Judul: Evaluation of UAN Bahan ini cocok untuk Informasi / Pendidikan Umum. Nama & E-mail (Penulis): Mark Ogilvie Saya Guru di Nasional plus School Jakarta Barat Topik: UAN SLTP 2003/2004 Tanggal: 02/08/2004 Evaluation of UAN national Indonesian English test for final year junior high school students INTRODUCTION This paper seeks to evaluate the UAN national Indonesian English test for final year (grade 9) junior high school students, using the 2002/2003 ujian akhir nasional (UAN) [final national test] administered in the Jakarta district. The Bachman and Palmer (1996) framework will be used as an evaluative framework. This is an objective study that seeks to evaluate the UAN test so that by this study of an awareness of issues involved in testing can be identified.
BACKGROUND
The UAN before 2001/2002 was known as evaluasi belajar tahap akhir nasional (EBTANAS). This was a name change, but not in construction or administration of the test. The test's future existence is under threat. A government commission for education is moving toward scrapping the exam and replacing it with school based criterion reference exams. Naommy (2004, p.4) reports:
Under the UAN policy, the Ministry of National Education requires third year junior and senior high school students to take national tests in English, Indonesian and mathematics, instead of letting individual schools make their own exams..
The commission said the policy violated Law No. 20/2003 on the national education system, insisting national exams could not be used to measure the performance of individual students or the general quality of education in the country.
UAN TEST COMPONENTS
The UAN test is a reading, grammar, conversation, and vocabulary test consisting of sixty multiple choice (a,b,c, or d) questions. The reading component according to Manurung (2001, p.1) contains 50% of all questions. Manurung identifies the components of the junior high UAN English test as comprising of the following items: reading, comprehension, present tenses, past tenses, future tenses, question tag, elliptical sentences (two sentences becoming one) pronouns, infinitives, comparisons, prepositions, 'introduction' (it and their use), modal auxiliaries (can, may, should, and must) conditional sentences, question words (what, who, whom, whose, which, why, where, when, and how), conversation (greetings, introducing, complement, leave taking, inviting, refusing, agreement, and disagreement, apologizing, and surprising), and vocabulary.
Students need to obtain a score of 40.10% to pass. Students who fail are allowed to repeat the test the same year. Naommy (2004, p.4) reports, "Education observers have predicted that up to 25% of 2.5 million junior high school students could fail the exams this year." Students cannot proceed to senior high school unless they pass the UAN exams.
BACHMAN AND PALMER (1996) FRAMEWORK OF EVALUATION
The Bachman and Palmer 1996 framework evaluation of test usefulness will be used to evaluate the UAN national Indonesian English test for final year junior high school students. The questions for logical evaluation of usefulness as posed by Bachman and Palmer will be identified in italics.
RELIABILITY
1. To what extent do characteristics of the test setting vary from one administration of the test to another? Some students take the tests in air-conditioned, comfortable classrooms with minimal background noise. Other students take the test without air-conditioning (majority) with variance in noise. All students did the exam at 7.30 to 9.30 a.m. (WIB) [west Indonesian time] on Monday 28th of April 2003. Students are used to the school environment, tropical heat and background noises. The setting cannot be changed unless financial assistance is given.
2. To what extent do characteristics of the test rubric vary in an unmotivated way from one part of the test to another, or on different forms of the test? The time is always standardized. Instructions and questions are clear in some questions, ambiguous in others, and question two is very confusing (how many brothers and sisters does Aditya have? songs?). Some questions have instructions whilst others have none. Questions 35 to 39 are based on a reading passage that instructs students for questions to 35 to 36 only. Some questions are confusing or ambiguous, thus it has flaws. Question 4 is instructed that it is based on the reading passage, when it is not.
3. To what extent to characteristics of the test input, vary in an unmotivated way from one part of the test to another, from one task to another and on different forms of the test? The input does not change. It is standardized. This is satisfactory.
4. To what extent do characteristics of the expected response vary in an unmotivated way from one part of the test to another or on different forms of the test? All responses are multiple choice, thus the format of answering does not change. The reading questions are mixed with the grammar questions, causing a constant switch from one type of question to another. Questions 1 to 3, 9 to 13, 17 to 21, 25 to 29, 33, 35 to 39, 44 to 47, 49 to 50, and 58 to 60 are reading. It would be far better, logical and easier for students if the reading questions were clustered together.
5. To what extent do characteristics of the relationship between input and response vary in an unmotivated way to one part of the test to another, or in different forms of the test? The input does not vary, neither does the response; it is all multiple choice.
VALIDITY
6. Is the language ability construct for this test clear and unambiguously defined? No. As mentioned previously, there are reading, grammar, vocabulary, and conversational components, but they are not clearly and unambiguously defined. Conversation, tested by a few written questions, is not a valid interpretation of speaking ability at all. There is no indication on the test to the language ability construct; it is implied.
7. Is the language ability construct for the test relevant to the purpose of the test? The purpose of the UAN test to rank students and to set a pass level to proceed to senior high schools. Students must pass to reach senior high school and schools and students are ranked and compared nationally. There is no speaking, listening, or writing component to the test. A multiple choice test is not going to test their practical, active use. It is a passive test construct and not constructed adequately to test students' reading, grammar, conversation and vocabulary ability effectively.
8. To what extent does the test taker reflect the construct definition? The conversation questions do not reflect conversational ability. The grammar questions do not reflect grammatical ability, for example, question 16 could be b) he has gone to the office, or d) he went to the office. Other questions have several correct choices, namely questions 4, 5, 7, 10, 24, 26, 29, 38, 44, 46, 47, 50, 51, 53, 57, and 59. However, only one answer is marked correct. The answer given for question 49 from the examiners is "most people in big cities read to certain news papers [sic]". The answer "most people in big cities subscribe to certain news papers [sic]" is marked incorrect! The test needs a construct that is correct in grammar and limiting the correct answers to one correct possibility on the multiple choice test.
9. To what extent do the scoring procedures reflect the construct definition? It is unfair to mark a correct answer incorrect. Questions 4, 5, 7, 10, 24, 26, 29, 38, 44, 46, 47, 50, 51, 53, 57, and 59 have at least one alternative correct answer. Only one answer is marked correct. In writing good test questions for multiple choice exams, Evaluating multiple choice tests (n.d., para. 13) states, "Make sure there is only one correct answer". Though this is obvious, it is a common mistake according to the previous author. In the case of question 49 the correct answer 'd' is marked incorrect, whilst the incorrect answer 'c' is marked correct. A single score from 60 multiple choice questions is not adequate enough to assess a range of English skills, depth of skills, or to indicate if a student is ready to progress to senior high school.
10. Will the scores obtained from the test help us to make the desired interpretations about the test takers' language ability?
It is a passive test of mainly reading and grammar and does little to evaluate students' language ability and use. The listening, speaking, and writing ability is not tested. The test itself is about scores. Bond (1996, para. 2) claims, "The major reason for using norm- referenced tests (NRT) is to classify students." The UAN seeks to compare and rank rather than assess the learners' ability to use language. Due to the former reason and the fact that many answers could be an alternative, the results do not help in interpreting learners' English ability.
11. What characteristics of the test setting are likely to cause different test takers to perform differently? All students complete answers on a Scantron answer sheet. All conditions are the same, except for the classrooms which may have air-conditioning and quite surrounds. This aspect is satisfactory.
12. What characteristics of the test rubric are likely to cause different test takers to perform differently? The general instructions are given in Indonesian. The instructions for individual questions are in English and are basic in structure. Some questions use incorrect English and can disadvantage or confuse students, for example, questions 17 to 21 are based on a short letter. The last sentence reads "We look forward to your confirmation by return." This is not everyday English and can confuse students and their understanding of the letter. Questions 58 to 60 are based on a paragraph that reads "The most important thing in the Olympic games is not t wo I [sic] take part." Such examples can cause test takers to perform differently.
13. What characteristics of the test input are likely to cause different test takers to perform differently? Incorrect grammar, words, spelling, and punctuation are input that can cause problems with test takers. Incorrect grammar occurs in questions 15, 29, 35, 42 ,52, and the reading for questions 17 to 21. Incorrect words are found in questions 2, 22, 28, and 37. Incorrect spelling occurs in questions, 3, 32 , 51, and the reading for questions 55 to 39 and 58 to 60. Incorrect punctuation is found in questions 22, 43, 54 ,and the reading for questions 35 to 39. Incorrect test questions invalidate the test.
14. What characteristics of the expected response are likely to cause different test takers to perform differently? Once again, having several possible correct answers in the question, whilst only marking one correct, can cause major problems with test takers.
15. What characteristics of the relationship between input and response are likely to cause different test takers to perform differently? The test construct included Indonesian boys' and girls' names which does not cause problems when trying to identify gender or subject. This is quite good. Question 8 has cultural bias. Most Indonesians do not eat cereal for breakfast and it is mostly the upper classes who buy cereals. Thus, a can, box, jar, or bottle of cereal is a difficult question if one does not know what cereals are nor have ever eaten cereals. Apart from question 8, other questions have input commonly known to Indonesians.
AUTHENTICITY
16. To what extent does the description of tasks in the TLU [target language use] domain include information about the setting input, expected response and relationship between input and response? The setting is included in some reading questions, but absent on grammatical and vocabulary questions. Question 7 epitomizes the confusion over expected response and input. The expected response by the test constructor is "a) which" [photographs]. However, "b) what" and even "c) whose" is grammatically acceptable. The expected response must match the input given in an authentic valid test. Kehoe (1995, para. 3) states, "As a rule one is concerned with writing stems that are clear and parsimonious, answers that are unequivocal and chosen by the students who do best on the test.." Question 28 lacks input for the response. There is no specific information to base the answer of "b". In essence, the TLU needs more contextual support and only one correct response. Question 42 relies heavily on knowledge of an Indonesian folk tale. Without knowledge of the folk tale, construction of the paragraph could differ from the expected response. There is a very small minority of students in Indonesia who do not know this folk tale, for example, foreign nationals sitting the tests.
17. To what extent do the characteristics of the test task correspond to those of the TLU tasks? Conversational items tested as multiple choice items are far from authentic! Thus, question 53 in reality could be a, b, c, or d depending on context. There are also many grammatical mistakes making the test non authentic.
INTERACTIVENESS
18. To what extent does the task presuppose the appropriate area or level of topical knowledge, and to what extent can we expect the test takers to have this area or level of topical knowledge? As previously mentioned question 42 presupposes topical knowledge of an Indonesian folk tale. Overall, the topical knowledge is appropriate for 14 to 15 year old students, for example, the areas of media, sickness, and sport. There is generally no great need to use specific topical knowledge to answer questions.
19. To what extent are the personal characteristics of the test takers included in the design statement? The design is for final year students of junior high schools in Indonesia in the Jakarta district. It is assumed all test takers are Indonesian, aged 14 to 15, and all speak Indonesian. All have done the pre-UAN test and it is assumed all have completed nine years of formal schooling. It is to be noted that there is a tiny minority of foreign nationals who also sit the test, but the government assumes and expects that they get schooled in international schools.
20. To what extent are the characteristics of the test tasks suitable for test takers with the specified personal characteristics? The test tasks are very much appropriate for the average and lower ability students. In this regard the test is suitable, but it fails to take into account the higher ability students to which the tasks are at a functional level far below their ability. That is, some students who achieve excellent results in native speaking English tests, for the same educational level in English, are tested on tasks that do not challenge nor address their level of ability. Year nine students at the school where I teach have performed well above average for the 2001 year 9/10 Australian English schools' competition test items, with one student scoring 100%. However many students from Indonesia fail the UAN test. Thus there is a big range of ability, whereas the test tasks do not cover the whole range.
21. Does the processing required in a test task involve a very narrow range of areas of language knowledge? As discussed previously, the tasks engage a very limited range of language knowledge.
22. What language functions, other that the simple demonstration of language ability, are involved in processing the input and formulating a response? None.
23. To what extent are the test tasks interdependent? They are not. All questions are dependent on the specific questions or reading passage and are independent of other items.
24. How much opportunity for strategy involvement is provided? A pre-UAN test is administered to all students under the same test conditions two weeks prior to the UAN test. There is also available a text with past year UAN test items available to teachers in all schools to prepare students. The construction of tests is very similar year to year and thus provides students and teachers with ample time to prepare.
25. Is this test likely to evoke an affective response that would make it relatively easy or difficult for the test takers to perform at their best? No. The topics are culturally sensitive and non-emotive.
IMPACT
26. To what extent might the experience of taking the test or the feedback received affect characteristics of test takers that relate to language use? This test is passive and the language to be tested is done so testing only understanding, neglecting higher skills such as processing, comparing, debating and even production of language. Hughes (2003, p. 1) claims, "If the skill of writing, for example, is tested only by multiple choice items then there is great pressure to practise such items rather than practise the skill of writing itself. This is clearly undesirable." The UAN test aims to test grammar, but students are not required to construct any sentences. The students are to learn conversational conventions, but not tested orally. Research by Hadiatmaja, cited by Somantri (2003, para. 6) observes that Indonesian school students learning English "are passive and receptive only [translation]." Thus the backwash effect of the UAN tests can be seen in students' passive and receptive skill focus with problems in construction of discourse in speaking and writing.
27. What provisions are there for involving test takers directly, or for collecting and utilizing feedback from test takers directly, or for collecting and utilizing feedback from test takers and the design and development of the tests? There are no known provisions. Students do not have the opportunity to provide any feedback or have any input into the development of the test.
28. How relevant, complete, and meaningful is the feedback that is provided to test takers? Correct answers, and students' responses are given showing their mistakes. A final score and school ranking is also given. There are just statistics and students are not given any explanation to why test items are correct. No information is given on their language ability or mastery of subject matter. It is difficult for the individual teacher to provide good feedback due to the amount of alternative correct answers.
29. Are decision procedures and criteria applied uniformly to all groups of test takers? Yes. All schools follow the same criteria of the UAN score and scores are objective, independent on participation, attendance, attitude or other factors.
30. How relevant and appropriate are the test scores to the decisions to be made? The test score is the single factor in determining the grade and to determine if the student can proceed to senior high school.
31. Are test takers fully informed about the procedures and criteria that will be used in making decisions?
32. Are these procedures and criteria actually followed in making the decisions? Yes. There are no exceptions, though those who fail may sit for the test again.
33. How consistent are the areas of language ability to be measured with those that are included in teaching materials? The teachings materials of teachers usually match the language ability to be measured, as is the case in the majority of schools. However, schools such as my school do not follow the national curriculum per se and go way beyond including active skills and including listening, speaking, and writing skills, in addition to the reading and grammar of the national curriculum. These schools, the teachers, and students feel uncomfortable with the test as it does not meet their learning content nor does it test most of their ability.
34. How consistent are the characteristics of the test and test tasks with the characteristics of teaching and learning activities? This is dependent on the individual teacher. Due to the passive nature of the tests, a lot of students learn English in a passive manner and as a result Artsiyanti (2002, para. 6) claims, "Students do not know when structures [grammar] have to be used and how to apply them in everyday life [translation]." The test tasks contribute to a negative backwash effect in the classrooms.
35. How consistent is the purpose of the test with the values and goals of teachers and of the instructional program? The test is far from achieving the goals of English at IPEKA. Due to its limitations in passive receptive skills it is also not consistent with goals of other schools' English courses, even though it is consistent with the national curriculum.
36. Are the interpretations we make of the test scores consistent with the values and goals of society and the education system? If wages are a reflection on worth, society does not value the worth of teachers in Indonesia in comparison to western countries. The average wage of a teacher is Rp 700,000 to Rp 800,000 (just over $100 AUD) a month (Sistem pendidikan harus dirombak secara radikal, 2004). Schools are often dilapidated and some students cannot afford their tuition. Many language teachers do not have adequate mastery of English to teach effectively and efficiently in schools in Indonesia. Somehow test scores are regarded as highly valid and respected by most as the major measure of performance in English and as a means to determine the academic progression of students to the next level.
There are more pressing concerns here of terrorism, hunger, and work. The acceptance by society and the educational system of the test scores should not equate with the usefulness of the test. The UAN needs reform!
37. To what extent to the values and goals of the test developer coincide or conflict with those of society and the education system? There is agreement with the education system and most of society.
38. What are the potential consequences, both positive and negative, for society and the education system, of using a test in this particular way? The backwash effect contributes to passive learners and English speakers not confident in production of English, of which is the case in Indonesia today.
39. What is the most desirable positive consequence, or the best thing that could happen as a result of using the test in this particular way, and how likely is this is happen? The test could act as a motivating factor for some in mastering passive English. This is still not likely.
40. What is the least desirable negative consequence, or the worst thing that could happen as a result of using the test in this particular way, and how likely is this to happen? As mentioned previously, many students will learn an understanding of reading and grammar in a passive and receptive manner without learning active skills and to the exclusion of speaking, listening, and writing. This is highly likely as it is already evidenced throughout the country.
PRACTICALITY
41. What type and relative amounts of resources are required for: (a.) the design stage, (b.) the operationalization stage and, (c.) the administration stage? There is not much money available for the UAN tests, nor time, nor expertise. The design is done by a few local English teachers with no resources provided by the government, apart from the syllabus and test construction design. The operation is done by a central team by Scanton computer marking. The administration of the tests is by individual schools.
42. What resources will be available for carrying out a. b. and c. above? Teachers, computers, printers and paper are available. Resources are very limited in Indonesia due to its massive student population and limited budget.
CONCLUSION
The UAN is not very useful. It is not valid, authentic nor interactive and has negative impacts on learning. It is however, reasonably reliable and practical.
The purpose of the UAN is to measure a level of English competence to progress to senior high school. This obviously fails that. It is necessary first to determine the aim or goal of the test. Kitao and Kitao (1996b. para. 2) state, "The goal of the test is what you want to measure." There are many unmeasured skills that can be tested. Listening, writing, and speaking can all be assessed in addition to reading and grammar. In regard to grammar, Kitao and Kitao (1996a , conc. para.) state, "While the testing of grammatical knowledge is limited - - it does not necessarily indicate whether the testee can use the grammatical knowledge in a communicative situation - - it is sometimes necessary and useful."
Indonesian schools are moving towards an outcome based curriculum. A criterion reference test (CRT) could well be an excellent alternative to the present UAN test. Gorsuch (1997, para. 25) claims, "Only CRTs will allow teachers to set standards, measure achievement and give students valuable feedback at the course level." This could then present the opportunity for positive backwash so that students are active users of English and confident in all skills.
All in all the UAN fails to be useful because of its test construction which is riddled with mistakes and contains many alternative multiple choice answers that are correct. Hughes (2003, p.2) claims, "Students' true abilities are not always reflected in the test scores that they obtain." This is the case with the UAN test.
REFERENCES
Artsiyanti, E.P. (2002, March). Bagaimana meningkatkan mutu hasil pelajaran bahasa Inggris di sekolah. Pendidikan Network. Retrieved April 16, 2004 from http://artikel.us/artsiyanti.html
Bachman, L.F. & Palmer, A.S. (1996). Language testing in practice. Oxford: Oxford University Press.
Bond, L.A. (1996). Norm- and criterion-referenced testing. Practical Assessment, Research and Evaluation, 5 (2). Retrieved May 12, 2004 from http://PAREonline.net/getvn.asp?v:5&n=2
Evaluating multiple choice tests. (n.d.). Luzerne County Community College. Retrieved April 16, 2004 from http://academic.luzerne.edu/adjuncts/evaluating_tests.htm
Gorsuch, G.J. (1997, January). Test purposes. The Language Teacher. Retrieved March 16, 2004 from http://www.jalt-publications.org/tlt/files/97/jan/gorsuch.html
Hughes, A. (2003). Testing for Language Teachers (2nd Ed.). Cambridge: Cambridge University Press.
Kehoe,J. (1995). Writing multiple-choice test items. Practical Assessment, Research and Evaluation, 4 (9). Retrieved May 12, 2004 from http://PAREonline.net/getvn.asp?v=4&n=9
Kitao, S.K. & Kitao,K. (1996a, June). Testing grammar. The Internet TESL Journal, 2 (6). Retrieved May 12, 2004 from http://iteslj.org/Articles/kitao-TestingGrammar.html
Kitao, S.K. & Kitao,K.(1996b). Writing a good test. TESL-L archives. Retrieved May 12, 2004 from http://www.ling.lancs.ac.uk/staff/visitors/kenji/kitao/design2.htm
Manurung, A. (2000). Kumpulan soal dan pembahasan EBTANAS bahasa Inggris SLTP. Jakarta: Penerbit Erlangga.
Naommy, P.C. (2004, May 5). Lawmakers opposes[sic] continuation of national final examinations. The Jakarta Post.
Sistem pendidikan harus dirombak secara radikal. (2004, May 14). Media Indonesia Online. Retrieved May 28, 2004 from http://mediaindo.i2.co.id/cetak/berita.asp?id=2004051401404919
Somantri, N. (2003, January). Penerapan metode simulasi tematis untuk peningkatkan kemampuan bahasa Inggris siswa. Pendidikan Network. Retrieved March 16, 2004 from http://www.pendidikan.net/nsomantri2.html Saya Mark Ogilvie setuju jika bahan yang dikirim dapat dipasang dan digunakan di Homepage Pendidikan Network dan saya menjamin bahwa bahan ini hasil karya saya sendiri dan sah (tidak ada copyright). . CATATAN: Artikel-artikel yang muncul di sini akan tetap di pertanggungjawabkan oleh penulis-penulis artikel masing-masing dan belum tentu mencerminkan sikap, pendapat atau kepercayaan Pendidikan Network. | |