Reflection of the Test-Item Quality in State SMP and SMA in Bandar Lampung

The objectives of this research are to analyze critically the quality of test items used in SMP and SMA (mid semester, final semester, and National Examination Practice) in terms of reliability as a whole, level of difficulty, discriminating power, the quality of answer keys and distractors. The methods used to analyze the test items are item analysis (ITEMAN), two types of descriptive statistics for analyzing test items and another for analyzing the options. The findings of the research are very far from what is believed, that is, the quality of majority of test items as well as key answers and distractors are unsatisfactory. Based the results of the analysis, conclusions are drawn and recommendations are put forward.


INTRODUCTION
This research is focused on analyzing the quality of multiple choice (MC) test items or objective test items (OTI). Although some experts may put forward their objections on the use of MC test items (Hinchliffe 2014, Srivastava et al., 2004, the MC tests are still widely used in an examination (Rodriguez, 2005;Mehta, et al. 2014;Kaur, et al. 2016;Namdeo, et al. 2016;Rauch, et al. 2010;McKenna, 2019) or test which involves many participants, such as, final-semester examination at schools, school-final examination, university entrance test, and new employee recruitment. Therefore, this research is still relevant and up to date to meet the need of a good quality of test items. A good quality of English instruction should be accompanied with good quality of assessment (Wiliam, 2013, He, et al. 2018Black, et al. 1998;Quaigrain, et al. 2017). A good quality of assessment can be seen from high index in validity, reliability, (Bolarinwa, 2015;Mohajan, 2017;Bajpai, et al. 2014;Taherdoost, 2016) level of difficulty and level of discriminating power (Boopa-thiraj, et al. 2013;Khoshaim, et al. 2016;Chauhan, et al. 2013). Besides, a good quality of assessment is indicated by good quality of key answers and distracters (Hasan, et al. 2017;Chauhan, P., et al. 2015;Rahma, et al. 2017;Burud, et al. 2019;Rao, et al. 2016;D'Sar, et al. 2017). Assessment cannot be separated from instruction because assessment is intended to measure whether the intended outcomes of the instruction are achieved or not.
In other words, if the assessment is in line with the instruction, and meet the necessary quality of a good and effective assessment, the results of the assessment must reflect the objectives of the instruction perfectly. By contrast, if the assessment is not in congruent with the instruction and it does not meet the required quality for effective and optimal assessment, it does not and will never reflect the intended outcome of the instruction.
Many studies have been conducted on the quality of assessment (Çanakkale, et al. 2013;Büyükkarcı, 2014;Patnaik, et al. 2015;Ibili, et al 2019;Gokdas, et al. 2019;Haidari, et al. 2019, Astawa, et al. 2017), but few studies, if any, have been conducted pertaining to the quality of test items, key answers, and distractors used for mid semester, and final semester as well as national examination for SMP and SMA. The current study tried to deal with the unresolved issues. Büyükkarcı (2014) investigates teachers' beliefs on assessment o student achievement, he has found that although assessment has a primary role in education and cannot be separated from instruction activities, teachers of language do not apply principles of assessment as required by the currcilum. The condition of student assessment perhaps may not be different between what happens in Turkey and in Indonesian context where language teachers remain unaware of the importance of the quality of test item quality in English education. Patnaik, et al. (2015) carry out a study from teacher perspective. They investigate the parameters of teacher quality and put priority on the necessity of updating oneself regularly to meet the challenges of teaching profession including the teachers' mastery on student chievement assessment. Ibili, et al (2019) have conducted a study of the relationship of feeling of ease and cognitive load of students. They have found that there is a strong correlation between the feeling of ease of test items and the extraneous load in males, and there is a strong relationship between the feeling of usefulness and the intrinsic load in females. Both the feeling of usefulness and the feeling of ease of use of test items have a strong correlation with the student cognitive mastery of language. This means that level of difficulty and discriminating power of a test item is very important for students to solve English test materials.
In line with the review of the literature above, it was assumed that there was a discrepancy between the theories of the assessment with the reality in the field. Particularly in relation to the quality of a stem of a test, that is, it must be proportionalneither too difficult nor too easy. If the test item is too difficult, it may not be answerable by majority of the participants including the clever ones. By contrast, if it is too easy, it may be answered correctly by both clever and nonclever students. In other words, if a test item is too difficult or too easy, it may result in its poor discriminating power, because it cannot be used to discriminate between the clever and non-clever students. Or it may happen that the resulted index is negative (-), that is, when an individual or group of clever students cannot answer an item correctly, but an individual or group of non-clever students can answer correctly. Such a case suggests that the test item does not have a good discriminating power.
Besides, there are two other components of a multiple choice test item which are almost neglected by an English teacher when designing and administering an objective test, that is, options comprising of the key answer and distractors (Burud, et al. 2019;Rahma, et al. 2017). Based on an informal focused group discussion (FGD) in the field, test designers are not aware of the important role of the options (source: an informal FGD with SMP and SMA English teachers in an MGMP meeting). They may have thought that the most important component of a test item is the key answer. Consequently they do not focus seriously on constructing good quality of distractors. Based on the theory, all the options should function well, which is indicated by being chosen by at least 5% of the testees. If any option is merely chosen by less than 5% of the testees, or even no one chooses it, it suggests that they may have thought that the answer is not that one. In short, it is very clear for them the inappropriateness of the distractors; consequently they do not choose it. Although this issue is theoretically very important for assessing the student achievement, no special research, at least which has been ever published, focuses on the issue. Therefore, this research dealt, among others, with the unresolved issue.
Theoretically, Gronlund et al. (2009: 93-106) put forward rules for designing multiple choice (MC) items: 1. Construct a test item to assess a significant learning achievement; 2. Put forward only one clearly formulated problem in the stem of the item; 3. Express the stem of the item in an easily understandable language; 4. Use as much of the wording as possible in the stem of the item, avoid repeating the same material in each of the choices; 5. If possible, state the in stem of the item in an affirmative form; 6. When negative wording is used in the stem of an item, it should be emphasized; 7. Make sure that the key answer is correct or clearly best; 8. All options should be grammatically correct and in line with the stem of the item and similar in form; 9. Prevent from using verbal clues that may cause the students to select the correct answer or to eliminate the incorrect options; 10. Make the distracters interesting for the uninformed; 11. Differentiate the relative length of the correct answer to remove length as the clue; 12. Prevent from using "all of the above" and use "none of the above" with great attention; 13. Change the position of the correct answer in a random manner; 14. Control the difficulty of item either by changing the problem AKSARA Jurnal Bahasa dan Sastra Vol. 20, No. 2, pp. 72 -87, October 2019 http://jurnal.fkip.unila.ac.id/index.php/aksara Jurusan Pendidikan Bahasa dan Seni 75 FKIP Universitas Lampung is the stem or by varying the options; 15. Make sure that each item is independent of the other items in the test; 16. Apply an efficient item format; and 17. Use normal grammatical rules. There may be some other rules which are not included in the list, but this is enough for general guidelines.
The objectives of this study were, first, to analyze the reliability of the test items as a whole, then the quality of each of the test items, in terms of level of difficulty, level of discriminating power; after that, the quality of the answer key, and finally the quality of the distracters. After each of these objectives was identified, then it was followed by decisions or recommendations.

METHODOLOGY
This research used a descriptive and evaluative method, that is, a study which described the results of an evaluation on a certain object which was adjusted with standard criteria. The objects of the current research were English-test items consisting of one unit of mid-semester exam for SMPN, one unit of final semester exam for SMPN, one unit of mid-semester exam for SMAN, one unit of final semester exam for SMAN, and one unit of National Exam Practice (LUN). These five different units of English test item model were intended to identify whether the quality of each unit similar to or different from one another. And finally to predict what may happen in the future if the results of the analysis of such test items were interpreted. The outcome of the current research is expected to support the theory of assessment in general and to be a beneficial feedback for curriculum developers and test-item designers in practical.
This research used a documentary procedure, that is, five different units of test items and students' answers in their answer sheets from five different groups depending on the types of test items relevant to the levels of participants. That is, students' answer sheets for SMPN mid semester exam, for SMPN final semester exam, for SMPN national exam practice (LUN), for SMAN mid semester exam, and for SMAN final semester exam. The data pertaining to the quality of the stems of the test items were analyzed using item analysis (Iteman) software, called Micro Computer Adaptive Test (MicroCat) version 3.50A, and interpreted using standard criteria of assessment. Iteman itself can be defined as one of "the analysis programs that comprise assessment systems of test items and test analysis package," (Assessment Systems Corporation (ASC) (1989)(1990)(1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006).
The reliability of each of the test unit was analyzed using the Iteman software, the results of which were compared with the standard criteria in Table 1. The discriminating power of each of the test items was analyzed using the Iteman software version 3.50A. Then the results of the statistical calculation were consulted with the following standard criteria in Table 2. Like analyzing the discriminating power, to analyze the level of difficulty, the Iteman software version 3.50A was used. And then to determine the decision, the results of the statistical computation were consulted with the standard criteria in Table 3 below: Finally, to analyze the quality of distractors, the Iteman software version 3.50A was used. After that, to determine the decision, the results of the statistical computation were consulted with the standard criteria below:

RESULTS AND DISCUSSION
As stated in the background of the research, there are five objectives of the current research, using five different units of test items, the results of the data analysis and discussion are organized in a similar systematic way.
First, results of SMPN mid-semester exam data analysis 1. There were 34 examinees in the data file. From the scale of the statistics, it can be inferred that the score of alpha is 0.727, it means that the reliability of the test items is high/good. It suggests that the test items as a whole are good and can be used but some items which are problematic should be revised first. 2. There are 9 items out of 50 (18%) that are considered good and can be used directly without any revision and can be put on the exercise bank. 3. 24 items out of 50 (48%) should be revised first before being used because one of the prop correct (level of difficulties)and point biserial (level of discriminating power) of the items cannot achieve good criteria, (see Tables 3  and 4). 4. 17 items out of 50 (34%) should be dropped because they do not fulfill the criteria of level of difficulty and the level of discriminating power, (see Tables  3 and 4). 5. There are 38 key answers out of 50 (76%) which are considered good and can be directly used without any revision; 12 out of 50 key answers (24%) are poor because they do not have good discriminating power; 37 distractors out of 200 (18.5%) work well and therefore, can be directly used without any revision; and 163 distractors out of 200 (81.5%) do not work well because there are some distractors that have the prop endorsing and point biserial indexes of 0.00 (very low). Which means that the distracters were attractive for the testees or they felt sure that they were obviously wrong.
These results of the analysis show that the number of good quality test items is less than that of those that should be revised and dropped. This also implies that the teachers' mastery of good and effective assessment should be developed. In other words, the principles of good and effective assessment have not been applied. Besides, the quality of key answers is also necessary to be retrained to the teachers and prospective teachers. It has been found that not all key answers are good, that is, there are 24% of them are still poor because their discriminating power is low, that is, they cannot discriminate between the clever and non-clever students. In other words, it is quite possible that both groups cannot answer correct or both of them can answer correctly. It is interesting to note that 81.5% of distracters do not work well. It means that majority of the Second, results of SMPN final semester exam data analysis 1. There were 32 examinees. From the scale of the statistics, it can be inferred that the score of alpha is 0.427, it means that the reliability of the test items is average. It suggests that the test items as a whole are good and can be used but some items which are problematic should be revised first. 2. There is 1 item out of 25 (4%) that is considered good and can be used directly without any revision and can be put on the exercise bank. 3. 6 out of 25 items (24%) should be revised because one of the prop correct and point biserial of the items cannot achieve good criteria, (see Tables 3 and 4). 4. There are 18 items out of 25 (72%) that should be dropped because the items do not fulfill the criteria of prop correct and point biserial, (see Tables 3 and  4). 5. There are 9 key answers out of 50 (18%) which are considered good and which can be directly used without any revision; 41 out of 50 key answers (82%) are poor because they do not have good discriminating power; 9 out of 100 distracters (9%) are good because they are chosen by at least 5% of the participants; 16 distractors out of 100 (16%) should be revised because the point biserial indexes belong to low category; 75 distractors out of 100 (75%) should be dropped because there are some distractors that have the prop endorsing and point biserial indexes of 0.00 (very low).
These results make us more surprised because there is only one item out 25 items which is categorized good. The rest are poor. And there are 18 items (72%) that should be dropped because they are too poor. This suggests that the test items are not tried out first before they are administered. The teacher(s) designed the test items and then they directly administered to assess their students' achievement. When the key answers are compared between the good and the poor ones, the poor (82%) surpasses the good (18%). This is a real challenge for the LPPTK to re-consider the English Teaching Assessment subject. It means that this topic should be more focused so that teachers are aware of the importance of the key answers and distracters. The distracters should be conceptually, and grammatically correct so that the students who are well prepared will choose it. Thus, the distracters function well, (Rahma, et al. 2017;Chauhan, P., et al. 2015;Burud, et al. 2019;Rao, et al. 2016;D'Sar, et al. 2017).
The following pie diagrams show the proportions of the quality of the test items used for final semester exam in SMPN. Vol. 20, No. 2, pp. 72 -87, October 2019 http://jurnal.fkip.unila.ac.id/index.php/aksara Jurusan Pendidikan Bahasa dan Seni 79 FKIP Universitas Lampung The following are the results of SMPN LUN data analysis 1. There were 36 examinees in the data file. From the scale of the statistics we can conclude that the score of alpha is 0.274, which means that the level of the test items is low (not sufficient). 2. There are 5 items out of 50 (10%) that are considered good and can be used directly without any revision and can be put on the test-item bank. 3. 10 items out of 50 (20%) that should be revised because one of the prop correct and point biserial of the items cannot achieve good criteria. 4. 35 items out of 50 (70%) that should be dropped because they do not fulfill the criteria of prop correct and point biserial.   Vol. 20, No. 2, pp. 72 -87, October 2019 http://jurnal.fkip.unila.ac.id/index.php/aksara Jurusan Pendidikan Bahasa dan Seni 81 FKIP Universitas Lampung the score of Alpha (reliability) is 0.544, it means that the level of the test items is Average. 2. Besides the reliability, it was found that there were 13 out of 50 items (26%) which were considered good and can be used directly without any prior revision in terms of level of difficulty (Prop. Correct) and discriminating power (Point Biser). 3. 21 items out of 50 items (42%) should be revised first before being used, because most of their point biserials are very low or needs revising. 4. 16 out of 50 items (32%) should be dropped because their point biserials are less than 0.200. Therefore, those items do not fulfill the criteria of test item's quality and should be dropped. 5. 14 key answers out of 50 (28%) are considered good, therefore, can be directly used without any revision; 6 key answers out of 50 (12%) should be revised because the point biserial indexes belong to low; and 28 key answers out of 50 (56%) should be dropped because the point biserial indexes belong to very low. Besides, there are 52 distractors out of 200 (26%) which belong to good category, therefore they can be directly used without any revision; 148 distracters out of 50 ( Finally, the results of the last analysis -SMAN Final Semester Exam (UAS) item test analysisfound the following points: 1. The alpha index of the whole test items (reliability) is 0.799 which belongs to high or good. 2. There were 27 items out of 50 items (54%) which were considered good and can be used directly without any prior revision in terms of level of difficulty (prop. Correct) and discriminating power (point biserial). 3. 18 items out of 50 items (36%) should be revised first before being used, because most of their point biserial is very low or needs revising. 4. 5 items out of 50 (10%) should be dropped because their point bisers are less than 0.200. Therefore, those items do not fulfill the criteria of test item's quality and should be dropped. 5. There are 23 out of 50 key answers (46%) belong to good category and therefore can be used directly without any revision; 20 out of 50 key answers (40%) should be revised because they do not have sufficient discriminating power 7 items (14%) whose key answers should be dropped and changed with new ones because there was a message *Please check the answer keys.  Vol. 20, No. 2, pp. 72 -87, October 2019 http://jurnal.fkip.unila.ac.id/index.php/aksara Jurusan Pendidikan Bahasa dan Seni 83 FKIP Universitas Lampung seen from the number of test items that can be used directly without any revision, 54%, and those which should be revised are 36% while those items that should be dropped are 10%.

AKSARA Jurnal Bahasa dan Sastra
Based on these results of the data analysis of the five different units of test items above, it can be interpreted that the concepts of a good quality test items need to be shared with the English teachers in almost all schools in every level of education. The teachers should be trained intensively to have a good mastery of analyzing any test items (daily test, mid semester test and final semester or school examination test items). If they do not have sufficient ability to analyze the test items, they could not measure precisely the intended learning outcome.
Based on these findings, there are some discrepancies between the theories of assessment and the realities in the field. This is a challenging task for Teacher Training and Education Faculty and other LPTK (institutions whose responsibility is to produce high quality teachers for all school levels from kinder garden through general senior and vocational high schools), for curriculum designers, and policy makers as well as other stake holders who are dealing with education.

CONCLUSIONS AND SUGGESTIONS
In line with the objectives of this research stated in the background section, that is, to analyze the reliability of the test items as a whole, then the quality of each of the test items, in terms of level of difficulty, level of discriminating power; after that, the quality of the answer key, and finally the quality of the distracters, which are then followed by decisions, and because there were five different units of test items used, (SMPN Mid Semester Exam, SMPN Final Semester Exam, SMPN National Examination Practice (LUN), SMAN Mid Semester Exam, and SMAN Final Semester Exam), the following conclusions are drawn: 1. The construct of the test items, to some extent, does not contain many mistakes not only on the stems but also on the options. Consequently, it tends to be applicable based on the construct of the test items, provided that the test items which contain some mistakes should be revised and re-tried out to make sure the optimal quality before they are administer-ed. 2. Special attention should be made priority on making sure that reliability, level of difficulty, discriminating power, key answers and distractors. 3. Based on the findings, it can also be interpreted that, to a certain extent, the stems and options of the test items which consisted of key answers and distracters are still away from the theories of good quality assessment.

RECOMMENDATIONS
1. Given that this research only focuses on test items of English assessment for SMPN and SMAN, further reserachers are recommended to investigate those used for vocational schools and religious-institution based schools (e.g. M.Ts. and MAN). 2. Further reserachers are also recommended to carry out research of authentic assessment since this research only focuses on multiple choice assessment. 3. Given that the findings show that in all different five units of test items, only less than half of the items are good, the rests are poor even should be dropped, likewise the quality of the options are more than fifty percent poor even should be dropped, anaysis of test items should be made more socialized to teachers and prospective teachers by LPTK (teacher training and education institutions) and by education authority in province and regency levels.