Abstract: The paper analyzed an English reading test by Arthur Hughes’ theories. The general information, results, validity and the item were discussed step by step. Thirty-three non-English first-year college students majoring in economics in Guizhou University took the test. All the procedures and rules were strictly observed like the formal examination so that the validity of this test was reliable. Results showed that the students were good at answering multiple choice questions and fell short of dealing with short answer questions. That indicates that the students were to apply their language ability to passage analysis and teachers should cultivate students’ language ability in different fields.
Key words: English reading test; validity; facility value
On December 19, 2013, 33 non-English majors took a reading test. 17 were girls, 52% of all,while 16 boys took up 48%. They are majoring in economics in grade one of Guizhou University.
In order to make the result valid, all the procedures and rules were strictly observed like a formal examination. They were informed to finish their task in 40 minutes. During the process, no one violated the examination rules.
Their grades of the national college entrance examination (shorted for NCEX) may help us learn their present holistic situation. And their grades are distributed as the following diagram(Diagram 1):
Diagram 1 Distribution of Students’Grades of the NCEX
This diagram presented us the idea that most students obtained satisfied results in the examination but failed to show their specific competence such as reading, writing and so on. So it was decided to give them a reading test in order to ascertain what extent they can reach as to reading competence.
The Test of English for Undergraduate Studies is a test battery designed to assess the English language proficiency of students who did not learn English as their first foreign language and who hope to undertake undergraduate study at universities and colleges where English is the medium of instruction.
The aim of the battery is to select students who have commanded elementary English to be able to benefit from college courses. This test was a Proficiency Test.
The Reading Test contained at least 16 items (less than 20 items)—approximately 4 items for each reading passage. Each reading passage and its items formed one sub-test. One mark was given to each item and the total score was 100. At the same time, the item designer used a variety of item types which included the following (Hughes, 2000) :
Identifying appropriate headings
Matching
Labeling or completing diagrams, tables, charts, etc.
Short answer questions
Multiple choice
Identifying order of events, topics
As to the test, there were five passages. In the first passage, students were required to find the main idea of each paragraph. There were five questions about this passage, and the score of each question was 4 percent, totally 20 percent. In the second passage, students were asked to find the corresponding meaning of some words. Each one was 4 percent, totally 20 percent. In the third one, the questions were shown in the form of multiple choice, each 4 percent, totally 20 percent. In the fourth passage, students should finish the item like ‘Short Answer Questions’, each was 4 percent, and totally 20 percent. In Passage 5 there were 20 blanks, each one was 2 percent,totally 20 percent.
This test was based on College English Syllabus and the degree was between Level 1 and Level 2 since the examinees were freshmen.
The followings were the analysis of the scores of the test, shown in Diagram 2 and Table 1.It was normal distribution.
Diagram 2 The Scores of the Test
Table1 Descriptive Statistics of the Test
Table 1 reports the descriptive statistics generated from the date analysis. Relatively high reliability is obtained. Other descriptive statistics generally look normal and acceptable.
According to Arthur Hughes, “… a test is said to be valid if it measures accurately what it is intended to measure. This seems simple enough. When closely examined, however, the concept of validity reveals a number of aspects, each of which deserves our attention” (2000:22). As a result, the analysis would be focused on the following: content validity, criterion-related validity,construct validity and face validity.
Arthur Hughes points out: “A test is said to have content validity if its content constitutes a representative sample of the language skills, structures, etc. with which it is meant to be concerned…The test would have content validity only if it included a proper sample of the relevant structures. Just what are the relevant structures will depend, of course, upon the purpose of the test. We would not expect an achievement test for intermediate learners to contain just the same set of structures as one for advanced learners. In order to judge whether or not a test has content validity, we need a specification of the skills or structures etc. That it is meant to cover. Such a specification should be made at a very early stage in test construction. It isn’t to be expected that everything in the specification will always appear in the test; there may simply be too many things for all of them to appear in a single test” (2000: 22). So in this reading test, five passages were chosen about the contents of biology, culture, daily life, commercial and so on. As the subjects were all freshmen, they would be easy to be attracted by such topics as going abroad or eating at Kentucky Fried Chicken. Although the passages were not so simple, they required certain reading skills and adequate vocabulary. There would not be any problems concerned with its content.
Another approach to test validity is to see how far results on the test agree with those provided by some independent and highly dependable assessment of the candidate’s ability. This independent assessment is thus the criterion measure against which the test is validated.
“There are essentially two kinds of criterion-related validity: concurrent validity and predictive validity. Concurrent validity is established when the test and the criterion are administered at about the same time.” (Hughes, 2000: 23)
“The second kind of criterion-related validity is predictive validity. This concerns the degree to which a test can predict candidates’ future performance.” (Hughes, 2000: 23)
When it came to this reading test, there was only one chance for those subjects to take the exam. Therefore, it was hard to gather enough data to show its concurrent validity and predictive validity, which means the analysis of criterion-related validity in this reading test would certainly need a second-step experiment in the further study.
Arthur Hughes says: “A test, part of a test, or a testing technique is said to have construct validity if it can be demonstrated that it measures just the ability which it is supposed to measure.The word ‘construct’ refers to any underlying ability (or trait) which is hypothesized in a theory of language ability” (P26). It is easy to know from the introduction of the test that the test consisted of different testing methods. In order to achieve the variety of the reading test, the test was designed in the forms of multiple choice, items matching, short answer questions and cloze.The various testing methods provided the candidates with different aspects in thinking and added the difficulty in dealing with these questions. The construct validity for this reading test could be suitable and reliable.
As Arthur Hughes points out: “A test is said to have face validity if it looks as if it measures what it is supposed to measure. For example, a test which pretended to measure pronunciation ability but which did not require the candidate to speak (and there have been some) might be thought to lack face validity. This would be true even if the test’s construct and criterion-related validity could be demonstrated. A test which does not have face validity may not be accepted by candidates, teachers, education authorities or employers” (P27). Obviously, the candidates and teachers in this experiment group could accept this test paper for testing reading skills, for both parties had realized some purposes and gained information that was beneficial and useful. At the same time, the content validity and construct validity could easily show what this test was supposed to measure. There would be no doubts in identifying the test’s face validity.
The output of item analysis can aid in the interpretation of students’ performance in the test.It refers to the difference between the proportion of the advanced group (usu. 27%) who got an item right and the proportion of the elementary group (usu. 27%) who got the item right. In this part, three items from the test would be picked out for the Discrimination and Facility Values analysis.
Facility value is regarded as the other measurement to determine the quality of a test. Item 1,2, and 3 would still be used to calculate the facility value. And the formula for the facility value was a bit of different from the one of discrimination.
H refers to the number of examinees in the higher group who correctly worked out the items.L refers to the number of examinees in the lower group who correctly worked out the items. N refers to the total number of examinees in the test. Here, the value of N is 20.
A good item discriminates between students who scored high or low on the examination as a whole.
As 33 examinees took part in the exam, they were divided into two groups evenly. Eight were in the advanced group (50%), and the rest (50%) were in the elementary group. And the formula below was used to calculate the discrimination value.
D refers to the discrimination. H refers to the number of examinees in the higher group who correctly worked out the items. L refers to the number of examinees in the lower group who correctly worked out the items. N refers to the total number of examinees who correctly worked out the items.
The results are shown in the following tables based on the above two formulas:
Table2 Item No. 1 (Passage One No. 1) (multiple choice)
Table3 Item No. 2 (Passage Five No. 3 ) (multiple choice)
Table4 Item No. 3 (Passage Two No. 4 ) (short answer questions)
As a placement test, the discrimination value is an important indicator to classify the upper group students from the lower group. It is believed if the discrimination value is above 0.4, the quality of the test or item is excellent; if it is between 0.3 – 0.39, the quality is good; if 0.2 – 0.29,then it is OK; but if the discrimination is below 0.19, then the quality of a test or test item is not good enough to distinguish good examinees from the not-well-qualified ones.
From Tables 2, 3 and 4, the discrimination index of item 1 was 0, showing that the test item was not good enough to distinguish good examinees from the not-well-qualified ones.The discrimination index of item 2 was 0.38, suggesting that the quality of the test was good and reliable. The discrimination index of item 3 was 0.88, indicating the quality of the test was excellent.
Then, the facility value of these three items would be checked. The higher the facility value is, the easier the test or the test item is, and vice versa. Generally speaking, if the facility value falls between 0.7 – 0.8, it means the test or test item is suitable for examinee’s level. While, we may be found from the item 1, 2 and 3 that they were too difficult to be done. As for item 1, all examinees out of the total 33 did it correctly. But for item 3, only 9 examinees did it correctly.Hence, we could say that either these examinees were not good at mastering these language grammar points in this placement test or this test was not suitable for the students’ level.
Item analysis is likely to be a useless process unless the results help teachers to improve their classroom teaching and item writers to improve their tests. In this paper, three items are checked and more items are for further research. But through examining these three items, we have found the necessity to strengthen students’ grammar awareness in teaching. As for test writing, all items should try to be of appropriate difficulty for the students to whom it is administered. It is desirable to have most items in the 0.3 to 0.5 range of difficulty. Very hard or very easy items contribute little to the discriminating power of a test.
This reading test was between level 1 and level 2. And most students have obtained the satisfied result of the test. The reading passages checked the students’ reading ability from different angles, e.g. vocabulary, grammar, knowledge, etc.
The test content and method would certainly influence the teaching and the influence may be positive or negative. From this reading test, it could be easily found that the students were good at answering multiple choice and fell short of getting the answers to short questions, which meant their students language ability was limited in their first sight of the words, and thus, they were good at guesing the answer. They could not apply their language ability to passage analysis. That is to say, their ability to organize language and to draw the conclusion was very poor. They could not use their own words to induce the paragraph, but just copied the words or sentences from the original passage. Therefore, they might not understand the meaning of the reading materials at all.
This fact told us that we, as teachers, should cultivate students’ ability to flexibly apply the language.
Hughes, A.2000. Testing for Language Teachers [M]. 北京:外语教育与研究出版社,人民教育出版社,剑桥大学出版社.