Significant role of rater factors in performance assessment contexts has been emphasized by a number of studies (e.g., Bachman et al.,1995; Hamps-Lyons, 1990; Lumley & McNamara, 1995; Weigle,1998). Results of previous studies show that rater perceptions and behaviors vary among different rater groups, which can be attributed to a variety of variables such as personality, cultural, linguistic, educational background, teaching and rating experiences (Hamp-Lyons, 1990). These variables influence rater decision-making behaviors, interpretations and expectations concerning task requirements and scoring criteria, reaction to essays, severity (inter-rater reliability) and self-consistency (intra-rater reliability) (Cumming, 1990; Engelhard, 1994; Erdosy, 2004; Freeman& Calfee, 1983; McNamara, 1996; Pula & Hout, 1993; Ruth & Murphy,1988; Santos, 1988; Vaughan, 1991; Weigle, 2002; Wolfe, 1997; Wolfe et al., 1998; Wood, 1991). Three main factors have received most of the attention in the literature, namely, rater L1 background (cultural and language), academic background and professional experience (rating and teaching), which are discussed in detail in the following sections.
Using the native speaker (NS) as the point of reference has had a long and contested history in language testing (Davies, 2004). It is no surprise that one of the areas attracting increasing attention in the field of language testing is whether native speakers of English, who have been traditionally eligible for scoring the L2 performance (Lazaraton, 2005;Seidlhofer, 2001), should continue to be the exclusive “norm makers”given that they are far outnumbered by non-native speakers of English for whom the NS norms may have limited relevance (Taylor, 2002). In the context of writing assessment, a great number of studies have been conducted to explore the influence of raters’ L1 background on the rating in terms of their language and cultural backgrounds (Connor-Linton,1995; Hamp-Lyons & Zhang, 2001; Kobayashi, 1992; Kobayashi &Rinnert, 1996; Santos, 1988; Shi, 2001; Zhang, 1999).
The influence of raters’ L1 background emerges most evidently in the contrasting studies of native speaker (NS) and non-native-speaker(NNS) raters of EFL/ESL writings, which have yielded ambiguous and inconclusive findings. A number of such studies reveal harsher ratings among NNS raters compared to their NS counterparts (e.g., Fayer &Krasinski, 1987; Santos, 1988), while others have found the reverse to be the case (e.g., Brown, 1995). To further complicate the situation, a third group of studies shows no language background differences with respect to severity and consistency (e.g., Brown, 1991; Connor-Linton,1995; O’Loughlin, 1994; Shi, 2001).
Besides raters’ language background, the importance of cultural background of raters is also appreciated by many researchers concerning the role of contrastive rhetoric (Kaplan, 1996). Some researchers (e.g.,Erdosy, 2003; Zhang, 1998) identified difference in writing assessment by raters from different cultural backgrounds while others did not(Kobayashi & Rinnert, 1996).
Raters’ academic background refers to whether the rater is a language or content teacher. Raters from different disciplines have been reported to have different interests, assumptions, and expectations concerning different tasks, essays and rating criteria. Several researchers(e.g., Santos, 1998; Vann et al., 1991) found that teachers from different departments rated and reacted differently to different aspects of ESL essays and disagreed as to when various criteria were being met. HampsLyons (1991) identified a tension between the nature of the scoring task(with the focus on discipline-specific content) and the raters’ academic background (ESL teachers rather than teachers from other disciplines).Mendelsohn and Cumming (1987) compared the perceptions and scores of sixteen engineering, English literature and ESL teachers for eight essays manipulated to reflect effective and ineffective language use and rhetorical organization. Results demonstrated complex interactions between essay features and rater backgrounds. When judging the essays,the engineering professors attributed more importance to language use;the ESL teachers gave more weight to rhetorical organization while the English literature teachers did not seem to be biased in either direction.
In the context of rating, raters’ professional experience can be explained as raters’ rating and/or teaching experience. Regarding the effects of raters’ teaching experience on rating, Song and Caruso (1996) found that experienced teachers assigned higher holistic scores to ESL essays. Cumming (1990) found that experienced teachers had a much fuller mental representation of the rating task and used large and varied number of criteria, self-control strategies and knowledge sources to judge ESL essays. Novice teachers, in contrast, tended to evaluate essays with only a few skills and criteria using skills which may derive from their general reading abilities or other knowledge they have acquired previously. Erdosy (2004) found that differences in raters’ teaching experience and, to a lesser extent, native language resulted in different scoring criteria and strategies since raters tended to bring different teaching concerns into the rating process.
Another key dimension of variability is the level of rating experience.There is a relatively extensive literature exploring raters’ rating experience on the essay scores and rating processes (Delaruelee, 1997; Erdosy, 2004;Sayki, 2003; Weigle, 1999). Shohamy, Gordon and Kraemer (1992) compared the ratings of experienced and novice raters of EFL essays and found no significant differences across groups in terms of inter-rater reliability. Other studies found that the effect of rater expertise on essay scores depended on other factors. For instance, Schoonen et al. (1997) found that the effect of rater expertise in an L1 writing assessment depended on the writing task at hand (restricted or free) and the rating criteria used(content and language). Weigle (1999) identified an interaction between rating expertise and writing task. Also research indicated that experienced and novice raters employ qualitatively different rating processes that the former had a broader range of responses and reading repertoires upon which to draw when scoring ESL essays than did the latter (cf. Hout, 1993; Pula &Hout, 1993 in L1).
The third dimension in this line of research has focused on the influence of the training experiences on raters’ rating performance. The findings concluded that focused training can help raters apply rating scales consistently (Connor & Carrell, 1993; Weigle, 1994). On the other hand, the benefits of training may be short-term and training cannot eliminate other effects (Lumley & McNmara, 1995), for example,“the context in which training occurs, the type of training given, the extent to which training is monitored, the extent to which reading is monitored, and the feedback given to readers all play an important part in maintaining both the reliability and the validity of the scoring of essays.”
On the whole, the literature on the factors influencing raters’rating performance revealed that rating behavior should be regarded as a contextualized process which is inevitably under the influence of various sources of factors brought into the process by raters as part of the educational and assessment contexts. And to identify the potential important sources of variation in raters helps explain variability in the scoring judgments of the participants in the current study.
It is also evident that few studies have really examined raters’writing proficiency in the context of writing assessment as an influencin factor rather than focusing solely on the native/non-native dichotomy of L1 backgrounds. Methodologically, in terms of quantitative analysis,researchers have examined the correlations between the scores raters assign and measures of specific essay features (e.g., Homburg, 1984;Tedick & Mathison, 1995), and rewriting essays to reflect strengths and weaknesses (e.g., Kobayashi & Rinnert, 1996; Mendelsohn &Cumming, 1987). Other studies have used qualitative data including interviews (e.g., Erdosy, 2004), questionnaires (e.g., Shi, 2001), written score comments (e.g., Milanovice et al., 1996; Rinnert & Kobayashi,2001), think-aloud protocols (e.g., Cumming et al., 2002; Delaruelle,1997) to identify the evaluation criteria that raters employ. No mixedmethod design—combining a quantitative analysis of discourse features in raters’ writing performance and a qualitative analysis of raters’perceptions and behaviors—has been employed in recent years with a view to investigating raters’ language background and its influence on their ratings. Furthermore, studies investigating the role of raters’language proficiency in writing assessment are yet to be conduced in the Chinese context, which is arguably an influential one due to the variety of English and the densely populated country (Berns, 2005).