It is well recognized that raters do not mechanically record what they see when engaged in the scoring process; their ratings are based on observation, interpretation and the exercise of personal and professional judgments (Myford & Wolfe, 2003). It is thus reasonable to assume that, raters, with their internalized criteria and individual manner in implementing scoring criteria, mediate between candidate performance and the final score and determine the meaningfulness of the score as well as the appropriateness of inferences made from the scores. These complex interactions between the various factors in L2 essay rating therefore give rise to the need for studies into factors affecting the essay rating process and outcomes. The nature of the rating process has thus been the focus of many studies.
Two major orientations of studies have been conducted in this regard. One line of research, which has long been dominant in the literature, is mainly concerned with how rater factors affect the rating through statistical modeling of scores by three main approaches:(a) Classical Test Theory (CTT) to examine intra/inter-reliability indices,(b) the Generalizability Theory (GT) approach to estimate variance component related with the rater facet, and (c) a Many-Facet Rasch Model (MFRM) to calibrate individual rater’s rating patterns. In such a line of research, final scores are regarded as rich sources of information and direct evidence indicative of how accurately and consistently raters have fulfilled their scoring tasks. Specifically, information can be obtained regarding how different raters agree with each other in their rank-ordering of the same group of candidates using inter-rater coefficient (Hout, 1990), the extent to which rater group as a whole can contribute to the total variance in the final scores using G-study (Lynch& McNamara, 1998) and how raters differ from each other in terms of their overall severity, self-consistency and significant bias towards candidates, tasks or rating occasions using calibrations from MFRM(Engelhard, 1994; Bachman et al., 1995; Myford & Wolfe, 2000; Eckes,2005).
These statistical methods such as the G-theory approach and the Rasch analysis enable researchers to investigate the effects of rater factors in a more sophisticated manner than traditional correlation coefficients for inter-rater reliability. These models, however, leaves the complexity and richness inherent in the rating process under-explored.In other words, statistical modeling might leave some sources of score variance unexplained. Since writing assessment is a complex and multifaceted activity beyond the scope of statistical manipulation of scores (Hamps-Lyons, 1995), researchers therefore have called for an in-depth investigation into this “black box” in their quantitative studies(Eckes, 2005; Weigle, 1998).
Another line of inquiry has been devoted to exploring raters’thought processes during their rating in order to unveil the underlying reasons leading to the scores. Rather than focusing on the quantitative score analysis, a verbal protocol approach perceives raters as the decision makers who might follow different mental paths to arrive at their final judgments. The following part thus addresses rater judgment from this perspective with two major focuses: essay features that raters pay attention to; reading styles/sequences they adopt and decision-making behaviors they make for assessment purposes. The two dimensions are:(1) studies using indirect evidence to infer raters’ rationale for their decision-making and (2) studies drawing on raters’ verbal protocols directly investigating their scoring behaviors while rating.
Researchers have attempted to infer what textual features raters are likely to attend to and how they weigh these features when making evaluative judgments by linking textual features to candidates’ writing performance. The majority of such studies first extracts a range of lexical, sentential and discoursal features from the written texts and correlates those features with the scores awarded by certain group of raters. Statistical analysis, such as Analysis of Variance (ANOVA),multiple regressions and discriminant analysis are usually employed to identify weight of those features in explaining variance in scores to infer what criteria raters use to judge writing performance at various levels of proficiency.
Generally, a number of studies have found that scores are significantly correlated with textual features at the micro level, such as vocabulary (Engber, 1995; Grant & Ginther, 2000; Laufer & Nation,1995), complexity and accuracy of the syntax and morphology in the written texts (Bardovi-Harlig, 1997; Bardovi-Harlig & Bofman, 1989;Cumming & Mellow, 1996; Ishikawa, 1995; Reid, 1992), accurate use of articles (Cumming & Mellow, 1996), and features at macro-level text structure, including meta-discourse markers (Intaraprawat & Steffensen,1995), rhetorical framing (Tedick & Mathison, 1995), and cohesive devices and various indices of clause structure (Grant & Ginther, 2000;Ishikawa, 1995; Intaraprawat & Steffenson, 1995; Reid, 1992; Reynolds,1995). However, findings regarding the respective weight of these features vary according to the context of the assessment.
The discourse analysis approach has enabled researchers to extract objective discourse features statistically correlated with scores to distinguish writing performances at different levels. However, this approach makes an assumption that statistically salient textual features are most likely to be perceived by raters and thus are employed by raters as criteria for judging writing performances. Additionally, correlation studies often fall short in giving casual accounts of observed phenomena especially in the case of the rating process (Freedman & Calfee, 1983).For example, if text length is found to be highly correlated with essay score, this does not necessarily mean that raters use text length as an important criterion in their judgment. In addition, these correlation studies have ignored “an interactive theory of textual communication where the text is constructed by the reader.”
Other sources of evidence from the perspective of raters are therefore needed to present a more plausible and detailed picture of and account for raters’ rating processes.
Instead of simply resorting to statistical analysis of essay features,another strand of studies is more “rater-oriented.” Such studies have revealed that raters vary in using complex decision-making behaviors to identify key text features and interpretation of rating scales, weighing the value of different features and then reaching a scoring decision (e.g.Barkaoui, 2010a; Vaughan, 1991). The underlying assumption of these studies is to perceive raters as readers who operationally define the construct of writing proficiency in making evaluative judgments on the overall quality of writing performance.
Research adopting this approach follows two major orientations.One is concerned with academic writing assessment in educational settings that aims at investigating relevant raters of EFL essays, such as teachers in particular disciplines and EFL teachers, to identify aspects of EFL essays they consider most significant in differentiating good essays from poor ones (e.g., Connor-Linton, 1995; Hamp-Lyons & Zhang,2001; Kobayashi & Rinnert, 1996; Santos, 1988). Studies into university professors’ reactions to academic writing showed that professors seemed to make a distinction between content and language and they tended to focus more on content, development and organization when evaluating both native English students’ and non-native students’ writing(Mendelsohn & Cumming, 1987; Santos, 1988). Other researchers have focused on specific rater groups about various types of errors in EFL students’ writing and asked them to identify varying degrees of influence of such errors on their judgment of essay quality, referred to as “error gravity.” (Rifkin & Roberts, 1995)
The other line of research mainly inquiries into salient features raters attend to in their rating process either through qualitative data collection techniques such as questionnaires (Connor-Linton, 1995; Shi,2001), immediate retrospective written reports (e.g., Milanovic et al.,1996) or observation of raters’ discussion of essay scores (Hamp-Lyons,1991). Eckes (2008) identified six major types of criteria that raters paid attention to, with each focusing on certain dimensions of discourse features: Syntax Type, the Correctness Type, the Structure Type, and the Fluency Type, the Non-fluency Type and the Non-argumentation Type.His findings indicated that raters show significant variability in their view of the weight or importance of well-specified scoring criteria.
Correlations between quantifiable discourse features with scores or eliciting qualitative comments from raters have given insights into what raters perceive as important to their rating and revealed part of the scene of what goes through raters’ minds in their decision-making processes. However, much remains unexplored about how raters acquire and process information above student writing performance and how they make decisions and monitor their rating processes. In the context of performance assessment, researchers have begun to look for direct evidence underlying raters’ judgments. Verbal protocol analysis is credited for providing such insights, not only into students’ thinking process, but also into the attitudes and behaviors of raters (Lazaraton&Taylor, 2007).
A number of studies have utilized thinking-aloud or intro-/retrospection protocols to explore both rater focus and the procedures they follow to arrive at their final judgments. Findings of these studies will be discussed in terms of their specific focuses.
One important concern in the studies into raters is to trace their heeded information when they rate candidates’ writing performance.It has been found that raters tend to focus on a wide range of textual features which can be grouped into several broad categories of discourse features. In analyzing rater focus, researchers have adopted different approaches, top-down or bottom-up. For instance, Cumming et al.(2001) prescribed no common scoring rubrics for the raters to follow.In his earlier study, Cumming (1990) found that ratings for both expert and novice raters tended to increase consistently according to students’level of ESL proficiency or writing expertise. Ratings of the novice and expert groups were different in that novice raters were more lenient than the experienced raters in terms of their ratings of content and rhetorical organization, although not for their rating of language use. Statistically significant differences were related to raters’ attention to specific aspect of the content, syntax, or rhetorical organization of the compositions.Vaughan (1991) found that in addition to the criterion-related aspects,raters also attended to a variety of non-criterion information, which was not involved in the given rating scale. Furthermore, even when commenting on similar categories of discourse features, raters vary in their focus on features and attach different weight to these features of candidates’ performances.
With regard to weighting different features, Sakyi (2000) identified three dominant characteristics to determine raters’ judgments of the essay quality: content related factors (agreement with opinion expressed by the writer, assumptions about task demand/task completion; presentation of ideas, essay length); language related factors (grammatical errors;vocabulary complexity, sentence structure, style) and comparing quality with that of preceding ones. Milanovic et al. (1996) catalogued sixteen general categories of textual features raters remarked upon when they differentiated the quality of writing performance and found that raters have different distribution patterns of their focus on the general categories of textual features. Regarding proficiency levels, they noted that with the higher levels, raters tended to focus more on vocabulary and content, although less so with effectiveness and text length.Milanovic et al. (1996) thus concluded that raters’ decision-making behaviors with regard to the final scores “was determined to varying extent by the weight they attributed to those composition elements.”
However, a list of heeded textual features only indicates aspects of essays raters note when they rate the students’ writing scripts. Further analysis of raters’ focus suggests that raters choose to focus on different textual features across different proficiency levels of target performance and across various writing tasks. Green (1998), investigating rater strategies in assessing essays, found that raters focused on different aspects of essays across different proficiency levels. It can be inferred that good scripts tend to elicit more rater attention to higher-level features such as register, style, layout and content and that raters’ focus on these features declines with essay quality while such features as grammatical accuracy, task understanding and task completion are more readily attended to when raters rate poor essays. Similar findings have also been echoed in other studies (Milanovic et al., 1996; Cumming et al., 2001). Cumming et al. (2001) identified a list of text features and sub-features (“traits” in Hamps-Lyons, 1991, or “rater objectives” in DeRemer, 1998) derived empirically from the think-aloud data that raters attended to with any frequency. As with other researchers, Smith(2000) also found substantial evidence of individual differences the particular features discussed for each essay while considering each of the rating criteria. However, he found little consistency in the ways in which the raters interpreted and applied the assessment criteria in similar ways. Raters therefore employ the assessment criteria to justify their scoring decisions but interpret the criteria in different ways. Another disagreement exists since raters cite examples of specific textual features when making rating decisions, but fail to agree at the specific level.
The inconsistent findings underpin the need to explore what raters actually do with scoring descriptors while rating. Findings on rater focus, as discussed, have revealed what textual features raters may focus on in making their scoring judgments (see Table 3-1). Raters’ decisionmaking is, in the nature of ratings, a cognitively complex process involving information acquiring and processing. Another important factor to be examined in raters’ rating processes lies in the way in which raters process the acquired information of candidate writing performance and the procedures they follow to arrive at their scoring decisions.
Table 3-1 Summary of studies on rater focus
(Table 3-1 continued)
Studies in the field do not provide adequate knowledge of how EFL essay raters distinguish between different levels of writing proficiency,as Lumley (2002) observed, that, although raters follow fundamentally similar rating processes, the relationship between scale content and text quality remains obscure. Charney (1984) suggested that raters might apply idiosyncratic criteria and they usually fall back on their own judging criteria when an essay does not fit the features defined on th rating scale (Vaughan, 1990). According to Wolfe and his colleagues, the scoring criteria adopted by a rater may be determined by three factors,namely, interactions between the rater’s prior beliefs and understanding of writing and the writing process, compatibility of the writer’s values and the scoring rubric; and effectiveness of the methods and materials used to train the rater. As a result, the scoring criteria adopted by a particular rater may differ from the explicit, external criteria contained in the scoring rubric.
Cumming’s (1990) observation suggests that some raters focus more on observable features of the text as indicated by Charney (1984).This assertion seems to be substantiated by Homburg’s (1984) funnel model. Homburg used objective measures to determine what readers do when they grade essays and suggested that raters grossly group ESL essays based on one feature and then further categorize them on the basis of yet other features. Discriminant analysis revealed five identified significant measures that are associated with each step of the rating process, namely, second degree errors per essay; dependent clauses per essay; words per sentence; conjunctions per essay and errorfree t-units per essay. The funnel model depicts rater behavior as linear and sequential following a specific pattern based entirely on countable features in graded text such as number of words, errors, dependent clauses, and conjunctions in an essay. However, it was criticized for not taking into account the complex nature of the rating process (Cumming,1990; Cumming et al., 2002; Lumley, 2000, 2002; Milanovic et al.,1996; Wolfe & Feltovich, 1994).
To summarize, these is no consensus has been reached by researchers concerning the roles and effects of rater background and interactions among various rater factors on EFL essay scores and rating processes. In addition, one limitation of these studies is that researchers do not actually explain how the language proficiency of raters mediates the rating process. Another limitation of previous studies is that the majority of the studies have been either strictly qualitative, focusing on the rating process and rater behavior, or strictly quantitative, limiting themselves to the analysis of essay scores.
Many of the studies regarding rater judgments have been conducted from the perspective of the sequence of steps that raters go through when they rate. The main issues raised in these studies have addressed the disagreement over whether or not there is one basic or procedure adopted by raters (Wolfe, 1997). Many researchers (e.g., DeRemer,1998; Milanovic & Saville, 1996; Smith, 2000; Vaughan, 1991) have suggested there are important variations in the approaches to rating used by different raters—with the exception of Freedman and Calfee(1983) and Homburg (1984)—with each proposing a different model of rating sequence. Freedman and Calfee (1983) developed an informationprocessing model of holistic rating in which they identified three main processes that underlie the rating of an essay; their model also incorporates two influencing factors: rater characteristics (expectations,values, reading ability, world knowledge) and the assessment context(writing task, time of day).
From a cognitive perspective, Ruth and Murphy (1988) describe the processes of raters in various roles in relation to writing task and essay.Essay raters assume dual roles: a comprehender of the writing task and essay and evaluator of the essay. In these roles, each rater “interacts with each of the given texts to make meaning and to accomplish the communicative goals of the assessment episode as those goals are understood by each participant.”
Differences in these interactions may lead to differences in scores assigned to various features of each essay. With a view to model building, Wolfe and colleagues (Wolfe &Feltovich, 1994; Wolfe & Ranney, 1996; Wolfe, 1997; Wolfe, Kao &Ranney, 1998) propose a cognitive model for essay scoring comprising two main components: The framework of scoring describes the “mental representation of the process through which an essay is read and evaluated.” (Wolfe, 1997, p. 91) The framework of writing, on the other hand, describes the rater’s “mental representation of the characteristics that constitute proficient or non-proficient writing.” (Wolfe, 1997,p. 90) However, these studies treated operational rating and ratings accompanied by think-aloud protocols as though they are equivalent,which lacks sufficient evidence to justify.
From the perspective of linguistic features that raters focus on while rating, Vaughan (1991) concluded that each rater relied upon their own strategies or style of reading for making scoring judgments. Five reading strategies were thus identified: the single-focus approach; the first impression approach; the two-category strategy; the laughing rater and the grammar-oriented rater. However, her classification was in very broad terms and seems to confound raters with different preferences in rating focus with their particular strategies. Her findings suggest that there are considerable individual differences between raters particularly in the areas of how raters perceive essay content and organization, and in regard to specific rater characteristics.
Some researchers have investigated approaches used by raters to evaluate essays in terms of times that raters read each essay, together with their motivation for reading and rereading. Weigle (1994) explored the judgments made by 16 experienced and less-experienced raters of Cambridge FCE/CPE essays. Data obtained revealed four discernible approaches to essay marking: principled and pragmatic two-scan approaches, reread-through approach and the provisionalmark approach. Milanovic et al. (1996) examined a suite of tests which relied on single scores given by raters, following the adapted rating schemes used in previous studies on the FCE/CPE exam. Four broad types of rater reading strategies were identified (Milanovic et al., 1996 ,pp. 98-100) Within these four reading strategies, Milanovic et al. (1996) identified seven steps: pre-marking; scan; read quickly; rate; modify; reassess /revise (including, possibly, additional scan or read quickly);and decide on a final mark. Their follow-up study (Milanovic & Saville,1994) revised this model, dividing raters’ approaches into linear and cyclical rater strategy models. The linear model modifies thelinear and cyclical models, suggesting a much more careful type of behavior. The cyclical model essentially identifies interruption to and repetition of the stages mentioned above, involving a number of loops and adjustments to the score awarded. Milanovic and Saville suggested that although they identified various differences in rater behavior, these may not have significant consequences for scoring. In terms of rater focus, results of protocol analysis indicate that different script levels do appear to elicit different marking behavior. In higher-level scripts, markers focused more on vocabulary and content. With intermediate level scripts(FCE), markers focused more on communicative effectiveness and task realization. Smith (2000) employed think-aloud verbal reports to explore rater consistency and rating decisions of six experienced raters. He identified three reading strategies in the assessment process:the read-through-once-then-scan approach; the performance criteriafocused approach and the first-impression-dominates approach. One of the interesting findings was that some raters who adopted a holistic approach may be resistant to externally imposed assessment criteria (i.e.the scale) by bringing with them “an internalized and personalized view of what constitutes an acceptable quality or standard of writing which may be impervious to attempts to control, through the performance criteria, variations inherent in the judgments of individual raters.”
The weakness of this study is the very small number of scripts examined.
Cumming (1990) talked of rating in terms of “complex, interactive processes” and “a far more multi-faceted, variable process than the non-linear funnel model.
He investigated the decision-making behaviors of 13 raters (7 novices and 6 experts) when evaluating ESL essays holistically to assess whether raters implicitly distinguish students’ L2 writing proficiency and language proficiency. Analyses o raters’ verbal reports revealed 28 common decision-making behaviors,many of which varied significantly between two groups. Twenty of these behaviors were classified under three dimensions of features to which raters awarded scores: substantive content, language use and rhetorical organization. The remaining judgment behaviors were labeled as aspect of a “self-control focus” used by raters. However, different from previous researchers (Hout, 1993), Cumming did not claim that these comments contribute to the final score.
Two recent studies have built upon Cumming’s (1990) study: Sakyi(2000) and Cumming et al (2001). Sakyi (2000) proposed a “tentative model” of factors affecting holistic scores based on analysis of verbal protocols of six experienced raters. Four different reading strategies were observed from raters’ think-aloud data. He claimed that “for raters who made a conscious effort to follow the scoring guide, the restrictions imposed on them to assign a final score caused them to depend mostly on only one or two particular features to distinguish between different levels of ability.”
Raters thus focused on 1) errors in the text, 2) on essay topic and presentation of idea, 3) on the raters’ personal reaction to the text and 4) on the scoring guide. Cumming et al. (2001) revised Cumming’s (1990) framework of rater behaviors into 35 behaviors. They grouped these behaviors into three main categories; self-monitoring focus, rhetorical and ideational focus, and language focus, which will be elaborated in Chapter 5.
Focusing on influences of rater background on rating, Erdosy (204) observed two quite distinct reading strategies amongst four experienced raters from different backgrounds. Two raters made multiple readings of the compositions, sorting them into piles before deciding on final scores,while the other two read each text only once, and assigned a score immediately afterwards. Erdosy (2004) suggested that reading strategies,rather than deriving from language teaching or learning experience,“must be developed through focused training and practical assessment experiences.” (p. 12) Such differences in behavior appeared to result from a lack of specific guidance about how the rating task should be approached, hence leading raters to draw on their previous teaching and assessment experience. He pointed out that raters’ judgments of students’ writing performances are partly based on their inferences about candidates’ knowledge, efforts and thought. In the absence of a scoring rubric, the relationship between performance and proficiency was mediated with reference to raters’ experiences in teaching and in the case of non-native speaker raters, in learning ESL. He called for studies on the “extent to which variability in judgments can be ascribed to variability in raters’ backgrounds.”
Studies have investigated the decision-making of raters when rating essays with no specific rating guidelines (e.g., Cumming et al., 2002), or when using holistic (e.g., Milanovic et al., 1996) or multiple-trait rating scales (e.g., Cumming, 1990; Lumley, 2002; Smith, 2000). Research in this area is providing a broader understanding of the cognitive processes that raters go through when assessing essays.
Cumming’s (1990) study of verbal protocols of ESL instructors also supports Freedman and Calfee’s (1983) model. He observed that experienced raters tend to integrate their interpretations and judgments of situational and textual features of compositions simultaneously, using a wide range of relevant knowledge and strategies. He identified 28 interpretation and judgment strategies that could not only form the component behaviors of the three processes outlined in Freedman and Calfee’s model but could also serve as a broader basis for further research in developing explicit and accurate definitions of theknowledge and strategies in scoring. Taking complexity of raters’ cognitive processes into account, Milanovic et al (1996) later developed a model of the decision-making processes of the raters based in part on the decision-making behaviors identified by Cumming’s (1990) study.
Similarly, as discussed in Section 3.2.2.3, Wolfe and Feltovich(1994) have presented a model of rater cognition investigating two types of mental models: models of performance and models of scoring. In the model of scoring, the authors describe three general scoring behaviors:interpretation (how a rater takes information and determines what aspects of the response will be considered as evidence for competence or non-competence); evaluation (how discrepancies or inconsistencies in the evidence will be dealt with or how different aspects of the response will be weighted in the decision-making process); and justification (how to monitor a rater’s own performance and attention and how to incorporate corrective feedback into their scoring activities).According to the authors, these behaviors form the bulk of raters’scoring frameworks and each of them are likely to be performed through the execution of a number of processing actions. However, the manner in which different raters execute the behaviors appears to be rather different. Wolfe and Feltovich concluded that raters’ model of performance becomes more cohesive and complex with the increase of their professional experiences and the model of scoring of raters differentiates in terms of raters’ rating expertise.
Built upon and refining Cumming’s (1990) and Sakyi’s (2001) framework of behaviors, Cumming et al. (2002) proposed two models of rater decision-making behavior. The first, discussed in the previous section, is a macro-strategy model that describes the overall sequence of rater decision-making behaviors when rating essays holistically. The second model considers both raters’ decision-making behaviors when rating L2 essays holistically and the aspects of writing that raters attend to. This micro-strategy model consists of 35 decision-making behaviors grouped under three focuses (self-monitoring, rhetoric and ideas, and language) and two strategies (interpretation and judgment) (see Table 3-2 for the framework). As Cumming et al. (2002) emphasize, the model portrays essay rating as an interactive process wherein the rater reads,judges, exercises diverse self-control strategies, and attends to numerous aspects of writing simultaneously. Cumming et al.’s model offers a useful framework for examining the role and effects of rater background and scoring method on essay rating processes. Beside the recognition of the interpretative and judgmental nature of the scoring, Cumming et al. also claim—regarding the source of assessment criteria—that essay scoring is based on “prevailing norms of educational practices as well as individuals’ past experiences.”
Table 3-2 Descriptive framework of decision-making behaviors(Cumming et al., 2002)
(Table 3-2 continued)
To summarize, two observations can be obtained concerning studies on rater focus, reading strategies, as well as raters’ decision-making behaviors. First, their findings are mixed due to specific assessment contexts and different rater groups investigated given that the nature of the studies reviewed is explorative and descriptive. This is quite natural in that as a complex cognitive process and ill-structured problem-solving task, rating is bound to exhibit considerable variation across different raters and contexts. However, what really matters is not just describing the superficial similarities or differences observed, but to investigate what should contribute to make raters reach the scoring decision.Investigation and comparison of different raters of various backgrounds is one of the initiations of such inquiry. This approach avoids being obsessed with the taxonomies of rating processes and rater behaviors outlined, which do not actually explain how language proficiency of raters mediates the rating process.
Second, the majority of the studies have been either strictly qualitative, focusing on the rating process and rater behavior or strictly quantitative, limiting themselves to the analysis of essay scores. More systematic empirical studies, combining both statistical analysis and qualitative interpretation, are thus needed to provide evidence for the links between raters’ judgments with various sources of influence and to establish a more unified picture about how these influencing factor causes the detected differences among raters when they are making scoring decisions.
Regarding factors involving and influencing rater judgments, no consensus regarding the effect of language proficiency, in particular writing proficiency on rater judgments has yet been reached, and this may be explained by three reasons. Theoretically, language background has been operationalized from different perspective in the studies conducted. Few studies have really examined raters’writing proficiency rather than focusing on the native/ non-native dichotomy of L1 backgrounds. Additional information about writing proficiency that influences raters’ rating judgments is therefore essential Methodologically, in terms of quantitative analysis, researchers have examined the correlations between the scores raters assign and measures of specific essay features (e.g., Homburg, 1984; Tedick & Mathison,1995), and rewriting essays to reflect strengths and weaknesses (e.g.,Kobayashi & Rinnert, 1996; Mendelsohn & Cumming, 1987). Other studies have used qualitative data including interviews (e.g., Erdosy,2004), questionnaires (e.g., Shi, 2001), written comments (e.g., Milanovice et al., 1996; Rinnert & Kobayashi, 2001), and think-aloud protocols (e.g., Cumming et al., 2002) to identify the evaluation criteria that raters employ. No mixed-methods study—combining a quantitative analysis of discourse features and a qualitative analysis of raters’perceptions and behaviors—has been conducted in recent years with a view to investigating raters’ language proficiency and its influence on their ratings.