英文教师写作能力与写作评价决策：一项中国内地的探索性研究（外国语言文学学术论丛）最新章节_刘力著

3.4 Assessing rater effects on essay scores: Many-Facet Rasch Model

Various models and techniques have been used to investigate essay scores. Classical Test Theory (CTT) has proven inadequate with performance assessment since it can deal only with a single source of variance within a given analysis, usually occasions, tests or raters. Generalizability theory(G-theory) (Bachman, 1997; Brennan, 2001) and Many-Facet Rasch Model(MFRM) (Linacre, 1994; McNamara, 1996; Weigle, 1998) provide more powerful techniques to identify and estimate the effects of various factors on rater-mediated test scores simultaneously. Furthermore, because they usually do not require control of factors in the assessment context, these new theories and techniques often allow the investigation of scores obtained in authentic assessment contexts.

G-theory provides a theoretical framework and a set of procedures for estimating the relative effects of different factors on the test scores in performance assessment (Bachman, 1997). This approach, however, has several limitations. First, G-theory often requires a fully crossed design in order to obtain accurate estimates of the variance components of the different facets in the assessment context, which is often impractical in authentic contexts. Second, G-theory does not consider the quality of ratings, such as raters not using the whole range of scores available.Third, G-theory assumes that the properties of rating scale are linear,that the distance between the score points is equal, and, as a result, that the observed raw scores represent an interval scale. These assumptions do not always hold, however. In addition, G-theory does not provide procedures to assess these assumptions. Finally, G-theory provides information at the group level, rather than at the level of the individual rater, student, and task (Linacre, 1996).

MFRM addresses the limitations raised above. Rasch models are a family of logistic latent-trait models of probabilities that permit the calibration of test item difficulty and test-taker ability independently of each other, but place them on a common frame of reference. These models use sophisticated mathematical procedures to develop estimates of person ability and item difficulty in terms of probabilities of expected responses to items by person (McNamara, 1996). According to the Rasch model, the probability of a correct response to a test item in a dichotomously scored test is merely a function of the difference between the student’s ability and the difficulty of the item. Measures of person ability and item difficulty are expressed in units called logits, which are log-odd transformations of observed scores across all students and items.As McNamara (1996) explains, a logit scale is a true interval scale to express the relationship between item difficulty and student abilit.

MFRM enables researchers (a) to model different aspects, called facets, in the assessment contest, for example, rater, testing context,rating session, in addition to student ability and item difficulty; (b) to estimate their effects on scores and (c) to place them on the same logit scale for comparison. The model provides a mechanism to develop linear measures of all facets of the assessments from raw, potential ordinal,ratings when a rating scale is used through the logistic transformation of raw scores to a logit scale. Each facet is calibrated from observed scores and all facets (student, rater, rating dimension, etc.) are placed on a single common linear scale called a variable or facets map. Assuming three facets—student, task and rater—McNamara (1996) explains that MFRM regards each rating as a function of the interaction of the ability of the candidate, the difficulty of the task and the severity of the rate.

MFRM enjoys several advantages. First, it is robust in the presence of missing data. Second, raw scores are sufficient statistics for estimating student ability, task difficulty, and rater severity. Third, a parameter’s estimation is dependent on the accumulation of all ratings in which it is independent of the particular values of any of those ratings. Fourth,measures constructed from the raw scores are observed in a linear nature of reference, which allows comparison. Fifth, the values of each parameter are independent of all other parameters within the frame of reference (Bond & Fox, 2001; Linacre, 1994, 1996; McNamara, 1996).In addition, while studies using G-theory require completely crossed designs, the only design requirement for MFRM is that “there be enough linkage between all elements of all facets that all parameters can be estimated within one frame of reference without indeterminacy.”

The computer program FACETS (Linacre, 2010) is employed to operationalize MFRM. This program uses observed scores to provide parameter estimates for each facet as well as information about the reliability of each of these estimates, in the form of standard error, and the validity of the measure in the form of fit statistics. Each facet is calibrated from the relevant observed ratings and all but the student measure facet are anchored at a common origin of zero. When the rating scale includes several rating categories, FACETS also allows the estimation of rating category difficulties. FACETS also permits scale diagnosis and bias analysis. Scale diagnosis aims to assess the quality of the rating scale by examining how scale steps or levels are functioning to create an interpretable measure and whether scale-step thresholds indicate a hierarchical pattern to the rating sale (Bond & Fox, 2001).Bias analysis aims to identify any systematic sub-patterns of behaviors from interaction of a particular rater with a particular aspect of the rating situation to estimate the effects of these interactions on essay scores(Kondo-Brown, 2002; Lumley & McNamara, 1995). FACETS analysis therefore has been employed in various studies to explore the effects of various facets in the rating context and interactions among them on essay scores (e.g., Hill, 1997; Lumley & McNamara, 1995; Myford &Wolfe, 2000; Weigle, 1998).

One issue related to the current study is that many researchers have been concerned about the Rasch model’s assumption of unidimensionality and its appropriateness when dealing with writing performance scores. The major concern is that while performance on writing is complex since it involves various abilities and cognitive skills, measurement models assume that the test is measuring one trait.In response to this concern, McNamara (1996) explains the need to distinguish between psychometric and psychological unidimensionality.Performance on any language task is necessarily psychological multidimentional as models of language ability suggest (Bachman,1990). In addition, the Rasch model “hypothesizes a single measurement dimension of ability and difficulty,” and its analysis of test data“represents a test of this hypothesis in relation to the data.” (McNamara,1996, p. 275)

To summarize, MFRM has proven a valuable tool for investigating the effects of different facets in the assessment context and interactions among these facets. The present study employed FACETS to analyze the scores obtained from the raters at different levels of writing proficiency.