Skip to contents

The use of predictive modeling in education for predicting student performance has increase substantially over the last few years (Borray, 2017). Researchers and practitioners estimate, or train, statistical models with student data where the outcome has already been observed so that predictions can be made with new students. In the training process a subset of the data is withheld in order to validate the accuracy of the prediction model. In these models, it is generally assumed the predictors are valid and reliable. Moreover, each observation has equal weight in the modeling process. However, in many situations, researchers may have additional information with regard to the quality, or reliability, of each individual measure. This paper explores the impact of lower quality measures on the accuracy of prediction models, especially when used for scoring students’ written work.

Rubrics are common tool for measuring student performance (Moskal & Leydens, 2000). They describe the quality of student work across multiple criteria. On the rubric, each score level has descriptors that describe the quality of response for that score. The result is a student score on an ordinal scale. The reliability of these measures is typically estimated by having multiple raters score the same student work. The score is assumed to reliable and accurate when raters agree. Disagreements are often resolved by either having a third rater score the student work (and taking the average or mode) or have the two raters confer and agree a “true” score. There are numerous reasons why raters may disagree, but regardless of the reason of the disagreement, the disagreement provides valuable information with regard to the quality, or confidence, of the score.

If we conceptualize rubric scores as the categorization of an underlying, but unmeasured, continuous construct, then the disagreements could be that the actual score is between the scoring levels. One approach would be taking the average of the two raters thereby increasing a k level rubric to \(2k - 1\) level rubric. However, this us undesirable for predictive modeling since we wish to predict with the same number of levels as training dataset.

Another perspective is that if rater A scored a student work a one, and rater B scored the same work a two, it is possible the work has qualities (as described in the rubric) of both a one and two. In the situation where written work (e.g. essay) is quantified for use in predictive models (e.g. automated scoring engines), the exclusion of a potential third scoring level could have implications in the accuracy of statistical models predicting new student scores. Regardless, given the disagreement of raters, our confidence that the score is clearly a one or a two is much less that if the raters agreed.


The Hewlett Foundations sponsored a Kaggle competition 1 to advance the accuracy of automated scoring algorithms. Essay set two (see Appendix A) includes responses from 1,800 essays from 10th graders with rubric scores in two domains: writing applications (scores ranging from 1 to 6) and language conventions (scores ranging from 1 to 4). All essays were scored by two raters. Given the rarity of of 1s, 5s, and 6s in domain one and of 1s in domain 2, those essays were excluded leaving 1,682 essays for this analysis.

LightSIDE (Mayfield & Rosé, 2014) was used to extract text features from the essays (i.e. unigrams, bigrams, include punctuation, stem N-grams, skip stopwords in N-grams). The training set consisted of 70% of the essays randomly selected with the remaining 30% used for validation. A multinomial log-linear regression model was used to train two prediction models: using the entire training set (full model) and using only essays where the two raters agreed (agreement model). The nnet (Venables & Ripley, 2002) R (R Core Team, 2019) package was used for to estimate the models. The accuracy for the writing applications domain was 0.7% worse for the agreement model but was better by 5.6% more accurate for the language conventions domain (Table 1).

Table 1. Prediction accuracy

Domain Full Agreement
1. Writing Applications 63.9% 63.1%
2. Language Conventions 55.7% 61.3%

Research Questions

Given the differences in model accuracy for differing data, this paper is guided by the following research question: What is the impact of rater disagreements on the accuracy of predicting scores?


To explore the impact of lower confidence scores on predictive models, data will be simulated with a single quantitative predictor variable for a dichotomous outcome variable. For the predictor variable, three groups will be simulated from a random normal distribution with separate means but equivalent standard deviations. Group A with an outcome of 0, Group B with an outcome of 1, and Group Unknown with a random outcome with specified probability of outcome (the default is 50% for each outcome). Group sizes can be specified, but are \(n_A = 100\), \(n_B = 100\), and \(n_{unknown} = 50\) by default. The simulatePredictionData function in the IRRsim R package facilitates the simulation of the data matrix. The following will generate two data frames for training and validating the predictive models.

test1.train <- simulatePredictionData(
    mean1 = -0.1, n1 = 100,
    mean2 = 0.1, n2 = 100,
    mean_unknown = 0.0, n_unknown = 50, ratio_unknown = 0.5,
    sd = .2)
table(test1.train$Y, test1.train$Group)
#>       A   B  NA
#>   0 100   0  24
#>   1   0 100  26
test1.valid <- simulatePredictionData(
    mean1 = -0.1, n1 = 100,
    mean2 = 0.1, n2 = 100,
    mean_unknown = 0.0, n_unknown = 50, ratio_unknown = 0.5,
    sd = 0.2)
table(test1.valid$Y, test1.valid$Group)
#>       A   B  NA
#>   0 100   0  22
#>   1   0 100  28

Logistic regression is used to train two predictive model using the glm function in R: 1. Using all data in the training set and 2. Only data from groups A and B, thereby excluding the uncertain group which has a random outcome.

uncertain_label <- attr(test1.valid, 'labels')[3] <- glm(Y ~ X, 
                     data = test1.train, 
                     family = binomial(link = 'logit')) <- glm(Y ~ X, 
                      data = test1.train[test1.train$Group != uncertain_label,], 
                      family = binomial(link = 'logit'))
test1.valid$Fitted.Full <- predict(, newdata = test1.valid)
test1.valid$Fitted.Known <- predict(, newdata = test1.valid)

Figure 1 represents the distributions of the predictor variable for each group separately along with the fitted logistic regression using all the data in black and using only groups A and B in orange. This is a fairly large effect of the predictor with a standardized difference of 1.0 between groups A and B. The mean for the uncertain group in this example fall in the middle and has equal likelihood of an outcome of 0 or 1.

Figure 1. Logistic Regression Plot

Figure 1. Logistic Regression Plot

The accuracy of the predictions using the model trained with all the data is 62% and for the model trained with only groups A and B is 63%. The simulateLR function will facilitate simulating, training, and validating many datasets (500 by default) with the same parameters.

test1 <- simulatePredictionModels(
    mean1 = -0.1, n1 = 100,
    mean2 = 0.1, n2 = 100,
    mean_unknown = 0.0, n_unknown = 50, ratio_unknown = 0.5,
    sd = .2)


Figure 2 is a histogram of the differences in prediction accuracies in the validation dataset between the models trained with all data minus models trained excluding the disagreements. On average, training the model with only agreement scores resulted in an increase accuracy of 1.17%.

Figure 2. Distribution of differences in prediction accuracies.

Figure 2. Distribution of differences in prediction accuracies.


The reliability of measures used in predictive modeling are an important contributor to the overall accuracy of the model. The inclusion and exclusion of data in training datasets is an important consideration for researhers and practioners. The impact of disagreement, or uncertain scores on the accuracy of predictive models will vary depending on the data available. To assist practioners in estimatin the impact of uncertain scores on the accuracy of predictive models, an R Shiny (Chang et al, 2019) application was developed (Figure 3).2 This application simulates many datasets using the parameters specified by the user providing model accuracy estimates using all and agreement data.

Shiny ApplicationFigure 3. Screenshot of the prediction model simulation Shiny application


Borray, A. (2017). Predicting Student Success with Big Data. EducauseReview. Retrieved from

Chang, W., Cheng, J., Allaire, J.J., Xie, Y. & McPherson, J. (2019). shiny: Web Application Framework for R. R package version 1.3.2.

Mayfield, E., & Rosé, C.P. (2014). LightSIDE: Open source machine learning. In M.D. Shermis and J. Burstein (Eds.), Handbook of Automated Essay Evaluation: Current Applications and New Directions (pp. 124-135). New York: Routledge.

Moskal, B.M., & Leydens, J.A. (2000). Scoring rubric development: Validity and reliability. Practical Assessment, Research, & Evaluation, 7(10).

R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL

Shermis, M.D., & Hamner, B. (2012). Contrasting state-of-the-art automated scoring of essays: analysis. Paper presented at the National Council on Measurements in Education, 2012, Vancouver, BC.

Venables, W. N. & Ripley, B. D. (2002), Modern Applied Statistics with S. Fourth Edition. Springer, New York.

Appendix A: Essay Prompt and Rubric

Essay Prompt

Censorship in the Libraries

“All of us can think of a book that we hope none of our children or any other children have taken off the shelf. But if I have the right to remove that book from the shelf – that work I abhor – then you also have exactly the same right and so does everyone else. And then we have no books left on the shelf for any of us.” –Katherine Paterson, Author

Write a persuasive essay to a newspaper reflecting your vies on censorship in libraries. Do you believe that certain materials, such as books, music, movies, magazines, etc., should be removed from the shelves if they are found offensive? Support your position with convincing arguments from your own experience, observations, and/or reading.

Scoring Rubric