April 11, 2023

Motivation

  • The Diagnostic Assessment and Achievement of College Skills (DAACS; www.DAACS.net) is a suite of technological and social supports to optimize student learning. DAACS provides personalized feedback about students’ strengths and weaknesses in terms of key academic (mathematics, reading, & writing) and self-regulated learning skills, linking them to the resources to help them be successful students.
  • For the writing assessment, were trying to establish inter-rater reliability in two ways:
    1. Human-to-human
    2. Human-to-machine
  • Literature clearly suggests ICC is the appropriate measure, however most of that research is from the medical literature.
  • Guidance on interpretation of ICC is not very clear.

Guiding Research Questions

  1. What is the relationship between intraclass correlation (ICC) and percent rater agreement (PRA)?
  2. Are the published guidelines for interpreting ICC appropriate for all rating designs?

Measuring Inter-rater Reliability

Percent Rater Agreement (PRA)

  • Percentage of scoring events where raters’ scores agree.

Intraclass Correlation (ICC)

  • ICC is a quantitative measurements are made on units that are organized into groups. It describes how strongly units in the same group resemble each other

Cohen’s Kappa

  • Measure of agreement between two raters that takes into account agreement by chance.

Fleiss’ Kappa

  • Extension of Cohen’s kappa for more than two raters. It also takes into account agreement by chance.

Components of Measuring IRR

Models:

  • One-way random effects: each subject is measured by a different set of k randomly selected raters.
  • Two-way random effects: k raters are randomly selected, then, each subject is measured by the same set of k raters.
  • Two-way mixed effects: k fixed raters are defined. Each subject is measured by the k raters.

Number of measurements:

  • Single measures: even though more than one measure is taken in the experiment, reliability is applied to a context where a single measure of a single rater will be performed.
  • Average measures: the reliability is applied to a context where measures of k raters will be averaged for each subject.

Consistency or absolute agreement:

  • Absolute agreement: the agreement between two raters is of interest, including systematic errors of both raters and random residual errors.
  • Consistency: in the context of repeated measurements by the same rater, systematic errors of the rater are canceled and only the random residual error is kept.

IRR Statistic Description Formula
Percent Agreement One-way random effects; Absolute agreement \(\frac{number\ of\ observations\ agreed\ upon}{total\ number\ of\ observations}\)
ICC(1,1) One-way random effects; absolute agreement; single measurements \(\frac{MS_R - MS_W}{MS_R + (k - 1)MS_W}\)
ICC(2,1) Two-way random effects; absolute agreement; single measures \(\frac{MS_R - MS_W}{MS_R + (k - 1)MS_E + \frac{k}{n}(MS_C - MS_E)}\)
ICC(3,1) Two-way random mixed effects; consistency; single measures. \(\frac{MS_R - MS_E}{MS_R + (k-1)MS_E}\)
ICC(1,k) One-way random effects; absolute agreement; average measures. \(\frac{MS_R - MS_W}{MS_R}\)
ICC(2,k) Two-way random effects; absolute agreement; average measures. \(\frac{MS_R - MS_E}{MS_R | \frac{MS_C - MS_E}{n}}\)
ICC(3,k) Two-way mixed effects; consistency; average measures. \(\frac{MS_R - MS_E}{MS_R}\)
Cohen’s Kappa Absolute agreement \(\frac{P_o - P_e}{1 - P_e}\)

Note. \(MS_R\) = mean square for rows; \(MS_W\) = mean square for residual sources of variance; \(MS_E\) = mean square error; \(MS_C\) = mean square for columns; \(P_o\) = observed agreement rates; \(P_e\) = expected agreement rates.

Guidelines for Interpreting IRR

Altman (1990) (Not specified)
  • < 0.2 Poor
  • 0.2 - 0.4 Fair
  • 0.4 - 0.6 Moderate
  • 0.6 - 0.8 Good
  • >0.8 Very good
  • Cicchetti & Sparrow (1981); Cicchetti (2001) (ICC, Cohen Kappa)
  • < 0.4 Poor
  • 0.4 - 0.6 Fair
  • 0.6 - 0.75 Good
  • >0.75 Excellent
  • Fleiss (1981, 1986); Brage et al. (1998); Martin et al. (1997); Svanholm et al. (1989) (Cohen Kappa)
  • < 0.4 Poor
  • 0.4 - 0.75 Fair
  • >0.75 Excellent
  • Koo & Li (2016) (ICC)
  • < 0.5 Poor
  • 0.5 - 0.75 Moderate
  • 0.75 - 0.9 Good
  • >0.9 Excellent
  • Landis & Koch (1997); Zeger et al. (2010) (Cohen Kappa)
  • < 0.2 Slight
  • 0.2 - 0.4 Fair
  • 0.4 - 0.6 Moderate
  • 0.6 - 0.8 Substantial
  • >0.8 Almost perfect
  • Portney & Watkins (2009) (ICC)
  • < 0.75 Poor to moderate
  • >0.75 Reasonable for clinical measurement
  • Shrout (1998) (Not specified)
  • < 0.1 Virtually none
  • 0.1 - 0.4 Slight
  • 0.4 - 0.6 Fair
  • 0.6 - 0.8 Moderate
  • >0.8 Substantial
  • Guidelines for Interpreting IRR

    Guidelines for Education

    In education, ICC(1,1) is often the most appropriate intraclass correlation (ICC) metric to use given each subject is measured by two randomly selected raters (one-way), single measures are used, and absolute agreement is desired (Shrout & Fleiss, 1979).

    Guidelines for interpreting ICC is largely found in the medical literature. When discussing to ICC(1,1), Koo and Li (2016) state that “practically, this model is rarely used in clinical reliability analysis because majority of the reliability studies typically involve the same set of raters to measure all subjects” (pp. 156-157).

    The guidelines for interpreting ICC may not be appropriate for all forms of ICC. As will be shown, the magnitude of the ICC vary greatly for the same percent agreement.

    The IRRsim R Package

    The IRRsim package provides functions to simulate various scoring designs.

    devtools::install_github('jbryer/IRRsim')
    library(IRRsim)

    Key functions:

    • simulateRatingMatrix - Simulate a single rating matrix.
    • simulateIRR - Simulates many scoring matrices with varying percent rater agreements. S3 methods implemented for objects returned by simulateIRR:
      • summary
      • plot
      • as.data.frame
    • IRRsim_demo - Run an interactive Shiny application.

    simulateRatingMatrix

    Simulate a single rating matrix.

    For each scoring event (i.e. row within a scoring matrix):

    1. One of k raters is randomly selected.
    2. A score, y, is randomly selected from the response distribution (uniform distribution by default)
    3. A random number, x, between 0 and 1 is generated.
      • If x is less than the specified desired percent agreement, the remaining values in the row are set to y.
      • Otherwise, scores for the remaining raters are randomly selected from the response distribution.
    4. Repeat steps 1 through 3 for the remaining scoring events.
    5. If \(k_m < k\), then \(k - k_m\) scores per row are set to NA (missing).

    Simulating Scoring Matrices

    set.seed(2112)
    test1 <- simulateRatingMatrix(
        nLevels = 3, k = 6, k_per_event = 6,
        agree = 0.6, nEvents = 10)
    test1
    ##    aa ab ac ad ae af
    ## 1   1  1  1  1  1  1
    ## 2   2  2  2  2  2  2
    ## 3   1  1  1  1  1  1
    ## 4   1  1  1  2  1  1
    ## 5   2  1  2  1  3  1
    ## 6   2  2  2  2  2  2
    ## 7   2  2  2  2  2  2
    ## 8   2  2  1  1  1  2
    ## 9   2  2  2  2  2  2
    ## 10  3  1  3  2  3  2
    agreement(test1)
    ## [1] 0.6
    set.seed(2112)
    test2 <- simulateRatingMatrix(
        nLevels = 3, k = 6, k_per_event = 2,
        agree = 0.6, nEvents = 10)
    test2
    ##    aa ab ac ad ae af
    ## 1  NA  1 NA  1 NA NA
    ## 2  NA NA NA NA  2  2
    ## 3   1 NA NA NA  1 NA
    ## 4  NA NA  1 NA  1 NA
    ## 5  NA NA  2 NA NA  1
    ## 6   2 NA  2 NA NA NA
    ## 7  NA  2 NA  2 NA NA
    ## 8  NA NA  1 NA NA  2
    ## 9  NA NA NA  2  2 NA
    ## 10  3  1 NA NA NA NA
    agreement(test2)
    ## [1] 0.7

    simulateIRR

    Simulates many scoring matrices with varying percent rater agreements. For each scoring matrix, IRR statistics are calculated. This functions returns an object of type IRRsim which has S3 methods defined for plot, summary, and as.data.frame.

    For the remainder of the document, we wish to estimate ICC for 6, 9, and 12 raters under the conditions of 3, 5, and 9 scoring levels.

    tests.3levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 3)
    tests.5levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 5)
    tests.9levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 9)

    Note that if the parameter parallel = TRUE is specified, the function will use multiple cores if available.

    tests.3levels.sum <- summary(tests.3levels, stat = 'ICC1', method = 'quadratic')
    summary(tests.3levels.sum$model[[1]]) # k = 6 raters
    ## 
    ## Call:
    ## lm(formula = as.formula(paste0(stat, " ~ I(agreement^2) + agreement")), 
    ##     data = test)
    ## 
    ## Residuals:
    ##       Min        1Q    Median        3Q       Max 
    ## -0.186551 -0.029589 -0.000278  0.031287  0.150912 
    ## 
    ## Coefficients:
    ##                Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)     0.25002    0.02316   10.80   <2e-16 ***
    ## I(agreement^2)  1.96041    0.05491   35.70   <2e-16 ***
    ## agreement      -1.30374    0.07322  -17.81   <2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.04921 on 897 degrees of freedom
    ## Multiple R-squared:  0.959,  Adjusted R-squared:  0.9589 
    ## F-statistic: 1.049e+04 on 2 and 897 DF,  p-value: < 2.2e-16

    96% of the variance in ICC(1,1) is accounted for by percent agreement!

    Simulating Different Response Distributions

    # Uniform response distribution
    test1.3levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 3, 
                                 response.probs = c(.33, .33, .33))
    # Lightly skewed response distribution
    test4.3levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 3, 
                                 response.probs = c(.275, .275, .45))
    # Moderately skewed response distributions
    test2.3levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 3, 
                                 response.probs = c(.2, .2, .6))
    # Highly skewed response distributions
    test3.3levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 3, 
                                 response.probs = c(.1, .1, .8))

    Uniform Distribution

    ## .
    ##         1         2         3 
    ## 0.3341111 0.3319944 0.3338944

    Lightly Skewed

    ## .
    ##         1         2         3 
    ## 0.2746389 0.2750574 0.4503037



    Moderately Skewed

    ## .
    ##         1         2         3 
    ## 0.2003907 0.1991685 0.6004407

    Highly Skewed

    ## .
    ##          1          2          3 
    ## 0.10013519 0.09962222 0.80024259

    Modeling ICC from Percent Agreement

    Simulated 374,400 scoring matricesThe results are included in the package and can be loaded using the data(IRRsimData) command. The script to generate these results is available here: https://github.com/jbryer/IRRsim/blob/master/data-raw/IRRsimulations.R (each 100 x k) with k between 2 and 16 raters; between 2 and 5 scoring levels; and four response distributions (i.e. uniform, lightly skewed, moderately skewed, and highly skewed).

    For each scoring design (i.e. \(k\), \(k_m\), & number of scoring levels), the following model was estimated:

    \[ICC = \beta_{PRA}^2 + \beta_{PRA} + B_0\]

    For ICC(1,1), ICC(2,1), and ICC(3,1), \(R^2 \ge 0.80\)

    For ICC(1,k), ICC(2,k), and ICC(3,k), the relationship is weaker as \(k_m\) approaches \(k\).

    However, when \(k_m < \frac{k}{2}\), \(R^2 \ge 0.80\).

    Mean Median Minimum Maximum
    ICC1 0.91 0.93 0.71 0.99
    ICC2 0.91 0.93 0.71 0.99
    ICC3 0.91 0.93 0.71 0.99
    ICC1k 0.78 0.79 0.63 0.88
    ICC2k 0.78 0.79 0.63 0.88
    ICC3k 0.78 0.79 0.63 0.88

    Shiny Application

    An interactive Shiny web application was developed to facilitate the simulation of scoring matrices for any particular scoring design.

    IRRsim_demo()

    Discussion

    • ICC and percent agreement are highly correlated with PRA accounting for at least 80% of the variance in ICC, but more than 90% when \(k_m = 2\).
    • Skewness in the response distribution does not appear to impact ICC. This may not be desirable. As one score increases in likelihood (say, response C in four possible scores), agreement by chance of the raters should also increase. ICC, which is suppose to account for agreement-by-chance, does not for this situation.
    • When \(k_m = 2\) (and more generally when \(k_m < \frac{k}{2}\)), there is a substantial penalty in ICC as k increases.
    • The published guidelines for interpreting ICC may not be appropriate for one-way designs (i.e. ICC(1,1) and ICC(1,k)). For many common scoring designs in education (e.g. \(k_m = 2\), \(k > 20\)), achieving “excellent” (Cicchetti, 2001) inter-rater reliability is not possible.
    • The Shiny application is a tool to help researchers interpret their ICC results in relation to the percent agreement achieved.

    Recommendations

    • Researchers should report their full scoring design including \(k\), \(k_m\), and number of scoring levels.
    • Report/utilize percent agreement. This metric is much more understandable.

    Thank You!


    Jason Bryer, Ph.D.

    Assistant Professor and Associate Director
    City University of New York
    jason@bryer.org

    Website: irrsim.bryer.org

    Source Code: github.com/jbryer/IRRsim





    DAACS was developed under grant #P116F150077 from the U.S. Department of Education. However, the contents do not necessarily represent the policy of the U.S. Department of Education, and you should not assume endorsement by the Federal Government.