Intraclass Correlation Coefficient Calculator

Name: Intraclass Correlation Coefficient Calculator
Author: Roboculator Team

Calculator

Between-Subjects Mean Square (MSb)

Within-Subjects Mean Square (MSw)

Number of Raters / Measurements (k)

Results

ICC Single Measure

0.571429

ICC Average Measure

0.8

F Statistic

Reliability Score

57.1

/100

Error Share

42.86

Results

ICC Single Measure

0.571429

ICC Average Measure

0.8

F Statistic

Reliability Score

57.1

/100

Error Share

42.86

The Intraclass Correlation Coefficient (ICC) Calculator measures the reliability of ratings or measurements made by multiple raters or instruments on the same set of subjects. Unlike Pearson's correlation (which measures association between two specific raters), the ICC assesses consistency among any number of raters and accounts for both the correlation and the agreement in absolute values. It is the standard measure of inter-rater reliability in medicine, psychology, and quality assessment.

Enter the between-subjects mean square (MSb), within-subjects mean square (MSw), and the number of raters from a one-way random effects ANOVA to compute both the single-measures ICC (reliability of a single rater) and the average-measures ICC (reliability of the mean of k raters). This calculator implements the ICC(1,1) and ICC(1,k) forms from the Shrout and Fleiss classification.

Visual Analysis

How It Works

The ICC is derived from a one-way random effects ANOVA model. The single-measures ICC, denoted ICC(1,1), is:

$$ICC(1,1) = \frac{MS_B - MS_W}{MS_B + (k-1) \cdot MS_W}$$

The average-measures ICC, denoted ICC(1,k), applies the Spearman-Brown formula:

$$ICC(1,k) = \frac{MS_B - MS_W}{MS_B}$$

Where:

MS_B = Between-subjects mean square, reflecting variability between the subjects being rated
MS_W = Within-subjects mean square, reflecting variability among raters for the same subject (measurement error)
k = Number of raters or repeated measures

The ICC partitions the total variance into between-subject variance (σ²_b) and within-subject (error) variance (σ²_w):

$$\sigma_b^2 = \frac{MS_B - MS_W}{k}$$

$$\sigma_w^2 = MS_W$$

$$ICC = \frac{\sigma_b^2}{\sigma_b^2 + \sigma_w^2}$$

This shows the ICC as the proportion of total variance that is due to true differences between subjects. When raters agree perfectly (MS_W → 0), ICC → 1. When between-subject differences are swamped by rater disagreement (MS_B ≈ MS_W), ICC → 0. Negative ICCs can occur when within-subject variability exceeds between-subject variability, indicating systematic disagreement among raters worse than chance.

The F statistic (MS_B/MS_W) tests the null hypothesis that ICC = 0. A significant F indicates meaningful inter-rater agreement exists.

Understanding Your Results

ICC values are interpreted using the guidelines from Koo and Li (2016) and Cicchetti (1994):

ICC Range	Interpretation
< 0.50	Poor reliability
0.50 - 0.74	Moderate reliability
0.75 - 0.89	Good reliability
≥ 0.90	Excellent reliability

ICC (Single Measures): The reliability of a single rater's score. Use this when each subject will be rated by only one rater in practice, or when you want to know how reliable an individual rater is.
ICC (Average Measures): The reliability of the mean of k raters' scores. Use this when the final score will always be the average of k raters. Always higher than the single-measures ICC.
F Statistic: Tests whether subjects differ significantly from each other after accounting for rater variability. A large F supports the existence of reliable individual differences.

The choice between single and average measures depends on how the measurement will be used. If a clinical scale will always be scored by one rater in practice, report the single-measures ICC. If the protocol requires averaging multiple raters, report the average-measures ICC.

Worked Examples

Medical Image Rating (3 Radiologists)

Inputs

ms between25

ms within5

Results

icc single0.571429

icc average0.8

interpretationModerate reliability

f statistic5

Three radiologists rate 20 images. ANOVA yields MSb = 25, MSw = 5. ICC(1,1) = (25-5)/(25+2×5) = 20/35 = 0.571 (moderate for single rater). ICC(1,k) = (25-5)/25 = 0.80 (good for the average of 3 raters). F = 25/5 = 5.0, indicating significant inter-rater agreement.

Pain Assessment (4 Clinicians)

Inputs

ms between48

ms within3

Results

icc single0.789474

icc average0.9375

interpretationGood reliability

f statistic16

Four clinicians assess pain in 15 patients. MSb = 48, MSw = 3. ICC(1,1) = (48-3)/(48+3×3) = 45/57 = 0.789 (good for a single clinician). ICC(1,k) = (48-3)/48 = 0.938 (excellent for 4-rater average). F = 16.0, highly significant agreement among clinicians.

Frequently Asked Questions

The ICC is a reliability coefficient that quantifies the degree of agreement among multiple raters or measurements on the same set of subjects. Unlike Pearson's r (which only measures the linear association between two specific raters), the ICC can handle any number of raters and measures both consistency (relative ranking) and absolute agreement (same numerical values). It is calculated from an ANOVA framework, partitioning total variability into between-subject and within-subject components.

ICC(1,1) estimates the reliability of a single rater's measurement. It answers: how reliable is one randomly selected rater? ICC(1,k) estimates the reliability of the average of k raters. It answers: how reliable is the mean score across all raters? ICC(1,k) is always higher than ICC(1,1) because averaging multiple ratings reduces random error. They are related by the Spearman-Brown formula: ICC(1,k) = k × ICC(1,1) / (1 + (k-1) × ICC(1,1)).

Run a one-way random effects ANOVA where subjects are the groups and rater scores are the observations within each group. The ANOVA table will provide: MS_Between (the mean square for the subject factor, reflecting between-subject variability) and MS_Within (the residual mean square, reflecting rater disagreement). Most statistical software (SPSS: Analyze → Scale → Reliability; R: icc() from the irr package; SAS: PROC MIXED) can compute ICCs directly.

Yes. A negative ICC occurs when within-subject variance exceeds between-subject variance (MSw > MSb), meaning raters disagree more than expected by chance. This can happen when: (1) raters interpret the scale in opposite ways; (2) there is a systematic bias between raters; (3) the sample is very homogeneous (little true between-subject variation). A negative ICC is usually treated as zero in practice, and the rating process should be examined for problems.

Guidelines suggest a minimum of 30 subjects for stable ICC estimates, though more is better. For the number of raters, k ≥ 3 is recommended for the one-way model. The precision of the ICC estimate depends on both: with 30 subjects and 3 raters, the 95% confidence interval for an ICC of 0.70 spans roughly 0.50 to 0.85. With 50 subjects and 4 raters, it narrows to approximately 0.58 to 0.80. Confidence intervals for ICC should always be reported alongside the point estimate.

Shrout and Fleiss (1979) defined 6 forms based on the study design; McGraw and Wong (1996) expanded to 10. The key decisions are: (1) Model -- one-way random (raters vary randomly across subjects) vs. two-way random (same raters rate all subjects) vs. two-way mixed (raters are fixed); (2) Type -- single measures vs. average measures; (3) Definition -- consistency (relative agreement) vs. absolute agreement. This calculator implements ICC(1,1) and ICC(1,k) -- the one-way random model, appropriate when each subject is rated by a different random set of raters.

Sources & Methodology

Shrout, P.E. & Fleiss, J.L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420-428. Koo, T.K. & Li, M.Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155-163. McGraw, K.O. & Wong, S.P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30-46.

Roboculator Team

The Roboculator Team explains calculations, planning tools, and practical formulas in clear language for real-life situations.

How helpful was this calculator?

Be the first to rate!