May 4, 2026/Lab

Inter-Rater Reliability

A pragmatic walkthrough of agreement statistics in clinical research.

Abstract

A pragmatic walkthrough of the agreement statistics that show up in clinical research methods sections: Cohen's κ, weighted κ, PABAK and Gwet's AC1, Fleiss' κ, Krippendorff's α, ICC, and the often-confused Cronbach's α. The page covers when each is appropriate, where the common traps live, and the most important distinction the literature gets wrong: Cronbach's α is not an inter-rater reliability statistic, no matter how many methods sections imply otherwise.

Why are there so many agreement statistics?

If you have read a clinical AI paper in the last three years, you have probably seen at least one agreement statistic, or, at minimum, you would assume the authors had bothered to include one (unfortunately, not always the case). Cohen's κ. Fleiss' κ. Krippendorff's α. ICC. Cronbach's α. They get cited interchangeably. They are not interchangeable. They measure different things, on different kinds of data, under different assumptions, and they routinely tell different stories about the same dataset.

This page is a pragmatic walkthrough. The goal is not to derive every formula. The goal is for you to know which statistic is appropriate for which problem, where the common traps live, and how to read a methods section that reports one of these numbers without taking the number at face value.

The single most important distinction comes at the end. Cronbach's α is not an inter-rater reliability statistic. It looks like one in the methods section. It is not. Reporting Cronbach's α and calling an instrument “reliable” is a category error that the literature commits constantly. We will get there.

The setup

Imagine two physicians scoring the same set of clinical notes on a quality rubric. They both rate 100 notes. We want to know: how much do they agree, and is the agreement meaningful?

The naive answer is “count how often they picked the same score.” This is raw agreement, expressed as a percentage. If they agreed on 70 of 100 notes, raw agreement is 70 percent.

The problem is that raw agreement does not control for chance. If both raters happened to use one category 90 percent of the time, they would agree on most notes by accident, even if their judgments were completely independent. Raw agreement does not tell you whether the agreement is signal or base rate.

This is the problem Cohen's κ was designed to solve.

Cohen's κ, the workhorse

Cohen's κ is the most widely used agreement statistic in clinical research. It is appropriate when:

You have exactly two raters.
The categories are nominal (unordered: yes/no, AI/Human/Can't tell, present/absent).
Each item is rated by both raters.

The formula subtracts the chance-agreement baseline from the observed agreement and normalizes:

κ = (Pₒ − Pₑ) / (1 − Pₑ)

Where Pₒ is the proportion of items the raters agreed on (raw agreement), and Pₑ is the proportion they would have agreed on by chance given their marginal distributions.

Interpretation, per Landis and Koch (1977): 0.0–0.2 slight, 0.2–0.4 fair, 0.4–0.6 moderate, 0.6–0.8 substantial, 0.8+ almost perfect. Most published clinical research treats κ above 0.6 as the floor for “reliable” agreement.

Try it below.

Interactive

Cohen’s κ Explorer

Two raters. Two or three categories. Pick a preset or move the sliders. Watch raw agreement and κ tell different stories.

When does κ mislead? PABAK and Gwet's AC1

Cohen's κ has a well-known weakness: when one category dominates the marginals, κ can crash to near zero even when raters are agreeing strongly on most items. This is the prevalence problem, and it is exactly the pattern that the Brodeur source-attribution data exhibits.

The basic problem: when 94 percent of ratings fall in one category (e.g., “Can't tell”), raw agreement on that category is dominated by base rate, and κ correctly penalizes that. But this can make a substantively reliable rater look unreliable on paper, because there is so little variance for the chance-correction to work against.

Two common fixes:

PABAK (Prevalence-Adjusted Bias-Adjusted Kappa), proposed by Byrt, Bishop, and Carlin (1993). Uses the marginal distribution averaged across raters as a prior, correcting both for prevalence (one category dominating) and for bias (raters using categories at different rates).
Gwet's AC1 / AC2, proposed by Gwet (2008). Replaces κ's chance-agreement model with one that is more robust when prevalence is extreme.

These statistics tend to be higher than κ when one category dominates. They tend to be similar to κ when categories are balanced. Reporting all three is the safe move when prevalence is a concern.

See this in the wild. The Brodeur et al. Science 2026 paper on LLM diagnostic reasoning reports raw percentages for a source-attribution task where one rater answered “Can't tell” 94 percent of the time. The paper does not report κ, PABAK, or Gwet's AC1. The exact pattern this section is about. See the full critique →

Weighted κ, when categories are ordinal

When categories have a natural ordering (Bond 1-5, NEJM Healer rubric ratings, Likert scales), unweighted κ throws away information. A rater giving a 4 when the other rater gave a 5 is much closer to agreement than one giving a 1 when the other gave a 5. Cohen's weighted κ accounts for this with a penalty matrix:

Linear weights: penalty proportional to the rank distance between categories.
Quadratic weights: penalty proportional to the squared rank distance. The more common choice for many clinical scales.

A subtle trap: weighted κ requires the categories to actually be ordinal. Applying linear-weighted κ to nominal categories (e.g., AI / Human / Can't tell) is a methodological mismatch, the rank ordering is not meaningful, so the weights are arbitrary. Some published papers do this. The Brodeur supplement reports a linear-weighted κ for nominal source attribution.

Fleiss' κ, when there are more than two raters

Cohen's κ is locked to exactly two raters. Many real research designs use three or more. Fleiss' κ (1971) extends the chance-corrected agreement framework to N raters per item, with a key flexibility: the N raters do not have to be the same N for every item. You can have 50 items each rated by 3 of 5 possible raters, and Fleiss' κ still works.

Mechanics: instead of computing pairwise agreement, Fleiss' κ computes the proportion of rater-pairs that agree within each item, averaged across items, then chance-corrects against the overall category marginals.

When you have a fixed set of raters who all rate every item, Fleiss' κ and the multi-rater extension of Cohen's κ tend to agree closely. When the rater roster varies across items, Fleiss' κ is the appropriate choice.

Visualization coming in v2.

Krippendorff's α, the general case

Krippendorff's α is the Swiss army knife of agreement statistics. It handles:

Any number of raters (one, two, many).
Any level of measurement (nominal, ordinal, interval, ratio).
Missing data, items rated by different subsets of raters.
Different category sets per item, in some specifications.

The cost of this generality is conceptual overhead. The formula is more involved than κ's, and the interpretation depends on the measurement level and the distance metric you choose. For most use cases where Cohen's κ or Fleiss' κ would be appropriate, α and κ tend to give similar values.

If you are designing a study, the practical advice is: use the simpler statistic when its assumptions are met, and use Krippendorff's α when they are not (especially with missing data).

Visualization coming in v2.

ICC, when ratings are continuous

When ratings are real numbers, Likert composites, percentages, lab values, image scores on a 0–100 scale, κ is the wrong tool. The right tool is the Intraclass Correlation Coefficient (ICC).

ICC has six common forms, formalized by Shrout and Fleiss (1979). The forms differ along three axes:

One-way vs. two-way model. One-way assumes each item is rated by a different set of raters drawn from a population. Two-way assumes the same set of raters rates every item.
Random vs. mixed. Random treats raters as drawn from a population (you want to generalize to other raters). Mixed treats raters as fixed (you only care about these specific raters).
Single rater vs. average of k raters. Are you reporting reliability of one rater's score, or the mean score across k raters?

The most common forms in clinical research:

ICC(2,1), two-way random, single rater. “If we drew a new rater from the population, how well would they agree with these raters?”
ICC(3,1), two-way mixed, single rater. “Among these specific raters, how well do they agree?”
ICC(2,k) and ICC(3,k), same models, but for the average of k raters' scores. Always higher than the single-rater versions.

Picking the wrong ICC form is a common mistake. Two-way random when raters are actually fixed inflates the implied generalizability. Average-of-k when reporting single-rater reliability inflates the implied agreement.

Visualization coming in v2.

The most important section on this page: Cronbach's α ≠ inter-rater reliability

Cronbach's α (1951) is everywhere in the medical literature. It is reported as a reliability statistic. It is interpreted as “the instrument is reliable.” Both of these things are misleading in a way that matters.

What Cronbach's α actually measures. Internal consistency: the degree to which items within a single instrument correlate with each other. If a rubric has nine items and a rater scores all nine, Cronbach's α tells you how strongly those nine items hang together for that rater. High α means the items are measuring something coherent within an instrument-and-rater pair.

What Cronbach's α does not measure. Whether two raters using the same instrument will agree on what to score. That is inter-rater reliability, and it is a different statistic family entirely (κ, ICC, α-Krippendorff).

Why this matters. A rubric can have very high Cronbach's α and very low inter-rater reliability at the same time. Each rater can be internally consistent, applying their own personal interpretation of the rubric coherently across items, while raters disagree completely with each other on how to apply the rubric. The instrument looks “reliable” in the abstract but is not reliable in deployment.

This conflation is endemic. Many published rubrics in clinical informatics are validated almost exclusively on Cronbach's α, with no inter-rater reliability data, or with κ values too low to support the deployment claims the paper makes. The visualization below makes the conflation visible. Same data, two statistics, different stories.

Interactive

Cronbach’s α ≠ inter-rater reliability

Same dataset, two statistics. 12 subjects rated by two raters on a 9-item rubric (1-5 scale). Watch α and κ tell different stories.

Item correlation within rater (drives α)0.85

High = items hang together inside an instrument

Inter-rater agreement (drives κ)0.15

High = raters apply the rubric the same way

Presets

Each rater is internally consistent (items hang together for them), but raters apply the rubric differently. Cronbach's α looks great. Inter-rater κ is poor. This is the failure mode many clinical rubric papers paper over.

Rater A · 12 subjects × 9 items

Rater B · 12 subjects × 9 items

Cronbach's α (mean across raters)0.99 / 0.99mean ≈ 0.99excellent internal consistency

Inter-rater κ (across items, mean)0.66Quadratic-weighted, item-by-item averagedsubstantial inter-rater agreement

Both internal consistency and inter-rater reliability are high. The instrument hangs together AND raters agree on its application.

The trap: a paper reports α = 0.85 and calls the instrument “reliable.” They have not measured inter-rater agreement at all. A separate paper finds κ ≈ 0.2 on the same instrument. Both can be true simultaneously. Cronbach’s α tells you whether items inside the instrument hang together for one rater. It does not tell you whether two raters will agree.

A worked example in the wild

The Brodeur et al. Science 2026 paper makes several claims that depend on agreement statistics being interpreted correctly. The headline κ = 0.51 on the Bond rubric is moderate at best by Landis-Koch conventions, on the rubric driving the headline accuracy claim. The source-attribution task reports raw percentages but no κ or PABAK, despite one rater answering “Can't tell” 94 percent of the time, exactly the pattern PABAK was designed to handle. The supplement reports a linear-weighted κ on nominal data, which is a methodological mismatch.

Full critique →

Coming next

This page is a primer. A more focused critique of one specific clinical rubric, PDQI-9, the Physician Documentation Quality Instrument, is coming as a co-released paper with the team at Suki. The case study there is exactly the Cronbach-vs-IRR conflation: PDQI-9 is widely cited as “reliable,” the reliability evidence is largely Cronbach's α, the inter-rater reliability evidence is sparse, and the cases where it has been measured show concerning numbers. More on that when the paper drops.

If you found this useful and you want to read me being similarly skeptical about clinical AI papers, the Differential series is the home for that.

StatisticsClinical AIMethodologyLab

empiricalpriors.io