Differential

Reading the clinical AI literature carefully.

There is a pattern in clinical AI publishing that has stopped being a coincidence. A frontier model gets benchmarked against physicians on cases that are almost certainly in its training data, scored on rubrics that reward its house style, and the result hits the wire as "AI matches doctors" before anyone has looked at the kappa. Press release moves faster than the supplementary materials, and the supplementary materials are where the paper actually lives.

A fun aside before we go further. I have close friends who took part in labeling for studies in this family. One told me they forgot the assignment entirely, woke up about ten minutes ahead of the submission deadline, and rushed to finish because they wanted the monetary incentive 😅. That is the actual gap between practice and whatever these studies are measuring.

Speaking of practice. I read a lot of this literature for my day job. I have never used a single piece of Clinical AI literature to design my experiments. ML literature, statistics literature, LLM literature, yes. Clinical AI literature, no. So I beg of you, reader: if you are implementing AI at a hospital, or you are a new hire at a Clinical AI company and you have no clue what you are doing, talk to biomedical PhD research scientists and statisticians to help you with your design. Do not follow whatever this is.

The bar I am holding the literature to in this series is "would the result survive a reasonably aggressive deployment review at a clinical AI company." I can't believe I'm saying this, but that bar now seems higher than peer review in our most prestigious journals, and certainly higher than press coverage. If the paper does not clear it, I want to know exactly which load-bearing assumption breaks.

Each Differential issue is a close read of one paper. The bar applies. Let's see which assumption gives.

1 Issue

May 2, 2026/Issue 01

The Word They Cut→

Close read of Brodeur et al. (Science 2026), the latest "AI matches doctors" headline. The 78/80 R-IDEA score reads more as a contamination signal than a capability claim. Eight methodological problems below.

empiricalpriors.io