Program directors and the pass/fail USMLE

Just over a year ago, the NBME announced that Step 1 would soon become pass/fail in 2022. A lot of program directors complained, saying the changes would make it harder to compare applicants. In this study of radiology PDs, most weren’t fans of the news:

A majority of PDs (69.6%) disagreed that the change is a good idea, and a minority (21.6%) believe the change will improve medical student well-being. Further, 90.7% of PDs believe a pass/fail format will make it more difficult to objectively compare applicants and most will place more emphasis on USMLE Step 2 scores and medical school reputation (89.3% and 72.7%, respectively).

Some students also complained, believing that a high Step score was their one chance to break into a competitive specialty.

There are two main reasons some program directors want to maintain a three-digit score for the USMLE exams.

The Bad Reason Step Scores Matter

One reason Step scores matter is that they’re a convenience metric that allows program staff to rapidly summarize a candidate’s merit across schools or other non directly comparable metrics. This is a garbage use case—in all ways you might imagine—but several reasons include:

  • The test wasn’t designed for this. It’s a licensing exam, and it’s a single data point.
  • The standard error of measurement is 6. According to the NBME scoring interpretation guide, “plus and minus one SEM represents an interval that will encompass about two thirds of the observed scores for an examinee’s given true score.” As in, given your score on test day, you should expect a score in that 12-point page only 2/3 of the time. That’s quite the range for an objective summary of a student’s worth.
  • The standard error of difference is 8, which is supposed to help us figure out if two candidates are statistically different. According to the NBME, “if the scores received by two examinees differ by two or more SEDs, it is likely that the examinees are different in their proficiency.” Another way of stating this is that within 16 points, we should consider applicants as being statistically inseparable. A 235 and 250 may seem like a big difference, but our treatment of candidates as such isn’t statistically valid. Not to mention, a statistical difference doesn’t mean a real-life clinical difference (a concept tested on Step 1, naturally).
  • The standard deviation is ~20 (19 in 2019), a broad range. With a mean of 232 in 2019 and our standard errors as above, the majority of applicants are going to fall into that +/- 1SD range with lots of overlap in the error ranges. All that hard work of these students is mostly just to see the average score creep up year to year (it was 229 in 2017 and 230 in 2018). If our goal was just to find the “smartest” 10% of medical students suitable for dermatology, then we could just use a nice IQ test and forget the whole USMLE thing.

It’s easier to believe in a world where candidates are both smarter and just plain better when they have higher scores than it is to acknowledge that it’s a poor proxy for picking smart, hard-working, dedicated, honest, and caring doctors. You know, the things that would actually help predict future performance. Is there a difference in raw intelligence between someone with a 200 vs 280? Almost certainly. That’s 4 standard deviations apart. But what about a 230 and 245? How much are we really accidentally weighing the luxury of having both the time and money needed in order to dedicate lots of both to Step prep?

In my field of radiology, I care a lot about your attention to detail (and maybe your tolerance for eyestrain). I care about your ability to not cut corners and lose your focus when you’re busy or at the end of a long shift. I care that you’re patient with others and care about the real humans on the other side of those images.

There’s no test for that.

If there were, it wouldn’t be given by the NBME.

The Less Bad Reason Step Scores Matter

But there is one use case that unfortunately has some merit: multiple-choice exams are pretty good at predicting performance on other multiple-choice exams. That wouldn’t matter here if licensure was the end of the test-taking game, but Step performance tends to predict future board exam performance.

Some board exams are quite challenging, and programs pride themselves on high pass-rates and hate dealing with residents that can’t pass their boards. So, Step 1 helps programs screen applicants by test-taking ability.

Once upon a time, I considered a career as a neurosurgeon instead of a neuroradiologist. No denying it certainly sounded cooler. I remember attending a meeting with the chair of neurosurgery at my medical school. This is only noteworthy because of his somewhat uncommon frankness. At the meeting, he said his absolute minimum interview/rank threshold was 230 (this was back around 2010). And I remember him saying the only reason he cared was because of the boards. They’d recently had a resident that everyone loved and thought was an excellent surgeon but just couldn’t seem to pass his boards after multiple attempts. It was a blight on the program.

Now, leave aside for a moment the possible issue with test validity if a dutiful clinician and excellent operator is being screened out over some multiple-choice questions. At the end of the day, programs need their residents to pass their boards. And it’s ideal if they pass their boards without special accommodations or other back-bending (like extra study time off-service) to help enable success. So while Step 1 cutoffs may be a way to quickly filter a large number of ERAS applications to a smaller more manageable number, they’re also a way to help programs in specialties with more challenging board exams ensure that candidates will eventually move on successfully to independent practice.

There is only one real reason a “good” Step score matters, and that is because specialty board certification exams are also broken.

One of the easiest ways a program can demonstrate high-quality and high board passage rates regardless of the underlying training quality is to select residents who can bring strong test-taking abilities to bear when it comes to another round of bullshitty multiple-choice exams.

A widely known secret is that board exams don’t exactly reflect real-life practice or real-life practical skills. Much of this type of board knowledge is learned by the trainees on their own, often through commercial prep products. A residency program in a field with a challenging board exam, like radiology, may be incentivized to pick students with high scores simply as a way to best ensure that their board pass rates will remain high. If Step 1 mania has taught us anything, it’s shown us that if you want high scores on a high-stakes exam, you pick people with high academic performance and then get out of their way.

What Are We Measuring?

When I see the work of other radiologists, I am rarely of the opinion that the quality of their work depends on their innate intelligence such as might be measured on a standardized exam. Ironically, most radiology exam questions ask questions about obvious findings. Almost none rely on actually making the finding or combating satisfaction of search (missing secondary or incidental findings when another finding is more obvious). And literally none test whether or not a radiologist can communicate findings in writing or verbally. When radiologists miss findings and get sued, the vast majority are for “perceptual errors” and not “interpretive ones.” As in, when I miss things, it’s relatively rare that I misinterpreted the findings I make and more often that I just didn’t see something (often that even I normally would [because I’m human]).

Obviously, it’s never a bad thing to be super smart or even hard-working. But the medical testing industrial complex has already selected sufficiently for intelligence. What it hasn’t selected for is being competent at practicing medicine.

While everyone would like to have a smarter doctor and train “smarter” residents, the key here is that board passage rates are another reflection of knowledge cached predominately in general test-taking ability and not clinical prowess. All tests are an indirect measure, for obvious reasons, but most include a wide variety of dubiously useful material largely designed to simply make exams challenging without necessarily distinguishing capable from dangerous candidates.

So when program directors complain about a pass/fail Step 1, they should be also be talking with their medical boards. I don’t think we should worry about seeing less qualified doctors, but we should be proactive about ensuring trainee success in the face of exams of arbitrary difficulty.


Leave a Reply