On August 15, the ABR released the 2019 Core Exam results, which included the highest failure rate since the exam’s inception in 2013: 15.9%.
(Side note: due to a “computer error,” the ABR decided to release the aggregate results before sharing individual results with trainees, resulting in entirely unnecessary extra anxiety. This itchy trigger finger release is in stark contrast to the Certifying Exam pass rates, which have never been released.)
|Year||Percent Passed||Percent Failed||Percent Conditioned||Total Examinees|
So what happened?
One potential explanation is that current residents are less intelligent, less hard-working, or less prepared for the exam despite similar baseline board scores in medical school, similar training at their residency programs, and now very mature and continually improving board preparation materials. This would seem unlikely.
If it really does simply chalk up to resident “caliber” as reflected in minor variations in Step scores, then I would volunteer that we should be concerned that a minimally related test could be so predictive (i.e., so what are we testing here? Radiology knowledge as gained over years of training or just MCQ ability?).
Another explanation is that—despite the magical Angoff method used to determine the difficulty/fairness of questions—the ABR simply isn’t very good at figuring out how hard their test is, and we should expect to see large swings in success rates year to year because different exams are simply easier or harder than others. This is feasible but does not speak well to the ABR’s ability to fairly and accurately test residents (i.e., their primary stated purpose). In terms of psychometrics, this would make the Core exam “unreliable.”
The ABR would certainly argue that the exam is criterion-based and that a swing of 10% is within the norms of expected performance. The simple way to address this would be to have the ABR’s psychometric data evaluated by an independent third-party such as the ACR. Transparency is the best disinfectant.
The third and most entertaining explanation is that current residents are essentially being sacrificed in petty opposition to Prometheus Lionheart. The test got too easy a couple years back and there needed to be a course correction.
The Core Prep Arms Race
With the widespread availability of continually evolving high-yield board prep material, the ABR may feel the need to update the exam in unpredictable ways year to year in order to stay ahead of “the man.”
(I’ve even heard secondhand stories about persons affiliated with the ABR in some capacity making intimations to that effect including admitting to feeling threatened by Lionheart’s materials/snarky approach and expressing a desire to “get him.” I wouldn’t reprint such things because they seem like really stupid things for someone to admit within public earshot, and I certainly cannot vouch for their veracity.)
If you’re happy with how your exam works, and then third parties create study materials that you feel devalue the exam, then your only option is to change (at least parts of) the exam. This may necessitate more unusual questions that do not make appearances in any of the several popular books or question banks. This is also not a good long-term plan.
This scenario was not just predictable but was the inevitable outcome of creating the Core exam to replace the oral boards. If the ABR thought people “cheating” on the oral boards by using recalls was bad, replacing that live performance with an MCQ test–the single most recallable and reproducible exam format ever created–was a true fool’s errand.
A useless high-stakes MCQ test based on a large and unspecified fraction of bullshit results in residents optimizing their learning for exam preparation. I see first-year residents using Crack the Core as a primary text, annotating it like a medical student annotates First Aid for the USMLE Step 1. Look no further than undergraduate medical education to see what happens when you make a challenging test that is critically important and cannot be safely passed without a large amount of dedicated studying: you devalue the actual thing you ostensibly want to promote.
In medical school, that means swathes of students ignoring their actual curricula in favor of self-directed board prep throughout the basic sciences and third-year students who would rather study for shelf exams than see patients. The ABR has said in the past that the Core Exam should require no dedicated studying outside of daily service learning. That is blatantly untrue, and an increasing failure rate only confirms how nonsensical that statement was and continues to be. Instead, the ABR is going to drive more residents into a board prep attitude that will detract from their actual learning. Time is finite; something always has to give.
If I were running a program that had recurrent Core Exam failures, I wouldn’t focus on improving teaching and service-learning. Because on a system-level, those things are not only hard to do well but probably wouldn’t even help. The smart move would be to give struggling residents more time to study. And that is bad for radiology and bad for patients.
The underlying impression is that the ABR’s efforts to make the test feel fresh every year have forced them to abandon some of the classic Aunt Minnie’s and reasonable questions in favor of an increasing number of bullshit questions in either content or form in order to drive the increasing failure rates. Even if this is not actually true, those are the optics, and that’s what folks in the community are saying. It’s the ABR’s job to convince people otherwise, but they’ve shown little interest in doing so in the past.
There is no evidence that the examination has gotten more relevant to clinical practice or better at predicting clinical performance, because there has never been any data nor will there ever be any data regarding the validity of the exam to do that.
The Impossibility of True Exam Validity
The ABR may employ a person with the official title of “Psychometric Director” with an annual base salary of $132,151, but it’s crucial to realize the difference between psychometrics in terms of making a test reliable and reproducible (such that the same person will achieve a similar score on different days) and that score being meaningful or valid in demonstrating what it is you designed the test to do. The latter would be if passing the Core Exam meant that you were actually safe to practice diagnostic radiology and failing it meant you were unsafe. That isn’t going to happen. It is unlikely to happen with any multiple-choice test because real life is not a closed book multiple-choice exam, but it’s compounded by the fact that the content choices just aren’t that great (no offense to the unpaid volunteers that do the actual work here). Case in point: there is completely separate dedicated Cardiac imaging section, giving it the same weight as all of MSK or neuroradiology. Give me a break.
The irony here is that one common way to demonstrate supposed validity is to norm results with a comparison group. In this case, to determine question fairness and passing thresholds, you wouldn’t just convene a panel of subject matter experts (self-selected mostly-academic rads) and then ask them to estimate the fraction of minimally competent radiologists you’d expect to get the question right (the Angoff method). You’d norm the test against a cohort of practicing general radiologists.
Unfortunately, this wouldn’t work, because the test includes too much material that a general radiologist would never use. Radiologists in practice would probably be more likely to fail than residents. That’s why MOC is so much easier than initial certification. Unlike the Core exam, the statement that no studying is required for MOC is actually true. Now, why isn’t the Core Exam more like MOC? That’s a question only the ABR can answer.
I occasionally hear the counter-argument that the failure rate should go up because some radiologists are terrible at their jobs. I wouldn’t necessarily argue that last part, with the caveat that we are all human and there are weak practitioners of all ages. But this sort of callous offhand criticism only makes sense if an increasing failure rate means that the people who pass the exam are better radiologists, the people who fail the exam are worse radiologists, and those who initially fail and then pass demonstrate a measurable increase in their ability to independently practice radiology. It is likely that none of the three statements are true.
Without getting too far into the weeds discussing types of validity (e.g., content, construct, and criterion), a valid Core Exam should have content that aligns closely with the content of practicing radiology, should actually measure radiology practice ability and not just radiology “knowledge,” and should be predictive of job performance. 0 for 3, it would seem.
So, this exam is lame and apparently getting lamer with no hope in sight. And let’s not get started on shameless exercise in redundant futility that is the Certifying Exam. So where did everything go wrong? Right from the start.
That’s the end of the rant. But let’s end with some thoughts for the future.
What the Core Exam SHOULD Be
To the ABR, feel free to use this obvious solution. It will be relatively expensive to produce, but luckily, you have the funds.
Diagnostic radiology is a specialty of image interpretation. While some content would be reasonable to continue in a single-best-answer multiple-choice format, the bulk of the test should be composed of simulated day-to-day practice. Unlike most medical fields, where it would be impossible to objectively see a resident perform in a standardized assortment of medical situations, the same portability of radiology that makes AIs so easy to train and cases so easy to share would be equally easy to use for resident testing.
Oral boards aren’t coming back. The testing software should be a PACS.
Questions would be cases, and the answers would be impressions. Instead of having a selection of radio buttons to click on, there would be free text boxes that would narrow down to a list of diagnoses as you type (like when you try to order a lab or enter a diagnosis in the EMR; this component would be important to make grading automated.)
The exam could be anchored in everyday practice. One should present cases centered on the common and/or high-stakes pathology that we expect every radiologist to safely and consistently diagnose. We could even have differential questions by having the examinee enter two or three diagnoses for the cases where such things are important considerations (e.g., some cases of diverticulitis vs colon cancer). These real-life PACS-based cases could be tied into second-order questions about management, communication, image quality, and even radiation dose. But it should all center around how radiologists actually view real studies. It could all be a true real-world simulation that is a direct assessment of relevant practice ability and not a proxy for other potentially related measurables. Let’s just have the examinees practice radiology and see how they do.
The ABR has argued in the past that the Core exam cannot be ported to a commercial center, which is largely the fault of the ABR for producing a terrible test. But at least that argument would finally hold water if the ABR actually deployed a truly unique evaluative experience that could actually demonstrate a trainee’s ability. The current paradigm is silly and outdated, and radiology is uniquely positioned within all of medicine to do better. The exam of the future should not be rooted in the largely failed techniques of the past.