How we view large language models (like GPT-4)
In my academic life, my group published a few papers on training and applying large language models (LLMs) to medical notes (here and here). We deliberately chose an area where scale mattered (reviewing millions of notes) and where the risk of a false positive or negative was minimal.
But we founded Atman not to publish papers with the (naive) wish that someone else might find some use for them – but to wholly own the process ourselves and carefully evaluate any advance that might help our patients, within the context of a fully systematized approach. Accordingly, we think deeply about what’s available, including LLMs. As evidenced by these manuscripts and far loftier applications in the news, the strength of LLMs (and multi-layered neural networks in general) is in carrying out complex tasks at tremendous scale, with high accuracy. As I’ve mentioned in previous posts, if trained with enough data, they have the potential to outperform clinicians at a number of tasks. However, as anyone who has worked with these models knows, they can also make head-scratching errors that you wouldn’t expect a first-year medical student to make.
This primary weakness – described by Gary Marcus and others and obvious to anyone who has trained these models – is unreliability and inability to cope with examples that deviate from how they were trained. An additional problem can severely compromise their use in clinical practice – there is little to no transparency about how they arrive at a decision (how are you supposed to express a billion-parameter non-linear model in words?). As a result, in many situations, a physician who is expected to follow their recommendation (provided they are still in the loop) often cannot validate the decision independently or communicate anything meaningful to a patient. One may wish that patients should just accept the recommendation, impressed by past model accuracy, and not ask for more detail. But this is neither realistic nor desired – as risks of medical decisions are real, outcomes are uncertain, and patients increasingly are expected to share in decision-making given that they and their family members will ultimately bear the consequences.
So unreliability, an unquantifiable chance of egregious error, and an inability of humans (whose medical license is on the line) to verify and explain the results plague these models. Unsurprisingly the FDA regulates any (non decision-support) software involved in actual decisions as medical devices – and passing the FDA certification process is arduous and exceedingly costly (we are in the midst of this). All of this has shaped and tempered the intended use of most models certified by the FDA: they typically represent “an extra pair of eyes” that may perhaps flag something that might otherwise be missed or suggest that some study be read earlier. Modest in ambition, but perhaps not that risky. Alternately, industry applications have attempted to steer clear of any clinical decision-making and instead focus on “drudgery relief” such as drafting initial responses to emails and messages, a process that today is already being performed by non-licensed personnel.
Cognizant of this, our own pragmatic approach to AI has been to partition our applications into:
- those where black box models (potentially reflecting inscrutable high-dimensional interactions) are appropriate.
- those where interpretability is of far greater importance.
Our software makes extensive use of quantitative models throughout all aspects of decision-making: guiding which tests to order to enable a diagnosis for a new complaint, which medication to choose to treat a patient optimally, and how to disambiguate side effects. But our approach deliberately follows category 2. These models take as inputs fully interpretable parameterized data and produce transparent decisions that can be reviewed fully by the treating provider, with all underlying logic. This parameterization also lends itself to rapid iterative learning and intuitive hypothesis-driven experimental design. And, of course, documentation is fully automated.
However, there are plenty of opportunities on both sides of this decision-making process that will benefit from large language models (category 1). The actual information we get from the world is messy and unstructured: patient and provider conversations, test reports, diagnostic images. LLMs are perfect for converting these into useful structured inputs and the process is fully verifiable, minimizing any risk. Likewise, downstream of decisions, communication of recommendations, if in writing, can be facilitated by LLMs, much as their use in other applications.
There is no doubt that these models represent a tremendous advance. We feel fortunate that the nature of our systematized care process lends itself to benefiting our patients through adopting AI’s strengths while avoiding most of its weaknesses.
Rahul D.