Name: The Future of AI in Medicine
Author: Tarvinder Singh, MD

"It is a capital mistake to theorize before one has data." — Arthur Conan Doyle, A Scandal in Bohemia

The Differential

The MRI was on my screen for eleven minutes before I admitted I didn't know.

The patient was forty-three — a graphic designer referred by her primary care physician after six months of symptoms that refused to assemble into a pattern. Intermittent numbness in her right hand, lasting minutes to hours, never in the same distribution twice. Two episodes of monocular visual blurring, each lasting under a minute, each in the left eye. Fatigue she described as "hitting a wall at 2 PM" — which could mean a demyelinating disease or could mean she had a toddler and a full-time job, both of which were true.

Her neurological exam was normal. Completely, stubbornly normal. Which in neurology means very little, because the diseases we fear most are the ones that hide between episodes, presenting a clean exam on the one afternoon the patient happens to be in your office.

The MRI report said what MRI reports say when the radiologist is being honest: Several nonspecific T2 hyperintensities in the periventricular and subcortical white matter. Clinical correlation recommended.

Clinical correlation. The two-word phrase that translates, roughly, to: I can see something, but I cannot tell you what it means. That part is your problem.

The images came up on the workstation. The lesions were there — small, bright on FLAIR, a few in the periventricular white matter, one arguably juxtacortical. Were they the ovoid, perpendicular-to-ventricle morphology that whispers multiple sclerosis? Or were they the punctate, nonspecific dots that mean nothing more than migraine, hypertension, or simply being over forty?

A neurologist who has read thousands of brain MRIs knows this moment. The clinic room, the patient watching for clues the physician isn't yet ready to give, the silent work of every diagnostician at the boundary of knowledge: pattern-matching against training, experience, memory of every similar case — and the match is ambiguous. The data fit multiple stories. The photographs — the exam, the imaging, the history — could be assembled into at least four different movies: early MS, migraine with aura, small-vessel ischemic changes, or the cruelest possibility of all — nothing, an incidental finding that would shadow this patient with anxiety for years.

Additional tests were ordered. Spinal fluid. Visual evoked potentials. Repeat imaging in three months. The right call — evidence-based, guideline-concordant, careful. But the physician in that room knew what was really happening: buying time. Spreading the uncertainty across future data points because the current data points weren't enough to resolve it.

This is what diagnosis looks like from inside. Not the clean arc from symptom to test to answer that textbooks describe, but a negotiation with ambiguity — an iterative narrowing of possibility that depends, at every step, on the physician's ability to hold multiple competing hypotheses in working memory while integrating information from sources that don't naturally talk to each other. The MRI doesn't know about the visual episodes. The visual episodes don't know about the family history of lupus she mentioned offhand while I was typing. The family history doesn't know about two prior normal MRIs at an outside facility that I haven't yet obtained because the fax machine at the other hospital is, as always, broken.

The human diagnostic mind is extraordinary. It is also, by the laws of biology, limited — not in intelligence, but in bandwidth. It can hold approximately four to seven items in working memory. The average complex diagnostic case generates dozens.

This chapter is about what happens when a fundamentally different kind of intelligence enters that negotiation — one that doesn't forget, doesn't fatigue, and can hold the entire movie in memory at once. And it is about a question this intelligence is now forcing us to confront: when the machine can see farther into the data than you can, what exactly is the physician's role?

The answer, I believe, is not what you expect.

Before the Revolution, the Wreckage

Before we meet the machines that are transforming diagnosis, we need to understand the one that failed — because the failure is the map.

In 2018, hospitals across South Korea, India, and Europe quietly stopped using IBM Watson for Oncology after discovering it recommended treatments that oncologists found unsafe. Watson had been trained primarily on data from Memorial Sloan Kettering — one hospital, one population, one treatment culture — and assumed its patterns were universal. They were not. The system marketed as the future of AI diagnosis became, instead, its most instructive cautionary tale: an opaque model, trained on narrow data, deployed without the transparency or equity safeguards that might have caught the mismatch before patients were affected. Everything that went wrong with Watson is a photographic negative of what the systems succeeding today get right. Keep that negative in mind. You will need it.

The Numbers That Changed Everything

By February 2026, the U.S. Food and Drug Administration had authorized more than a thousand AI-enabled medical devices — a threshold that barely existed a decade ago and that roughly doubles every two years. Three-quarters of these devices operate in radiology, which is not accidental: medical imaging is where AI's capacity for high-dimensional pattern recognition, the alien perception we explored in Chapter 2, meets a domain that has always been defined by visual pattern recognition. The substrate is the same. The scale is what differs.

The clinical evidence behind these devices has crossed from debatable to established. In lung cancer screening, AI systems detect small, subtle nodules on chest CT with sensitivity that matches or exceeds experienced radiologists, particularly under the time pressure of high-volume reads. In breast cancer screening, AI-assisted interpretation reduces false positives — the callbacks that cause weeks of anxiety for patients who ultimately have no cancer — while maintaining or improving detection rates. In diabetic retinopathy, the FDA authorized an autonomous AI system to diagnose without a physician in the loop — the first time a machine was trusted to render a clinical diagnosis entirely on its own.

These are not projections from startup pitch decks. They are peer-reviewed results from systems reading real patients' images in real hospitals. They establish the floor of what diagnostic AI can do. But a floor is not a ceiling. And what happened next revealed a gap between human and machine performance that the field is still learning how to talk about.

The Augmentation Gap

In late 2025, Microsoft Research published a result that landed like a quiet earthquake across diagnostic medicine. The system, called MAI-DxO, was given the hardest diagnostic cases the New England Journal of Medicine publishes — complex, ambiguous, multi-system cases used to teach clinical reasoning at the highest level. These are not cases where the answer is obvious. They are the cases where experienced physicians disagree, where the differential spans a dozen possibilities, where the correct answer requires integrating physical exam, laboratory data, imaging, and clinical history into a single coherent narrative.

MAI-DxO scored 85.5 percent.

Physicians — experienced, board-certified physicians — scored approximately 20 percent on the same cases.

I need you to sit with that number for a moment. Not rush past it. Not rationalize it as a benchmark artifact. The machine was not four percent better. Not ten percent. It was more than four times as accurate as the humans it was designed to assist.

This result has a name that the field has not yet settled on, so I will give it one: the Augmentation Gap. It is the distance between what AI can perceive in diagnostic data and what a human physician can perceive in the same data, unaided — and on the hardest cases in medicine, that distance is large enough to make the word "augmentation" sound like a euphemism.

The Augmentation Principle, introduced in Chapter 1, holds that AI must amplify human capability, not replace human judgment. I believe this. I have seen it work — in the stroke detector that pages the specialist before the radiologist finishes her coffee, in the sepsis alert that catches the drift hours before the fever spikes. But the MAI-DxO result forces an uncomfortable question: when the gap is this large, what does "amplify" actually mean? If a system outperforms the physician by a factor of four on the hardest cases in medicine, is the physician augmenting the machine, or is the machine carrying the physician?

But the gap is not uniform, and this is where the story turns. When researchers tested large language models on clinical reasoning tasks that specifically required managing uncertainty — not diagnosing from complete information, but reasoning through ambiguity, weighing contradictory evidence, deciding what to do when the data does not resolve — the machines faltered. They became confident where they should have been cautious, decisive where the correct clinical answer was I don't know yet, and here is what I need to find out. The same models that could outdiagnose physicians four-to-one on cases with knowable answers stumbled on the cases where the answer was: there is no answer yet. In February 2026, a contamination-resistant benchmark called LiveClin (arXiv 2602.13864) tested the leading foundation models on 3,500 expert-curated clinical reasoning questions built from cases the models could not have memorized — and the best-performing system reached only 35.7% accuracy. The headline benchmark numbers do not transfer to fresh, open-ended clinical problems. They never did.

But the gap is not merely a benchmark phenomenon. It replicates in prospective care. In March 2026, Google's AMIE system became the first large language model deployed for real clinical encounters — conducting pre-visit history-taking with ninety-eight patients at Beth Israel Deaconess Medical Center, with zero safety interventions required (arXiv 2603.08448). The system's differential diagnosis included the final diagnosis in ninety percent of cases, a performance statistically indistinguishable from the primary care physicians it worked alongside (p = 0.6). But when the evaluation shifted from diagnosis to management — from what is wrong to what should we do about it — the physicians outperformed AMIE on plan practicality (p = 0.003) and cost-effectiveness (p = 0.004). The gap between pattern recognition and clinical action, which benchmarks had been mapping for years, now had a prospective address: a real hospital, real patients, real stakes. The machine matched the physician on the question. The physician outmatched the machine on the answer.

But the most disquieting data point is not about what AI can or cannot do. It is about what happens when humans use AI. A randomized study of 1,298 participants published in Nature Medicine in early 2026 gave people access to the same large language models that, tested alone, identified the correct medical condition in 94.9 percent of scenarios. The participants, equipped with these near-perfect tools, identified the relevant conditions in fewer than 34.5 percent of cases — no better than a control group working without AI at all. The gap did not narrow. It inverted. The humans made the machine worse. The study tested the general public, not clinicians — but the mechanism it exposed is not ignorance. It is architecture. People anchor on the machine's first suggestion. They skim rather than interrogate. They treat algorithmic confidence as permission to stop thinking rather than as raw material for thinking harder. The failure is not in the model. It is in the interface between model and mind — and that interface, in most medical AI systems, is designed as if showing a physician the right answer were the same as helping the physician reason about it.

This is the topology of the Augmentation Gap — and it has three dimensions, not two. The gap is widest where the problem is well-defined — where a finite set of data points maps to a diagnosis, where pattern recognition at scale wins. It narrows, sharply, where the problem is open-ended — where managing uncertainty matters more than resolving it, where the physician's value lies not in knowing but in navigating not-knowing. And it can invert — completely — when the interface between human and machine encourages deference rather than reasoning. A system that presents a confident recommendation with a green checkmark invites acceptance. A system that shows its work, names what it does not know, and refuses to offer certainty on insufficient data invites thought. The distance between those two designs is the distance between augmentation and its opposite: a tool that makes the physician worse by making thinking feel unnecessary.

Which brings the argument back to that clinic room, and the patient with the ambiguous MRI.

No model available in 2026 could have done what the neurologist did in that exam room: look a forty-three-year-old woman in the eye and say, "These findings are uncertain. Here is what they might mean. Here is what we should do next, and here is why. And someone will call you when the results come back, because you should not have to wait for a portal notification to learn something this important about your body." The diagnosis — if there is one — may eventually belong to the machine. But the uncertainty, and the relationship that sustains a patient through it, belong to the physician.

This is not sentimentality. It is the clinical reality that no benchmark captures. The MAI-DxO study measured diagnostic accuracy. It did not measure what happens in the room when the diagnosis is uncertain, when the patient is frightened, when the next step is not a test but a conversation. The physician who justifies her existence solely by the accuracy of her diagnoses is standing on ground that is actively eroding. The physician who justifies her existence by what she does with the diagnosis — and, crucially, what she does in the absence of one — is standing on the only ground the machine cannot reach.

At least, not yet.

The Stroke That Didn't Wait

Consider the domain where the Augmentation Principle may matter most: acute stroke care.

In vascular neurology, the bottleneck has never been treatment — effective therapies for large vessel occlusions exist, including mechanical thrombectomy, a procedure that physically retrieves the clot from the brain's blood vessels. The bottleneck has been time. Every minute a large vessel occlusion goes untreated, approximately 1.9 million neurons die. The clock starts when the clot forms, and it does not pause for shift changes, radiologist availability, or the seventeen steps between a CT scan being acquired and a neurointerventionalist being paged.

This is where Viz.ai entered the picture. The system analyzes CT angiography images in real time, detects suspected large vessel occlusions, and sends an alert directly to the stroke specialist's phone — bypassing the traditional chain of radiologist reads the scan, calls the ER physician, ER physician calls the neurologist, neurologist reviews the images, neurologist calls the interventionalist. That chain, on a good night, takes twenty to thirty minutes. On a bad night — when the radiologist is reading a backlog, when the ER is overwhelmed — it takes longer. And in stroke, longer means brain tissue that will never recover.

Viz.ai collapses that chain. The algorithm reads the scan within minutes of acquisition and pings the specialist directly. The largest evaluation of AI-powered stroke imaging to date — four hundred and fifty-two thousand patients across all one hundred and seven NHS hospitals in England — measured the result: thrombectomy rates doubled from 2.3% to 4.6%, door-in-door-out times fell by sixty-four minutes, and approximately fifteen thousand additional patients received treatment (Lancet Digital Health, 2026). In stroke care, those minutes are not an efficiency metric. They are neurons. They are the difference between a patient who walks out of the hospital and a patient who spends the rest of their life in a wheelchair.

Ask any neurointerventionalist who has stood in an angiography suite at 3 AM, threading a catheter into a patient's brain, knowing that the minutes lost before arrival are minutes that no amount of skill can recover. The Augmentation Principle is not a slogan in that room. It is the knowledge that a machine caught what the workflow would have delayed — and that the patient on the table has a better chance because of it.

The photograph view of stroke care sees a single CT scan, interpreted by a single radiologist, at a single point in time. The movie view integrates the imaging, the clinical presentation, the time of onset, the vessel anatomy, and the treatment window into a continuous narrative that moves at the speed the disease demands. AI did not replace anyone in this story. The radiologist still reads the scan. The neurologist still makes the clinical decision. The interventionalist still performs the procedure. What AI replaced was the gap — the dead space between data acquisition and human action, where brain tissue was dying while information waited in a queue.

A Chest Pain at 4 AM

Let me construct a scenario — not a real patient, but a composite drawn from thousands of real encounters — to show you what the photograph-to-movie shift looks like in practice.

A fifty-four-year-old man arrives in the emergency department at 4 AM with chest pain. He is diaphoretic — sweating — and clutching his sternum. His troponin level, a marker of cardiac injury, comes back mildly elevated. His ECG shows nonspecific ST changes. His blood pressure is 152/94. He has a family history of coronary artery disease. He smoked for twenty years and quit five years ago.

The photograph view: The emergency physician sees this moment. Elevated troponin. Abnormal but ambiguous ECG. Risk factors. The clinical decision is binary: admit and observe, or escalate to catheterization. The physician uses a risk score — HEART, TIMI, or one of the other validated tools — plugs in the variables, and gets a number. The number informs the decision, but the number is static. It captures this single point in time, this one blood draw, this one ECG tracing.

The movie view: An AI system integrating continuous data sees something different. It has access to this patient's electronic health record — not just tonight's troponin, but the trend of his troponin over three serial draws, each ninety minutes apart. It sees that the trajectory is rising in a curve characteristic of acute myocardial injury, not the flat or declining pattern that suggests a more benign cause. It integrates his continuous cardiac monitoring, detecting a subtle heart rate variability pattern associated with autonomic instability. It cross-references his genomic data, noting a variant in the LPA gene associated with elevated lipoprotein(a) and accelerated atherosclerosis. It pulls his imaging history and notes progressive coronary calcium scores over the past six years that, individually, were each "within normal limits" for their respective age ranges but, viewed as a trajectory, form an unmistakable upward curve.

No single data point in that movie is invisible to a human physician. Any cardiologist could, given time, review every chart, trace every trend, cross-reference every genomic marker. But "given time" is the operative phrase. At 4 AM, in a busy emergency department, with six other patients waiting, the time does not exist. The photograph is what fits in the available cognitive bandwidth. The movie requires a computational collaborator.

This is the Augmentation Principle made clinical. The AI does not decide whether this patient goes to the catheterization lab. The physician decides. But the physician decides with a film playing behind their eyes instead of a snapshot — and the decision is richer, faster, and more informed because of it. A January 2026 randomized trial of real-time AI-ECG alerts across 14,989 emergency department patients (PMID 41507124, Nature Communications) demonstrated the design lesson embedded in that distinction: broad AI alerts did not significantly improve overall treatment rates, but among the high-risk hyperkalemia patients the system flagged, treatment rose sharply — 69.1% versus 41.6%. The movie is only useful when it zooms into the right frame and hands the clinician a clear reason to act. Flood the screen with undifferentiated signal, and the physician tunes it out.

The Rules Are Being Written

In January 2026, the FDA issued updated guidance on clinical decision support software — the latest iteration of a regulatory framework that has been evolving, haltingly, since the agency first grappled with the question of when software crosses the line from information tool to medical device.

The guidance matters because it attempts to draw a boundary that is inherently blurry: when does an AI system inform a physician's decision, and when does it make one? A system that displays a patient's lab results in a chart is clearly an information tool. A system that autonomously diagnoses diabetic retinopathy and recommends treatment is clearly a medical device. But what about the vast middle ground — the system that highlights a region on a CT scan as suspicious, the system that ranks differential diagnoses by probability, the system that flags a patient's vital sign trajectory as concerning?

This middle ground is where most diagnostic AI lives, and the regulatory framework is still catching up. The 2026 guidance acknowledges what clinicians have known for years: automation bias is real. When a machine suggests an answer, physicians are more likely to agree with it — even when the suggestion is wrong. A 2025 study in Communications Medicine (PMID 40038550) documented this effect directly, showing that AI recommendations measurably modified physician clinical decisions, sometimes overriding the physician's independent assessment. The machine did not force the physician's hand. It nudged it. And in medicine, a nudge can change an outcome.

The regulatory challenge, then, is not just about whether AI systems are accurate. It is about how they interact with human cognition — whether they genuinely augment decision-making or subtly colonize it. This is the Transparency Principle operating at the systems level. An AI that is accurate but opaque, that gives the right answer but cannot explain its reasoning, creates a dependency that looks like augmentation but functions like replacement. The physician follows the machine not because they understand its logic, but because they trust its track record. And trust without understanding is not collaboration. It is abdication.

The best diagnostic AI systems now emerging — the ones that will define the next decade of clinical practice — are the ones that show their work. Systems that highlight which features of an image drove the diagnosis. Systems that present not just a prediction but a confidence interval and the factors that widen or narrow it. Systems that are, in effect, transparent projectors — machines whose movies come with subtitles.

The Alien in the Emergency Room

In Chapter 2, I described AI as an alien intelligence — not conscious, not scheming, but perceptually alien, operating in dimensional spaces that human cognition cannot access. The diagnostic revolution is where that alien perception produces its most tangible returns.

Consider sepsis — the body's catastrophic overreaction to infection, which kills more than a quarter of a million people annually in the United States alone. Sepsis is notoriously difficult to predict because its early signs are subtle, nonspecific, and buried in the noise of routine vital signs. A heart rate that drifts upward by eight beats per minute. A respiratory rate that increases slightly. A blood pressure that softens by a few points. Each of these changes, viewed in isolation — as photographs — looks like nothing. A patient who is anxious. A patient who walked to the bathroom. Normal variation.

But viewed as a movie — as a trajectory playing out across hours in a high-dimensional space that includes vital signs, lab values, medication timing, fluid balance, and the patterns of ten thousand prior patients who developed sepsis — those subtle drifts form a signature. AI systems designed for early sepsis detection can identify this signature hours before a clinician would notice, triggering an alert that moves the clinical team from reactive to proactive. Not responding to a crisis, but preventing one. A February 2026 ward-level randomized trial of 10,422 patients (PMID 41644641, Nature Medicine) proved the distinction between prediction and action: AI deterioration alerts improved the speed of rapid response review — from 12.6% to 20.8% within thirty minutes — without increasing false-positive burden, but the primary endpoint of deterioration duration did not budge. The alert saw the movie. The clinical workflow had not yet been redesigned to act on what the alert saw. This is the lesson the field keeps learning and keeps forgetting: the algorithm is never the bottleneck. The workflow is.

Or consider rare diseases — the roughly seven thousand conditions that individually affect small numbers of patients but collectively affect hundreds of millions worldwide. The average rare disease patient waits years for an accurate diagnosis, cycling through specialists, accumulating incorrect diagnoses, enduring treatments for conditions they do not have. The bottleneck is not indifference. It is combinatorics. No physician can hold seven thousand rare diseases in active memory and pattern-match against each one during a twenty-minute encounter.

AI can. Phenotypic analysis systems that integrate facial morphology, clinical features, laboratory patterns, and genomic data can narrow the diagnostic search space from thousands of possibilities to a manageable handful — not replacing the geneticist's judgment, but focusing it. Giving the human expert a short list instead of an encyclopedia. This is the alien intelligence at its most benevolent: perceiving patterns across dimensional spaces that no human could traverse, then translating those patterns into a form that human clinicians can evaluate, challenge, and act upon.

The Movie That Must Be Fair

There is a scene missing from the revolution so far, and it is the most important one.

Every AI system I have described — the stroke detector, the sepsis predictor, the cancer screener, the rare disease identifier — learned its patterns from data. And data is not neutral. Data is a fossil record of human decisions, and human decisions carry the sediment of every bias, every structural inequity, every historical injustice that shaped the system producing them.

If a training dataset overrepresents patients from academic medical centers — who tend to be whiter, wealthier, and better-insured than the general population — the model learns the patterns of that population. When deployed in a community hospital serving a predominantly Black or Latino neighborhood, its performance may degrade. Not because the algorithm is racist in any intentional sense, but because it was never shown the full movie. It was shown a movie cast entirely from one demographic and asked to generalize to a world that looks nothing like the set.

This is the Equity Principle, and it is not a footnote. It is a load-bearing wall. A diagnostic AI that works brilliantly for the patients who already have the best access to care and fails for the patients who need it most is not a revolution. It is a replication — a high-tech reproduction of the same disparities that medicine has been failing to address for generations.

The corrective is not to slow down. It is to be deliberate. To demand diverse training data — not as an afterthought, but as a prerequisite for deployment. To validate AI systems not just on aggregate accuracy but on equity of accuracy — performance stratified by race, ethnicity, gender, socioeconomic status, and geographic context. To treat the question "Does this system work equally well for everyone?" not as a research question but as a deployment criterion.

The diagnostic revolution deserves to reach every patient, in every hospital, in every community. If the movie AI creates is only available in high-definition for some populations and in static-filled low resolution for others, we will have built a technology that amplifies the very inequities it had the potential to erase.

The First Movie, Not the Last

The projector is running. After two chapters of building the machine and explaining its optics, the first movie is playing — and it is a diagnostic one. AI systems are reading scans faster than the workflow can deliver them, detecting cancers that human eyes would miss, predicting deterioration hours before the clinical signs emerge, and collapsing the deadly gaps between data and action.

But this first movie has revealed something its creators did not fully anticipate. The Augmentation Gap is real. On the hardest cases in medicine, the machine does not merely assist the physician — it outperforms her, by a margin wide enough to restructure the relationship between human and machine that this entire book is built on. The chapters ahead will take us into territories where this question becomes sharper: into the operating room, where machines and surgeons share the same patient; into drug discovery, where AI is redesigning the molecular search space itself; into mental health, where the patterns AI must read are not in images or lab values but in the cadence of a voice, the rhythm of sleep, the words a patient chooses and the ones they avoid. Each frontier will test the three principles — Augmentation, Transparency, Equity — in new and more demanding ways.

But I want to leave you with the image I carry from my clinic. The patient with the ambiguous MRI. The four possible movies. The machine that could have ranked those movies in seconds and been right more often than I was. And the part of the encounter no machine could have performed: the moment I looked across the desk and said, I don't have an answer yet. But I'm going to stay with this until I do.

That promise — not the diagnosis, but the promise — is the thing the machine cannot make. It is also the thing the patient needed most.

The diagnostic revolution is here. The question is no longer whether AI will transform how we find what is wrong with the human body. The question is whether we — the physicians standing inside the gap — will have the honesty to say what the machines can do better, the humility to let them do it, and the clarity to know where the human story begins that no algorithm can tell.

Next: Chapter 4 — When the Machine Kills: The Anatomy of AI Failure in Medicine