I want to tell you something that took me nearly two years and almost fourteen thousand AI responses to understand.

Everyone is afraid of the wrong thing.

When people talk about AI in healthcare, the conversation always circles back to hallucination. Models making things up. Inventing drugs that don't exist, citing papers that were never written, fabricating medical guidelines from thin air. And yes, that's terrifying. If an AI tells a doctor to prescribe a nonexistent medication, someone could get hurt.

But here's what we found when we actually ran the numbers at scale: hallucination barely happens.

The Data

What 13,728 AI Responses Actually Tell Us

We sent real professional queries to four frontier AI models: Claude, GPT-4o, Gemini Pro, and DeepSeek. Medical questions, legal questions, technical questions. We scored every single response under a strict four-way taxonomy, separating correct answers from hallucinations from incomplete responses from abstentions.

The pattern that emerged genuinely surprised me.

90.1%
Correct
5.7%
Incomplete
0.8%
Hallucination

See that middle number? That 5.7% is the one keeping me up at night. Not the 0.8%.

An incomplete response is factually correct but missing something critical. The AI says "administer IV fluids" for sepsis. That's true. But it doesn't mention the 30 mL/kg bolus volume or the three-hour window. A junior doctor following that answer might delay appropriate intervention. Not because the AI lied. Because it didn't say enough.

The Gap Nobody Is Measuring

Traditional hallucination detectors would flag these answers as "safe." Every word is true. But in professional practice, truth is not enough. You need the right amount of truth.

Incomplete responses are seven times more common than hallucinations. And not a single safety framework on the market catches them.

The Pattern

Two Failure Modes, Not One

Here's the finding that changed how I think about this entire problem.

We structured our queries in four turns of increasing difficulty: factual questions, inferential follow-ups, edge cases, and speculative questions that pushed beyond established knowledge. When we mapped the failure types against these turns, a clean separation appeared.

Turns 1–3: Zero Hallucinations. Only Incompleteness.
Across 10,296 factual, inferential, and edge-case responses, the models never fabricated anything. But they gave incomplete answers 4.9% to 9.4% of the time. The danger when asking questions within established knowledge isn't lies. It's omissions.
Turn 4: Hallucination Appears. Incompleteness Vanishes.
All 112 hallucinations occurred exclusively on speculative queries pushing beyond the models' training. But here, zero incomplete responses appeared. When models are pushed beyond what they know, they either fabricate or abstain. They don't hedge.
The Implication: Different Risks Need Different Safety Mechanisms.
For routine clinical queries, you need completeness verification and detail enrichment. For speculative queries, you need consensus-gated human review. One safety tool cannot cover both failure modes.
The Solution

A Signal You Can Actually Govern With

So if hallucination detectors aren't enough, what do we do?

The answer came from a simple observation. I'd been watching thousands of students interact with AI during robotics competitions through our Techlympics program across Malaysia. The students who got the best results weren't the ones who trusted the AI's first answer. They were the ones who asked multiple AIs the same question and compared the responses.

When different AI models agree on an answer, that answer is almost always safe. When they disagree, something is off.

We formalised this into what we call a consensus field (σ). You send the same clinical question to four different frontier models. You measure the semantic similarity between their responses. That agreement score ranges from 0 (total disagreement) to 1 (perfect consensus).

Consensus Level Unsafe Rate What It Means
σ = 0 (No agreement) 17.5% One in six responses is unsafe. High risk.
σ ≈ 0.50 (Partial) 5.3% Risk drops substantially but remains meaningful.
σ ≈ 0.83 (Strong) 4.3% Approaching safe-for-automation threshold.
σ = 1.0 (Full consensus) 0.74% Below 1%. Near the shared blindness floor.

That's a 24-fold improvement from worst to best. And it costs nothing beyond running the query through multiple models. No retraining. No access to model internals. Just standard API calls that any hospital IT department could implement tomorrow.

We compared our consensus signal against every established detection method: SelfCheck, majority voting, model self-reported confidence. On the medical baseline, σ achieved an AUC of 0.816. Every other method performed at or below chance. Self-reported confidence was literally uninformative: all four models reported maximum confidence on 100% of responses, including the ones that were wrong.

The Honest Limitation

The Shared Blindness Floor

Now, I could stop there and make this sound like a solved problem. But I'd be doing the same thing I'm criticising others for. Telling you the truth, but not enough of it.

Even at perfect consensus, when every model agrees, the unsafe rate doesn't reach zero. It hits 0.74%. We call this the shared blindness floor. It exists because all four models were trained on overlapping data. They share the same blind spots. They are confidently wrong about the same things.

We quantified this precisely: the 95% confidence interval for the floor is [0.50%, 1.10%]. That's not an estimate. That's a boundary. And it means no consensus-based method, ours or anyone else's, can guarantee zero risk.

What The Floor Tells Us

Consensus is necessary but not sufficient for safety. It can take you from 17.5% unsafe to under 1%. But that last fraction of a percent requires something else entirely. It requires a human clinician in the loop.

I think this honesty is what makes the work credible. We're not selling a silver bullet. We're offering a governance tool that knows its own limits.

The Governance Framework

Three Tiers That Regulators Can Actually Audit

Here's where this becomes immediately actionable. The consensus field gives us a natural way to create tiers of automation, not based on committee opinions but on measured, auditable risk levels.

Tier 1 · σ ≥ 0.83
Confident automation. Unsafe rate below 1%. Suitable for automated draft documentation, low-risk decision support, and routine clinical summaries. The AI proceeds with minimal oversight.
Tier 2 · 0.50 ≤ σ < 0.83
Cautious automation. Unsafe rate 4–5%. Output is flagged for clinician confirmation. The AI assists but does not proceed autonomously.
Tier 3 · σ < 0.50
Mandatory human review. Unsafe rate 8–18%. No autonomous use. The AI surfaces its disagreement, and a clinician makes the call.

This is the kind of framework a hospital governance board can understand. A regulator can ask: "What threshold are you using?" A hospital can answer: "σ ≥ 0.83 for autonomous draft notes, mandatory clinician review below 0.5." That's auditable. That's governable.

The Abstention Insight

The AI That Knows When to Shut Up

One more finding that I think deserves attention, especially from a governance perspective.

Not all AI models behave the same way when they're uncertain. Some push through and give you an answer anyway, even when they shouldn't. Others pause. They say, effectively, "I'm not confident enough to answer this."

We tracked this across all four models. The results were clear.

Model Abstention Rate Unsafe Rate
Claude (Anthropic) 6.47% 3.38%
DeepSeek 0.79% 4.69%
Gemini Pro (Google) 2.71% 8.42%
GPT-4o (OpenAI) 0.70% 9.56%

The model that declined to answer most often had the lowest unsafe rate. The one that almost never declined had the highest. In clinical settings, an AI that knows when to shut up is safer than one that always has an answer. That's how good doctors operate too.

The Malaysian Opportunity

Why Malaysia Can Lead This

I'm writing this from Kuala Lumpur, where Malaysia's National AI Office is preparing the country's AI Governance Bill, expected before Cabinet by mid-2026. We're at an inflection point. The government is actively figuring out how to regulate AI in healthcare, and most of the frameworks on the table are imported from the US or EU.

Those frameworks are built around hallucination detection. They're asking the wrong question.

Malaysia has something unique to contribute here. We have a healthcare system that serves millions with limited specialist capacity. If AI can safely handle even a portion of clinical documentation or decision support, that's not a convenience. It's a lifeline. But "safely" is the operative word. And safety, as our data shows, isn't just about catching lies. It's about catching omissions.

We don't have legacy regulatory frameworks that assumed AI would look like medical devices. We have NAIO actively seeking input. And we have the data, the patents, the platform, and the methodology to propose something concrete: not "regulate AI" in the abstract, but "here's a measurable, auditable governance signal that any hospital can implement."

The consensus field doesn't depend on English-language training data or American clinical guidelines. It measures whether independent AI systems agree, regardless of the language or medical tradition they were trained on. For a multilingual country like ours, that matters.

The Ask

What I'm Asking For

If you're a policymaker: look at the data. AI hallucination in healthcare is a real problem, but it's the smaller problem. Incomplete answers are seven times more common, and no current framework catches them. We have a measurement tool that does.

If you're a hospital administrator: this doesn't require replacing your AI vendor. It works on top of any existing system. Run the same query through multiple models, measure the consensus, and use that number to decide how much human oversight is needed.

If you're a clinician: we're not trying to replace you. The data proves you're irreplaceable. Even our best system hits a floor below which only a human expert can catch the remaining errors. What we're offering is a way to tell you which AI outputs deserve your trust and which ones need your scrutiny.

If you're a researcher: the full methodology is described in our paper submitted for peer review. The scoring taxonomy, the statistical analysis, the formal limitations. We're not hiding behind marketing. The work is open to challenge.

Hallucination is rare but structurally constrained. Imprecision is common and invisible to current detection methods. Multi-model consensus reduces unsafe outputs by 24×. The residual risk requires human oversight, and we've quantified exactly where that oversight is needed. This isn't theoretical. This is infrastructure. And Malaysia can lead the way.

See It In Action

Eptim.ai's medical verification platform applies multi-model consensus and clinical consistency rules to AI outputs. Try it on your own medical queries.

Try Eptim.health →

If this resonated, share it with someone working in healthcare AI or governance policy.