Why AI Hallucinates — and How Teaching It to Say "I Don't Know" May Save Us All

The number one reason people dismiss AI tools is hallucinations. Not the occasional mistake — every tool makes mistakes. It's the confident, articulate, completely fabricated response delivered with the same tone as a factual one. When your AI assistant invents a citation that doesn't exist and presents it as if it's quoting a peer-reviewed journal, that erodes trust in a way that's hard to recover from.

A recent paper from OpenAI gives us something rare: clarity on why this happens and a surprisingly simple fix.

It's Not Bad Data. It's Bad Training.

The biggest surprise from the paper? Hallucinations aren't primarily caused by bad input data. Even with perfect training data, language models would still hallucinate. The problem is structural — baked into how these models learn.

During training, models are taught to always produce an answer. They're rewarded for fluency and completeness. They're penalised for hesitation or gaps. The training process literally optimises for confidence, even when the model is working from insufficient information.

Think about what that means. We've built systems that are constitutionally incapable of saying "I don't know." Not because they're arrogant — because we trained the uncertainty out of them.

The Camera Recording a TV

Here's the analogy that stuck with me. Imagine pointing a camera at a TV screen that's showing the camera's own feed. You get a feedback loop — each generation of the image gets slightly more distorted, slightly more removed from reality, but still looks plausible if you don't know what the original looked like.

That's what's happening as AI models train on AI-generated content. The internet is increasingly populated with text written by LLMs. New models train on that text. Each generation inherits the previous generation's hallucinations and adds its own. The feedback loop compounds.

This isn't theoretical. Researchers are already seeing it in the wild. Model outputs trained on model outputs degrade in measurable ways — losing nuance, converging toward generic responses, and amplifying factual errors that were minor in the original training data.

Teaching "I Don't Know"

The OpenAI paper proposes a fix that's deceptively simple: train models to express uncertainty.

Instead of always generating a confident answer, models can be taught to flag when they're operating outside their reliable knowledge. Not a full refusal — more like a calibrated confidence signal. "Based on what I know, I think X, but I'm not confident about Y."

GPT-5's implementation of this is interesting. It can now indicate when it's uncertain, when it's working from limited information, and when a claim should be verified. It's not perfect — the calibration is still rough — but it's the first serious attempt to make uncertainty a feature rather than a bug.

Why This Matters More Than Better Benchmarks

I use AI more intensively than most people I know. My entire workflow runs through Claude Code. And I can tell you that hallucinations aren't my biggest problem anymore — not because they've stopped, but because I've developed the habit of verifying everything that matters.

The real problem is the people who haven't developed that habit. The student who cites the fabricated source. The manager who makes a decision based on AI-generated analysis without checking the underlying data. The developer who ships code that was plausible but wrong.

Teaching models to say "I don't know" doesn't just reduce hallucinations. It changes the user's relationship with the tool. A model that flags its own uncertainty trains its users to think critically. A model that never hesitates trains its users to trust blindly.

The Bigger Question

The paper raises something it doesn't fully address: what happens when AI-generated content becomes the majority of training data?

We're not far from that point. Some estimates suggest over 50% of internet text will be AI-generated within the next few years. If models train on that text, and those models generate text that future models train on, we're building a house of mirrors.

The hallucination problem isn't just about individual model outputs. It's about the entire knowledge ecosystem becoming less reliable, one training cycle at a time. Teaching models to express uncertainty is a start. But the deeper challenge is maintaining a clean signal of human-generated, verified information in a world increasingly flooded with synthetic text.

Where I Land

I'm cautiously optimistic about the "I don't know" approach. Not because it solves hallucinations — it doesn't. But because it shifts the paradigm from "AI as oracle" to "AI as collaborator with known limitations."

That's a healthier relationship. I don't need AI to be perfect. I need it to be honest about when it's guessing. The best human experts I've worked with have always been the ones who say "I'm not sure about this part" — because that's when you know you can trust the parts they are sure about.

If AI can learn the same skill, that changes everything. Not because the technology improves. Because the trust model finally works.