When “Elite” DORA Metrics Hide Illegible Systems in AI-Heavy Orgs
If your DORA dashboard says “elite” while your incidents sound like “we have no idea what this thing does,” you are not imagining a gap. Your dashboard shows a system. In practice, it is just the part you have chosen to look at.
AI is breaking the assumptions under DORA without touching the numbers. This is the obligation gap. Under AI load, DORA shows you AI-assisted output and a thin slice of failures, then calls that “team capability.” If you carry a crown jewel, that gap belongs to you.
The core argument is simple. DORA is not wrong. It is incomplete.
Why this matters now
On paper, the DORA story for an AI rollout is often beautiful.
Deployment frequency up 30 to 40 percent.
Lead time down 30 to 40 percent.
Change failure rate stable.
MTTR inside acceptable ranges.
Inside the war room, the story is different.
“I do not know why it behaves like that.”
“Copilot refactored this; the tests passed; I do not really remember the details.”
“The one person who understood this left six months ago.”
Diagnosis Time stretches. Review depth erodes. Runbooks fall behind. More and more of your real system lives in prompts, vendor tools, and private mental models that never make it into the repo.
If you ignore that curve, you are betting your crown jewels on a story your own engineers do not believe.
DORA does not see any of that. It was never designed to.
You trade a little slide comfort today for not having to explain a ghost system to a regulator tomorrow. The boon is faster correct models, safer reversals, and fewer unknown owners in the war room.
The DORA bubble
Every metric has a scope, whether you name it or not. Typical DORA scope for a “service” is a deploy pipeline for a web or API tier, a set of repos “owned” by a team, and an incident process that records outages tied to those components. Everything else quietly falls out of frame.
Shared identity or auth layers that six teams depend on but no single team owns.
Core data pipelines that feed the service but live under a different VP.
Vendor systems that sit in the middle of critical flows but never appear in your architecture diagram or your org chart.
Manual operational work that keeps the whole thing from drifting off a cliff.
The crown jewel blast radius
For a crown jewel system, the blast radius is bigger than the bubble. By crown jewel system I mean anything whose failure puts you in front of a regulator, a board, or a patient or customer in real pain.
When those systems fail, the story often looks like this. A dependency outside the DORA scope behaves strangely. Signals show up in another team’s tooling first. Manual workarounds start happening in Slack and side channels. The service inside the bubble is the last one to “fail,” so it gets the headline. The report talks about the part inside the bubble. The real story started somewhere else.
AI and shadow scope
AI accelerates this scope drift in ways DORA was never designed to see. AI-assisted changes to vendor configs and schemas never count as deploys because they happen in admin consoles or through API calls outside your pipeline.
AI-generated jobs, scripts, and glue code live between systems and never make it onto the architecture diagram because they started as quick fixes and became load bearing.
AI agents act at the edge, triaging tickets, triggering workflows, routing traffic, all without clear ownership or incident tags.
Most of that lives outside the DORA bubble. No deploy counts. No formal ownership. No consistent incident tagging. From the dashboard, your delivery looks stable. From the inside, you are carrying a growing ball of invisible coupling and ad hoc glue.
Reality in the war room
The moment you feel the cost is not in a metrics review. It is in the war room.
People who have never met are suddenly debugging the same broken flow.
ICs, vendors, product, and on-call engineers are all staring at different panels that disagree. Two or three monitoring systems show conflicting truths.
Someone finally asks “who actually owns this vendor integration” and three people start talking at once about different systems, each assuming someone else has it covered. When the dust settles, the post-incident writeup tends to collapse everything back into the bubble. Root cause is expressed in terms of the in-scope service. Actions stay inside existing team boundaries. Out-of-scope pieces get labeled “contributing factors” and slide off the action item list.
You get to keep your clean charts. You do not get better judgment.
Scope and Reality, running the pass on one crown jewel
For each crown jewel
Draw the blast radius. List every dependency that can hurt this system, including identity, data, vendors, and AI-driven tools.
If you cannot map it, you cannot claim you run it.
Mark what DORA actually sees. Where do deploy counts, failure rates, and MTTR exist today.
Replay the last three serious incidents. Where did they really start. When did the part you measure notice.
Ask who owns the off-graph parts. Is there a named owner with the authority and time to act. Or is it “shared responsibility” that belongs to no one.
If the answer is “most serious problems start outside the bubble we measure,” your metrics are not lying. They are just not telling the whole story.
You have a scope problem, not just a performance problem.
The field ritual, 60 Minute Four Pass Review
Set a timer. Pick one crown jewel. Use the Four Pass lens.
Pass 1 Power and Blame
Pass 2 Obligation and OTW
Pass 3 Scope and Reality
Pass 4 Repair and Load
By the end of the hour you produce three artifacts, a scope note signed by the accountable owner and witnessed by incident leadership, a fracture list, and one instrumentation change scheduled with an owner and a date. If you cannot produce them, treat the system as more illegible than your org admits and escalate.
The instrumentation patch
Add one primary gauge next to DORA.
Diagnosis Time
Diagnosis Time is the time from first alert to a correct shared model of cause and next safe action. Measure it in real incidents and in drills, using the same start and stop criteria each time.
SRE time-to-detect and time-to-resolve measure speed inside known scope. Diagnosis Time measures how long it takes to build the correct model of the system when the scope itself is unclear.
Around Diagnosis Time, add these.
Code Confidence
Knowledge Retention
Root Cause Depth
Fracture Tracker
Tool-deference and hesitation are legibility alarms. They mark places where knowledge is missing, ownership is unclear, or the real system lives outside the repo. Track which flows trigger them.
If Diagnosis Time improves only because you narrowed incident scope or suppressed escalation, the metric is being used as theater and the system remains illegible.
If Diagnosis Time is rising while you are handing out “elite” labels, the problem is not the engineers. The problem is the panel.
This is not anti-AI. It is anti-illegibility.
This is not a control play. It is a scope honesty requirement that forces ownership where the risk actually lives.
How to use this
Start small and specific.
Pick one crown jewel system. Instrument Diagnosis Time next to DORA on that system. Run the 60 Minute Four Pass Review with the people who actually carry incidents and maintain the critical paths. Not just the people who present the slides.
Then, over the next 90 days
Month 1
Baseline Diagnosis Time and see how it behaves against your “elite” DORA bands.
Month 2
Add Code Confidence and Knowledge Retention checks into incident reviews.
Month 3
Tie one real decision to what you learned. Make a promotion, platform, or funding call that uses Diagnosis Time and fracture trends alongside DORA, not under it.
Never promote on velocity alone when Diagnosis Time is rising. Success on that first crown jewel looks like DORA stable while Diagnosis Time comes down.
This costs you something. You spend time on the ritual, political capital on the conversation, and comfort when the first Diagnosis Time curve contradicts your DORA story. You get governance in return.
If leadership will not fund the time and authority to fix the fractures you find, they are choosing harm risk deliberately.
Your role is steward, not narrator of a panel.
Reading this without instrumenting at least one crown jewel is choosing comfort over governance.
Artifacts are cheap, judgement is scarce.
Per ignem, veritas.
References
[1] DORA, “Accelerate State of DevOps Report 2024,” DORA Research, 2024. Accessed Jan. 27, 2026.



