How we built an AI SRE agent that fixes production issues

OpsAI is Middleware's AI SRE agent that auto-resolves production incidents using full-stack telemetry correlation, root cause analysis, and GitHub PR fixes.

We built OpsAI because investigating a production incident has become one of the hardest cognitive tasks in software engineering, and the industry’s answer has been to add more dashboards.

That’s the wrong answer.

As distributed environments grow, the signals get noisier, failures span more systems, and Kubernetes adds layers that nobody fully understands at 3 AM. Telemetry volume keeps doubling. Somewhere in the middle of all that, a PagerDuty alert fires and a human is expected to pull it all apart, fast.

At Middleware, we’ve spent years unifying logs, metrics, and traces on a single OpenTelemetry-native platform, precisely so engineers don’t have to bounce between five tools to understand a single incident. But the more we shipped, the more we kept asking ourselves the same uncomfortable question: if all the data is already in one place, why are humans still doing the correlation work?

So we built OpsAI an AI SRE agent that detects, diagnoses, and fixes production issues end-to-end. Not a chatbot bolted onto a dashboard. Not a summarizer. An agent that runs the same investigative loop a senior on-call engineer would run just faster, at machine scale, without sleeping through the 3 AM page.

This is the full write-up: why we built it the way we did, how it works under the hood, what we’ve seen in production, and where we honestly stand relative to the rest of the AI SRE space.

The pain we kept hearing about

We didn’t design OpsAI from a whiteboard. We designed it from hundreds of conversations with platform engineers, SREs, and CTOs who were all circling the same five problems.

Alert fatigue is brutal. Modern microservices stacks throw thousands of alerts a week. Most is noise. In one customer environment we benchmarked, a single misconfigured CPU monitor fired 1,073 times in seven days with zero resolutions. That’s not a monitoring problem. That’s a trust collapse engineers stop responding to alerts as signals because they’ve learned most of them are false. And when they stop trusting alerts, the real ones get missed.

Context-switching is the real MTTR killer. During a live incident, an engineer typically jumps between PagerDuty, Datadog or Grafana, Loki or Splunk, a Kubernetes console, GitHub, Slack, and a runbook that’s three months out of date. Every context switch costs minutes. A single P1 can burn hours before anyone has confirmed what broke, let alone why.

Kubernetes is opaque. Pods crash. ConfigMaps drift. Init containers fail silently. Diagnosis usually means chaining six kubectl commands an engineer half-remembers from a Slack thread last quarter. The platform abstracts infrastructure well until something goes wrong, at which point it abstracts the failure too.

AI-generated code is multiplying production bugs. Teams ship more code than ever, much of it generated by Copilot, Cursor, or in-house LLMs. The code passes review and tests, then surfaces subtle runtime errors under real traffic. The velocity gain from AI-assisted coding is real; so is the reliability cost.

Dashboards tell you something broke. They don’t tell you why. Even with the best observability stack in the world, a human still has to do the detective work. Engineers spend up to 60% of their time finding root causes instead of building features. That ratio is unsustainable.

The industry has been trying to solve this with more tools. We think the real answer is fewer humans doing manual correlation, and an agent built specifically to do it.

“Every engineering team already has enough telemetry. Very few have operational clarity.”

What Is an AI SRE Agent?

An AI SRE agent is a system that autonomously performs the investigative and remediation workflows a Site Reliability Engineer would otherwise execute manually. Not an alert router. Not a glorified runbook. An agent that detects an anomaly, correlates it across telemetry signals, identifies root cause at the code level, and applies a fix either as a pull request or, for infrastructure-layer issues, directly against the running system.

The distinction from traditional AIOps matters. AIOps platforms reduce alert noise and surface probable causes. That’s useful. It’s not sufficient. An AI SRE agent operates across the full incident lifecycle from detection through remediation without requiring a human in the loop for the common case.

OpsAI watches APM, RUM, logs, infrastructure, and Kubernetes events continuously. It pulls in alerts from third-party tools like Datadog and Grafana too. When something breaks, it doesn’t just acknowledge the alert. It investigates. It runs root cause analysis using the full observability context. When it’s confident, it ships a code fix as a pull request through a secure GitHub MCP connection without waking up your on-call engineer.

Think of it as an always-on SRE that knows your stack from the load balancer down to the line of code, never context-switches, and never gets tired.

See OpsAI resolve your next production incident automatically →

How OpsAI Investigates: A Real Process, Not a Guessing Game

One of the first things we learned building this: the more telemetry you pile into a context window, the more the model gets distracted by suspicious-but-unrelated signals. Performance degrades. Root causes get missed. The early instinct give the agent more data made it worse, not better.

The fix wasn’t more tool calls. It was a structured, hypothesis-driven investigation loop.

OpsAI runs four stages on every incident, in order.

Stage 1: Detect

OpsAI continuously monitors APM traces, RUM sessions, logs, infrastructure metrics, and Kubernetes events. It also ingests alerts from Middleware’s native monitors and from third-party platforms like Datadog and Grafana. The moment something deviates a latency spike, an error-rate breach, a pod stuck in CrashLoopBackOff, an SLO burn the agent picks it up. No human needs to forward the alert. The detection layer immediately begins building the incident context graph.

Stage 2: Diagnose

This is where most “AI summarizer” tools stop. OpsAI keeps going.

It forms an initial hypothesis about what might be wrong, then validates or rejects it using targeted queries against the actual telemetry. If the hypothesis holds, it digs deeper into sub-hypotheses. If it doesn’t, it backtracks and follows a different lead.

The agent queries only what it needs to test a specific hypothesis not every signal in the stack at once. That focused, branching approach is what lets OpsAI handle multi-component incidents spanning several services without drowning in noise.

It’s the same shape Datadog described for Bits AI SRE, and honestly, it’s how good human SREs already think. The hard part isn’t the idea, it’s building an agent that does it reliably across thousands of incidents without going off the rails.

Stage 3: Root Cause

Once OpsAI converges on a likely root cause, it assembles everything a reviewer would want: the stack trace, the affected service, recent commits, related log patterns, infrastructure context, and a structured explanation of why this is the cause not just a correlated symptom.

We’ve had real incidents where OpsAI traced a Kafka lag spike not to the obvious upstream error logs which were unrelated but to a commit latency spike buried two service layers deep. A human SRE would have arrived at the same conclusion after a few hours of digging. OpsAI got there in minutes.

Stage 4: Fix

This is what genuinely separates OpsAI from “AI on top of a dashboard.”

For application-level bugs, OpsAI uses a secure GitHub MCP connection to access only the files referenced in the error’s stack trace, generates a targeted fix, and opens a pull request with a clean diff. It never scans the full codebase. It never stores source code. File-scoped, incident-scoped, ephemeral access.

For Kubernetes, you choose your comfort level: Auto RCA mode (OpsAI investigates, you apply the fix) or Auto Fix mode (OpsAI applies it directly).

A concrete example from one of our beta runs: a Python service started throwing KeyError on the /user/<username> endpoint. OpsAI traced the issue to an unsafe dictionary lookup at lines 61–66 of flask/app.py, replaced it with a safe .get() call, and opened a pull request titled

“Fix: Handle missing users gracefully in /user/<username> endpoint” open and ready for review. End-to-end resolution: under two minutes.

Read the full OpsAI setup guide in our docs →

Investigating Like Humans, Not Like a Summary Engine

We want to be direct about the design choice that took us longest to land on.

Early prototypes of OpsAI like most early SRE agents in this space scaled by making more tool calls and asking the LLM to summarize the responses. Every tool call added more tokens to the prompt. Performance got worse as we added more telemetry, not better. Sometimes the agent landed on the wrong root cause because an unrelated critical-looking error log distracted it from the actual signal.

The hypothesis-driven loop changed that. OpsAI now:

Forms a specific hypothesis about the root cause
Queries only the data needed to validate or reject that hypothesis
Branches into sub-hypotheses when evidence supports going deeper
Prunes dead branches and follows the most promising lead
Repeats until it reaches a root cause with enough confidence to act

The result is an agent that gets more accurate as incidents get more complex not less. That property is what makes it trustworthy enough to act autonomously.

“The operational bottleneck shifted from telemetry collection to operational reasoning. Engineers spend more time correlating than fixing.”

What We’re Seeing in Production

We didn’t ship OpsAI based on lab benchmarks. We’ve been running it on Middleware’s own production stack for months, alongside a select group of design-partner customers.

Metric	Result	Where It Shows Up
Internal incidents auto-resolved	50%+ automatically	Middleware’s own on-call
Detection-to-resolution rate	70%+	Design-partner deployments
MTTR reduction	5× on incidents OpsAI handles	Beta customer measurements
On-call productivity improvement	80%+	Pages auto-resolved vs. escalated
Fix accuracy	75% solved correctly end-to-end	PR quality review
Speed vs. competing AI SRE agents	6×–10× faster on identical prompts	Head-to-head benchmark

The 6×–10× speed advantage came from running identical prompts against OpsAI and competing agents across Grafana, Datadog, APM, RUM, and Kubernetes scenarios. The gap comes from architecture: agents without first-party observability context spend their inference budget on data retrieval and API orchestration. OpsAI starts the reasoning pipeline with a fully populated incident context graph.

The internal number 50%+ of Middleware’s own production issues handled automatically is the one we’re proudest of. OpsAI is on-call alongside our engineers. Most days, it resolves incidents before a human ever sees the page.

“Middleware resolved the time we spend on debugging and resolving issues by nearly 90%. What sets their AI apart is that it doesn’t stop at detecting issues. It actually helps fix problems in production, and for engineering teams, that’s been a real game changer.”
Nico Laqua, CEO, Corgi Insurance

Where OpsAI Sits in the AI SRE Landscape

The space has gotten crowded fast. Here’s an honest read on where we land and we’d encourage you to run identical prompts against each of these before making a decision. We did, and the benchmark results are public.

Agent	Approach	How OpsAI Compares
Resolve AI	Platform-agnostic, orchestrates over third-party APIs	Strong for vendor-neutral overlays, but every investigation is a round trip across external APIs. OpsAI runs on first-party data faster and deeper.
Datadog Bits AI SRE	Native to Datadog’s platform, shares the “own the data layer” philosophy	Alert-only and locked to Datadog’s pricing tier. OpsAI detects from APM, RUM, infra, and Kubernetes directly, ingests Datadog alerts, and ships on usage-based pricing.
Deductive AI	Focused on reasoning and RCA under uncertainty	Interesting RCA work. OpsAI closes the loop with auto-fix and PR generation.
New Relic SRE Agent	Native to New Relic’s platform	Same vendor lock-in trade-off. OpsAI is OpenTelemetry-native and ingests third-party alerts without requiring migration.

Resolve AI has done genuinely interesting work on platform-agnostic orchestration. Deductive AI is pushing on reasoning under uncertainty using reinforcement learning. Lightrun and Mezmo are approaching the problem from different angles. We don’t think any one team wins the AI SRE category outright the problem space is large.

What we’ve focused on is the combination that no one else has: full-stack observability ownership, GitHub MCP code awareness, end-to-end resolution with PR generation, and third-party alert ingestion so teams aren’t forced into a migration.

“Observability platforms have spent the last decade getting better at telling you something is wrong. The next decade is about systems that fix it for you. With OpsAI, we’re not building another dashboard or another alert channel we’re building an SRE agent that lives inside your observability stack, reasons across your full telemetry, and ships actual code fixes when it’s confident.”
Laduram Vishnoi, Founder & CEO, Middleware

Why Building on Full-Stack Observability Changes the Math

There’s a real architectural debate happening in the AI SRE space, and we want to be direct about it.

Some agents, like Resolve AI, sit on top of whatever stack you already have and orchestrate over vendor APIs. That sounds good for teams with heavy existing investments but every external API call is a potential point of latency, rate-limiting, schema mismatch, or missing context. The agent is only as fast as the slowest API it queries.

Other agents Datadog Bits AI, New Relic’s SRE Agent live inside a single observability platform. Fast and accurate, but locked to one vendor’s pricing and ecosystem.

OpsAI takes the best of both. Three concrete advantages fall out of building on first-party, OpenTelemetry-native data:

Speed. No waiting on third-party API responses for every investigation. Correlations happen inside a single data layer. That’s where the 6×–10× speed advantage comes from.

Accuracy. No schema mismatches, no missing fields, no rate-limited queries. The agent sees exactly the same data the rest of the platform sees APM traces, RUM sessions, logs, infrastructure metrics, and Kubernetes events in one unified context.

Depth. Correlations between traces, logs, metrics, RUM sessions, and Kubernetes telemetry happen in one place, not across five APIs. That’s the difference between finding the symptom and finding the cause.

And for teams that don’t want to migrate off Datadog or Grafana, OpsAI ingests their alerts and queries their metrics, logs, and traces then runs investigations inside Middleware. You get the architectural advantage without the migration cost.

“Dashboards don’t solve incidents. Engineers do. OpsAI changes that equation.”

Why Pairing an AI SRE Agent With Full-Stack Observability Matters

A standalone AI SRE agent bolted onto existing monitoring is useful. An AI SRE agent that lives inside the same platform collecting your traces, logs, metrics, RUM data, and Kubernetes events is something fundamentally different.

One context, not five. OpsAI correlates across signals natively because they share a data layer. No translation tax, no schema mismatch between systems.

Code-aware fixes. Pairing full-stack observability with a secure GitHub MCP connection means OpsAI matches a stack trace to the exact file and line, then ships a PR. That’s the difference between “something broke in service X” and “line 78 of app.py needs a .get() instead of a [] lookup.”

Less alert tuning, less manual triage. Anomaly detection runs across your application, infrastructure, and logs simultaneously. The agent learns your baseline and filters the false positives that keep on-call engineers up at night like that CPU monitor that fired 1,073 times in a week.

Privacy stays intact. OpsAI reads only the files referenced in a specific error. It never scans your full codebase. It never stores your source code.

See how OpsAI connects to your stack →

A Note on How Others Are Building This and What We’ve Learned

Datadog’s Bits AI SRE team has been public about benchmarking on real incidents instead of synthetic ones, and about the limits of “more tool calls plus a summarization prompt.” That mirrors our own experience you can’t evaluate an SRE agent against toy scenarios and expect it to generalize.

We benchmark OpsAI the same way: real incidents from real production stacks, scored against ground truth.

We’ve learned from watching every serious team in this space. The hard part isn’t the concept of an AI SRE agent it’s building one that does it reliably across thousands of heterogeneous production incidents without going off the rails.

Confidence calibration, blast radius control, context pruning, third-party schema normalization none of those problems are solved by the underlying model. They’re engineering problems, and they took us longer than the core reasoning pipeline.

We’re still early. The agent gets better every week as we feed it more real-world incidents and refine the investigation loop. But the direction is clear: less firefighting, more building. Fewer humans doing manual correlation. More incidents that quietly resolve themselves before anyone’s phone buzzes at 3 AM.

“AI agents won’t replace SREs. They’ll eliminate repetitive operational toil and free engineers to do the work that actually requires human judgment.”

The Future of SRE: Autonomous, Context-Aware, Remediation-Capable

The trajectory is clear. AI agents will handle an increasing fraction of production operations, anomaly detection, root cause analysis, code-level remediation, Kubernetes management and incident communication. Not because AI is fashionable, but because the alternative scaling human on-call capacity linearly with system complexity simply doesn’t work.

The future SRE stack is autonomous, context-aware, and remediation-capable.

What that means in practice: SRE teams shift from reactive firefighting to proactive reliability engineering. The on-call rotation still exists, handling the edge cases and architectural decisions that require human judgment. The 3 AM pages become rarer, then rare. Engineering velocity increases. Cognitive overhead from incident management decreases.

Self-healing infrastructure isn’t a destination it’s a continuous process of expanding the boundary of what the system handles without human intervention. OpsAI auto-resolves 50%+ of our internal incidents today. That number grows every week as the investigation loop improves on real production data. The scope expands from application errors and Kubernetes remediations to deployment validation, capacity management, and proactive failure prevention.

We’re in early innings. But the direction is unmistakable.

“The next generation of engineering tools won’t just detect problems. They’ll resolve them.”

Get Started With OpsAI

OpsAI is generally available today for all Middleware customers, existing and new.

To turn it on: For Kubernetes, install the Kube Agent with opsai.enabled=true. For application errors, install the APM or RUM agents and connect your GitHub repo via the Middleware MCP server. For Datadog or Grafana, connect those accounts in Middleware settings OpsAI starts ingesting alerts immediately.

If you want to see it run on your own stack, we’d love to show you.

Book a demo → See OpsAI handle root cause analysis and auto-remediation against your actual production telemetry.

Start free no credit card required → Full-stack observability plus OpsAI, running in your environment in under an hour.

Read the OpsAI documentation → Setup guides for Kubernetes, APM, RUM, Datadog, and Grafana.

How We Built an AI SRE Agent That Troubleshoots Production Issues Like a Team of Engineers

The pain we kept hearing about

What Is an AI SRE Agent?