Tatiana Mikhaleva

Independent Developer Advocate and founder of DevOps.Pink. Docker Captain, CNCF Ambassador, AWS Community Builder — ambassador across eight developer tooling programs in total. I cover Docker, Kubernetes, and the agentic-AI stack for engineers who actually ship.

AI SRE Joined My On-Call — A Beginner-Friendly Walkthrough of Rootly

By Tatiana Mikhaleva · Founder & Senior Developer Advocate

2026-05-12

DevOps & Cloud

AI

/

AI Agents

/

MCP

/

SRE

/

Incident Response

/

DevOps

/

Beginners

AI SRE Joined My On-Call — A Beginner-Friendly Walkthrough of Rootly

You know the feeling. You finally sat down to eat, or just closed your laptop for the night, and your phone lights up. P1. Prod is down.

If you’re new to on-call, that first incident is rough. Slack has twelve people typing. Your dashboards are a wall of red. You’re flipping between logs, recent deploys, and that PR a coworker merged twenty minutes ago. Was it the deploy? The cache? CI/CD? Half the code in your stack was written by someone who left two years ago. The other half was written by an AI nobody fully audited. Welcome to modern on-call.

This is the gap a new category of tooling is trying to close — AI SRE. I walked through one of them, Rootly, and here’s the honest beginner’s take.

Why “AI SRE” became a thing in 2026#

Three forces converged. First, microservice sprawl means no single human can hold the whole architecture in their head anymore. Second, the people who wrote half the code have moved on, taking their context with them. Third, LLMs got good enough at code reasoning that they can plausibly assemble that lost context from logs, runbooks, and past postmortems.

So “AI SRE” is not a buzzword for “we put a chatbot in our incident channel.” It’s tooling that ingests your full incident surface area — runbooks, past postmortems, Slack incident channels, IaC, application code — and helps you reason about a failure faster than you could alone.

The AI part is doing two specific jobs:

Indexing your tribal knowledge. Every postmortem you’ve ever written, every Slack thread where someone said “oh, this is just X again” — that’s training material your AI SRE keeps loaded and searchable.
Correlating live signals when an alert fires. Logs, metrics, recent commits, recent deploys. It pulls those together and produces a hypothesis with citations: “This looks similar to incident #438 from last March, and a deploy went out 12 minutes ago that touched the same code path.”

You still decide what to do. You just skip the thirty minutes of frantic tab-switching that usually precedes any real thinking.

How Rootly handles trust: the four-level model#

This is the part I cared about most. AI in production sounds great in a demo and terrifying when it’s actually firing on a real incident. Nobody wants an autonomous agent rolling back deploys because it pattern-matched on the wrong thing.

Rootly publishes an AI SRE Maturity Model with four explicit levels, and you adopt them progressively. The names in the docs are formal; here’s the practical version:

Level 1 — Observe. (Officially: “Read-Only Copilot.”) The AI watches and produces context packets with verifiable hypotheses linked to evidence. It does not execute anything. This is the trust-building stage — you can challenge every conclusion, and the AI shows its work.

Level 2 — Advise. Still part of Level 1 in the formal model, but worth calling out: the AI now suggests specific next steps with a reasoning trail. “Consider rolling back deploy abc123. Here’s why.” You still drive.

Level 3 — Approve. (Officially: “Assisted Actions With Approvals.”) The AI can initiate actions — rollbacks, config changes, scale-ups — but only through a workflow engine that enforces RBAC, audit logs, and verification gates. Nothing executes without a human approval click. This is where the time savings start to compound.

Level 4 — Autonomous. (Officially: “Guardrailed Autonomy for Narrow, Reversible Failure Modes.”) As Rootly puts it: “This is not ‘AI resolves incidents.’ It is ‘AI resolves a small number of repeatable incidents safely.’” Autonomous execution applies only to allow-listed, reversible runbooks with clear preconditions and stop conditions. Most teams reach this level only for a handful of specific failure modes after months of safe operation at Level 3.

The philosophy underneath is what Rootly calls “human-on-the-loop” — an engineer always has the final say on critical actions, accountability stays human. That’s the right framing. It’s also what separates serious AI SRE products from the demoware.

What stood out beyond the trust model#

The post-incident timeline writes itself. After an incident closes, Rootly assembles the full sequence — when the alert fired, who joined the war room, what commands were run, what was tried, what fixed it. Postmortems usually take me an hour minimum. This drops it to “review and edit.”

Citations on every recommendation. When the AI suggests a fix, it links to the past incident, runbook, or code line that informed the decision. You can challenge it. AI tools that refuse to show their work make me nervous; this one doesn’t.

Bring your own AI API key. You plug in your own LLM credentials — OpenAI, Anthropic, or a self-hosted model. Your incident data is scoped to your account. Rootly’s docs are explicit: “PII is automatically scrubbed and never used for training. Opt out anytime.” For anyone in a regulated industry, that’s the difference between “evaluate” and “can’t even start the conversation.”

The Rootly MCP server. This is the 2026-flavored part I didn’t expect. Rootly ships an MCP (Model Context Protocol) server that plugs your incident context directly into IDEs like Cursor, Windsurf, and Claude Desktop. Mid-incident, you can ask your editor “what was the last change to the auth service?” and the answer comes from Rootly’s correlation engine — without copy-pasting between five windows. If you’ve been following the MCP wave, this is the kind of practical integration the protocol was designed for.

The customer list is heavy on companies you’d recognize — NVIDIA, Canva, Webflow, Figma, LinkedIn, Replit, Grammarly, Mistral AI, Brex, DoorDash, SoFi. That’s not a side-project tool; it’s running production at scale.

Will it replace me?#

The honest answer: no, but it changes the job.

Junior on-call shifts today are mostly “navigate to the right dashboard, identify the obvious thing, page a senior if it’s not obvious.” Levels 1 and 3 of the trust model handle the navigation and the obvious things. What’s left for the human is the actual judgment — “is this the right rollback target,” “do we communicate to customers now,” “is this incident a symptom of a deeper architectural problem.”

That’s a better job. Less tab-juggling, more thinking. The path from junior to senior gets shorter because you spend rotation time on real decisions instead of pattern-matching against dashboards.

Who this is for#

If you’re a solo SRE drowning in alerts, this is for you. If you’re a team lead and your engineers are burning out on rotation, this is for you. If you’re early in your career and want to see what production on-call actually looks like, walking through Rootly’s setup is a decent way to learn what the work involves before your first real pager.

The promise isn’t “AI replaces your on-call.” It’s “AI handles the busywork so the thinking part is faster.” Which, after a year of false-positive alerts and broken pager rotations, is what most of us actually need.

If on-call is starting to hurt, Rootly is worth a real evaluation. Start at Level 1 — observe-only — and see what it catches that you’d miss. You don’t promote to Level 3 until you trust the Level 1 output. That’s the right shape for adopting any AI tool in production.

Tatiana Mikhaleva

Docker Captain · IBM Champion · AWS Community Builder

DevOps.Pink — cloud-native education for the agentic-AI era.

YouTube Discord LinkedIn