AI SRE Joined My On-Call — A Beginner-Friendly Walkthrough of Rootly
By Tatiana Mikhaleva · Founder & Senior Developer Advocate
You know the feeling. You finally sat down to eat, or just closed your laptop for the night, and your phone lights up. P1. Prod is down.
If you’re new to on-call, that first incident is rough. Slack has twelve people typing. Your dashboards are a wall of red. You’re flipping between logs, recent deploys, and that PR a coworker merged twenty minutes ago. Was it the deploy? The cache? CI/CD? Half the code in your stack was written by someone who left two years ago. The other half was written by an AI nobody fully audited. Welcome to modern on-call.
This is the gap a new category of tooling is trying to close — AI SRE. I walked through one of them, Rootly, and here’s the honest beginner’s take.
Why “AI SRE” became a thing in 2026
Three forces converged. First, microservice sprawl means no single human can hold the whole architecture in their head anymore. Second, the people who wrote half the code have moved on, taking their context with them. Third, LLMs got good enough at code reasoning that they can plausibly assemble that lost context from logs, runbooks, and past postmortems.
So “AI SRE” is not a buzzword for “we put a chatbot in our incident channel.” It’s tooling that ingests your full incident surface area — runbooks, past postmortems, Slack incident channels, IaC, application code — and helps you reason about a failure faster than you could alone.
The AI part is doing two specific jobs:
- Indexing your tribal knowledge. Every postmortem you’ve ever written, every Slack thread where someone said “oh, this is just X again” — that’s training material your AI SRE keeps loaded and searchable.
- Correlating live signals when an alert fires. Logs, metrics, recent commits, recent deploys. It pulls those together and produces a hypothesis with citations: “This looks similar to incident #438 from last March, and a deploy went out 12 minutes ago that touched the same code path.”
You still decide what to do. You just skip the thirty minutes of frantic tab-switching that usually precedes any real thinking.
How Rootly handles trust: the four-level model
This is the part I cared about most. AI in production sounds great in a demo and terrifying when it’s actually firing on a real incident. Nobody wants an autonomous agent rolling back deploys because it pattern-matched on the wrong thing.
Rootly publishes an AI SRE Maturity Model with four explicit levels, and you adopt them progressively. The names in the docs are formal; here’s the practical version:
Level 1 — Observe. (Officially: “Read-Only Copilot.”) The AI watches and produces context packets with verifiable hypotheses linked to evidence. It does not execute anything. This is the trust-building stage — you can challenge every conclusion, and the AI shows its work.
Level 2 — Advise. Still part of Level 1 in the formal model, but worth calling out: the AI now suggests specific next steps with a reasoning trail. “Consider rolling back deploy abc123. Here’s why.” You still drive.
Level 3 — Approve. (Officially: “Assisted Actions With Approvals.”) The AI can initiate actions — rollbacks, config changes, scale-ups — but only through a workflow engine that enforces RBAC, audit logs, and verification gates. Nothing executes without a human approval click. This is where the time savings start to compound.
Level 4 — Autonomous. (Officially: “Guardrailed Autonomy for Narrow, Reversible Failure Modes.”) As Rootly puts it: “This is not ‘AI resolves incidents.’ It is ‘AI resolves a small number of repeatable incidents safely.’” Autonomous execution applies only to allow-listed, reversible runbooks with clear preconditions and stop conditions. Most teams reach this level only for a handful of specific failure modes after months of safe operation at Level 3.
The philosophy underneath is what Rootly calls “human-on-the-loop” — an engineer always has the final say on critical actions, accountability stays human. That’s the right framing. It’s also what separates serious AI SRE products from the demoware.
What stood out beyond the trust model
The post-incident timeline writes itself. After an incident closes, Rootly assembles the full sequence — when the alert fired, who joined the war room, what commands were run, what was tried, what fixed it. Postmortems usually take me an hour minimum. This drops it to “review and edit.”
Citations on every recommendation. When the AI suggests a fix, it links to the past incident, runbook, or code line that informed the decision. You can challenge it. AI tools that refuse to show their work make me nervous; this one doesn’t.
Bring your own AI API key. You plug in your own LLM credentials — OpenAI, Anthropic, or a self-hosted model. Your incident data is scoped to your account. Rootly’s docs are explicit: “PII is automatically scrubbed and never used for training. Opt out anytime.” For anyone in a regulated industry, that’s the difference between “evaluate” and “can’t even start the conversation.”
The Rootly MCP server. This is the 2026-flavored part I didn’t expect. Rootly ships an MCP (Model Context Protocol) server that plugs your incident context directly into IDEs like Cursor, Windsurf, and Claude Desktop. Mid-incident, you can ask your editor “what was the last change to the auth service?” and the answer comes from Rootly’s correlation engine — without copy-pasting between five windows. If you’ve been following the MCP wave, this is the kind of practical integration the protocol was designed for.
The customer list is heavy on companies you’d recognize — NVIDIA, Canva, Webflow, Figma, LinkedIn, Replit, Grammarly, Mistral AI, Brex, DoorDash, SoFi. That’s not a side-project tool; it’s running production at scale.
Will it replace me?
The honest answer: no, but it changes the job.
Junior on-call shifts today are mostly “navigate to the right dashboard, identify the obvious thing, page a senior if it’s not obvious.” Levels 1 and 3 of the trust model handle the navigation and the obvious things. What’s left for the human is the actual judgment — “is this the right rollback target,” “do we communicate to customers now,” “is this incident a symptom of a deeper architectural problem.”
That’s a better job. Less tab-juggling, more thinking. The path from junior to senior gets shorter because you spend rotation time on real decisions instead of pattern-matching against dashboards.
Who this is for
If you’re a solo SRE drowning in alerts, this is for you. If you’re a team lead and your engineers are burning out on rotation, this is for you. If you’re early in your career and want to see what production on-call actually looks like, walking through Rootly’s setup is a decent way to learn what the work involves before your first real pager.
The promise isn’t “AI replaces your on-call.” It’s “AI handles the busywork so the thinking part is faster.” Which, after a year of false-positive alerts and broken pager rotations, is what most of us actually need.
If on-call is starting to hurt, Rootly is worth a real evaluation. Start at Level 1 — observe-only — and see what it catches that you’d miss. You don’t promote to Level 3 until you trust the Level 1 output. That’s the right shape for adopting any AI tool in production.
Related Posts
- 1Stop Lying About Your Backups — Zero-Trust Recovery with PlakarDevOps & Cloud · Learn how to master Terraform tags for cloud resource management, automation, and cost tracking. Discover best practices, default tags, and merging strategies!
- 2Escaping the Command Line Cartel: Why I Mandate Visual Git in Enterprise DXDevOps & Cloud · Relying purely on the terminal is a toxic DX dependency. Discover how architecting visual version control with GitKraken eliminates cognitive load, enforces psychological safety, and scales enterprise DevOps.
- 3Kubernetes Is No Longer Number One — The REAL 2025 Cloud Native Report (CNCF x SlashData)DevOps & Cloud · Kubernetes is no longer number one. The 2025 CNCF x SlashData report reveals the real cloud-native trends — backend growth, DevOps adoption, AI gaps, and the technologies developers actually use.
- 4Inside Helm - How Charts, Releases, and State Work in KubernetesDevOps & Cloud · Learn how Helm really works under the hood — charts, releases, and Kubernetes state management explained in plain DevOps language.
Random Posts
- 110 Docker Interview Questions & Answers for DevOps & Cloud EngineersDevOps & Cloud · Top 10 Docker interview questions for 2025 DevOps & Cloud Engineer roles — with answers, code examples, and expert tips to help you ace your next interview.
- 2Docker for Girls - Understanding docker ps Once and For AllDevOps & Cloud · Learn docker ps easily! This fun and beginner-friendly guide breaks down the Docker command to list, filter, and inspect containers.
- 3Amazon Q - The AI DevOps Tool That Fixes AWS HeadachesAI & MLOps · Amazon Q is AWS's AI assistant that helps DevOps engineers fix cloud issues faster with smart, context-aware insights and automation.
- 4Inside Helm - How Charts, Releases, and State Work in KubernetesDevOps & Cloud · Learn how Helm really works under the hood — charts, releases, and Kubernetes state management explained in plain DevOps language.