AI SRE Joined My On-Call — A Beginner-Friendly Walkthrough of Rootly
By Tatiana Mikhaleva · Developer Advocate · Docker Captain · IBM Champion
You know the feeling, darling. You finally sat down to eat, or just closed your laptop for the night. Then your phone lights up. P1. Prod is down.
If you’re new to on-call, that first incident is rough. Slack has twelve people typing. Your dashboards are a wall of red. And you’re flipping between logs, recent deploys, and that PR a coworker merged twenty minutes ago. Was it the deploy? The cache? CI/CD? Half the code in your stack was written by someone who left two years ago. The other half? Written by an AI nobody fully audited. Welcome to modern on-call, queen.
There’s a whole new category of tooling trying to close that gap, and it goes by AI SRE. So I walked through one of them, Rootly. Here’s the honest beginner’s take.
Why “AI SRE” became a thing in 2026
Three forces converged here. First: microservice sprawl. No single human can hold the whole architecture in their head anymore, full stop. Second, the people who actually wrote half the code have moved on, and they took all their context with them. Third, LLMs finally got good enough at code reasoning that they can plausibly rebuild that lost context from logs, runbooks, and old postmortems.
So no, “AI SRE” isn’t a fancy way of saying “we dropped a chatbot in our incident channel.” It’s tooling that swallows your whole incident surface area, runbooks and past postmortems and Slack incident channels and IaC and application code, and then helps you reason about a failure faster than you ever could on your own.
The AI is really doing two jobs here.
- Indexing your tribal knowledge. Every postmortem you’ve ever written. Every Slack thread where someone went “oh, this is just X again.” That’s all training material your AI SRE keeps loaded and searchable, sis.
- Correlating live signals when an alert fires. Logs, metrics, recent commits, recent deploys. It yanks those together and hands you a hypothesis with citations: “This looks similar to incident #438 from last March, and a deploy went out 12 minutes ago that touched the same code path.”
You still decide what to do. You just skip the thirty minutes of frantic tab-switching that usually comes before any real thinking starts.
How Rootly handles trust: the four-level model
Okay, this is the part I actually cared about. AI in production sounds dreamy in a demo. It sounds terrifying when it’s firing on a real incident at 3am. Because nobody, and I mean nobody, wants an autonomous agent rolling back deploys because it pattern-matched on the wrong thing.
Rootly publishes an AI SRE Maturity Model with four explicit levels, and you adopt them one at a time. The names in the docs are pretty formal. Here’s the practical version, code cuties.
Level 1 — Observe. (Officially: “Read-Only Copilot.”) The AI watches and produces context packets with verifiable hypotheses linked to evidence. It does not execute anything. This is the trust-building stage. You can challenge every conclusion, and the AI shows its work.
Level 2 — Advise. Still technically part of Level 1 in the formal model, but worth calling out on its own. Now the AI suggests specific next steps and brings a reasoning trail along: “Consider rolling back deploy abc123. Here’s why.” You still drive.
Level 3 — Approve. (Officially: “Assisted Actions With Approvals.”) Now the AI can initiate actions like rollbacks, config changes, and scale-ups, but only through a workflow engine that enforces RBAC, audit logs, and verification gates. Nothing executes without a human approval click. This is where the time savings really start to stack up.
Level 4 — Autonomous. (Officially: “Guardrailed Autonomy for Narrow, Reversible Failure Modes.”) As Rootly puts it: “This is not ‘AI resolves incidents.’ It is ‘AI resolves a small number of repeatable incidents safely.’” Autonomous execution applies only to allow-listed, reversible runbooks with clear preconditions and stop conditions. Most teams get here only for a handful of specific failure modes, and only after months of safe operation at Level 3.
The philosophy holding all of this up is what Rootly calls “human-on-the-loop.” An engineer always has the final say on critical actions. Accountability stays human. That’s the right framing, and honestly it’s what separates the serious AI SRE products from the demoware.
What stood out beyond the trust model
The post-incident timeline writes itself. Once an incident closes, Rootly assembles the full sequence: when the alert fired, who joined the war room, what commands got run, what was tried, and what actually fixed it. My postmortems usually eat an hour, minimum. This drops the whole thing to “review and edit.” Bless.
Citations on every recommendation. When the AI suggests a fix, it links straight to the past incident, runbook, or code line that informed the call. You can push back. AI tools that refuse to show their work make me nervous, and this one doesn’t.
Bring your own AI API key. You plug in your own LLM credentials, whether that’s OpenAI, Anthropic, or a self-hosted model. Your incident data stays scoped to your account. And Rootly’s docs are blunt about it: “PII is automatically scrubbed and never used for training. Opt out anytime.” For anyone in a regulated industry, that one line is the difference between “let’s evaluate this” and “we can’t even start the conversation.”
The Rootly MCP server. Now this is the very 2026 part I didn’t see coming. Rootly ships an MCP (Model Context Protocol) server that wires your incident context straight into IDEs like Cursor, Windsurf, and Claude Desktop. Mid-incident, you can ask your editor “what was the last change to the auth service?” and the answer comes back from Rootly’s correlation engine, no copy-pasting between five windows. If you’ve been watching the MCP wave, this is exactly the kind of practical integration the protocol was built for.
The customer list leans heavy on names you’d know on sight: NVIDIA, Canva, Webflow, Figma, LinkedIn, Replit, Grammarly, Mistral AI, Brex, DoorDash, SoFi. That’s not some weekend side-project tool. That’s running production at scale.
Will it replace me?
Honest answer? No. But it does change the job.
Think about what a junior on-call shift looks like today. Mostly it’s “navigate to the right dashboard, spot the obvious thing, page a senior if it’s not obvious.” Levels 1 and 3 of the trust model swallow the navigation and the obvious stuff. What’s left for the human is the part that’s actually judgment: “is this the right rollback target,” “do we communicate to customers now,” “is this incident a symptom of a deeper architectural problem.”
And that’s a better job, sis. Less tab-juggling, more thinking. The road from junior to senior gets shorter, because you spend your rotation on real decisions instead of pattern-matching against dashboards all night.
Who this is for
Solo SRE drowning in alerts? This is for you. Team lead watching your engineers burn out on rotation? Also for you. And if you’re early in your career and just want to see what production on-call actually looks like, walking through Rootly’s setup is a genuinely decent way to learn what the work feels like before your first real pager.
The promise here isn’t “AI replaces your on-call.” It’s “AI handles the busywork so the thinking part goes faster.” Which, after a year of false-positive alerts and broken pager rotations, is what most of us actually need.
So if on-call is starting to hurt, Rootly is worth a real evaluation. Start at Level 1, observe-only, and see what it catches that you’d have missed. You don’t promote to Level 3 until you trust the Level 1 output. That’s the right shape for adopting any AI tool in production, queen.
Related Posts
- 1How to Secure AI Agents in Production: IBM's Six-Phase FrameworkDevOps & Cloud · Teams secure AI agents like normal software, and production breaks. Here's IBM and Anthropic's six-phase framework for securing them, phase by phase.
- 2Your AI Agent Doesn't Need a Better Prompt. It Needs a CeilingDevOps & Cloud · A prompt is not a security control. It's a wish. The Vault → Sentinel → MCP → ADLC → watsonx Orchestrate stack that gives AI agents a hard ceiling — and why IBM consolidating HashiCorp made the whole thing boring, in the best possible way.
- 3CNCF Q1 2026 Report — Why Feature Flagging Is the Hidden Gateway to Cloud Native MaturityDevOps & Cloud · CNCF Q1 2026 cloud native report analysis. Why feature flagging is the bridge from mainstream to advanced engineering practice, with exclusive commentary from the report's author.
- 4Stop Lying About Your Backups — Zero-Trust Recovery with PlakarDevOps & Cloud · Learn how to master Terraform tags for cloud resource management, automation, and cost tracking. Discover best practices, default tags, and merging strategies!
Random Posts
- 1AWS AI/ML - The Ultimate Guide for IT GirlsAI & MLOps · Unlock AWS AI/ML! Discover how Amazon's AI tools like SageMaker, Lex, and Polly automate tasks, enhance CX, and drive innovation—no coding needed!
- 2Helm in Kubernetes - What It Is and Why You Need ItDevOps & Cloud · Helm simplifies Kubernetes deployments. Learn what it is, how it works, and why it's essential for managing scalable apps and infrastructure.
- 3CNCF Q1 2026 Report — Why Feature Flagging Is the Hidden Gateway to Cloud Native MaturityDevOps & Cloud · CNCF Q1 2026 cloud native report analysis. Why feature flagging is the bridge from mainstream to advanced engineering practice, with exclusive commentary from the report's author.
- 4Git Merge Conflicts for Beginners - What to Do When Your Branches Have BeefDevOps & Cloud · Learn how to fix Git merge conflicts step-by-step. A fun, beginner-friendly guide to resolving and avoiding conflict errors in your branches.