Production anomaly → auto-ticket → agent investigation
The production anomaly to auto-ticket to agent investigation pipeline automates the first phase of incident response.
- ·Production anomaly detection auto-creates tickets and triggers agent investigation
- ·Self-healing for known patterns: agent detects known error pattern, applies known fix, deploys, and verifies
- ·Infrastructure recommends code changes based on production data (Vercel SDI model)
- ·Auto-created tickets include full context (traces, logs, affected users, similar past incidents)
- ·Self-healing success rate is tracked (% of auto-fixes that resolve the issue without human intervention)
Evidence
- ·Auto-ticket creation logs triggered by production anomalies
- ·Self-healing event logs showing detection, fix, deploy, and verification steps
- ·Infrastructure recommendation pipeline configuration (production data to code change suggestions)
What It Is
The production anomaly to auto-ticket to agent investigation pipeline automates the first phase of incident response. When a production anomaly is detected - a metric crossing a threshold, an error rate spike, an SLO burn rate anomaly - the system automatically creates a structured incident ticket and dispatches an AI agent to investigate. The agent queries the observability stack, retrieves historical context, analyzes the current production signals, and produces a preliminary root cause hypothesis and set of recommended actions - all before any human reviews the page.
This is the first level at which AI agents become primary participants in incident response rather than supporting tools. The human on-call is no longer the first investigator; the agent is. By the time the on-call engineer opens the page (typically 5-15 minutes after an automated investigation starts), the agent has already: queried metrics for the affected service and neighboring services, retrieved distributed traces from the incident window, found the most similar past incidents, executed the first steps of the runbook, and produced a structured preliminary analysis. The human's job is to review the agent's work and decide what action to take, not to start the investigation from scratch.
The auto-ticket creation is an important architectural detail. The ticket is not just a notification - it is a structured context container that accumulates investigation data over time. The ticket contains: the anomaly detection event with the triggering metric and its deviation from baseline, the agent's preliminary analysis with links to the relevant traces and log queries, the list of investigation steps the agent executed, and a structured field for the root cause when determined. This ticket structure means that when the human engages, they have a full investigation history in one place. The ticket also becomes part of the incident history that future agents will query for similar incidents.
The anomaly detection layer is the foundation of this pipeline. At L4, anomaly detection goes beyond simple threshold alerts. Statistical anomaly detection - using tools like Facebook Prophet, AWS CloudWatch Anomaly Detection, or custom Prometheus alerting rules with predict_linear and dynamic thresholds - identifies anomalies that would miss fixed-threshold alerts: a gradual drift that stays within normal bounds but shows an abnormal trend, a metric that is within its absolute bounds but correlates with a pattern that historically precedes failures. The richer the anomaly detection, the richer the set of conditions that trigger agent investigation.
Why It Matters
Automated anomaly-to-agent investigation fundamentally changes the economics and speed of incident response:
- Mean time to investigation drops from 15 minutes to under 2 minutes - agents start investigating the moment the anomaly fires, without waiting for a human to acknowledge the page and open their laptop
- Incident response quality becomes consistent - agent investigation follows the runbook every time, without the variability of humans who are tired, distracted, or unfamiliar with the affected service
- On-call burden decreases - when agents handle the investigation phase, the on-call engineer's job shifts to reviewing and approving agent recommendations rather than performing the investigation; this significantly reduces on-call cognitive load
- Small anomalies that humans would deprioritize are investigated - a P3 alert that would sit in the queue for hours receives immediate agent investigation; some P3 alerts turn out to be early signals of P1 incidents
- Every investigation creates structured knowledge - agent investigations written into tickets are queryable historical data; the next similar incident has the previous investigation to learn from
Getting Started
6 steps to get from here to the next level
Common Pitfalls
Mistakes teams actually make at this stage - and how to avoid them
How Different Roles See It
Bob's on-call engineers are burning out. The team's services are mature and incidents are infrequent, but each incident requires deep investigation that can take hours and happens at unpredictable times. He wants to reduce the on-call burden without reducing service quality.
What Bob should do - role-specific action plan
Sarah tracks on-call sustainability as a developer experience metric. She knows that unsustainable on-call rotations cause senior engineers to leave and make hiring harder. The agent investigation pipeline is a direct intervention on this problem.
What Sarah should do - role-specific action plan
Victor is building the investigation agent pipeline. He has the observability stack (metrics, traces, logs via MCP), the incident history MCP, and the runbook MCP. He is now designing the agent's investigation protocol and the ticket writing format.
What Victor should do - role-specific action plan
Further Reading
5 resources worth reading - hand-picked, not scraped
From the Field
Recent releases, projects, and discussions relevant to this maturity level.