Production anomaly → auto-ticket → agent investigation

The production anomaly to auto-ticket to agent investigation pipeline automates the first phase of incident response.

·Production anomaly detection auto-creates tickets and triggers agent investigation
·Self-healing for known patterns: agent detects known error pattern, applies known fix, deploys, and verifies
·Infrastructure recommends code changes based on production data (Vercel SDI model)

·Auto-created tickets include full context (traces, logs, affected users, similar past incidents)
·Self-healing success rate is tracked (% of auto-fixes that resolve the issue without human intervention)

Evidence

·Auto-ticket creation logs triggered by production anomalies
·Self-healing event logs showing detection, fix, deploy, and verification steps
·Infrastructure recommendation pipeline configuration (production data to code change suggestions)

What It Is

The production anomaly to auto-ticket to agent investigation pipeline automates the first phase of incident response. When a production anomaly is detected - a metric crossing a threshold, an error rate spike, an SLO burn rate anomaly - the system automatically creates a structured incident ticket and dispatches an AI agent to investigate. The agent queries the observability stack, retrieves historical context, analyzes the current production signals, and produces a preliminary root cause hypothesis and set of recommended actions - all before any human reviews the page.

This is the first level at which AI agents become primary participants in incident response rather than supporting tools. The human on-call is no longer the first investigator; the agent is. By the time the on-call engineer opens the page (typically 5-15 minutes after an automated investigation starts), the agent has already: queried metrics for the affected service and neighboring services, retrieved distributed traces from the incident window, found the most similar past incidents, executed the first steps of the runbook, and produced a structured preliminary analysis. The human's job is to review the agent's work and decide what action to take, not to start the investigation from scratch.

The auto-ticket creation is an important architectural detail. The ticket is not just a notification - it is a structured context container that accumulates investigation data over time. The ticket contains: the anomaly detection event with the triggering metric and its deviation from baseline, the agent's preliminary analysis with links to the relevant traces and log queries, the list of investigation steps the agent executed, and a structured field for the root cause when determined. This ticket structure means that when the human engages, they have a full investigation history in one place. The ticket also becomes part of the incident history that future agents will query for similar incidents.

The anomaly detection layer is the foundation of this pipeline. At L4, anomaly detection goes beyond simple threshold alerts. Statistical anomaly detection - using tools like Facebook Prophet, AWS CloudWatch Anomaly Detection, or custom Prometheus alerting rules with predict_linear and dynamic thresholds - identifies anomalies that would miss fixed-threshold alerts: a gradual drift that stays within normal bounds but shows an abnormal trend, a metric that is within its absolute bounds but correlates with a pattern that historically precedes failures. The richer the anomaly detection, the richer the set of conditions that trigger agent investigation.

Why It Matters

Automated anomaly-to-agent investigation fundamentally changes the economics and speed of incident response:

Mean time to investigation drops from 15 minutes to under 2 minutes - agents start investigating the moment the anomaly fires, without waiting for a human to acknowledge the page and open their laptop
Incident response quality becomes consistent - agent investigation follows the runbook every time, without the variability of humans who are tired, distracted, or unfamiliar with the affected service
On-call burden decreases - when agents handle the investigation phase, the on-call engineer's job shifts to reviewing and approving agent recommendations rather than performing the investigation; this significantly reduces on-call cognitive load
Small anomalies that humans would deprioritize are investigated - a P3 alert that would sit in the queue for hours receives immediate agent investigation; some P3 alerts turn out to be early signals of P1 incidents
Every investigation creates structured knowledge - agent investigations written into tickets are queryable historical data; the next similar incident has the previous investigation to learn from

Getting Started

Define structured anomaly event schemas - Every anomaly event that triggers the pipeline should be a structured JSON object: {service, alert_name, metric_name, current_value, baseline_value, deviation_percent, timestamp, severity, runbook_link}. This structure is what the agent receives as its starting context. Unstructured alert text is not sufficient.
Build the auto-ticket creation webhook - Configure your alerting system (Prometheus Alertmanager, Datadog Monitors, PagerDuty) to call a webhook when an alert fires. The webhook creates a Jira or Linear ticket with the structured anomaly data, assigns it to the on-call engineer, and tags it for agent investigation. The ticket ID becomes the coordination point for all subsequent investigation activity.
Implement the investigation agent - The agent receives the ticket ID, reads the anomaly context, and executes its investigation plan: query metrics (PromQL), retrieve traces (TraceQL), search logs (LogQL), find similar incidents (incident MCP), read the runbook (runbook MCP). The agent writes its findings back to the ticket as structured comments. Use a stateless agent with explicit tool calling rather than a conversational agent - the investigation task is well-defined and benefits from deterministic tool execution.
Define the agent's investigation protocol - Write an explicit investigation protocol the agent follows for each alert type. For a payment error rate spike: (1) query payment service error rate for last 30 minutes, (2) compare to 24-hour and 7-day baseline, (3) find traces with errors in the window, (4) check payment service deployment history in the last 2 hours, (5) check external payment gateway API status, (6) query similar incidents. The protocol gives the agent a structured investigation path rather than an open-ended search.
Set human escalation thresholds - Not every agent investigation ends in a clear answer. Define the conditions under which the agent escalates to a human: root cause not identified after N investigation steps, multiple conflicting hypotheses, root cause identified but remediation requires human approval, or confidence below a defined threshold. Escalation should carry the full investigation context so the human continues where the agent left off.
Instrument investigation quality - Track the percentage of agent investigations that correctly identify the root cause (as confirmed by human resolution). Track the time from anomaly detection to agent producing a preliminary analysis. Track the percentage of investigations that require human escalation. These metrics show where the investigation protocol needs improvement and demonstrate the value of the automated pipeline.

Tip

The hardest part of this pipeline is not the technology - it is defining what "correct root cause identification" means precisely enough to measure it. Before instrumenting investigation quality, spend a sprint defining a taxonomy of root cause categories for your system and ensuring every resolved incident is tagged with a root cause category. Without this taxonomy, you cannot evaluate whether the agent's hypotheses are correct.

6 steps to get from here to the next level

Common Pitfalls

Agents that investigate but cannot communicate results clearly. An agent that produces a 3,000-word investigation report is worse than no report during an incident. The agent's output should be structured and scannable: root cause hypothesis in one sentence, evidence in three bullet points, recommended action in one sentence. Format matters as much as accuracy when humans need to act on the output quickly.

No escalation path for agent uncertainty. An agent that confidently produces a wrong root cause hypothesis can send the incident response team down the wrong path, making resolution slower than if the agent had not investigated at all. Agents must communicate their confidence level and escalate when uncertain. "I found two plausible root causes with similar evidence; human judgment needed to differentiate" is better than a confident wrong answer.

Alert volume overwhelming the agent pipeline. If 50 alerts fire simultaneously during a cascading failure, dispatching 50 concurrent investigation agents is not useful and may be expensive. Build throttling and deduplication into the pipeline: during a cascading failure, investigate the root cause alert rather than every downstream symptom. Alert correlation (identifying which alerts are symptoms of the same root cause) is a prerequisite for efficient agent dispatch.

Investigation agents with write access to production. An investigation-phase agent should be read-only. It queries metrics, retrieves traces, reads logs, and reads runbooks - but it does not take remediation actions. Mixing investigation and remediation in a single agent is dangerous: an agent that incorrectly identifies a root cause might take a remediation action that makes the incident worse. Separate the investigation agent (read-only) from the remediation agent (write access, requires approval).

Not feeding agent investigation results back into runbooks. When an agent investigation correctly identifies a root cause that was not in the runbook, that discovery should update the runbook. Runbooks that do not incorporate agent investigation learnings become stale. Create a process: when an agent's novel finding helps resolve an incident, add it to the relevant runbook within 48 hours.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob's on-call engineers are burning out. The team's services are mature and incidents are infrequent, but each incident requires deep investigation that can take hours and happens at unpredictable times. He wants to reduce the on-call burden without reducing service quality.

What Bob should do: Bob should position the agent investigation pipeline as an on-call load reduction investment. The goal is to shift the on-call engineer's role from investigator to decision-maker: the agent does the investigation, the human makes the call on remediation. This shift reduces the cognitive and time burden of on-call, making rotation more sustainable and accessible to more team members. Bob should measure on-call experience with a simple monthly survey: mean pages per week, mean investigation time per incident, subjective experience rating. These metrics should improve as the agent pipeline matures. Bob should also ensure the pipeline has a clear human override: on-call engineers should always be able to bypass agent recommendations and investigate manually. The agent is a tool, not an authority.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah tracks on-call sustainability as a developer experience metric. She knows that unsustainable on-call rotations cause senior engineers to leave and make hiring harder. The agent investigation pipeline is a direct intervention on this problem.

What Sarah should do: Sarah should instrument the full incident response timeline with the agent pipeline in place: time from anomaly detection to agent starting investigation, time from agent investigation to human decision, time from human decision to resolution. Compare this to the pre-agent baseline. The improvement in "time from anomaly to human decision" is the clearest evidence that the pipeline is working. Sarah should also interview on-call engineers after their first month with the agent pipeline: are they more confident? Are incidents less stressful? Do they feel the agent's analyses are useful? This qualitative data supplements the quantitative metrics and surfaces issues with agent output quality that numbers alone might miss.

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor is building the investigation agent pipeline. He has the observability stack (metrics, traces, logs via MCP), the incident history MCP, and the runbook MCP. He is now designing the agent's investigation protocol and the ticket writing format.

What Victor should do: Victor should start with a narrow, high-frequency incident type for the first production deployment of the investigation agent. Pick the alert that fires most often, has the most consistent root causes, and has the most detailed runbook. Build an investigation agent specifically for this alert type, test it against the last 20 historical incidents (were the root causes it would have identified correct?), and deploy it to production for that alert type only. This narrow deployment validates the pipeline with real incidents before expanding to all alert types. Victor should also design the investigation protocol as data (a JSON schema) rather than code, so that runbook owners can update the investigation steps without modifying the agent code. The protocol schema is the bridge between human-written runbooks and agent-executed investigation procedures.

What Victor should do - role-specific action plan