Incident data available for context

"Incident data available for context" means that when an AI agent or human engineer begins investigating a production issue, all the relevant historical context is immediately acce

·Full observability stack is operational (OpenTelemetry + Grafana/Datadog or equivalent)
·Production metrics feed into dashboards accessible to all developers
·Incident data (post-mortems, error patterns) is available as agent context

·SLOs are defined and tracked for key services
·Incident data is structured for machine consumption (not just human-readable post-mortem docs)

Evidence

·Observability stack configuration (OTel collector, Grafana dashboards)
·Production metrics dashboards with developer access
·Incident data accessible via MCP or structured API

What It Is

"Incident data available for context" means that when an AI agent or human engineer begins investigating a production issue, all the relevant historical context is immediately accessible through programmatic interfaces - not buried in Confluence pages, Slack threads, or people's memories. This includes past incidents with their root cause, resolution steps, and affected services; runbooks with step-by-step investigation procedures; historical metric baselines that define what "normal" looks like; and the linkage between specific error patterns and their known remediation paths.

At L3, this context availability is primarily about making data accessible to agents via MCP servers. An MCP server wrapping PagerDuty's API allows an agent to query: "what incidents has the payment service had in the last 6 months?" An MCP server wrapping Confluence allows an agent to retrieve the runbook for a specific alert type. An MCP server wrapping your internal incident database allows an agent to find all incidents where this specific error code appeared and what the resolution was each time. The agent combines this historical context with the current production signals (metrics, traces, logs) to form an investigation hypothesis that would take a human hours to assemble.

The architecture of this context layer matters. Incident data that lives only in human-readable pages is not agent-accessible in any practical sense. The agent needs structured, queryable interfaces: REST APIs, MCP tools, or search endpoints that return machine-parseable data. PagerDuty, OpsGenie, and Jira Service Management all expose APIs that can be wrapped as MCP tools. Confluence and Notion expose search APIs. Internal runbook systems can be queried via Elasticsearch or similar. The MCP layer is the translation between human-oriented incident management tools and agent-consumable context APIs.

Runbooks are the highest-value context assets for agent investigation. A well-structured runbook contains: the alert condition that triggers it, the likely root causes ranked by frequency, the investigation steps in order, the remediation steps for each root cause, and links to relevant dashboards and past incidents. An agent with access to this runbook can follow its investigation steps programmatically: query the metrics the runbook specifies, check the conditions the runbook lists, and execute the remediation steps if a root cause is confirmed. The runbook essentially programs the agent's investigation procedure.

Why It Matters

Making incident data available for agent context enables qualitatively different incident response:

Agent investigation reaches human expert quality - a new on-call engineer has no institutional memory; an agent with access to 2 years of incident history and runbooks can investigate with the knowledge of your most experienced SRE
Runbook execution becomes automatable - when runbooks are structured data (not just prose) accessible via API, agents can follow them step by step rather than requiring humans to translate text into actions
Pattern recognition across incident history - agents can identify that three seemingly different incidents in the past month all involved the same root cause, something a human might miss when investigating each incident in isolation
Reduces tribal knowledge risk - incident knowledge stored only in people's heads is lost when those people leave; incident data in structured, queryable systems is organizational knowledge that persists
Enables proactive context injection - when an alert fires, the agent automatically retrieves all relevant historical incidents, runbooks, and baselines before the human even opens the page, reducing time-to-context from 20 minutes to seconds

Getting Started

Build an MCP server for your incident management tool - PagerDuty, OpsGenie, and Jira Service Management all expose REST APIs. Build an MCP server with tools: get_incidents(service, time_range), get_incident_detail(incident_id), get_runbook(alert_name). This gives agents programmatic access to your incident history and runbooks without scraping web pages.
Structure your runbooks for machine consumption - Convert prose runbooks into structured formats: investigation steps as an ordered list, each step with an expected outcome and a condition ("if this step shows X, go to step 4; if it shows Y, go to step 7"). Store runbooks in a system with a query API (Confluence has a REST API, as does Notion). Even YAML files in a git repository are more agent-accessible than PDF documents.
Create an incident tagging taxonomy - Tag every incident with: affected service, root cause category, resolution category, and duration. This taxonomy makes historical queries meaningful: "how many incidents in the last 6 months had root_cause=database_connection_pool_exhaustion?" Without consistent tagging, incident history is a collection of text rather than queryable data.
Define and store metric baselines - For every key metric, store the "normal" baseline as a Prometheus recording rule or a separate time series: P99 latency baseline, request rate baseline, error rate baseline. When an incident fires, the agent queries the current value and the baseline to quantify the deviation: "error rate is 15x the 30-day baseline."
Link Sentry error groups to incident records - When a Sentry error group corresponds to a past incident, link them: add the incident ID to the Sentry error group's tags, or add the Sentry error group ID to the incident record. This bidirectional linkage allows agents to find all past incidents related to a current error pattern and retrieve their resolution paths.
Build a "context package" assembler - Create a service that, given an alert (service name, alert type, timestamp), assembles a context package: recent incidents for the service, the relevant runbook, current metric deviation from baseline, recent deployments, and the Sentry error groups active in the window. This package is delivered to the agent at the start of investigation rather than requiring the agent to query each source separately.

Tip

The most valuable incident context is often the "false alarm" documentation: incidents that were investigated and found to be benign (flaky external API, expected traffic spike from a marketing campaign). Without this documentation, agents and humans will repeatedly investigate the same benign conditions as if they are novel problems. Tag resolved-as-not-actionable incidents explicitly.

6 steps to get from here to the next level

Common Pitfalls

Runbooks that describe symptoms rather than investigation procedures. A runbook that says "this alert means the payment service is unhealthy - check the dashboard" is not useful to an agent. A runbook needs to be procedural: step 1 check metric X, if above threshold Y proceed to step 2, step 2 check log query Z, if pattern W found execute command V. The procedural format is what makes a runbook automatable.

Incident data in inaccessible formats. PDF runbooks, screenshots of dashboards, Slack threads as the primary incident record - none of these are agent-accessible. Every piece of incident knowledge stored in a human-only format is a dead end for agent investigation. Migrate knowledge to systems with APIs before expecting agents to use it.

Not maintaining incident context over time. Incident runbooks that were accurate two years ago may be completely wrong today if the architecture has changed. Stale runbooks are worse than no runbooks because they send agents down incorrect investigation paths. Assign runbook owners and require quarterly review. Track when each runbook was last validated and flag unreviewed runbooks in the agent's context package.

Over-relying on incident history without current context. An agent that looks only at historical incidents without also examining current metrics, traces, and logs will match the wrong historical pattern. Historical context should inform hypothesis formation, not replace investigation. The agent should use historical incidents to generate hypotheses and then validate those hypotheses against current production data.

No feedback loop from agent investigations back to incident records. When an agent investigates an incident and identifies a root cause, that finding should be written back to the incident record. Over time, this creates an incident history that includes agent-generated analyses alongside human ones - compounding institutional knowledge rather than just consuming it.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob's team loses significant institutional knowledge every time an engineer leaves. Post-mortems are written but rarely referenced; runbooks exist but are not maintained; new on-call engineers spend their first incidents reinventing investigations that veterans handle in minutes. Bob wants to capture this knowledge in a form that persists and compounds.

What Bob should do: Bob should treat incident knowledge management as an engineering investment with measurable ROI: the metric is "time for a new on-call engineer to reach full effectiveness." Currently that might be 3-6 months; with structured, accessible incident history and runbooks, it should be 2-4 weeks. Bob should mandate that every significant incident generate a structured post-mortem that follows a template: timeline, root cause, contributing factors, resolution steps, and "what an agent should check" as a specific section. This last section is the bridge to agent-assisted investigation. Bob should also schedule a quarterly runbook review as a team ritual: every runbook is checked against current system architecture and updated or retired.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah is focused on reducing the expertise barrier for on-call rotation. Junior developers avoid on-call because they feel unprepared; senior developers carry disproportionate on-call burden because they are the only ones who can investigate incidents effectively. Accessible incident context is the equalizer.

What Sarah should do: Sarah should advocate for making incident context the first thing a new on-call engineer receives when a page fires. She should work with the team to build the context package assembler: when a PagerDuty alert fires, an automated system immediately posts in the incident Slack thread: the relevant runbook link, the three most similar past incidents with their resolutions, the current metric deviation from baseline, and a link to the pre-filtered Grafana dashboard. This context package reduces the expertise gap: a junior engineer with this package can follow a structured investigation path that previously required institutional memory. Sarah should measure whether on-call rotation participation increases after context packages are introduced, and whether junior engineers' incident resolution time approaches senior engineers' time.

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor wants agents to be the primary consumers of the incident context layer. He is building an automated investigation agent that, when an alert fires, assembles all relevant context and produces a preliminary root cause hypothesis before any human engages. This requires the context layer to be comprehensive, structured, and API-accessible.

What Victor should do: Victor should build the incident context MCP server as a priority infrastructure investment. The server exposes: get_similar_incidents(service, error_pattern), get_runbook(alert_name), get_metric_baseline(service, metric_name), get_recent_deployments(service, time_range), and get_active_sentry_errors(service). An investigation agent calls all five in parallel when an alert fires, synthesizes the results into a preliminary analysis, and posts it in the incident Slack thread within 2 minutes of alert firing. Victor should also instrument the quality of agent investigations: when an agent's preliminary analysis is confirmed as correct by the human responder, that success case is logged. When the agent is wrong, the failure case is logged with the correct root cause. Over time, this feedback loop improves both the runbooks and the agent's investigation strategy.

What Victor should do - role-specific action plan