Incident data available for context
"Incident data available for context" means that when an AI agent or human engineer begins investigating a production issue, all the relevant historical context is immediately acce
- ·Full observability stack is operational (OpenTelemetry + Grafana/Datadog or equivalent)
- ·Production metrics feed into dashboards accessible to all developers
- ·Incident data (post-mortems, error patterns) is available as agent context
- ·SLOs are defined and tracked for key services
- ·Incident data is structured for machine consumption (not just human-readable post-mortem docs)
Evidence
- ·Observability stack configuration (OTel collector, Grafana dashboards)
- ·Production metrics dashboards with developer access
- ·Incident data accessible via MCP or structured API
What It Is
"Incident data available for context" means that when an AI agent or human engineer begins investigating a production issue, all the relevant historical context is immediately accessible through programmatic interfaces - not buried in Confluence pages, Slack threads, or people's memories. This includes past incidents with their root cause, resolution steps, and affected services; runbooks with step-by-step investigation procedures; historical metric baselines that define what "normal" looks like; and the linkage between specific error patterns and their known remediation paths.
At L3, this context availability is primarily about making data accessible to agents via MCP servers. An MCP server wrapping PagerDuty's API allows an agent to query: "what incidents has the payment service had in the last 6 months?" An MCP server wrapping Confluence allows an agent to retrieve the runbook for a specific alert type. An MCP server wrapping your internal incident database allows an agent to find all incidents where this specific error code appeared and what the resolution was each time. The agent combines this historical context with the current production signals (metrics, traces, logs) to form an investigation hypothesis that would take a human hours to assemble.
The architecture of this context layer matters. Incident data that lives only in human-readable pages is not agent-accessible in any practical sense. The agent needs structured, queryable interfaces: REST APIs, MCP tools, or search endpoints that return machine-parseable data. PagerDuty, OpsGenie, and Jira Service Management all expose APIs that can be wrapped as MCP tools. Confluence and Notion expose search APIs. Internal runbook systems can be queried via Elasticsearch or similar. The MCP layer is the translation between human-oriented incident management tools and agent-consumable context APIs.
Runbooks are the highest-value context assets for agent investigation. A well-structured runbook contains: the alert condition that triggers it, the likely root causes ranked by frequency, the investigation steps in order, the remediation steps for each root cause, and links to relevant dashboards and past incidents. An agent with access to this runbook can follow its investigation steps programmatically: query the metrics the runbook specifies, check the conditions the runbook lists, and execute the remediation steps if a root cause is confirmed. The runbook essentially programs the agent's investigation procedure.
Why It Matters
Making incident data available for agent context enables qualitatively different incident response:
- Agent investigation reaches human expert quality - a new on-call engineer has no institutional memory; an agent with access to 2 years of incident history and runbooks can investigate with the knowledge of your most experienced SRE
- Runbook execution becomes automatable - when runbooks are structured data (not just prose) accessible via API, agents can follow them step by step rather than requiring humans to translate text into actions
- Pattern recognition across incident history - agents can identify that three seemingly different incidents in the past month all involved the same root cause, something a human might miss when investigating each incident in isolation
- Reduces tribal knowledge risk - incident knowledge stored only in people's heads is lost when those people leave; incident data in structured, queryable systems is organizational knowledge that persists
- Enables proactive context injection - when an alert fires, the agent automatically retrieves all relevant historical incidents, runbooks, and baselines before the human even opens the page, reducing time-to-context from 20 minutes to seconds
Getting Started
6 steps to get from here to the next level
Common Pitfalls
Mistakes teams actually make at this stage - and how to avoid them
How Different Roles See It
Bob's team loses significant institutional knowledge every time an engineer leaves. Post-mortems are written but rarely referenced; runbooks exist but are not maintained; new on-call engineers spend their first incidents reinventing investigations that veterans handle in minutes. Bob wants to capture this knowledge in a form that persists and compounds.
What Bob should do - role-specific action plan
Sarah is focused on reducing the expertise barrier for on-call rotation. Junior developers avoid on-call because they feel unprepared; senior developers carry disproportionate on-call burden because they are the only ones who can investigate incidents effectively. Accessible incident context is the equalizer.
What Sarah should do - role-specific action plan
Victor wants agents to be the primary consumers of the incident context layer. He is building an automated investigation agent that, when an alert fires, assembles all relevant context and produces a preliminary root cause hypothesis before any human engages. This requires the context layer to be comprehensive, structured, and API-accessible.
What Victor should do - role-specific action plan
Further Reading
5 resources worth reading - hand-picked, not scraped
From the Field
Recent releases, projects, and discussions relevant to this maturity level.