Maturity Matrix

Alerting on errors

Alerting on errors is the practice of automatically notifying a human when something goes wrong in production - before a customer reports it.

  • ·Basic application logging exists
  • ·Alerting fires on application errors
  • ·Logs are searchable (centralized logging, not just local files)
  • ·No feedback loop exists between production issues and development priorities

Evidence

  • ·Logging configuration in application code
  • ·Alert configuration (PagerDuty, Opsgenie, or equivalent)

What It Is

Alerting on errors is the practice of automatically notifying a human when something goes wrong in production - before a customer reports it. At the basic level (L1), this means hooking your application's error output to a notification system: PagerDuty, OpsGenie, or simply a Slack channel that receives error messages. When an unhandled exception is thrown, an HTTP 500 is returned, or an error log line is written, a human gets pinged. The human then logs in, investigates, and decides what to do.

This is an enormous step up from pure reactive operations - where the first signal of a production problem is a customer complaint - but it remains fundamentally human-driven. The alert is a page to a person, not a trigger for automated investigation. The on-call engineer receives "Error: NullPointerException in PaymentService" at 2am and must manually trace what caused it, assess the impact, and determine the fix. The alerting system provides the notification; everything after that is manual.

At L1, alerts are typically coarse-grained: any error triggers a page, any 5xx rate above a threshold fires an alarm. There is little sophistication around alert severity, routing, or suppression. The same alert fires for a transient database connection blip and for a total payment system outage. This leads to the classic "alert fatigue" failure mode: teams that page too aggressively train their engineers to ignore pages, which defeats the purpose of alerting entirely.

The critical gap at L1 alerting is the absence of context in the alert. The notification tells you something broke; it does not tell you what changed, which users were affected, whether it is getting worse or better, or what the likely cause is. All of that investigation happens manually after the page fires. For AI-assisted operations, this gap is the primary bottleneck: agents can be paged just like humans, but without structured context attached to the alert, they have as little to work with as a human receiving a terse error message.

Why It Matters

Even basic error alerting is a significant maturity step with real operational benefits:

  • Shifts from reactive to proactive - the team learns about production issues from the monitoring system rather than from customer complaints, reducing the time-to-detection from hours to minutes
  • Creates an incident record - alert history provides a log of production events that informs post-mortems and helps identify recurring patterns
  • Establishes the on-call culture - defining who receives alerts, how they escalate, and what the response SLA is builds the operational discipline that advanced observability practices require
  • Baseline for improvement - you cannot tune alerts, reduce noise, or add context to them until you have alerts running; basic alerting is the prerequisite for every more sophisticated alerting practice
  • Enables agent notification - even crude error alerts can be routed to an agent as the starting signal for investigation, provided the agent has access to supporting context elsewhere

Getting Started

6 steps to get from here to the next level

Common Pitfalls

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

B
BobHead of Engineering

Bob's team is currently learning about production problems from customer support tickets. The time-to-detection for production incidents is measured in hours or days. He wants to shift to proactive detection but is not sure where to start given the team has no existing monitoring infrastructure.

What Bob should do - role-specific action plan

S
SarahProductivity Lead

Sarah tracks developer experience metrics and knows that production incidents are a major source of developer stress and context-switching. On-call rotations are dreaded because incidents take hours to diagnose with no tooling support. She wants to make on-call more manageable.

What Sarah should do - role-specific action plan

V
VictorStaff Engineer - AI Champion

Victor wants to wire AI agents into the incident response pipeline. His goal is to have an agent receive an alert, pull the relevant context, and produce a preliminary diagnosis before any human needs to engage. The current alerting setup is too thin to support this - alerts carry no structured context that an agent can act on.

What Victor should do - role-specific action plan