Alerting on errors

Alerting on errors is the practice of automatically notifying a human when something goes wrong in production - before a customer reports it.

·Basic application logging exists
·Alerting fires on application errors

·Logs are searchable (centralized logging, not just local files)
·Production issues do not yet feed back into dev priorities

Evidence

·Logging configuration in application code
·Alert configuration (PagerDuty, Opsgenie, or equivalent)

What It Is

Alerting on errors is the practice of automatically notifying a human when something goes wrong in production - before a customer reports it. At the basic level (L1), this means hooking your application's error output to a notification system: PagerDuty, OpsGenie, or simply a Slack channel that receives error messages. When an unhandled exception is thrown, an HTTP 500 is returned, or an error log line is written, a human gets pinged. The human then logs in, investigates, and decides what to do.

This is an enormous step up from pure reactive operations - where the first signal of a production problem is a customer complaint - but it remains fundamentally human-driven. The alert is a page to a person, not a trigger for automated investigation. The on-call engineer receives "Error: NullPointerException in PaymentService" at 2am and must manually trace what caused it, assess the impact, and determine the fix. The alerting system provides the notification; everything after that is manual.

At L1, alerts are typically coarse-grained: any error triggers a page, any 5xx rate above a threshold fires an alarm. There is little sophistication around alert severity, routing, or suppression. The same alert fires for a transient database connection blip and for a total payment system outage. This leads to the classic "alert fatigue" failure mode: teams that page too aggressively train their engineers to ignore pages, which defeats the purpose of alerting entirely.

The critical gap at L1 alerting is the absence of context in the alert. The notification tells you something broke; it does not tell you what changed, which users were affected, whether it is getting worse or better, or what the likely cause is. All of that investigation happens manually after the page fires. For AI-assisted operations, this gap is the primary bottleneck: agents can be paged just like humans, but without structured context attached to the alert, they have as little to work with as a human receiving a terse error message.

Why It Matters

Even basic error alerting is a significant maturity step with real operational benefits:

Shifts from reactive to proactive - the team learns about production issues from the monitoring system rather than from customer complaints, reducing the time-to-detection from hours to minutes
Creates an incident record - alert history provides a log of production events that informs post-mortems and helps identify recurring patterns
Establishes the on-call culture - defining who receives alerts, how they escalate, and what the response SLA is builds the operational discipline that advanced observability practices require
Baseline for improvement - you cannot tune alerts, reduce noise, or add context to them until you have alerts running; basic alerting is the prerequisite for every more sophisticated alerting practice
Enables agent notification - even crude error alerts can be routed to an agent as the starting signal for investigation, provided the agent has access to supporting context elsewhere

Getting Started

Integrate an error tracking tool - Sentry is the standard choice for application-level error tracking. It captures unhandled exceptions with full stack traces, groups similar errors, and provides alert routing. Add the Sentry SDK to your application in an afternoon. For infrastructure-level alerting, connect your metrics to PagerDuty or OpsGenie.
Define alert severity tiers - Not every error should page an engineer at 2am. Define three tiers: P1 (customer-impacting, page immediately), P2 (degraded functionality, notify during business hours), P3 (noise/investigation needed, log and review weekly). Apply these tiers to every alert before enabling it.
Set thresholds, not just boolean triggers - Alerting on every single 5xx response will produce noise. Alert on 5xx rate exceeding 1% of requests over a 5-minute window, or error count exceeding 10 per minute. Rate-based thresholds filter out transient blips while catching sustained problems.
Configure alert routing - Alerts should go to the team responsible for the affected service, not a single shared inbox. Use PagerDuty or OpsGenie routing rules to match alert source to on-call team. This prevents alert fatigue from irrelevant pages and ensures the right person investigates.
Add runbooks to every alert - Every alert should link to a runbook: a document that says "when this alert fires, check these things in this order." The runbook does not need to be comprehensive on day one - even a paragraph helps. This runbook becomes the context that agents will later use for automated investigation.
Review alert noise weekly - For the first month, review which alerts fired and whether the page was actionable. Alerts that fire frequently but never require action should be tuned or removed. Alert fatigue is a failure mode that undermines your entire observability investment.

Tip

The first time you set up Sentry or a similar error tracker, you will discover errors that have been silently occurring in production for months. Budget time to triage this backlog before assuming every new alert is urgent - many will be longstanding low-priority issues you never knew about.

6 steps to get from here to the next level

Common Pitfalls

Alerting on everything, paging on everything. The fastest way to destroy an on-call culture is to page engineers for every non-critical error. When pages fire 20 times a week for issues that resolve themselves, engineers stop treating pages as urgent. Distinguish between "this needs immediate human attention" and "this is worth knowing about" before routing any alert to PagerDuty.

No runbooks attached to alerts. An alert that says "PaymentService error rate high" with no additional context forces the on-call engineer to start from scratch every time. Even a minimal runbook - what to check, what the likely causes are, what the last time this happened looked like - dramatically reduces mean time to resolution and is the prerequisite for any agent-assisted investigation.

Not setting up alert suppression or deduplication. A single cascading failure can trigger dozens of alerts from different systems. Without deduplication, the on-call engineer receives 40 pages in 5 minutes - an unhelpful flood rather than actionable signal. Configure alert grouping in PagerDuty/OpsGenie and deduplication in your alerting rules before going live.

Forgetting infrastructure alerts alongside application alerts. Application error tracking (Sentry) captures code-level exceptions but misses disk full, memory pressure, network partition, and database connection pool exhaustion. These infrastructure-level alerts are often the root cause of application errors. A complete alerting setup covers both layers.

No post-mortem process connected to alerts. Alerts that fire and are resolved without a follow-up record create no organizational learning. When the same alert fires three times in a month, you should know that - and have documented what was done each time. Connect your alerting tool to your incident management process (even a simple Slack channel or Jira ticket) so every significant alert creates a record.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob's team is currently learning about production problems from customer support tickets. The time-to-detection for production incidents is measured in hours or days. He wants to shift to proactive detection but is not sure where to start given the team has no existing monitoring infrastructure.

What Bob should do: Bob should start with Sentry for application error tracking and PagerDuty for alert routing - both have free tiers that cover small teams. The goal in the first two weeks is not comprehensive coverage; it is getting any alert to the right person within five minutes of a production error. Bob should assign one engineer to own the alerting setup for each major service, and make alert coverage a definition-of-done criterion for new features. Within a month, the team should have basic error alerting on all customer-facing services, a simple severity classification, and at least stub runbooks for the top five alert types. The cultural shift - from "wait for customers to complain" to "we know before customers do" - is as important as the tooling.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah tracks developer experience metrics and knows that production incidents are a major source of developer stress and context-switching. On-call rotations are dreaded because incidents take hours to diagnose with no tooling support. She wants to make on-call more manageable.

What Sarah should do: Sarah should instrument the on-call experience as a developer productivity metric: how many pages per week, what is the mean time to resolution, what fraction of pages are actionable versus noise. These numbers establish the baseline that future improvements - structured logging, richer alerts, runbooks, agent investigation - will improve against. Sarah should also push for runbook quality as a first-class engineering investment. A runbook that reduces a 2-hour incident investigation to a 20-minute checklist is a direct productivity win. Over time, these runbooks become the training data for agent-automated investigation - every step a human follows during an incident is a step an agent can be taught to follow automatically.

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor wants to wire AI agents into the incident response pipeline. His goal is to have an agent receive an alert, pull the relevant context, and produce a preliminary diagnosis before any human needs to engage. The current alerting setup is too thin to support this - alerts carry no structured context that an agent can act on.

What Victor should do: Victor should design alert payloads as agent inputs from the start. When PagerDuty fires, the alert payload should include: the service name, the error message, a link to the relevant Sentry error group, a link to the deployment that was most recent at alert time, and a link to the runbook. An agent receiving this payload can immediately query Sentry for the full stack trace, check the deployment history, pull the runbook steps, and start forming a hypothesis - all before a human is paged. Victor should build a proof-of-concept where a Slack alert fires and an agent automatically replies with a preliminary analysis. Even if the analysis is occasionally wrong, the signal to the team that automated investigation is possible is worth the experiment.

What Victor should do - role-specific action plan