Maturity Matrix

Post-deploy monitoring

Post-deploy monitoring is the practice of actively watching key production metrics in the minutes and hours after a deployment, with the goal of detecting deployment-induced regres

  • ·Structured logging is implemented (JSON logs with consistent fields)
  • ·OpenTelemetry basic instrumentation is deployed (traces and metrics)
  • ·Post-deploy monitoring checks run after each deployment
  • ·Traces are correlated across services
  • ·Post-deploy checks include automated smoke tests

Evidence

  • ·Structured logging configuration showing JSON format with standard fields
  • ·OpenTelemetry SDK configuration in application code
  • ·Post-deploy monitoring job configuration in CD pipeline

May 2026 Update

Post-deploy monitoring now includes token-cost telemetry as table stakes. ccusage (13.2k stars on GitHub, ccusage.com) tracks per-session and per-project spend from local JSONL with cache breakdown and offline pricing. Claude-Code-Usage-Monitor adds live charts and "time-to-limit" predictions. Both /usage and /context shipped as built-in commands in April. Treat agent token spend the same way you treat error rate and p95 latency - it is a leading indicator that often spikes before the user-facing impact.

For Claude-based pipelines specifically, also watch the harness-quality signals from Stella Laurenzo's 6,852-session audit: median thinking length per turn, files read before edit. When those metrics drop sharply (Anthropic confirmed this happened in March-April 2026 due to harness changes, not the model), it is the same kind of leading indicator that error-rate change is for code deploys.

What It Is

Post-deploy monitoring is the practice of actively watching key production metrics in the minutes and hours after a deployment, with the goal of detecting deployment-induced regressions before they cause sustained user impact. Instead of deploying and moving on, the team treats the post-deployment window as a distinct monitoring phase: error rates, latency distributions, and business metrics are watched with heightened attention and lower alert thresholds immediately after a release.

At L2, post-deploy monitoring is a semi-manual practice. A deployment triggers a notification in Slack, the deploying developer or a designated reviewer watches the dashboard for 10-30 minutes, and if nothing alarming appears, the deployment is considered stable. The monitoring might involve watching a Grafana dashboard, checking Sentry for new error groups, or looking at a DataDog APM trace volume. The key distinction from no monitoring is intentionality: the team has defined what "healthy after deploy" looks like and is actively verifying that the new version meets that definition.

Canary deployments are the natural complement to post-deploy monitoring at this level. Rather than deploying to 100% of traffic immediately, a canary routes a small percentage (5-10%) of requests to the new version while the monitoring phase runs. If metrics stay healthy, traffic gradually shifts to 100%. If metrics degrade, the canary is rolled back before the majority of traffic is affected. Kubernetes supports canary patterns via weighted services or ingress controllers; AWS and GCP offer traffic-splitting in their deployment services. The canary reduces the blast radius of a bad deployment from "all users" to "a small percentage of users for a short window."

Health checks are the baseline mechanism for post-deploy monitoring: each service exposes an HTTP endpoint that returns 200 when healthy and a non-200 status when something is wrong. Kubernetes uses readiness and liveness probes. Load balancers use health checks to determine which instances receive traffic. At L2, these checks verify that the service started correctly and responds to requests; they do not yet verify business logic correctness. Combined with watching error rates and latency in the minutes after deploy, health checks form the basic canary evaluation criterion.

Why It Matters

The deployment window is the highest-risk period in any service's lifecycle:

  • Most production incidents are deployment-caused - industry data consistently shows that 60-80% of production incidents follow a recent deployment; monitoring this window catches the majority of incidents at the earliest possible moment
  • Blast radius is smallest immediately after deploy - a regression caught in the first 5 minutes after deploying affects far fewer users than one caught hours later by a customer complaint
  • Enables faster deployment cycles - teams that have reliable post-deploy monitoring deploy more frequently because they trust the safety net; without it, fear of regressions leads to batching changes into large, risky releases
  • Provides deployment quality signal - tracking the percentage of deployments that required rollback or triggered an alert creates a metric for deployment quality that informs where to invest in testing
  • Creates the data foundation for automated canary evaluation - the manual post-deploy monitoring practice, once instrumented, can be automated: define the healthy thresholds, run the deploy, let the system evaluate the canary and promote or roll back automatically

Getting Started

6 steps to get from here to the next level

Common Pitfalls

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

B
BobHead of Engineering

Bob's team has had several incidents where a deployment was the root cause but no one noticed until customers complained 2-3 hours later. He wants to implement a post-deploy monitoring practice but is not sure how to make it a consistent team behavior rather than something that happens when individuals remember to do it.

What Bob should do - role-specific action plan

S
SarahProductivity Lead

Sarah has noticed that fear of deployment causes developers to batch changes into large, infrequent releases. Developers avoid deploying on Fridays, delay releases until the end of the sprint, and sometimes hold back small changes for weeks waiting for larger releases. This reduces deployment frequency and increases risk, because larger batches mean harder rollbacks.

What Sarah should do - role-specific action plan

V
VictorStaff Engineer - AI Champion

Victor wants to automate the post-deploy monitoring decision. Rather than a human watching a dashboard for 10 minutes, he wants the deployment pipeline to automatically evaluate canary health and either promote to full traffic or roll back, without human intervention. This requires the health criteria to be codified and the evaluation to be programmable.

What Victor should do - role-specific action plan