Full observability stack (OTel + Grafana)
A full observability stack means having all three telemetry pillars - metrics, traces, and logs - collected, correlated, and queryable in a unified system.
- ·Full observability stack is operational (OpenTelemetry + Grafana/Datadog or equivalent)
- ·Production metrics feed into dashboards accessible to all developers
- ·Incident data (post-mortems, error patterns) is available as agent context
- ·SLOs are defined and tracked for key services
- ·Incident data is structured for machine consumption (not just human-readable post-mortem docs)
Evidence
- ·Observability stack configuration (OTel collector, Grafana dashboards)
- ·Production metrics dashboards with developer access
- ·Incident data accessible via MCP or structured API
What It Is
A full observability stack means having all three telemetry pillars - metrics, traces, and logs - collected, correlated, and queryable in a unified system. At L3, this typically means OpenTelemetry for instrumentation, Prometheus for metrics storage, Grafana Tempo for distributed traces, Grafana Loki for logs, and Grafana as the unified query and visualization layer that ties all three together. The defining characteristic of this level is correlation: from a single Grafana dashboard, you can click an anomalous metric spike, see the traces from that time window, and drill into the logs for a specific trace - all without switching tools or losing context.
The OTel Collector is the hub of this architecture. Every service ships telemetry (OTLP protocol) to the Collector. The Collector applies transformations, sampling decisions, and routing: metrics go to Prometheus (via remote_write), traces go to Tempo, logs go to Loki. Grafana connects to all three as data sources. This unified pipeline means that adding a new service to observability requires only configuring it to send OTLP to the Collector - the rest of the pipeline handles it automatically. Removing instrumentation code from each service is a separate concern from changing the backend stack.
Service Level Objectives (SLOs) are the critical practice that distinguishes L3 from L2. At L3, teams define formal SLOs for customer-facing services: "99.9% of requests will complete successfully" and "P99 latency will be below 500ms." These SLOs are tracked as error budgets in Grafana: if you start the month with 43 minutes of allowed downtime (a 99.9% monthly SLO) and you have already consumed 30 minutes, your error budget is 70% depleted. Error budgets create a shared language between engineering and product: "we cannot take on this risky deployment because we have only 13 minutes of error budget remaining this month."
The full observability stack at L3 also enables exemplars: individual data points in a Prometheus metric that carry a reference to a specific trace. When your P99 latency metric spikes, Grafana can show you the exemplars - the actual traces that represent that P99 - with a single click. This bridges the gap between aggregate metrics ("P99 is bad") and specific request context ("here is the exact trace that was slow"). Without exemplars, you know that latency is bad but must search for representative traces manually. With exemplars, the worst traces are surfaced automatically.
Why It Matters
The correlated observability stack unlocks capabilities that no individual tool provides alone:
- Correlation cuts investigation time - moving from a metric anomaly to the relevant traces to the specific log lines takes seconds in a correlated stack and 20+ minutes in disconnected tools
- SLOs create objective quality standards - error budgets transform abstract reliability goals into concrete operational constraints that govern when teams should deploy, experiment, and take risks
- Exemplars surface the worst cases automatically - instead of searching for representative slow or failed requests, exemplars present them directly when metrics degrade
- The stack is agent-queryable - Prometheus, Loki, and Tempo all expose query APIs; an agent can programmatically query any of the three using PromQL, LogQL, or TraceQL to investigate an incident without a human intermediary
- Unified data model enables novel queries - with all telemetry in one system, you can ask cross-pillar questions: "find all traces from users who appeared in ERROR logs in the last hour" or "show me the latency distribution for requests that triggered this specific log message"
Getting Started
6 steps to get from here to the next level
Common Pitfalls
Mistakes teams actually make at this stage - and how to avoid them
How Different Roles See It
Bob has invested in structured logging and basic tracing at L2, but the tools are fragmented: engineers use different dashboards, switch between Sentry, Grafana, and CloudWatch during incidents, and lose time translating between different query languages and mental models. He wants to unify the stack and establish SLOs as the team's reliability language.
What Bob should do - role-specific action plan
Sarah's developers spend the first 15 minutes of every incident figuring out which tool has the relevant data and how to query it. The fragmented tool stack is a developer experience problem that compounds under incident stress. She wants to reduce the tool-switching overhead and create a single starting point for all incident investigation.
What Sarah should do - role-specific action plan
Victor wants agents to be able to query the observability stack using all three telemetry types. He knows that Prometheus exposes PromQL, Loki exposes LogQL, and Tempo exposes TraceQL - all via HTTP APIs. He wants to build MCP tools that wrap these APIs so agents can query any pillar of the observability stack programmatically.
What Victor should do - role-specific action plan
Further Reading
5 resources worth reading - hand-picked, not scraped
From the Field
Recent releases, projects, and discussions relevant to this maturity level.