Full observability stack (OTel + Grafana)

A full observability stack means having all three telemetry pillars - metrics, traces, and logs - collected, correlated, and queryable in a unified system.

·Full observability stack is operational (OpenTelemetry + Grafana/Datadog or equivalent)
·Production metrics feed into dashboards accessible to all developers
·Incident data (post-mortems, error patterns) is available as agent context

·SLOs are defined and tracked for key services
·Incident data is structured for machine consumption (not just human-readable post-mortem docs)

Evidence

·Observability stack configuration (OTel collector, Grafana dashboards)
·Production metrics dashboards with developer access
·Incident data accessible via MCP or structured API

What It Is

A full observability stack means having all three telemetry pillars - metrics, traces, and logs - collected, correlated, and queryable in a unified system. At L3, this typically means OpenTelemetry for instrumentation, Prometheus for metrics storage, Grafana Tempo for distributed traces, Grafana Loki for logs, and Grafana as the unified query and visualization layer that ties all three together. The defining characteristic of this level is correlation: from a single Grafana dashboard, you can click an anomalous metric spike, see the traces from that time window, and drill into the logs for a specific trace - all without switching tools or losing context.

The OTel Collector is the hub of this architecture. Every service ships telemetry (OTLP protocol) to the Collector. The Collector applies transformations, sampling decisions, and routing: metrics go to Prometheus (via remote_write), traces go to Tempo, logs go to Loki. Grafana connects to all three as data sources. This unified pipeline means that adding a new service to observability requires only configuring it to send OTLP to the Collector - the rest of the pipeline handles it automatically. Removing instrumentation code from each service is a separate concern from changing the backend stack.

Service Level Objectives (SLOs) are the critical practice that distinguishes L3 from L2. At L3, teams define formal SLOs for customer-facing services: "99.9% of requests will complete successfully" and "P99 latency will be below 500ms." These SLOs are tracked as error budgets in Grafana: if you start the month with 43 minutes of allowed downtime (a 99.9% monthly SLO) and you have already consumed 30 minutes, your error budget is 70% depleted. Error budgets create a shared language between engineering and product: "we cannot take on this risky deployment because we have only 13 minutes of error budget remaining this month."

The full observability stack at L3 also enables exemplars: individual data points in a Prometheus metric that carry a reference to a specific trace. When your P99 latency metric spikes, Grafana can show you the exemplars - the actual traces that represent that P99 - with a single click. This bridges the gap between aggregate metrics ("P99 is bad") and specific request context ("here is the exact trace that was slow"). Without exemplars, you know that latency is bad but must search for representative traces manually. With exemplars, the worst traces are surfaced automatically.

Why It Matters

The correlated observability stack unlocks capabilities that no individual tool provides alone:

Correlation cuts investigation time - moving from a metric anomaly to the relevant traces to the specific log lines takes seconds in a correlated stack and 20+ minutes in disconnected tools
SLOs create objective quality standards - error budgets transform abstract reliability goals into concrete operational constraints that govern when teams should deploy, experiment, and take risks
Exemplars surface the worst cases automatically - instead of searching for representative slow or failed requests, exemplars present them directly when metrics degrade
The stack is agent-queryable - Prometheus, Loki, and Tempo all expose query APIs; an agent can programmatically query any of the three using PromQL, LogQL, or TraceQL to investigate an incident without a human intermediary
Unified data model enables novel queries - with all telemetry in one system, you can ask cross-pillar questions: "find all traces from users who appeared in ERROR logs in the last hour" or "show me the latency distribution for requests that triggered this specific log message"

Getting Started

Deploy the Grafana OSS stack - Use the Grafana published Kubernetes Helm charts or docker-compose to run Prometheus, Loki, Tempo, and Grafana. The kube-prometheus-stack Helm chart deploys a production-ready Prometheus stack with alerting in one command. Add Loki and Tempo as separate Helm releases. This takes a day to deploy and configure for a new environment.
Configure the OTel Collector as the central pipeline - Deploy the OTel Collector Contrib distribution (it includes all exporters). Configure it to receive OTLP from all services and export to: Prometheus remote_write for metrics, Tempo OTLP for traces, Loki push API for logs. Every service now has one configuration: OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317.
Enable exemplars in Prometheus and Grafana - Configure Prometheus to store exemplars (enable-feature=exemplar-storage). Configure OTel SDK to emit exemplars with metrics (most SDKs do this automatically when a trace is active). In Grafana, enable exemplars on your Prometheus data source. Verify that metric graphs show exemplar dots that link to traces.
Define SLOs for all customer-facing services - For each service, define availability SLO (success rate %) and latency SLO (P99 threshold). Use the Grafana SLO plugin or the Prometheus recording rules pattern to track error budget consumption. Create a top-level dashboard showing error budget status for all services.
Build service-specific dashboards with the USE and RED methods - For infrastructure: Utilization, Saturation, Errors (USE). For services: Rate, Errors, Duration (RED). Every service should have a standard dashboard with these metrics. Grafana's dashboard-as-code feature (Grafonnet or Terraform provider) lets you generate standard dashboards automatically for each new service.
Implement alert routing from Prometheus Alertmanager - Define PrometheusRule resources for your SLO burn rates and key operational metrics. Route alerts through Alertmanager to PagerDuty or OpsGenie. Multi-window, multi-burn-rate SLO alerts (the Google SRE approach) are the most reliable alerting pattern: alert when you are burning your error budget fast enough to exhaust it ahead of schedule.

Tip

Start Grafana dashboards from the Grafana community dashboard library rather than building from scratch. Dashboards for Kubernetes, PostgreSQL, Redis, and most common infrastructure already exist and are maintained by the community. Import them, then customize for your specific services.

6 steps to get from here to the next level

Common Pitfalls

Running disconnected tools instead of a correlated stack. Using Datadog for metrics, Jaeger for traces, and ELK for logs without correlation between them recreates the fragmentation problem. The value of the full stack comes from correlation - metrics linking to traces linking to logs. If your three pillars do not share trace IDs and do not have a unified query interface, you have three tools, not an observability stack.

Defining SLOs that are aspirational rather than contractual. An SLO that nobody acts on when the error budget is depleted is a vanity metric. SLOs need organizational commitment: when error budget is exhausted, the team pauses new feature deployments until the budget is restored. Without this commitment, SLOs are dashboards, not operational constraints.

Alert fatigue from alerting on symptoms rather than SLO burn rate. Alerting directly on error rate exceeding a threshold creates pages for every transient spike. SLO burn rate alerts - "you are burning your error budget 10x faster than your SLO allows" - are far more robust: they alert when the sustained rate of degradation threatens your SLO, not when a brief spike occurs.

Not instrumenting business logic, only infrastructure. A full observability stack that only instruments HTTP requests and database queries is missing the business-level signals that indicate whether the system is correct, not just operational. Add custom metrics for business operations: orders placed per minute, payment success rate, user signup rate. These business metrics catch correctness bugs that technical metrics miss.

Neglecting the OTel Collector as a cost control lever. The Collector's sampling and filtering capabilities are significant cost controls. Without them, a high-traffic service can generate enormous telemetry volumes that make the stack expensive. Configure the Collector to sample normal traces aggressively (5-10%), keep all error and slow traces, and drop redundant low-value metrics. This can reduce telemetry storage costs by 80% with minimal observability impact.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob has invested in structured logging and basic tracing at L2, but the tools are fragmented: engineers use different dashboards, switch between Sentry, Grafana, and CloudWatch during incidents, and lose time translating between different query languages and mental models. He wants to unify the stack and establish SLOs as the team's reliability language.

What Bob should do: Bob should make the unified observability stack a Q1 infrastructure investment. The business case is incident resolution time and SLO accountability. Bob should commission two workstreams in parallel: one team consolidates telemetry into the Grafana stack (Prometheus, Loki, Tempo), another defines SLOs for all customer-facing services and presents them to product leadership. The SLO definition exercise is not purely technical - product and business stakeholders need to agree on what reliability means. Getting that agreement early creates the organizational foundation for the error budget policy. Bob should also ensure that the observability stack is designed from day one to support agent queries, since the long-term goal is agent-assisted incident investigation.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah's developers spend the first 15 minutes of every incident figuring out which tool has the relevant data and how to query it. The fragmented tool stack is a developer experience problem that compounds under incident stress. She wants to reduce the tool-switching overhead and create a single starting point for all incident investigation.

What Sarah should do: Sarah should work with the team to designate Grafana as the single starting point for all production investigation. This means Grafana dashboards that link out to other tools, not standalone tools that link to Grafana occasionally. Every alert notification should include a direct link to a Grafana dashboard pre-filtered to the relevant service and time window. Every runbook should reference Grafana dashboards for its investigation steps. Sarah should also track "time to first relevant data" during incidents as a developer experience metric: from alert receipt to seeing the relevant metric/trace/log, how long does it take? A unified stack should reduce this from minutes to seconds.

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor wants agents to be able to query the observability stack using all three telemetry types. He knows that Prometheus exposes PromQL, Loki exposes LogQL, and Tempo exposes TraceQL - all via HTTP APIs. He wants to build MCP tools that wrap these APIs so agents can query any pillar of the observability stack programmatically.

What Victor should do: Victor should build a observability-query MCP server with four tools: query_metrics(promql, time_range), query_logs(logql, time_range), query_traces(traceid), and find_traces(service, time_range, filters). These tools expose the full Grafana stack to agents without requiring the agent to know the underlying query language syntax. When an alert fires, an agent can call find_traces(service="payment", time_range="last_10m", error=true) to retrieve the relevant traces, then query_logs(logql='{service="payment"} |= "error"', time_range="last_10m") to find the corresponding log lines. The agent synthesizes this data into a preliminary root cause analysis before any human engages. Victor should run this as a live demo with the team, showing an incident investigation that takes an agent 30 seconds versus a human 20 minutes.

What Victor should do - role-specific action plan