OpenTelemetry basic

OpenTelemetry (OTel) is the open standard for collecting and exporting telemetry data - traces, metrics, and logs - from distributed systems.

·Structured logging is implemented (JSON logs with consistent fields)
·OpenTelemetry basic instrumentation is deployed (traces and metrics)
·Post-deploy monitoring checks run after each deployment

·Traces are correlated across services
·Post-deploy checks include automated smoke tests

Evidence

·Structured logging configuration showing JSON format with standard fields
·OpenTelemetry SDK configuration in application code
·Post-deploy monitoring job configuration in CD pipeline

What It Is

OpenTelemetry (OTel) is the open standard for collecting and exporting telemetry data - traces, metrics, and logs - from distributed systems. "OpenTelemetry basic" means instrumenting your services with the OTel SDK to produce distributed traces, configuring an OTel Collector to receive and forward that data, and visualizing the results in a tracing backend like Jaeger or Grafana Tempo. At this level, you have your first end-to-end view of how a request travels through multiple services - which services it touched, how long each took, and where errors occurred.

Distributed tracing solves a problem that structured logging alone cannot: following a single request across service boundaries. When a user action triggers calls across an API gateway, an auth service, a business logic service, and a database, each service may log the relevant steps. But without a shared trace context, those logs are four unrelated streams. A trace links them: every log line, every database query, every external call that occurred during that request is tagged with the same trace ID and organized into a parent-child span hierarchy. You can see in a single view that the API gateway received the request (10ms), forwarded to the auth service (45ms), which queried the user database (38ms of that), then returned to the business logic service (120ms), which made two downstream calls before responding.

The OTel SDK provides auto-instrumentation for most popular frameworks. In Node.js, the @opentelemetry/auto-instrumentations-node package automatically instruments Express, HTTP clients, database drivers, and more without modifying application code. In Python, opentelemetry-instrumentation covers Flask, Django, SQLAlchemy, Redis clients, and gRPC. In Java, the OTel Java agent instruments Spring Boot, JDBC, and most common libraries via a JVM agent flag. Auto-instrumentation means you can get distributed traces running across your entire application in a day, without modifying a single line of business logic.

The OTel Collector is the recommended deployment pattern at even basic scale. Rather than shipping telemetry directly from each service to the backend, services send to the Collector (running as a sidecar or cluster-level agent), which batches, filters, samples, and forwards to the configured backend. This decouples your services from the choice of tracing backend and allows you to change or add backends (Jaeger, Tempo, Datadog, Honeycomb) without modifying service code. The Collector also provides sampling: in a high-traffic service, you do not need to record every trace; sampling 10% of requests while keeping 100% of error traces is a common and sensible configuration.

Why It Matters

The shift from structured logging to distributed tracing changes how you understand your system:

End-to-end request visibility - see exactly how long each service took for a specific user request, not just aggregated averages; identify the specific hop where latency is added
Error attribution in distributed systems - when a request fails, the trace shows exactly which service threw the exception and what the call chain looked like at that moment
Performance regression detection - traces make it obvious when a deployment added latency to a specific service; the P99 latency graph shows the change, and individual traces show which span got slower
Database and external call visibility - auto-instrumentation captures database queries, HTTP calls, queue operations, and cache interactions without any manual instrumentation; these are often the root cause of performance problems
Agent-queryable request context - agents investigating incidents can query the tracing backend for traces matching specific criteria (high latency, specific error type, specific user) and reconstruct the exact sequence of events during an incident

Getting Started

Start with auto-instrumentation on one service - Pick your most trafficked service and add the OTel auto-instrumentation package for its language. For Node.js: npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node. For Python: pip install opentelemetry-sdk opentelemetry-instrumentation. Configure the OTLP exporter endpoint and run. You should see traces within minutes.
Deploy the OTel Collector - Run the otel/opentelemetry-collector-contrib Docker image. Configure a minimal pipeline: OTLP receiver → batch processor → Jaeger/Tempo exporter. Start all services pointing at the Collector on port 4317 (gRPC) or 4318 (HTTP). The Collector handles buffering and retry so your services are not blocked by backend unavailability.
Set up Jaeger or Grafana Tempo as your tracing backend - Jaeger (jaegertracing/all-in-one) runs in Docker with a single command and provides a UI for trace search and visualization. Grafana Tempo integrates with the broader Grafana stack and scales better for production. Start with Jaeger for speed; migrate to Tempo when you need the Grafana integration.
Propagate trace context across service calls - Auto-instrumentation handles context propagation automatically for HTTP calls, gRPC, and most message queues. Verify that traces actually span multiple services by creating a test request that touches at least two services and confirming that the trace shows both services as spans. If spans are disconnected, trace context is not being propagated.
Add custom spans for business-critical operations - Auto-instrumentation captures framework-level operations but not your business logic. Add manual spans for operations that matter: tracer.startSpan("process_payment"), tracer.startSpan("validate_inventory"). These custom spans appear in the trace alongside the auto-instrumented framework operations and give you business-level visibility.
Configure tail-based sampling for production - Recording 100% of traces in a high-traffic production service is expensive. Configure the OTel Collector's tail sampling processor to keep 100% of error traces, 100% of slow traces (P99+), and 10% of normal traces. This captures all the interesting data while controlling cost.

Tip

Use the W3C Trace Context standard (traceparent header) rather than vendor-specific propagation formats. OTel defaults to W3C Trace Context, which means traces propagate correctly across services instrumented with different OTel SDK versions and different languages without any special configuration.

6 steps to get from here to the next level

Common Pitfalls

Instrumenting only one service and declaring success. A trace that shows only one service is just a structured log with extra steps. The value of distributed tracing emerges when traces span multiple services. Instrument all services that participate in user-facing requests before drawing conclusions about your tracing setup.

Ignoring sampling from the start. A high-traffic service with 100% trace sampling will generate enormous data volumes and incur significant storage costs. Configure sampling before going to production. Head-based sampling (decide at the entry point) is simple but loses error traces. Tail-based sampling in the OTel Collector (decide after the trace completes) is more sophisticated and keeps all error traces.

Confusing metrics with traces. OTel supports both. Traces answer "what happened during this specific request?" Metrics answer "what is the overall behavior of this service over time?" Both are necessary and complementary. Many teams start OTel for tracing and forget to also configure metrics export, leaving gaps in their observability.

Not adding trace IDs to structured logs. Traces and logs are most powerful when correlated: given a trace, find the logs; given a log line, find the trace. This requires the same trace ID to appear in both. Configure your structured logging library to read the current OTel trace ID from the context and include it as a log field. Most OTel libraries provide a bridge for this; use it from day one.

Over-instrumenting and creating span noise. Not every function call needs a span. A service with thousands of spans per request creates traces that are harder to read than a trace with 20 meaningful spans. Instrument at the boundary level (external calls, database queries, significant business operations) rather than at the function call level.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob has structured logging in place and is hearing about distributed tracing from his staff engineers. His primary concern is cost and complexity - he does not want a two-month instrumentation project that distracts from product work. He wants to understand whether OTel is worth the investment at this stage.

What Bob should do: Bob should authorize a two-week OTel proof of concept on two services that have frequent inter-service communication. The goal is specific: can the team answer the question "which service is responsible for the latency in this user-visible operation?" using the trace UI within 5 minutes of an incident? If yes, the investment is justified. The POC also surfaces the cost question: run with 100% sampling for one week, then apply sampling and measure the reduction in data volume. This gives Bob concrete cost projections before committing to full rollout. Bob should also ask Victor (the staff engineer) to evaluate whether the tracing setup will support agent-assisted investigation, since that is the long-term goal.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah's developers spend significant time during incidents trying to identify which service caused a latency spike or error cascade. Distributed tracing promises to reduce this investigation time dramatically, but she needs to validate the claim with data before recommending it as a team-wide investment.

What Sarah should do: Sarah should instrument one of the most incident-prone service pairs with OTel, then compare incident investigation time for the next month against the previous month. The before metric (time from alert to root cause identification) and the after metric (same, with tracing available) provide the ROI case. Sarah should also survey developers about their experience with the tracing UI - is it helping them find answers faster, or is it adding complexity without clarity? Developer experience with the tooling is as important as the tooling's technical capability. A tracing system that developers find confusing will not be used during the high-stress conditions of a production incident.

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor wants to connect OTel traces to AI agents so agents can query traces during incident investigation. He knows this requires the tracing backend to expose a query API, not just a human-readable UI. He is evaluating whether Jaeger's API, Grafana Tempo's API, or a commercial solution like Honeycomb is the best choice for agent integration.

What Victor should do: Victor should evaluate tracing backends primarily on API capability, not UI quality. Honeycomb and Grafana Tempo both expose REST APIs suitable for agent queries. Victor should build a prototype MCP tool that accepts a trace ID or a query (service name, time range, error type) and returns trace summaries in a format an agent can reason about. The key test: given an alert that says "payment service P99 latency exceeded 2s at 14:32," can an agent use the tracing API to retrieve the relevant traces, identify which span is adding latency, and propose a root cause hypothesis? If yes, the architecture is correct. Victor should also ensure trace IDs are included in all alert payloads - without the trace ID as an anchor, agents must search for relevant traces rather than fetching them directly.

What Victor should do - role-specific action plan