OpenTelemetry basic
OpenTelemetry (OTel) is the open standard for collecting and exporting telemetry data - traces, metrics, and logs - from distributed systems.
- ·Structured logging is implemented (JSON logs with consistent fields)
- ·OpenTelemetry basic instrumentation is deployed (traces and metrics)
- ·Post-deploy monitoring checks run after each deployment
- ·Traces are correlated across services
- ·Post-deploy checks include automated smoke tests
Evidence
- ·Structured logging configuration showing JSON format with standard fields
- ·OpenTelemetry SDK configuration in application code
- ·Post-deploy monitoring job configuration in CD pipeline
What It Is
OpenTelemetry (OTel) is the open standard for collecting and exporting telemetry data - traces, metrics, and logs - from distributed systems. "OpenTelemetry basic" means instrumenting your services with the OTel SDK to produce distributed traces, configuring an OTel Collector to receive and forward that data, and visualizing the results in a tracing backend like Jaeger or Grafana Tempo. At this level, you have your first end-to-end view of how a request travels through multiple services - which services it touched, how long each took, and where errors occurred.
Distributed tracing solves a problem that structured logging alone cannot: following a single request across service boundaries. When a user action triggers calls across an API gateway, an auth service, a business logic service, and a database, each service may log the relevant steps. But without a shared trace context, those logs are four unrelated streams. A trace links them: every log line, every database query, every external call that occurred during that request is tagged with the same trace ID and organized into a parent-child span hierarchy. You can see in a single view that the API gateway received the request (10ms), forwarded to the auth service (45ms), which queried the user database (38ms of that), then returned to the business logic service (120ms), which made two downstream calls before responding.
The OTel SDK provides auto-instrumentation for most popular frameworks. In Node.js, the @opentelemetry/auto-instrumentations-node package automatically instruments Express, HTTP clients, database drivers, and more without modifying application code. In Python, opentelemetry-instrumentation covers Flask, Django, SQLAlchemy, Redis clients, and gRPC. In Java, the OTel Java agent instruments Spring Boot, JDBC, and most common libraries via a JVM agent flag. Auto-instrumentation means you can get distributed traces running across your entire application in a day, without modifying a single line of business logic.
The OTel Collector is the recommended deployment pattern at even basic scale. Rather than shipping telemetry directly from each service to the backend, services send to the Collector (running as a sidecar or cluster-level agent), which batches, filters, samples, and forwards to the configured backend. This decouples your services from the choice of tracing backend and allows you to change or add backends (Jaeger, Tempo, Datadog, Honeycomb) without modifying service code. The Collector also provides sampling: in a high-traffic service, you do not need to record every trace; sampling 10% of requests while keeping 100% of error traces is a common and sensible configuration.
Why It Matters
The shift from structured logging to distributed tracing changes how you understand your system:
- End-to-end request visibility - see exactly how long each service took for a specific user request, not just aggregated averages; identify the specific hop where latency is added
- Error attribution in distributed systems - when a request fails, the trace shows exactly which service threw the exception and what the call chain looked like at that moment
- Performance regression detection - traces make it obvious when a deployment added latency to a specific service; the P99 latency graph shows the change, and individual traces show which span got slower
- Database and external call visibility - auto-instrumentation captures database queries, HTTP calls, queue operations, and cache interactions without any manual instrumentation; these are often the root cause of performance problems
- Agent-queryable request context - agents investigating incidents can query the tracing backend for traces matching specific criteria (high latency, specific error type, specific user) and reconstruct the exact sequence of events during an incident
Getting Started
6 steps to get from here to the next level
Common Pitfalls
Mistakes teams actually make at this stage - and how to avoid them
How Different Roles See It
Bob has structured logging in place and is hearing about distributed tracing from his staff engineers. His primary concern is cost and complexity - he does not want a two-month instrumentation project that distracts from product work. He wants to understand whether OTel is worth the investment at this stage.
What Bob should do - role-specific action plan
Sarah's developers spend significant time during incidents trying to identify which service caused a latency spike or error cascade. Distributed tracing promises to reduce this investigation time dramatically, but she needs to validate the claim with data before recommending it as a team-wide investment.
What Sarah should do - role-specific action plan
Victor wants to connect OTel traces to AI agents so agents can query traces during incident investigation. He knows this requires the tracing backend to expose a query API, not just a human-readable UI. He is evaluating whether Jaeger's API, Grafana Tempo's API, or a commercial solution like Honeycomb is the best choice for agent integration.
What Victor should do - role-specific action plan
Further Reading
5 resources worth reading - hand-picked, not scraped
From the Field
Recent releases, projects, and discussions relevant to this maturity level.
Observability & Feedback Loop