Structured logging

Structured logging replaces free-form text log output with machine-parseable records - typically JSON - where every field has a defined name and type.

·Structured logging is implemented (JSON logs with consistent fields)
·OpenTelemetry basic instrumentation is deployed (traces and metrics)
·Post-deploy monitoring checks run after each deployment

·Traces are correlated across services
·Post-deploy checks include automated smoke tests

Evidence

·Structured logging configuration showing JSON format with standard fields
·OpenTelemetry SDK configuration in application code
·Post-deploy monitoring job configuration in CD pipeline

What It Is

Structured logging replaces free-form text log output with machine-parseable records - typically JSON - where every field has a defined name and type. Instead of writing "Error processing payment for user 42: timeout after 5000ms", you write {"level":"error","service":"payment","event":"payment_timeout","user_id":42,"duration_ms":5000,"timestamp":"2024-01-15T14:32:01Z"}. The information content is identical, but the format is queryable: you can filter by level=error, aggregate by service, histogram by duration_ms, and alert when event=payment_timeout exceeds a threshold.

Structured logging is the first observability practice that genuinely enables machine consumption of log data. Unstructured text logs require human pattern recognition to extract meaning - a human reads "Error processing payment" and understands what happened. Structured logs require schema definition upfront but return that investment as queryability, alertability, and eventually agent-accessibility. When every log line is a JSON object with consistent fields, the entire log stream becomes a queryable database rather than a text file.

The standard approach is to adopt a structured logging library in each language your services use - structlog in Python, zerolog or zap in Go, winston with JSON format in Node.js, slf4j with Logback JSON encoder in Java - and configure it to emit JSON to stdout. The output flows through your container orchestrator to a log aggregation system: Datadog, the ELK stack (Elasticsearch, Logstash, Kibana), Grafana Loki, or AWS CloudWatch Logs Insights. Once in the aggregation layer, every field in every log line is indexed and queryable.

The key fields that every structured log line should include are: timestamp (ISO 8601, UTC), log level (DEBUG/INFO/WARN/ERROR), service name, trace ID (discussed below), and the specific event that occurred. Beyond these mandatory fields, add domain-specific fields for every log line: user IDs, request IDs, operation names, durations, result codes. The richness of these fields determines the richness of the queries you can run against your logs later. A structured log line with 15 well-chosen fields is dramatically more valuable than 15 unstructured log lines covering the same events.

Why It Matters

The shift from unstructured to structured logging unlocks a cascade of downstream capabilities:

Queryability transforms investigation speed - diagnosing a production issue by running service:payment level:error event:payment_timeout in Datadog takes 10 seconds; grepping log files takes 10 minutes and requires infrastructure access
Aggregation enables anomaly detection - when logs have consistent numeric fields (duration_ms, retry_count, queue_depth), you can build dashboards and alerts on their distributions; unstructured logs have no numeric fields to aggregate
Consistent schema enables cross-service correlation - when every service uses the same field names for common concepts (trace_id, user_id, service), you can query across service boundaries and find the path a request took through your system
Log data becomes agent-accessible - Datadog, Elasticsearch, and Loki all expose query APIs; an AI agent can call these APIs to retrieve relevant log data for investigation, something impossible with unstructured log files
Audit trails become reliable - structured authentication, authorization, and data-mutation logs can serve as compliance audit trails; unstructured logs cannot, because there is no guarantee they contain the required fields

Getting Started

Choose a structured logging library for each service - Pick the idiomatic choice for your language: structlog (Python), zap or zerolog (Go), winston (Node.js), logback with logstash-logback-encoder (Java/Kotlin). Configure it to output JSON to stdout. This is a one-time setup per service.
Define your mandatory field schema - Before touching application code, define the fields every log line must include: timestamp, level, service, version, trace_id. Write a logging wrapper or middleware that adds these fields automatically. No application code should ever have to set service or timestamp manually.
Add a request-scoped context carrier - Use your language's context mechanism (Go context.Context, Python contextvars, Java MDC) to carry trace ID and other request-scoped fields through your call stack. Any log line emitted during request processing automatically inherits these fields without the application code passing them explicitly.
Connect to a log aggregation platform - Configure your container runtime or log shipper (Fluentd, Filebeat, the Datadog Agent) to forward stdout JSON logs to your chosen aggregation system. Datadog and Grafana Loki parse JSON automatically; ELK requires a Logstash or Beats pipeline. Validate that logs are appearing in the UI before proceeding.
Migrate unstructured logs incrementally - Do not attempt to refactor all logging at once. Start with the highest-traffic code paths: HTTP request handlers, database query wrappers, external API calls. Structured logs for these critical paths give you immediate value even while the rest of the codebase is still unstructured.
Define and document your log schema - Write a document listing all standard field names and their types. When a new service is created or a new event type is logged, it follows the schema. Consistency across services is the entire value proposition of structured logging; inconsistency recreates the fragmentation problem you were trying to solve.

Tip

Add trace_id to every log line from day one, even before you have distributed tracing set up. When you later implement OTel tracing, the trace ID in your logs will automatically correlate with your traces. Adding it retroactively after the fact is much harder than including it from the start.

6 steps to get from here to the next level

Common Pitfalls

Inconsistent field names across services. Service A logs userId, service B logs user_id, service C logs uid. All three contain user identifiers but no query can retrieve them together. Enforce a field naming standard before teams start instrumenting. The opentelemetry-semantic-conventions package defines standard names for common fields - use it as your baseline rather than inventing your own.

Over-logging at DEBUG level in production. Structured logging is cheap compared to unstructured logging because it is machine-parseable, but it is not free. Debug-level logs that run in production hot paths can add meaningful latency and cost. Set production log levels to INFO by default, with the ability to dynamically enable DEBUG for specific services or request IDs when investigating an incident.

Treating JSON as sufficient for correlation. Structured logs from multiple services are still siloed unless they share a common correlation ID (trace ID). Without trace IDs, you can query within a service but cannot follow a request across service boundaries. The trace ID is what connects structured logging to distributed tracing, and it needs to be threaded through every service call from day one.

No log retention policy. Centralized structured logs at scale can be expensive. A service emitting 10,000 log lines per second generates terabytes per day. Without a retention policy (keep ERROR logs for 90 days, DEBUG logs for 7 days), costs grow unchecked. Define retention tiers by log level as part of your initial aggregation setup.

Logging the same information at multiple levels. Logging a request at entry, logging each step, logging the result, and logging a summary creates redundant data that inflates storage costs without adding analytical value. Log the entry with full context, log errors if they occur, log the result with duration. Omit intermediate steps unless they represent distinct events worth tracking independently.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob's team has centralized logging but it is still unstructured text. Post-mortems show that even with logs available, investigation takes a long time because finding relevant entries requires pattern-matching against free-form text. The team is also being asked to provide audit trails for compliance, and the current logs do not have consistent enough fields to serve as reliable audit records.

What Bob should do: Bob should frame structured logging as both an operational efficiency investment and a compliance requirement. The compliance angle provides budget justification that the operational angle alone may not. Bob should commission a two-week migration project per service, starting with the highest-risk services (payment, authentication, data mutations). Each service migration follows the same playbook: adopt the structured logging library, define the field schema, validate in the aggregation platform, update alerting to use field-based queries. Bob should also make structured logging a standard criterion in the team's definition of done for new services: no service goes to production without structured JSON logs flowing to the aggregation platform.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah has heard developers complain that finding the relevant log line during an incident is harder than it should be. She measures the time from "alert fires" to "root cause identified" and knows it is too long. She suspects that moving to structured logging would cut this time significantly.

What Sarah should do: Sarah should run a structured logging pilot with one team and measure the before/after investigation time for a comparable set of incidents. The hypothesis is clear: structured logs with field-based queries dramatically reduce time-to-root-cause compared to unstructured text search. If the pilot confirms the hypothesis (it will), the data makes the case for the rest of the organization. Sarah should also note that structured logging is the prerequisite for agent-assisted investigation: without structured, queryable logs, agents cannot meaningfully participate in incident response. Every minute invested in structured logging migration is a minute that enables future agent automation.

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor wants to build an agent that can investigate production incidents by querying the log system, correlating errors with deployments, and proposing root causes. He knows this requires structured logs with consistent fields and a query API. His current setup has neither.

What Victor should do: Victor should design the log schema with agent consumption as the primary use case. Agents need to ask questions like: "What errors occurred in the payment service in the 10 minutes before the alert?" and "Which user IDs were affected by this error?" These questions require specific fields (service, level, timestamp, user_id) with consistent types and names. Victor should also select a log aggregation platform based on its API quality, not just its UI quality. Grafana Loki's HTTP API, Datadog's Log Query API, and Elasticsearch's REST API are all suitable for agent queries; a platform with a beautiful UI but no programmatic API is a dead end for agent-assisted investigation. Victor should prototype the agent query path before committing to a platform.

What Victor should do - role-specific action plan