Back to all blogs
Cloud & DevOpsJune 28, 20269 min read

OpenTelemetry Distributed Tracing: How to Instrument Your Entire Stack and Eliminate Blind Spots in Production

Most engineering teams only discover production failures after users complain — OpenTelemetry distributed tracing changes that by giving you deep, correlated visibility across every service, database call, and API hop in your system.

L
Lucas Bennett
UI/UX Design Director
OpenTelemetry Distributed Tracing: How to Instrument Your Entire Stack and Eliminate Blind Spots in Production
TL;DR Quick Answer: OpenTelemetry distributed tracing lets you follow a single request as it travels through every microservice, database, queue, and third-party API in your system — generating a correlated trace with spans, attributes, and timing data. When implemented correctly, teams typically see a 60% reduction in Mean Time to Resolution (MTTR) and eliminate the "log archaeology" that burns engineering hours during incidents. This guide walks you through full-stack instrumentation from Node.js services to Python workers to your PostgreSQL layer.

If your team is still debugging production incidents by grepping through disconnected log files across five different services, you already know the pain. OpenTelemetry distributed tracing is the industry-standard answer to that problem — and in 2025, there is no credible excuse for running a microservices architecture without it. At Apargo, we instrument every production system we build with OpenTelemetry from day one, not as an afterthought. This article is the deep technical guide we wish existed when we started.

What Is OpenTelemetry Distributed Tracing and Why It Matters

OpenTelemetry (OTel) is a CNCF-graduated observability framework that provides vendor-neutral APIs, SDKs, and a collector for generating, collecting, and exporting telemetry data — traces, metrics, and logs — from your applications. The "distributed" in distributed tracing refers to the ability to correlate a single logical request across multiple independent processes and services using a shared Trace ID.

Without OpenTelemetry distributed tracing, your observability stack typically looks like this:

  • Service A logs a request ID that only exists in its own log stream
  • Service B has no idea which upstream call triggered its execution
  • Your database slow-query log has timestamps but no correlation to the user-facing request
  • A 1,200ms latency spike exists somewhere — but finding it takes 45 minutes of manual cross-referencing

With OpenTelemetry distributed tracing in place, that same investigation takes under 90 seconds. You open your tracing backend (Jaeger, Tempo, Honeycomb, Datadog), search by Trace ID, and see the entire waterfall: which service was slow, which database query took 800ms, and which downstream HTTP call timed out.

Core Concepts You Must Understand Before Writing a Single Line

Traces, Spans, and Context Propagation

A Trace is the complete journey of a request through your system, represented as a directed acyclic graph (DAG) of Spans. Each Span represents a single unit of work — an HTTP handler, a database query, a cache lookup, a message queue publish. Every Span carries:

  • Trace ID: A 128-bit globally unique identifier shared across all spans in a trace
  • Span ID: A 64-bit identifier unique to this specific unit of work
  • Parent Span ID: Links this span to its caller, forming the tree structure
  • Start/End Timestamps: Nanosecond-precision timing for latency analysis
  • Attributes: Key-value metadata (e.g., http.method, db.statement, user.id)
  • Status: OK, ERROR, or UNSET — critical for alerting
  • Events: Timestamped log-like entries attached to a span

Context Propagation is the mechanism by which Trace IDs travel across service boundaries. The W3C TraceContext standard (HTTP header: traceparent) is the default propagation format in OpenTelemetry and is now supported natively by virtually every major HTTP client and server framework.

The OpenTelemetry Collector

The OTel Collector is a standalone binary that sits between your instrumented services and your observability backend. It receives spans via OTLP (OpenTelemetry Protocol) over gRPC or HTTP, applies processors (batching, filtering, attribute enrichment), and exports to one or many backends simultaneously. Running the Collector decouples your application code from your backend vendor — swap Jaeger for Grafana Tempo without touching a single line of application code.

Full-Stack OpenTelemetry Distributed Tracing: Instrumentation Walkthrough

Step 1: Instrument a Node.js Express Service

Install the required packages:

npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-grpc \
  @opentelemetry/resources \
  @opentelemetry/semantic-conventions

Create a tracing.js file that must be loaded before any other module:

// tracing.js — Load this FIRST via node --require ./tracing.js
'use strict';

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');

// Define the service resource — this appears as the service name in your tracing UI
const resource = new Resource({
  [SemanticResourceAttributes.SERVICE_NAME]: 'api-gateway',
  [SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION || '1.0.0',
  [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'production',
});

// Configure the OTLP exporter pointing to your OTel Collector
const traceExporter = new OTLPTraceExporter({
  url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
});

const sdk = new NodeSDK({
  resource,
  traceExporter,
  // Auto-instrumentation covers: HTTP, Express, gRPC, pg, redis, mongoose, and 40+ more
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-fs': { enabled: false }, // too noisy
      '@opentelemetry/instrumentation-http': {
        ignoreIncomingRequestHook: (req) => req.url === '/health', // skip health checks
      },
    }),
  ],
});

sdk.start();

// Graceful shutdown — flush remaining spans before process exits
process.on('SIGTERM', () => {
  sdk.shutdown().then(() => process.exit(0));
});

Start your service with the tracer loaded:

node --require ./tracing.js server.js

With this single file, you get automatic instrumentation for all incoming/outgoing HTTP requests, Express route handlers, PostgreSQL queries via pg, Redis calls, and more — with zero changes to your business logic code.

Step 2: Add Custom Spans for Business Logic

Auto-instrumentation covers infrastructure calls, but you need manual spans for your domain logic — payment processing, fraud checks, AI inference calls. Here's how:

// order-service.js
const { trace, SpanStatusCode } = require('@opentelemetry/api');

const tracer = trace.getTracer('order-service', '1.0.0');

async function processOrder(orderId, userId) {
  // Create a custom span wrapping the entire order processing flow
  return await tracer.startActiveSpan('order.process', async (span) => {
    try {
      // Attach business-context attributes — searchable in your tracing UI
      span.setAttributes({
        'order.id': orderId,
        'user.id': userId,
        'order.source': 'api',
      });

      const inventory = await checkInventory(orderId); // auto-instrumented DB call nested here
      const payment = await chargePayment(userId, inventory.totalAmount);

      span.setAttributes({
        'payment.transaction_id': payment.transactionId,
        'payment.amount_usd': inventory.totalAmount,
      });

      // Add a timestamped event (like a log, but attached to the trace)
      span.addEvent('payment.authorized', {
        'payment.method': payment.method,
        'payment.gateway_latency_ms': payment.latencyMs,
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return { success: true, transactionId: payment.transactionId };

    } catch (error) {
      // Record the exception — this shows the full stack trace in your tracing backend
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      throw error;
    } finally {
      span.end(); // Always end the span
    }
  });
}

Step 3: Instrument a Python FastAPI Service

Install OpenTelemetry for Python:

pip install opentelemetry-sdk \
  opentelemetry-instrumentation-fastapi \
  opentelemetry-instrumentation-sqlalchemy \
  opentelemetry-exporter-otlp-proto-grpc \
  opentelemetry-instrumentation-httpx
# tracing_setup.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME, SERVICE_VERSION
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
import os

def setup_tracing(app):
    resource = Resource.create({
        SERVICE_NAME: "ml-inference-service",
        SERVICE_VERSION: os.getenv("APP_VERSION", "1.0.0"),
        "deployment.environment": os.getenv("ENVIRONMENT", "production"),
    })

    exporter = OTLPSpanExporter(
        endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://otel-collector:4317"),
        insecure=True,
    )

    provider = TracerProvider(resource=resource)
    # BatchSpanProcessor buffers spans and sends in batches — critical for production performance
    # max_queue_size=2048, max_export_batch_size=512, schedule_delay_millis=5000
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)

    # Auto-instrument FastAPI routes, SQLAlchemy queries, and outbound HTTPX calls
    FastAPIInstrumentor.instrument_app(app)
    SQLAlchemyInstrumentor().instrument()
    HTTPXClientInstrumentor().instrument()

    return trace.get_tracer("ml-inference-service")

Step 4: Deploy the OpenTelemetry Collector with Docker Compose

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317   # gRPC — preferred for high throughput
      http:
        endpoint: 0.0.0.0:4318   # HTTP/JSON — useful for browser SDKs

processors:
  batch:
    timeout: 5s                  # Flush every 5 seconds
    send_batch_size: 512         # Or when 512 spans are buffered
  memory_limiter:
    check_interval: 1s
    limit_mib: 512               # Prevent OOM crashes under spike load
  resource:
    attributes:
      - key: cluster.name
        value: "production-us-east-1"
        action: insert

exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
  # Simultaneously export to Prometheus for span-derived metrics
  prometheus:
    endpoint: "0.0.0.0:8889"

service:
  pipelines:
    traces:
      receivers:
Share this article:
Cloud & DevOpsApargo Lab

Related Articles

Explore more insights from our engineering and product teams.

View all blogs
Online Document Verification: Detect Fake, Edited & AI-Generated Files Instantly
May 1, 2026
Engineering

Online Document Verification: Detect Fake, Edited & AI-Generated Files Instantly

Learn how to verify documents online and detect fake, forged, edited, or AI-generated files instantly using VerifyDocs. Fast, secure, and AI-powered.

Online Document Verification: Detect Fake, Edited & AI-Generated Files Instantly
May 1, 2026
Engineering

Online Document Verification: Detect Fake, Edited & AI-Generated Files Instantly

Learn how to verify documents online and detect fake, forged, edited, or AI-generated files instantly using VerifyDocs. Fast, secure, and AI-powered.

Top 10 Ways to Detect Fake Documents Online (Complete Guide)
May 2, 2026
Engineering

Top 10 Ways to Detect Fake Documents Online (Complete Guide)

Discover the top 10 ways to detect fake, forged, edited, or AI-generated documents online. Learn expert tips and use VerifyDocs for instant verification.