Press ESC to close Press / to search

OpenTelemetry Complete Implementation Guide: Production-Ready Observability with Distributed Tracing, Metrics, and Logs

🎯 Key Takeaways

  • What is OpenTelemetry and Why Does It Matter?
  • OpenTelemetry Architecture: Core Components
  • Implementing OpenTelemetry: Step-by-Step Production Guide
  • Distributed Tracing: Context Propagation
  • Metrics Collection with OpenTelemetry

πŸ“‘ Table of Contents

Observability has evolved from basic logging and metrics to comprehensive distributed tracing, contextual metrics, and structured logs. OpenTelemetry has emerged as the universal standard for observability instrumentation, unifying telemetry data collection across applications, services, and infrastructure. This comprehensive guide explains how to implement OpenTelemetry in production environments, migrate from legacy observability tools, and build a complete observability stack that scales from startups to enterprises.

What is OpenTelemetry and Why Does It Matter?

OpenTelemetry (OTel) is a Cloud Native Computing Foundation (CNCF) project that provides vendor-neutral APIs, SDKs, and tools for generating, collecting, and exporting telemetry data (traces, metrics, logs). It’s the merger of OpenTracing and OpenCensus projects, backed by major cloud providers and observability vendors.

The Observability Crisis OpenTelemetry Solves

Problem Legacy Approach OpenTelemetry Solution
Vendor Lock-in Proprietary agents per vendor Standardized instrumentation
Instrumentation Complexity Different libraries per tool Single SDK for all backends
Data Correlation Siloed logs, metrics, traces Unified context propagation
Performance Overhead Multiple agents per service Single collector pipeline

OpenTelemetry Architecture: Core Components

1. Instrumentation Libraries (SDKs)

SDKs for all major languages generate telemetry data:

  • Auto-instrumentation: Automatic tracing for frameworks (Express, Flask, Spring Boot)
  • Manual instrumentation: Custom spans, metrics, logs for business logic
  • Semantic conventions: Standardized attribute naming for consistency

2. OpenTelemetry Collector

The Collector receives, processes, and exports telemetry data:

  • Receivers: Accept telemetry from applications (OTLP, Jaeger, Prometheus)
  • Processors: Transform data (sampling, filtering, enrichment)
  • Exporters: Send data to backends (Prometheus, Jaeger, Grafana, Datadog)

3. Backend Storage and Visualization

Popular backends compatible with OpenTelemetry:

  • Jaeger: Distributed tracing (open-source)
  • Prometheus + Grafana: Metrics and dashboards
  • Tempo: Grafana’s trace storage
  • Loki: Grafana’s log aggregation
  • Commercial: Datadog, New Relic, Honeycomb, Splunk

Implementing OpenTelemetry: Step-by-Step Production Guide

Phase 1: Deploy OpenTelemetry Collector

Deploy the Collector as a sidecar or DaemonSet in Kubernetes:

# otel-collector-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: observability
data:
  otel-collector-config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
      prometheus:
        config:
          scrape_configs:
            - job_name: 'kubernetes-pods'
              kubernetes_sd_configs:
                - role: pod
    
    processors:
      batch:
        timeout: 10s
        send_batch_size: 1024
      memory_limiter:
        check_interval: 1s
        limit_mib: 512
      attributes:
        actions:
          - key: environment
            value: production
            action: insert
    
    exporters:
      otlp/tempo:
        endpoint: tempo:4317
        tls:
          insecure: true
      prometheus:
        endpoint: "0.0.0.0:8889"
      logging:
        loglevel: debug
    
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch, attributes]
          exporters: [otlp/tempo, logging]
        metrics:
          receivers: [otlp, prometheus]
          processors: [memory_limiter, batch]
          exporters: [prometheus]
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-collector
  namespace: observability
spec:
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
      - name: otel-collector
        image: otel/opentelemetry-collector-contrib:0.95.0
        args: ["--config=/conf/otel-collector-config.yaml"]
        volumeMounts:
        - name: config
          mountPath: /conf
        ports:
        - containerPort: 4317  # OTLP gRPC
        - containerPort: 4318  # OTLP HTTP
        - containerPort: 8889  # Prometheus metrics
        resources:
          limits:
            memory: 512Mi
            cpu: 500m
      volumes:
      - name: config
        configMap:
          name: otel-collector-config

Phase 2: Instrument Applications (Auto-Instrumentation)

Example: Node.js/Express Application

// tracing.js - Initialize OpenTelemetry BEFORE your application code
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'user-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: 'production',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4317',
  }),
  metricReader: new OTLPMetricExporter({
    url: 'http://otel-collector:4317',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
      '@opentelemetry/instrumentation-redis': { enabled: true },
    }),
  ],
});

sdk.start();

process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => console.log('Tracing terminated'))
    .catch((error) => console.log('Error terminating tracing', error))
    .finally(() => process.exit(0));
});

Your application code (app.js):

// Import tracing FIRST
require('./tracing');

const express = require('express');
const app = express();

// Your application code (auto-instrumented)
app.get('/users/:id', async (req, res) => {
  const userId = req.params.id;
  
  // Database query (auto-traced)
  const user = await db.query('SELECT * FROM users WHERE id = $1', [userId]);
  
  // Redis lookup (auto-traced)
  const cached = await redis.get(`user:${userId}`);
  
  res.json({ user });
});

app.listen(3000);

Phase 3: Add Custom Instrumentation

Capture business-specific metrics and traces:

const { trace, metrics } = require('@opentelemetry/api');

// Get tracer
const tracer = trace.getTracer('user-service', '1.0.0');

// Get meter for metrics
const meter = metrics.getMeter('user-service', '1.0.0');

// Create custom metrics
const orderCounter = meter.createCounter('orders.created', {
  description: 'Count of orders created',
});

const orderValue = meter.createHistogram('orders.value', {
  description: 'Order value in USD',
});

// Business logic with custom spans
app.post('/orders', async (req, res) => {
  const span = tracer.startSpan('process.order', {
    attributes: {
      'user.id': req.user.id,
      'order.items.count': req.body.items.length,
    }
  });
  
  try {
    // Validate order (child span)
    const validationSpan = tracer.startSpan('validate.order', { parent: span });
    await validateOrder(req.body);
    validationSpan.end();
    
    // Process payment (child span)
    const paymentSpan = tracer.startSpan('process.payment', { parent: span });
    const payment = await processPayment(req.body.payment);
    paymentSpan.setAttribute('payment.method', payment.method);
    paymentSpan.setAttribute('payment.amount', payment.amount);
    paymentSpan.end();
    
    // Create order
    const order = await createOrder(req.body);
    
    // Record metrics
    orderCounter.add(1, { 'order.status': 'success' });
    orderValue.record(order.total, { 'order.currency': 'USD' });
    
    span.setStatus({ code: 1 }); // OK
    res.json({ orderId: order.id });
    
  } catch (error) {
    span.recordException(error);
    span.setStatus({ code: 2, message: error.message }); // ERROR
    orderCounter.add(1, { 'order.status': 'failed' });
    res.status(500).json({ error: error.message });
  } finally {
    span.end();
  }
});

Phase 4: Python/FastAPI Example

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from fastapi import FastAPI

# Initialize OpenTelemetry
resource = Resource.create({
    "service.name": "payment-service",
    "service.version": "1.0.0",
    "deployment.environment": "production"
})

trace.set_tracer_provider(TracerProvider(resource=resource))
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)

metrics.set_meter_provider(MeterProvider(resource=resource))
metrics.get_meter_provider().add_metric_reader(
    PeriodicExportingMetricReader(OTLPMetricExporter(endpoint="http://otel-collector:4317"))
)

# Create FastAPI app
app = FastAPI()

# Auto-instrument FastAPI
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()

# Get tracer and meter
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)

# Custom metrics
payment_counter = meter.create_counter(
    "payments.processed",
    description="Number of payments processed"
)

# API endpoint with tracing
@app.post("/payments")
async def process_payment(payment: PaymentRequest):
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("payment.amount", payment.amount)
        span.set_attribute("payment.currency", payment.currency)
        
        try:
            # Validate payment
            with tracer.start_as_current_span("validate_payment"):
                validate_payment(payment)
            
            # Charge customer
            with tracer.start_as_current_span("charge_customer"):
                result = await charge_customer(payment)
            
            payment_counter.add(1, {"status": "success"})
            return {"transaction_id": result.id}
            
        except Exception as e:
            span.record_exception(e)
            span.set_status(Status(StatusCode.ERROR, str(e)))
            payment_counter.add(1, {"status": "failed"})
            raise

Distributed Tracing: Context Propagation

OpenTelemetry automatically propagates trace context across service boundaries using W3C Trace Context headers.

How It Works

  1. Service A creates a trace with trace ID: abc123
  2. Service A calls Service B with HTTP header: traceparent: 00-abc123-def456-01
  3. Service B extracts trace context from header and creates child span
  4. Both spans share trace ID abc123, allowing end-to-end visibility

Example: Microservices Communication

// Service A (Order Service)
const axios = require('axios');
const { context, propagation } = require('@opentelemetry/api');

app.post('/orders', async (req, res) => {
  const span = tracer.startSpan('create.order');
  
  // OpenTelemetry automatically injects trace context into outgoing HTTP requests
  const payment = await axios.post('http://payment-service/charge', {
    amount: req.body.total,
    userId: req.user.id
  });
  
  // Trace context propagated automatically!
  
  span.end();
  res.json({ orderId: 'order-123' });
});

Metrics Collection with OpenTelemetry

Metric Types

Type Use Case Example
Counter Monotonically increasing values HTTP requests, errors
Gauge Current value that can go up/down Memory usage, queue depth
Histogram Distribution of values Response time, request size

Golden Signals with OpenTelemetry

const meter = metrics.getMeter('api-service');

// Latency (histogram)
const httpDuration = meter.createHistogram('http.server.duration', {
  description: 'HTTP request duration in milliseconds',
  unit: 'ms'
});

// Traffic (counter)
const httpRequests = meter.createCounter('http.server.requests', {
  description: 'Total HTTP requests'
});

// Errors (counter)
const httpErrors = meter.createCounter('http.server.errors', {
  description: 'Total HTTP errors'
});

// Saturation (gauge)
const activeRequests = meter.createUpDownCounter('http.server.active_requests', {
  description: 'Currently active HTTP requests'
});

// Middleware to record metrics
app.use((req, res, next) => {
  const start = Date.now();
  activeRequests.add(1);
  
  res.on('finish', () => {
    const duration = Date.now() - start;
    
    httpDuration.record(duration, {
      'http.method': req.method,
      'http.route': req.route?.path || 'unknown',
      'http.status_code': res.statusCode
    });
    
    httpRequests.add(1, {
      'http.method': req.method,
      'http.status_code': res.statusCode
    });
    
    if (res.statusCode >= 400) {
      httpErrors.add(1, {
        'http.method': req.method,
        'http.status_code': res.statusCode
      });
    }
    
    activeRequests.add(-1);
  });
  
  next();
});

Observability Backend: Grafana Stack

Deploy Complete Observability Stack

# docker-compose.yml - Complete observability stack
version: '3.8'
services:
  # Trace storage
  tempo:
    image: grafana/tempo:latest
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./tempo.yaml:/etc/tempo.yaml
      - tempo-data:/tmp/tempo
    ports:
      - "4317:4317"  # OTLP gRPC
      - "4318:4318"  # OTLP HTTP
  
  # Metrics storage
  prometheus:
    image: prom/prometheus:latest
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    ports:
      - "9090:9090"
  
  # Log storage
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - loki-data:/loki
  
  # Visualization
  grafana:
    image: grafana/grafana:latest
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
    ports:
      - "3000:3000"
    depends_on:
      - tempo
      - prometheus
      - loki

volumes:
  tempo-data:
  prometheus-data:
  loki-data:
  grafana-data:

Migration Strategy: From Legacy Tools to OpenTelemetry

Phase 1: Add OpenTelemetry Alongside Existing Tools

Run both systems in parallel:

  • Deploy OpenTelemetry Collector with exporters to both new and legacy backends
  • Instrument 1-2 non-critical services with OpenTelemetry
  • Validate data quality and completeness

Phase 2: Incremental Service Migration

Migrate services one at a time:

  • Start with stateless services (easier rollback)
  • Remove legacy instrumentation after validating OTel data
  • Update dashboards and alerts to use OTel metrics

Phase 3: Decommission Legacy Tools

Once 80%+ of services migrated:

  • Sunset legacy observability agents
  • Consolidate backends (optional: move to single vendor or open-source stack)
  • Train teams on new observability workflows

Performance and Cost Optimization

Sampling Strategies

Reduce trace volume without losing visibility:

processors:
  # Tail-based sampling: Keep interesting traces
  tail_sampling:
    decision_wait: 10s
    policies:
      # Always sample errors
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      
      # Always sample slow requests
      - name: slow-requests
        type: latency
        latency:
          threshold_ms: 1000
      
      # Sample 1% of normal requests
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 1

Resource Attribution

Tag telemetry with cost allocation metadata:

processors:
  resource:
    attributes:
      - key: team
        value: platform-engineering
        action: insert
      - key: cost_center
        value: engineering-ops
        action: insert
      - key: environment
        from_attribute: deployment.environment
        action: insert

Real-World Production Case Studies

Case Study 1: E-Commerce Platform

Challenge: $50k/month observability costs, vendor lock-in with Datadog
Solution: Migrated to OpenTelemetry + Grafana Cloud
Results: 60% cost reduction ($20k/month), no functionality loss, 2x faster troubleshooting with correlated traces

Case Study 2: Financial Services Company

Challenge: Siloed observability (3 different APM tools across teams)
Solution: Standardized on OpenTelemetry, unified backends
Results: Mean time to resolution (MTTR) reduced 40%, cross-team collaboration improved, unified dashboards

Case Study 3: SaaS Startup

Challenge: Rapid growth (10 β†’ 100 microservices in 12 months)
Solution: Built on OpenTelemetry from day 1
Results: Zero vendor lock-in, flexibility to switch backends, $10k/month vs $50k+ for commercial APM at scale

Conclusion: The Future of Observability is Open

OpenTelemetry has become the de facto standard for observability instrumentation. Its vendor-neutral approach eliminates lock-in while providing best-in-class telemetry collection. Organizations implementing OpenTelemetry gain flexibility to choose backends, reduce costs, and future-proof their observability stack.

The transition from proprietary APM tools to OpenTelemetry requires upfront investment but pays dividends through reduced costs, increased flexibility, and unified observability across your entire infrastructure.

Start with auto-instrumentation, add custom spans for business logic, deploy the Collector, and connect to your preferred backend. Within weeks, you’ll have production-grade observability that scales from startups to enterprisesβ€”without vendor lock-in.

The observability revolution is here, and it’s open source. OpenTelemetry is the foundation of that revolution.

Was this article helpful?

R

About Ramesh Sundararamaiah

Red Hat Certified Architect

Expert in Linux system administration, DevOps automation, and cloud infrastructure. Specializing in Red Hat Enterprise Linux, CentOS, Ubuntu, Docker, Ansible, and enterprise IT solutions.

🐧 Stay Updated with Linux Tips

Get the latest tutorials, news, and guides delivered to your inbox weekly.

Add Comment


↑