OpenTelemetry Complete Implementation Guide: Production-Ready Observability with Distributed Tracing, Metrics, and Logs
π― Key Takeaways
- What is OpenTelemetry and Why Does It Matter?
- OpenTelemetry Architecture: Core Components
- Implementing OpenTelemetry: Step-by-Step Production Guide
- Distributed Tracing: Context Propagation
- Metrics Collection with OpenTelemetry
π Table of Contents
- What is OpenTelemetry and Why Does It Matter?
- OpenTelemetry Architecture: Core Components
- Implementing OpenTelemetry: Step-by-Step Production Guide
- Distributed Tracing: Context Propagation
- Metrics Collection with OpenTelemetry
- Observability Backend: Grafana Stack
- Migration Strategy: From Legacy Tools to OpenTelemetry
- Performance and Cost Optimization
- Real-World Production Case Studies
- Conclusion: The Future of Observability is Open
Observability has evolved from basic logging and metrics to comprehensive distributed tracing, contextual metrics, and structured logs. OpenTelemetry has emerged as the universal standard for observability instrumentation, unifying telemetry data collection across applications, services, and infrastructure. This comprehensive guide explains how to implement OpenTelemetry in production environments, migrate from legacy observability tools, and build a complete observability stack that scales from startups to enterprises.
π Table of Contents
- What is OpenTelemetry and Why Does It Matter?
- The Observability Crisis OpenTelemetry Solves
- OpenTelemetry Architecture: Core Components
- 1. Instrumentation Libraries (SDKs)
- 2. OpenTelemetry Collector
- 3. Backend Storage and Visualization
- Implementing OpenTelemetry: Step-by-Step Production Guide
- Phase 1: Deploy OpenTelemetry Collector
- Phase 2: Instrument Applications (Auto-Instrumentation)
- Phase 3: Add Custom Instrumentation
- Phase 4: Python/FastAPI Example
- Distributed Tracing: Context Propagation
- How It Works
- Example: Microservices Communication
- Metrics Collection with OpenTelemetry
- Metric Types
- Golden Signals with OpenTelemetry
- Observability Backend: Grafana Stack
- Deploy Complete Observability Stack
- Migration Strategy: From Legacy Tools to OpenTelemetry
- Phase 1: Add OpenTelemetry Alongside Existing Tools
- Phase 2: Incremental Service Migration
- Phase 3: Decommission Legacy Tools
- Performance and Cost Optimization
- Sampling Strategies
- Resource Attribution
- Real-World Production Case Studies
- Case Study 1: E-Commerce Platform
- Case Study 2: Financial Services Company
- Case Study 3: SaaS Startup
- Conclusion: The Future of Observability is Open
What is OpenTelemetry and Why Does It Matter?
OpenTelemetry (OTel) is a Cloud Native Computing Foundation (CNCF) project that provides vendor-neutral APIs, SDKs, and tools for generating, collecting, and exporting telemetry data (traces, metrics, logs). It’s the merger of OpenTracing and OpenCensus projects, backed by major cloud providers and observability vendors.
The Observability Crisis OpenTelemetry Solves
| Problem | Legacy Approach | OpenTelemetry Solution |
|---|---|---|
| Vendor Lock-in | Proprietary agents per vendor | Standardized instrumentation |
| Instrumentation Complexity | Different libraries per tool | Single SDK for all backends |
| Data Correlation | Siloed logs, metrics, traces | Unified context propagation |
| Performance Overhead | Multiple agents per service | Single collector pipeline |
OpenTelemetry Architecture: Core Components
1. Instrumentation Libraries (SDKs)
SDKs for all major languages generate telemetry data:
- Auto-instrumentation: Automatic tracing for frameworks (Express, Flask, Spring Boot)
- Manual instrumentation: Custom spans, metrics, logs for business logic
- Semantic conventions: Standardized attribute naming for consistency
2. OpenTelemetry Collector
The Collector receives, processes, and exports telemetry data:
- Receivers: Accept telemetry from applications (OTLP, Jaeger, Prometheus)
- Processors: Transform data (sampling, filtering, enrichment)
- Exporters: Send data to backends (Prometheus, Jaeger, Grafana, Datadog)
3. Backend Storage and Visualization
Popular backends compatible with OpenTelemetry:
- Jaeger: Distributed tracing (open-source)
- Prometheus + Grafana: Metrics and dashboards
- Tempo: Grafana’s trace storage
- Loki: Grafana’s log aggregation
- Commercial: Datadog, New Relic, Honeycomb, Splunk
Implementing OpenTelemetry: Step-by-Step Production Guide
Phase 1: Deploy OpenTelemetry Collector
Deploy the Collector as a sidecar or DaemonSet in Kubernetes:
# otel-collector-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: observability
data:
otel-collector-config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
processors:
batch:
timeout: 10s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
attributes:
actions:
- key: environment
value: production
action: insert
exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
logging:
loglevel: debug
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, attributes]
exporters: [otlp/tempo, logging]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch]
exporters: [prometheus]
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-collector
namespace: observability
spec:
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:0.95.0
args: ["--config=/conf/otel-collector-config.yaml"]
volumeMounts:
- name: config
mountPath: /conf
ports:
- containerPort: 4317 # OTLP gRPC
- containerPort: 4318 # OTLP HTTP
- containerPort: 8889 # Prometheus metrics
resources:
limits:
memory: 512Mi
cpu: 500m
volumes:
- name: config
configMap:
name: otel-collector-config
Phase 2: Instrument Applications (Auto-Instrumentation)
Example: Node.js/Express Application
// tracing.js - Initialize OpenTelemetry BEFORE your application code
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'user-service',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: 'production',
}),
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4317',
}),
metricReader: new OTLPMetricExporter({
url: 'http://otel-collector:4317',
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': { enabled: true },
'@opentelemetry/instrumentation-express': { enabled: true },
'@opentelemetry/instrumentation-pg': { enabled: true },
'@opentelemetry/instrumentation-redis': { enabled: true },
}),
],
});
sdk.start();
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.log('Error terminating tracing', error))
.finally(() => process.exit(0));
});
Your application code (app.js):
// Import tracing FIRST
require('./tracing');
const express = require('express');
const app = express();
// Your application code (auto-instrumented)
app.get('/users/:id', async (req, res) => {
const userId = req.params.id;
// Database query (auto-traced)
const user = await db.query('SELECT * FROM users WHERE id = $1', [userId]);
// Redis lookup (auto-traced)
const cached = await redis.get(`user:${userId}`);
res.json({ user });
});
app.listen(3000);
Phase 3: Add Custom Instrumentation
Capture business-specific metrics and traces:
const { trace, metrics } = require('@opentelemetry/api');
// Get tracer
const tracer = trace.getTracer('user-service', '1.0.0');
// Get meter for metrics
const meter = metrics.getMeter('user-service', '1.0.0');
// Create custom metrics
const orderCounter = meter.createCounter('orders.created', {
description: 'Count of orders created',
});
const orderValue = meter.createHistogram('orders.value', {
description: 'Order value in USD',
});
// Business logic with custom spans
app.post('/orders', async (req, res) => {
const span = tracer.startSpan('process.order', {
attributes: {
'user.id': req.user.id,
'order.items.count': req.body.items.length,
}
});
try {
// Validate order (child span)
const validationSpan = tracer.startSpan('validate.order', { parent: span });
await validateOrder(req.body);
validationSpan.end();
// Process payment (child span)
const paymentSpan = tracer.startSpan('process.payment', { parent: span });
const payment = await processPayment(req.body.payment);
paymentSpan.setAttribute('payment.method', payment.method);
paymentSpan.setAttribute('payment.amount', payment.amount);
paymentSpan.end();
// Create order
const order = await createOrder(req.body);
// Record metrics
orderCounter.add(1, { 'order.status': 'success' });
orderValue.record(order.total, { 'order.currency': 'USD' });
span.setStatus({ code: 1 }); // OK
res.json({ orderId: order.id });
} catch (error) {
span.recordException(error);
span.setStatus({ code: 2, message: error.message }); // ERROR
orderCounter.add(1, { 'order.status': 'failed' });
res.status(500).json({ error: error.message });
} finally {
span.end();
}
});
Phase 4: Python/FastAPI Example
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from fastapi import FastAPI
# Initialize OpenTelemetry
resource = Resource.create({
"service.name": "payment-service",
"service.version": "1.0.0",
"deployment.environment": "production"
})
trace.set_tracer_provider(TracerProvider(resource=resource))
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
metrics.set_meter_provider(MeterProvider(resource=resource))
metrics.get_meter_provider().add_metric_reader(
PeriodicExportingMetricReader(OTLPMetricExporter(endpoint="http://otel-collector:4317"))
)
# Create FastAPI app
app = FastAPI()
# Auto-instrument FastAPI
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()
# Get tracer and meter
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)
# Custom metrics
payment_counter = meter.create_counter(
"payments.processed",
description="Number of payments processed"
)
# API endpoint with tracing
@app.post("/payments")
async def process_payment(payment: PaymentRequest):
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("payment.amount", payment.amount)
span.set_attribute("payment.currency", payment.currency)
try:
# Validate payment
with tracer.start_as_current_span("validate_payment"):
validate_payment(payment)
# Charge customer
with tracer.start_as_current_span("charge_customer"):
result = await charge_customer(payment)
payment_counter.add(1, {"status": "success"})
return {"transaction_id": result.id}
except Exception as e:
span.record_exception(e)
span.set_status(Status(StatusCode.ERROR, str(e)))
payment_counter.add(1, {"status": "failed"})
raise
Distributed Tracing: Context Propagation
OpenTelemetry automatically propagates trace context across service boundaries using W3C Trace Context headers.
How It Works
- Service A creates a trace with trace ID:
abc123 - Service A calls Service B with HTTP header:
traceparent: 00-abc123-def456-01 - Service B extracts trace context from header and creates child span
- Both spans share trace ID
abc123, allowing end-to-end visibility
Example: Microservices Communication
// Service A (Order Service)
const axios = require('axios');
const { context, propagation } = require('@opentelemetry/api');
app.post('/orders', async (req, res) => {
const span = tracer.startSpan('create.order');
// OpenTelemetry automatically injects trace context into outgoing HTTP requests
const payment = await axios.post('http://payment-service/charge', {
amount: req.body.total,
userId: req.user.id
});
// Trace context propagated automatically!
span.end();
res.json({ orderId: 'order-123' });
});
Metrics Collection with OpenTelemetry
Metric Types
| Type | Use Case | Example |
|---|---|---|
| Counter | Monotonically increasing values | HTTP requests, errors |
| Gauge | Current value that can go up/down | Memory usage, queue depth |
| Histogram | Distribution of values | Response time, request size |
Golden Signals with OpenTelemetry
const meter = metrics.getMeter('api-service');
// Latency (histogram)
const httpDuration = meter.createHistogram('http.server.duration', {
description: 'HTTP request duration in milliseconds',
unit: 'ms'
});
// Traffic (counter)
const httpRequests = meter.createCounter('http.server.requests', {
description: 'Total HTTP requests'
});
// Errors (counter)
const httpErrors = meter.createCounter('http.server.errors', {
description: 'Total HTTP errors'
});
// Saturation (gauge)
const activeRequests = meter.createUpDownCounter('http.server.active_requests', {
description: 'Currently active HTTP requests'
});
// Middleware to record metrics
app.use((req, res, next) => {
const start = Date.now();
activeRequests.add(1);
res.on('finish', () => {
const duration = Date.now() - start;
httpDuration.record(duration, {
'http.method': req.method,
'http.route': req.route?.path || 'unknown',
'http.status_code': res.statusCode
});
httpRequests.add(1, {
'http.method': req.method,
'http.status_code': res.statusCode
});
if (res.statusCode >= 400) {
httpErrors.add(1, {
'http.method': req.method,
'http.status_code': res.statusCode
});
}
activeRequests.add(-1);
});
next();
});
Observability Backend: Grafana Stack
Deploy Complete Observability Stack
# docker-compose.yml - Complete observability stack version: '3.8' services: # Trace storage tempo: image: grafana/tempo:latest command: ["-config.file=/etc/tempo.yaml"] volumes: - ./tempo.yaml:/etc/tempo.yaml - tempo-data:/tmp/tempo ports: - "4317:4317" # OTLP gRPC - "4318:4318" # OTLP HTTP # Metrics storage prometheus: image: prom/prometheus:latest command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - prometheus-data:/prometheus ports: - "9090:9090" # Log storage loki: image: grafana/loki:latest ports: - "3100:3100" volumes: - loki-data:/loki # Visualization grafana: image: grafana/grafana:latest environment: - GF_AUTH_ANONYMOUS_ENABLED=true - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin volumes: - grafana-data:/var/lib/grafana - ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml ports: - "3000:3000" depends_on: - tempo - prometheus - loki volumes: tempo-data: prometheus-data: loki-data: grafana-data:
Migration Strategy: From Legacy Tools to OpenTelemetry
Phase 1: Add OpenTelemetry Alongside Existing Tools
Run both systems in parallel:
- Deploy OpenTelemetry Collector with exporters to both new and legacy backends
- Instrument 1-2 non-critical services with OpenTelemetry
- Validate data quality and completeness
Phase 2: Incremental Service Migration
Migrate services one at a time:
- Start with stateless services (easier rollback)
- Remove legacy instrumentation after validating OTel data
- Update dashboards and alerts to use OTel metrics
Phase 3: Decommission Legacy Tools
Once 80%+ of services migrated:
- Sunset legacy observability agents
- Consolidate backends (optional: move to single vendor or open-source stack)
- Train teams on new observability workflows
Performance and Cost Optimization
Sampling Strategies
Reduce trace volume without losing visibility:
processors:
# Tail-based sampling: Keep interesting traces
tail_sampling:
decision_wait: 10s
policies:
# Always sample errors
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
# Always sample slow requests
- name: slow-requests
type: latency
latency:
threshold_ms: 1000
# Sample 1% of normal requests
- name: probabilistic
type: probabilistic
probabilistic:
sampling_percentage: 1
Resource Attribution
Tag telemetry with cost allocation metadata:
processors:
resource:
attributes:
- key: team
value: platform-engineering
action: insert
- key: cost_center
value: engineering-ops
action: insert
- key: environment
from_attribute: deployment.environment
action: insert
Real-World Production Case Studies
Case Study 1: E-Commerce Platform
Challenge: $50k/month observability costs, vendor lock-in with Datadog
Solution: Migrated to OpenTelemetry + Grafana Cloud
Results: 60% cost reduction ($20k/month), no functionality loss, 2x faster troubleshooting with correlated traces
Case Study 2: Financial Services Company
Challenge: Siloed observability (3 different APM tools across teams)
Solution: Standardized on OpenTelemetry, unified backends
Results: Mean time to resolution (MTTR) reduced 40%, cross-team collaboration improved, unified dashboards
Case Study 3: SaaS Startup
Challenge: Rapid growth (10 β 100 microservices in 12 months)
Solution: Built on OpenTelemetry from day 1
Results: Zero vendor lock-in, flexibility to switch backends, $10k/month vs $50k+ for commercial APM at scale
Conclusion: The Future of Observability is Open
OpenTelemetry has become the de facto standard for observability instrumentation. Its vendor-neutral approach eliminates lock-in while providing best-in-class telemetry collection. Organizations implementing OpenTelemetry gain flexibility to choose backends, reduce costs, and future-proof their observability stack.
The transition from proprietary APM tools to OpenTelemetry requires upfront investment but pays dividends through reduced costs, increased flexibility, and unified observability across your entire infrastructure.
Start with auto-instrumentation, add custom spans for business logic, deploy the Collector, and connect to your preferred backend. Within weeks, you’ll have production-grade observability that scales from startups to enterprisesβwithout vendor lock-in.
The observability revolution is here, and it’s open source. OpenTelemetry is the foundation of that revolution.
Was this article helpful?
About Ramesh Sundararamaiah
Red Hat Certified Architect
Expert in Linux system administration, DevOps automation, and cloud infrastructure. Specializing in Red Hat Enterprise Linux, CentOS, Ubuntu, Docker, Ansible, and enterprise IT solutions.