自托管 Tempo + Grafana 替代 Jaeger 做分布式追踪后端

起因

之前用 Jaeger 做 trace 后端。痛点：

ES 后端吃内存 + 维护 ES 集群烦
UI 比 Grafana 弱，跟 Prometheus / Loki 不在一处看
trace 数据保留几天就磁盘满

Tempo 是 Grafana Labs 的开源 trace backend，专为对象存储设计：

后端是 S3 / GCS / 任意对象存储（便宜 + 无限容量）
不用 ES，省内存
跟 Grafana / Loki / Prometheus 同生态，单 UI 串起 metric / log / trace

安装

docker-compose.yml（开发 / 单机生产）：

services:
  tempo:
    image: grafana/tempo:2.6.0
    command: ['-config.file=/etc/tempo/tempo.yaml']
    volumes:
      - ./tempo.yaml:/etc/tempo/tempo.yaml
      - tempo-data:/var/tempo
    ports:
      - "3200:3200"   # tempo API
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP

  grafana:
    image: grafana/grafana:11.2.0
    ports: ["3000:3000"]
    environment:
      GF_AUTH_ANONYMOUS_ENABLED: 'true'
      GF_AUTH_ANONYMOUS_ORG_ROLE: 'Admin'
    volumes:
      - ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/ds.yaml

volumes:
  tempo-data:

tempo.yaml：

server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc: { endpoint: 0.0.0.0:4317 }
        http: { endpoint: 0.0.0.0:4318 }

ingester:
  trace_idle_period: 10s
  max_block_duration: 5m

compactor:
  compaction:
    block_retention: 168h     # 7 天保留

storage:
  trace:
    backend: local           # 本地磁盘；生产换 s3 / gcs
    local:
      path: /var/tempo/traces
    wal:
      path: /var/tempo/wal

生产 storage 切到 S3：

storage:
  trace:
    backend: s3
    s3:
      endpoint: s3.us-east-1.amazonaws.com
      bucket: my-tempo-traces
      access_key: ${AWS_ACCESS_KEY_ID}
      secret_key: ${AWS_SECRET_ACCESS_KEY}
    wal:
      path: /var/tempo/wal

S3 / Backblaze B2 / Cloudflare R2 都可。Tempo 把 trace 块存对象存储，
"无限"容量 + 极便宜（B2 $6/TB/月）。

Grafana 接 Tempo

grafana-datasources.yaml：

apiVersion: 1
datasources:
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    isDefault: true
    jsonData:
      tracesToLogsV2:
        datasourceUid: 'loki'
        spanStartTimeShift: '-1m'
        spanEndTimeShift: '1m'
        tags: ['service.name', 'pod', 'container']
      serviceMap:
        datasourceUid: 'prometheus'
      nodeGraph:
        enabled: true

启动后 Grafana → Explore → 选 Tempo → 输 trace ID 直接查；
或用 Search 按 service / operation / tag 找。

应用端发 trace

跟之前 OpenTelemetry 那篇一样，应用配 OTLP 上报到 tempo:4317：

OTEL_EXPORTER_OTLP_ENDPOINT=http://tempo:4317 \
opentelemetry-instrument myapp

Tempo 完全 OTLP 兼容，不需要换 SDK。

TraceQL（强大查询语言）

Tempo 2.x 支持 TraceQL，类似 LogQL 但针对 trace：

# 找耗时 > 1s 的 trace
{ duration > 1s }

# 找特定 service 的 error
{ service.name = "api" && status = error }

# 找 span 含特定 attr
{ span.http.status_code = 500 }

# 找包含两个特定操作的 trace
{ name = "db.query" } && { name = "redis.set" }

比 Jaeger UI 表单过滤强大很多。复杂查询能精确定位"哪类调用慢 / 错"。

与 Loki 关联：trace ↔ log

trace 视图里点 span → 显示该时段对应服务的 log（Loki 拉）。
反过来：Loki 日志里有 trace_id → 点击直接跳 trace 视图。

让 OTel SDK 在 log 中注入 trace_id：

import logging
from opentelemetry.instrumentation.logging import LoggingInstrumentor

LoggingInstrumentor().instrument(set_logging_format=True)

log 自动带 trace_id / span_id。Loki 配置认识这个字段。

与 Prometheus 关联：service graph + metric

# tempo.yaml
metrics_generator:
  registry:
    external_labels:
      source: tempo
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write

overrides:
  defaults:
    metrics_generator:
      processors: [service-graphs, span-metrics]

Tempo 自动从 trace 生成：

traces_service_graph_request_total：服务调用关系（A 调 B 多少次）
traces_spanmetrics_latency_bucket：每 endpoint 的延迟分布

这些指标进 Prometheus 后 Grafana 仪表盘 + 告警。
不需要应用端写 metric 代码，trace 自动衍生 metric。

资源占用对比

我们生产 25 个服务，trace 量大约 10k spans/s：

	Jaeger + ES	Tempo + S3
总 RAM 占用	~12 GB	~3 GB
存储 7 天	200 GB SSD ($30/月 EBS)	80 GB S3 ($2/月)
维护负担	ES 集群 ops	tempo single binary
UI 体验	Jaeger UI ok	Grafana 统一
TraceQL	❌	✅

Tempo 显著省 + 体验更好。

与替代品

	Tempo	Jaeger	SigNoz	DataDog APM
后端	对象存储	ES / Cassandra	ClickHouse	商业云
自托管	✅ 简单	✅ 但 ES 麻烦	✅ 中	❌
价格	极便宜	维护贵	中	贵
与 metric/log 集成	Grafana 统一	分散	内置	DD 统一

预算紧 → Tempo / Jaeger。
要 metric/log/trace one-stop → Grafana stack 或 DataDog（贵但省心）。

采样

100% 采样 trace 数据爆炸。生产建议：

processors:
  probabilistic_sampler:
    sampling_percentage: 10

10% 全采。或更智能 tail sampling：保留所有 error + 慢 trace + 5% 普通。

收集端（OTel Collector）配采样，Tempo 端不再处理。

完整 stack 图

应用 (Python/Go/Node)
   ↓ OTLP
OTel Collector (采样 + 路由)
   ↓ OTLP
Tempo (trace 存 S3)
   ↑ TraceQL
Grafana (UI) ← Loki (logs) ← Prometheus (metrics)

3 个数据库（trace/log/metric），1 个 UI（Grafana），完全开源。

踩过的坑

block_retention 配错：168h 7 天；如果想 30 天写 720h。
单位 m h d 都识别。
对象存储 list 频繁：Tempo 每几秒 list bucket 找新 block，
B2 / R2 收 API 调用费。配 query_frontend.search.cache_control
减少 list 频率。
WAL 损坏：单机突然断电 → wal 损坏 → tempo 启动失败。
rm -rf /var/tempo/wal/* 丢失正在 ingest 的几秒数据，重启 OK。
生产用 PV / 持久 disk。
多 tenant 没配：默认 single-tenant；多团队共用同一 Tempo
要开 multitenancy + Auth header。
search 索引慢：trace 查找需要 search index，老版本 search 慢。
2.0+ 显著改进；用最新。

切换的实际感受

从 Jaeger + ES 切到 Tempo + S3 用了 1 周（双跑 + 切流量）：

内存释放 9 GB
存储费用降 15x
Grafana 统一仪表盘体验"看 metric → 跳 trace → 跳 log" 流畅
TraceQL 让"高 latency P99 的 trace" 一行 query 出来

强烈推荐新项目用 Tempo + Grafana stack 而非 Jaeger。