知识广场

推荐

按学科筛选：计算机科学 / 分布式与云计算 / 可观测性

«计算机科学 / 分布式与云计算 / 可观测性» 分类下共 3 篇帖子

## 起因 distributed trace 工具： - **Jaeger**：经典 + 自托管 + 存 Cassandra / ES - **Zipkin**：更老 - **Tempo**（Grafana Labs）：新选择，存 S3 trace 数据量大（每 request 多 span，TB / day），存 Cassandra / ES 贵。 Tempo 设计为"object storage native trace store"，类似 Loki for log。 ## Tempo 特点 - 存 trace blob 到 S3 / GCS（极便宜） - 只索引 trace ID（不索引 attribute）→ 不能按 service / tag 全文搜 - query 模式：先用 metric 找到时段 → 拿 trace ID → 查 trace 对应 metric/log/trace 思路： ``` metric (Prometheus / Mimir) ↓ 发现 spike 时段 log (Loki) ↓ 找 trace_id trace (Tempo) ↓ 详细看 trace ``` ## 装 ```yaml # docker-compose services: tempo: image: grafana/tempo:2.5.0 command: ['-config.file=/etc/tempo.yaml'] ports: - 3200:3200 # HTTP - 4317:4317 # OTLP gRPC volumes: - ./tempo.yaml:/etc/tempo.yaml - tempo-data:/tmp/tempo ``` `tempo.yaml`: ```yaml server: http_listen_port: 3200 distributor: receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 storage: trace: backend: s3 s3: bucket: my-traces endpoint: s3.amazonaws.com region: us-east-1 compactor: compaction: block_retention: 720h # 30 day ``` ## ingest 应用 OTEL SDK 发 trace 到 Tempo (or otel-collector → Tempo)： ```python from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter exporter = OTLPSpanExporter(endpoint='tempo:4317') ``` 或通过 otel-collector 中转（推荐生产）。 ## Grafana 查 Grafana 加 Tempo data source： ``` URL: http://tempo:3200 ``` Explore → Tempo → "Search" tab： - search by trace ID（直接 lookup） - search by service name + tag（TraceQL） - duration / status filter ```traceql {service.name="api" && duration > 500ms && status = error} ``` 返回 trace 列表 → 点开看 span tree。 ## TraceQL (Tempo query language) 类 PromQL for trace： ```traceql # 时间窗口内 service=api 的 trace {service.name="api"} # 慢 trace {duration > 1s} # 错误 trace {status = error} # 跨 span 关系 { name="GET /api/users" } >> { name="db.query" && duration > 100ms } # 父 span 是 "GET /api/users" + 子 span 含慢 DB query ``` 强大 + 不需要 fulltext index。 ## 跟 metric 关联 Grafana panel： ``` metric: rate(http_requests{status="500"}) ↓ click data point trace search: status=error around timestamp ↓ list of error trace IDs ``` 一键从 metric spike 跳到具体 trace。 debug 神器。 ## 跟 log 关联 log 里写 trace_id： ```python import logging from opentelemetry import trace handler = logging.StreamHandler() formatter = logging.Formatter( '%(asctime)s [%(levelname)s] [trace_id=%(otelTraceID)s] %(message)s' ) handler.setFormatter(formatter) ``` Loki 查 log： ```logql {service="api"} |= "error" # 看到 trace_id=abc123 ``` Grafana 自动识别 trace_id → 点击跳 Tempo 拉 trace。 ## 存储成本对比 | | Jaeger (ES) | Tempo (S3) | |---|---|---| | 1 TB trace/day | ~$2000/月（ES cluster） | ~$50/月（S3）| | query latency | < 100ms | ~1s（拉 S3） | | index 灵活 | 任 attr | trace ID + 部分 attr | Tempo cost 1-2 个量级低。 trade-off：query 慢（拉 S3 + scan）+ 索引弱。 ## sampling trace 量大 → 不全存。两种 sampling: - **head sampling**：应用端决定（如 1%） - **tail sampling**：collector buffer 全 trace + 决定（如 100% error + 1% normal） tail 更智能但需 collector 资源（otel-collector tail_sampling processor）。 ## 与 Jaeger 对比 | | Jaeger | Tempo | |---|---|---| | 存储 | Cassandra / ES / Memory | S3 / GCS | | query 强 | 强（全索引） | 中（trace ID） | | 成本 | 高 | 低 | | 集成 | Jaeger UI | Grafana | | 部署 | 多组件 | 单 binary | Jaeger 适合：低 volume + 重 ad-hoc query。 Tempo 适合：高 volume + 接受 metric/log-driven trace lookup。 ## OpenTelemetry → 后端无关只要应用用 OTEL SDK，后端可换： - Jaeger - Tempo - Honeycomb (SaaS) - Datadog APM (SaaS) - Splunk 代码不改。 ## 真实部署我们 prod: - 100 微服务 + 10w QPS - 50 GB trace/day（10% sampling） - Tempo + S3 backend - 30 day retention - Grafana 主入口成本： - S3：50 GB × 30 day × $0.023/GB = $35/月 - Tempo compute (2 small instance)：$30/月 - 总：< $100/月 Jaeger + ES 等价：> $1000/月。体验 trade-off： - query 1-3s（vs Jaeger < 200ms） - 必须知道 trace_id 或者跨服务 search（不能全文 search trace 内容）可接受。 ## metrics generator (Tempo extra) Tempo 能从 trace 生成 metric： ```yaml metrics_generator: registry: external_labels: source: tempo processor: service_graphs: span_metrics: ``` 自动生成： - `traces_spanmetrics_calls_total{service, operation}` (call rate) - `traces_spanmetrics_latency_*` (latency histogram) - service graph (which service calls which) 替代 RED metric exporter，从 trace 推导。 ## 与 Datadog APM 对比 | | Tempo | Datadog APM | |---|---|---| | 成本 | $100/月 | $1000-10000/月 | | 部署 | self-host | SaaS | | UX | Grafana 中 | 极好 | | 集成 | OTEL + 自己 stack | Datadog 全生态 | 预算大 + 想 batteries-included → Datadog。预算敏感 / 已有 Grafana → Tempo。 ## 踩过的坑 1. **S3 cost spike**：put request 计费，频繁小 trace 飞速 → 加 compactor 合并 block。 2. **query 超时**：跨大时间窗口 search → S3 拉量大 → timeout。缩窗口或者 use trace ID lookup。 3. **OTEL 版本兼容**：Tempo 升级时 OTLP schema 变 → 接收旧 SDK trace 失败。pin SDK + Tempo 版本。 4. **head sampling 漏关键 trace**：随机 1% 漏掉了 error trace → 没 trace 可看。tail sampling 解决。 5. **trace too large**：单 trace 几千 span → UI 慢。控制 cardinality。

可观测性运 运维实录编辑部官方@ops_lab 2026-05-13 23:02 🔥 热度 0 💬 评论 0

OpenTelemetry Collector：统一收集 trace / metric / log

## 起因可观测性 3 大 pillar: - **metric**：Prometheus + node_exporter / app exporter - **trace**：Jaeger / Tempo / Zipkin - **log**：Loki / ELK 每个数据类型一套 collector：promtail / vector / fluentd / filebeat / otel-trace 等。应用要装 N 个 SDK，运维要管 N 套 agent。 **OpenTelemetry Collector** 统一：一个 binary 收 3 类数据 + 转发给后端。应用用一套 OTEL SDK，agent 收一套 protocol。 ## 装 ```bash docker run -d -p 4317:4317 -p 4318:4318 \ -v $(pwd)/config.yaml:/etc/otelcol/config.yaml \ otel/opentelemetry-collector-contrib:latest ``` `config.yaml`： ```yaml receivers: otlp: protocols: grpc: { endpoint: 0.0.0.0:4317 } http: { endpoint: 0.0.0.0:4318 } prometheus: config: scrape_configs: - job_name: 'apps' static_configs: - targets: ['app:8080'] processors: batch: timeout: 10s exporters: otlphttp/jaeger: endpoint: http://jaeger:4318 prometheusremotewrite: endpoint: http://mimir:9009/api/v1/push loki: endpoint: http://loki:3100/loki/api/v1/push service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [otlphttp/jaeger] metrics: receivers: [otlp, prometheus] processors: [batch] exporters: [prometheusremotewrite] logs: receivers: [otlp] processors: [batch] exporters: [loki] ``` receiver → processor → exporter 流水线。 ## 应用接入 (Python) ```bash pip install opentelemetry-distro opentelemetry-exporter-otlp opentelemetry-bootstrap -a install # 自动装 instrumentation ``` ```bash # 启动应用 opentelemetry-instrument \ --traces_exporter otlp \ --metrics_exporter otlp \ --logs_exporter otlp \ --service_name myapp \ --exporter_otlp_endpoint http://otel-collector:4317 \ python app.py ``` 或者代码内： ```python from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter provider = TracerProvider() provider.add_span_processor( BatchSpanProcessor(OTLPSpanExporter(endpoint='otel-collector:4317')) ) trace.set_tracer_provider(provider) tracer = trace.get_tracer(__name__) with tracer.start_as_current_span('process_order'): # ... pass ``` auto-instrumentation 自动 trace Django / requests / SQLAlchemy 等。 ## 跨语言 JS / Java / Go / Ruby / .NET 都有 OTEL SDK。全用 OTLP 发到同 collector → 后端汇总。 ## 处理器 (processor) ```yaml processors: attributes: actions: - key: env value: production action: insert - key: password # 敏感字段 action: delete filter: traces: span: - 'attributes["http.url"] == "/health"' # 过滤健康检查 trace tail_sampling: decision_wait: 10s policies: - name: errors type: status_code status_code: { status_codes: [ERROR] } - name: slow type: latency latency: { threshold_ms: 1000 } - name: sample type: probabilistic probabilistic: { sampling_percentage: 1 } ``` - attributes 加 / 删 tag - filter 丢弃噪音 span - tail sampling: 1% 抽样 + 100% 错误 + 100% 慢请求 → 节省后端存储 ## tail vs head sampling head sampling：trace 开始时决定要不要采集（应用端）。 tail sampling：trace 完成后看完整决定（collector 端）。 tail 优势：基于结果决定（错误 / 慢的全采，正常的 1%）。缺点：collector 要 buffer 所有 trace 几秒。 ## deployment 模式 ``` agent (DaemonSet, 每 node) → gateway (Deployment, 集群级) → 后端 ``` - agent：每 node 一个，应用本地连，减少网络 - gateway：中央处理（采样 / batch / 多后端 fanout）或者单层：应用 → collector → 后端（小集群）。 ## k8s 部署 (operator) ```bash helm install opentelemetry-operator open-telemetry/opentelemetry-operator ``` ```yaml apiVersion: opentelemetry.io/v1beta1 kind: OpenTelemetryCollector metadata: name: gateway spec: mode: deployment replicas: 3 config: | receivers: ... processors: ... exporters: ... ``` operator 管 deployment + 配置 reload。 ## 自动 instrumentation (k8s) ```yaml apiVersion: opentelemetry.io/v1alpha1 kind: Instrumentation metadata: name: python-instr spec: exporter: endpoint: http://otel-gateway:4317 python: image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python ``` ```yaml # pod annotation metadata: annotations: instrumentation.opentelemetry.io/inject-python: "true" ``` operator 自动 inject sidecar / init container → 应用零改动获得 trace / metric。 ## metric processing ```yaml processors: metricstransform: transforms: - include: http.server.duration action: update new_name: http_request_duration_seconds ``` OTEL metric 名换成 Prometheus 风格 → 兼容 Grafana 老 dashboard。 ## 与 vector 对比 vector (Datadog OSS) 是另一通用 telemetry pipeline： | | OTEL Collector | Vector | |---|---|---| | 标准 | OTEL（CNCF） | Datadog 自家 | | log | ✅ | ✅ 强 | | metric | ✅ | ✅ | | trace | ✅ 强 | 弱 | | 配置 | YAML | TOML | | 性能 | 中 | 极高（Rust） | trace 重 → OTEL。 log / metric pipeline + 极致性能 → Vector。我用 OTEL 因为标准化优势 + 跨多后端。 ## 与 fluentd / fluent-bit fluent-bit / fluentd 主要 log shipper，metric / trace 弱。新项目用 OTEL 一栈。老 ELK 项目可能仍 fluent。 ## 真实 case 新项目从 0 设计 observability： ``` 应用（Python / Go / TS） + OTEL SDK auto-instrument ↓ OTLP gRPC otel-collector (DaemonSet, 每 node) ↓ OTLP otel-collector (gateway, 3 replica) ↓ ├─ trace → Tempo ├─ metric → Mimir └─ log → Loki ↓ Grafana 统一查看 ``` 一套 SDK + 一套 collector → 3 类数据 → Grafana 一处看（trace ID 关联 log 和 metric）。 trace ID 关联是杀手：error 看 trace → 同 trace ID 拉 log → 看 metric spike 时间窗口。debug 速度极大提升。 ## 踩过的坑 1. **OTLP gRPC vs HTTP**：默认 4317 是 gRPC，4318 HTTP。client 配错端口报错。 2. **batch processor 太大**：batch 太大 latency 高 + OOM 风险。 `send_batch_size: 8192` 调。 3. **tail sampling 内存**：高 QPS 时 buffer 几秒 trace → 几 GB RAM。 gateway 单独 deployment，分配大内存。 4. **auto-instrument 性能**：某些 framework 全部 instrument 后 P99 涨。disable 不重要的（health check / metrics endpoint）。 5. **多 env tag 漏**：dev / staging / prod 数据混 → 难区分。 `resource_attributes: env=prod` 强制加。

可观测性运 运维实录编辑部官方@ops_lab 2026-05-08 06:52 🔥 热度 0 💬 评论 0

PromQL recording rules：让贵的 query 提前算 + cache

## 起因 Grafana dashboard 30+ panel，每 panel 一个 PromQL query。打开一次 dashboard：Prometheus 跑 30 query → CPU 飙 → 慢。某些复杂 query： ```promql histogram_quantile(0.95, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))) ``` 每次跑要扫几百万 sample 再算 percentile。慢且重复。 **Recording rules**：把贵 query 周期性提前算 → 存为新 metric → dashboard 查这新 metric → 快。 ## 配 recording rule `/etc/prometheus/rules/api.yml`： ```yaml groups: - name: api_recording interval: 30s rules: - record: api:http_request_duration_seconds:p95 expr: | histogram_quantile(0.95, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))) - record: api:http_requests:rate5m expr: | sum by (service, status) (rate(http_requests_total[5m])) ``` ```yaml # prometheus.yml rule_files: - 'rules/*.yml' ``` reload prometheus → 每 30 秒算一次 → 结果存为 metric `api:http_request_duration_seconds:p95`。 dashboard query 改成： ```promql api:http_request_duration_seconds:p95 ``` 立刻返回（已算好）。 ## 命名规范 ``` <level>:<metric>:<aggregation> ``` 例： - `api:http_requests:rate5m` - `node:cpu:usage_pct` - `cluster:pod_count` 避免跟原 metric 冲突 + 一眼知道是 recording rule。 ## 何时用 recording rule 适合： - 复杂 query（histogram_quantile / 多 join） - dashboard 频繁查 - alert 用同 query 多次 - 跨 metric 计算（A + B / C 等）不适合： - 简单查询（ratio / count） - 一次性临时 query - high cardinality（recording 后存空间炸） ## alerting rule 类似语法： ```yaml - alert: HighErrorRate expr: api:http_requests:rate5m{status=~"5.."} > 10 for: 5m labels: { severity: warning } annotations: summary: "High 5xx rate on {{ $labels.service }}" ``` `expr` 用 recording rule 结果 → alert evaluation 也快。 ## 性能数据我们一个 cluster 100 node / 50 service： dashboard 打开（30 panel）： | | latency | Prometheus CPU | |---|---|---| | 无 recording rule | 8s | 80% | | 全 recording rule | 0.5s | 20% | dashboard 数据可能稍滞后（recording interval 30s）。 trade-off：实时性 vs 性能。 ## external_labels ```yaml # prometheus.yml global: external_labels: cluster: prod region: us-east ``` recording rule 结果自动带 cluster / region label → 多集群 federation。 ## federation （多集群汇总）多个 prometheus 互相拉： ```yaml scrape_configs: - job_name: 'federate' metrics_path: '/federate' params: match[]: - '{__name__=~"job:.+"}' # 只拉 recording rule static_configs: - targets: - 'prom-us-east:9090' - 'prom-eu-west:9090' ``` 中央 prom 拉所有 region recording rule → 全局 dashboard。只拉 recording rule（不是 raw metric）→ 数据量小 + 标准化。 ## 与 Mimir / VictoriaMetrics 对比老 prometheus 单机： - 数据存本地，几亿 sample 内存 - recording rule 提速但仍单机瓶颈 Grafana Mimir / VictoriaMetrics 是 prometheus 兼容的分布式 TSDB： - 多节点存储 + 查询 - 长期保留（年级别） - 内置 recording rule 跑大 scale 必上。中小 scale prom 单机 + recording rule 够。 ## 真实 case：dashboard 优化某客户 SRE dashboard： - 10 个 service × 4 SLI（latency p50/p95/p99 + error rate）= 40 panel - 打开慢 15 秒 - prom CPU 周期 spike 优化： 1. 建 recording rule 算每 SLI metric 2. dashboard query 改用 recording rule 3. recording rule interval 跟 dashboard auto-refresh 对齐（30s）效果： - dashboard 打开 1 秒 - prom CPU 平均 -50% - alert evaluation 同样加速 ## 与 cortex / thanos 对比 | | self-host prom | Mimir | Thanos | VictoriaMetrics | |---|---|---|---|---| | 部署 | 简单 | 复杂 | 复杂 | 中 | | 长期存储 | 弱 | S3 | S3 | 本地/S3 | | 多集群 | federation | native | native | native | | 性能 | 单机 | 横向扩展 | 横向扩展 | 高 | 中小项目 self-host prom + recording rule + 远程 write 备份。 > 几亿 series → Mimir / VictoriaMetrics。 ## subquery (PromQL 4) ```promql max_over_time(rate(http_requests_total[5m])[1h:1m]) ``` `[1h:1m]` = 1h 窗口里每 1m 取 sample → over rate → max。复杂但强大。recording rule 提前算更友好。 ## debug 不出数 ```promql # 看原 metric http_request_duration_seconds_bucket # 看 recording rule 结果 api:http_request_duration_seconds:p95 ``` Grafana Explore 直接查。 `/api/v1/rules` 看 rule 状态： ```bash curl http://prom:9090/api/v1/rules ``` `health: ok / err / unknown` 显示 rule 是否在跑。 ## 踩过的坑 1. **rule cycle**：rule A 依赖 rule B 依赖 rule A → 报错 + 数据空。严格 layer：raw → level 1 → level 2，单向。 2. **label cardinality 爆**：recording rule 加 high cardinality label → 新 metric 几百万 series → TSDB OOM。`sum without` 合并。 3. **interval 太短**：1s interval rule 比 raw scrape 还频繁 → 反而增负载。30s-1m 合理。 4. **alert 旧数据**：recording rule 跑得慢 → alert 用旧值 → 错过真实 spike。监控 `prometheus_rule_evaluation_duration_seconds`。 5. **federation 漏 label**：federation 默认不带 external_label → 多 region 看不出来源。`honor_labels: true`。

可观测性运 运维实录编辑部官方@ops_lab 2026-05-04 00:20 🔥 热度 0 💬 评论 0