知识广场

推荐

按学科筛选：计算机科学 / 分布式与云计算

«计算机科学 / 分布式与云计算» 分类下共 24 篇帖子

Redis Streams vs NATS JetStream：轻量消息队列怎么选

## 起因需要异步消息队列处理： - 用户上传图片 → 后台压缩 - 发邮件 / 短信 - 跑数据 export Kafka 太重（要 ZooKeeper 时代 / 现在 KRaft 也复杂）； RabbitMQ 装着 OK 但对小项目仍偏重； Redis Streams 和 NATS JetStream 是轻量替代。下面对比两者 + 用法。 ## Redis Streams Redis 5.0+ 内置的 stream 数据结构，类似 Kafka topic 但单实例。 ### 装 Redis 7.x 默认带： ```bash sudo apt install -y redis redis-cli --version ``` ### Producer ```python import redis r = redis.Redis() # 向 stream "tasks" 加一条消息 r.xadd('tasks', {'type': 'send_email', 'to': '[email protected]', 'subject': 'hi'}) # 返回 message id：'1716543210000-0' ``` ID 是 timestamp-seq 自动生成。 ### Consumer Group ```python # 创建 consumer group（每个 worker 进程一份消费状态） r.xgroup_create('tasks', 'email-workers', id='0', mkstream=True) # Worker 拉消息 while True: msgs = r.xreadgroup( groupname='email-workers', consumername='worker-1', streams={'tasks': '>'}, # > 表示新消息 count=10, block=5000, # 5s 没消息阻塞 ) for stream_name, messages in msgs: for msg_id, data in messages: try: send_email(data[b'to'], data[b'subject']) # ack：确认处理完 r.xack('tasks', 'email-workers', msg_id) except Exception as e: # 不 ack：消息留在 pending list，可被 reclaim 或 retry log.exception('failed') ``` 特性： - **at-least-once delivery**（ack 确认；不 ack 会被 reclaim） - **多 consumer 共享 group**：消息分配给空闲 worker - **持久化**：写盘（开 AOF 时） + 全消息历史可重读 ### 维护：MAXLEN 防爆 ```python r.xadd('tasks', {...}, maxlen=100000, approximate=True) # 超过 10w 条自动删最老的（近似，性能好） ``` 否则 stream 无限增长。 ### Pending list（处理失败 / worker 挂） ```python # 看哪些消息被 worker-1 取了但没 ack r.xpending_range('tasks', 'email-workers', min='-', max='+', count=100, consumer='worker-1') # Worker 挂掉超过 60s 没 ack 的消息 → reclaim 给 worker-2 r.xclaim('tasks', 'email-workers', 'worker-2', min_idle_time=60000, message_ids=stuck_ids) ``` 成熟 retry / dead letter 都要自己 wrap。 ## NATS JetStream NATS 是 Go 写的轻量消息中间件，JetStream 是它的持久化层（v2.2+）。 ### 装 ```bash curl -L https://github.com/nats-io/nats-server/releases/latest/download/nats-server-linux-amd64.tar.gz \ | sudo tar xz -C /usr/local/bin --strip-components=1 nats-server-*/nats-server # 启动 + JetStream nats-server -js ``` Docker: ```bash docker run -d -p 4222:4222 nats:latest -js ``` ### 装 client ```bash uv add nats-py ``` ### Producer ```python import asyncio import nats async def main(): nc = await nats.connect('nats://localhost:4222') js = nc.jetstream() # 创建 stream（一次性） await js.add_stream( name='TASKS', subjects=['tasks.>'], # 接受 tasks.X 各 subject retention='workqueue', # work queue 模式（被消费就删） max_msgs=100000, ) # 发消息 await js.publish('tasks.email', b'{"to":"[email protected]"}') await nc.close() asyncio.run(main()) ``` ### Consumer ```python async def worker(): nc = await nats.connect('nats://localhost:4222') js = nc.jetstream() # 创建 durable consumer psub = await js.pull_subscribe( subject='tasks.email', durable='email-workers', config=ConsumerConfig( ack_policy='explicit', max_deliver=3, # 最多重试 3 次 ack_wait=30, # 30s 内 ack ), ) while True: try: msgs = await psub.fetch(batch=10, timeout=5) for msg in msgs: try: await send_email(json.loads(msg.data)) await msg.ack() except Exception: await msg.nak() # 立刻重投 except nats.errors.TimeoutError: continue asyncio.run(worker()) ``` 特性： - **at-least-once / exactly-once (with dedup)** - **多 consumer 模式**：work queue / fanout / replay - **subject 多级 routing**：`tasks.>` / `orders.created.*` - **集群 / 副本**：3 节点 raft 内置 ### 多副本 + HA ```yaml # nats.conf jetstream { store_dir: /var/lib/nats } cluster { name: my-cluster listen: 0.0.0.0:6222 routes: [ nats-route://node1:6222, nats-route://node2:6222, nats-route://node3:6222, ] } ``` ```python await js.add_stream(name='TASKS', subjects=['tasks.>'], num_replicas=3) ``` 3 节点 raft 自动同步。一节点挂剩 2 节点继续工作。 **Redis Streams 没有这个能力**（要 sentinel + 第三方扩展实现）。 ## 对比表 | | Redis Streams | NATS JetStream | |---|---|---| | 学习曲线 | 低（Redis 老熟人） | 中 | | 集群 / HA | 弱（需 Sentinel） | 内置 raft | | 吞吐 | 中（10k-50k msg/s） | 高（100k+ msg/s） | | 持久化 | AOF / RDB | log 文件 | | 多语言 client | 极多 | 较多 | | 资源占用 | 中（Redis 进程） | 低（Go 单二进制） | | Subject routing | 无（按 stream 名） | 多级 wildcard | | 跨集群联邦 | 无 | leaf node | | 复杂场景 | 简单 work queue | 复杂 routing + replay | ### Redis Streams 适合 - 已经在用 Redis 不想加新组件 - 单实例 OK，QPS < 10k - 简单 work queue / event log - 团队熟 Redis ### JetStream 适合 - 需要 HA / 多副本 - 多消费模式（同一消息多 group 各取） - subject-based routing 复杂 - 高吞吐（> 10k msg/s） - 不想运营 Kafka 但需要类似能力 ## 跟 Kafka 对比 Kafka 适合超大规模（数百万 msg/s）+ 长期存储 + 多 consumer group + stream processing（Kafka Streams / Flink）。 JetStream / Redis Streams 都是 "Kafka 简化版"，但 80% 场景够。 ## 实战 case 我们小创业公司： - 1k-10k msg/s - 工作队列：图片处理 / 邮件 / API 调用 / data ETL - 不想运营 Kafka 选 **NATS JetStream**： - 单二进制部署（5 分钟搞定） - 3 节点集群跑 6 个月 0 down - subject routing 让"按业务类型分流"自然（`tasks.email.*`、`tasks.image.*`） - Go client 性能极好（业务端 Go 服务直连）对比之前用 Celery + RabbitMQ： - 资源占用 1/3 - 吞吐 5x - 维护成本明显降但 Celery 的"task discovery / chain / chord / scheduled task" 生态更丰富，Python-heavy 项目仍是合理选择。 ## 与 Celery / Sidekiq / dramatiq 比这些是"任务队列 framework"（带 retry / scheduling / monitoring），底层 broker 是 Redis / RabbitMQ。 Redis Streams / JetStream 是底层 broker。要"task 框架体验" 在它们上套一层（如 dramatiq + Redis）。 ```bash uv add dramatiq[redis] ``` ```python import dramatiq @dramatiq.actor(max_retries=3, queue_name='email') def send_email(to, subject, body): smtp.send(to, subject, body) # 业务 send_email.send('[email protected]', 'Welcome', '...') ``` 跟 Celery 用法类似但代码 1/3 + Redis Streams 后端。 ## 监控 ### Redis ```bash redis-cli xinfo stream tasks # length / first-entry / last-entry / consumer group 详情 # 持续监控积压 watch -n 5 'redis-cli xlen tasks' ``` Prometheus redis_exporter 暴露 stream length / lag。 ### NATS ```bash nats stream info TASKS nats consumer info TASKS email-workers # 看 pending / ack pending / lag ``` `nats` CLI + prometheus-nats-exporter。 ## 踩过的坑 ### Redis Streams 1. **没 MAXLEN → 无限增长**：1 小时几十万消息后 Redis 内存爆。永远 `xadd ... MAXLEN ~ 100000`。 2. **consumer name 重复**：两个进程用 `consumer-1` → pending list 混乱。每 worker 独立 name（`worker-${hostname}-${pid}`）。 3. **claim 没设 min_idle_time**：抢走还没超时的消息 → 重复处理。生产 30-60s。 ### JetStream 4. **stream 创建后改 config**：某些 config（subject / retention）改了要 delete + recreate，丢消息。一开始想清楚。 5. **pull vs push subscription**：pull 简单可控；push 需要 client 一直连。新手用 pull。 6. **JetStream domain / account 隔离**：多 tenant 时配错权限互通。小项目用 default 就好。 ## 总结简单"事件流" → Redis Streams（已经在用 Redis 的话）。 "想要类 Kafka 但更轻" → NATS JetStream。极致简单 + Python → Celery + Redis broker。真大规模 → Kafka。

分布式与云计算后 后端工程纪要官方@backend_jot 2026-05-20 14:25 🔥 热度 0 💬 评论 0

Grafana Tempo：低成本存 distributed trace

## 起因 distributed trace 工具： - **Jaeger**：经典 + 自托管 + 存 Cassandra / ES - **Zipkin**：更老 - **Tempo**（Grafana Labs）：新选择，存 S3 trace 数据量大（每 request 多 span，TB / day），存 Cassandra / ES 贵。 Tempo 设计为"object storage native trace store"，类似 Loki for log。 ## Tempo 特点 - 存 trace blob 到 S3 / GCS（极便宜） - 只索引 trace ID（不索引 attribute）→ 不能按 service / tag 全文搜 - query 模式：先用 metric 找到时段 → 拿 trace ID → 查 trace 对应 metric/log/trace 思路： ``` metric (Prometheus / Mimir) ↓ 发现 spike 时段 log (Loki) ↓ 找 trace_id trace (Tempo) ↓ 详细看 trace ``` ## 装 ```yaml # docker-compose services: tempo: image: grafana/tempo:2.5.0 command: ['-config.file=/etc/tempo.yaml'] ports: - 3200:3200 # HTTP - 4317:4317 # OTLP gRPC volumes: - ./tempo.yaml:/etc/tempo.yaml - tempo-data:/tmp/tempo ``` `tempo.yaml`: ```yaml server: http_listen_port: 3200 distributor: receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 storage: trace: backend: s3 s3: bucket: my-traces endpoint: s3.amazonaws.com region: us-east-1 compactor: compaction: block_retention: 720h # 30 day ``` ## ingest 应用 OTEL SDK 发 trace 到 Tempo (or otel-collector → Tempo)： ```python from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter exporter = OTLPSpanExporter(endpoint='tempo:4317') ``` 或通过 otel-collector 中转（推荐生产）。 ## Grafana 查 Grafana 加 Tempo data source： ``` URL: http://tempo:3200 ``` Explore → Tempo → "Search" tab： - search by trace ID（直接 lookup） - search by service name + tag（TraceQL） - duration / status filter ```traceql {service.name="api" && duration > 500ms && status = error} ``` 返回 trace 列表 → 点开看 span tree。 ## TraceQL (Tempo query language) 类 PromQL for trace： ```traceql # 时间窗口内 service=api 的 trace {service.name="api"} # 慢 trace {duration > 1s} # 错误 trace {status = error} # 跨 span 关系 { name="GET /api/users" } >> { name="db.query" && duration > 100ms } # 父 span 是 "GET /api/users" + 子 span 含慢 DB query ``` 强大 + 不需要 fulltext index。 ## 跟 metric 关联 Grafana panel： ``` metric: rate(http_requests{status="500"}) ↓ click data point trace search: status=error around timestamp ↓ list of error trace IDs ``` 一键从 metric spike 跳到具体 trace。 debug 神器。 ## 跟 log 关联 log 里写 trace_id： ```python import logging from opentelemetry import trace handler = logging.StreamHandler() formatter = logging.Formatter( '%(asctime)s [%(levelname)s] [trace_id=%(otelTraceID)s] %(message)s' ) handler.setFormatter(formatter) ``` Loki 查 log： ```logql {service="api"} |= "error" # 看到 trace_id=abc123 ``` Grafana 自动识别 trace_id → 点击跳 Tempo 拉 trace。 ## 存储成本对比 | | Jaeger (ES) | Tempo (S3) | |---|---|---| | 1 TB trace/day | ~$2000/月（ES cluster） | ~$50/月（S3）| | query latency | < 100ms | ~1s（拉 S3） | | index 灵活 | 任 attr | trace ID + 部分 attr | Tempo cost 1-2 个量级低。 trade-off：query 慢（拉 S3 + scan）+ 索引弱。 ## sampling trace 量大 → 不全存。两种 sampling: - **head sampling**：应用端决定（如 1%） - **tail sampling**：collector buffer 全 trace + 决定（如 100% error + 1% normal） tail 更智能但需 collector 资源（otel-collector tail_sampling processor）。 ## 与 Jaeger 对比 | | Jaeger | Tempo | |---|---|---| | 存储 | Cassandra / ES / Memory | S3 / GCS | | query 强 | 强（全索引） | 中（trace ID） | | 成本 | 高 | 低 | | 集成 | Jaeger UI | Grafana | | 部署 | 多组件 | 单 binary | Jaeger 适合：低 volume + 重 ad-hoc query。 Tempo 适合：高 volume + 接受 metric/log-driven trace lookup。 ## OpenTelemetry → 后端无关只要应用用 OTEL SDK，后端可换： - Jaeger - Tempo - Honeycomb (SaaS) - Datadog APM (SaaS) - Splunk 代码不改。 ## 真实部署我们 prod: - 100 微服务 + 10w QPS - 50 GB trace/day（10% sampling） - Tempo + S3 backend - 30 day retention - Grafana 主入口成本： - S3：50 GB × 30 day × $0.023/GB = $35/月 - Tempo compute (2 small instance)：$30/月 - 总：< $100/月 Jaeger + ES 等价：> $1000/月。体验 trade-off： - query 1-3s（vs Jaeger < 200ms） - 必须知道 trace_id 或者跨服务 search（不能全文 search trace 内容）可接受。 ## metrics generator (Tempo extra) Tempo 能从 trace 生成 metric： ```yaml metrics_generator: registry: external_labels: source: tempo processor: service_graphs: span_metrics: ``` 自动生成： - `traces_spanmetrics_calls_total{service, operation}` (call rate) - `traces_spanmetrics_latency_*` (latency histogram) - service graph (which service calls which) 替代 RED metric exporter，从 trace 推导。 ## 与 Datadog APM 对比 | | Tempo | Datadog APM | |---|---|---| | 成本 | $100/月 | $1000-10000/月 | | 部署 | self-host | SaaS | | UX | Grafana 中 | 极好 | | 集成 | OTEL + 自己 stack | Datadog 全生态 | 预算大 + 想 batteries-included → Datadog。预算敏感 / 已有 Grafana → Tempo。 ## 踩过的坑 1. **S3 cost spike**：put request 计费，频繁小 trace 飞速 → 加 compactor 合并 block。 2. **query 超时**：跨大时间窗口 search → S3 拉量大 → timeout。缩窗口或者 use trace ID lookup。 3. **OTEL 版本兼容**：Tempo 升级时 OTLP schema 变 → 接收旧 SDK trace 失败。pin SDK + Tempo 版本。 4. **head sampling 漏关键 trace**：随机 1% 漏掉了 error trace → 没 trace 可看。tail sampling 解决。 5. **trace too large**：单 trace 几千 span → UI 慢。控制 cardinality。

可观测性运 运维实录编辑部官方@ops_lab 2026-05-13 23:02 🔥 热度 0 💬 评论 0

Prometheus + Alertmanager 把告警送到 Slack / 邮件 / 钉钉

## 起因装了 Prometheus + Grafana 后能看图了，但"半夜 3 点磁盘满了"还得有人盯仪表盘才发现。要把"触发某指标条件"自动转成 push 告警。 Alertmanager 是 Prometheus 生态官方告警分发组件。 ## 解决方案 ### 1. 写告警规则（Prometheus 端） `/etc/prometheus/rules/node.yml`： ```yaml groups: - name: node-alerts interval: 30s rules: - alert: NodeDown expr: up{job="node"} == 0 for: 2m labels: severity: critical team: ops annotations: summary: '节点 {{ $labels.instance }} 离线' description: 'Prometheus 已经 2 分钟无法 scrape {{ $labels.instance }}' - alert: HighCPU expr: 100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 85 for: 10m labels: severity: warning annotations: summary: '{{ $labels.instance }} CPU > 85% 持续 10 分钟' - alert: DiskAlmostFull expr: 100 - node_filesystem_avail_bytes{mountpoint="/"} * 100 / node_filesystem_size_bytes{mountpoint="/"} > 90 for: 5m labels: severity: critical annotations: summary: '{{ $labels.instance }} 根分区 > 90%' description: '目前 {{ $value | humanizePercentage }}' - alert: MemoryPressure expr: 100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 92 for: 15m labels: severity: warning ``` 关键点： - `expr`：PromQL 表达式，结果非空时触发 - `for`：持续多久才真触发（避免瞬间抖动告警） - `labels`：可以路由用（severity / team） - `annotations`：人读的内容，模板支持 `{{ $labels.x }}` 和 `{{ $value }}` `prometheus.yml` 引入： ```yaml rule_files: - /etc/prometheus/rules/*.yml alerting: alertmanagers: - static_configs: - targets: ['localhost:9093'] ``` reload Prometheus： ```bash curl -X POST http://localhost:9090/-/reload ``` UI 看告警状态：http://prom:9090/alerts，三种状态：Inactive / Pending / Firing。 ### 2. 装 Alertmanager ```bash curl -fsSL https://github.com/prometheus/alertmanager/releases/latest/download/alertmanager-0.27.0.linux-amd64.tar.gz \ | sudo tar xz -C /opt/ sudo ln -s /opt/alertmanager-0.27.0.linux-amd64 /opt/alertmanager sudo useradd -rs /bin/false alertmanager sudo mkdir -p /var/lib/alertmanager sudo chown -R alertmanager:alertmanager /var/lib/alertmanager ``` ### 3. 配置 Alertmanager `/etc/alertmanager/alertmanager.yml`： ```yaml global: resolve_timeout: 5m smtp_smarthost: 'smtp.example.com:587' smtp_from: '[email protected]' smtp_auth_username: 'notifier' smtp_auth_password: '...' # 路由树 route: receiver: default group_by: ['alertname', 'instance'] group_wait: 30s # 第一条告警等 30s 看有没有同组的 group_interval: 5m # 同组下一批告警最少间隔 repeat_interval: 4h # 同告警重复发的最短间隔 routes: - matchers: - severity = critical receiver: pagerduty group_wait: 10s repeat_interval: 1h - matchers: - severity = warning - team = ops receiver: slack-ops inhibit_rules: - source_matchers: [severity="critical"] target_matchers: [severity="warning"] equal: ['instance'] # 同一台机器既有 critical 又有 warning，warning 被抑制（避免噪音） receivers: - name: default email_configs: - to: '[email protected]' - name: slack-ops slack_configs: - api_url: 'https://hooks.slack.com/services/...' channel: '#alerts-ops' title: '{{ .CommonAnnotations.summary }}' text: | {{ range .Alerts }} *{{ .Annotations.summary }}* {{ .Annotations.description }} severity: {{ .Labels.severity }} {{ end }} - name: pagerduty pagerduty_configs: - service_key: '<PD key>' - name: dingtalk webhook_configs: - url: 'https://oapi.dingtalk.com/robot/send?access_token=...' send_resolved: true ``` 钉钉 webhook 需要[特定 JSON 格式](https://open.dingtalk.com/document/orgapp-server/custom-robot-access)，通常需要中间 adapter（`prometheus-webhook-dingtalk`）转换。 systemd unit： ```ini [Service] User=alertmanager ExecStart=/opt/alertmanager/alertmanager \ --config.file=/etc/alertmanager/alertmanager.yml \ --storage.path=/var/lib/alertmanager Restart=on-failure ``` ```bash sudo systemctl enable --now alertmanager ``` ### 4. 测试手动触发一条假告警： ```bash curl -XPOST http://alertmanager:9093/api/v2/alerts -d ' [{ "labels": {"alertname": "TestAlert", "severity": "critical", "instance": "test"}, "annotations": {"summary": "测试告警"}, "startsAt": "2026-05-24T10:00:00Z" }]' ``` 应该几秒内收到 Slack / 邮件。 ### 5. 抑制 + 静默 ```bash # 系统维护期间静默 amtool silence add alertname=NodeDown instance=app1.example.com \ --duration=2h --comment 'planned maintenance' amtool silence query amtool silence expire <id> ``` 或 Alertmanager UI（端口 9093）有 silence 表单。 ## 效果 - 告警从"刷仪表盘" → "手机 push" - 通过 group_by + group_interval 把"集群批量告警"合并成几条消息，不再被刷屏 - inhibit 让"机器挂了之后机器上每个服务都告警"自动只剩一条 NodeDown - on-call 工程师 MTTR 从 30 分钟（发现 + 上线诊断）降到 5 分钟 ## 几个最佳实践 1. **`for: 5m` 不要省**：磁盘 90% 持续 5 分钟才真告警，避免 cron 任务瞬间冲高引起假警 2. **每条 alert 配 runbook 链接**：annotation 加 `runbook_url`，告警消息里点击直达"出现 X 怎么处理"文档 3. **不要告警一切**：CPU 80% 不需要立刻人工干预，写到 daily report 就好。半夜 page 应当是"现在不处理业务挂"级别 4. **`repeat_interval: 4h` 平衡**：太短刷屏；太长重要告警睡过 5. **定期 review alert noise**：跑 `amtool alert query` 看哪些告警反复 firing 没人理 → 要么调阈值要么删 ## 踩过的坑 1. **rule reload 没生效**：Prometheus reload `/-/reload` 端点默认禁用，要 `--web.enable-lifecycle` 启动。 2. **告警时间戳错乱**：Prometheus / Alertmanager 时区不一致 → UI 显示告警是几年前的。两端都 UTC。 3. **expr 写错没语法报错**：Prometheus 接受语法对但语义错的表达式（如 `metric > 100` 而 metric 单位是 GB），跑出来永远空。在 Prometheus UI Graph 选项卡先跑 expr 看结果再写规则。 4. **Slack webhook URL 进 git**：泄露了被人乱发消息。放 env / secret 或 alertmanager 的 `file:` 引用： ```yaml api_url_file: /etc/alertmanager/slack_url ``` 5. **inhibit_rules 的 `equal` 字段不一致**：source 和 target 没共同 label 时 inhibit 不生效。仔细 check label 列表。

分布式与云计算运 运维实录编辑部官方@ops_lab 2026-05-12 21:29 🔥 热度 0 💬 评论 0

etcd 备份 + 恢复：K8s 灾难前的最后一道防线

## 起因 K8s 整个 cluster 状态存在 etcd： - 所有 deployment / service / configmap / secret - node 状态 - 所有 namespace etcd 挂 → cluster 整体瘫。更恐怖：etcd 数据损坏（disk 错 / 误操作）→ 数据丢失。定期 etcd backup + 演练 restore 是 cluster 运维基础功。没做过 restore 演练等于没 backup。 ## 看 etcd 状态 ```bash # kubeadm 部署的 cluster sudo etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ endpoint status --write-out=table ``` ``` +------------------------+------------+---------+---------+--------+ | ENDPOINT | DB SIZE | LEADER | RAFT IDX| ALARMS | +------------------------+------------+---------+---------+--------+ | 10.0.1.10:2379 | 250 MB | true | 123456 | | | 10.0.1.11:2379 | 250 MB | false | 123456 | | | 10.0.1.12:2379 | 250 MB | false | 123456 | | +------------------------+------------+---------+---------+--------+ ``` 3 个 endpoint 应一致 + 1 个 leader + 无 alarms。 ## 创建 snapshot ```bash sudo ETCDCTL_API=3 etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db ``` `Snapshot saved at /backup/etcd-...db` snapshot 是单 binary 文件，几百 MB 量级（小 cluster）。 ## 自动化备份 (cronjob) ```bash # /etc/cron.d/etcd-backup 0 */6 * * * root /usr/local/bin/etcd-backup.sh ``` ```bash #!/bin/bash # /usr/local/bin/etcd-backup.sh set -euo pipefail BACKUP_DIR=/backup RETENTION_DAYS=7 TIMESTAMP=$(date +%Y%m%d-%H%M%S) ETCDCTL_API=3 etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ snapshot save "$BACKUP_DIR/etcd-$TIMESTAMP.db" # 上传到 S3 aws s3 cp "$BACKUP_DIR/etcd-$TIMESTAMP.db" s3://my-backups/etcd/ # 删本地老备份 find $BACKUP_DIR -name 'etcd-*.db' -mtime +$RETENTION_DAYS -delete # 删 S3 老备份（lifecycle policy 也行） aws s3 ls s3://my-backups/etcd/ | awk '{print $4}' | while read f; do days_old=$((($(date +%s) - $(date -d $(echo $f | sed 's/etcd-$....$$..$$..$-.*/\1-\2-\3/') +%s)) / 86400)) if [ $days_old -gt 30 ]; then aws s3 rm s3://my-backups/etcd/$f fi done ``` 每 6 小时备份 + 7 天本地 + 30 天 S3。 ## restore (灾难恢复) **演练步骤**（请在测试环境跑过几次）： ```bash # 1. 停所有 etcd 节点的 etcd（kubeadm 用 static pod 控制） sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/ # 等几秒确认 etcd 停了 sudo crictl ps | grep etcd # 2. 备份当前 data dir sudo mv /var/lib/etcd /var/lib/etcd.bak # 3. restore snapshot sudo ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-xxx.db \ --data-dir /var/lib/etcd \ --name etcd-node-1 \ --initial-cluster etcd-node-1=https://10.0.1.10:2380,etcd-node-2=https://10.0.1.11:2380,etcd-node-3=https://10.0.1.12:2380 \ --initial-cluster-token etcd-cluster-1 \ --initial-advertise-peer-urls https://10.0.1.10:2380 # 4. 在每节点上做 3 / restore # 5. 启动 etcd sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/ # 6. 验证 kubectl get nodes kubectl get pods -A ``` cluster 回到 snapshot 时点状态。 snapshot 之后创建的资源（pod / deployment 等）丢失（除非应用 IaC / GitOps 能重 apply）。 ## velero (更高层备份) etcd snapshot 是 raw K8s API state。 **Velero**：备份 + restore K8s 资源 + PV 数据： ```bash helm install velero vmware-tanzu/velero \ --namespace velero --create-namespace \ --set configuration.provider=aws \ --set configuration.backupStorageLocation.bucket=my-backups \ ... ``` ```bash # 备份 namespace velero backup create myapp-backup --include-namespaces myapp # 备份全集群（除 system） velero backup create cluster-backup --exclude-namespaces kube-system,kube-public # 列出 velero backup get # restore velero restore create --from-backup myapp-backup --namespace-mappings myapp:myapp-restore ``` Velero 优势 vs etcd snapshot： - 选择性备份（按 namespace / label） - restore 到不同 namespace / cluster - PV 数据快照（CSI snapshot 支持） - 跨 cluster migration etcd snapshot 是最后兜底（整 cluster 灾难）； Velero 是日常运维（误删 namespace / 迁移 / 测试 restore）。两者都做。 ## 测试 restore 至关重要经验：**没演练过的 backup = 没 backup**。每季度跑一次： 1. spin up 临时 cluster 2. 从 backup restore 3. 验证应用能跑 4. 文档化每步第一次跑会发现： - 漏备份某 resource - restore script 步骤错 - 某 PV 数据没快照修完 → real disaster 时不慌。 ## etcd 调优 defrag （DB 碎片整理）： ```bash etcdctl defrag --endpoints=... ``` DB size 大但 key 少 → 碎片多。定期 defrag。每 leader change / 大批量改后跑。 quota： ```bash etcd --quota-backend-bytes=8589934592 # 8 GB ``` 默认 2 GB → 大 cluster 不够。 monitor： ```promql etcd_mvcc_db_total_size_in_bytes etcd_disk_wal_fsync_duration_seconds etcd_server_leader_changes_seen_total ``` leader changes 频繁 → 网络 / disk 问题。 ## 多 cluster: 跨 region backup ```bash # velero schedule velero schedule create daily-backup --schedule="0 2 * * *" --ttl 720h \ --include-namespaces myapp \ --storage-location aws-prod-east # 跨 region replication（备份不放同 region 防 region-wide 灾） aws s3 cp s3://my-backups-east/... s3://my-backups-west/... ``` ## 与 cloud managed (EKS / GKE / AKS) cloud managed K8s 的 etcd 是 cloud provider 维护。 - 自动备份 / failover - 但你不能直接 etcdctl snapshot 仍要用 Velero 备份 resource 层（防 namespace 误删 / restore 到其它 cluster）。 ## 真实 incident case 某客户误执行 `kubectl delete namespace prod`。恢复： 1. 看最近 velero backup（4 小时前） 2. `velero restore create --from-backup ... --include-namespaces prod` 3. 30 分钟后 deployment / service / configmap / secret 全回 4. PV 数据完整（CSI snapshot）无 backup → cluster 完蛋。有 backup + 演练 → 30 分钟恢复。 ## 踩过的坑 1. **backup 没测**：以为有 backup → 真灾难时发现 backup file 损坏 / 不完整。定期 restore 测试。 2. **etcd snapshot 不包括 secret in etcd-encryption-config 外**： K8s 1.13+ 支持 etcd 加密。restore 后 encryption config 不存在 → 解密失败。备份时同时 cp encryption config。 3. **velero 不备 PV 数据 by default**：要装 CSI plugin 或者 restic integration。 4. **restore 后 IP 变**：service 重新创建 ClusterIP 可能变 → DNS 改传播延迟。 5. **etcd full**：DB size 超 quota → cluster read-only。`defrag` + `alarm disarm`。

Kubernetes 运 运维实录编辑部官方@ops_lab 2026-05-12 10:03 🔥 热度 0 💬 评论 0

给一台 Linux 服务器装 Prometheus + Grafana + node_exporter（10 分钟版）

任何长期运行的服务都该有监控。Prometheus + Grafana 是事实标准： Prom 抓指标，Grafana 画图，免费 + 自托管。下面把它们装起来。 ## 1. 装 node_exporter（被监控机器上） ```bash # 在每台要被监控的机器上 curl -fsSL https://github.com/prometheus/node_exporter/releases/latest/download/node_exporter-1.8.2.linux-amd64.tar.gz \ | sudo tar xz -C /usr/local/bin --strip-components=1 \ node_exporter-1.8.2.linux-amd64/node_exporter sudo useradd -rs /bin/false node_exporter ``` systemd unit `/etc/systemd/system/node_exporter.service`： ```ini [Unit] Description=Prometheus node exporter After=network.target [Service] User=node_exporter ExecStart=/usr/local/bin/node_exporter \ --collector.systemd \ --collector.processes \ --collector.textfile.directory=/var/lib/node_exporter NoNewPrivileges=true ProtectSystem=strict ProtectHome=true [Install] WantedBy=multi-user.target ``` ```bash sudo systemctl enable --now node_exporter curl localhost:9100/metrics | head ``` ## 2. 装 Prometheus（中心机器上） ```bash curl -fsSL https://github.com/prometheus/prometheus/releases/latest/download/prometheus-2.55.1.linux-amd64.tar.gz \ | sudo tar xz -C /opt/ sudo ln -s /opt/prometheus-2.55.1.linux-amd64 /opt/prometheus sudo useradd -rs /bin/false prometheus sudo mkdir -p /var/lib/prometheus sudo chown -R prometheus:prometheus /var/lib/prometheus /opt/prometheus* ``` 配置 `/opt/prometheus/prometheus.yml`： ```yaml global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: 'prod' scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node' static_configs: - targets: - 'server1.example.com:9100' - 'server2.example.com:9100' - 'server3.example.com:9100' labels: environment: prod ``` systemd `/etc/systemd/system/prometheus.service`： ```ini [Unit] Description=Prometheus After=network.target [Service] User=prometheus ExecStart=/opt/prometheus/prometheus \ --config.file=/opt/prometheus/prometheus.yml \ --storage.tsdb.path=/var/lib/prometheus \ --storage.tsdb.retention.time=30d \ --web.listen-address=:9090 Restart=on-failure [Install] WantedBy=multi-user.target ``` ```bash sudo systemctl enable --now prometheus # 浏览器：http://中心机:9090 ``` ## 3. 装 Grafana ```bash sudo apt install -y apt-transport-https software-properties-common sudo mkdir -p /etc/apt/keyrings/ wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | \ sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | \ sudo tee /etc/apt/sources.list.d/grafana.list sudo apt update && sudo apt install -y grafana sudo systemctl enable --now grafana-server # 浏览器：http://中心机:3000 (默认 admin/admin) ``` ## 4. 在 Grafana 加数据源 UI 上： ``` Connections → Data sources → Add data source → Prometheus URL: http://localhost:9090 Save & test ``` ## 5. 导入 node_exporter 仪表盘 Grafana 社区已经有现成的： ``` Dashboards → New → Import → ID: 1860 (Node Exporter Full) ``` 瞬间得到 CPU / RAM / 磁盘 / 网络 / I/O / 进程等完整指标可视化。 ## 6. 几条 PromQL 救命查询 ```promql # CPU 使用率（5 分钟均值） 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # 内存使用率 100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) # 磁盘使用率（根分区） 100 - (node_filesystem_avail_bytes{mountpoint="/"} * 100 / node_filesystem_size_bytes{mountpoint="/"}) # 5 分钟内 5xx 错误率（如果你也抓 nginx 指标） sum(rate(nginx_http_requests_total{status=~"5.."}[5m])) by (instance) # 哪些机器 90% 内存 topk(5, 100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) ``` ## 7. 告警 `prometheus.yml` 加： ```yaml rule_files: - /opt/prometheus/rules/*.yml alerting: alertmanagers: - static_configs: - targets: ['localhost:9093'] ``` `/opt/prometheus/rules/node.yml`： ```yaml groups: - name: node rules: - alert: HighCPU expr: 100 - avg by (instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 80 for: 10m labels: severity: warning annotations: summary: 'CPU > 80% on {{ $labels.instance }} for 10m' - alert: DiskAlmostFull expr: 100 - node_filesystem_avail_bytes{mountpoint="/"} * 100 / node_filesystem_size_bytes{mountpoint="/"} > 85 for: 5m labels: severity: critical annotations: summary: 'Disk > 85% on {{ $labels.instance }}' ``` 装 Alertmanager 把告警送到 Slack / 邮件 / 钉钉： ```bash curl -fsSL https://github.com/prometheus/alertmanager/releases/latest/download/alertmanager-0.27.0.linux-amd64.tar.gz \ | sudo tar xz -C /opt/ ``` `/opt/alertmanager/alertmanager.yml`： ```yaml route: receiver: slack group_wait: 30s group_interval: 5m repeat_interval: 4h receivers: - name: slack slack_configs: - api_url: 'https://hooks.slack.com/services/...' channel: '#alerts' title: '{{ .CommonAnnotations.summary }}' text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}' ``` ## 8. 防火墙 ```bash # Prometheus 端 sudo ufw allow from <Grafana_IP> to any port 9090 # node_exporter 端 sudo ufw allow from <Prometheus_IP> to any port 9100 ``` 或者 Prometheus / Grafana 都跑同一台（自己监控自己时常见），只开 Grafana 80 / 443 给外网。 ## 9. 安全：永远别 9100 直接暴露公网 node_exporter 的指标包含敏感信息（启动参数、进程列表）。一定要： - 内网防火墙限制源 IP - 用 nginx 在前面套 basic auth - 或者 node_exporter 用 `--web.tls.config.file` 启 mTLS ## 10. 存储 retention ```bash --storage.tsdb.retention.time=30d # 默认 15d --storage.tsdb.retention.size=50GB ``` 长期存储用 Thanos / VictoriaMetrics / Cortex。单机 30 天指标通常占几 GB-几十 GB。 ## 踩过的坑 - Prom UI 显示 target "down"：`curl host:9100/metrics` 看是节点没起还是网络不通；常常是防火墙 9100 没开给 Prom。 - 时区：Prometheus 内部用 UTC；Grafana 展示按浏览器时区。两者不一致时建议在 Grafana 强制时区到本地。 - `scrape_interval` 太短：1s 抓导致 Prom 数据爆炸 + 网络流量大。 15s 是稳健默认；30s / 1m 在节点多时更省。 - 指标 cardinality 失控：label 取了 user_id / request_id 这种值无限的， Prom 时序数量爆炸 OOM。这是 Prom 最常见的 outage 根因。

分布式与云计算运 运维实录编辑部官方@ops_lab 2026-05-09 13:13 🔥 热度 0 💬 评论 0

写 incident runbook：让凌晨 3 点的 oncall 能跟着做

## 起因服务挂了。监控告警发到 oncall 手机。理想：oncall 知道怎么处理。现实：oncall 是新人 / 不熟这服务 / 凌晨 3 点头脑不清 → 慌。 `runbook`（操作手册）：每个服务 / 每类 alert 一份明确步骤文档。不需要 oncall 思考根因，跟着步骤恢复 → 后续 owner 来 root cause。 ## 一份 runbook 包含什么 ```markdown # Runbook: API service down ## TL;DR 1. Check status: `kubectl get pods -n api` 2. If pod CrashLoopBackOff → restart: `kubectl rollout restart deploy api -n api` 3. If still bad → check dashboard: <url> 4. If beyond 15 min → page tech lead: <name> ## 告警含义什么触发的：API 5xx > 5% 持续 5 min 影响：用户 login / 主要功能不可用 ## 快速 mitigation [5 个具体命令 / 步骤] ## diagnose [查日志 / metric 怎么看] ## 升级路径 mitigation 没用 → 找谁 + 怎么联系 ``` 关键：**第一行就告诉怎么做**，不是先 5 段背景介绍。 ## 模板 ```markdown # Runbook: <Alert Name> ## 紧急 mitigation (< 5 分钟) 具体命令 / 步骤，复制粘贴能跑。 ## 影响 - 用户感知：xxx - 受影响服务：xxx ## 何时升级 - 5 分钟内未解决 → 找 oncall #2 - 15 分钟内未解决 → page tech lead - 影响数据完整性 → 拉 incident commander ## diagnose 1. Grafana: <link> 2. 日志查询：<link Loki / ELK> 3. trace：<link Jaeger> ## 已知场景 - 场景 A：症状 → 处理 - 场景 B：症状 → 处理 ## 相关链接 - 架构图：<link> - 设计文档：<link> - post-mortem：<link to history> ``` ## 实例：DB connection pool 满 ```markdown # Runbook: PG connection pool exhausted ## 5 分钟 mitigation ```bash # 1. 看 pool 状态 psql -h pgbouncer -U admin pgbouncer -c "SHOW POOLS;" # 如果 cl_waiting > 0 → 有客户端排队 # 2. 重启 app（释放可能 leak 的连接） kubectl rollout restart deploy/api -n api kubectl rollout restart deploy/worker -n api # 3. 临时扩 pool kubectl edit configmap pgbouncer-config -n db # default_pool_size: 30 → 50 kubectl rollout restart deploy/pgbouncer -n db # 4. 5 分钟内未恢复 → 升级 ``` ## 影响 - 所有 API 请求 timeout / 慢 - 后台 worker 失败 ## diagnose - Grafana "PG conn pool": <link> - log: `{service="api"} |= "connection refused"` - trace 看慢 query：<link> ## 已知场景 - 长 transaction leak（应用 bug）：找 long-running query： ```sql SELECT pid, now() - xact_start, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY 2 DESC; ``` kill 长跑：`SELECT pg_terminate_backend(pid);` - 流量 spike：HPA 没跟上 → 手动扩 `kubectl scale deploy/api --replicas=20 -n api` ## 升级 - 第 5 分钟 → @oncall-secondary - 第 15 分钟 → @platform-lead - 数据损坏迹象 → @dba @cto ## 历史 - 2025-02-12 incident: long tx leak in checkout flow - 2024-11-03 incident: traffic spike from marketing ``` ## 让 runbook 持续有用死的 runbook = 没 runbook。每 incident 后 update： - 这步骤管用 → 强化 - 不管用 → 删 / 改 - 新 case → 加"已知场景" post-mortem 写完必须更新 runbook（强制 process）。 ## alert 跟 runbook 关联 ```yaml # Prometheus alert rule - alert: ApiHighErrorRate expr: rate(api_errors[5m]) > 0.05 annotations: summary: "API error rate > 5%" runbook_url: "https://wiki.example.com/runbook/api-error-rate" ``` PagerDuty / Slack alert 自动带 runbook 链接 → oncall 一键点开。 ## 命名 - `api-5xx-spike`（不是"api-bad"） - `pg-replica-lag` - `cert-expiring-soon` 具体 + 可搜。每个 alert name 对应一个 runbook URL slug。 ## 存哪 - Confluence / Notion：好搜索 + 版本 - GitHub repo: ops/runbooks/<name>.md：跟 code 一起 review - 内置 Grafana annotation 我们用 git repo + 工具自动 publish 到 web view。 PR review 强制：alert 新增必须有 runbook PR。 ## decision tree 模式复杂场景画决策树： ``` alert: api-down start │ ▼ pod 是否 running? │ ├─ 否 → kubectl describe pod → 是否 ImagePullBackOff? │ ├─ 是 → registry 故障，看 <link> │ └─ 否 → 看 events，可能 OOMKilled → 扩 memory │ └─ 是 → 是否 5xx? ├─ 是 → 看 dependency (DB / Redis)，看 <link> └─ 否 → ingress 问题，看 <link> ``` mermaid 图直接渲染。复杂场景帮助新人按图索骥。 ## 演练 (game day) 每季度模拟 incident： 1. SRE 故意触发"假 incident"（kill pod / 切 DB / 灌错配） 2. oncall 跟 runbook 处理 3. 后 retro：runbook 哪步含糊 / 缺步骤新人参与多 → 暴露 runbook 不足。不演练 = runbook 锈了。 ## 没 runbook 的 alert 怎么办不应该。原则："no runbook, no alert"。如果没人知道收到这 alert 该干啥 → 这 alert 没意义，删了或者写 runbook。避免"alert 海"：每天 100 个 alert 没人看 → 真出事漏。 ## 与 chaos engineering chaos engineering 主动制造小故障验证系统韧性。 runbook 是被动响应文档。两者互补：chaos 找出 weakness → 加 runbook + 修系统。 ## 工具 - **PagerDuty**：alert → oncall + runbook - **Opsgenie**：类似 - **Atlassian Statuspage**：状态页给客户 - **Linear / Jira**：incident ticket 跟踪 ## 真实 case 某 prod incident 凌晨 2 点： - API 5xx 飙 - oncall (新人，3 个月)收到 alert - runbook 链接：5 步 mitigation - 第 2 步 (rollout restart) 没好 - 第 4 步 (扩 pool) 缓解 - 5 分钟用户感知恢复 - 早上 team 看到 → root cause: third-party API timeout，加更长 timeout 完整 incident 30 分钟解决，影响 < 5 分钟。没 runbook 的话新人可能就先 page 一堆人，影响时间几倍。 ## 反模式 - "看监控就知道了"：监控不告诉你怎么修 - "去问 X 老员工"：X 出差 / 离职 - "看代码就知道"：凌晨 3 点没人看 code - "再 alert 一次看看"：用户 already 受影响 runbook 是为最坏情况写的，假设阅读者一无所知。 ## 踩过的坑 1. **runbook 跟实际不符**：服务改了配置 runbook 没更新 → 命令报错。 review 加"runbook 改了吗"。 2. **太长**：5 页 runbook 凌晨没人读完。TL;DR + 第一屏放 mitigation。 3. **依赖外部知识**："找 platform team Slack 沟通" 但 Slack 名变了。绝对路径 / 永久 link。 4. **不演练**：以为 runbook 写完就够 → 真出事步骤不通。季度演练。 5. **没人 own**：runbook 谁更新？每服务 owner 负责。新 alert 设 "DRI"。

分布式与云计算运 运维实录编辑部官方@ops_lab 2026-05-09 07:56 🔥 热度 0 💬 评论 0

OpenTelemetry Collector：统一收集 trace / metric / log

## 起因可观测性 3 大 pillar: - **metric**：Prometheus + node_exporter / app exporter - **trace**：Jaeger / Tempo / Zipkin - **log**：Loki / ELK 每个数据类型一套 collector：promtail / vector / fluentd / filebeat / otel-trace 等。应用要装 N 个 SDK，运维要管 N 套 agent。 **OpenTelemetry Collector** 统一：一个 binary 收 3 类数据 + 转发给后端。应用用一套 OTEL SDK，agent 收一套 protocol。 ## 装 ```bash docker run -d -p 4317:4317 -p 4318:4318 \ -v $(pwd)/config.yaml:/etc/otelcol/config.yaml \ otel/opentelemetry-collector-contrib:latest ``` `config.yaml`： ```yaml receivers: otlp: protocols: grpc: { endpoint: 0.0.0.0:4317 } http: { endpoint: 0.0.0.0:4318 } prometheus: config: scrape_configs: - job_name: 'apps' static_configs: - targets: ['app:8080'] processors: batch: timeout: 10s exporters: otlphttp/jaeger: endpoint: http://jaeger:4318 prometheusremotewrite: endpoint: http://mimir:9009/api/v1/push loki: endpoint: http://loki:3100/loki/api/v1/push service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [otlphttp/jaeger] metrics: receivers: [otlp, prometheus] processors: [batch] exporters: [prometheusremotewrite] logs: receivers: [otlp] processors: [batch] exporters: [loki] ``` receiver → processor → exporter 流水线。 ## 应用接入 (Python) ```bash pip install opentelemetry-distro opentelemetry-exporter-otlp opentelemetry-bootstrap -a install # 自动装 instrumentation ``` ```bash # 启动应用 opentelemetry-instrument \ --traces_exporter otlp \ --metrics_exporter otlp \ --logs_exporter otlp \ --service_name myapp \ --exporter_otlp_endpoint http://otel-collector:4317 \ python app.py ``` 或者代码内： ```python from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter provider = TracerProvider() provider.add_span_processor( BatchSpanProcessor(OTLPSpanExporter(endpoint='otel-collector:4317')) ) trace.set_tracer_provider(provider) tracer = trace.get_tracer(__name__) with tracer.start_as_current_span('process_order'): # ... pass ``` auto-instrumentation 自动 trace Django / requests / SQLAlchemy 等。 ## 跨语言 JS / Java / Go / Ruby / .NET 都有 OTEL SDK。全用 OTLP 发到同 collector → 后端汇总。 ## 处理器 (processor) ```yaml processors: attributes: actions: - key: env value: production action: insert - key: password # 敏感字段 action: delete filter: traces: span: - 'attributes["http.url"] == "/health"' # 过滤健康检查 trace tail_sampling: decision_wait: 10s policies: - name: errors type: status_code status_code: { status_codes: [ERROR] } - name: slow type: latency latency: { threshold_ms: 1000 } - name: sample type: probabilistic probabilistic: { sampling_percentage: 1 } ``` - attributes 加 / 删 tag - filter 丢弃噪音 span - tail sampling: 1% 抽样 + 100% 错误 + 100% 慢请求 → 节省后端存储 ## tail vs head sampling head sampling：trace 开始时决定要不要采集（应用端）。 tail sampling：trace 完成后看完整决定（collector 端）。 tail 优势：基于结果决定（错误 / 慢的全采，正常的 1%）。缺点：collector 要 buffer 所有 trace 几秒。 ## deployment 模式 ``` agent (DaemonSet, 每 node) → gateway (Deployment, 集群级) → 后端 ``` - agent：每 node 一个，应用本地连，减少网络 - gateway：中央处理（采样 / batch / 多后端 fanout）或者单层：应用 → collector → 后端（小集群）。 ## k8s 部署 (operator) ```bash helm install opentelemetry-operator open-telemetry/opentelemetry-operator ``` ```yaml apiVersion: opentelemetry.io/v1beta1 kind: OpenTelemetryCollector metadata: name: gateway spec: mode: deployment replicas: 3 config: | receivers: ... processors: ... exporters: ... ``` operator 管 deployment + 配置 reload。 ## 自动 instrumentation (k8s) ```yaml apiVersion: opentelemetry.io/v1alpha1 kind: Instrumentation metadata: name: python-instr spec: exporter: endpoint: http://otel-gateway:4317 python: image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python ``` ```yaml # pod annotation metadata: annotations: instrumentation.opentelemetry.io/inject-python: "true" ``` operator 自动 inject sidecar / init container → 应用零改动获得 trace / metric。 ## metric processing ```yaml processors: metricstransform: transforms: - include: http.server.duration action: update new_name: http_request_duration_seconds ``` OTEL metric 名换成 Prometheus 风格 → 兼容 Grafana 老 dashboard。 ## 与 vector 对比 vector (Datadog OSS) 是另一通用 telemetry pipeline： | | OTEL Collector | Vector | |---|---|---| | 标准 | OTEL（CNCF） | Datadog 自家 | | log | ✅ | ✅ 强 | | metric | ✅ | ✅ | | trace | ✅ 强 | 弱 | | 配置 | YAML | TOML | | 性能 | 中 | 极高（Rust） | trace 重 → OTEL。 log / metric pipeline + 极致性能 → Vector。我用 OTEL 因为标准化优势 + 跨多后端。 ## 与 fluentd / fluent-bit fluent-bit / fluentd 主要 log shipper，metric / trace 弱。新项目用 OTEL 一栈。老 ELK 项目可能仍 fluent。 ## 真实 case 新项目从 0 设计 observability： ``` 应用（Python / Go / TS） + OTEL SDK auto-instrument ↓ OTLP gRPC otel-collector (DaemonSet, 每 node) ↓ OTLP otel-collector (gateway, 3 replica) ↓ ├─ trace → Tempo ├─ metric → Mimir └─ log → Loki ↓ Grafana 统一查看 ``` 一套 SDK + 一套 collector → 3 类数据 → Grafana 一处看（trace ID 关联 log 和 metric）。 trace ID 关联是杀手：error 看 trace → 同 trace ID 拉 log → 看 metric spike 时间窗口。debug 速度极大提升。 ## 踩过的坑 1. **OTLP gRPC vs HTTP**：默认 4317 是 gRPC，4318 HTTP。client 配错端口报错。 2. **batch processor 太大**：batch 太大 latency 高 + OOM 风险。 `send_batch_size: 8192` 调。 3. **tail sampling 内存**：高 QPS 时 buffer 几秒 trace → 几 GB RAM。 gateway 单独 deployment，分配大内存。 4. **auto-instrument 性能**：某些 framework 全部 instrument 后 P99 涨。disable 不重要的（health check / metrics endpoint）。 5. **多 env tag 漏**：dev / staging / prod 数据混 → 难区分。 `resource_attributes: env=prod` 强制加。

可观测性运 运维实录编辑部官方@ops_lab 2026-05-08 06:52 🔥 热度 0 💬 评论 0

给生产服务做一份"60 秒看完所有重要指标"Grafana 仪表盘

## 起因监控装好了，Prometheus + Loki + Grafana 都在跑，仪表盘有 30 多个。出问题时点开仪表盘要翻几屏才看到关键指标。on-call 的同事中午吃饭被叫起来排查，越急越慌越找不到。 "作战仪表盘"就是把"凌晨被叫醒后要看的 8 个图"压缩到一屏内。 ## 解决方案 ### 设计原则 1. **一屏装下，不滚屏**：1920×1080 上 8-12 个 panel 是甜点 2. **金字塔布局**：最上面是"系统健康分"，下方是各组件细节 3. **时间窗口默认 1h**：足够看到趋势又不太久 4. **每个 panel 自带告警阈值线**：红色 horizontal line 标 SLO 5. **stats vs graph**：当前值用 stat panel（巨大数字），趋势用 timeseries ### 顶部：4 个 stat panel（系统状态） ```promql # Panel: 在线节点数 / 总节点数 sum(up{job="node"}) / count(up{job="node"}) # Panel: 当前 RPS（应用总） sum(rate(http_requests_total[1m])) # Panel: 当前 P95 延迟 histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) # Panel: 当前 5xx 错误率 sum(rate(http_requests_total{status=~"5.."}[1m])) / sum(rate(http_requests_total[1m])) ``` 每个 stat panel 配 thresholds: - 绿色：正常 - 黄色：警戒 - 红色：异常例如 P95 延迟 < 100ms 绿、100-500ms 黄、> 500ms 红。 ### 中部：时间序列趋势（2x2） 1. **RPS by endpoint**： ``` sum by (path) (rate(http_requests_total[1m])) ``` 2. **延迟 P50/P95/P99**： ``` histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) histogram_quantile(0.95, ...) histogram_quantile(0.99, ...) ``` 3. **错误率 by status code**： ``` sum by (status) (rate(http_requests_total[1m])) ``` 4. **数据库 query 时间 P95**： ``` histogram_quantile(0.95, sum by (le) (rate(db_query_duration_seconds_bucket[5m]))) ``` ### 下部：基础设施（2x2） 5. **每节点 CPU**： ``` 100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 ``` 6. **每节点 RAM**： ``` 100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) ``` 7. **磁盘使用率**： ``` 100 - node_filesystem_avail_bytes{mountpoint="/"} * 100 / node_filesystem_size_bytes{mountpoint="/"} ``` 8. **网络入/出**： ``` sum by (instance) (rate(node_network_receive_bytes_total[1m])) sum by (instance) (rate(node_network_transmit_bytes_total[1m])) ``` ### 底部：log 错误日志（Loki） Logs panel，query： ```logql {job="myapp"} |~ "ERROR|FATAL|panic" | json ``` 直接看到最近的错误日志，配合上面的指标曲线看时间关联。 ### 配置 tip #### 单位每个 panel 设正确的 unit（seconds / percent / bytes/sec）， Grafana 自动用 K/M/G 缩写。`{r}/s` 表示每秒请求数。 #### 颜色一致所有"错误"用红、"延迟"用紫、"流量"用蓝、"资源"用橙。形成视觉默契，眼睛快速分辨。 #### 阈值线 ``` Thresholds: - color: green, value: 0 - color: yellow, value: 100 - color: red, value: 500 ``` panel 上自动画水平线。 #### Variables（下拉切环境 / 服务）仪表盘顶部： ``` Variable: env Type: query Query: label_values(up, env) Variable: service Type: query Query: label_values(up{env="$env"}, job) ``` 用户切 env=prod / service=api → 所有 panel 自动 filter。 ### 自动化：dashboard as code 不要在 UI 里手动建。用 [grafonnet](https://github.com/grafana/grafonnet) 或者 [grizzly](https://github.com/grafana/grizzly) 把 dashboard 写成 Jsonnet / YAML 进 git： ```jsonnet local g = import 'g.libsonnet'; g.dashboard.new('My App SLI Overview') + g.dashboard.withRefresh('30s') + g.dashboard.withPanels([ g.panel.stat.new('Active nodes') + g.panel.stat.queryOptions.withTargets([ g.query.prometheus.new('default', 'sum(up)') ]), // ... 更多 panel ]) ``` `grr` apply 一键部署到 Grafana。改动走 PR review。 ### Provisioning 放 dashboard JSON 到 `/etc/grafana/provisioning/dashboards/`， Grafana 启动自动加载。这样部署 / 还原灾备时 dashboard 不需要手动重建。 ## 效果 - 凌晨 on-call 收到告警 → 打开仪表盘 → 5 秒看清"哪个指标红了" - mean time to diagnose 从 ~15 分钟降到 ~3 分钟 - 团队新人 onboarding 时一份仪表盘 = 一份系统全貌速成课 - "感觉慢" 类玄学反馈被替换为"看这个 panel 上 P95 涨了 3 倍" ## 一些进阶 panel ### Service Level Objective (SLO) burn rate ``` 1 - sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h])) ``` 定义 SLO（如"99.5% 成功率"），算每小时实际 vs 目标的 burn rate。 > 1 = 这小时消耗了超过本月预算的 1/30。 ### Apdex ``` (sum(rate(http_request_duration_seconds_bucket{le="0.1"}[5m])) + sum(rate(http_request_duration_seconds_bucket{le="0.4"}[5m])) / 2) / sum(rate(http_request_duration_seconds_count[5m])) ``` 100ms 满意 + 400ms 容忍 → 0-1 分数。一个数字概括"用户满意度"。 ### USE method (Utilization / Saturation / Errors) 每资源（CPU / RAM / disk / network）三栏：当前用量 / 队列长度 / 错误数。 Brendan Gregg 的经典系统排查框架，dashboard 上也用得上。 ## 踩过的坑 1. **panel 用 sum 不加 by**：所有节点 RPS 加一起看不出哪个节点出问题。关键指标都按 `by (instance)` 或 `by (job)` 分组。 2. **时间窗口太长**：default 6h 仪表盘载入慢 + 看不清。设 1h 默认，按需调长。 3. **stat panel 数字闪烁**：默认 refresh 5s 时每次刷新都重算 → 视觉抖。配 `min_step: 30s` 平滑。 4. **告警从 dashboard 配（Grafana alerting）vs Alertmanager rule**：两套别混。生产建议规则进 Prometheus rule files（与 Grafana 解耦）， Grafana 只展示。 5. **dashboard 太多**：超过 30 个仪表盘后没人知道用哪个。归类 + 命名规范 + folder 组织，每月 review 删用不到的。

分布式与云计算运 运维实录编辑部官方@ops_lab 2026-05-07 17:21 🔥 热度 0 💬 评论 0

Longhorn：给 K8s 加分布式块存储（不依赖云）

## 起因 K8s pod 用 PV： - 云上：EBS / GCE PD / Azure Disk - 自托管：hostPath（坏到没救）/ NFS（慢 + 没 snapshot）/ Ceph（运维炸）需要"PVC 跑 stateful 应用"但又不在云上： - 边缘 cluster - 自建机房 - 本地 dev / 小 prod **Longhorn**（Rancher / SUSE，CNCF）：K8s-native distributed block storage。轻量 + 部署简单 + UI 友好。 ## 装 ```bash helm install longhorn longhorn/longhorn \ -n longhorn-system --create-namespace ``` 每 node 自动跑 longhorn-manager pod + 用本地 disk 做 storage。 `StorageClass` 自动创建： ```bash kubectl get sc longhorn (default) driver.longhorn.io Delete Immediate ``` ## 用 ```yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: data spec: storageClassName: longhorn accessModes: [ReadWriteOnce] resources: requests: storage: 10Gi ``` ```yaml spec: containers: - volumeMounts: - mountPath: /data name: data volumes: - name: data persistentVolumeClaim: claimName: data ``` 挂载即用。Longhorn 在背后： - 找有空间的 node 创建 volume - replica 跨 3 node 同步（默认） - pod schedule 到任意 node，volume 自动跟随 ## replica 跨 node 3 replica = 3 个 node 各存一份。任一 node 挂 → 还有 2 副本，pod 在别 node 重启接着用。 ## snapshot + backup ```yaml # 创建 snapshot apiVersion: longhorn.io/v1beta2 kind: Snapshot metadata: name: snap-1 namespace: longhorn-system spec: volume: pvc-xxx ``` snapshot 是 volume 状态 point-in-time copy（本地，秒级）。 backup 是把 snapshot 上传到 S3 / NFS： ```bash # Longhorn UI 配 S3 backup target # 或者 CR ``` ```yaml apiVersion: longhorn.io/v1beta2 kind: Backup metadata: name: backup-1 spec: snapshotName: snap-1 ``` 恢复：从 backup 创建新 volume → PVC → pod 挂载。跨集群迁移：cluster A backup → cluster B restore。 ## recurring job ```yaml apiVersion: longhorn.io/v1beta2 kind: RecurringJob metadata: name: daily-backup spec: cron: "0 2 * * *" task: backup retain: 7 groups: [default] ``` 每天 2:00 自动 snapshot + backup，保留 7 份。 ## UI ```bash kubectl -n longhorn-system port-forward svc/longhorn-frontend 8080:80 # http://localhost:8080 ``` UI 看 volume / replica 状态 / backup / settings。比 yaml 直观，运维友好。 ## 性能 3-node cluster + NVMe disk + Longhorn 3 replica： - random read 4k: ~50k IOPS - random write 4k: ~15k IOPS - sequential write: ~600 MB/s 跟单盘比有 30-50% overhead（network + replication）。跟 EBS gp3 接近。够多数 workload。 ## 与 Ceph (Rook) 对比 | | Longhorn | Rook-Ceph | |---|---|---| | 复杂度 | 低 | 极高 | | 性能 | 中 | 高 | | scale | 中（几十 node） | 巨（几百+） | | object storage | ❌ | ✅ | | file storage (NFS-like) | RWX 实验 | ✅ | | block storage | ✅ | ✅ | 中小 cluster + 主要 block storage → Longhorn。大 scale + 需 object / file → Ceph。 ## RWX (多 pod 同时读写) ```yaml accessModes: [ReadWriteMany] # RWX ``` Longhorn RWX 用内置 NFS share volume 之上。性能比 RWO 弱，但能用。 ReadWriteOncePod (K8s 1.27+) 是更严格 RWO。 ## 与 hostPath ```yaml volumes: - name: data hostPath: path: /mnt/data ``` 简单粗暴但： - pod schedule 必须在那 node（pin） - 没备份 / 副本 - 移走 node 数据丢 dev 还行，prod 别用。 ## 与 NFS NFS provisioner：装个 NFS server + provisioner，PVC 在 NFS 上分 volume。 | | NFS | Longhorn | |---|---|---| | RWX | ✅ | ✅（弱） | | 性能 | 中 | 高 | | HA | NFS server 单点 | replica HA | | 部署 | 简 | 中 | 小数据 / RWX 重 → NFS 简单。 HA / 性能 → Longhorn。 ## 真实 case 某客户私有云 cluster： - 5 node bare metal - 每 node 1 TB NVMe - Longhorn 3 replica - PG / Redis / app data 全 Longhorn 效果： - Postgres 跑得跟单盘差 30%（可接受） - 任一 node down → pod 自动迁移 + volume 跟随 - 每天自动 backup 到 S3 - UI 让 ops 直观看 storage 状态挑战： - 网络 IO 飙（replica sync 占带宽）→ 10 GbE 网卡 - node 重启时 replica rebuild 几小时 ## 监控 Prometheus metrics： - `longhorn_volume_actual_size_bytes` - `longhorn_volume_state` - `longhorn_node_status` Grafana dashboard 官方提供。 alert 关键： - replica 不足（应 3 实际 < 3） - volume detached - node down - backup 失败 ## 与 cloud volume 对比 | | Longhorn | EBS / PD | |---|---|---| | 部署 | 自管 | 托管 | | 跨 AZ | 自配 | 内置 | | 性能 | 看本地 disk | 一致 | | 成本 | hardware 一次性 | 按月 | | 适合 | 自托管 / 边缘 | 云 | 在云上没必要 Longhorn（用 EBS）。自托管 / 混合云 → Longhorn 填补"K8s storage" 空白。 ## 踩过的坑 1. **每 node 一份 replica = 数据膨胀**：3 replica 占 3x 空间。规划存储 capacity * 3。 2. **kernel module**：Longhorn 用 iSCSI 协议。某些 minimal OS 没装 `open-iscsi` → 启不来。 3. **disk fill 90%**：默认 reserve 30% 给系统。改 `Storage Minimal Available Percentage`。 4. **node drain 慢**：drain 时 volume detach + reattach 慢（几十秒）。 maintenance window 留时间。 5. **backup target 配错**：S3 endpoint / credential 错 → 备份失败但 UI 显示成功初看。定期手 restore 验证。

Kubernetes 运 运维实录编辑部官方@ops_lab 2026-05-06 15:32 🔥 热度 0 💬 评论 0

podman rootless：不需要 root daemon 跑容器（更安全 + systemd 集成）

## 起因我的服务器上 Docker daemon 用 root 跑。任何能跟 docker.sock 通信的用户都等于 root（passing socket 到容器 = 容器逃逸）。多用户 / 共享开发机上风险大。 `podman` 是 Red Hat 主导的 OCI 兼容容器引擎，几个区别： - **no daemon**：每个容器一个进程 fork 出来，systemd 直接 supervise - **rootless first**：默认非 root 用户跑 - **Docker CLI 兼容**：`alias docker=podman` 大多数命令直接 work - **支持 pod**（K8s 那种多容器组） ## 装 ```bash # Debian 12+ / Ubuntu 22.04+ sudo apt install -y podman # 给当前用户 subuid / subgid（rootless 必须） # /etc/subuid + /etc/subgid 应当各有： # me:100000:65536 # 默认装好就有；没有就： sudo usermod --add-subuids 100000-165535 --add-subgids 100000-165535 me # 配置 storage（rootless 在 ~/.local/share/containers/） podman info | grep -A 3 storage ``` ## 基础用法（跟 Docker 几乎一样） ```bash podman pull nginx:alpine podman run -d --name web -p 8080:80 nginx:alpine podman ps podman logs -f web podman exec -it web sh podman stop web podman rm web ``` `alias docker=podman` 后大部分 docker 命令直接用。 ## rootless 网络 Rootless 容器不能 bind 80/443（< 1024 需 root）。解决： ```bash # 方案 A：bind 8080，前置 nginx 反代 podman run -d -p 8080:80 nginx # 方案 B：开 unprivileged port range sudo sysctl net.ipv4.ip_unprivileged_port_start=80 # 之后 rootless 也能 bind 80 ``` 方案 B 安全风险：任何用户能 bind 80。多人共享机器不建议。 ## podman generate systemd 杀手锏：把 container 转成 systemd unit： ```bash podman run -d --name web -p 8080:80 nginx:alpine # 生成 unit 文件 podman generate systemd --name web --files --new # 输出：container-web.service mkdir -p ~/.config/systemd/user/ mv container-web.service ~/.config/systemd/user/ systemctl --user daemon-reload systemctl --user enable --now container-web.service # 开机自启（必须 enable linger） loginctl enable-linger $USER ``` 之后 systemctl 全程管理 container： ```bash systemctl --user status container-web systemctl --user restart container-web journalctl --user -u container-web -f ``` 容器 = systemd service。一致性极佳。 ## Quadlet（podman 4.4+ 推荐）更现代的方式：`.container` 文件直接被 systemd 当 unit： `~/.config/containers/systemd/web.container`： ```ini [Unit] Description=Web server [Container] Image=docker.io/library/nginx:alpine PublishPort=8080:80 Volume=/home/me/www:/usr/share/nginx/html:Z AutoUpdate=registry [Service] Restart=always [Install] WantedBy=default.target ``` ```bash systemctl --user daemon-reload systemctl --user start web # 注意是 web 不是 web.container ``` 比 generate systemd 简洁得多。每改 image / volume / port 直接改 .container 文件 + daemon-reload。 ## Pod（多容器组） K8s 风格的 Pod（共享 network + IPC）： ```bash podman pod create --name myapp -p 8080:80 podman run -d --pod myapp --name api myapi:latest podman run -d --pod myapp --name worker myworker:latest podman pod ls ``` api 和 worker 共享网络（同 localhost），共享 IPC namespace。适合"sidecar" 模式。 K8s YAML 直接转 podman pod： ```bash podman play kube k8s-pod.yaml ``` 支持基础 K8s YAML。 ## auto-update ```bash # 容器加 label podman run -d --label io.containers.autoupdate=registry nginx:alpine # 系统级 timer 自动检查 + 升级 systemctl --user enable --now podman-auto-update.timer ``` 每天检查 registry 上 image 新版，有就 pull + restart 容器。Watchtower 等价。 ## docker-compose 兼容 ```bash sudo apt install -y podman-compose # 或：pip install podman-compose podman-compose up -d # 大部分 docker-compose.yml 直接 work ``` 或者用 `podman compose`（subcommand，跟 Docker Compose v2 接近）。少数 docker-only feature（如 `network_mode: bridge` 的特定行为）偶尔不兼容，看具体场景。 ## 与 Docker 对比 | | Docker | podman | |---|---|---| | daemon | dockerd (root) | 无 daemon（每容器自己进程） | | rootless | 实验性 + 有限 | 默认 + 一等公民 | | systemd 集成 | 弱 | 极强（Quadlet） | | Docker CLI 兼容 | 原生 | 极高 | | Compose | docker compose | podman-compose / podman compose | | K8s 友好 | Pod 概念无 | 原生 Pod | | 性能 | 略好（daemon overhead 摊销） | 接近 | | 生态成熟度 | 极高 | 高 + 增长中 | | RHEL / Fedora 默认 | ❌ | ✅ | Red Hat 系（RHEL / CentOS Stream / Fedora）官方推 podman。其它 distro 都装得上。 ## 适用场景 ✅ **podman 适合**： - 单机部署 + 喜欢 systemd 集成 - 多用户开发机（rootless 安全） - 不需要 docker swarm - RHEL / Fedora 用户 - K8s 学习（pod 概念友好） ❌ **保 Docker 适合**： - 重度用 docker swarm / docker compose 高级 feature - 团队工具链都基于 docker - 公司 SaaS 工具明确支持 Docker（如 GitHub Actions docker-container action） ## 我的实际迁移家用服务器（10+ 容器：nginx / db / 监控 / 个人 app）从 docker 迁 podman + Quadlet： ```ini # ~/.config/containers/systemd/postgres.container [Container] Image=postgres:16-alpine PublishPort=127.0.0.1:5432:5432 Environment=POSTGRES_PASSWORD=... Volume=postgres-data.volume:/var/lib/postgresql/data HealthCmd=pg_isready -U postgres HealthInterval=10s [Install] WantedBy=default.target ``` 每个容器一个 .container 文件，git 管。重启服务器后 systemd 自动按顺序起所有容器。 ### 效果 - 总内存：少 200 MB（无 dockerd 常驻） - 部署 / 重启容器跟 systemctl service 一致 - 容器崩溃 / OOM / 异常退出：journal 里完整 log + restart 历史 - rootless 让"容器漏洞 → 拿 host root" 这条路被堵 - 一致 backup：备份 ~/.local/share/containers + ~/.config = 完整状态 ## 踩过的坑 1. **rootless 网络默认 slirp4netns** 慢：吞吐量大幅低于 root docker。高流量场景配 `pasta` (新版默认) 或者 root pod。 2. **rootless 不能跨用户访问 image**：每个用户独立 storage。 `podman --root=/path/to/system-storage` 可以共享但复杂。 3. **某些 docker-only env var**：如 `DOCKER_HOST` socket 路径。 podman 用 unix socket 在 `$XDG_RUNTIME_DIR/podman/podman.sock`。 4. **selinux 标签**：volume mount 缺 `:Z` / `:z` 后缀容器内访问权限错。 `-v /path:/path:Z` 让 SELinux 自动 relabel。 5. **GPU passthrough**：rootless + GPU 不直接 work。`--device /dev/nvidia*` + udev rule 配。生产推荐 root podman 跑 GPU 容器。

容器运 运维实录编辑部官方@ops_lab 2026-05-06 12:17 🔥 热度 0 💬 评论 0

ArgoCD：把 K8s 部署做成 GitOps（git 是单一真理）

## 起因 K8s 部署演化： 1. `kubectl apply -f` 手动（不知道现在集群什么状态） 2. CI script `kubectl apply` 自动化（但 git 跟 cluster 不一致仍可能） 3. **GitOps**：git 是 source of truth，controller 自动同步到 cluster ArgoCD 是 GitOps controller。修改 yaml + push git → cluster 自动 apply。 diff / rollback / approval flow 全自动化。 ## 装 ```bash kubectl create namespace argocd kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml # UI kubectl port-forward svc/argocd-server -n argocd 8080:443 # https://localhost:8080 # 默认密码： kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d ``` ## 第一个 Application ```yaml # argocd-apps/web.yaml apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: web namespace: argocd spec: project: default source: repoURL: https://github.com/myorg/k8s-manifests targetRevision: main path: apps/web destination: server: https://kubernetes.default.svc namespace: web syncPolicy: automated: prune: true # git 删了 yaml → cluster 也删 selfHeal: true # 手动改 cluster → 自动 revert 回 git 状态 syncOptions: - CreateNamespace=true ``` ```bash kubectl apply -f argocd-apps/web.yaml ``` argocd 会： 1. clone git repo 2. apply `apps/web/*.yaml` 到 `web` namespace 3. 持续监听 git → 有 change pull + apply 4. 持续监听 cluster → drift 自动修复 ## UI ArgoCD UI 显示： - 所有 Application 健康状态 - 每 Application 的 K8s resource 图（service → deploy → pod） - 跟 git desired state 的 diff - sync 历史 + commit message ``` ┌─ Application: web ──────────────────────────────────┐ │ Status: Synced Healthy │ │ Last sync: 2 min ago (commit abc123) │ │ ┌─Service──┐ ┌─Deployment──┐ ┌─Pod ×3─┐ │ │ │ web │ → │ web (3/3) │ → │ ... │ │ │ └──────────┘ └─────────────┘ └────────┘ │ └─────────────────────────────────────────────────────┘ ``` ## kustomize / helm 支持 ```yaml # Application 用 helm spec: source: repoURL: https://github.com/myorg/charts path: charts/web helm: values: | replicaCount: 3 image: tag: v1.2.3 ``` 或者 kustomize： ```yaml spec: source: path: overlays/production kustomize: images: - myorg/web:v1.2.3 ``` ## ApplicationSet（多 cluster / 多 env） ```yaml apiVersion: argoproj.io/v1alpha1 kind: ApplicationSet metadata: name: web-multi-env spec: generators: - list: elements: - cluster: prod url: https://prod.k8s values: values-prod.yaml - cluster: staging url: https://staging.k8s values: values-staging.yaml template: metadata: name: 'web-{{cluster}}' spec: destination: server: '{{url}}' namespace: web source: repoURL: ... helm: valueFiles: - '{{values}}' ``` 一份 template → 给 prod / staging 生成两个 Application。 git 改 → 两个 env 同步。 ## 部署流程 ``` 1. dev 提 PR 改 image tag (v1.2.3 → v1.2.4) 2. PR review + merge to main 3. ArgoCD 检测 main 改动 (default 3 min polling 或 webhook 即时) 4. ArgoCD diff: deployment image change 5. ArgoCD apply: rolling update deployment 6. 健康 check 通过 → Sync success ``` 整个过程 git 是源 + UI 可见。错了 git revert → ArgoCD 自动 revert cluster。 ## image automation (argocd-image-updater) 不想手改 image tag → image-updater 监听 registry，新 tag 自动 commit git： ```yaml metadata: annotations: argocd-image-updater.argoproj.io/image-list: web=myorg/web argocd-image-updater.argoproj.io/web.update-strategy: semver argocd-image-updater.argoproj.io/write-back-method: git ``` dev 推 image → image-updater 改 git → ArgoCD 同步 cluster。全流程自动。 ## sync wave ```yaml metadata: annotations: argoproj.io/sync-wave: "1" # 先 apply（如 namespace / CRD） ``` ```yaml metadata: annotations: argoproj.io/sync-wave: "2" # 后 apply（如 deployment） ``` 控制 apply 顺序：CRD 先 → operator 后 → custom resource 最后。 ## sync hook ```yaml metadata: annotations: argoproj.io/hook: PreSync # apply 之前跑 argoproj.io/hook-delete-policy: HookSucceeded spec: template: spec: containers: - name: migrate image: myorg/migrate command: ['./migrate.sh'] ``` PreSync = 部署前跑 db migrate。 PostSync = 部署后跑 smoke test。 SyncFail = 失败时通知。类似 helm hook。 ## 与 FluxCD 对比 | | ArgoCD | FluxCD | |---|---|---| | 哲学 | UI + CLI first | CLI + GitOps Toolkit | | UI | 强 | 弱（无官方 UI） | | 配置 | Application CRD | Kustomization / HelmRelease | | 多 cluster | ApplicationSet | 不擅长 | | 学习 | 简单 UI 上手 | CLI 思维 | | CNCF | graduated | graduated | ArgoCD 适合 visual ops + 多 cluster。 Flux 适合 git-pure / 自动化 first。我们用 ArgoCD（UI 让 dev 也能看部署状态，减少 ops 沟通）。 ## 安全 / RBAC ```yaml # AppProject 限制 apiVersion: argoproj.io/v1alpha1 kind: AppProject metadata: name: dev-team spec: sourceRepos: - https://github.com/myorg/dev-manifests.git destinations: - namespace: 'dev-*' server: '*' clusterResourceWhitelist: - group: '' kind: Namespace namespaceResourceWhitelist: - group: 'apps' kind: Deployment ``` dev team 只能从特定 git repo 部署到 dev-* namespace，不能改 cluster-wide 资源。 ## SSO 集成 ArgoCD 接 OIDC（Okta / Google / GitHub）→ 用 SSO 登录 UI + CLI。 ```yaml # argocd-cm ConfigMap data: oidc.config: | name: GitHub issuer: https://github.com/login/oauth clientID: ... clientSecret: ... ``` ## 真实部署 case 我们 prod + staging + 多个 dev cluster： - 1 个 manifest repo（gitops/） - 每 env 一个 ApplicationSet - 每 app 在 manifest repo 下 apps/<name>/{base,overlays/{prod,staging}}/ - image-updater 自动 promotion staging（main branch tag） - prod promotion 手动（PR + approval） ops 改东西 = PR 改 yaml。dev 看 ArgoCD UI 知道部署进度。不需要 dev 学 kubectl。 ## 缺点 - Application CRD 增加学习 - ArgoCD 自身要运维（HA setup） - bug：自动 sync 时偶有 race condition ## 踩过的坑 1. **selfHeal 改不动 cluster**：dev 手 kubectl edit 救火 → 5 秒被 ArgoCD revert。临时 disable selfHeal 或者改 git。 2. **diff 噪声**：cluster 自动加 annotation（如 deployment.kubernetes.io/revision） → ArgoCD 看到 diff。`ignoreDifferences` 配置过滤。 3. **CRD 顺序**：先 install operator 再 apply custom resource。 sync-wave 控制。 4. **大 manifest repo 慢**：几千 yaml → ArgoCD slow。拆多 repo 或者多 Application。 5. **secret 不放 git**：用 sealed-secret / external-secret operator， git 存加密版本。

Kubernetes 运 运维实录编辑部官方@ops_lab 2026-05-04 09:09 🔥 热度 0 💬 评论 0

PromQL recording rules：让贵的 query 提前算 + cache

## 起因 Grafana dashboard 30+ panel，每 panel 一个 PromQL query。打开一次 dashboard：Prometheus 跑 30 query → CPU 飙 → 慢。某些复杂 query： ```promql histogram_quantile(0.95, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))) ``` 每次跑要扫几百万 sample 再算 percentile。慢且重复。 **Recording rules**：把贵 query 周期性提前算 → 存为新 metric → dashboard 查这新 metric → 快。 ## 配 recording rule `/etc/prometheus/rules/api.yml`： ```yaml groups: - name: api_recording interval: 30s rules: - record: api:http_request_duration_seconds:p95 expr: | histogram_quantile(0.95, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))) - record: api:http_requests:rate5m expr: | sum by (service, status) (rate(http_requests_total[5m])) ``` ```yaml # prometheus.yml rule_files: - 'rules/*.yml' ``` reload prometheus → 每 30 秒算一次 → 结果存为 metric `api:http_request_duration_seconds:p95`。 dashboard query 改成： ```promql api:http_request_duration_seconds:p95 ``` 立刻返回（已算好）。 ## 命名规范 ``` <level>:<metric>:<aggregation> ``` 例： - `api:http_requests:rate5m` - `node:cpu:usage_pct` - `cluster:pod_count` 避免跟原 metric 冲突 + 一眼知道是 recording rule。 ## 何时用 recording rule 适合： - 复杂 query（histogram_quantile / 多 join） - dashboard 频繁查 - alert 用同 query 多次 - 跨 metric 计算（A + B / C 等）不适合： - 简单查询（ratio / count） - 一次性临时 query - high cardinality（recording 后存空间炸） ## alerting rule 类似语法： ```yaml - alert: HighErrorRate expr: api:http_requests:rate5m{status=~"5.."} > 10 for: 5m labels: { severity: warning } annotations: summary: "High 5xx rate on {{ $labels.service }}" ``` `expr` 用 recording rule 结果 → alert evaluation 也快。 ## 性能数据我们一个 cluster 100 node / 50 service： dashboard 打开（30 panel）： | | latency | Prometheus CPU | |---|---|---| | 无 recording rule | 8s | 80% | | 全 recording rule | 0.5s | 20% | dashboard 数据可能稍滞后（recording interval 30s）。 trade-off：实时性 vs 性能。 ## external_labels ```yaml # prometheus.yml global: external_labels: cluster: prod region: us-east ``` recording rule 结果自动带 cluster / region label → 多集群 federation。 ## federation （多集群汇总）多个 prometheus 互相拉： ```yaml scrape_configs: - job_name: 'federate' metrics_path: '/federate' params: match[]: - '{__name__=~"job:.+"}' # 只拉 recording rule static_configs: - targets: - 'prom-us-east:9090' - 'prom-eu-west:9090' ``` 中央 prom 拉所有 region recording rule → 全局 dashboard。只拉 recording rule（不是 raw metric）→ 数据量小 + 标准化。 ## 与 Mimir / VictoriaMetrics 对比老 prometheus 单机： - 数据存本地，几亿 sample 内存 - recording rule 提速但仍单机瓶颈 Grafana Mimir / VictoriaMetrics 是 prometheus 兼容的分布式 TSDB： - 多节点存储 + 查询 - 长期保留（年级别） - 内置 recording rule 跑大 scale 必上。中小 scale prom 单机 + recording rule 够。 ## 真实 case：dashboard 优化某客户 SRE dashboard： - 10 个 service × 4 SLI（latency p50/p95/p99 + error rate）= 40 panel - 打开慢 15 秒 - prom CPU 周期 spike 优化： 1. 建 recording rule 算每 SLI metric 2. dashboard query 改用 recording rule 3. recording rule interval 跟 dashboard auto-refresh 对齐（30s）效果： - dashboard 打开 1 秒 - prom CPU 平均 -50% - alert evaluation 同样加速 ## 与 cortex / thanos 对比 | | self-host prom | Mimir | Thanos | VictoriaMetrics | |---|---|---|---|---| | 部署 | 简单 | 复杂 | 复杂 | 中 | | 长期存储 | 弱 | S3 | S3 | 本地/S3 | | 多集群 | federation | native | native | native | | 性能 | 单机 | 横向扩展 | 横向扩展 | 高 | 中小项目 self-host prom + recording rule + 远程 write 备份。 > 几亿 series → Mimir / VictoriaMetrics。 ## subquery (PromQL 4) ```promql max_over_time(rate(http_requests_total[5m])[1h:1m]) ``` `[1h:1m]` = 1h 窗口里每 1m 取 sample → over rate → max。复杂但强大。recording rule 提前算更友好。 ## debug 不出数 ```promql # 看原 metric http_request_duration_seconds_bucket # 看 recording rule 结果 api:http_request_duration_seconds:p95 ``` Grafana Explore 直接查。 `/api/v1/rules` 看 rule 状态： ```bash curl http://prom:9090/api/v1/rules ``` `health: ok / err / unknown` 显示 rule 是否在跑。 ## 踩过的坑 1. **rule cycle**：rule A 依赖 rule B 依赖 rule A → 报错 + 数据空。严格 layer：raw → level 1 → level 2，单向。 2. **label cardinality 爆**：recording rule 加 high cardinality label → 新 metric 几百万 series → TSDB OOM。`sum without` 合并。 3. **interval 太短**：1s interval rule 比 raw scrape 还频繁 → 反而增负载。30s-1m 合理。 4. **alert 旧数据**：recording rule 跑得慢 → alert 用旧值 → 错过真实 spike。监控 `prometheus_rule_evaluation_duration_seconds`。 5. **federation 漏 label**：federation 默认不带 external_label → 多 region 看不出来源。`honor_labels: true`。

可观测性运 运维实录编辑部官方@ops_lab 2026-05-04 00:20 🔥 热度 0 💬 评论 0

Loki + Promtail 做日志聚合（轻量、与 Grafana 同生态）

ELK Stack 重量级（Java + Elasticsearch 几个 GB 内存），小规模团队用 Loki 更合适： - Go 写的，单机版几百 MB 内存 - 不全文索引，只按 label 索引（像 Prometheus） - 存储用对象存储（S3 / 本地磁盘） - Grafana 一等公民支持，UI 体验和 Prometheus 一致 ## 架构 ``` 节点 → promtail (agent) → Loki (中心) → Grafana ``` ## 1. 装 Loki ```bash sudo useradd -rs /bin/false loki sudo mkdir -p /var/lib/loki /etc/loki sudo chown -R loki:loki /var/lib/loki curl -fsSL https://github.com/grafana/loki/releases/latest/download/loki-linux-amd64.zip \ -o /tmp/loki.zip sudo unzip /tmp/loki.zip -d /usr/local/bin sudo mv /usr/local/bin/loki-linux-amd64 /usr/local/bin/loki sudo chmod +x /usr/local/bin/loki ``` 最简单的"all-in-one"配置 `/etc/loki/config.yml`： ```yaml auth_enabled: false server: http_listen_port: 3100 common: path_prefix: /var/lib/loki storage: filesystem: chunks_directory: /var/lib/loki/chunks rules_directory: /var/lib/loki/rules replication_factor: 1 ring: instance_addr: 127.0.0.1 kvstore: store: inmemory schema_config: configs: - from: 2024-01-01 store: tsdb object_store: filesystem schema: v13 index: prefix: index_ period: 24h limits_config: retention_period: 30d reject_old_samples: true reject_old_samples_max_age: 168h ``` systemd `/etc/systemd/system/loki.service`： ```ini [Unit] Description=Loki After=network.target [Service] User=loki ExecStart=/usr/local/bin/loki -config.file=/etc/loki/config.yml Restart=on-failure [Install] WantedBy=multi-user.target ``` ```bash sudo systemctl enable --now loki curl localhost:3100/ready ``` ## 2. 装 promtail（每台被采集机器） ```bash curl -fsSL https://github.com/grafana/loki/releases/latest/download/promtail-linux-amd64.zip \ -o /tmp/promtail.zip sudo unzip /tmp/promtail.zip -d /usr/local/bin sudo mv /usr/local/bin/promtail-linux-amd64 /usr/local/bin/promtail sudo chmod +x /usr/local/bin/promtail ``` `/etc/promtail/config.yml`： ```yaml server: http_listen_port: 9080 grpc_listen_port: 0 positions: filename: /var/lib/promtail/positions.yaml clients: - url: http://loki.example.com:3100/loki/api/v1/push scrape_configs: # systemd journal - job_name: journal journal: max_age: 12h labels: job: systemd-journal host: ${HOSTNAME} relabel_configs: - source_labels: ['__journal__systemd_unit'] target_label: unit # nginx access log - job_name: nginx static_configs: - targets: [localhost] labels: job: nginx host: ${HOSTNAME} __path__: /var/log/nginx/*.log # 自定义应用日志 - job_name: myapp static_configs: - targets: [localhost] labels: job: myapp host: ${HOSTNAME} __path__: /var/log/myapp/*.log ``` systemd 类似，启动： ```bash sudo systemctl enable --now promtail journalctl -u promtail -n 20 ``` ## 3. Grafana 加数据源 ``` Connections → Data sources → Loki URL: http://loki.example.com:3100 ``` ## 4. LogQL 查询语法 ``` # 看某 unit 的所有日志 {unit="nginx.service"} # 多 label 过滤 {job="nginx", host="server1"} |= "500" # 含 "500" # 排除 {job="nginx"} != "kube-probe" # 正则匹配 {job="myapp"} |~ "ERROR|FATAL" # JSON 日志解析后过滤 {job="myapp"} | json | level="error" | line_format "{{.timestamp}} {{.msg}}" # 错误率 / 量 sum by (host) (rate({job="nginx"} |= "500" [5m])) # Top N 出错的 unit topk(5, sum by (unit) (count_over_time({job="systemd-journal"} |= "ERROR" [1h]))) ``` `|=`、`!=`、`|~`、`!~` 是按字符串 / 正则过滤；`| json`、`| logfmt` 是解析。 ## 5. 仪表盘 Grafana Explore 写 LogQL 查询；满意了用 "Add to dashboard" 保存。社区仪表盘：搜 "Loki" 在 grafana.com/dashboards，有 nginx / docker / k8s 现成版本。 ## 6. 告警（log-based alerting） `rules/myapp.yml` (Loki rule)： ```yaml groups: - name: myapp-alerts rules: - alert: HighErrorRate expr: | sum(rate({job="myapp"} |~ "ERROR|FATAL" [5m])) > 0.1 for: 5m labels: severity: warning annotations: summary: 'myapp 错误率高' ``` Loki 内置 Ruler 跑这个规则，触发告警发给 Alertmanager（同 Prometheus 那套）。 ## 7. retention / 存储 ```yaml limits_config: retention_period: 30d # 全局 30 天 per_stream_retention: - selector: '{job="audit"}' period: 1y # audit 日志保留 1 年 ``` 底层用 chunks（默认 filesystem）。生产建议 S3 / GCS： ```yaml common: storage: s3: bucketnames: my-loki-logs region: us-east-1 ``` 按需 prune 老 chunks 来控制成本。 ## 8. JSON 结构化日志（推荐）应用直接输出 JSON 日志： ```python import structlog log = structlog.get_logger() log.info('user signed up', user_id=42, plan='pro') # {"event": "user signed up", "user_id": 42, "plan": "pro", "level": "info"} ``` Loki 里 `{job="myapp"} | json | user_id="42"` 直接过滤字段。不要在 label 上加 user_id 这种高基数的 —— 用 query-time `| json` filter。 ## 9. 资源占用 Loki + Promtail 在 4 core / 8GB 机器上能处理 100k logs/sec 或 5GB/day。比 ES 节省 5-10x 资源。 ## 10. 与 Prometheus 互补 Prometheus = 时序指标（数字） Loki = 日志（文本）同一仪表盘里：上面是请求数（Prom），下面是错误日志（Loki）。按时间对齐看故障。 ## 踩过的坑 - promtail 没权限读 `/var/log/...`：默认 root 才读；要么 user=root 跑 promtail，要么把日志 chmod。 - label cardinality：跟 Prometheus 一样，label 取 user_id 把 Loki 拖死。日志的"高基数"信息应该在 line 里（用 `| json` 解析），不该在 label 上。 - 时区：Loki 内部 UTC；UI 按浏览器时区显示。日志原文里的时间戳格式不一时需要 `pipeline_stages` 解析 timestamp。 - "all-in-one" 配置不适合多副本 / HA。生产规模上去后拆成 distributor / ingester / querier 三类组件。

分布式与云计算运 运维实录编辑部官方@ops_lab 2026-05-03 13:33 🔥 热度 0 💬 评论 0

Docker BuildKit + cache mount：CI build 从 8 分钟降到 90 秒

## 起因我们的 Node + Python 微服务 CI 每次 build： 1. `apt install` 系统依赖：1 分钟 2. `npm ci` 装 node_modules：2.5 分钟 3. `pip install` 装 Python 依赖：1.5 分钟 4. webpack 编译：2 分钟 5. push image：1 分钟总 8 分钟。改一行代码 → 等 8 分钟。每天 30 次 build = 4 小时浪费。 BuildKit 是 Docker 的新 builder（默认开启），支持 cache mount / secret mount / 并行 stage 构建。配合 CI 端 layer 缓存，重复 build 大部分步骤跳过。 ## 解决方案 ### 1. 启用 BuildKit ```bash # 现代 Docker Desktop / docker 23+ 默认开 # 如果没开： export DOCKER_BUILDKIT=1 docker build . # Compose 用 buildx： docker compose build --progress plain ``` ### 2. cache mount for package managers 旧 Dockerfile： ```dockerfile FROM node:20-slim WORKDIR /app COPY package*.json ./ RUN npm ci # 每次都从零下载 + 装 COPY . . RUN npm run build ``` 每改一行代码 → COPY 之后所有 layer 失效 → `npm ci` 重新下整套依赖。加 cache mount： ```dockerfile # syntax=docker/dockerfile:1.7 FROM node:20-slim WORKDIR /app COPY package*.json ./ RUN --mount=type=cache,target=/root/.npm,sharing=locked \ npm ci --prefer-offline COPY . . RUN --mount=type=cache,target=/root/.npm,sharing=locked \ npm run build ``` `--mount=type=cache` 创建一个跨 build 持久的 cache 目录（在 buildkit 之内，不进最终 image）。npm 下载的包缓存到那里，下次 build 直接命中。第一行 `# syntax=docker/dockerfile:1.7` 必须，开启高级 Dockerfile 语法。效果：第一次 build 2.5 分钟；后续 npm install 5-10 秒。 ### 3. apt cache mount ```dockerfile FROM ubuntu:24.04 RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \ --mount=type=cache,target=/var/lib/apt,sharing=locked \ rm -f /etc/apt/apt.conf.d/docker-clean \ && apt update \ && apt install -y --no-install-recommends \ build-essential libpq-dev \ && rm -rf /var/lib/apt/lists/* ``` 第一次跑 apt-get update 30 秒；后续每次 build 这一步 < 5 秒。 ### 4. pip / uv cache ```dockerfile FROM python:3.12-slim # 用 uv 极快 COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv WORKDIR /app COPY pyproject.toml uv.lock ./ RUN --mount=type=cache,target=/root/.cache/uv \ uv sync --frozen --no-install-project COPY . . RUN --mount=type=cache,target=/root/.cache/uv \ uv sync --frozen CMD ["uv", "run", "gunicorn", "myapp:app"] ``` uv 本身就快，加 cache mount 几乎是"第一次外，永远秒级"。 ### 5. multi-stage build：让产物 image 不带 build dep ```dockerfile # === build stage === FROM node:20-slim AS builder WORKDIR /app COPY package*.json ./ RUN --mount=type=cache,target=/root/.npm \ npm ci COPY . . RUN --mount=type=cache,target=/root/.npm \ npm run build # === runtime stage === FROM nginx:alpine COPY --from=builder /app/dist /usr/share/nginx/html EXPOSE 80 ``` 最终 image 只有 nginx + 静态文件，不含 node_modules / 源码。小、安全、启动快。 ### 6. 让 CI 跨 job 缓存 layer GitHub Actions： ```yaml - uses: docker/setup-buildx-action@v3 - uses: docker/build-push-action@v6 with: context: . push: true tags: ghcr.io/myorg/myapp:latest cache-from: type=gha cache-to: type=gha,mode=max ``` `type=gha` 用 GitHub Actions 自带 cache 后端。重复 build 跨 job 命中 cache。 GitLab CI： ```yaml build: image: docker:cli services: [docker:dind] script: - docker buildx create --use - docker buildx build --cache-from type=registry,ref=$CI_REGISTRY_IMAGE/cache --cache-to type=registry,ref=$CI_REGISTRY_IMAGE/cache,mode=max --push -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA . ``` ### 7. secret mount（不让 secret 进 image layer） ```dockerfile RUN --mount=type=secret,id=npmrc,target=/root/.npmrc \ npm ci ``` ```bash docker build --secret id=npmrc,src=$HOME/.npmrc . ``` `.npmrc` 在 build 时挂入，build 完不留在 image 任何 layer。比 ARG 安全（ARG 进 history 可被 `docker history` 看到）。 ### 8. 并行 stage ```dockerfile FROM base AS deps RUN install_deps FROM base AS assets RUN compile_assets FROM runtime COPY --from=deps /opt/deps /opt/deps COPY --from=assets /opt/assets /opt/assets ``` BuildKit 自动并行无依赖的 stage（`deps` 和 `assets` 同时跑）。 ## 效果按上面优化后我们的 build： | 步骤 | 之前 | 之后 | |---|---|---| | apt install | 60s | 5s | | npm ci | 150s | 8s | | pip install | 90s | 5s | | webpack | 120s | 60s（业务代码改了仍要 build） | | push | 60s | 10s（只 push 改了的 layer） | | **总** | **8m** | **~90s** | CI iteration 速度提升 5x，开发体验改善明显。 ## 调试 BuildKit ```bash # 详细输出 docker build --progress plain -t myapp . # 看 build cache docker buildx du # 清 cache（如果 cache 损坏 / 太大） docker buildx prune docker buildx prune --all ``` ## 踩过的坑 1. **没写 `# syntax=...` 注释**：cache mount 等高级语法不识别，报 "unknown flag --mount"。第一行必加。 2. **`sharing=locked` 重要**：多个 build 并发跑同一 cache → race condition 损坏。`locked` 让 buildkit 串行访问。 3. **CI cache 太大被 evict**：GitHub Actions cache 上限 10 GB / repo， `mode=max` 缓存所有 layer 容易超。改 `mode=min` 只缓存最终层，或定期清理。 4. **multi-arch build 慢**：`--platform linux/amd64,linux/arm64` 时 QEMU 模拟另一架构很慢。改用 native runner（GitHub 提供 arm64 runner，或 self-host）。 5. **layer 顺序错**：把变化频繁的 `COPY . .` 放在装依赖前 → 每次代码改都让依赖层失效。永远先 COPY package files 装依赖，再 COPY 代码。

容器运 运维实录编辑部官方@ops_lab 2026-05-01 19:03 🔥 热度 0 💬 评论 0

在 LXC 容器里跑一个隔离的开发环境（比 Docker 更像虚拟机）

Docker 适合"一进程 + 镜像分发"。LXC（system container）适合"我要一个完整的 Linux，但不想多装一台 VM"——它共享宿主内核但提供完整的 init / cron / ssh / multiple processes，像轻量 VM。适合场景：本地开发环境隔离（一个项目一个 LXC，不脏宿主）、跑老版本 Ubuntu 试旧软件、做 CI 隔离。 ## 安装 LXD（LXC 的现代封装） ```bash sudo snap install lxd sudo lxd init # 一路 Enter 用默认值：dir storage、新的 lxdbr0 网桥、不加 cluster sudo usermod -aG lxd $USER # 重新登录让 group 生效 ``` ## 启动一个容器 ```bash # 列可用镜像 lxc image list images: ubuntu/22.04 | head # 启动 lxc launch images:ubuntu/22.04 devbox lxc list # +--------+---------+----------------------+------+-----------+-----------+ # | NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | # +--------+---------+----------------------+------+-----------+-----------+ # | devbox | RUNNING | 10.10.20.42 (eth0) | | CONTAINER | 0 | # +--------+---------+----------------------+------+-----------+-----------+ ``` ## 进去 / 跑命令 ```bash # 交互 shell lxc exec devbox -- bash # 一次性命令 lxc exec devbox -- apt update lxc exec devbox -- apt install -y python3-pip git # 以普通用户身份进 lxc exec devbox -- su - ubuntu ``` ## 挂载宿主目录 ```bash # 把宿主 /home/yourname/code/myproj 挂到容器 /opt/myproj lxc config device add devbox myproj disk \ source=/home/yourname/code/myproj \ path=/opt/myproj ``` ID 映射的小坑：默认 LXD 用 unprivileged container（UID 映射），宿主的 `1000:1000` 在容器内是 `1000:1000`，权限正确。但如果宿主用了 NFS、 overlay 等，映射可能不直观，遇到再 `lxc config show devbox` 看。 ## 端口映射 / 反代容器有 IP 但 NAT 在宿主桥后。要把容器的 8080 暴露到宿主： ```bash lxc config device add devbox webproxy proxy \ listen=tcp:0.0.0.0:18080 \ connect=tcp:127.0.0.1:8080 ``` 之后 `curl http://<宿主-IP>:18080/` 转到容器内 8080。 ## 拍快照 / 还原 ```bash lxc snapshot devbox before-upgrade lxc restore devbox before-upgrade lxc info devbox | grep -A 5 Snapshots ``` 非常适合"我要装一个可能搞坏环境的东西，万一不行就回去"。 ## 限制资源 ```bash lxc config set devbox limits.cpu 2 lxc config set devbox limits.memory 2GB lxc config set devbox limits.memory.swap false lxc restart devbox ``` ## 模板化 ```bash # 把 devbox 当模板 lxc snapshot devbox base lxc publish devbox/base --alias my-dev-base # 之后随时拉一份新的出来 lxc launch my-dev-base devbox2 ``` ## 自动启动 ```bash lxc config set devbox boot.autostart true lxc config set devbox boot.autostart.priority 10 ``` ## 与 Docker 对比 | 维度 | LXC (system container) | Docker (app container) | |---|---|---| | 进程数 | 多进程，有 init | 默认单进程（PID 1） | | 文件系统 | 完整 distro | 极简 image | | 启动速度 | 1-2 秒 | < 1 秒 | | 生命周期 | 长（像 VM） | 短（每次启动是新实例） | | 应用分发 | 不方便 | 主流（image registry） | | 编排生态 | LXD cluster | Kubernetes、Swarm、Compose | ## 踩过的坑 - 默认 storage backend 是 `dir`（裸目录），性能差且不支持快照高效复制。生产 / 重度使用换 `btrfs` 或 `zfs`：`sudo lxd init` 时选。 - 容器内 dmesg / mount 等命令可能权限不够 —— 这是 unprivileged container 的预期行为，不是 bug。需要的话 `lxc config set devbox security.privileged true`，但安全等同 root。 - LXD vs LXC：`apt install lxc` 装的是底层工具；建议直接 `snap install lxd` 用更现代的 API。两者底层兼容但命令完全不同。 - Snap 版 LXD 升级时容器会随之重启，业务跑在 LXD 上的要避开升级窗口。

容器运 运维实录编辑部官方@ops_lab 2026-05-01 11:56 🔥 热度 0 💬 评论 0

第 1 / 2 页下一页