任何长期运行的服务都该有监控。Prometheus + Grafana 是事实标准:
Prom 抓指标,Grafana 画图,免费 + 自托管。下面把它们装起来。
1. 装 node_exporter(被监控机器上)
# 在每台要被监控的机器上
curl -fsSL https://github.com/prometheus/node_exporter/releases/latest/download/node_exporter-1.8.2.linux-amd64.tar.gz \
| sudo tar xz -C /usr/local/bin --strip-components=1 \
node_exporter-1.8.2.linux-amd64/node_exporter
sudo useradd -rs /bin/false node_exporter
systemd unit /etc/systemd/system/node_exporter.service:
[Unit]
Description=Prometheus node exporter
After=network.target
[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes \
--collector.textfile.directory=/var/lib/node_exporter
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
[Install]
WantedBy=multi-user.target
sudo systemctl enable --now node_exporter
curl localhost:9100/metrics | head
2. 装 Prometheus(中心机器上)
curl -fsSL https://github.com/prometheus/prometheus/releases/latest/download/prometheus-2.55.1.linux-amd64.tar.gz \
| sudo tar xz -C /opt/
sudo ln -s /opt/prometheus-2.55.1.linux-amd64 /opt/prometheus
sudo useradd -rs /bin/false prometheus
sudo mkdir -p /var/lib/prometheus
sudo chown -R prometheus:prometheus /var/lib/prometheus /opt/prometheus*
配置 /opt/prometheus/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'prod'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets:
- 'server1.example.com:9100'
- 'server2.example.com:9100'
- 'server3.example.com:9100'
labels:
environment: prod
systemd /etc/systemd/system/prometheus.service:
[Unit]
Description=Prometheus
After=network.target
[Service]
User=prometheus
ExecStart=/opt/prometheus/prometheus \
--config.file=/opt/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=30d \
--web.listen-address=:9090
Restart=on-failure
[Install]
WantedBy=multi-user.target
sudo systemctl enable --now prometheus
# 浏览器:http://中心机:9090
3. 装 Grafana
sudo apt install -y apt-transport-https software-properties-common
sudo mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | \
sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | \
sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update && sudo apt install -y grafana
sudo systemctl enable --now grafana-server
# 浏览器:http://中心机:3000 (默认 admin/admin)
4. 在 Grafana 加数据源
UI 上:
Connections → Data sources → Add data source → Prometheus
URL: http://localhost:9090
Save & test
5. 导入 node_exporter 仪表盘
Grafana 社区已经有现成的:
Dashboards → New → Import → ID: 1860 (Node Exporter Full)
瞬间得到 CPU / RAM / 磁盘 / 网络 / I/O / 进程等完整指标可视化。
6. 几条 PromQL 救命查询
# CPU 使用率(5 分钟均值)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 内存使用率
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
# 磁盘使用率(根分区)
100 - (node_filesystem_avail_bytes{mountpoint="/"} * 100 /
node_filesystem_size_bytes{mountpoint="/"})
# 5 分钟内 5xx 错误率(如果你也抓 nginx 指标)
sum(rate(nginx_http_requests_total{status=~"5.."}[5m])) by (instance)
# 哪些机器 90% 内存
topk(5, 100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
7. 告警
prometheus.yml 加:
rule_files:
- /opt/prometheus/rules/*.yml
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
/opt/prometheus/rules/node.yml:
groups:
- name: node
rules:
- alert: HighCPU
expr: 100 - avg by (instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 80
for: 10m
labels:
severity: warning
annotations:
summary: 'CPU > 80% on {{ $labels.instance }} for 10m'
- alert: DiskAlmostFull
expr: 100 - node_filesystem_avail_bytes{mountpoint="/"} * 100 / node_filesystem_size_bytes{mountpoint="/"} > 85
for: 5m
labels:
severity: critical
annotations:
summary: 'Disk > 85% on {{ $labels.instance }}'
装 Alertmanager 把告警送到 Slack / 邮件 / 钉钉:
curl -fsSL https://github.com/prometheus/alertmanager/releases/latest/download/alertmanager-0.27.0.linux-amd64.tar.gz \
| sudo tar xz -C /opt/
/opt/alertmanager/alertmanager.yml:
route:
receiver: slack
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: slack
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#alerts'
title: '{{ .CommonAnnotations.summary }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'
8. 防火墙
# Prometheus 端
sudo ufw allow from <Grafana_IP> to any port 9090
# node_exporter 端
sudo ufw allow from <Prometheus_IP> to any port 9100
或者 Prometheus / Grafana 都跑同一台(自己监控自己时常见),
只开 Grafana 80 / 443 给外网。
9. 安全:永远别 9100 直接暴露公网
node_exporter 的指标包含敏感信息(启动参数、进程列表)。一定要:
- 内网防火墙限制源 IP
- 用 nginx 在前面套 basic auth
- 或者 node_exporter 用
--web.tls.config.file启 mTLS
10. 存储 retention
--storage.tsdb.retention.time=30d # 默认 15d
--storage.tsdb.retention.size=50GB
长期存储用 Thanos / VictoriaMetrics / Cortex。单机 30 天指标
通常占几 GB-几十 GB。
踩过的坑
- Prom UI 显示 target "down":
curl host:9100/metrics看是节点没起
还是网络不通;常常是防火墙 9100 没开给 Prom。 - 时区:Prometheus 内部用 UTC;Grafana 展示按浏览器时区。两者不一致
时建议在 Grafana 强制时区到本地。 scrape_interval太短:1s 抓导致 Prom 数据爆炸 + 网络流量大。
15s 是稳健默认;30s / 1m 在节点多时更省。- 指标 cardinality 失控:label 取了 user_id / request_id 这种值无限的,
Prom 时序数量爆炸 OOM。这是 Prom 最常见的 outage 根因。
登录后参与评论。