systemd WatchdogSec：进程卡死自动重启（不只是崩溃）

起因

服务挂了 systemd Restart=on-failure 能自动重启。但 "进程没崩，却卡死
不响应" 这种情况 systemd 看不出来——TCP 连接还在，进程还活，只是不
处理请求了。

WatchdogSec 让服务定期向 systemd 发"我还活"信号；超时不发就被
systemd 当死了重启。

工作原理

service 进程
    ↓ 每 N 秒 sd_notify(WATCHDOG=1)
systemd watchdog timer
    ↓ 收到 → reset timer
    ↓ 没收到 → SIGTERM + restart

时间预算（WatchdogSec=30s）= 服务必须在 30s 内"喂狗"一次。

配置 systemd 单元

/etc/systemd/system/myapp.service：

[Service]
Type=notify
WatchdogSec=30
NotifyAccess=main
ExecStart=/usr/local/bin/myapp
Restart=on-watchdog
RestartSec=5

Type=notify：systemd 期待服务在启动好后发 READY=1
WatchdogSec=30：30 秒内没"喂狗"就重启
Restart=on-watchdog：watchdog 超时也重启（默认 on-failure 不含）
NotifyAccess=main：只允许主进程发 notify

应用代码：发 WATCHDOG

Python

import os
import socket
import threading
import time

def notify(message):
    """向 systemd notify socket 发消息"""
    sock_path = os.environ.get('NOTIFY_SOCKET')
    if not sock_path:
        return
    s = socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM)
    if sock_path.startswith('@'):
        sock_path = '\0' + sock_path[1:]
    try:
        s.sendto(message.encode(), sock_path)
    finally:
        s.close()

def watchdog_loop():
    interval = int(os.environ.get('WATCHDOG_USEC', '30000000')) // 1_000_000 // 2
    while True:
        # 在这里 check 应用真"活"——不只是循环跑
        if app_is_healthy():
            notify('WATCHDOG=1')
        time.sleep(interval)

# 启动时
notify('READY=1')

# 后台喂狗
threading.Thread(target=watchdog_loop, daemon=True).start()

# 业务主循环
serve_forever()

WATCHDOG_USEC 是 systemd 自动设的环境变量（微秒）。喂狗间隔通常
设为 WatchdogSec 的 1/2 给容错。

app_is_healthy() 是你的健康检查：

def app_is_healthy():
    # check DB / Redis / 关键资源
    try:
        db.ping(timeout=2)
        # 检查事件循环没死锁 / queue 不爆
        return queue.qsize() < 10000
    except Exception:
        return False

如果 health check 失败，不要喂狗 → systemd 杀进程重启。

Go

import (
    "os"
    "time"
    "github.com/coreos/go-systemd/v22/daemon"
)

func main() {
    // ... 初始化
    daemon.SdNotify(false, daemon.SdNotifyReady)

    // 启动 watchdog goroutine
    interval, err := daemon.SdWatchdogEnabled(false)
    if err == nil && interval > 0 {
        go func() {
            tick := time.NewTicker(interval / 2)
            for range tick.C {
                if appHealthy() {
                    daemon.SdNotify(false, daemon.SdNotifyWatchdog)
                }
            }
        }()
    }

    serve()
}

Rust

use systemd::daemon;
use std::thread;
use std::time::Duration;

fn main() {
    daemon::notify(false, [(daemon::STATE_READY, "1")].iter()).unwrap();

    thread::spawn(|| {
        let interval = Duration::from_secs(15);
        loop {
            if app_healthy() {
                daemon::notify(false, [(daemon::STATE_WATCHDOG, "1")].iter()).unwrap();
            }
            thread::sleep(interval);
        }
    });

    serve();
}

不会修改源码的服务

跑现成 binary 没 sd_notify 支持？用 systemd-notify wrapper：

[Service]
ExecStart=/path/to/wrapper.sh

#!/bin/bash
# wrapper.sh
your-app &
APP_PID=$!

while kill -0 $APP_PID 2>/dev/null; do
    if curl -sf http://localhost:8080/health > /dev/null; then
        systemd-notify WATCHDOG=1
    fi
    sleep 15
done

但这种方式增加复杂度；建议优先改应用代码。

实测

sudo systemctl start myapp
sudo systemctl status myapp
# 启动后看 status: active (running)
# 状态行有 "Watchdog: 30s"

# 模拟"应用卡死" —— 让 app_is_healthy 返 False
# 30 秒后 systemd 会：
# 1. 发 SIGTERM
# 2. 等待终止
# 3. 按 Restart=on-watchdog 重启
journalctl -u myapp -f
# 看到 "Watchdog timeout (limit 30s)!" + restart

与"被动监控 + 外部重启" 对比

外部脚本检测 + 重启：

# cron */1 * * * *
if ! curl -sf http://localhost:8080/health; then
    systemctl restart myapp
fi

优点：简单，不改应用代码。
缺点：

间隔最小 1 分钟（cron）
健康检查走 HTTP（增加耦合）
restart 流程慢（systemctl 命令 + 进程退出 + 重启）

WatchdogSec 优点：

毫秒级触发
应用内部直接判断"我健康吗"
systemd 一站式管

生产推荐 WatchdogSec。

几个调参

WatchdogSec 太短

WatchdogSec=5：网络抖动 / GC 暂停 / 大事务 commit → 误杀健康进程。
通常 30-120 秒是合理范围。

启动期不喂狗

启动加载大模型几分钟：

TimeoutStartSec=10min
WatchdogSec=30

启动期内由 TimeoutStartSec 控制；READY 发出后才进入 WatchdogSec 监管。

应用代码：模型 load 完才 notify('READY=1') + 启动 watchdog 线程。

RuntimeMaxSec：周期性重启

WatchdogSec=30
RuntimeMaxSec=24h    # 跑超过 24h 强制重启

防内存泄漏类问题：每天自动重启一次，泄漏不可能跑到 OOM。

实战：FastAPI + uvicorn + WatchdogSec

uvicorn 不原生支持 sd_notify。
解决方案 1：用 hypercorn 或 gunicorn (UvicornWorker)，自己写 hook。

或者用 ASGI lifespan event：

# app/main.py
import asyncio, socket, os
from fastapi import FastAPI
from contextlib import asynccontextmanager

def notify(msg):
    sock_path = os.environ.get('NOTIFY_SOCKET')
    if not sock_path:
        return
    s = socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM)
    try:
        s.sendto(msg.encode(), sock_path)
    finally:
        s.close()

async def watchdog():
    interval = int(os.environ.get('WATCHDOG_USEC', '30000000')) / 2_000_000
    while True:
        if await check_health_async():
            notify('WATCHDOG=1')
        await asyncio.sleep(interval)

@asynccontextmanager
async def lifespan(app: FastAPI):
    notify('READY=1')
    task = asyncio.create_task(watchdog())
    yield
    task.cancel()

app = FastAPI(lifespan=lifespan)

# health check 是真检查事件循环 + DB
async def check_health_async():
    try:
        await asyncio.wait_for(db.execute('SELECT 1'), timeout=2)
        return True
    except Exception:
        return False

systemd unit:

[Service]
Type=notify
WatchdogSec=60
ExecStart=/srv/app/.venv/bin/uvicorn app.main:app --host 127.0.0.1 --port 8000
Restart=on-watchdog

效果：DB 卡死 → health check 不通过 → 不喂狗 → 60s 后 systemd 杀重启。

监控

systemd journal 里 watchdog 触发显示：

systemd[1]: myapp.service: Watchdog timeout (limit 30s)!
systemd[1]: myapp.service: Killing process 12345 with signal SIGABRT.

Prometheus node_exporter 自动收 systemd *_restart_count 指标。
告警：rate(systemd_unit_restart_total[1h]) > 5 → 服务持续不健康。

踩过的坑

Type=simple 不 work：必须 Type=notify watchdog 才生效。
fork 后子进程喂狗 NotifyAccess 默认 main：子进程发的被忽略。
要么改成 NotifyAccess=all，要么主进程负责。
测试期间设置太短：WatchdogSec=5 测试是因为快，生产忘改回
常态 → 高负载时 GC 一下被误杀。
app 用 thread + GIL：watchdog 线程被 GIL 卡住 → 整个 Python
进程都不喂狗 → 误杀。用 async 或 multiprocessing 把 watchdog 独立。
WATCHDOG_USEC 没读到：环境变量是 systemd 启动子进程时注入。
如果你的 service 用 shell wrapper 启动，要 exec 替换或显式 env
传过去。