写一个真正有用的 /healthz 和 /readyz（不是返回 200 那么简单）

K8s / 反代 / 监控系统都会查应用的健康状态。很多人把它们写成
return {"ok": true} 然后觉得搞定了——这种 endpoint 没区分进程活着
和真的能服务请求，到时候监控告警和实际故障对不上。

正确做法是分两个端点：

/healthz (liveness)：进程是否活着。失败 → 重启容器
/readyz (readiness)：能否接收新请求。失败 → 从 LB 后端摘掉但不重启

liveness：尽量薄

@app.get('/healthz')
def liveness():
    return {'status': 'alive'}

就这么薄。原则：不能查任何外部依赖。因为：

DB 暂时不通 → 不应该重启 Web 进程
Redis 慢 → 重启不能解决
liveness 失败的语义是"进程已经损坏，没法自愈"，只有 OOM / 死循环 /
panic 这种才该 fail

加点点缀（确认 process 没死锁）：

import time
@app.get('/healthz')
def liveness():
    # 检查事件循环 / 主线程没卡住
    return {'status': 'alive', 'ts': time.time()}

readiness：检查所有 hard dependency

import asyncio
from sqlalchemy import text

@app.get('/readyz')
async def readiness():
    checks = {}
    overall_ok = True

    # DB
    try:
        async with db_session() as s:
            await asyncio.wait_for(
                s.execute(text('SELECT 1')), timeout=2.0)
        checks['db'] = 'ok'
    except Exception as e:
        checks['db'] = f'fail: {e!r}'
        overall_ok = False

    # Redis
    try:
        await asyncio.wait_for(redis.ping(), timeout=1.0)
        checks['redis'] = 'ok'
    except Exception as e:
        checks['redis'] = f'fail: {e!r}'
        overall_ok = False

    # 关键外部 API（可选）—— 通常 readiness 不查第三方 API，
    # 因为他们挂了你也没办法
    # checks['stripe'] = ...

    status = 200 if overall_ok else 503
    return JSONResponse(
        status_code=status,
        content={'ok': overall_ok, 'checks': checks},
    )

注意：

wait_for + 超时：依赖卡死时 readiness 自己别卡死
失败返回 503，K8s 才会把这个 pod 从 service endpoints 里摘掉
同时返回详情：人工排查时一眼看见哪个依赖挂了

startup probe（K8s 1.16+）

应用启动慢的（如加载大模型），需要第三种 probe：startup。
启动期间 readiness 还没就绪也别立刻杀，给它时间：

# K8s 部署 YAML 示例
livenessProbe:
  httpGet: { path: /healthz, port: 8000 }
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet: { path: /readyz, port: 8000 }
  periodSeconds: 5
  failureThreshold: 2

startupProbe:
  httpGet: { path: /readyz, port: 8000 }
  periodSeconds: 5
  failureThreshold: 60   # 给 60 * 5 = 300 秒启动时间

startupProbe 没通过前 liveness / readiness 都不算。通过后切到正常 probe。

不要写成的反模式

# 错: 把 health 和 ready 写一起
@app.get('/health')
def health():
    db_ok = db.check()
    return {'ok': db_ok}
# 问题：DB 抖一下整个进程被重启 → 雪崩

# 错: liveness 检查外部依赖
@app.get('/healthz')
def liveness():
    requests.get('https://api.example.com/ping', timeout=5)
    return {'ok': True}
# 问题：第三方 API 慢 → liveness 慢 → K8s 觉得进程挂了 → 反复重启

# 错: 不区分 ok / 503
@app.get('/readyz')
def ready():
    return {'db': 'fail'}   # status=200! LB 仍认为这个 instance 健康

给 readiness 加"我自己降级中"标志

有时候你想主动让某 pod 不接新请求（比如准备 deploy / drain）：

ready_flag = True

@app.get('/readyz')
def readiness():
    if not ready_flag:
        return JSONResponse(503, {'ok': False, 'reason': 'draining'})
    return ...

@app.post('/admin/drain')
def drain():
    global ready_flag
    ready_flag = False
    return {'ok': True, 'state': 'draining'}

收到 SIGTERM 时先把 ready_flag=False、等 LB 摘掉、再退出：

import signal, asyncio

async def graceful_shutdown():
    global ready_flag
    ready_flag = False
    await asyncio.sleep(15)   # 等 LB 注意到
    # 然后退出
    sys.exit(0)

signal.signal(signal.SIGTERM, lambda *_: asyncio.create_task(graceful_shutdown()))

Metric 一起暴露

ready_counter = Counter('readyz_total', 'readyz checks', ['result'])

@app.get('/readyz')
def readiness():
    result = 'ok' if all_ok else 'fail'
    ready_counter.labels(result=result).inc()
    ...

Prometheus 上能看 readiness 通过率随时间变化。

踩过的坑

用 requests 同步查依赖 → 阻塞事件循环 → readiness 用了几秒，
健康的 pod 也被错杀。所有依赖检查必须超时 + async。
检查 DB 用 SELECT 1 是基本健康但不能验证可写。如果你的服务必须能写，
检查 SELECT 1 同时 INSERT ... ON CONFLICT DO NOTHING 一条特殊行。
把 /healthz 和 /readyz 暴露在公网：让攻击者用慢请求 DoS 你的检查
端点。挂内网，或者加简单 IP 白名单。
K8s 没配 terminationGracePeriodSeconds → SIGTERM 后 30 秒就 SIGKILL，
graceful shutdown 没时间完成。把这个值调到至少 60 秒。