nginx 调优：keepalive、worker、buffer 实操

起因

新部署的 nginx 默认配置在中等流量下出现：

P99 偶尔 200ms+ 突刺
connection reset 客户端错误
上游 PHP-FPM / gunicorn connection timeout

不调优 → 撑 1k QPS 就开始 wobble。调几个 key 参数后 5k QPS 稳。
下面是我每个新部署都套的 baseline。

worker_processes / worker_connections

worker_processes auto;          # = CPU 核心数
worker_rlimit_nofile 65535;     # 文件描述符上限

events {
    worker_connections 8192;     # 单 worker 最大连接
    use epoll;                   # Linux 用 epoll
    multi_accept on;             # 一次接多 connection
}

worker_processes auto 让 nginx 自动用所有核。
worker_connections × worker_processes = 理论最大并发连接。
8 核 × 8192 = 65536 并发，远超 1k QPS 需求。

upstream keepalive（关键）

默认 nginx 反代每请求开新 TCP 连接到 upstream → TCP/TLS 握手成本。

upstream backend {
    server 127.0.0.1:8000;
    server 127.0.0.1:8001;
    keepalive 64;               # 与 upstream 保持 64 个空闲连接
    keepalive_timeout 60s;
    keepalive_requests 1000;     # 每连接最多 1000 请求后重建
}

server {
    location / {
        proxy_pass http://backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";      # 必须！不传 Connection close
    }
}

proxy_set_header Connection "" 这行最容易漏。
缺它 → upstream 默认 HTTP/1.0 不复用 → keepalive 不生效。

加上后 latency 降 30-50%（省 TCP 握手 + 不必要的 connection setup）。

buffer / timeout

# 请求 buffer
client_max_body_size 20m;
client_body_buffer_size 16k;
client_header_buffer_size 4k;
large_client_header_buffers 4 16k;

# upstream buffer
proxy_buffering on;
proxy_buffer_size 8k;
proxy_buffers 8 16k;
proxy_busy_buffers_size 16k;

# timeout
proxy_connect_timeout 5s;
proxy_send_timeout 60s;
proxy_read_timeout 60s;
send_timeout 60s;

proxy_buffering on（默认）：nginx 先收完 response 再发给 client →
慢 client 不拖累 upstream。

特殊场景关闭（如 SSE / streaming）：

location /sse/ {
    proxy_pass http://backend;
    proxy_buffering off;            # streaming 不缓冲
    proxy_cache off;
}

gzip

gzip on;
gzip_vary on;
gzip_min_length 1024;
gzip_proxied any;
gzip_comp_level 5;
gzip_types
    text/plain text/css application/json application/javascript
    text/xml application/xml application/xml+rss text/javascript
    image/svg+xml;

JSON / HTML 压缩 70-90%。
gzip_comp_level 5 在 CPU 和压缩比间平衡（1 - 9 范围）。

brotli (更好压缩)

brotli on;
brotli_comp_level 6;
brotli_types text/plain text/css application/json application/javascript;

需要 brotli module 装（一些 distro 默认带 nginx-extras package）。
比 gzip 压缩比好 15-20%，CPU 占用稍高。

静态文件 sendfile / tcp_nopush

sendfile on;            # 内核态文件直接 send
tcp_nopush on;          # 凑满 packet 再发（搭配 sendfile）
tcp_nodelay on;         # 长连接 small payload 不延迟

# 静态文件 cache
location /static/ {
    expires 30d;
    add_header Cache-Control "public, immutable";
    access_log off;
}

sendfile 是 zero-copy → 大文件传输 CPU 占用 1/4。

TLS 优化

ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384;
ssl_prefer_server_ciphers off;

ssl_session_cache shared:SSL:10m;       # 10 MB session cache，约 40000 session
ssl_session_timeout 1d;
ssl_session_tickets off;

ssl_stapling on;
ssl_stapling_verify on;

# HTTP/2
listen 443 ssl http2;

# HTTP/3 (nginx 1.25+)
listen 443 quic reuseport;
add_header Alt-Svc 'h3=":443"; ma=86400';

session cache 让 TLS handshake 从 2 RTT → 0 RTT（resumption）。
高 QPS 时显著省 CPU。

rate limit

防 abuse：

http {
    limit_req_zone $binary_remote_addr zone=api:10m rate=100r/s;
    limit_conn_zone $binary_remote_addr zone=conn:10m;
}

server {
    location /api/ {
        limit_req zone=api burst=200 nodelay;
        limit_conn conn 20;
        proxy_pass http://backend;
    }
}

每 IP 100 req/s + burst 200 + 同时 ≤ 20 connection。
基础防护，挡 script kiddie。

log 格式 + sampling

log_format main_json escape=json
    '{'
    '"time":"$time_iso8601",'
    '"remote":"$remote_addr",'
    '"method":"$request_method",'
    '"uri":"$request_uri",'
    '"status":$status,'
    '"bytes":$body_bytes_sent,'
    '"rt":$request_time,'
    '"upstream_rt":"$upstream_response_time",'
    '"ua":"$http_user_agent"'
    '}';

# 大量流量时只 log 一部分（health check / static）
map $request_uri $loggable {
    default 1;
    /health 0;
    ~*\.ico 0;
}

access_log /var/log/nginx/access.log main_json if=$loggable;

JSON 格式 → ELK / Loki / Datadog 解析友好。

实战调优 case

某 Django app，单 server，4 vCPU / 8 GB：

调前：
- 1000 QPS 撑得住但 P99 300ms
- upstream connection 数 4000+ 频繁 TIME_WAIT
- 偶发 502

调后（上面 baseline）：
- 3000 QPS 稳定，P99 80ms
- upstream connection 200（keepalive 复用）
- 502 几乎绝迹

关键改动：
1. upstream keepalive 64 + Connection "" header
2. worker_processes auto + worker_connections 8192
3. gzip on + brotli on

监控

/nginx_status (stub_status module)：

location /nginx_status {
    stub_status;
    allow 127.0.0.1;
    deny all;
}

prometheus exporter（nginx-vts / nginx-prometheus-exporter）抓 →
Grafana panel：

active connections
accepted / handled / requests rate
per-status code rate
upstream response time

发现 spike / leak 第一时间。

高级：worker_cpu_affinity

worker_processes 4;
worker_cpu_affinity 0001 0010 0100 1000;

绑 worker 到特定 CPU → cache locality 提升 5-10%。
8 核以下机器一般不必要；多核高压才显著。

踩过的坑

Connection "" 漏了：upstream keepalive 无效 → 性能没提升却
以为配了。tcpdump / netstat 看 connection 数确认。
worker_connections 高但 ulimit 低：worker_rlimit_nofile 也要
配，否则 syscall fail。ulimit -n 检查。
proxy_buffer_size 小：上游 response header 大（cookie 多）→
upstream sent too big header 错误。调大到 16k+。
server_tokens on：默认显 nginx 版本在 header / error page →
信息泄。server_tokens off。
reload 不等于 restart：nginx -s reload 优雅 reload 是好习惯；
restart 断现有连接。CI deploy 用 reload。