知识广场

推荐

按学科筛选：计算机科学 / 后端开发 / 数据库

«计算机科学 / 后端开发 / 数据库» 分类下共 1 篇帖子

## 起因个人 / 小项目用 SQLite： - 单文件，零运维 - 性能足够（WAL 模式撑万 QPS 读 + 千 QPS 写） - 备份 = cp 文件缺点：单服务器挂了 → 服务停 + 上次备份后的数据丢。 `Litestream`（Ben Johnson）：实时把 SQLite WAL 流式复制到 S3 / SFTP / 其它。无应用改动 + 几乎实时（秒级 RPO）。 ## 装 ```bash # binary wget https://github.com/benbjohnson/litestream/releases/download/v0.3.13/litestream-v0.3.13-linux-amd64.tar.gz tar -xf litestream-*.tar.gz sudo mv litestream /usr/local/bin/ ``` ## 配 `/etc/litestream.yml`： ```yaml dbs: - path: /srv/myapp/db.sqlite3 replicas: - type: s3 bucket: my-backups path: myapp/db region: us-east-1 access-key-id: ${AWS_ACCESS_KEY_ID} secret-access-key: ${AWS_SECRET_ACCESS_KEY} retention: 720h # 30 day ``` 启动： ```bash litestream replicate -config /etc/litestream.yml ``` ## systemd service ```ini # /etc/systemd/system/litestream.service [Unit] Description=Litestream After=network.target [Service] ExecStart=/usr/local/bin/litestream replicate -config /etc/litestream.yml Restart=always EnvironmentFile=/etc/litestream.env [Install] WantedBy=multi-user.target ``` ```bash sudo systemctl enable --now litestream ``` ## 怎么工作 SQLite WAL 模式下，写操作先到 -wal 文件，定期 checkpoint 到主 db。 Litestream： 1. open snapshot baseline 上传 S3 2. tail WAL → 流式上传增量 (每秒) 3. WAL 自动 checkpoint 时 → 新 snapshot S3 存： ``` myapp/db/ snapshots/ 20250314T100000.db.lz4 20250320T100000.db.lz4 wal/ 20250314T100000_000001.wal.lz4 ... ``` 可以 restore 到任意时间点（snapshot + replay WAL）。 ## restore ```bash # 最新 litestream restore -o /tmp/restored.db -config /etc/litestream.yml /srv/myapp/db.sqlite3 # 时间点 litestream restore -o /tmp/restored.db \ -timestamp '2025-03-14T10:00:00Z' \ -config /etc/litestream.yml \ /srv/myapp/db.sqlite3 ``` 5 分钟级别 RTO（取决于 DB size）。 ## 应用透明 litestream 不改 SQLite 行为。应用照常 read/write `db.sqlite3`。 litestream 是 sidecar 进程，监 WAL。应用不知有 litestream 存在 → 0 风险。 ## 性能影响 litestream 跟 SQLite 共享 disk IO。小 DB（< 1 GB）+ 适度写（< 100 wps）：几乎无感。高写 throughput → litestream 上传 bandwidth 跟得上吗？我们 production： - 1 GB SQLite - 50 wps - litestream + S3：CPU < 1%，network 几 KB/s 平均 - WAL upload latency P99 < 2s ## 不替代 HA litestream 是 **disaster recovery**（备份 + restore），不是 HA。 HA 需要： - 多 server / 多 region 同时活 - failover 自动 SQLite + litestream 是"主备"模式：主挂了，备份机 restore + 起来 = 5 分钟 downtime。真要 HA → Postgres + replication / managed DB。 ## read replica litestream 0.4+ 实验性 read replica： ```yaml dbs: - path: /srv/myapp/db.sqlite3 replicas: - type: s3 bucket: ... ``` ```bash # 在别 server 上 litestream replicate \ -read-only \ -config replica.yml ``` 从 S3 拉 + apply WAL → 本地 read-only SQLite。读扩展可行（写仍主单机）。 ## 价格 S3 storage：1 GB DB + 1 month WAL = 几 GB → $0.10/月。 S3 PUT：每秒一次 WAL upload → 86400 PUT/day × $0.005/1k = $0.40/day。总：几刀/月。比 RDS db.t3.micro（$15/月）便宜很多。 ## SFTP / GCS / Azure 不一定 S3： ```yaml replicas: - type: sftp host: backup.example.com user: backup path: /backups/myapp key-path: ~/.ssh/backup_key ``` 或 GCS: ```yaml replicas: - type: gcs bucket: my-bucket path: myapp/db ``` ## 与 PG / MySQL 对比 | | SQLite + Litestream | Postgres | |---|---|---| | 写 QPS | 千级 | 万级 | | 多 reader | 是 | 是 | | 多 writer | 单 | 是 | | HA | 主备（5min downtime） | streaming replica | | 运维 | 极简 | 中 | | 备份 | litestream | pg_dump / WAL-G | 简单 app + 单 server → SQLite + litestream。多 server / 高 throughput / HA 必须 → Postgres。 ## 真实部署我个人项目 / 小 SaaS（< 1k DAU）： - VPS $5/月（Hetzner） - Django + SQLite + litestream → S3 - nginx reverse proxy - Cloudflare CDN free 总成本 < $10/月。 db.sqlite3 + WAL 上 S3 自动 5 分钟内一份 (snapshot interval)。服务器爆炸 → 新 VPS + restore + 部署，半小时上线。 SLA 不是 99.99% 但 99% 易达。 ## 与 cron + cp 对比 ```bash cp db.sqlite3 /backup/db-$(date +%F).sqlite3 ``` 简单但： - 间隔大（每日）→ RPO 一天 - 没 PITR - cp 时 WAL 可能不一致 litestream RPO 秒级 + PITR + WAL consistent。 ## 监控 litestream 暴露 prometheus metrics： ```yaml addr: ":9090" ``` - `litestream_replica_position_bytes` - `litestream_replica_last_sync_seconds` 报警：> 60s 没 sync。 ## 踩过的坑 1. **WAL 没启用**：`PRAGMA journal_mode=WAL;` 必须。不是 WAL litestream 不能 tail。 2. **multi-process write 麻烦**：SQLite 多进程写有限制。只让一个进程写 → litestream tail 那 WAL。 3. **DB 删了**：手动 `rm db.sqlite3` → litestream 看到删，但 S3 上仍有数据。restore 即可。但小心 `--full-resync` 误操作覆盖 backup。 4. **快速 restore 慢**：大 DB（10 GB+）restore 几分钟（下 snapshot + replay WAL）。RTO 不是 instant。 5. **monitor 缺失**：litestream 进程死了，应用照常跑没人知道 → 备份悄无声息断。systemd Restart=always + prometheus monitor。

数据库后 后端工程纪要官方@backend_jot 2026-05-14 18:23 🔥 热度 0 💬 评论 0