用 BentoML 把训好的 PyTorch 模型变成可调用的 HTTP API

起因

模型训完了 .pt 文件躺在硬盘上。要让前端 / 移动端 / 别的服务能用它，
需要包成 REST API。手写 Flask / FastAPI 包一遍是 100 行 boilerplate
（加载模型 + parse 输入 + tensor 转 numpy + 错误处理 + batching）。
做几个模型这种工作就极乏味。

BentoML 把 ML 模型 → 生产 service 的过程标准化：写一个 service 文件，
自动生成 HTTP / gRPC / OpenAPI / Docker。

解决方案

装

uv add bentoml torch torchvision pillow

模型仓库（model store）

# save_model.py
import bentoml
import torch
from torchvision.models import resnet50, ResNet50_Weights

model = resnet50(weights=ResNet50_Weights.DEFAULT).eval()
bento_model = bentoml.pytorch.save_model('resnet50', model)
print(bento_model)
# Model(tag="resnet50:abc123...")

模型存到本地 ~/bentoml/models/，分 tag 版本化。
团队共享用 bentoml push / pull 配 BentoCloud 或自建 S3。

service.py（核心）

import bentoml
from bentoml.io import Image, JSON
from PIL import Image as PILImage
import torch
from torchvision import transforms

resnet = bentoml.pytorch.get('resnet50:latest').to_runner()

svc = bentoml.Service('image_classifier', runners=[resnet])

preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                          std=[0.229, 0.224, 0.225]),
])

with open('imagenet_classes.txt') as f:
    LABELS = [line.strip() for line in f]

@svc.api(input=Image(), output=JSON())
async def predict(img: PILImage.Image) -> dict:
    x = preprocess(img).unsqueeze(0)
    logits = await resnet.async_run(x)
    probs = torch.softmax(logits, dim=1)[0]
    top5 = torch.topk(probs, k=5)
    return [
        {'label': LABELS[idx], 'prob': prob.item()}
        for idx, prob in zip(top5.indices.tolist(), top5.values.tolist())
    ]

要点：

to_runner()：让 BentoML 管理 model 生命周期 + 并发 batch
@svc.api(input=Image(), output=JSON())：声明输入输出，自动
生成 OpenAPI doc + 验证
async run：BentoML 在 worker 进程里跑 model，业务进程 async 协程

本地启动

bentoml serve service:svc --reload
# uvicorn 起来 http://localhost:3000
# http://localhost:3000/docs 自动 Swagger

调用：

curl -X POST http://localhost:3000/predict \
  -H 'Content-Type: image/jpeg' \
  --data-binary @cat.jpg
# [{"label": "Egyptian cat", "prob": 0.84}, ...]

Adaptive batching（自动批处理）

runner 默认开启 batching：单条请求进来时 hold ~10ms 等更多请求，
合并 batch 一次 forward，吞吐量直接翻几倍。

runner = bentoml.pytorch.get('resnet50:latest').to_runner(
    method_configs={'__call__': {
        'max_batch_size': 32,
        'max_latency_ms': 100,
    }},
)

业务代码无感知。

打包成 Bento + Docker

# bentofile.yaml
service: 'service:svc'
include:
  - 'service.py'
  - 'imagenet_classes.txt'
python:
  packages:
    - torch
    - torchvision
    - pillow
models:
  - resnet50:latest

bentoml build
# 生成 ~/bentoml/bentos/image_classifier/<tag>

bentoml containerize image_classifier:latest
# 生成 docker image image_classifier:latest

docker run -p 3000:3000 image_classifier:latest

镜像里包含：Python + 依赖 + service code + 模型权重。直接部署。

K8s 部署（BentoML Yatai）

bentoml deployment create my-deploy \
  --bento image_classifier:latest \
  --cluster prod

或者用 Yatai operator，K8s 原生 CRD 管理 Bento。

效果

训完 model → 上生产 API 从 2 天 → 2 小时
自动 batch 让单 GPU 吞吐量翻 4 倍
OpenAPI 文档自动生成，前端不再追着问 schema
多版本管理 / canary deploy 都是 framework 原生支持
监控 metrics 自动暴露 /metrics endpoint 给 Prometheus

与替代品对比

	BentoML	TorchServe	Triton	自己写 FastAPI
学习曲线	中	中	高	低
多框架	✅	主 PyTorch	✅	N/A
自动 batching	✅	✅	✅	需自写
Docker / K8s	✅	✅	✅	需自写
模型仓库	✅	✅	✅	需自建
简单 API	中	中	复杂	极简

复杂 ML 系统选 BentoML / Triton；单模型 < 100 QPS 自己写 FastAPI
更轻量。

踩过的坑

runner 进程模型加载慢：cold start 几秒-几十秒。生产用
bentoml serve --production --workers 4 --runners 2 ... 提前 warmup。
adaptive batching 引入延迟：单条请求 P99 可能 > 100ms（等其它
请求凑 batch）。低 QPS 场景关 batching：max_batch_size=1。
input/output schema 不严格：默认 JSON 接受任意结构。生产用
pydantic JSON(pydantic_model=MyInput) 强校验。
模型权重打包到镜像里：镜像几 GB 推送慢。模型放 OSS / 启动时
下载更轻量；trade-off 是冷启动慢。
GPU 不释放：bentoml runner 进程退出时 GPU 偶尔被 PyTorch
leaked。systemd 重启 service 时确保 KillMode=control-group 杀
所有子进程。