CLIP + Faiss 做"用文字搜图"的图片搜索引擎（自家相册版）

起因

手机里 10 万张照片，找"去年在日本拍的樱花" 要翻几天。
Google Photos 能做语义搜索但隐私 → 想本地。

OpenAI 的 CLIP 模型把图片和文字编码到同一个语义向量空间。
"樱花" 的文字向量 ≈ 樱花图片的视觉向量。
本地跑 CLIP + Faiss 索引 + 几行 Python = 自己的 Google Photos。

解决方案

装

uv add open-clip-torch faiss-cpu torch pillow tqdm
# GPU 加速：
uv add faiss-gpu torch --index https://download.pytorch.org/whl/cu124

open-clip 是 LAION 训的 CLIP 系列（性能比官方 CLIP 好）。

Step 1: 提取图片特征

import torch
import open_clip
from PIL import Image
from pathlib import Path
import numpy as np
from tqdm import tqdm

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

# 中等大小 + 性能 + 多语言：xlm-roberta-large 支持中文
model, _, preprocess = open_clip.create_model_and_transforms(
    'xlm-roberta-large-ViT-H-14',
    pretrained='frozen_laion5b_s13b_b90k',
)
model = model.to(DEVICE).eval()
tokenizer = open_clip.get_tokenizer('xlm-roberta-large-ViT-H-14')

def encode_image(path):
    img = Image.open(path).convert('RGB')
    x = preprocess(img).unsqueeze(0).to(DEVICE)
    with torch.no_grad():
        feat = model.encode_image(x)
        feat = feat / feat.norm(dim=-1, keepdim=True)
    return feat.cpu().numpy().squeeze().astype('float32')

# 跑全相册
photos_dir = Path('~/Pictures/Photos').expanduser()
paths = list(photos_dir.rglob('*.jpg')) + list(photos_dir.rglob('*.heic'))

features = []
valid_paths = []
for p in tqdm(paths):
    try:
        features.append(encode_image(p))
        valid_paths.append(str(p))
    except Exception as e:
        print(f'skip {p}: {e}')

features = np.stack(features)   # shape (N, 1024)
np.save('image_features.npy', features)
with open('image_paths.txt', 'w') as f:
    f.write('\n'.join(valid_paths))

GPU 上 10 万张 ~ 2-4 小时。CPU 慢 5-10 倍。一次性事，后续只索引新增。

Step 2: Faiss 索引

import faiss
import numpy as np

features = np.load('image_features.npy')
N, D = features.shape   # 100000, 1024

# 小数据集（< 1M）用 flat：精确 + 简单
index = faiss.IndexFlatIP(D)   # IP = inner product = cosine（features 已 normalize）
index.add(features)
faiss.write_index(index, 'images.index')

# 大数据集（百万级）用 IVF + PQ 压缩
# index = faiss.index_factory(D, 'IVF1024,PQ32', faiss.METRIC_INNER_PRODUCT)
# index.train(features)
# index.add(features)

10 万 × 1024 维 flat 索引约 400 MB。搜一次 < 10ms。

Step 3: 文字 → 找图

paths = open('image_paths.txt').read().splitlines()
index = faiss.read_index('images.index')

def search(query: str, k=12):
    tokens = tokenizer([query]).to(DEVICE)
    with torch.no_grad():
        feat = model.encode_text(tokens)
        feat = feat / feat.norm(dim=-1, keepdim=True)
    feat = feat.cpu().numpy().astype('float32')

    scores, indices = index.search(feat, k)
    return [(paths[i], scores[0][rank]) for rank, i in enumerate(indices[0])]

# 用！
for path, score in search('cherry blossoms in Japan'):
    print(f'{score:.3f}  {path}')

# 中文
for path, score in search('一只在沙滩上奔跑的金毛狗'):
    print(f'{score:.3f}  {path}')

返回 top-12 最相似的图，按 cosine similarity 排序。

Step 4: Web UI（10 行 Streamlit）

# app.py
import streamlit as st
from PIL import Image
# ... 上面的 search 函数 ...

st.title('我的照片搜索')
query = st.text_input('描述你要找的图：')
if query:
    results = search(query, k=12)
    cols = st.columns(4)
    for i, (path, score) in enumerate(results):
        with cols[i % 4]:
            st.image(Image.open(path), caption=f'{score:.2f}')

streamlit run app.py
# 浏览器自动打开

输入"日落沙滩"→ 3 秒内显示所有匹配照片。

进阶

1. 图搜图（以图找图）

def search_by_image(image_path, k=12):
    feat = encode_image(image_path)
    feat = feat.reshape(1, -1)
    scores, indices = index.search(feat, k)
    return [(paths[i], scores[0][rank]) for rank, i in enumerate(indices[0])]

"找跟这张图相似的所有照片"。适合"找出所有该旅行的照片"。

2. 增量索引

新增图片时不要重建整个 index：

new_features = []
for path in new_paths:
    new_features.append(encode_image(path))
new_features = np.stack(new_features)

index.add(new_features)
faiss.write_index(index, 'images.index')

# paths 文件追加
with open('image_paths.txt', 'a') as f:
    f.write('\n' + '\n'.join(new_paths))

Flat index 支持 add；如果是 IVF + PQ 需要 reuse trained index + add。

3. 过滤：按 metadata

# EXIF 信息读取拍摄时间 / GPS / 相机
from PIL.ExifTags import TAGS
def get_exif(path):
    img = Image.open(path)
    exif = {TAGS.get(k, k): v for k, v in (img._getexif() or {}).items()}
    return exif

# 搜结果加 metadata filter
results = search('cherry blossoms')
filtered = [(p, s) for p, s in results if get_year(p) == 2023]

更高级：把 metadata 存 SQLite 一起 join。

4. CLIP 模型选择

模型	大小	速度	质量	多语言
ViT-B/32	150 MB	快	中	仅英
ViT-L/14	430 MB	中	高	仅英
ViT-H/14	1.1 GB	慢	极高	仅英
xlm-roberta + ViT-H	4 GB	慢	极高	多语言
siglip-large	1 GB	中	极高	看版本

中文场景一定用支持多语言的（xlm-roberta CLIP 或 chinese-clip）。

5. faiss 大数据集

百万 - 千万级照片：

# IVF: 把 vectors 分桶，搜时只查最近的 N 个桶
nlist = int(np.sqrt(N))   # 桶数
quantizer = faiss.IndexFlatIP(D)
index = faiss.IndexIVFFlat(quantizer, D, nlist, faiss.METRIC_INNER_PRODUCT)
index.train(features)
index.add(features)
index.nprobe = 8           # 搜时查 8 个桶（增加 recall）

千万级用 IVF + PQ 压缩（牺牲一点精度换 50x 内存压缩）。

6. 部署到手机

CLIP 模型导出 ONNX / CoreML 后能在手机端跑：

torch.onnx.export(model.visual, dummy_image,
                  'clip_vision.onnx', opset_version=14)

Apple CoreML 工具更直接。手机端单图 encode < 200ms。

完整效果

我的真实相册（4 万张照片）：

index 大小：160 MB
文字搜索单 query：~30ms
"去年在京都的樱花" → 95% 召回率（漏的是被树枝挡住的）
"戴墨镜的人" → 90%
"一群人合影" → 85%
"蓝色的天空" → 100% 但太多匹配
"我爸" → 0%（没人脸识别能力）

CLIP 强在"语义概念"，弱在"特定人物 / 文字 OCR"。
后两者需要专门的人脸识别 + OCR pipeline 配合。

与替代品对比

	自托管 CLIP	Google Photos	Apple Photos	immich (开源)
隐私	✅	❌	部分（设备端）	✅
自由度	高	低	中	中
人脸识别	没（自己加）	✅	✅	✅
语义搜索	✅（CLIP）	✅	一定	✅（CLIP）
多设备	自己搭	✅	✅	✅

如果不想从零搭：immich 是开源 Google Photos 替代，
内置 CLIP 搜索 + 人脸识别 + 多设备同步。

踩过的坑

HEIC 格式：iPhone 拍的 .heic 默认 Pillow 读不了。
pip install pillow-heif + register_heif_opener()。
GPU memory 不够：H/14 model + 高分辨率图 batch=1 仍 OOM。
feat = model.encode_image(x.half()) half precision 减半显存。
路径有中文：Windows 上 Path 偶尔编码乱。统一 UTF-8 + 转
绝对路径。
加新图后忘 update：每次 sync 跑增量 encode + add to index。
写个 cron。
face matching is poor：CLIP 不擅长"区分两张人脸是否同人"。
要加 face recognition 用 ArcFace / FaceNet 等专用模型。