Feast feature store：把训练 / 在线推理用同一份特征代码

起因

我们的推荐模型在训练时算"用户过去 30 天点击次数"用 pandas，
线上推理时用 Redis lookup。两套代码必然漂移：一次 pandas 用错时区
窗口对不上 → 训练效果在线上拉胯。
"训练 / 在线特征不一致" 是 ML 生产最常见的痛点。

Feast 是开源 feature store，把特征定义统一在一处，训练 / 在线都从
同一份代码读。

解决方案

装

uv add feast
feast version

项目初始化

feast init my_store
cd my_store/feature_repo

生成示例 example_repo.py。

定义特征

# feature_repo/user_features.py
from datetime import timedelta
from feast import Entity, Feature, FeatureView, FileSource, ValueType
from feast.types import Float32, Int64

user = Entity(name='user_id', value_type=ValueType.INT64)

# 数据源：parquet 文件（生产用 BigQuery / Snowflake / S3 等）
user_clicks_source = FileSource(
    path='data/user_clicks.parquet',
    timestamp_field='event_time',
)

user_clicks_30d = FeatureView(
    name='user_clicks_30d',
    entities=[user],
    ttl=timedelta(days=30),
    schema=[
        Feature(name='click_count', dtype=Int64),
        Feature(name='avg_dwell_seconds', dtype=Float32),
        Feature(name='top_category', dtype='string'),
    ],
    source=user_clicks_source,
    online=True,
)

feast apply
# 注册 entity / feature view 元数据

训练时拉历史特征

from feast import FeatureStore
import pandas as pd

store = FeatureStore(repo_path='feature_repo')

# 训练样本（user_id + label + event_time）
training_df = pd.DataFrame({
    'user_id': [1, 2, 3],
    'event_time': pd.to_datetime(['2024-01-01', '2024-01-02', '2024-01-03']),
    'label': [1, 0, 1],
})

# point-in-time 拼接特征（保证不用未来数据）
training = store.get_historical_features(
    entity_df=training_df,
    features=[
        'user_clicks_30d:click_count',
        'user_clicks_30d:avg_dwell_seconds',
        'user_clicks_30d:top_category',
    ],
).to_df()

# train
X = training[['click_count', 'avg_dwell_seconds']]
y = training['label']
model.fit(X, y)

Feast 自动按 entity_df 的 event_time 去 source 找当时点的特征值，
避免 data leakage（不会用未来的 click_count 训练历史样本）。

在线 materialize（把最近特征推到 Redis）

feast materialize-incremental $(date -u +%Y-%m-%dT%H:%M:%S)
# 把 last_materialization → now 的特征写到 online store (Redis)

定时跑（每小时 / 每天）：

# /etc/systemd/system/feast-materialize.timer
[Timer]
OnCalendar=*-*-* */1:00:00

在线推理时读特征

features = store.get_online_features(
    features=[
        'user_clicks_30d:click_count',
        'user_clicks_30d:avg_dwell_seconds',
    ],
    entity_rows=[{'user_id': 42}],
).to_dict()

x = [[features['click_count'][0], features['avg_dwell_seconds'][0]]]
prediction = model.predict(x)

同一份 FeatureView 定义 决定了训练和在线都拿"click_count"的相同
语义。漂移消失。

Online store backend

Feast 支持多种 online store：

# feature_repo/feature_store.yaml
online_store:
  type: redis
  connection_string: 'localhost:6379'

# 或：
online_store:
  type: dynamodb
  region: us-east-1
  table_name: feast_online

# 或：
online_store:
  type: sqlite       # dev 用

数据源支持

offline_store:
  type: file                   # parquet
# 或 bigquery / snowflake / redshift / spark / trino

不同公司栈用不同 source，Feast 抽象统一。

实战流程

                  ┌─────────────────┐
                  │ FeatureView 定义 │
                  │ (Python 代码)    │
                  └────────┬────────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
        ▼                  ▼                  ▼
  ┌──────────┐      ┌────────────┐    ┌──────────┐
  │ 训练 pipeline │  │ materialize │   │ 在线推理 │
  │ get_historical│  │ → Redis     │   │ get_online│
  └──────────┘      └────────────┘    └──────────┘

特征定义是 single source of truth。

何时该上 feature store

多个模型共用相同特征（用户画像 / 商品画像）
训练 / 在线漂移导致过线下效果不在线上复现
特征工程团队跟 ML 团队分工（特征 owner 模型）
需要 point-in-time 一致性（防 data leakage）

何时不该上（杀鸡用牛刀）：

单模型 + 简单特征（计算一下 mean / sum 直接做）
团队 < 5 人 ML 工程师
没有上线 / 用 batch 预测

替代方案对比

	Feast (OSS)	Hopsworks (商业)	Tecton (商业)	自己拼 (Redis + DAG)
复杂度	中	中	高	低-高
价格	免费	付费	付费	0
在线 store	多选	HBase 主	多	自选
离线 store	多选	自家 + S3	多	自选
实时特征	较弱	强	强	看实现

中小团队 Feast；大企业 Tecton / Hopsworks（带运维支持）；
极简团队拼 Redis + cron 也能 work。

与训练 pipeline 集成

# Kubeflow / Airflow / Dagster pipeline:
@task
def fetch_features(entity_df):
    store = FeatureStore(...)
    return store.get_historical_features(entity_df, ...).to_df()

@task
def train(df):
    model.fit(df[features], df['label'])
    return model

@task
def deploy(model):
    save_to_s3(model)
    feast materialize-incremental ...

特征 fetch 是 pipeline 第一步，后续 train / deploy 都用 feast。

效果

我们一个 churn 预测模型用 Feast 后：

训练 / 在线特征一致性：100%（之前 ~95%，漂移 5% 是 bug）
新模型上线时间从 2 周 → 3 天（特征不用重写，复用现成 view）
多模型共享 user_features view：避免重复算
"为什么这个用户预测分数低" 类调试：直接 feast 查在线特征 + 对照训练
分布

踩过的坑

event_time 时区：feast 假设所有时间是 UTC。本地时间传进去
会算偏差。always to_datetime(...).tz_localize('UTC')。
materialize 漏数据：materialize-incremental 从 last_materialization
开始。如果之前没 materialize 过，从 ttl 之前开始，可能遗漏。
首次用 materialize <start> <end> 全量。
online store 一致性：Redis 单机时 materialize 期间挂掉 → 部分
特征写进去 + 部分没写 → 在线读到旧 + 新混合值。Redis cluster
有助稳定。
特征 schema 变了：加新字段简单（FeatureView 重新 apply）；
删 / 改字段类型麻烦，需要重新 materialize 整 entity 群。
point-in-time join 慢：百万级 entity_df 跨多个 feature view
join 几分钟到几十分钟。生产 source 用 BigQuery / Snowflake 让
join 推到 DB 端比较快。