pandas 内存优化：dtype 收缩 / categorical / sparse 让 5GB → 800MB

起因

加载一份"用户行为日志" CSV，10M 行 × 30 列，
pd.read_csv('events.csv') → 内存 5 GB 后 OOM。
机器只有 8 GB RAM。
查 df.info(memory_usage='deep')：

RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 30 columns):
 #   Column          Dtype          MemUsage (MB)
---  ------          -----          -------------
 0   event_type      object         620
 1   user_id         int64          76
 2   country         object         580
 3   device_type     object         610
 4   amount          float64        76
 ...
dtypes: float64(8), int64(7), object(15)
Memory: 5102 MB

绝大多数内存被 string column 占用（pandas object dtype = Python str
对象数组，每个 string 40+ bytes 开销）。

5 个优化让同样数据降到 800 MB：

1. categorical：低基数字符串

df['event_type'] = df['event_type'].astype('category')
df['country'] = df['country'].astype('category')
df['device_type'] = df['device_type'].astype('category')

categorical 把 string 映射到 int8/int16 + 一份 unique values 字典。
event_type 5 个值 + 10M 行 → 5 bytes / row 而非 60 bytes / row。

效果：

 0   event_type      category       10        (从 620 MB)
 2   country         category       11        (从 580 MB)
 3   device_type     category       12        (从 610 MB)

3 个 column 从 1.8 GB → 33 MB。

判断"该不该 categorical"：

# 列的 unique 数 / 总行数 比例
df['country'].nunique() / len(df)
# 0.00002 = 200 / 10M → 强烈 categorical
# 0.5 = 5M unique / 10M → 不该 categorical（开销可能更大）

# 一般 ratio < 0.05 时 categorical 受益

> 50% 基数的 string column（如 URL / session_id）反而保 object 或 string[pyarrow]。

2. 数值列收缩 dtype

df['user_id'].max()    # 200_000_000  → int32 装得下 (max 2.1B)
df['amount'].max()     # 9999.99     → float32 够（默认 float64）
df['hour_of_day'].max()  # 23           → int8 (max 127)

df = df.astype({
    'user_id': 'int32',
    'amount': 'float32',
    'hour_of_day': 'int8',
    'minute_of_day': 'int8',
})

int64 → int32 减半；int8 减 8 倍。
float64 → float32 减半；某些数据（ML 特征）甚至 float16 也 OK。

辅助：

def shrink_ints(df):
    for col in df.select_dtypes(include=['int64']).columns:
        c = df[col]
        mx, mn = c.max(), c.min()
        if mn >= 0:
            if mx < 256:           df[col] = c.astype('uint8')
            elif mx < 65536:        df[col] = c.astype('uint16')
            elif mx < 4294967296:   df[col] = c.astype('uint32')
        else:
            if -128 <= mn and mx < 128:       df[col] = c.astype('int8')
            elif -32768 <= mn and mx < 32768:  df[col] = c.astype('int16')
            elif -2147483648 <= mn and mx < 2147483648:
                df[col] = c.astype('int32')
    return df

跑一次自动 downcast 所有 int。

3. sparse for mostly-zero columns

# 90% 是 0 的 dummy variable column
df['premium'] = df['premium'].astype('Sparse[int8, 0]')

只存 non-zero 值 + 索引。
0.9M 个 0 + 0.1M 个 1 → 之前 10MB → 1MB。

适合 one-hot encoding 后的稀疏矩阵。

4. parquet 替代 CSV

读 / 写 / 存都快得多：

df.to_parquet('events.parquet', compression='zstd')
# 5GB CSV → 800MB Parquet（列存 + 压缩）

df = pd.read_parquet('events.parquet')
# 比 read_csv 快 5-10 倍
# 自动保留 dtype（包括 categorical）

read_csv 时也指定 dtype 避免 pandas 猜：

df = pd.read_csv('events.csv', dtype={
    'event_type': 'category',
    'country': 'category',
    'amount': 'float32',
    'user_id': 'int32',
})

避免 pandas 默认 int64 / float64 / object 后再转换的中间峰值。

5. chunked read：分块处理

如果转完仍装不下，分块流式处理：

chunks = []
for chunk in pd.read_csv('events.csv', chunksize=100_000):
    chunk = optimize_dtypes(chunk)
    # 业务处理 / 聚合
    result = chunk.groupby('user_id').agg(...)
    chunks.append(result)

# 最后合并
final = pd.concat(chunks).groupby(...).sum()

或者直接用 polars（前面有文章），原生支持流式 + 多核。

6. PyArrow string 替代 object

pandas 2.0+ 加了 string[pyarrow] 类型，比 object 省内存 + 快：

df['session_id'] = df['session_id'].astype('string[pyarrow]')

适合"高基数 string" —— categorical 不划算的场景。
比 object 省 40-60% 内存 + groupby / sort 快 2-3x。

pd.options.future.infer_string = True 让 read_csv 默认用
string[pyarrow]（pandas 3.0 将成默认）。

实战脚本

import pandas as pd

def shrink(df):
    """通用 dtype 收缩 + categorical 自动检测"""
    df = df.copy()

    # 1. integer downcast
    for col in df.select_dtypes(include=['int64']).columns:
        df[col] = pd.to_numeric(df[col], downcast='integer')

    # 2. float downcast
    for col in df.select_dtypes(include=['float64']).columns:
        df[col] = pd.to_numeric(df[col], downcast='float')

    # 3. low-cardinality string → category
    for col in df.select_dtypes(include=['object']).columns:
        if df[col].nunique() / len(df) < 0.05:
            df[col] = df[col].astype('category')

    return df

df = pd.read_csv('events.csv')
print('before:', df.memory_usage(deep=True).sum() / 1e9, 'GB')

df = shrink(df)
print('after: ', df.memory_usage(deep=True).sum() / 1e9, 'GB')

典型场景 5x 内存压缩。

效果

我的 10M × 30 dataset：

	内存
默认 read_csv	5.1 GB
+ int / float downcast	3.2 GB
+ categorical (3 cols)	1.4 GB
+ string[pyarrow] (1 col)	1.1 GB
+ sparse (2 cols)	800 MB

整套操作 < 30 秒 + 完全保留语义。后续 groupby / merge 也快得多
（小数据 = 快 cache 命中）。

何时考虑换工具

pandas 优化到底后还是不够 → 切：

polars：原生流式 + 多核，比 pandas 快 5-30x
DuckDB：SQL 跑 Parquet / CSV，省内存
Dask：pandas 类似 API + out-of-core + 集群
Spark：超大数据集群

但单机 100GB 内 pandas + 这些技巧通常够。

踩过的坑

categorical 后 join 慢：两个 df join 的 categorical 列要
"相同 categories" 否则 pandas 转回 object。
df['x'].cat.set_categories(combined_cats)。
pd.read_csv 不带 dtype：pandas 先全读 object / int64 / float64
占大量内存，read 完才转。指定 dtype：内存峰值降 2-3x。
categorical 不能做某些操作：.str accessor 在 categorical
上慢 / 不工作。需要时 .astype('object').str.lower()。
sparse 与 numpy/sklearn 兼容性差：很多 sklearn estimator 不
接受 SparseArray，要 .to_dense()。trade-off。
string[pyarrow] 仍在演进：pandas 2.x 部分功能（如 groupby）
仍回退到 object。看 changelog 跟进。