知识广场

推荐

按学科筛选：计算机科学 / 机器学习 / 数据分析

«计算机科学 / 机器学习 / 数据分析» 分类下共 3 篇帖子

## 起因数据团队常见需求： - 给 PM / 业务做个交互式 dashboard - 给一个 ML model 一个 demo UI - 内部工具（查 user / 跑 cohort 分析）完整前端（React + API）几天工作量。 **streamlit** / **gradio**：纯 Python 描述 UI，5 分钟出能用 app。 ## streamlit ```python # app.py import streamlit as st import pandas as pd st.title('Sales Dashboard') uploaded = st.file_uploader('上传 CSV', type='csv') if uploaded: df = pd.read_csv(uploaded) st.dataframe(df) country = st.selectbox('国家', df.country.unique()) filtered = df[df.country == country] st.bar_chart(filtered.groupby('product').amount.sum()) st.metric('总销售', f"${filtered.amount.sum():,.0f}") ``` ```bash streamlit run app.py # 浏览器自动开 localhost:8501 ``` 整个 app 一个 .py，自顶向下 imperative。每次 widget 交互 → script 从头重跑（reactive）。 ## gradio ```python # app.py import gradio as gr def predict(image): # 跑 model return label, score demo = gr.Interface( fn=predict, inputs=gr.Image(), outputs=[gr.Label(), gr.Number()], title='Image Classifier', ) demo.launch() ``` gradio 更 function-centric：定义 input/output → 自动 wrap UI。 ## 两个的定位 - **streamlit**：通用 dashboard / 内部工具（多个 widget 交互） - **gradio**：ML model demo（input → 跑 → output） streamlit 是 layout + state-aware app；gradio 是 function demo wrapper。 ## streamlit 细节 ### session state ```python if 'count' not in st.session_state: st.session_state.count = 0 if st.button('+1'): st.session_state.count += 1 st.write(f'count: {st.session_state.count}') ``` 每次重跑 script，session_state 持久（同浏览器 session）。 ### cache ```python @st.cache_data def load_data(path): return pd.read_csv(path) # 慢操作 df = load_data('big.csv') # 重 invocation cached ``` `@st.cache_data` 内容缓存；`@st.cache_resource` 资源（model）缓存。不 cache 的话每次按 widget 都重读 CSV → 慢。 ### 多页面 ``` my_app/ ├── app.py # 主页 ├── pages/ │ ├── 1_📊_Dashboard.py │ ├── 2_🔍_Search.py │ └── 3_⚙️_Settings.py ``` `streamlit run app.py` 左侧自动有 page 切换。 ### chart 库 streamlit 内置： - `st.line_chart`, `st.bar_chart`, `st.area_chart`（轻量） - `st.pyplot()` matplotlib - `st.plotly_chart()` plotly - `st.altair_chart()` altair - `st.vega_lite_chart()` 任意 chart 都能塞。 ## gradio 细节 ### 复合 input/output ```python demo = gr.Interface( fn=lambda txt, slider: txt * slider, inputs=[ gr.Textbox(label='文本'), gr.Slider(1, 10, step=1), ], outputs=gr.Textbox(), ) ``` ### Blocks (复杂布局) ```python with gr.Blocks() as demo: gr.Markdown('# My App') with gr.Row(): with gr.Column(): inp = gr.Textbox() btn = gr.Button('Run') with gr.Column(): out = gr.Textbox() btn.click(fn=process, inputs=inp, outputs=out) demo.launch() ``` Blocks 更接近 streamlit 灵活度。 ### chat interface ```python def respond(message, history): return f"echo: {message}" gr.ChatInterface(respond).launch() ``` 3 行起 LLM chat UI。huggingface space 上 90% chat demo 用 gradio。 ## 性能 / scale - streamlit：每个 user 连接独立 session，但跑同一进程；重 compute 会卡其它 user - gradio：queue 系统，多个 request 排队跑 LLM demo 用 gradio（queue 默认）；多 user dashboard 用 streamlit。 ## 部署 ### streamlit cloud / hugging face streamlit cloud free tier： - 连 GitHub repo - 推 → 自动部署到 streamlit.app domain hugging face space： - 同样思路，免费 CPU - gradio / streamlit 都支持 ### 自托管 ```dockerfile FROM python:3.12-slim WORKDIR /app COPY . . RUN pip install -r requirements.txt EXPOSE 8501 CMD streamlit run app.py --server.address 0.0.0.0 ``` 放 nginx 反代 + auth → 内部工具。 ## 鉴权两者都没原生用户系统。方案： - nginx + basic auth - Cloudflare Access（zero-trust） - streamlit-authenticator package - OAuth proxy (oauth2-proxy) 内部工具我用 Cloudflare Access：5 分钟配，免维护。 ## 与 dash / panel 对比 - **Dash**（Plotly）：基于 Flask + React，更灵活但写得多 - **Panel**（HoloViz）：科学计算友好，多 backend - **Reflex**（前 Pynecone）：写 Python 编 React，UI 强 streamlit / gradio 简单优先；dash / reflex 复杂应用。 ## 我的选择 - **数据 dashboard** → streamlit - **ML 模型 demo** → gradio - **内部 admin tool** → streamlit - **需要复杂前端** → 接 SPA + FastAPI ## case：客户演示工具要给客户演示一个文本摘要 LLM： ```python import gradio as gr from transformers import pipeline summarizer = pipeline('summarization') def summarize(text): return summarizer(text, max_length=100)[0]['summary_text'] gr.Interface( fn=summarize, inputs=gr.Textbox(lines=10, placeholder='粘贴长文'), outputs=gr.Textbox(label='摘要'), title='LLM 摘要 Demo', examples=[['Long text 1...'], ['Long text 2...']], ).launch(share=True) # share=True 给临时公网 URL（gradio.live） ``` 30 秒部署 + URL 发给客户 + 客户能直接试。比 PPT 强 100 倍。 `share=True` 临时 tunnel 72 小时有效。 ## 内部 case：cohort 分析工具 ```python import streamlit as st import duckdb st.title('User Cohort Analysis') date_range = st.date_input('日期范围', value=(start, end)) group_by = st.selectbox('Group by', ['country', 'plan', 'source']) @st.cache_data def query(date_range, group_by): return duckdb.sql(f""" SELECT {group_by}, DATE_TRUNC('week', signup_date) AS cohort, COUNT(*) AS users FROM read_parquet('s3://.../users/*.parquet') WHERE signup_date BETWEEN '{date_range[0]}' AND '{date_range[1]}' GROUP BY 1, 2 """).df() df = query(date_range, group_by) st.plotly_chart(px.line(df, x='cohort', y='users', color=group_by)) st.dataframe(df) ``` 业务自己改 dropdown 看不同维度。原本 BA 找数据团队跑 → 改成业务自助。 ## 踩过的坑 1. **state 重置**：streamlit 每次交互重跑 script。耗时 op 没 cache → 卡。 2. **gradio queue 默认关**：高 concurrent 时阻塞。`demo.queue()` 打开。 3. **streamlit 多 tab**：同 user 多 tab → state 不共享。 `st.session_state` 是单 session 单 tab。 4. **share=True 安全**：gradio share 链接公网，没 auth。给 demo 用，不要放 secret data。 5. **upload size 限制**：streamlit 默认 200 MB；要更大改 `--server.maxUploadSize=1000`。

数据分析数 数据与机器学习官方@data_ml_lab 2026-05-13 20:53 🔥 热度 0 💬 评论 0

DuckDB：笔记本上跑分析 SQL，比 pandas 快 10x

## 起因数据分析常见情境： - 收到一堆 CSV / Parquet（几 GB - 几十 GB） - 想跑 SQL JOIN / 聚合 / 窗口函数分析 - 没 Snowflake / BigQuery（个人项目 / 本地探索） - pandas 慢 + groupby 写得难看 `DuckDB`：嵌入式 OLAP 数据库（"SQLite for analytics"），单文件 binary，跑分析 SQL 跟 columnar 仓库一样快，**在你笔记本上**。 ## 装 ```bash pip install duckdb # Python brew install duckdb # CLI ``` ## CLI ```bash $ duckdb my.db D SELECT * FROM 'data.csv' LIMIT 5; -- 直接读 csv，无需 import D SELECT COUNT(*) FROM 'data.parquet'; D SELECT a.x, b.y FROM 'a.csv' a JOIN 'b.parquet' b ON a.id = b.id; ``` CSV / Parquet / JSON 直接当 table 查，无 import 步骤。 ## Python ```python import duckdb # in-memory con = duckdb.connect() # 直接查 CSV df = con.execute(""" SELECT country, SUM(amount) AS total FROM 'orders.csv' WHERE qty > 5 GROUP BY country ORDER BY total DESC """).df() # 返回 pandas df # 持久化 con = duckdb.connect('analysis.db') con.execute("CREATE TABLE orders AS SELECT * FROM 'orders.csv'") ``` ## 跟 pandas / DataFrame 互通 ```python import pandas as pd import duckdb # pandas df 直接当 table 用（DuckDB zero-copy 引用） df = pd.read_csv('big.csv') result = duckdb.sql(""" SELECT col1, AVG(col2) FROM df -- 直接引用 pandas df GROUP BY col1 """).df() ``` polars 同样： ```python import polars as pl pl_df = pl.read_csv('big.csv') result = duckdb.sql("SELECT * FROM pl_df WHERE col1 > 100").pl() ``` DuckDB 跟 pandas / polars / Arrow 数据**zero-copy 互转**（都用 Arrow columnar 内存格式）。 ## 性能 8 核 16 GB 笔记本，10 GB Parquet 文件： ```sql SELECT country, SUM(amount), COUNT(*) FROM 'orders.parquet' GROUP BY country ORDER BY 2 DESC LIMIT 10; ``` | 工具 | 时间 | |---|---| | pandas | 35s | | polars (eager) | 8s | | polars (lazy) | 4s | | DuckDB | 2.5s | DuckDB 列存 + vector 执行 + 多核 + 全局优化器，把分析查询打得很快。 ## 直接查远程 Parquet ```python duckdb.sql(""" SELECT * FROM read_parquet('s3://my-bucket/orders/*.parquet') WHERE date = '2025-03-01' """) ``` DuckDB 支持 S3 / GCS / Azure / HTTP 直读。配合 partition + Parquet column pruning → 只读必要的 column 和 partition。 ## 数据湖直接查跟 Iceberg / Delta lake 集成： ```sql INSTALL iceberg; LOAD iceberg; SELECT * FROM iceberg_scan('s3://bucket/table/'); ``` 不用 Spark 也能查 Iceberg。 ## window 函数 / 复杂 SQL ```sql SELECT user_id, date, amount, SUM(amount) OVER (PARTITION BY user_id ORDER BY date) AS cum_sum, RANK() OVER (PARTITION BY DATE_TRUNC('month', date) ORDER BY amount DESC) AS rank_in_month FROM orders; ``` 全 SQL 标准 + Postgres 兼容大量扩展 + DuckDB 特有的 ANTI/SEMI JOIN / QUALIFY 等。 ## EXPORT / IMPORT ```sql COPY (SELECT * FROM big_table) TO 'out.parquet' (FORMAT PARQUET); COPY (SELECT * FROM big_table) TO 'out.csv' (HEADER, DELIMITER ','); ``` 数据格式互转的瑞士军刀。 ## 真实 case：替代 pandas EDA 数据探索，原本： ```python df = pd.read_csv('events.csv') df_filtered = df[df.user_age > 18] grouped = df_filtered.groupby(['country', 'product']).agg({ 'amount': ['sum', 'mean'], 'qty': 'count', }).reset_index() sorted = grouped.sort_values(('amount', 'sum'), ascending=False) sorted.head(20) ``` DuckDB 等价： ```python duckdb.sql(""" SELECT country, product, SUM(amount) AS total, AVG(amount) AS avg_amount, COUNT(*) AS n FROM 'events.csv' WHERE user_age > 18 GROUP BY country, product ORDER BY total DESC LIMIT 20 """).df() ``` SQL 更直白 + 跑得快 + 不需要 import 完整 csv。 ## 跟 Snowflake / BigQuery 对比 | | DuckDB | Snowflake | BigQuery | |---|---|---|---| | 部署 | 单 binary | SaaS | SaaS | | 数据规模 | < 1 TB（单机） | PB | PB | | 成本 | 0 | 按 credit | 按扫描 GB | | 启动 | < 100ms | < 1s | < 5s | | SQL | Postgres-like | ANSI++ | ANSI+ | | 并发用户 | 单 | 多 | 多 | DuckDB 不替代 Snowflake（不是 multi-user / 不是无限 scale）。但 90% 个人 / 团队分析（< 1 TB）DuckDB 够 + 免费 + 快。 ## motherduck（DuckDB cloud） DuckDB 团队也做了 motherduck.com → DuckDB + cloud sync： - 本地查 + 云端永久存 - 共享数据集 - 跨设备一致按需用，但 DuckDB 本身完全离线可用。 ## 嵌入应用 ```python # Django / FastAPI 里嵌 DuckDB 做 analytics endpoint import duckdb @app.get('/analytics/top-products') def top_products(): return duckdb.sql(""" SELECT product, SUM(amount) AS total FROM read_parquet('s3://.../orders/*.parquet') WHERE date > current_date - 7 GROUP BY product ORDER BY total DESC LIMIT 10 """).df().to_dict('records') ``` 不用 separate analytics DB / 全部嵌进应用。 ## extension 生态 ```sql INSTALL httpfs; -- HTTP / S3 INSTALL spatial; -- 地理空间 INSTALL fts; -- 全文搜索 INSTALL postgres; -- 查 Postgres 表 INSTALL excel; -- 读写 .xlsx INSTALL sqlite; -- 读写 SQLite 文件 ``` `INSTALL postgres; LOAD postgres;` 后： ```sql ATTACH 'host=pg.example.com dbname=app user=...' AS pg (TYPE postgres); SELECT * FROM pg.public.orders LIMIT 10; ``` 把 Postgres 表当本地表查 + JOIN 本地 CSV → 异构数据查询。 ## 踩过的坑 1. **大数据 > RAM 时**：DuckDB 用 disk spilling 但仍可能慢。`SET memory_limit='10GB'`，留剩余给 OS。 2. **column type 自动推断错**：CSV 列 sometime "N/A"，DuckDB 推断 string。`read_csv_auto(..., types={'col': 'INTEGER'})` 显式。 3. **更新慢**：DuckDB 是 OLAP，不适合频繁 UPDATE。点查 / 单行更新 → 用 SQLite。 4. **并发写不行**：单写者，多 reader。Web app 多 worker 同时写 DuckDB 会锁。 5. **extension 版本不匹配**：DuckDB 升级后 extension cache 旧版本报错。`FORCE INSTALL <ext>` 强制更新。

数据分析数 数据与机器学习官方@data_ml_lab 2026-05-01 00:40 🔥 热度 0 💬 评论 0

polars vs pandas（2026 视角）

## 起因 pandas 是 Python 数据界 15 年的事实标准。但： - 单线程（GIL），大数据慢 - 内存膨胀（同一列多份 copy） - API 设计累赘（SettingWithCopyWarning、index 烦） `polars` 是 Rust 写的 DataFrame，2020+ 起飞。Apache Arrow 内存格式 + 多线程 + lazy 执行。 2026 视角看，polars 在多个维度全面超越 pandas。 ## 装 ```bash pip install polars # 或者 uv add polars ``` ## 句法对比 ```python import polars as pl import pandas as pd # pandas df = pd.read_csv('orders.csv') result = ( df[df['country'] == 'US'] .groupby('product') .agg({'amount': 'sum', 'qty': 'count'}) .reset_index() .sort_values('amount', ascending=False) .head(10) ) # polars df = pl.read_csv('orders.csv') result = ( df.filter(pl.col('country') == 'US') .group_by('product') .agg([ pl.col('amount').sum(), pl.col('qty').count(), ]) .sort('amount', descending=True) .head(10) ) ``` polars 句法 method chain 顺。明确的 `pl.col(...)` 比 pandas `df['x']` 在复杂 expression 里清晰。 ## 性能我们一个 10 GB CSV / 80 列： | 操作 | pandas | polars | polars-lazy | |---|---|---|---| | read_csv | 95s | 22s | 22s | | filter + groupby + agg | 38s | 5s | 3s | | join 两 10 GB | 90s (OOM 风险) | 18s | 12s | | sort by 3 列 | 25s | 4s | 3s | 5-10x 快。32 核机器更明显（pandas 单核）。内存：pandas 30 GB peak，polars 12 GB peak（Arrow columnar + zero-copy）。 ## lazy 执行 polars 杀手 feature： ```python # eager（每步实际跑） df = pl.read_csv('big.csv') result = df.filter(...).group_by(...).agg(...) # lazy（构建 query plan，scan 时才执行） result = ( pl.scan_csv('big.csv') # 注意 scan_ 而不是 read_ .filter(pl.col('x') > 0) .group_by('y') .agg(pl.col('z').sum()) .collect() # 触发执行 ) ``` lazy 优势： - **predicate pushdown**：filter 推到 CSV 读取阶段，只读符合行 - **projection pushdown**：只读用到的列 - **CSE**：重复 expression 算一次 - **streaming**：> 内存数据流式处理 ```python result = ( pl.scan_csv('100GB.csv') .filter(pl.col('date') > '2025-01-01') .select(['user_id', 'amount']) # 只读这俩列 .group_by('user_id') .agg(pl.col('amount').sum()) .collect(streaming=True) # 流式，不全加载 ) ``` 100 GB CSV 在 16 GB 机器跑得动。pandas 没 streaming 直接 OOM。 ## SQL interface ```python ctx = pl.SQLContext() ctx.register('orders', df) result = ctx.execute(""" SELECT country, SUM(amount) FROM orders WHERE qty > 5 GROUP BY country """).collect() ``` 熟 SQL 但不熟 polars expression → 写 SQL。 ## 跟 pandas 互转 ```python df_pd = pl.DataFrame(...).to_pandas() df_pl = pl.from_pandas(df_pd) ``` 零拷贝（用 Arrow buffer 共享）。混用方便。 ## 与 pandas 2.x（Arrow backend）对比 pandas 2.x 加了 pyarrow backend： ```python df = pd.read_csv('data.csv', dtype_backend='pyarrow') ``` 性能改善但**仍单线程**。比 polars 还差一截（polars 多核 + lazy + native rust）。 ## 与 spark / dask 对比 | | pandas | polars | dask | spark | |---|---|---|---|---| | 内存模型 | row | columnar (Arrow) | partition | columnar | | 并行 | 单线程 | 多线程 | 多进程/集群 | 集群 | | 数据规模 | < RAM | > RAM (streaming) | TB | PB | | 学习曲线 | 低 | 中 | 中 | 高 | | 启动 | 0.1s | 0.1s | 1s | 30s+ | - < 10 GB → polars - 10 GB - 1 TB → polars streaming / dask - > 1 TB → spark / dask 集群 ## 实际项目迁移我们 ETL pipeline 30 个 script，pandas → polars： ``` 1. read_csv → scan_csv：1 行换 2. df[df.x > 5] → df.filter(pl.col('x') > 5)：手动改 3. groupby().agg({}) → group_by().agg([])：手动改 4. .reset_index() → 删（polars 无 index 概念） 5. lambda apply → 改成 polars expression ``` 大约 30 - 50% 行需要改。但跑速从 2 小时 → 12 分钟，值得。 LLM 辅助迁移很方便，pandas 到 polars 是 well-defined 转换。 ## API 缺点 / 注意 - 没 index（这是 feature 不是 bug，但 pandas 老用户要适应） - merge → join（语义稍不同，pandas merge 默认 inner，polars join 默认 inner，OK） - pivot / melt 等也有 + 语义略不同 - 没 multi-index column 90% workflow polars OK。某些特殊 transformation（时间序列 resample 加 multi-index）pandas 仍胜。 ## 用什么场景 - **新 ETL** → polars 默认 - **现有 pandas codebase** → 看痛点决定，不必全迁 - **notebook 探索性分析** → 二选一都行，polars 性能优势更大 - **DataFrame for ML 输入** → sklearn 仍 pandas 友好；polars 转 numpy 传 sklearn ## 我的工作流 - 数据 ingestion / heavy ETL：polars - ML feature engineering：polars - 给 sklearn / pytorch 时：`.to_numpy()` 或 `.to_pandas()` - 临时小数据：pandas（生态广） ## 踩过的坑 1. **expression 错位**：`pl.col('x') + 5 - pl.col('y')` vs `pl.col('x') + (5 - pl.col('y'))`。运算符优先级跟 Python 一致，但容易看走眼。 2. **lazy collect 慢**：忘了 `.collect()` 一直 lazy。debug 时 `.head(10).collect()` 看数据。 3. **datetime 时区**：polars 严格 timezone aware / naive 区分。 pandas 经常混。从 pandas 来的 dataframe `pl.from_pandas` 时 timezone 信息可能丢。 4. **null 处理**：polars 用 Arrow null bit，跟 pandas NaN 不同。 `pl.col('x').is_null()` 不是 `x != x`。 5. **groupby 后默认按 key 排序**：pandas 默认排，polars 默认不排。要 sort 显式 `.sort()`。

数据分析数 数据与机器学习官方@data_ml_lab 2026-04-24 11:52 🔥 热度 0 💬 评论 0