Data-Centric AI · LLM Data Systems · AI4Science Data-Centric AI · LLM Data Systems · AI4Science

面向下一代模型的Data-centric AI基础设施 Data-centric AI infrastructure for next-generation models

张文涛，北京大学国际机器学习研究中心助理教授、研究员、博士生导师，Data-Centric AI Group 负责人。博士毕业于北京大学并师从崔斌教授，曾任职于腾讯机器学习平台部、Apple AIML 和加拿大 Mila 人工智能实验室。主要研究方向为以数据为中心的人工智能、LLM 数据系统、数据治理智能体与 AI4Science，聚焦大模型时代的数据基础设施，探索可复用、可扩展、可验证的下一代 AI 基础设施。

近五年以第一作者或通讯作者发表 CCF-A 类论文 100 余篇，谷歌学术引用逾 14,000+ 次，入选 Elsevier 世界前 2% 顶尖科学家，2026 年位列 CSRanking 北大 AI/ML 方向及 AI+Data 方向学者首位。现任 NeurIPS、ACL、SIGKDD 等国际会议领域主席，主持国家自然科学基金、科技部、教育部、北京市科委及校企合作科研项目 20 余项。

曾获广东省科技进步特等奖、中国电子学会科技进步一等奖、世界互联网大会领先科技成果奖，三次获得最佳论文奖：WWW 2022、APWeb 2023、CIKM 2024。入选智源学者、浦江青年学者、ACM SIGMOD China 新星奖、世界人工智能大会云帆奖等。课题组构建 DataFlow、MinerU、MinerU-HTML、DataFlex、AgentFlow、OpenWorldLib、One-Eval、Paper2Any 等开源系统，形成面向 Data-Centric AI 的系统化工具链和基础设施。

Wentao Zhang is an Assistant Professor, Researcher, and PhD Advisor at Peking University, leading the Data-Centric AI Group. He received his PhD from PKU under Prof. Bin Cui and previously worked at Tencent ML Platform, Apple AIML, and Mila. His research focuses on Data-Centric AI, LLM data systems, data governance agents, and AI4Science, building reusable, scalable, and verifiable data infrastructure for next-generation models.

In the past five years, he has published 100+ CCF-A papers as first or corresponding author, with 14,000+ Google Scholar citations, and has been selected among Elsevier's top 2% scientists worldwide. In 2026, he ranks #1 among PKU scholars in AI/ML and AI+Data on CSRankings. He serves as Area Chair for NeurIPS, ACL, and SIGKDD, and has led 20+ research projects from NSFC, MOST, MOE, Beijing municipal programs, and industry collaborations.

He has received the Special Prize of Guangdong Science and Technology Progress Award, the CIE Science and Technology Progress Award, and the World Internet Conference leading achievement award, with best-paper-level awards at WWW 2022, APWeb 2023, and CIKM 2024. His group builds open systems including DataFlow, MinerU, MinerU-HTML, DataFlex, AgentFlow, OpenWorldLib, One-Eval, and Paper2Any, forming a systematic toolchain and infrastructure for Data-Centric AI.

研究主线Research Data-Centric AI、LLM、AI Systems、AI4Science。 Data-Centric AI, LLMs, AI systems, and AI4Science.

主持项目Grants 国家自然科学基金重大研究计划、科技部重点研发计划（课题）、教育部学科突破先导项目（Co-PI）、北大-腾讯大模型数据联合实验室。 NSFC Major Research Plan, MOST key R&D project subtopic, MOE disciplinary breakthrough project (Co-PI), and the PKU-Tencent Joint Laboratory for Large-Model Data.

荣誉与论文奖Honors & awards 智源学者、浦江青年学者、广东省科技进步特等奖、电子学会科技进步一等奖、世界人工智能大会云帆奖、ACM SIGMOD 中国新星奖、世界互联网大会领先科技成果奖、华为火花奖等。 Zhiyuan Scholar, Pujiang Young Scholar, Special Prize of Guangdong Science and Technology Progress Award, CIE Science and Technology Progress Award, WAIC Rising Star, ACM SIGMOD China Rising Star, World Internet Conference leading achievement, Huawei Spark Award, and more.

学术影响力Academic impact 获 WWW'22、APWeb'23、CIKM'24 等顶会最佳论文奖，DataFlow、MinerU、Angel 等开源Data Infra累计获 GitHub Star 超 8 万。 Best paper awards at WWW'22, APWeb'23, and CIKM'24; open-source Data Infra systems including DataFlow, MinerU, and Angel have accumulated 80k+ GitHub stars.

开源项目Open-source projects 招生与实习Join the group

Wentao Zhang · 张文涛Wentao Zhang

北京大学助理教授 / 研究员 / 博士生导师

Assistant Professor, Researcher, PhD Advisor at PKU

Google Scholar DBLP GitHub Twitter 小红书 bili Bilibili 知知乎

邮箱：wentao.zhang@pku.edu.cn Email: wentao.zhang@pku.edu.cn 微信：z1299799152 WeChat: z1299799152 地址：北京市海淀区颐和园路5号，北京大学静园六院 Address: Jingyuan Courtyard 6, Peking University, Beijing, China

100+近五年一作/通讯发表CCF-A顶会论文First/corresponding-author CCF-A papers in recent years

#1CSRanking 北大AI和AI+Data方向均列在职教师第一#1 PKU faculty in AI and AI+Data on CSRankings 14k+谷歌学术引用 · h-index 50Google Scholar citations · h-index 50 80k+DataFlow、MinerU 和 Angel 等开源项目累计 GitHub StarsGitHub stars across DataFlow, MinerU, Angel, and related open-source systems

3+最佳论文级奖项：WWW · APWeb · CIKMBest-paper-level awards: WWW · APWeb · CIKM

研究方向Research

围绕下一代模型的数据基础设施Data infrastructure for next-generation models

Data-Centric AI

从大语言模型、多模态大模型，到 Agentic LLM 与 World Model，我们持续布局 Data Infra，目标是以更低成本、更低门槛和更高质量完成数据获取、解析、合成、清洗、评估与训练调度，让 Data-Centric AI 成为下一代模型能力增长的基础设施。

From LLMs and multimodal foundation models to agentic LLMs and world models, we continuously build Data Infra for lower-cost, lower-barrier, and higher-quality data acquisition, parsing, synthesis, cleaning, evaluation, and training orchestration, making Data-Centric AI a core infrastructure layer for next-generation model capabilities.

DataFlow 新闻DataFlow news Data-Centric AI 观点Data-Centric AI viewpoint

近期动态Recent news

近期动态Recent updates

107 updates

2026-07

4 篇论文被 ACM MM 2026 接收。

4 papers are accepted by ACM MM 2026.

2026-07

1 篇论文被 COLM 2026 接收：Variational Co-Evolution via Reinforcement Learning。

One paper is accepted by COLM 2026: Variational Co-Evolution via Reinforcement Learning.

2026-07

1 篇论文被 IEEE TPAMI 接收：Aligning Condensed Graph via Hashing: A New Insight for Federated Graph Learning。

One paper is accepted by IEEE TPAMI: Aligning Condensed Graph via Hashing: A New Insight for Federated Graph Learning.

2026-06

获 广东省科技进步特等奖。

Awarded the Special Prize of Guangdong Science and Technology Progress Award.

2026-06

4 篇论文被 ECCV 2026 接收：MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding、Scientific Image Synthesis: Benchmarking, Methodologies, and Downstream Utility、Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs、GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models。

Four papers are accepted by ECCV 2026: MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding, Scientific Image Synthesis: Benchmarking, Methodologies, and Downstream Utility, Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs, and GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models.

2026-05

4 Papers are accepted by ICML 2026.

2026-05

3 篇论文被 SIGKDD 2026 接收：RARE: Retrieval-Augmented Reasoning Modeling、Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM、ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows。

Three papers are accepted by SIGKDD 2026: RARE: Retrieval-Augmented Reasoning Modeling, Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM, and ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows.

2026-05

1 篇论文被 VLDB 2026 接收：QA-GraphRAG: Query-Adaptive Plug-and-Play Retrieval Integration for Graph-based Retrieval-Augmented Generation。

One paper is accepted by VLDB 2026: QA-GraphRAG: Query-Adaptive Plug-and-Play Retrieval Integration for Graph-based Retrieval-Augmented Generation.

2026-04

1 Papers is accepted by ACL 2026 Industry Track.

2026-04

11 Papers are accepted by ACL 2026 Findings.

2026-04

8 Papers are accepted by ACL 2026 MainConference.

2026-02

获评 浦江青年学者。

Awarded Pujiang Young Scholar.

2026-02

Two Papers are accepted by ICDE 2026.

2026-02

Five Papers are accepted by CVPR 2026.

2026-02

获评 智源学者。

Awarded Zhiyuan Scholar.

2026-01

博士生徐铭浩获 腾讯青云奖学金。

PhD student Minghao Xu received the Tencent Qingyun Scholarship.

2026-01

Six Papers are accepted by ICLR 2026.

2026-01

Two Papers are accepted by WWW 2026.

2025-11

One Paper is accepted by SIGKDD 2026.

2025-11

Two Papers are accepted by AAAI 2026.

2025-09

Seven Papers are accepted by NeurIPS 2025.

2025-08

Six Papers are accepted by EMNLP 2025.

2025-08

🏆 We win the First Place Winner in ICML 2025 Challenges on Automated Math Reasoning and Extensions!

2025-07

Five Papers are accepted by ACM MM 2025.

2025-06

Two Papers are accepted by ICCV 2025.

2025-06

One Paper is accepted by VLDB 2025.

2025-05

One Paper is accepted by ECML 2025.

2025-05

Six Papers are accepted by ACL 2025 Main.

2025-05

One Paper is accepted by ACL 2025 Findings.

2025-05

Three Papers are accepted by SIGKDD 2025.

2025-05

One Paper is accepted by IEEE TKDE 2025.

2025-05

Two Papers are accepted by ICML 2025.

2025-04

One Paper is accepted by IJCAI 2025.

2025-04

One Paper is accepted by ISSTA 2025.

2025-04

One Paper is accepted by SIGIR 2025.

2025-04

One Paper is accepted by ISSTA 2025.

2025-02

Two papers are accepted by CVPR 2025.

2025-02

One paper is accepted by ICDE 2025.

2025-01

Two papers are accepted by IEEE TKDE 2025.

2025-01

Four papers are accepted by ICLR 2025.

2025-01

Two papers are accepted by ICDE 2025.

2025-01

Two papers are accepted by WWW 2025.

2025-01

One papers is accepted by VLDB 2025.

2024-12

Two papers are accepted by AAAI 2025.

2024-11

Two papers are accepted by ICDE 2025.

2024-10

🏆 We win the Best Student Full Paper Award in CIKM 2024!

2024-10

One paper is accepted by IEEE BIBM 2024.

2024-09

Three papers are accepted by NeurIPS 2024.

2024-06

One paper is accepted by CIKM 2024.

2024-06

One paper is accepted by VLDB 2024.

2024-05

One paper is accepted by SIGKDD 2024.

2024-05

One paper is accepted by the main track of ACL 2024.

2024-04

One paper is accepted by ICML 2024.

2024-04

One paper is accepted by JMLR 2024.

2024-04

One paper is accepted by TKDE 2024.

2024-03

Four papers are accepted by ICDE 2024.

2024-02

I am awared First Prize of Scientific and Technological Progress Award of CIE due to the Angel Project.

2024-02

One paper is accepted by VLDB 2024.

2024-02

One paper is accepted by SIGMOD 2024.

2024-02

I am awared 2023 CAAI Doctoral Dissertation Award.

2024-01

One paper is accepted by WWW 2024.

2024-01

One paper is accepted by ACM Computing Survey 2024.

2024-01

Two papers are accepted by ICLR 2024.

2023-12

One paper is accepted by AAAI 2024.

2023-12

I am awared 2023 Beijing Doctoral Dissertation Award.

2023-12

Three papers are accepted by ICDE 2024.

2023-10

One paper is accepted by ICDE 2024.

2023-10

🏆 We win the Best Paper Runner Up Award in APWeb-WAIM 2023.

2023-09

One paper is accepted by ACM Computing Survey 2023.

2023-09

One paper is accepted by NeurIPS 2023.

2023-08

One paper is accepted by VLDB 2024.

2023-08

One paper is accepted by APWEB-WAIM 2023.

2023-08

One paper is accepted by CIKM 2023.

2023-08

Our book about Diffusion Model is now avaliable.

2023-05

One paper is accepted by TKDE 2023.

2023-05

One paper is accepted by SIGKDD 2023.

2023-05

One paper is accepted by VLDB 2023.

2023-03

One paper is accepted by SIGMOD 2023.

2022-11

One paper is accepted by AAAI 2023.

2022-11

One paper is accepted by ICDE 2023.

2022-10

One paper is accepted by VLDBJ 2022.

2022-09

One paper is accepted by NeurIPS 2022.

2022-09

获世界人工智能大会云帆奖-明日之星，2022。

Awarded Rising Star at the World AI Conference, 2022.

2022-06

I am honor to present the valedictorian for the class of 2022 in CS of PKU.

2022-06

I receive my Ph.D. degree in computer science from Peking University with Outstanding Doctoral Dissertation Award.

2022-05

One paper is accepted by the journal VLDBJ 2022.

2022-05

Four papers are accepted by the conference SIGKDD 2022.

2022-05

Two papers as first author, have been accepted by ICML 2022.

2022-05

One paper related to AutoML, has been accepted by Bioinformatics 2022.

2022-04

🏆 We win the Best Student Paper Award in WWW 2022 !

2022-04

We release our first version of the scalable graph learning toolkit--SGL.

2022-03

One paper is selected as the Best Paper Award Nominees in WWW 2022. The corresponding PasCa system (integrated into SGL) will be open source next month!

2022-03

One paper as corresponding author, related to GNN-based Recommendation, has been accepted by the journal ACM Computing Survey 2022 .

2022-01

One paper related to graph-based recommendation, has been accepted by the conference ICDE 2022 .

2022-01

One paper as first author, related to graph data annotation, has been accepted by the conference ICLR 2022 .

2022-01

One paper related to our large scale Hyper-paramater Tuning system, has been accepted by the conference VLDB 2022 .

2022-01

I accepted the invitation to serve as Program Committee member of the Research Track of ACM SIGKDD 2022.

2022-01

One paper as first author, related to our scalable graph NAS system, has been accepted by the conference WWW 2022 .

2021-12

Our OpenBox team won the “Outstanding Winner” at the openGCC contest in CCF ChinaSoft 2021. Congratulations!

2021-09

Two papers as first author, related to scalable graph learning and graph data annotation, have been accepted by the conference NeurIPS 2021 with Spotlight (< 3%).

2021-08

We propose GAMLP, a scalable and efficient graph model, which achieves the top #1 performance in three public and largest ogbn graphs (i.e., ogbn-papers100M, ogbn-products, and ogbn-mag)! See the leaderboards here.

2021-07

One paper as first author, related to large-scale graph data selection, has been accepted by the conference VLDB 2021.

2021-07

One paper as co-first author, related to deep GNN, has been accepted by the journal TKDE 2021.

2021-06

One paper as third author, related to our AutoML system -- VocalnoML, has been accepted by the conference VLDB 2021.

2021-05

Three papers, related to sparse graph, graph decomposition and our blackbox optimization (BBO) system -- OpenBox, are accepted by the conference SIGKDD 2021.

2021-03

As the only person in China, I was selected as an Apple Scholar in AI/ML. Many thanks to Apple!

2021-03

One paper as first author has been accepted by the conference SIGMOD 2021. Looking forward to the meeting in Xi'an this summer!

DataFlow 生态：从数据入口到应用出口的可编程基础设施DataFlow Ecosystem: Programmable infrastructure from data inputs to application outputs

                  L5

                      应用出口层Application layer
                      科研生产力与世界模型赋能
                      Research productivity and world-model services
                    
                        Paper2Any科研内容生成与转换Research asset generation
                      
                        OpenWorldLib知识开放与应用服务Open knowledge services
                      
                  L4

                      训练调度与
评估反馈层Training and
evaluation layer
                      数据驱动闭环优化
                      Data-driven closed-loop optimization
                    
                        DataFlex数据选择 / 混合 / 加权Selection / mixture / weighting
                      
                        One-Eval模型 / 任务评估Model / task evaluation
                      
                        ▮▮▮
                        DataFlow Eval评估流水线与分析Evaluation pipelines
                      
                  L3

                      Agentic 数据与
交互编排层Agentic data and
orchestration layer
                      智能体与工具编排
                      Agent and tool orchestration
                    
                        AgentFlow智能体数据合成Agent data synthesis
                      
                        ⌘
                        DataFlow-Skills & Harness技能库与评测框架Skills and harness
                      
                        ▭
                        DataFlow-WebUI可视化与协同工作台Visual workspace
                      
                  L2

                      数据资产工厂层Data asset factory
                      数据资产生产与治理
                      Data asset production and governance
                    
                        DataFlow CoreOperator / Pipeline / Storage
                      
                        DataFlow-MM多模态数据资产Multimodal assets
                      
                        ▦
                        DataFlow-Table表格数据资产Table assets
                      
                        ◇
                        DataFlow-KG知识图谱资产KG assets
                      
                  L1

                      数据入口与解析层Data intake and parsing
                      解析与清洗中枢
                      Parsing and cleaning hub
                    
                        MinerU通用文档解析Document parsing
                      
                        ◎
                        MinerU-HTML网页内容解析HTML parsing
                      
                        Flash-MinerU高速文档解析Fast document parsing
                      
                        ⌯
                        RayOrch分布式采集调度Distributed scheduling
                      
                  L0

                      多源异构数据Heterogeneous data
                      覆盖多模态多领域数据
                      Multi-modal, multi-domain sources

DataMind

Context · Memory · Wiki · RAG / GraphRAG · Second Brain

○Profile / Session ◫Memory ▤Wiki ⌕RAG ◇GraphRAG ◷Feedback Trace ↻Knowledge Reuse

DataFlow

大模型数据准备系统，包含数据获取、处理、质量评估、算子和工作流编排。

LLM data preparation with acquisition, processing, quality evaluation, operators, and workflows.

ProtocolData prep

MinerU

通用文档解析引擎，将 PDF、图片和复杂版面转换为 AI-ready 文档数据。

General document parsing that converts PDFs, images, and complex layouts into AI-ready document data.

Document AIPDF

MinerU
HTML

MinerU-HTML

基于轻量语言模型的网页主内容抽取工具，服务高质量 HTML 数据清洗。

Lightweight-LM main-content extraction for high-quality HTML data cleaning.

HTMLExtraction

DataFlex

训练过程中动态选择、配比和重加权数据的 Data-centric LLM 训练框架。

Data-centric LLM training with dynamic selection, mixture, and reweighting.

TrainingMixture

DataFlow-MM

DataFlow 的多模态扩展，覆盖图像、视频、音频等数据资产的处理与评测。

The multimodal extension of DataFlow for image, video, audio, and related data assets.

MultimodalData assets

DataFlow-LoopAI

面向大模型的闭环优化框架，从评测、问题分析到数据获取和训练反馈。

Closed-loop LLM optimization from evaluation and failure analysis to data acquisition and feedback.

LoopAIFeedback

DataFlow-Table

自动化表格数据处理框架，覆盖取数、处理和分析三类 Agentic Workflow。

Agentic workflows for table extraction, processing, and analysis.

TableAgentic workflow

DataFlow-Graph

面向知识图谱数据处理的 DataFlow 扩展，支持图谱构建、补全、推理与评测。

DataFlow extension for knowledge graph construction, enrichment, reasoning, and evaluation.

KGGraph

OpenWorldLib

将 DataFlow 扩展到 World Model 场景，支持世界模型数据准备与评估。

Extends DataFlow to world-model data preparation and evaluation.

World ModelDataFlow

One-Eval

自动化评测框架，目标是一句话从用户需求到模型评测报告。

Automated LLM evaluation from natural-language needs to model reports.

NL2EvalEvaluation

Paper2Any

基于 DataFlow-Agent 搭建的科研资产生成应用，支持科研绘图、PPT、海报等。

Research asset generation for figures, slides, posters, and related workflows.

Research workflowAssets

AgentFlow

首个包含 RAG、MM-RAG、DeepResearch、Code、GUI 等多环境的 Agent 数据合成框架。

Agent data synthesis across RAG, MM-RAG, DeepResearch, Code, GUI, and more.

Agent dataWorkflow

DataFlow
Skills

DataFlow-Skills

面向 DataFlow 生态的可复用技能库，把数据算子、流程生成和质量评测沉淀成可组合能力。

A reusable skill library for the DataFlow ecosystem, packaging data operators, workflow generation, and quality evaluation as composable capabilities.

SkillsOperators

MemOS

面向大模型与智能体的 Memory OS，统一长期记忆的存储、检索、管理与个性化调用。

A Memory OS for LLMs and agents, unifying long-term memory storage, retrieval, management, and personalization.

Memory OSAgents

Angel

腾讯与北大联合设计的高性能分布式机器学习与图计算平台。

A high-performance distributed machine learning and graph computing platform.

Distributed MLGraph

著作Books

数据与生成式 AI 系列著作Books on data and generative AI

大模型数据：原理、技术与实战LLM Data: Principles, Technologies, and Practice

系统介绍大模型数据全生命周期管理，覆盖数据获取、清洗解析、标注、合成增强、质量评估、合规安全，以及 PDF 数据垂类模型微调实践。

A systematic book on the lifecycle of LLM data, covering acquisition, cleaning and parsing, annotation, synthesis, quality evaluation, compliance, safety, and PDF-data fine-tuning practice.

作者：何聪辉、吴郦军、张文涛 Authors: Conghui He, Lijun Wu, Wentao Zhang 电子工业出版社，2026 年 4 月，ISBN 9787121523748 Publishing House of Electronics Industry, Apr. 2026, ISBN 9787121523748

查看图书页面View book page

扩散模型：生成式 AI 模型的理论、应用与代码实践Diffusion Models: Theory, Applications, and Code Practice

面向生成式 AI 与扩散模型的理论、应用和代码实践，介绍扩散模型基础、典型生成任务与前沿应用。

A book on diffusion-model theory, applications, and code practice for generative AI, covering foundations, representative generation tasks, and frontier applications.

作者：杨灵、张至隆、张文涛、崔斌 Authors: Ling Yang, Zhilong Zhang, Wentao Zhang, Bin Cui 电子工业出版社，2023 年 8 月，ISBN 9787121459856 Publishing House of Electronics Industry, Aug. 2023, ISBN 9787121459856

查看图书页面View book page

论文发表Publications

论文云图、全量文章与代表论文Topic cloud, all publications, and featured papers

Filter by keyword:

全部论文窗口 All Publications Window

代表性项目论文Representative project papers

课题组优势Group strengths

支持学生形成论文、开源和产业影响力Supporting students across papers, open source, and industrial impact

biliBilibili 视频介绍Bilibili intro video

1. 研究方向1. Research directions

DCAI 和大模型都是学术界/工业界前沿。
善于发现有影响力和现实意义、但 under-explored 的研究问题，避免内卷。

DCAI and large models sit at the frontier of both academia and industry.
The group values impactful, practically meaningful, and under-explored research problems.

2. 学生指导2. Student mentoring

每周按小方向组会分享和讨论，线下静园六院 208，线上腾讯会议。
安排经验丰富的师兄师姐带入门，遇到问题随时讨论。
根据每位学生的基础、兴趣和未来规划制定培养方案，一对一指导。

Weekly subgroup meetings are held offline at Jingyuan Courtyard 6 and online via Tencent Meeting.
Senior students help newcomers get started, with frequent discussions whenever problems arise.
Mentoring plans are tailored to each student's background, interests, and future goals.

3. 合作资源3. Collaboration resources

丰富计算资源，如千卡规模 H20/H100/H200 算力集群。
Apple AIML、腾讯混元、华为、上海 AI Lab、字节 Seed、快手可灵、阿里 Qwen、MSRA、蚂蚁等 Research 实习和工作推荐。
Mila、Stanford、ETH、HKUST、NUS、UQ 等学术合作与交流机会。

Strong computing resources, including large-scale H20/H100/H200 GPU clusters.
Research internship and career referrals across Apple AIML, Tencent Hunyuan, Huawei, Shanghai AI Lab, ByteDance Seed, Kuaishou Kling, Alibaba Qwen, MSRA, Ant Group, and more.
Academic collaboration and exchange opportunities with Mila, Stanford, ETH, HKUST, NUS, UQ, and other partners.

4. 成长支持4. Career support

助研津贴、推荐信、优先保送本课题组硕博机会。
叉院 PhD 住宿和工位在校本部燕园校区。
组内氛围融洽，定期组织徒步、羽毛球、聚餐等团建，自愿参加。

Research assistant support, recommendation letters, and priority opportunities for master's and PhD paths within the group.
PhD students in the School of Intelligence Science and Technology have housing and workspace on PKU's main Yanyuan campus.
The group keeps a friendly culture with optional hiking, badminton, meals, and other activities.

致有志于 AI 研究的学生

For students interested in AI research

在当前高度竞争的 AI 人才生态中，无论是工业界顶尖计划，还是学术界 Top 课题组博士招生与教职聘任，核心竞争力已从“论文数量”转向“综合影响力”。我会从以下五个维度支持你的成长。

In today's competitive AI ecosystem, long-term competitiveness comes from integrated research impact rather than paper count alone. The group supports students across the following five dimensions.

1. 前沿务实的研究方向1. Practical frontier research

聚焦真实需求与产业趋势，避免内卷赛道，例如 LLM Data、Data-Centric AI、数据智能体等方向。

We focus on real needs and industry trends, including LLM data, Data-Centric AI, and data agents.

2. 高质量顶会论文2. High-quality top-tier papers

CCF-A 顶会论文仍是重要门槛，但更看重工作是否被引用、被主流开源项目采纳、解决关键问题。

Top-tier papers remain important, but we care more about whether work is cited, adopted, and solves key problems.

3. 扎实的开源项目经历3. Strong open-source experience

主导或深度参与高星开源项目，建立工程与研究能力、形成个人品牌，围绕核心项目构建研究骨架。

Students can lead or deeply contribute to high-impact open-source projects and build a research agenda around core systems.

4. 工业界研究实习经历4. Research internships

鼓励学生进入头部企业或研究院实习，在真实场景、海量数据和强大算力中锤炼问题定义能力。

Students are encouraged to intern at leading companies and labs, learning from real scenarios, large-scale data, and strong compute.

5. 课题组与研究平台5. Group and platform support

课题组的学术网络、合作资源和学校平台将直接影响科研效率、合作机会与职业出口。

The group's academic network, collaborations, and PKU platform support research efficiency, opportunities, and career outcomes.

深度合作伙伴与学生实习平台

Close partners and internship platforms

学生目前主要在以下企业、研究院和联合培养平台参与研究实习、项目合作与联合培养。

Students mainly participate in research internships, project collaborations, and joint training through the following companies, research institutes, and platforms.

腾讯混元

字节 Seed

字节 TikTok

阿里 Qwen

快手可灵

中关村学院

上海 AI Lab

Apple AIML

微软 MSRA

北京智源

蚂蚁

Kimi

阶跃星辰

华为

小米

美团 LongCat

三星

九坤量化

加入我们Join us

欢迎博士生、博士后和研究实习生Openings for PhD students, postdocs, and research interns

PhD / Master

依托北京大学国际机器学习研究中心招收博士生，也依托上海 AI Lab、北京中关村学院等平台招收联培博士生。申请 2027 年秋季入学博士/硕士的学生，建议先联系实习。

支持 CCF-A 顶会论文训练，但更强调工作的影响力、开源采纳和解决关键问题。
围绕核心开源项目构建研究骨架，避免碎片化。

PhD and master students are welcome through PKU CMLR and joint programs with Shanghai AI Lab and Beijing Zhongguancun Academy. Students applying for Fall 2027 are encouraged to start with an internship.

Top-tier paper training is supported, with emphasis on real impact, open-source adoption, and key problems.
Research is organized around core open-source systems rather than fragmented topics.

Postdoc

长期招收博士后，与北京大学鄂维南院士和崔斌教授联合培养，方向包括大模型数据、Data-Centric AI、AI4Science 和智能体系统。

Long-term postdoc openings are available in LLM data, Data-Centric AI, AI4Science, and agent systems, jointly mentored with Prof. Weinan E and Prof. Bin Cui.

崔斌教授联合招聘Joint opening with Prof. Bin Cui 鄂维南院士联合招聘Joint opening with Prof. Weinan E

Research Intern

长期招收研究实习生，可远程/校外实习。适合希望参与 CCF-A 论文、开源系统建设、真实产业数据问题和科研产品化的学生。

参与 DataFlow、MinerU、DataFlex、AgentFlow、Paper2Any 等系统。
工业界研究实习和头部企业/研究院资源推荐。
联系 wentao.zhang@pku.edu.cn 或微信 z1299799152。

Research interns are welcome year-round, including remote interns. This is suitable for students who want to work on CCF-A papers, open-source systems, real data problems, and research products.

Projects include DataFlow, MinerU, DataFlex, AgentFlow, Paper2Any, and related systems.
Research internship and referral resources are available across leading companies and labs.
Contact wentao.zhang@pku.edu.cn or WeChat z1299799152.

奖项与服务Awards & Service

代表性荣誉 / 学术组织兼职与服务Selected Awards / Academic Service

代表性荣誉Selected Awards

学术组织兼职与服务Academic Appointments and Service

1. 中国计算机学会（CCF）1. China Computer Federation (CCF)

CCF 数据库专委会执行委员；CCF 软件工程专委会执行委员。 Executive member, CCF Technical Committee on Databases; Executive member, CCF Technical Committee on Software Engineering.

2. 中国中文信息学会2. Chinese Information Processing Society of China

社会媒体处理专业委员会执行委员。 Executive member, Technical Committee on Social Media Processing.

3. 国际学术会议主席3. International Conference Chairing Roles

Tutorial & Workshop Chair: ADC 2024；领域主席 Area Chair: WWW 2025-2026, NeurIPS 2025-2026, ACL 2025-2026, SIGKDD 2025-2026。 Tutorial & Workshop Chair: ADC 2024; Area Chair: WWW 2025-2026, NeurIPS 2025-2026, ACL 2025-2026, SIGKDD 2025-2026.

4. 国际国内学术会议程序委员会委员4. Program Committee Member

数据库国际会议: VLDB 2024-2026, ICDE 2023-2026, DASFAA 2022。机器学习国际会议: ICML 2021-2026, NeurIPS 2022-2026, ICLR 2024-2026。数据挖掘国际会议: SIGKDD 2021-2026, WWW 2022-2026, SDM 2024。计算机视觉和自然语言会议: ICCV/ECCV 2023-2026, CVPR 2023-2025, ACL 2023-2026。 Database conferences: VLDB 2024-2026, ICDE 2023-2026, DASFAA 2022. Machine learning conferences: ICML 2021-2026, NeurIPS 2022-2026, ICLR 2024-2026. Data mining conferences: SIGKDD 2021-2026, WWW 2022-2026, SDM 2024. Computer vision and NLP conferences: ICCV/ECCV 2023-2026, CVPR 2023-2025, ACL 2023-2026.

5. 国际国内期刊审稿人/Action Editor5. Journal Reviewer / Action Editor

国际期刊: JMLR, TMLR, VLDBJ, IEEE TKDE, IEEE TNNLS, WWWJ, SCIS, DSE, Cell Patterns, Nature Communications。国内期刊: 中国科学。 International journals: JMLR, TMLR, VLDBJ, IEEE TKDE, IEEE TNNLS, WWWJ, SCIS, DSE, Cell Patterns, Nature Communications. Chinese journals: Scientia Sinica.

6. 上海人工智能实验室6. Shanghai Artificial Intelligence Laboratory

研究顾问。 Research Advisor.

7. 北京中关村学院7. Beijing Zhongguancun Academy

兼职导师 & 项目负责人。 Adjunct mentor and project lead.

8. 大数据分析与应用国家工程实验室8. National Engineering Laboratory for Big Data Analysis and Applications

研究员。 Researcher.

9. 北京大学图灵班9. PKU Turing Class

科研导师。 Research advisor.

面向下一代模型的Data-centric AI基础设施 Data-centric AI infrastructure for next-generation models

围绕下一代模型的数据基础设施Data infrastructure for next-generation models

Data-Centric AI

训练数据准备Training data preparation

数据-模型交互训练Data-model interaction training

推理数据准备Inference data preparation

数据智能体Data agents

近期动态Recent updates

DataFlow 生态：从数据入口到应用出口的可编程基础设施DataFlow Ecosystem: Programmable infrastructure from data inputs to application outputs

DataMind

DataFlow

MinerU

MinerU-HTML

DataFlex

DataFlow-MM

DataFlow-LoopAI

DataFlow-Table

DataFlow-Graph

OpenWorldLib

One-Eval

Paper2Any

AgentFlow

DataFlow-Skills

MemOS

Angel

数据与生成式 AI 系列著作Books on data and generative AI

大模型数据：原理、技术与实战LLM Data: Principles, Technologies, and Practice

扩散模型：生成式 AI 模型的理论、应用与代码实践Diffusion Models: Theory, Applications, and Code Practice

论文云图、全量文章与代表论文Topic cloud, all publications, and featured papers

支持学生形成论文、开源和产业影响力Supporting students across papers, open source, and industrial impact

1. 研究方向1. Research directions

2. 学生指导2. Student mentoring

3. 合作资源3. Collaboration resources

4. 成长支持4. Career support

致有志于 AI 研究的学生

For students interested in AI research

深度合作伙伴与学生实习平台

Close partners and internship platforms

欢迎博士生、博士后和研究实习生Openings for PhD students, postdocs, and research interns

PhD / Master

Postdoc

Research Intern

代表性荣誉 / 学术组织兼职与服务Selected Awards / Academic Service

代表性荣誉Selected Awards

学术组织兼职与服务Academic Appointments and Service