Wentao Zhang · 张文涛Wentao Zhang北京大学 · Data-Centric AI GroupPeking University · Data-Centric AI Group
Data-Centric AI · LLM Data Systems · AI4ScienceData-Centric AI · LLM Data Systems · AI4Science
面向下一代模型的Data-centric AI基础设施Data-centric AI infrastructure for next-generation models
张文涛,北京大学国际机器学习研究中心助理教授、研究员、博士生导师,Data-Centric AI Group 负责人。博士毕业于北京大学并师从崔斌教授,曾任职于腾讯机器学习平台部、Apple AIML 和加拿大 Mila 人工智能实验室。主要研究方向为以数据为中心的人工智能、LLM 数据系统、数据治理智能体与 AI4Science,聚焦大模型时代的数据基础设施,探索可复用、可扩展、可验证的下一代 AI 基础设施。
曾获中国电子学会科技进步一等奖、世界互联网大会领先科技成果奖,三次获得最佳论文奖:WWW 2022、APWeb 2023、CIKM 2024。入选智源学者、浦江青年学者、ACM SIGMOD China 新星奖、世界人工智能大会云帆奖等。课题组构建 DataFlow、MinerU、MinerU-HTML、DataFlex、AgentFlow、OpenWorldLib、One-Eval、Paper2Any 等开源系统,形成面向 Data-Centric AI 的系统化工具链和基础设施。
Wentao Zhang is an Assistant Professor, Researcher, and PhD Advisor at Peking University, leading the Data-Centric AI Group. He received his PhD from PKU under Prof. Bin Cui and previously worked at Tencent ML Platform, Apple AIML, and Mila. His research focuses on Data-Centric AI, LLM data systems, data governance agents, and AI4Science, building reusable, scalable, and verifiable data infrastructure for next-generation models.
In the past five years, he has published 100+ CCF-A papers as first or corresponding author, with 14,000+ Google Scholar citations, and has been selected among Elsevier's top 2% scientists worldwide. In 2026, he ranks #1 among PKU scholars in AI/ML and AI+Data on CSRankings. He serves as Area Chair for NeurIPS, ACL, and SIGKDD, and has led 20+ research projects from NSFC, MOST, MOE, Beijing municipal programs, and industry collaborations.
He has received the CIE Science and Technology Progress Award and the World Internet Conference leading achievement award, with best-paper-level awards at WWW 2022, APWeb 2023, and CIKM 2024. His group builds open systems including DataFlow, MinerU, MinerU-HTML, DataFlex, AgentFlow, OpenWorldLib, One-Eval, and Paper2Any, forming a systematic toolchain and infrastructure for Data-Centric AI.
研究主线ResearchData-Centric AI、LLM、AI Systems、AI4Science。Data-Centric AI, LLMs, AI systems, and AI4Science.
主持项目Grants国家自然科学基金重大研究计划、科技部重点研发计划(课题)、教育部学科突破先导项目(Co-PI)、北大-腾讯大模型数据联合实验室。NSFC Major Research Plan, MOST key R&D project subtopic, MOE disciplinary breakthrough project (Co-PI), and the PKU-Tencent Joint Laboratory for Large-Model Data.
荣誉与论文奖Honors & awards智源学者、浦江青年学者、电子学会科技进步一等奖、世界人工智能大会云帆奖、ACM SIGMOD 中国新星奖、世界互联网大会领先科技成果奖、华为火花奖等。Zhiyuan Scholar, Pujiang Young Scholar, CIE Science and Technology Progress Award, WAIC Rising Star, ACM SIGMOD China Rising Star, World Internet Conference leading achievement, Huawei Spark Award, and more.
学术影响力Academic impact获 WWW'22、APWeb'23、CIKM'24 等顶会最佳论文奖,DataFlow、MinerU、Angel 等开源Data Infra累计获 GitHub Star 超 8 万。Best paper awards at WWW'22, APWeb'23, and CIKM'24; open-source Data Infra systems including DataFlow, MinerU, and Angel have accumulated 80k+ GitHub stars.
围绕下一代模型的数据基础设施Data infrastructure for next-generation models
Data-Centric AI
从大语言模型、多模态大模型,到 Agentic LLM 与 World Model,我们持续布局 Data Infra,目标是以更低成本、更低门槛和更高质量完成数据获取、解析、合成、清洗、评估与训练调度,让 Data-Centric AI 成为下一代模型能力增长的基础设施。
From LLMs and multimodal foundation models to agentic LLMs and world models, we continuously build Data Infra for lower-cost, lower-barrier, and higher-quality data acquisition, parsing, synthesis, cleaning, evaluation, and training orchestration, making Data-Centric AI a core infrastructure layer for next-generation model capabilities.
4 篇论文被 ECCV 2026 接收:MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding、Scientific Image Synthesis: Benchmarking, Methodologies, and Downstream Utility、Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs、GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models。
Four papers are accepted by ECCV 2026: MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding, Scientific Image Synthesis: Benchmarking, Methodologies, and Downstream Utility, Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs, and GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models.
4 Papers are accepted by ICML 2026.
3 篇论文被 SIGKDD 2026 接收:RARE: Retrieval-Augmented Reasoning Modeling、Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM、ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows。
Three papers are accepted by SIGKDD 2026: RARE: Retrieval-Augmented Reasoning Modeling, Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM, and ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows.
Four papers are accepted by the conference SIGKDD 2022.
Two papers as first author, have been accepted by ICML 2022.
One paper related to AutoML, has been accepted by Bioinformatics 2022.
🏆 We win the Best Student Paper Award in WWW 2022 !
We release our first version of the scalable graph learning toolkit--SGL.
One paper is selected as the Best Paper Award Nominees in WWW 2022. The corresponding PasCa system (integrated into SGL) will be open source next month!
One paper as corresponding author, related to GNN-based Recommendation, has been accepted by the journal ACM Computing Survey 2022 .
One paper related to graph-based recommendation, has been accepted by the conference ICDE 2022 .
One paper as first author, related to graph data annotation, has been accepted by the conference ICLR 2022 .
One paper related to our large scale Hyper-paramater Tuning system, has been accepted by the conference VLDB 2022 .
I accepted the invitation to serve as Program Committee member of the Research Track of ACM SIGKDD 2022.
One paper as first author, related to our scalable graph NAS system, has been accepted by the conference WWW 2022 .
Our OpenBox team won the “Outstanding Winner” at the openGCC contest in CCF ChinaSoft 2021. Congratulations!
Two papers as first author, related to scalable graph learning and graph data annotation, have been accepted by the conference NeurIPS 2021 with Spotlight (< 3%).
We propose GAMLP, a scalable and efficient graph model, which achieves the top #1 performance in three public and largest ogbn graphs (i.e., ogbn-papers100M, ogbn-products, and ogbn-mag)! See the leaderboardshere.
One paper as first author, related to large-scale graph data selection, has been accepted by the conference VLDB 2021.
One paper as co-first author, related to deep GNN, has been accepted by the journal TKDE 2021.
One paper as third author, related to our AutoML system -- VocalnoML, has been accepted by the conference VLDB 2021.
Three papers, related to sparse graph, graph decomposition and our blackbox optimization (BBO) system -- OpenBox, are accepted by the conference SIGKDD 2021.
DCAI and large models sit at the frontier of both academia and industry.
The group values impactful, practically meaningful, and under-explored research problems.
2. 学生指导2. Student mentoring
每周按小方向组会分享和讨论,线下静园六院 208,线上腾讯会议。
安排经验丰富的师兄师姐带入门,遇到问题随时讨论。
根据每位学生的基础、兴趣和未来规划制定培养方案,一对一指导。
Weekly subgroup meetings are held offline at Jingyuan Courtyard 6 and online via Tencent Meeting.
Senior students help newcomers get started, with frequent discussions whenever problems arise.
Mentoring plans are tailored to each student's background, interests, and future goals.
3. 合作资源3. Collaboration resources
丰富计算资源,如千卡规模 H20/H100/H200 算力集群。
Apple AIML、腾讯混元、华为、上海 AI Lab、字节 Seed、快手可灵、阿里 Qwen、MSRA、蚂蚁等 Research 实习和工作推荐。
Mila、Stanford、ETH、HKUST、NUS、UQ 等学术合作与交流机会。
Strong computing resources, including large-scale H20/H100/H200 GPU clusters.
Research internship and career referrals across Apple AIML, Tencent Hunyuan, Huawei, Shanghai AI Lab, ByteDance Seed, Kuaishou Kling, Alibaba Qwen, MSRA, Ant Group, and more.
Academic collaboration and exchange opportunities with Mila, Stanford, ETH, HKUST, NUS, UQ, and other partners.
4. 成长支持4. Career support
助研津贴、推荐信、优先保送本课题组硕博机会。
叉院 PhD 住宿和工位在校本部燕园校区。
组内氛围融洽,定期组织徒步、羽毛球、聚餐等团建,自愿参加。
Research assistant support, recommendation letters, and priority opportunities for master's and PhD paths within the group.
PhD students in the School of Intelligence Science and Technology have housing and workspace on PKU's main Yanyuan campus.
The group keeps a friendly culture with optional hiking, badminton, meals, and other activities.
致有志于 AI 研究的学生
For students interested in AI research
在当前高度竞争的 AI 人才生态中,无论是工业界顶尖计划,还是学术界 Top 课题组博士招生与教职聘任,核心竞争力已从“论文数量”转向“综合影响力”。我会从以下五个维度支持你的成长。
In today's competitive AI ecosystem, long-term competitiveness comes from integrated research impact rather than paper count alone. The group supports students across the following five dimensions.
We focus on real needs and industry trends, including LLM data, Data-Centric AI, and data agents.
2. 高质量顶会论文2. High-quality top-tier papers
CCF-A 顶会论文仍是重要门槛,但更看重工作是否被引用、被主流开源项目采纳、解决关键问题。
Top-tier papers remain important, but we care more about whether work is cited, adopted, and solves key problems.
3. 扎实的开源项目经历3. Strong open-source experience
主导或深度参与高星开源项目,建立工程与研究能力、形成个人品牌,围绕核心项目构建研究骨架。
Students can lead or deeply contribute to high-impact open-source projects and build a research agenda around core systems.
4. 工业界研究实习经历4. Research internships
鼓励学生进入头部企业或研究院实习,在真实场景、海量数据和强大算力中锤炼问题定义能力。
Students are encouraged to intern at leading companies and labs, learning from real scenarios, large-scale data, and strong compute.
5. 课题组与研究平台5. Group and platform support
课题组的学术网络、合作资源和学校平台将直接影响科研效率、合作机会与职业出口。
The group's academic network, collaborations, and PKU platform support research efficiency, opportunities, and career outcomes.
深度合作伙伴与学生实习平台
Close partners and internship platforms
学生目前主要在以下企业、研究院和联合培养平台参与研究实习、项目合作与联合培养。
Students mainly participate in research internships, project collaborations, and joint training through the following companies, research institutes, and platforms.
欢迎博士生、博士后和研究实习生Openings for PhD students, postdocs, and research interns
PhD / Master
依托北京大学国际机器学习研究中心招收博士生,也依托上海 AI Lab、北京中关村学院等平台招收联培博士生。申请 2027 年秋季入学博士/硕士的学生,建议先联系实习。
支持 CCF-A 顶会论文训练,但更强调工作的影响力、开源采纳和解决关键问题。
围绕核心开源项目构建研究骨架,避免碎片化。
PhD and master students are welcome through PKU CMLR and joint programs with Shanghai AI Lab and Beijing Zhongguancun Academy. Students applying for Fall 2027 are encouraged to start with an internship.
Top-tier paper training is supported, with emphasis on real impact, open-source adoption, and key problems.
Research is organized around core open-source systems rather than fragmented topics.
Long-term postdoc openings are available in LLM data, Data-Centric AI, AI4Science, and agent systems, jointly mentored with Prof. Weinan E and Prof. Bin Cui.
Research interns are welcome year-round, including remote interns. This is suitable for students who want to work on CCF-A papers, open-source systems, real data problems, and research products.
Projects include DataFlow, MinerU, DataFlex, AgentFlow, Paper2Any, and related systems.
Research internship and referral resources are available across leading companies and labs.
Contact wentao.zhang@pku.edu.cn or WeChat z1299799152.
奖项与服务Awards & Service
代表性荣誉 / 学术组织兼职与服务Selected Awards / Academic Service