
Wentao Zhang is an assistant professor (Principal Investigator/PhD Advisor) in the Center of Machine Learning Research at Peking University (PKU), and he leads the Data-centric Machine Learning (DCML) group. Wentao’s research focuses on DCML, Graph ML, ML systems and AI4Science. Wentao has published 50+ papers, including 10+ first author papers in the top DB (SIGMOD, VLDB, ICDE), DM (KDD, WWW), and ML (ICML, NeurIPS, ICLR) venues. Wentao is the contributor or designer of several system projects, including Angel GitHub Repo stars, SGL GitHub Repo stars, MindWare GitHub Repo stars, and OpenBox GitHub Repo stars. His research works have been powering several billion-scale applications in Tencent, and some of them have been recognized by multiple best paper awards, including the Best Paper Runner Up Award at APWeb-WAIM 2023, and the Best Student Paper Award at WWW 2022.

Before joining PKU, wentao worked as a research fellow with Prof.Jian Tang at Montreal Institute for Learning Algorithms (Mila, led by Prof.Yoshua Bengio), and he received his Ph.D. degree in CS at PKU, supervised by Prof. Bin Cui. He worked with Prof. Lei Chen as a visiting scholar at HKUST in 2019. Besides, Wentao has accumulated for 4 years industrial experience in the ML & Data Platform Department of Tencent and the AIML Department of Apple.

张文涛,北京大学国际机器学习研究中心助理教授、研究员、博士生导师,曾任职于腾讯机器学习平台部、Apple AIML和加拿大 Mila 人工智能实验室。研究兴趣为以数据为中心的机器学习(Data-centric ML, DCML) 、图机器学习、机器学习系统和交叉学科应用(如 Diffusion、多模态和 AI4Science)。 他近 5 年在机器学习(ICML/NeurIPS/ ICLR)、数据挖掘(SIGKDD/WWW)和数据管理(SIGMOD/VLDB/ICDE)等领域发表 CCF-A 类论文 50 余篇,也担任多个国际顶会(VLDB/NeurIPS/WWW 等)的 PC Member/Area Chair。他获得多个最佳论文奖(如第一作者获 WWW’22 Best Student Paper Award 和 通讯作者获 APWeb-WAIM’23 Best Paper Runner Up Award),领导或参与开源了多个机器学习系统,如大规模图学习系统 SGL、分布式机器学习系统 Angel (GitHub 6.7k star)、和黑盒优化系统 OpenBox。他曾获 2021 年度亚太地区唯一的 Apple Scholar、世界人工智能大会云帆奖、北京大学/北京市/中国人工智能学会优秀博士学位论文奖、2023 中国电子学会科技进步一等奖等等多项荣誉。

Email: wentao.zhang@pku.edu.cn

Wechat (微信): z1299799152

Office: 217, Jingyuan Courtyard 6, PKU, Beijing

We have opening positions for PhDs (2 students/year), Masters and Research Interns (not limited to PKU, work online). Interested persons please contact me directly!

长期招收工程师和博士后,如感兴趣请直接联系 wentao.zhang@pku.edu.cn (微信:z1299799152)!


  • 1. 研究方向
    • DCML、大模型、生成式AI和AI4Science等都是学术界/工业界前沿
    • 作为一线青椒,我善于发现和提炼好的研究问题 (有影响力和现实意义,但under-explored的新问题,避免内卷)
  • 2. 学生指导
    • 每周按小方向组会分享(线下:静园六院208,线上:腾讯会议)和讨论
    • 安排经验丰富的师兄/师姐带领入门,遇到问题随时讨论 (也可以随时找我)
    • 根据每位学生的基础、兴趣和未来规划针对性制定培养方案,一对一指导
    • 作为同龄人:)讨论学习、生活、工作和职业规划,尊重学生想法成为朋友😊
  • 3. 合作资源
    • 丰富的计算资源(如80GB Tesla A100集群)
    • 工业界合作伙伴 (如Apple、腾讯、华为、上海AI Lab、百川智能、字节、快手和蚂蚁等) Research实习和工作推荐。可以使用工业界算力、数据和好的研究问题,积累实习经历。
    • 学术界合作伙伴(如Mila、Stanford、ETH、HKUST、NUS和UQ等)交流机会
    • 助研津贴(实习生视参与程度)
  • 4. 其他
    • 推荐信和优先保送本课题组硕博(叉院PhD住宿和工位在校本部燕园校区)的机会
    • 组内氛围融洽,定期组织团建(如徒步、羽毛球和聚餐等),自愿参加

Research Interests

  • General DCML: 近些年来 AI 模型发展遇到了瓶颈,大部分 SOTA 模型(如ChatGPT 和SAM)都是沿用2017年提出的Transformer 结构,性能收益来源由模型 –> 数据。课题组主要考虑优 Data quality, quantity 和 efficiency,以较低成本和较短时间来获得大量高质量数据。以大模型(如ChatGPT)为例,在考虑数据获取成本和效率的前提下,设计高效的数据处理方法(如过滤、去重和降噪),研究科学和系统的数据质量评估体系和策略,探索更有效的数据合成(如合成和增强)方式,构建有效的数据抽取(如RAG、分布匹配和数据配比)方式。

  • DCML on Graph: 图数据广泛存在于现实生活中,如微信里的社交网络,知识图谱以及淘宝推荐场景里的用户商品二部图。图机器学习也即“将机器学习应用于图数据”,有望解决传统深度学习无法处理的关系推理、可解释性等一系列问题。我主要考虑 1) 以图神经网络(GNN)为切入点,用DCML的思想来优化图数据(如图特征工程、图结构优化、图数据增强和图异常处理等);2) LLM+GNN,探索更好的图数据表达形式,用于支持通用图大模型。

  • DCML Applications:
    • For Science: AI4Science是人工智能和Science交叉领域,也是目前学术界和工业界前沿的热点方向。我主要以数据为中心,研究和设计高效的Science数据(如蛋白质和分子)构建和预处理方式,以及分子建模与生物制药等交叉应用。
    • For AIGC&Diffusion Model: 扩散模型是当前最热门的生成模型,其应用领域包含了CV、NLP以及交叉学科等,我主要探究以数据为中心,将扩散模型如何更好地应用于各种复杂数据生成场景,如文生图、文生视频、可控3D生成、多模态学习等。
  • DCML Systems: ML System 是人工智能和计算机系统的交叉领域,也是目前计算机系统研究前沿的热点方向。我们课题组主要考虑从系统层面来支持DCML任务,如支持多种类型(如Graph和Text)的数据格式,支持大规模数据的处理(如Distributed ML),以及降低系统的使用门槛(如AutoML)等。针对大模型数据侧,课题组也在开发能支持多种数据类型、大规模数据的 DCML 系统,涵盖大模型数据处理、合成、质量评估、以及数据抽取等多个方面。

A summary of my recent works:

Contributed Open-source Projects

  • Angel: a high-performance distributed machine learning and graph computing platform, jointly designed by Tencent and PKU. GitHub stars

  • SGL: a scalable graph learning toolkit for extremely large graph datasets. GitHub stars

  • MindWare: a powerful AutoML system, which automates feature engineering, algorithm selection and hyperparameter tuning. GitHub stars

  • OpenBox: an efficient open-source system designed for solving generalized black-box optimization (BBO) problems. GitHub stars

Selected Awards

  1. First Prize of Scientific and Technological Progress Award(科技进步一等奖), CIE(中国电子学会), 2023
  2. Outstanding Doctoral Dissertation Award(优秀博士学会论文奖), CAAI(中国人工智能学会) (10 people in China), 2023
  3. Outstanding Doctoral Dissertation Award(优秀博士学会论文奖), Beijing (14 people in PKU, and 104 people in Beijing), 2023
  4. 🏆 Best Paper Runner Up Award, APWeb-WAIM 2023.
  5. Rising Star (云帆奖-明日之星), World AI Conference, 2022.
  6. 🏆 Best Student Paper Award of WWW 2022 (1/1822, the 2nd WWW Best Student Paper from China), 2022
  7. IVADO Postdoctoral Fellowship, Canada
  8. Outstanding Doctoral Dissertation Award(优秀博士学会论文奖), Peking University (Sole winner in Computer Software and Theory), 2022
  9. Outstanding Graduate of Beijing, China, 2022
  10. Candidate of May 4th Medal(五四奖章) (Each School recommends 1 candidate, highest honor in PKU), 2022
  11. The Big Data Expo Leading Technology Achievement Award, China International Big Data Industry Expo (Angel Graph project), 2022
  12. Candidate of People of the Year(年度人物) (1 people in EECS, and 42 people in PKU), 2021
  13. Merit Student of Beijing (2 people in EECS, and 58 people in PKU), 2021
  14. Apple PhD Fellowship (1 people in China, and 15 people in the world), 2021
  15. National Scholarship (Top 1% in PKU), 2019, 2021
  16. Baidu Scholarship Nominee (20 people in the world), 2021

