About

Wentao Zhang is an assistant professor (Principal Investigator/PhD Advisor) in the Center of Machine Learning Research at Peking University (PKU), and he leads the Data-centric Machine Learning (DCML) group. Wentao’s research focuses on DCML, Graph ML, ML systems and AI4Science. Wentao has published 40+ papers, including 10+ first author papers in the top DB (SIGMOD, VLDB, ICDE), DM (KDD, WWW), and ML (ICML, NeurIPS, ICLR) venues. Wentao is the contributor or designer of several system projects, including Angel GitHub Repo stars, SGL GitHub Repo stars, MindWare GitHub Repo stars, and OpenBox GitHub Repo stars. His research works have been powering several billion-scale applications in Tencent, and some of them have been recognized by multiple best paper awards, including the Best Paper Runner Up Award at APWeb-WAIM 2023, and the Best Student Paper Award at WWW 2022.

Before joining PKU, wentao worked as a research fellow with Prof.Jian Tang at Montreal Institute for Learning Algorithms (Mila, led by Prof.Yoshua Bengio), and he received his Ph.D. degree in CS at PKU, supervised by Prof. Bin Cui. He worked with Prof. Lei Chen as a visiting scholar at HKUST in 2019. Besides, Wentao has accumulated for 4 years industrial experience in the ML & Data Platform Department of Tencent and the AIML Department of Apple.

张文涛,北京大学国际机器学习研究中心助理教授、研究员、博士生导师,曾任职于腾讯机器学习平台部、Apple AIML以及Mila人工智能实验室。研究兴趣为以数据为中心的机器学习、图机器学习、机器学习系统和交叉学科应用(如Diffusion、多模态和AI4Science)。 他近5年在机器学习(ICML/NeurIPS/ICLR)、数据挖掘(KDD/WWW)和数据管理(SIGMOD/VLDB/ICDE)等领域发表CCF-A类论文50余篇,并获得多个最佳论文奖(如第一作者获WWW’22 Best Student Paper Award 和 通讯作者获APWeb-WAIM’23 Best Paper Runner Up Award)。他领导或参与开源了多个机器学习系统,如大规模图学习系统SGL、分布式机器学习系统Angel、和黑盒优化系统OpenBox。他曾获Apple Scholar (2021年度亚太地区唯一)、世界人工智能大会云帆奖、北京大学/北京市/中国人工智能学会优秀博士学位论文奖、中国电子学会科技进步一等奖等多项荣誉。

Email: wentao.zhang@pku.edu.cn

Wechat (微信): z1299799152

Office: 217, Jingyuan Courtyard 6, PKU, Beijing

We have opening positions for PhDs (2 students/year), Masters and Research Interns (not limited to PKU, work online). Interested persons please contact me directly!

长期招收实习生(可远程校外实习, 申请2025年秋季入学博士/硕士的学生,建议先联系实习),如感兴趣请直接联系我!

课题组优势:

  • 1. 研究方向
    • DCML、大模型、生成式AI和AI4Science等都是学术界/工业界前沿
    • 作为一线青椒,我善于发现和提炼好的研究问题 (有影响力和现实意义,但under-explored的新问题,避免内卷)
  • 2. 学生指导
    • 每周按小方向组会分享(线下:静园六院208,线上:腾讯会议)和讨论
    • 安排经验丰富的师兄/师姐带领入门,遇到问题随时讨论 (也可以随时找我)
    • 根据每位学生的基础、兴趣和未来规划针对性制定培养方案,一对一指导
    • 作为同龄人:)讨论学习、生活、工作和职业规划,尊重学生想法成为朋友😊
  • 3. 合作资源
    • 丰富的计算资源(如80GB Tesla A100集群)
    • 工业界合作伙伴 (如Apple、腾讯、华为、上海AI Lab、百川智能等) 实习和工作推荐
    • 学术界合作伙伴(如Mila、Stanford、ETH、HKUST、NUS和UQ等)交流机会
    • 助研津贴(实习生视参与程度)
  • 4. 其他
    • 推荐信和优先保送本课题组硕博(叉院PhD住宿和工位在校本部燕园校区)的机会
    • 组内氛围融洽,定期组织团建(如徒步、羽毛球和聚餐等),自愿参加

Research Interests

  • General DCML: 近些年来AI模型发展遇到了瓶颈,大部分SOTA模型(如ChatGPT和SAM)都是沿用2017年提出的Transformer结构,性能收益来源由模型转变为数据。我主要考虑优化Data quality (e.g., imbalance, noise and OOD), quantity (e.g., annotation and augmentation), efficiency (e.g., distillation, compression, and selection) 和 privacy (e.g., attack and FL),以较低成本和较短时间来获得大量高质量数据。 以大语言模型为例,在考虑数据获取成本和效率的前提下,研究科学和系统的数据质量评估策略,设计高效的数据选择(如过滤、去重和降噪)方法,构建有效的数据配比方式,并探索使用大模型来辅助数据优化(如自动数据标注和数据生成)。

  • DCML on Graph: 图数据广泛存在于现实生活中,如微信里的社交网络,知识图谱以及淘宝推荐场景里的用户商品二部图。图机器学习也即“将机器学习应用于图数据”,有望解决传统深度学习无法处理的关系推理、可解释性等一系列问题。我主要考虑 1) 以图神经网络(GNN)为切入点,用DCML的思想来优化图数据(如图特征工程、图结构优化、图数据增强和图异常处理等);2) LLM+GNN,探索更好的图数据表达形式,用于支持通用图大模型。

  • DCML Applications:
    • For Science: AI4Science是人工智能和Science交叉领域,也是目前学术界和工业界前沿的热点方向。我主要以数据为中心,研究和设计高效的Science数据(如蛋白质和分子)构建和预处理方式,以及分子建模与生物制药等交叉应用。
    • For AIGC&Diffusion Model: 扩散模型是当前最热门的生成模型,其应用领域包含了CV、NLP以及交叉学科等,我主要探究以数据为中心,将扩散模型如何更好地应用于各种复杂数据生成场景,如文生图、文生视频、可控3D生成、多模态学习等。
  • DCML Systems: ML System是人工智能和计算机系统的交叉领域,也是目前计算机系统研究前沿的热点方向。我主要考虑从系统层面来支持DCML任务,如支持多种类型(如Graph和Text)的数据格式,支持大规模数据的处理(如Distributed ML),以及降低系统的使用门槛(如AutoML)等。

A summary of my recent works:

What's New

  • 2024-03: Four papers are accepted by ICDE 2024.
  • 2024-02: I am awared First Prize of Scientific and Technological Progress Award of CIE due to the Angel Project.
  • 2024-02: One paper is accepted by VLDB 2024.
  • 2024-02: One paper is accepted by SIGMOD 2024.
  • 2024-02: I am awared 2023 CAAI Doctoral Dissertation Award.
  • 2024-01: One paper is accepted by WWW 2024.
  • 2024-01: One paper is accepted by ACM Computing Survey 2024.
  • 2024-01: Two papers are accepted by ICLR 2024.
  • 2023-12: One paper is accepted by AAAI 2024.
  • 2023-12: I am awared 2023 Beijing Doctoral Dissertation Award.
  • 2023-12: Three papers are accepted by ICDE 2024.
  • 2023-10: One paper is accepted by ICDE 2024.
  • 2023-10: 🏆 We win the Best Paper Runner Up Award in APWeb-WAIM 2023.
  • 2023-09: One paper is accepted by ACM Computing Survey 2023.
  • 2023-09: One paper is accepted by NeurIPS 2023.
  • 2023-08: One paper is accepted by VLDB 2024.
  • 2023-08: One paper is accepted by APWEB-WAIM 2023.
  • 2023-08: One paper is accepted by CIKM 2023.
  • 2023-08: Our book about Diffusion Model is now avaliable.
  • 2023-05: One paper is accepted by TKDE 2023.
  • 2023-05: One paper is accepted by SIGKDD 2023.
  • 2023-05: One paper is accepted by VLDB 2023.
  • 2023-03: One paper is accepted by SIGMOD 2023.
  • 2022-11: One paper is accepted by AAAI 2023.
  • 2022-11: One paper is accepted by ICDE 2023.
  • 2022-10: One paper is accepted by VLDBJ 2022.
  • 2022-09: One paper is accepted by NeurIPS 2022.
  • 2022-09: I am awared Rising Star (云帆奖-明日之星) in World AI Conference, 2022.
  • 2022-06: I am honor to present the valedictorian for the class of 2022 in CS of PKU.
  • 2022-06: I receive my Ph.D. degree in computer science from Peking University with Outstanding Doctoral Dissertation Award.
  • 2022-05: One paper is accepted by the journal VLDBJ 2022.
  • 2022-05: Four papers are accepted by the conference SIGKDD 2022.
  • 2022-05: Two papers as first author, have been accepted by ICML 2022.
  • 2022-05: One paper related to AutoML, has been accepted by Bioinformatics 2022.
  • 2022-04: 🏆 We win the Best Student Paper Award in WWW 2022 !
  • 2022-04: We release our first version of the scalable graph learning toolkit–SGL.
  • 2022-03: One paper is selected as the Best Paper Award Nominees in WWW 2022. The corresponding PasCa system (integrated into SGL) will be open source next month!
  • 2022-03: One paper as corresponding author, related to GNN-based Recommendation, has been accepted by the journal ACM Computing Survey 2022 .
  • 2022-01: One paper related to graph-based recommendation, has been accepted by the conference ICDE 2022 .
  • 2022-01: One paper as first author, related to graph data annotation, has been accepted by the conference ICLR 2022 .
  • 2022-01: One paper related to our large scale Hyper-paramater Tuning system, has been accepted by the conference VLDB 2022 .
  • 2022-01: I accepted the invitation to serve as Program Committee member of the Research Track of ACM SIGKDD 2022.
  • 2022-01: One paper as first author, related to our scalable graph NAS system, has been accepted by the conference WWW 2022 .
  • 2021-12: Our OpenBox team won the “Outstanding Winner” at the openGCC contest in CCF ChinaSoft 2021. Congratulations!
  • 2021-09: Two papers as first author, related to scalable graph learning and graph data annotation, have been accepted by the conference NeurIPS 2021 with Spotlight (< 3%).
  • 2021-08: We propose GAMLP, a scalable and efficient graph model, which achieves the top #1 performance in three public and largest ogbn graphs (i.e., ogbn-papers100M, ogbn-products, and ogbn-mag)! See the leaderboards here.
  • 2021-07: One paper as first author, related to large-scale graph data selection, has been accepted by the conference VLDB 2021.
  • 2021-07: One paper as co-first author, related to deep GNN, has been accepted by the journal TKDE 2021.
  • 2021-06: One paper as third author, related to our AutoML system – VocalnoML, has been accepted by the conference VLDB 2021.
  • 2021-05: Three papers, related to sparse graph, graph decomposition and our blackbox optimization (BBO) system – OpenBox, are accepted by the conference SIGKDD 2021.
  • 2021-03: As the only person in China, I was supported by the Apple Scholars in AI/ML PhD fellowship. Many thanks to Apple!
  • 2021-03: One paper as first author has been accepted by the conference SIGMOD 2021. Looking forward to the meeting in Xi’an this summer!

Contributed Open-source Projects

  • Angel: a high-performance distributed machine learning and graph computing platform, jointly designed by Tencent and PKU. GitHub stars

  • SGL: a scalable graph learning toolkit for extremely large graph datasets. GitHub stars

  • MindWare: a powerful AutoML system, which automates feature engineering, algorithm selection and hyperparameter tuning. GitHub stars

  • OpenBox: an efficient open-source system designed for solving generalized black-box optimization (BBO) problems. GitHub stars

Selected Awards

  1. First Prize of Scientific and Technological Progress Award(科技进步一等奖), CIE(中国电子学会), 2023
  2. Outstanding Doctoral Dissertation Award(优秀博士学会论文奖), CAAI(中国人工智能学会) (10 people in China), 2023
  3. Outstanding Doctoral Dissertation Award(优秀博士学会论文奖), Beijing (14 people in PKU, and 104 people in Beijing), 2023
  4. 🏆 Best Paper Runner Up Award, APWeb-WAIM 2023.
  5. Rising Star (云帆奖-明日之星), World AI Conference, 2022.
  6. 🏆 Best Student Paper Award of WWW 2022 (1/1822, the 2nd WWW Best Student Paper from China), 2022
  7. IVADO Postdoctoral Fellowship, Canada
  8. Outstanding Doctoral Dissertation Award(优秀博士学会论文奖), Peking University (Sole winner in Computer Software and Theory), 2022
  9. Outstanding Graduate of Beijing, China, 2022
  10. Candidate of May 4th Medal(五四奖章) (Each School recommends 1 candidate, highest honor in PKU), 2022
  11. The Big Data Expo Leading Technology Achievement Award, China International Big Data Industry Expo (Angel Graph project), 2022
  12. Candidate of People of the Year(年度人物) (1 people in EECS, and 42 people in PKU), 2021
  13. Merit Student of Beijing (2 people in EECS, and 58 people in PKU), 2021
  14. Apple PhD Fellowship (1 people in China, and 15 people in the world), 2021
  15. National Scholarship (Top 1% in PKU), 2019, 2021
  16. Baidu Scholarship Nominee (20 people in the world), 2021

Selected Competitions

  1. Outstanding Winner of the openGCC contest in CCF ChinaSoft (1/3814), 2021
  2. Rank #1 in Open Graph Benchmark, 2021
  3. Outstanding Winner of the BDIC Big Data Competition (1/575), 2018

Selected Program Committee Member and Reviewer

  • Database and Data Management:
    • ICDE 2023,2024
    • DASFFA 2022
    • VLDBJ 2022,2023
  • Machine Learning:
    • ICML 2021, 2022, 2023, 2024
    • NeurIPS 2022, 2023
    • ICLR 2024
    • JMLR 2023
    • Machine Learning 2023
    • LoG 2024
  • Data Mining:
    • SIGKDD 2021, 2022, 2023, 2024
    • SDM 2024
    • WWW 2022
    • DASFFA 2022, 2023, 2024
    • IEEE TKDE 2022
    • IEEE TNNLS 2022
    • PAKDD 2023, 2024
  • Others:
    • ICCV 2023
    • ECCV 2024
    • CVPR 2023, 2024
    • SCIS 2023

Invited Talks

I am happy to give a talk if you are interested in my work. 😊

  1. Model Degradation Hinders Deep Graph Neural Networks.
    KDD’22, 2022. 08
  2. Graph Attention Multi-Layer Perceptron.
    KDD’22, 2022. 08
  3. NAFS: A Simple yet Tough-to-beat Baseline for Graph Representation Learning.
    AI Time [News]
    ICML’22, Virtual, 2022. 07
    Jiqizhixin, Virtual, 2022. 07 [News][Slides]
  4. Deep and Flexible Graph Neural Architecture Search.
    ICML’22, Virtual, 2022. 07
    Jiqizhixin, Virtual, 2022. 07
  5. Towards Large Scale Graph Learning: Data, Model and System.《大规模图学习:数据、模型与系统》
    THU, Virtual, 2023.02
    PKU, Virtual, 2023.02
    SUSTech, 2023.01
    HKUST (Guang Zhou), Virtual, 2022.04 [News]
    Stanford, Virtual, 2021.11
    Mila, Virtual, 2021.9
  6. Towards Automated Graph Learning. 《自动化图机器学习》 [Doc]
    HKUST, Virtual, 2022.11[News]
    NUDT, Virtual, 2022. 07
    HUST, 2022. 08
    Zhejiang University, 2022. 08
  7. Information gain propagation a new way to graph active learning with soft labels. 《软标签场景下的图主动学习》
    AI Time, Virtual, 2022. 06 [News]
    ICLR’22, Virtual, 2022. 04
  8. Data-centric ML on Graph.
    UvA, 2022. 04
    PKU, 2023.05
    HKUST, 2023.04
  9. Towards Data-Centric ML.《以数据为中心的机器学习》
    Apple research, 2022. 06
    RUC, 2023.06
    SEU, 2023.07
    PKU, 2023.08
  10. valedictorian Speech.《北京大学计算机系2022级毕业生代表致辞》
    CS of PKU, 2022. 06 [News]
  11. PaSca: a graph neural architecture search system under the scalable paradigm. 《可扩展性的图神经结构搜索系统》
    DGL Team, Amazon, Virtual, 2022.07
    CSU, Virtual, 2022. 07
    CCF, Virtual, 2022.06 [News] [Slides]
    DataFun, Virtual, 2022.06 [Slides]
    MLNLP, Virtual, 2022.06 [News][Slides][Video]
    InfoQ, Tencent Cloud, Virtual, 2022.06 [News]
    WWW’22, Virtual, 2022.04 [Slides]
    Data Platform, Tencent, Virtual, 2022.05
  12. Towards Large-scale Graph Machine Learning. 《大规模图机器学习》 [Doc]
    HKUST, Virtual, 2022. 08 (In Preparing)
    LOGs, Virtual, 2022. 07 [Video]
  13. How to Do Research? 《浅谈科研》
    Apple Research, Virtual, 2021.12
    PKU, Virtual, 2021.12 [News-1, News-2] [Slides]

  14. The Scalability of Large-scale Graph Machine Learning.《大规模图机器学习的可扩展性》
    Tencent Big Data, Virtual, 2022.04
    NeurIPS, Virtual, 2021.12
    4Paradigm, Virtual, 2021.12
    AI Drive, 2021.12 [Video] [News] [Slides]
  15. RIM: Reliable Influence-based Active Learning on Graphs.
    NeurIPS, Virtual, 2021.12
    NeurIPS MeetUp China, 2021.12 [News] [Slides]
  16. A survey of GNN system.《GNN系统调研》
    Tencent, Virtual, 2021.12 [Slides]

  17. Graph Attention Multi-Layer Perceptron.《图注意力多层感知器》
    DataFun, Virtual, 2021.10 [News] [Slides]