About
Wentao Zhang is an assistant professor (Principal Investigator/PhD Advisor) in the Center of Machine Learning Research at Peking University (PKU), and he leads the Data-centric Machine Learning (DCML) group. Wentao’s research focuses on DCML, Graph ML, ML systems and AI4Science. Wentao has published 50+ papers in the top DB (SIGMOD, VLDB, ICDE), DM (KDD, WWW), and ML (ICML, NeurIPS, ICLR) venues. Wentao is the contributor or designer of several system projects, including Angel , Open-DataFlow-Eval , SGL , MindWare , and OpenBox . His research works have been powering several billion-scale applications in Tencent, and some of them have been recognized by multiple best paper awards (e.g., WWW’22, APWeb’23 and CIKM’24).
Before joining PKU, wentao worked as a research fellow with Prof.Jian Tang at Montreal Institute for Learning Algorithms (Mila, led by Prof.Yoshua Bengio), and he received his Ph.D. degree in CS at PKU, supervised by Prof. Bin Cui. He worked with Prof. Lei Chen as a visiting scholar at HKUST in 2019. Besides, Wentao has accumulated for 4 years industrial experience in the ML & Data Platform Department of Tencent and the AIML Department of Apple.
张文涛,北京大学国际机器学习研究中心助理教授、研究员、博士生导师,曾任职于腾讯机器学习平台部、Apple AIML和加拿大 Mila 人工智能实验室。研究兴趣为以数据为中心的机器学习(Data-centric ML, DCML)、机器学习系统和AI4Science。 他近 5 年在机器学习(ICML/NeurIPS/ ICLR)、数据挖掘(SIGKDD/WWW)和数据管理(SIGMOD/VLDB/ICDE)等领域发表 CCF-A 类论文 50 余篇,也担任多个国际顶会的 PC Member/Area Chair。他获得多个最佳论文奖(如WWW’22, APWeb’23和CIKM’24),领导或参与开源了多个机器学习系统。他曾获Apple Scholar、世界人工智能大会云帆奖、北京大学/北京市/中国人工智能学会优秀博士学位论文奖、2023 中国电子学会科技进步一等奖等等多项荣誉。
Email: wentao.zhang@pku.edu.cn
Wechat (微信): z1299799152
Office: 217, Jingyuan Courtyard 6, PKU, Beijing
We have opening positions for PhDs (2 students/year), Masters and Research Interns (not limited to PKU, work online). Interested persons please contact me directly!
长期招收工程师和博士后,如感兴趣请直接联系 wentao.zhang@pku.edu.cn (微信:z1299799152)!
课题组优势:
- 1. 研究方向
- DCML、大模型、生成式AI和AI4Science等都是学术界/工业界前沿
- 作为一线青椒,我善于发现和提炼好的研究问题 (有影响力和现实意义,但under-explored的新问题,避免内卷)
- 2. 学生指导
- 每周按小方向组会分享(线下:静园六院208,线上:腾讯会议)和讨论
- 安排经验丰富的师兄/师姐带领入门,遇到问题随时讨论 (也可以随时找我)
- 根据每位学生的基础、兴趣和未来规划针对性制定培养方案,一对一指导
- 作为同龄人:)讨论学习、生活、工作和职业规划,尊重学生想法成为朋友😊
- 3. 合作资源
- 丰富的计算资源(如80GB Tesla A100集群)
- 工业界合作伙伴 (如Apple、腾讯、华为、上海AI Lab、百川智能、字节、快手和蚂蚁等) Research实习和工作推荐。可以使用工业界算力、数据和好的研究问题,积累实习经历。
- 学术界合作伙伴(如Mila、Stanford、ETH、HKUST、NUS和UQ等)交流机会
- 助研津贴(实习生视参与程度)
- 4. 其他
- 推荐信和优先保送本课题组硕博(叉院PhD住宿和工位在校本部燕园校区)的机会
- 组内氛围融洽,定期组织团建(如徒步、羽毛球和聚餐等),自愿参加
Research Interests
General DCML: 近些年来 AI 模型发展遇到了瓶颈,大部分 SOTA 模型(如ChatGPT 和SAM)都是沿用2017年提出的Transformer 结构,性能收益来源由模型 –> 数据。课题组主要考虑优 Data quality, quantity 和 efficiency,以较低成本和较短时间来获得大量高质量数据。以大模型(如ChatGPT)为例,在考虑数据获取成本和效率的前提下,设计高效的数据处理方法(如过滤、去重和降噪),研究科学和系统的数据质量评估体系和策略,探索更有效的数据合成(如合成和增强)方式,构建有效的数据抽取(如RAG、分布匹配和数据配比)方式。
DCML on Graph: 图数据广泛存在于现实生活中,如微信里的社交网络,知识图谱以及淘宝推荐场景里的用户商品二部图。图机器学习也即“将机器学习应用于图数据”,有望解决传统深度学习无法处理的关系推理、可解释性等一系列问题。我主要考虑 1) 以图神经网络(GNN)为切入点,用DCML的思想来优化图数据(如图特征工程、图结构优化、图数据增强和图异常处理等);2) LLM+GNN,探索更好的图数据表达形式,用于支持通用图大模型。
- DCML Applications:
- For Science: AI4Science是人工智能和Science交叉领域,也是目前学术界和工业界前沿的热点方向。我主要以数据为中心,研究和设计高效的Science数据(如蛋白质和分子)构建和预处理方式,以及分子建模与生物制药等交叉应用。
- For AIGC&Diffusion Model: 扩散模型是当前最热门的生成模型,其应用领域包含了CV、NLP以及交叉学科等,我主要探究以数据为中心,将扩散模型如何更好地应用于各种复杂数据生成场景,如文生图、文生视频、可控3D生成、多模态学习等。
- DCML Systems: ML System 是人工智能和计算机系统的交叉领域,也是目前计算机系统研究前沿的热点方向。我们课题组主要考虑从系统层面来支持DCML任务,如支持多种类型(如Graph和Text)的数据格式,支持大规模数据的处理(如Distributed ML),以及降低系统的使用门槛(如AutoML)等。针对大模型数据侧,课题组也在开发能支持多种数据类型、大规模数据的 DCML 系统,涵盖大模型数据处理、合成、质量评估、以及数据抽取等多个方面。
A summary of my recent works:
- General DCML: how to improve the data quality, quantity and efficiency for ML?
- Prompt Engineering for LLMs
- Prompt Augmentation System [PAS, Arxiv 24]
- Data Selection for LLMs
- Large-scale Video Keyframe Selection [KeyVideoLLM, Arxiv 24]
- Efficient Selection of Empathy Data [Efficient-Empathy, Arxiv 24]
- Survey for Data-centric Multimodal LLMs [Survry, Arxiv 24]
- Prompt Engineering for LLMs
- DCML on graph: how to improve graph data and support large graph foundation model?
- Data annotation
- Better efficiency [ALG, SIGMOD 21]
- Model free [Grain, VLDB 21]
- Noise handling[RIM, NeurIPS 21, Spotlight]
- Simplifying the labeling task [IGP, ICLR 22]
- Feature engineering (Complex model –> better features + simple model)
- Feature/label smoothing + simple model [NDLS, NeurIPS 21, Spotlight]
- Unsupervised and non-parametric feature smoothing [NAFS, ICML 22]
- Graph-based MLP deployed at Tencent [GAMLP, KDD 22]
- Inference at large scale [NAI, ICDE 24]
- Non-parametric optimization [NPA, SIGMOD 24]
- Experimental evaluation [AIR, KDD 22]
- Data distillation
- Offline distillation [RDD, SIGMOD 20]
- Online distillation [ROD, KDD 21]
- Data annotation
- DCML Systems: how to make DCML faster and easier?
- Distributed ML & AutoML
- Distributed NAS on graph [PasCa, WWW 22, Best Student Paper Award]
- Deep and flexible NAS on graph [DF-GNAS, ICML 22]
- Scalable graph learning [SGL]
- Distributed graph learning [Angel Graph]
- End-to-End AutoML [MindWare, VLDB 21]
- Black box optimization [OpenBox, KDD 21 JMLR 24,]
- Large-scale hyper-parameter tuning [Hyper-Tune, VLDB 22]
- Distributed GNN training[The First Survey of Distributed GNN Training, Arxiv 22]
- Online Spark SQL tuning service[Rover, KDD 23]
- Distributed ML & AutoML
- DCML Application: how to use machine learning in real applications?
- For Industry
- GNN-based recommendation [The First Survey of GNN-based RS, CSUR 22]
- GNN-based recommendation system deployed at Taobao [Zoomer, ICDE 22]
- For Science
- Diffusion models [The First Survey of Diffusion Models, CSUR 23][CONPREDIFF, NeurIPS 23]
- AutoML for biology [AutoDC, Bioinformatics 22]
- Protein-Language LLM [ProtLLM, ACL 2024]
- Benchmark for Glycan Machine Learning [GlycanML, Arxiv 2024]
- For Industry
What's New
- 2024-10: 🏆 We win the Best Student Full Paper Award in CIKM 2024!
- 2024-10: One paper is accepted by IEEE BIBM 2024.
- 2024-09: Three papers are accepted by NeurIPS 2024.
- 2024-06: One paper is accepted by CIKM 2024.
- 2024-06: One paper is accepted by VLDB 2024.
- 2024-05: One paper is accepted by SIGKDD 2024.
- 2024-05: One paper is accepted by the main track of ACL 2024.
- 2024-04: One paper is accepted by ICML 2024.
- 2024-04: One paper is accepted by JMLR 2024.
- 2024-04: One paper is accepted by TKDE 2024.
- 2024-03: Four papers are accepted by ICDE 2024.
- 2024-02: I am awared First Prize of Scientific and Technological Progress Award of CIE due to the Angel Project.
- 2024-02: One paper is accepted by VLDB 2024.
- 2024-02: One paper is accepted by SIGMOD 2024.
- 2024-02: I am awared 2023 CAAI Doctoral Dissertation Award.
- 2024-01: One paper is accepted by WWW 2024.
- 2024-01: One paper is accepted by ACM Computing Survey 2024.
- 2024-01: Two papers are accepted by ICLR 2024.
- 2023-12: One paper is accepted by AAAI 2024.
- 2023-12: I am awared 2023 Beijing Doctoral Dissertation Award.
- 2023-12: Three papers are accepted by ICDE 2024.
- 2023-10: One paper is accepted by ICDE 2024.
- 2023-10: 🏆 We win the Best Paper Runner Up Award in APWeb-WAIM 2023.
- 2023-09: One paper is accepted by ACM Computing Survey 2023.
- 2023-09: One paper is accepted by NeurIPS 2023.
- 2023-08: One paper is accepted by VLDB 2024.
- 2023-08: One paper is accepted by APWEB-WAIM 2023.
- 2023-08: One paper is accepted by CIKM 2023.
- 2023-08: Our book about Diffusion Model is now avaliable.
- 2023-05: One paper is accepted by TKDE 2023.
- 2023-05: One paper is accepted by SIGKDD 2023.
- 2023-05: One paper is accepted by VLDB 2023.
- 2023-03: One paper is accepted by SIGMOD 2023.
- 2022-11: One paper is accepted by AAAI 2023.
- 2022-11: One paper is accepted by ICDE 2023.
- 2022-10: One paper is accepted by VLDBJ 2022.
- 2022-09: One paper is accepted by NeurIPS 2022.
- 2022-09: I am awared Rising Star (云帆奖-明日之星) in World AI Conference, 2022.
- 2022-06: I am honor to present the valedictorian for the class of 2022 in CS of PKU.
- 2022-06: I receive my Ph.D. degree in computer science from Peking University with Outstanding Doctoral Dissertation Award.
- 2022-05: One paper is accepted by the journal VLDBJ 2022.
- 2022-05: Four papers are accepted by the conference SIGKDD 2022.
- 2022-05: Two papers as first author, have been accepted by ICML 2022.
- 2022-05: One paper related to AutoML, has been accepted by Bioinformatics 2022.
- 2022-04: 🏆 We win the Best Student Paper Award in WWW 2022 !
- 2022-04: We release our first version of the scalable graph learning toolkit–SGL.
- 2022-03: One paper is selected as the Best Paper Award Nominees in WWW 2022. The corresponding PasCa system (integrated into SGL) will be open source next month!
- 2022-03: One paper as corresponding author, related to GNN-based Recommendation, has been accepted by the journal ACM Computing Survey 2022 .
- 2022-01: One paper related to graph-based recommendation, has been accepted by the conference ICDE 2022 .
- 2022-01: One paper as first author, related to graph data annotation, has been accepted by the conference ICLR 2022 .
- 2022-01: One paper related to our large scale Hyper-paramater Tuning system, has been accepted by the conference VLDB 2022 .
- 2022-01: I accepted the invitation to serve as Program Committee member of the Research Track of ACM SIGKDD 2022.
- 2022-01: One paper as first author, related to our scalable graph NAS system, has been accepted by the conference WWW 2022 .
- 2021-12: Our OpenBox team won the “Outstanding Winner” at the openGCC contest in CCF ChinaSoft 2021. Congratulations!
- 2021-09: Two papers as first author, related to scalable graph learning and graph data annotation, have been accepted by the conference NeurIPS 2021 with Spotlight (< 3%).
- 2021-08: We propose GAMLP, a scalable and efficient graph model, which achieves the top #1 performance in three public and largest ogbn graphs (i.e., ogbn-papers100M, ogbn-products, and ogbn-mag)! See the leaderboards here.
- 2021-07: One paper as first author, related to large-scale graph data selection, has been accepted by the conference VLDB 2021.
- 2021-07: One paper as co-first author, related to deep GNN, has been accepted by the journal TKDE 2021.
- 2021-06: One paper as third author, related to our AutoML system – VocalnoML, has been accepted by the conference VLDB 2021.
- 2021-05: Three papers, related to sparse graph, graph decomposition and our blackbox optimization (BBO) system – OpenBox, are accepted by the conference SIGKDD 2021.
- 2021-03: As the only person in China, I was supported by the Apple Scholars in AI/ML PhD fellowship. Many thanks to Apple!
- 2021-03: One paper as first author has been accepted by the conference SIGMOD 2021. Looking forward to the meeting in Xi’an this summer!
Contributed Open-source Projects
- Angel: a high-performance distributed machine learning and graph computing platform, jointly designed by Tencent and PKU.
SGL: a scalable graph learning toolkit for extremely large graph datasets.
MindWare: a powerful AutoML system, which automates feature engineering, algorithm selection and hyperparameter tuning.
- OpenBox: an efficient open-source system designed for solving generalized black-box optimization (BBO) problems.
Selected Awards
- 🏆 Best Student Full Paper Award, CIKM 2024.
- Weiming Young Scholar, Peking University, 2024
- First Prize of Scientific and Technological Progress Award(科技进步一等奖), CIE(中国电子学会), 2023
- Outstanding Doctoral Dissertation Award(优秀博士学会论文奖), CAAI(中国人工智能学会) (10 people in China), 2023
- Outstanding Doctoral Dissertation Award(优秀博士学会论文奖), Beijing (14 people in PKU, and 104 people in Beijing), 2023
- 🏆 Best Paper Runner Up Award, APWeb-WAIM 2023.
- Rising Star (云帆奖-明日之星), World AI Conference, 2022.
- 🏆 Best Student Paper Award of WWW 2022 (1/1822, the 2nd WWW Best Student Paper from China), 2022
- IVADO Postdoctoral Fellowship, Canada
- Outstanding Doctoral Dissertation Award(优秀博士学会论文奖), Peking University (Sole winner in Computer Software and Theory), 2022
- Outstanding Graduate of Beijing, China, 2022
- Candidate of May 4th Medal(五四奖章) (Each School recommends 1 candidate, highest honor in PKU), 2022
- The Big Data Expo Leading Technology Achievement Award, China International Big Data Industry Expo (Angel Graph project), 2022
- Candidate of People of the Year(年度人物) (1 people in EECS, and 42 people in PKU), 2021
- Merit Student of Beijing (2 people in EECS, and 58 people in PKU), 2021
- Apple PhD Fellowship (1 people in China, and 15 people in the world), 2021
- National Scholarship (Top 1% in PKU), 2019, 2021
- Baidu Scholarship Nominee (20 people in the world), 2021
Selected Program Committee Member and Area Chair
- Database and Data Management:
- ICDE 2023,2024
- DASFFA 2022
- VLDBJ 2022,2023
- VLDB 2024
- Machine Learning:
- ICML 2021, 2022, 2023, 2024
- NeurIPS 2022, 2023, 2024
- ICLR 2024
- JMLR 2023
- Machine Learning 2023
- LoG 2024
- Data Mining:
- SIGKDD 2021, 2022, 2023, 2024, 2025
- SDM 2024
- WWW 2022, 2023, 2024
- DASFFA 2022, 2023, 2024
- IEEE TKDE 2022,2023, 2024
- IEEE TNNLS 2022
- PAKDD 2023, 2024
- Others:
- ICCV 2023
- ECCV 2024
- CVPR 2023, 2024
- SCIS 2023