Projects

Our recent open-source work centers on Data-centric AI: preparing high-quality data, dynamically interacting with models during training, evaluating models in an agentic way, and extending the ecosystem to world models and research productivity tools.

1. Data Preparation and Parsing

DataFlow

DataFlow is an LLM-driven framework for unified data preparation and workflow automation. It abstracts data operators, prompts, and workflows into reusable pipelines, supporting data generation, cleaning, filtering, evaluation, and format conversion for training, fine-tuning, RAG, and domain-specific AI applications.

DataFlow also provides a foundation for the broader DataFlow-EcoSystem, where multiple repositories can follow shared operator and pipeline protocols and collaborate as a data-centric AI toolchain.

MinerU 2.5

MinerU is a high-accuracy document parsing engine for LLM, RAG, and agent workflows. MinerU 2.5 introduces a decoupled vision-language parsing strategy for efficient high-resolution document understanding, separating global layout analysis from local content recognition to better handle dense text, formulas, tables, and complex layouts.

MinerU-HTML

MinerU-HTML is an SLM-powered HTML main-content extractor. It converts complex web pages into cleaner AI-ready content by removing boilerplate, navigation, ads, and metadata while preserving structured elements such as code blocks, formulas, and tables.

The resulting extraction pipeline supports web-scale corpus construction, Deep Research agents, RAG, and model training.

Flash-MinerU

Flash-MinerU is a Ray-powered acceleration layer for MinerU that turns PDF-to-Markdown parsing into a scalable data infrastructure component. It keeps MinerU's parsing logic and output format while adding distributed execution, high-throughput VLM inference, and asynchronous pipeline parallelism for multi-GPU and cluster-ready document processing.

RayOrch

RayOrch is a lightweight orchestration framework for asynchronous Ray pipelines. It provides RayModule, overlapped micro-batch execution, and DAG-style scheduling, making it easier to dynamically schedule and serve multiple deep learning models across NLP, CV, and multimodal inference pipelines.

2. Data-Model Interaction Training

DataFlex

DataFlex is a data-centric dynamic training framework built on top of LLaMA-Factory. It supports three core paradigms of data optimization during training: sample selection, domain mixture adjustment, and sample reweighting. By integrating difficult-to-reproduce data selection and weighting methods into one framework, DataFlex improves reproducibility and enables more flexible model-data interaction.

One-Eval

One-Eval is an agentic system for automated and traceable LLM evaluation. It targets NL2Eval: starting from a natural-language evaluation requirement, the system plans benchmarks, prepares datasets, runs inference, selects metrics, and generates reports with traceable evidence and human-in-the-loop checkpoints.

3. World Models and Scientific Productivity

OpenWorldLib

OpenWorldLib is a unified codebase and framework for advanced world models. It defines a world model as a model or framework centered on perception, equipped with interaction and long-term memory capabilities, for understanding and predicting the complex world. OpenWorldLib integrates tasks such as multimodal understanding, visual action prediction, visual generation, and simulation into a standardized research and development framework.

Paper2Any

Paper2Any turns papers, text, topics, and screenshots into editable research assets. It supports model architecture diagrams, technical route diagrams, experimental plots, slide decks, rebuttals, posters, narrated videos, and citation exploration.

The project focuses on paper multimodal workflows, making research communication artifacts editable rather than static.

4. Ecosystem Summary

ProjectMain RoleLinks
DataFlowUnified LLM data preparation and workflow automationCode / Technical Report / Docs
MinerU 2.5High-resolution document parsing VLMCode / Technical Report / ACL 2026
MinerU-HTMLAI-ready web-page main-content extractionCode / Technical Report / Model
Flash-MinerURay-powered distributed acceleration for MinerU parsingCode / PyPI / Benchmark
DataFlow-MMMultimodal data preparationCode / Docs
AgentFlowAgent data synthesis frameworkCode
RayOrchRay-based asynchronous model orchestrationCode
DataFlexData selection, mixture, and reweighting during trainingCode / Technical Report / Docs
One-EvalAgentic NL2Eval evaluation systemCode / Technical Report
OpenWorldLibUnified framework for advanced world modelsCode / Technical Report
Paper2AnyEditable scientific figures, slides, and paper workflowsCode / Demo / Paper2SysArch / SciFlow-Bench