DataFlow
DataFlow is an LLM-driven framework for unified data preparation and workflow automation. It abstracts data operators, prompts, and workflows into reusable pipelines, supporting data generation, cleaning, filtering, evaluation, and format conversion for training, fine-tuning, RAG, and domain-specific AI applications.
DataFlow also provides a foundation for the broader DataFlow-EcoSystem, where multiple repositories can follow shared operator and pipeline protocols and collaborate as a data-centric AI toolchain.
Code / Technical Report / Documentation
MinerU 2.5
MinerU is a high-accuracy document parsing engine for LLM, RAG, and agent workflows. MinerU 2.5 introduces a decoupled vision-language parsing strategy for efficient high-resolution document understanding, separating global layout analysis from local content recognition to better handle dense text, formulas, tables, and complex layouts.
Code / Technical Report / ACL 2026 Industry Track Oral
MinerU-HTML
MinerU-HTML is an SLM-powered HTML main-content extractor. It converts complex web pages into cleaner AI-ready content by removing boilerplate, navigation, ads, and metadata while preserving structured elements such as code blocks, formulas, and tables.
The resulting extraction pipeline supports web-scale corpus construction, Deep Research agents, RAG, and model training.
Code / Technical Report / Model
RayOrch
RayOrch is a lightweight orchestration framework for asynchronous Ray pipelines. It provides RayModule, overlapped micro-batch execution, and DAG-style scheduling, making it easier to dynamically schedule and serve multiple deep learning models across NLP, CV, and multimodal inference pipelines.
Code