We propose the Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) algorithm. By making our work publicly available, we provide the broader research community and society with practical access to scalable reinforcement learning, enabling all to benefit from these advancements. Our system is based on the awesome verl framework. Thanks for their great work! Applying DAPO training to Qwen2.5-32B base model proves to outperform the previous state-of-the-art DeepSeek-R1-Zero-Qwen-32B on AIME 2024, achieving 50% accuracy with 50% less training steps.
To benefit the broader research community, we fully open-source the recipe of our RL training, including algorithm details, dataset, verifiers, model weights and infrastructures. We utilize verl to perform DAPO training using the Qwen-32B-Base model from scratch with 128 H20 GPUs. Our open-source experiments were conducted on the Volcano Engine Machine Learning Platform. We will provide a full reproduction guideline on the Volcano Engine platform to help users replicate our experiments.
We provide training and validation datasets for DAPO training.
We adopt a simple but robust rule-based verifier relying on string normalization and matching.
We provide the out-of-the-box script for DAPO training reproduction.
We propose the Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) algorithm, which includes several key techniques as below. Detailed analysis and insights can be found in our technical report.