文章目录
- 前言
- Technical Details and Methods
-
- Observation, Action and Reward
- Neural Network Architechture
- Imitation Learning with Importance Sampling
- Diversified League Training
- Rule-Guided Policy Search
- Stabilized Policy Improvement with DAPO
- Results
-
- Overall Performance
-
- Human Evaluation
- League Evaluation