Reinforcement Learning via Self-Distillation https://arxiv.org/abs/2601.20802 https://www.alphaxiv.org/ru/overview/2601.20802 https://github.com/lasgroup/SDPO