LLMs Can Learn to Reason Via Off-Policy RL https://arxiv.org/abs/2602.19362 https://www.alphaxiv.org/ru/overview/2602.19362
LLMs Can Learn to Reason Via Off-Policy RL https://arxiv.org/abs/2602.19362…
0 viewsОткрыть в Telegram →
Из этого канала
- #6171The Art of Efficient Reasoning: Data, Reward, and Optimization…
The Art of Efficient Reasoning: Data, Reward, and Optimization https://arxiv.org/abs/2602.20945 https://www.alphaxiv.org/ru/overview/2602.20945
- #6172LocoOperator-4B is a 4B-parameter tool-calling agent model trained via…
LocoOperator-4B is a 4B-parameter tool-calling agent model trained via knowledge distillation from Qwen3-Coder-Next inference traces.
- #6174квены забираем https://huggingface.co/Qwen/Qwen3.5-27B…
квены забираем https://huggingface.co/Qwen/Qwen3.5-27B https://huggingface.co/Qwen/Qwen3.5-35B-A3B
- #6168On Data Engineering for Scaling LLM Terminal Capabilities…
On Data Engineering for Scaling LLM Terminal Capabilities https://arxiv.org/abs/2602.21193 https://www.alphaxiv.org/ru/overview/2602.21193…
- #6167https://www.inceptionlabs.ai/blog/introducing-mercury-2
https://www.inceptionlabs.ai/blog/introducing-mercury-2