Development of method for self-correction of large language models via reinforcement learning
Abstract
This work presents a reinforcement learning-based approach to the self-correction of large language models (LLMs). The motivation stems from the instability in output quality of current LLMs, their tendency to produce factual or logical errors, and the lack of built-in mechanisms for verifying and improving their own responses. We formalize self-correction as an episodic Markov Decision Process (MDP), where the model generates an initial response and a subsequent correction attempt, with a binary reward signal based on the improvement in quality. The study addresses major challenges in this setting, including distributional shift, behavioral collapse, and reward hacking. We survey existing approaches, such as test-time reasoning, chain-ofthought prompting, fine-tuning, and reinforcement learningbased strategies. Our proposed method employs the Advantage Actor-Critic (A2C) algorithm, selected for its balance between implementation simplicity and optimization efficiency. We describe a two-stage training architecture designed to foster stable selfcorrection capabilities. Experimental results on mathematical problem-solving tasks demonstrate notable improvements in response quality. These findings highlight the practical relevance of the proposed method, showing enhanced reliability and robustness in the outputs of large language models.
Full Text:
PDF (Russian)References
Z. Shao и др., DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, 2024. url: https://arxiv.org/abs/2402.03300.
A. Lozhkov, R. Li и H. de Vries, StarCoder 2 and The Stack v2: The Next Generation, 2024. url: https://arxiv.org/abs/2402.19173.
C. Team и др., CodeGemma: Open Code Models Based on Gemma, 2024. url: https://arxiv.org/abs/2406.11409.
J. Ahn, R. Verma, R. Lou, D. Liu, R. Zhang и W. Yin, Large Language Models for Mathematical Reasoning: Progresses and Challenges, 2024. url: https://arxiv.org/abs/2402.00157.
W. Shi и др., Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models, 2024. url: https://arxiv.org/abs/2406.17294.
S. Banerjee, A. Agarwal и S. Singla, LLMs Will Always Hallucinate, and We Need to Live With This, 2024. url: https://arxiv.org/abs/2409.05746.
N. Mündler, J. He, S. Jenko и M. Vechev, Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation, 2024. url: https://arxiv.org/abs/2305.15852.
J. Huang и др., Large Language Models Cannot Self-Correct Reasoning Yet, 2024. url: https://arxiv.org/abs/2310.01798.
D. Liu и др., Large Language Models Have Intrinsic Self-Correction Ability, 2024. url: https://arxiv.org/abs/2406.15673.
P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg и D. Amodei, Deep Reinforcement Learning from Human Preferences, 2023. url: https://arxiv.org/abs/1706.03741.
G. Kim, P. Baldi и S. McAleer, Language Models Can Solve Computer Tasks, 2023. url: https://arxiv.org/abs/2303.17491.
S. Welleck и др., Generating Sequences by Learning to Self-Correct, в The Eleventh International Conference on Learning Representations, 2023. url: https://openreview.net/forum?id=hH36JeQZDaO.
A. Madaan и др., Self-Refine: Iterative Refinement with Self-Feedback, 2023. url: https://arxiv.org/abs/2303.17651.
Y. Zhang и др., Small Language Models Need Strong Verifiers to Self-Correct Reasoning, 2024. url: https://arxiv.org/abs/2404.17140.
A. Kumar и др., Training Language Models to Self-Correct via Reinforcement Learning, 2024. url: https://arxiv.org/abs/2409.12917.
A. Havrilla и др., GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements, 2024. url: https://arxiv.org/abs/2402.10963.
R. S. Sutton и A. G. Barto, Reinforcement Learning: An Introduction, 2nd. MIT Press, 2018. url: https://www.andrew.cmu.edu/course/10-703/textbook/BartoSutton.pdf.
M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
S. M. Ross, Introduction to Stochastic Dynamic Programming. Academic Press, 1983.
R. J. Williams, Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning, Machine Learning, т. 8, № 3–4, с. 229–256, 1992. doi: 10.1007/BF00992696.
L. Gao и др., PAL: Program-Aided Language Models, 2023. url: https://arxiv.org/abs/2211.10435.
S. Yao и др., ReAct: Synergizing Reasoning and Acting in Language Models, 2023. url: https://arxiv.org/abs/2210.03629.
A. Creswell, M. Shanahan и I. Higgins, Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning, 2022. url: https://arxiv.org/abs/2205.09712.
L. Ouyang и др., Training Language Models to Follow Instructions with Human Feedback, 2022. url: https://arxiv.org/abs/2203.02155.
Y. Bai и др., Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, 2022. url: https://arxiv.org/abs/2204.05862.
R. Ma и др., S2R: Teaching LLMs to Self-Verify and Self-Correct via Reinforcement Learning, 2025. url: https://arxiv.org/abs/2502.12853.
D. Hendrycks и др., Measuring Mathematical Problem Solving With the MATH Dataset, CoRR, т. abs/2103.03874, 2021. url: https://arxiv.org/abs/2103.03874.
A. Ahmadian и др., Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs, 2024. url: https://arxiv.org/abs/2402.14740.
J. Schulman, F. Wolski, P. Dhariwal, A. Radford и O. Klimov, Proximal Policy Optimization Algorithms, CoRR, т. abs/1707.06347, 2017. url: http://arxiv.org/abs/1707.06347.
V. Mnih и др., Asynchronous Methods for Deep Reinforcement Learning, 2016. url: https://arxiv.org/abs/1602.01783.
Z. Zhang и др., The Lessons of Developing Process Reward Models in Mathematical Reasoning, 2025. url: https://arxiv.org/abs/2501.07301.
G. Team и др., Gemma 2: Improving Open Language Models at a Practical Size, 2024. url: https://arxiv.org/abs/2408.00118.
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian и Z. Ma, The Llama 3 Herd of Models, 2024. url: https://arxiv.org/abs/2407.21783.
Refbacks
- There are currently no refbacks.
Abava Кибербезопасность ИБП для ЦОД СНЭ
ISSN: 2307-8162