Homepage of AutoRefine

Abstract

Large language models have demonstrated impressive reasoning capabilities but are inherently limited by their knowledge reservoir. Retrieval-augmented reasoning mitigates this limitation by allowing LLMs to query external resources, but existing methods often retrieve irrelevant or noisy information, hindering accurate reasoning. In this paper, we propose AutoRefine, a reinforcement learning post-training framework that adopts a new "search-and-refine-during-think" paradigm. AutoRefine introduces explicit knowledge refinement steps between successive search calls, enabling the model to iteratively filter, distill, and organize evidence before generating an answer. Furthermore, we incorporate tailored retrieval-specific rewards alongside answer correctness rewards using group relative policy optimization. Experiments on single-hop and multi-hop QA benchmarks demonstrate that AutoRefine significantly outperforms existing approaches, particularly in complex, multi-hop reasoning scenarios. Detailed analysis shows that AutoRefine issues frequent, higher-quality searches and synthesizes evidence effectively.

Research Gaps

Current retrieval-augmented reasoning approaches face two key limitations:

Lack of refinement of retrieved documents: Existing methods typically feed retrieved documents directly to the LLM without first distilling key information. This forces LLMs to process large volumes of potentially irrelevant content, making it difficult to identify and focus on the most relevant knowledge.
Underexplored retrieval-specific rewards: Most reinforcement learning-based RAG methods rely solely on outcome-based rewards (like answer correctness) to guide the model's behavior. They neglect the importance of retrieval-specific rewards that could directly improve the quality of the retrieval process itself.

Method

The Search-and-Refine-During-Think Paradigm

At the core of AutoRefine is a novel "search-and-refine-during-think" paradigm that extends the traditional "search-during-think" approach. This paradigm allows the LLM to:

Think: Reason about the problem and identify knowledge gaps.
Search: Formulate queries to retrieve relevant information.
Receive Documents: Obtain documents from an external knowledge source.
Refine: Distill and extract key information from retrieved documents.

This cycle can repeat multiple times until the model has gathered sufficient information to confidently answer the question.

Reward Modeling with Retrieval-Aware Signals

AutoRefine employs a sophisticated reward system that combines:

Outcome-Based Reward: This measures the correctness of the final answer using metrics like F1-score.
Retrieval-Specific Reward: This evaluates the quality of refined knowledge based on how well it captures essential information from the retrieved documents.

Experimental Results

Overall Performance

AutoRefine was evaluated on seven question-answering benchmarks, including three single-hop datasets (NQ, TriviaQA, and PopQA) and four multi-hop datasets (HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle). The results demonstrate that:

AutoRefine significantly outperforms existing methods, achieving 6.9% higher average accuracy than the strongest baseline across all benchmarks.
The model shows particularly strong performance on multi-hop QA tasks, demonstrating its ability to effectively handle complex reasoning that requires multiple pieces of information.

The search frequency increases during training, especially for multi-hop questions, showing that the model learns to perform more searches when dealing with complex queries.

The refinement process successfully extracts crucial information from retrieved documents.

Ablation Study

AutoRefine maintains high performance across different retrieval depths (number of documents retrieved per search), with the best results achieved at k=5.

Models trained with retrieval-specific rewards show improved search frequency, search quality, and refinement quality compared to models trained with only outcome-based rewards.

Acknowledgements

This project is built upon the foundational work of VeRL and Search-R1. We sincerely thank the authors of these projects for their valuable contributions, which have significantly supported and inspired our work.

Citation

@article{shi2025search,
      title={Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs},
      author={Shi, Yaorui and Li, Shihan and Wu, Chang and Liu, Zhiyuan and Fang, Junfeng and Cai, Hengxing and Zhang, An and Wang, Xiang},
      journal={arXiv preprint arXiv:2505.11277},
      year={2025}
    }

Search and Refine During Think:

Autonomous Retrieval‑Augmented Reasoning of LLMs