Search and Refine During Think:

Autonomous Retrieval‑Augmented Reasoning of LLMs

Yaorui Shi1*, Sihang Li1*, Chang Wu1, Zhiyuan Liu2, Junfeng Fang2, Hengxing Cai3†, An Zhang1, Xiang Wang1†,
1University of Science and Technology of China
2National University of Singapore 3DP Technology
*Equal Contribution Correspondence

Abstract

Large language models have demonstrated impressive reasoning capabilities but are inherently limited by their knowledge reservoir. Retrieval-augmented reasoning mitigates this limitation by allowing LLMs to query external resources, but existing methods often retrieve irrelevant or noisy information, hindering accurate reasoning. In this paper, we propose AutoRefine, a reinforcement learning post-training framework that adopts a new "search-and-refine-during-think" paradigm. AutoRefine introduces explicit knowledge refinement steps between successive search calls, enabling the model to iteratively filter, distill, and organize evidence before generating an answer. Furthermore, we incorporate tailored retrieval-specific rewards alongside answer correctness rewards using group relative policy optimization. Experiments on single-hop and multi-hop QA benchmarks demonstrate that AutoRefine significantly outperforms existing approaches, particularly in complex, multi-hop reasoning scenarios. Detailed analysis shows that AutoRefine issues frequent, higher-quality searches and synthesizes evidence effectively.


Teaser

Research Gaps

innovation

Current retrieval-augmented reasoning approaches face two key limitations:

  • Lack of refinement of retrieved documents: Existing methods typically feed retrieved documents directly to the LLM without first distilling key information. This forces LLMs to process large volumes of potentially irrelevant content, making it difficult to identify and focus on the most relevant knowledge.
  • Underexplored retrieval-specific rewards: Most reinforcement learning-based RAG methods rely solely on outcome-based rewards (like answer correctness) to guide the model's behavior. They neglect the importance of retrieval-specific rewards that could directly improve the quality of the retrieval process itself.

Method

The Search-and-Refine-During-Think Paradigm

Paradigm

At the core of AutoRefine is a novel "search-and-refine-during-think" paradigm that extends the traditional "search-during-think" approach. This paradigm allows the LLM to:

  • Think: Reason about the problem and identify knowledge gaps.
  • Search: Formulate queries to retrieve relevant information.
  • Receive Documents: Obtain documents from an external knowledge source.
  • Refine: Distill and extract key information from retrieved documents.
This cycle can repeat multiple times until the model has gathered sufficient information to confidently answer the question.

Reward Modeling with Retrieval-Aware Signals


AutoRefine employs a sophisticated reward system that combines:

  • Outcome-Based Reward: This measures the correctness of the final answer using metrics like F1-score.
  • Retrieval-Specific Reward: This evaluates the quality of refined knowledge based on how well it captures essential information from the retrieved documents.

Experimental Results

Overall Performance


AutoRefine was evaluated on seven question-answering benchmarks, including three single-hop datasets (NQ, TriviaQA, and PopQA) and four multi-hop datasets (HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle). The results demonstrate that:

  • AutoRefine significantly outperforms existing methods, achieving 6.9% higher average accuracy than the strongest baseline across all benchmarks.
  • The model shows particularly strong performance on multi-hop QA tasks, demonstrating its ability to effectively handle complex reasoning that requires multiple pieces of information.
  • Overall Performance
  • The search frequency increases during training, especially for multi-hop questions, showing that the model learns to perform more searches when dealing with complex queries.
  • Search Behavior
  • The refinement process successfully extracts crucial information from retrieved documents.
  • Refine Behavior

Ablation Study


  • AutoRefine maintains high performance across different retrieval depths (number of documents retrieved per search), with the best results achieved at k=5.
  • Ablation
  • Models trained with retrieval-specific rewards show improved search frequency, search quality, and refinement quality compared to models trained with only outcome-based rewards.
  • Ablation

Acknowledgements

This project is built upon the foundational work of VeRL and Search-R1. We sincerely thank the authors of these projects for their valuable contributions, which have significantly supported and inspired our work.

Citation

@article{shi2025search,
      title={Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs},
      author={Shi, Yaorui and Li, Shihan and Wu, Chang and Liu, Zhiyuan and Fang, Junfeng and Cai, Hengxing and Zhang, An and Wang, Xiang},
      journal={arXiv preprint arXiv:2505.11277},
      year={2025}
    }