ReMemR1: Look Back to Reason Forward

Abstract

Large language models face challenges in long-context question answering, where key evidence of a query may be dispersed across millions of tokens. Existing works equip large language models with a memory buffer that is dynamically updated via a linear document scan, also known as the "memorize while reading" methods. While this approach scales efficiently, it suffers from pruning of latent evidence, information loss through overwriting, and sparse reinforcement learning signals. To tackle these challenges, we present ReMemR1, which integrates the mechanism of memory retrieval into the memory update process, enabling the agent to selectively callback historical memories for non-linear reasoning. To further strengthen training, we propose a multi-level reward design, which combines final-answer rewards with dense, step-level signals that guide effective memory use. Together, these contributions mitigate information degradation, improve supervision, and support complex multi-hop reasoning. Extensive experiments demonstrate that ReMemR1 significantly outperforms state-of-the-art baselines on long-context question answering while incurring negligible computational overhead.

Motivation

Current memory-augmented approaches for long-context QA face key limitations:

Progressive information loss: "Memorize while reading" paradigms perform a single linear scan over the document, progressively overwriting memory. This causes important evidence to be pruned, especially when key information is scattered far apart in the context.
Sparse RL signals: Existing RL-based methods rely solely on outcome-level rewards (final-answer correctness), which provide sparse and delayed feedback. This makes it difficult for the model to learn effective memory update and retrieval strategies at each step.

ReMemR1 addresses both challenges with: (1) a callback mechanism that enables non-linear memory revisiting over past details, and (2) a multi-level reward design combining outcome rewards with dense step-level supervision.

Method

Memory Update with Callback

Unlike conventional memory agents that use a restrictive state where each memory update only depends on the current context and previous memory, ReMemR1 augments the state representation with a callback query. At each time step, the agent:

Updates memory: Generates a new memory state by integrating the current document chunk with the previous memory.
Generates callback query: Produces a query to retrieve relevant information from the full memory history.
Retrieves and integrates: Uses the callback query to fetch past memories, enabling non-linear access to previously processed information.

This mechanism enables the agent to selectively revisit historical memories, preventing information loss from linear overwriting.

History-Augmented State Transition

(Left) Conventional memory agents use a restrictive state s_t = m_t, where the next memory only depends on the current context and memory. (Right) Our method represents states as s_t = (m_t, q_t), where the agent generates a callback query to retrieve relevant information from its entire memory history, enabling non-linear reasoning paths.

Multi-Level Reward Design

ReMemR1 employs a multi-level reward design that combines:

Outcome rewards: Measure the correctness of the final answer at terminal states.
State rewards: Provide dense, step-level signals that guide effective memory use at each intermediate step.

Each reward type is normalized at the corresponding level: state rewards across all states at the same step, and outcome rewards across all trajectories in the group. This design alleviates the sparse supervision problem while maintaining alignment with the ultimate objective.

Experimental Results

Main Results

ReMemR1 was evaluated on long-context QA benchmarks with context lengths ranging from 100 to 6400 documents, using HotpotQA (in-distribution) and 2WikiMultiHopQA (out-of-distribution):

Accuracy on HotpotQA (In-Distribution)

Scale	Method	Avg	100	200	400	800	1600	3200	6400
3B	MemAgent	60.9	70.3	69.4	68.8	60.9	60.2	59.4	58.8
3B	ReMemR1 (Ours)	63.8	70.9	71.7	74.0	65.4	65.0	65.4	66.1
7B+	R1-Distill-7B	10.2	40.6	25.8	0.8	1.6	2.3	1.5	3.1
	Qwen2.5-1M-7B	54.7	75.8	71.9	68.0	67.2	69.5	22.7	0.0
	QwenLong-L1-32B	57.8	83.6	85.2	74.2	73.4	57.8	38.9	38.3
	MemAgent-7B	74.0	81.8	78.9	78.9	77.0	79.7	72.1	75.8
	ReMemR1-7B (Ours)	81.1	82.3	82.8	78.9	82.0	79.7	80.0	80.8

Accuracy on 2WikiMultiHopQA (Out-of-Distribution)

Scale	Method	Avg	100	200	400	800	1600	3200	6400
3B	MemAgent	40.2	41.4	45.3	39.4	36.3	28.9	26.7	25.9
3B	ReMemR1 (Ours)	42.5	53.5	50.4	41.7	37.0	36.2	35.4	37.8
7B+	R1-Distill-7B	25.8	36.7	29.7	0.0	0.8	2.3	2.3	0.8
	Qwen2.5-1M-7B	45.3	62.5	59.4	57.8	47.7	46.1	25.8	0.0
	QwenLong-L1-32B	45.3	74.2	69.5	65.6	58.6	38.3	24.6	29.9
	MemAgent-7B	50.8	61.7	57.8	47.6	50.7	44.5	46.9	44.7
	ReMemR1-7B (Ours)	55.6	63.9	63.1	54.5	54.7	45.4	48.9	50.3

Key findings:

ReMemR1 consistently outperforms all baselines across both benchmarks and all context lengths.
The advantage grows with context length: At 6400 documents, ReMemR1-7B achieves 80.8% on HotpotQA vs. 75.8% for MemAgent, and 50.3% vs. 44.7% on 2WikiMultiHopQA.
Strong generalization: Trained only on HotpotQA, ReMemR1 still achieves the best results on the out-of-distribution 2WikiMultiHopQA.

Distant Evidence Challenge

When key evidence pieces are placed far apart in the document, ReMemR1's callback mechanism proves especially effective. The accuracy gap between ReMemR1 and baselines widens as the evidence distance increases, demonstrating its robustness to scattered information.

Computational Efficiency

ReMemR1 consistently achieves higher accuracy with only modest additional computation. The callback retrieval module introduces less than 0.1% additional time and negligible memory overhead (<0.001% of total memory), making it highly practical for real-world long-context scenarios.

Ablation Studies

The multi-level reward design with α=0.8 consistently delivers the best accuracy, balancing outcome rewards with dense step-level guidance.
RL-driven callback significantly outperforms rule-based callback, confirming that learning when and what to recall through RL is essential.

Training dynamics: ReMemR1 quickly learns the callback format requirements under RL guidance, with performance rapidly improving during the first 20 steps.

Acknowledgements

This project is built upon the foundational work of VeRL and MemAgent. We sincerely thank the authors of these projects for their valuable contributions, which have significantly supported and inspired our work.

Citation

@article{rememr1,
  author       = {Yaorui Shi and
                  Yuxin Chen and
                  Siyuan Wang and
                  Sihang Li and
                  Hengxing Cai and
                  Qi Gu and
                  Xiang Wang and
                  An Zhang},
  title        = {Look Back to Reason Forward: Revisitable Memory for Long-Context {LLM}
                  Agents},
  journal      = {CoRR},
  volume       = {abs/2509.23040},
  year         = {2025},
  eprinttype    = {arXiv},
  eprint       = {2509.23040},
}

Look Back to Reason Forward:

Revisitable Memory for Long-Context LLM Agents