⚠ This page is served via a proxy. Original site: https://github.com
This service does not collect credentials or authentication data.
Skip to content
/ CaRR Public

This repository contains the code and data for the paper "Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards".

License

Notifications You must be signed in to change notification settings

THUDM/CaRR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

GitHub arXiv Dataset

Multi-Turn RL Training

🔥 News

  • [2026/01/11] Our SFT trajectories and RL QA pairs with rubrics have been fully open-sourced on Hugging Face Dataset CaRR-DeepDive.
  • [2026/01/11] Released the CaRR framework, implemented as a remote reward model server — now fully available in ./deepsearch_rm_with_rubrics.
  • Model and training code are currently being organized – coming soon!

🚀 Overview

Existing Reinforcement Learning (RL) approaches for deep search agents primarily rely on binary outcome rewards (i.e., whether the final answer is correct). However, pure outcome rewards fail to capture the comprehensiveness and factuality of agents’ reasoning process, often leading to undesirable behaviors such as:

  • Shortcut exploitation: Agents may find the answer using only partial information, ignoring complex constraints.

  • Hallucinations: Agents may arrive at the correct answer via fortunate huallucinations.

Optimizing toward these flawed trajectories will result in agents with diminished robustness and suboptimal performance

To address these, we propose Citation-aware Rubric Rewards (CaRR) and Citation-aware Group Relative Policy Optimization (C-GRPO) to encourage deep search agents to conduct comprehensive, evidence-grounded reasoning.


✨ Key Features

1. Citation-Aware Rubric Rewards (CaRR)

CaRR

CaRR is a fine-grained reward framework for deep search agents that emphasizes reasoning comprehensiveness, factual grounding, and evidence connectivity. It decomposes complex, multi-hop questions into atomic, verifiable rubrics. A trajectory satisfies a rubric only if:

  • Entity Identification: It explicitly identifies all hidden entities involved.

  • Citation Grounding: The statement is fully supported by the cited web contents.

  • Evidence Connectivity: The supported rubrics forms an evidence chain that connects to the final predicted answer.

2. Citation-aware Group Relative Policy Optimization (C-GRPO)

C-GRPO

C-GRPO extends Group Relative Policy Optimization (GRPO) by assigning an additional weighted citation-aware rubric reward to trajectories that have found the correct final answer. This encourages the model to improve accuracy and reasoning quality simultaneous, thereby promoting more robust policy learning.


📊 Experimental Results

Our RL experiments use Qwen3-4B-Thinking-2507 and Qwen3-30B-A3B-Thinking-2507 as backbone models, and use DeepDive as the training data.

Evaluation results on four challenging deep search benchmarks show that C-GRPO consistently outperforms standard outcome-based GRPO, and demonstrates superior test-time scaling capacity, effectively utilizing longer context budgets to improve performance:

C-GRPO

C-GRPO agents also generalize well to open-ended deep research tasks:

C-GRPO


Acknowledgments


📖 Citation

If you find our work useful, please consider citing:

@misc{lu2025deepdiveadvancingdeepsearch,
      title={Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards},
      author={Jiajie Zhang and Xin Lv and Ling Feng and Lei Hou and Juanzi Li},
      year={2025},
      eprint={2601.06021},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.06021},
}

About

This repository contains the code and data for the paper "Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages