From Faithfulness to Correctness: Generative Reward Models that Think Critically

In this repository, we introduce the Thinking-supervised Reward Model (TRM): a sentence-level generative reward model that equips language models with critical thinking abilities. TRM enables stepwise reasoning—from document faithfulness to factual correctness—for Chinese question answering (QA) tasks with supporting documents.

Thinking-supervised Reward Model (TRM)

Given a query, answer, and supporting documents, TRM first assesses the faithfulness of each answer sentence to the supporting documents, and then applies a reasoning step to evaluate sentence-level correctness.

Policy Optimization

TRM is further incorporated into policy optimization within a reinforcement learning (RL) framework, where TRM ensures correctness and an auxiliary reward model addresses usefulness.

Getting Started

Please follow the steps below to set up, train, and evaluate with the Thinking-supervised Reward Model (TRM).

1. Clone the dependency repository

Our work is developed using Wechat-YATT. Please clone it into a subdirectory of this repository:

git clone https://github.com/Tencent/Wechat-YATT.git

2. Download the TRM checkpoint

Download the pre-trained TRM from Hugging Face: https://huggingface.co/QiyaoMa/TRM.

3. Prepare scripts and configuration

Follow instructions in the following scripts to provide dataset/model paths, cluster settings, and any required environment variables:

preprocess/preprocess.sh
policy/grpo.sh
train.sh
evaluate.sh

4. Train with TRM

To train your policy model with the provided TRM checkpoint, run sh train.sh:

Input: query, supporting documents
Output: updated policy model

5. Evaluate correctness and usefulness

To evaluate both correctness and usefulness of the policy model, run sh evaluate.sh:

Convert Megatron checkpoint to Huggingface checkpoint
Generate answers using the policy model
Evaluate with LLM-as-a-Judge

Name	Name	Last commit message	Last commit date
Latest commit History 2 Commits 2 Commits
evaluate	evaluate
policy	policy
preprocess	preprocess
.gitignore	.gitignore
Policy Model.png	Policy Model.png
README.md	README.md
Reward Model.png	Reward Model.png
evaluate.sh	evaluate.sh
train.sh	train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

From Faithfulness to Correctness: Generative Reward Models that Think Critically

Thinking-supervised Reward Model (TRM)

Policy Optimization

Getting Started

1. Clone the dependency repository

2. Download the TRM checkpoint

3. Prepare scripts and configuration

4. Train with TRM

5. Evaluate correctness and usefulness

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Search code, repositories, users, issues, pull requests...

Folders and files

Latest commit

History

Repository files navigation

From Faithfulness to Correctness: Generative Reward Models that Think Critically

Thinking-supervised Reward Model (TRM)

Policy Optimization

Getting Started

1. Clone the dependency repository

2. Download the TRM checkpoint

3. Prepare scripts and configuration

4. Train with TRM

5. Evaluate correctness and usefulness

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages