Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Martin-qyma/TRM

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

From Faithfulness to Correctness: Generative Reward Models that Think Critically

[📜 Paper] [🖥️ Code] [🤗 Hugging Face]

In this repository, we introduce the Thinking-supervised Reward Model (TRM): a sentence-level generative reward model that equips language models with critical thinking abilities. TRM enables stepwise reasoning—from document faithfulness to factual correctness—for Chinese question answering (QA) tasks with supporting documents.

Thinking-supervised Reward Model (TRM)

Given a query, answer, and supporting documents, TRM first assesses the faithfulness of each answer sentence to the supporting documents, and then applies a reasoning step to evaluate sentence-level correctness.

Policy Optimization

TRM is further incorporated into policy optimization within a reinforcement learning (RL) framework, where TRM ensures correctness and an auxiliary reward model addresses usefulness.

Getting Started

Please follow the steps below to set up, train, and evaluate with the Thinking-supervised Reward Model (TRM).

1. Clone the dependency repository

Our work is developed using Wechat-YATT. Please clone it into a subdirectory of this repository:

git clone https://github.com/Tencent/Wechat-YATT.git

2. Download the TRM checkpoint

Download the pre-trained TRM from Hugging Face: https://huggingface.co/QiyaoMa/TRM.

3. Prepare scripts and configuration

Follow instructions in the following scripts to provide dataset/model paths, cluster settings, and any required environment variables:

  • preprocess/preprocess.sh
  • policy/grpo.sh
  • train.sh
  • evaluate.sh

4. Train with TRM

To train your policy model with the provided TRM checkpoint, run sh train.sh:

  • Input: query, supporting documents
  • Output: updated policy model

5. Evaluate correctness and usefulness

To evaluate both correctness and usefulness of the policy model, run sh evaluate.sh:

  1. Convert Megatron checkpoint to Huggingface checkpoint
  2. Generate answers using the policy model
  3. Evaluate with LLM-as-a-Judge

About

From Faithfulness to Correctness: Generative Reward Models that Think Critically

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Morty Proxy This is a proxified and sanitized view of the page, visit original site.