[📜 Paper] [🖥️ Code] [🤗 Hugging Face]
In this repository, we introduce the Thinking-supervised Reward Model (TRM): a sentence-level generative reward model that equips language models with critical thinking abilities. TRM enables stepwise reasoning—from document faithfulness to factual correctness—for Chinese question answering (QA) tasks with supporting documents.
Given a query, answer, and supporting documents, TRM first assesses the faithfulness of each answer sentence to the supporting documents, and then applies a reasoning step to evaluate sentence-level correctness.

TRM is further incorporated into policy optimization within a reinforcement learning (RL) framework, where TRM ensures correctness and an auxiliary reward model addresses usefulness.

Please follow the steps below to set up, train, and evaluate with the Thinking-supervised Reward Model (TRM).
Our work is developed using Wechat-YATT. Please clone it into a subdirectory of this repository:
git clone https://github.com/Tencent/Wechat-YATT.git
Download the pre-trained TRM from Hugging Face: https://huggingface.co/QiyaoMa/TRM.
Follow instructions in the following scripts to provide dataset/model paths, cluster settings, and any required environment variables:
preprocess/preprocess.shpolicy/grpo.shtrain.shevaluate.sh
To train your policy model with the provided TRM checkpoint, run sh train.sh:
- Input: query, supporting documents
- Output: updated policy model
To evaluate both correctness and usefulness of the policy model, run sh evaluate.sh:
- Convert Megatron checkpoint to Huggingface checkpoint
- Generate answers using the policy model
- Evaluate with LLM-as-a-Judge