GitHub - open-sciencelab/GraphGen: GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

📚 Table of Contents

📝 What is GraphGen?
🚀 Quick Start
📌 Latest Updates
🏗️ System Architecture
🍀 Acknowledgements
📚 Citation
📜 License

📝 What is GraphGen?

GraphGen is a framework for synthetic data generation guided by knowledge graphs. Here is our paper and best practice.

It begins by constructing a fine-grained knowledge graph from the source text，then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.

🚀 Quick Start

Experience GraphGen through Web or Backup Web Entrance

For any questions, please check FAQ, open new issue or join our wechat group and ask.

Gradio Demo

python webui/app.py

Run from PyPI

Install GraphGen
```
pip install graphg
```

Run in CLI

SYNTHESIZER_MODEL=your_synthesizer_model_name \
SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \
SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \
TRAINEE_MODEL=your_trainee_model_name \
TRAINEE_BASE_URL=your_base_url_for_trainee_model \
TRAINEE_API_KEY=your_api_key_for_trainee_model \
graphg --output_dir cache

Run from Source

Install dependencies
```
pip install -r requirements.txt
```

Configure the environment

Create an .env file in the root directory
```
cp .env.example .env
```

Set the following environment variables:

# Synthesizer is the model used to construct KG and generate data
SYNTHESIZER_MODEL=your_synthesizer_model_name
SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model
SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model
# Trainee is the model used to train with the generated data
TRAINEE_MODEL=your_trainee_model_name
TRAINEE_BASE_URL=your_base_url_for_trainee_model
TRAINEE_API_KEY=your_api_key_for_trainee_model

(Optional) If you want to modify the default generated configuration, you can edit the content of the configs/graphgen_config.yaml file.

# configs/graphgen_config.yaml
# Example configuration
data_type: "raw"
input_file: "resources/examples/raw_demo.jsonl"
# more configurations...

Run the generation script
```
bash scripts/generate.sh
```
Get the generated data
```
ls cache/data/graphgen
```

Run with Docker

Build the Docker image
```
docker build -t graphgen .
```
Run the Docker container
```
 docker run -p 7860:7860 graphgen
```

📌 Latest Updates

2025.04.21: We have released the initial version of GraphGen.

🏗️ System Architecture

See analysis by deepwiki for a technical overview of the GraphGen system, its architecture, and core functionalities.

Workflow

🍀 Acknowledgements

SiliconCloud Abundant LLM API, some models are free
LightRAG Simple and efficient graph retrieval solution
ROGRAG ROGRAG: A Robustly Optimized GraphRAG Framework

📚 Citation

If you find this repository useful, please consider citing our work:

@misc{chen2025graphgenenhancingsupervisedfinetuning,
      title={GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation}, 
      author={Zihong Chen and Wanli Jiang and Jinzhe Li and Zhonghang Yuan and Huanjun Kong and Wanli Ouyang and Nanqing Dong},
      year={2025},
      eprint={2505.20416},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.20416}, 
}

📜 License

This project is licensed under the Apache License 2.0.

Name	Name	Last commit message	Last commit date
Latest commit History 319 Commits
.github/workflows	.github/workflows
baselines	baselines
graphgen	graphgen
resources	resources
scripts	scripts
webui	webui
.env.example	.env.example
.gitignore	.gitignore
.pylintrc	.pylintrc
CITATION.cff	CITATION.cff
Dockerfile	Dockerfile
LICENSE	LICENSE
MANIFEST.in	MANIFEST.in
README.md	README.md
requirements.txt	requirements.txt
setup.py	setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📝 What is GraphGen?

🚀 Quick Start

Gradio Demo

Run from PyPI

Run from Source

Run with Docker

📌 Latest Updates

🏗️ System Architecture

Workflow

🍀 Acknowledgements

📚 Citation

📜 License

About

Uh oh!

Releases

Uh oh!

Contributors 7

Languages

Search code, repositories, users, issues, pull requests...

License

open-sciencelab/GraphGen

Folders and files

Latest commit

History

Repository files navigation

📝 What is GraphGen?

🚀 Quick Start

Gradio Demo

Run from PyPI

Run from Source

Run with Docker

📌 Latest Updates

🏗️ System Architecture

Workflow

🍀 Acknowledgements

📚 Citation

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors 7

Languages