Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

ruc-datalab/ZeroNL2SQL

Open more actions menu

Repository files navigation

ZeroNL2SQL

💭 Introduction

This repository contains the code for our VLDB2024 paper “Combining Small Language Models and Large Language Models for Zero-Shot NL2SQL”.

📂 Data Preparation

Train data

  • Spider: Put it under src/datasets/spider.

Test data

  • KaggleDBQA: Put it under src/datasets/kaggledbqa.
  • Dr.Spider: Put it under src/datasets/drspider.
mkdir data/
unzip src/datasets/kaggledbqa/kaggledbqa.zip -d data/
unzip src/datasets/drspider/drspider.zip -d data/
# Don't delete the original .zip file

💻 Environment Preparation

PWC

Please refer to requirements.txt to download the relevant toolkits.

Prepare the following folders:

cd ZeroNL2SQL
mkdir logs
mkdir experimental_outputs/train/template_generator
mkdir experimental_outputs/train/aligner

⚡ Quick Start

Download models

  • Template Generator: Put it under experimental_outputs/train/template_generator.
  • Aligner: Put it under experimental_outputs/train/aligner.

Text-to-SQL inference

Use the following script to directly infer on the text-to-sql test set. This script will take four steps: 1. generate SQL template; 2. align (SELECT, STRUCTURE) with the user question; 3. prepare data for LLM inference; 4. text2sql using LLM.

CUDA_VISIBLE_DEVICES={gpu_id} bash scripts/infer_LLM_with_template.sh test_set_name your_openai_key
  • The first argument is the name of the test set, which can be selected from kaggledbqa, DB_DBcontent_equivalence, DB_schema_abbreviation, DB_schema_synonym, NLQ_keyword_synonym, NLQ_keyword_carrier, NLQ_column_synonym, NLQ_column_carrier, NLQ_column_attribute, NLQ_column_value, NLQ_value_synonym, NLQ_multitype, NLQ_others, SQL_comparison, SQL_sort_order, SQL_NonDB_number, SQL_DB_text, SQL_DB_number.
  • The second argument is your openai key, which you can obtain from the official website.

Note that we evaluate the text-to-SQL results using the test_suite_evaluation, and the evaluation results are presented in eval.output.

👐 Train From Scratch

Train template generator

CUDA_VISIBLE_DEVICES={gpu_id} bash -c "python src/run.py configs/train_template_generator.json"

The best model will be saved at experimental_outputs/train/template_generator/BEST_MODEL/.

Train aligner

CUDA_VISIBLE_DEVICES={gpu_id} bash -c "python src/run_aligner.py configs/train_aligner.json"

The best model will be saved at experimental_outputs/train/aligner/checkpoint_best.pkl.

💬Citation

If our code is helpful to you, please cite our work:

@misc{gu2023interleaving,
      title={Interleaving Pre-Trained Language Models and Large Language Models for Zero-Shot NL2SQL Generation}, 
      author={Zihui Gu and Ju Fan and Nan Tang and Songyue Zhang and Yuxin Zhang and Zui Chen and Lei Cao and Guoliang Li and Sam Madden and Xiaoyong Du},
      year={2023},
      eprint={2306.08891},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  
Morty Proxy This is a proxified and sanitized view of the page, visit original site.