Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

shashikg/transformer-punct-and-capit

Open more actions menu

Repository files navigation

BERT Based Model for Punctuation and Capitalization Restoration

Features:

  • Uses Huggingface Tranformer library for base transformer architecture.
  • Pytorch Lightning is used for training and checkpoints.
  • Easy config based model description for easy experimenttation and reaearch.
  • Can be exported as a pytorch quantized model for faster inference on CPU.
  • Includes helper function for data preparation, text normalization, and offline sentence augmentation specific for punctuation and capitalization restoration.

Quick guide:

# Install requirements:
pip install -r requirements.txt

# Downloads raw text corpus from tatoeba for english language
bash download_tatoeba_en_sent.sh

# Preprocess raw text data. Check config file for more details
python preprocess_raw_text_data.py --config="example_configs/preprocess_config_en.yaml"

# Merge multiple data files into one, apply sent augmentation, and tokenization. Check config file for more details
python merge_and_tokenize_datasets.py --config="example_configs/model_config_en.yaml"

# Merge multiple data files into one, apply sent augmentation, and tokenization. Check config file for more details
python train_punct_and_capit_model.py --config="example_configs/model_config_en.yaml"

For inference:

from transformer_punct_and_capit.models import TransformerPunctAndCapitModel

model_path="experiments/model.pcm" # pcm_checkpoint path
model = TransformerPunctAndCapitModel.restore_model(model_path, device='cuda')

model.predict("how are you") # Single example
# Output: ["How are you?"]

model.predict_batch(["how are you"], batch_size=64, show_pbar=True) # Batch example
# Output: ["How are you?"]

About

BERT Tranformer Based Model for Punctuation and Capitalization Restoration

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  
Morty Proxy This is a proxified and sanitized view of the page, visit original site.