Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

datnnt1997/VPhoBertTagger

Open more actions menu

Repository files navigation

🍜VPhoBertTagger

Token classification using Phobert Models for 🇻🇳Vietnamese

🏞️Environments🏞️

Get started in seconds with verified environments. Run script below for install all dependencies

bash ./install_dependencies.sh

📚Dataset📚

The input data's format of 🍜VPhoBertTagger follows VLSP-2016 format with four columns separated by a tab character, including of word, pos, chunk, and named entity. Each word which was segmented has been put on a separate line and there is an empty line after each sentence. For details, see sample data in 'datasets/samples' directory. The table below describes an example Vietnamese sentence in dataset.

Word POS Chunk NER
Dương Np B-NP B-PER
V B-VP O
một M B-NP O
chủ N B-NP O
cửa hàng N B-NP O
lâu A B-AP O
năm N B-NP O
E B-PP O
Hà Nội Np B-NP B-LOC
. CH O O

The dataset must put on directory with structure as below.

├── data_dir
|  └── train.txt
|  └── dev.txt
|  └── test.txt

🎓Training🎓

The commands below fine-tune PhoBert for Token-classification task. Models download automatically from the latest Hugging Face release

python main.py train --task vlsp2016 --run_test --data_dir ./datasets/vlsp2016 --model_name_or_path vinai/phobert-base --model_arch softmax --output_dir outputs --max_seq_length 256 --train_batch_size 32 --eval_batch_size 32 --learning_rate 3e-5 --epochs 20 --early_stop 2 --overwrite_data

or

bash ./train.sh

Arguments:

  • type (str,*required): What is process type to be run. Must in [train, test, predict, demo].
  • task (str, *optional): Training task selected in the list: [vlsp2016, vlsp2018_l1, vlsp2018_l2, vlsp2018_join]. Default: vlsp2016
  • data_dir (Union[str, os.PathLike], *required): The input data dir. Should contain the .csv files (or other data files) for the task.
  • overwrite_data (bool, *optional) : Whether not to overwirte splitted dataset. Default=False
  • load_weights (Union[str, os.PathLike], *optional): Path of pretrained file.
  • model_name_or_path (str, *required): Pre-trained model selected in the list: [vinai/phobert-base, vinai/phobert-large,...] Default=vinai/phobert-base
  • model_arch (str, *required): Punctuation prediction model architecture selected in the list: [softmax, crf, lstm_crf].
  • output_dir (Union[str, os.PathLike], *required): The output directory where the model predictions and checkpoints will be written.
  • max_seq_length (int, *optional): The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded. Default=190.
  • train_batch_size (int, *optional): Total batch size for training. Default=32.
  • eval_batch_size (int, *optional): Total batch size for eval. Default=32.
  • learning_rate (float, *optional): The initial learning rate for Adam. Default=1e-4.
  • classifier_learning_rate (float, *optional): The initial classifier learning rate for Adam. Default=5e-4.
  • epochs (float, *optional): Total number of training epochs to perform. Default=100.0.
  • weight_decay (float, *optional): Weight deay if we apply some. Default=0.01.
  • adam_epsilon (float, *optional): Epsilon for Adam optimizer. Default=5e-8.
  • max_grad_norm (float, *optional): Max gradient norm. Default=1.0.
  • early_stop (float, *optional): Number of early stop step. Default=10.0.
  • no_cuda (bool, *optional): Whether not to use CUDA when available. Default=False.
  • run_test (bool, *optional): Whether not to run evaluate best model on test set after train. Default=False.
  • seed (bool, *optional): Random seed for initialization. Default=42.
  • num_workers (int, *optional): how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. Default=0.
  • save_step (int, *optional): The number of steps in the model will be saved. Default=10000.
  • gradient_accumulation_steps (int, *optional): Number of updates steps to accumulate before performing a backward/update pass. Default=1.

📈Tensorboard📈

The command below start Tensorboard help you follow fine-tune process.

tensorboard --logdir runs --host 0.0.0.0 --port=6006

🥇Performances🥇

All experiments were performed on an RTX 3090 with 24GB VRAM, and a CPU Xeon® E5-2678 v3 with 64GB RAM, both of which are available for rent on vast.ai. The pretrained-model used for comparison are available on HuggingFace.

VLSP 2016

Click to expand!
Model BIO-Metrics NE-Metrics Log
Accuracy Precision Recall F1-score Accuracy
(w/o 'O')
Accuracy Precision Recall F1-score
Bert-base-multilingual-cased [1] Softmax 0.9905 0.9239 0.8776 0.8984 0.9068 0.9905 0.8938 0.8941 0.8939 Maxtrix
Log
CRF 0.9903 0.9241 0.8880 0.9048 0.9087 0.9903 0.8951 0.8945 0.8948 Maxtrix
Log
LSTM_CRF 0.9905 0.9183 0.8898 0.9027 0.9178 0.9905 0.8879 0.8992 0.8935 Maxtrix
Log
PhoBert-base [2] Softmax 0.9950 0.9312 0.9404 0.9348 0.9570 0.9950 0.9434 0.9466 0.9450 Maxtrix
Log
CRF 0.9949 0.9497 0.9248 0.9359 0.9525 0.9949 0.9516 0.9456 0.9486 Maxtrix
Log
LSTM_CRF 0.9949 0.9535 0.9181 0.9349 0.9456 0.9949 0.9520 0.9396 0.9457 Maxtrix
Log
viBERT [3] Softmax ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM_CRF ... ... ... ... ... ... ... ... ... ...

VLSP 2018

Level 1

Click to expand!
Model BIO-Metrics NE-Metrics Epoch
Accuracy Precision Recall F1-score Accuracy
(w/o 'O')
Accuracy Precision Recall F1-score
Bert-base-multilingual-cased [1] Softmax 0.9828 0.7421 0.7980 0.7671 0.8510 0.9828 0.7302 0.8339 0.7786 Maxtrix
Log
CRF 0.9824 0.7716 0.7619 0.7601 0.8284 0.9824 0.7542 0.8127 0.7824 Maxtrix
Log
LSTM_CRF 0.9829 0.7533 0.7750 0.7626 0.8296 0.9829 0.7612 0.8122 0.7859 Maxtrix
Log
PhoBert-base [2] Softmax 0.9896 0.7970 0.8404 0.8170 0.8892 0.9896 0.8421 0.8942 0.8674 Maxtrix
Log
CRF 0.9903 0.8124 0.8428 0.8260 0.8834 0.9903 0.8695 0.8943 0.8817 Maxtrix
Log
LSTM+CRF 0.9901 0.8240 0.8278 0.8241 0.8715 0.9901 0.8671 0.8773 0.8721 Maxtrix
Log
viBERT [3] Softmax ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM_CRF ... ... ... ... ... ... ... ... ... ...

Level 2

Click to expand!
Model BIO-Metrics NE-Metrics Epoch
Accuracy Precision Recall F1-score Accuracy
(w/o 'O')
Accuracy Precision Recall F1-score
Bert-base-multilingual-cased [1] Softmax ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM_CRF ... ... ... ... ... ... ... ... ... ...
PhoBert-base [2] Softmax ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM+CRF ... ... ... ... ... ... ... ... ... ...
viBERT [3] Softmax ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM_CRF ... ... ... ... ... ... ... ... ... ...

Join

Click to expand!
Model BIO-Metrics NE-Metrics Epoch
Accuracy Precision Recall F1-score Accuracy
(w/o 'O')
Accuracy Precision Recall F1-score
Bert-base-multilingual-cased [1] Softmax ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM_CRF ... ... ... ... ... ... ... ... ... ...
PhoBert-base [2] ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM+CRF ... ... ... ... ... ... ... ... ... ...
viBERT [3] Softmax ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM_CRF ... ... ... ... ... ... ... ... ... ...

References

[1] Kenton, J. D. M. W. C., & Toutanova, L. K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT (pp. 4171-4186).

[2] Nguyen, D. Q., & Nguyen, A. T. (2020, November). PhoBERT: Pre-trained language models for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 1037-1042).

[3] The, V. B., Thi, O. T., & Le-Hong, P. (2020). Improving sequence tagging for vietnamese text using transformer-based neural models. arXiv preprint arXiv:2006.15994.

🧠Inference🧠

The command below load your fine-tuned model and inference in your text input.

python main.py predict --model_path outputs/best_model.pt

Arguments:

  • type (str,*required): What is process type to be run. Must in [train, test, predict, demo].
  • model_path (Union[str, os.PathLike], *optional): Path of pretrained file.
  • no_cuda (bool, *optional): Whether not to use CUDA when available. Default=False.

🌟Demo🌟

The command below load your fine-tuned model and start demo page.

python main.py demo --model_path outputs/best_model.pt

Arguments:

  • type (str,*required): What is process type to be run. Must in [train, test, predict, demo].
  • model_path (Union[str, os.PathLike], *optional): Path of pretrained file.
  • no_cuda (bool, *optional): Whether not to use CUDA when available. Default=False.

💡Acknowledgements💡

Pretrained model Phobert by VinAI Research and Pytorch implementation by Hugging Face.

Releases

No releases published

Packages

 
 
 

Contributors

Morty Proxy This is a proxified and sanitized view of the page, visit original site.