🍜VPhoBertTagger

Token classification using Phobert Models for 🇻🇳Vietnamese

🏞️Environments🏞️

Get started in seconds with verified environments. Run script below for install all dependencies

bash ./install_dependencies.sh

📚Dataset📚

The input data's format of 🍜VPhoBertTagger follows VLSP-2016 format with four columns separated by a tab character, including of word, pos, chunk, and named entity. Each word which was segmented has been put on a separate line and there is an empty line after each sentence. For details, see sample data in 'datasets/samples' directory. The table below describes an example Vietnamese sentence in dataset.

Word	POS	Chunk	NER
Dương	Np	B-NP	B-PER
là	V	B-VP	O
một	M	B-NP	O
chủ	N	B-NP	O
cửa hàng	N	B-NP	O
lâu	A	B-AP	O
năm	N	B-NP	O
ở	E	B-PP	O
Hà Nội	Np	B-NP	B-LOC
.	CH	O	O

The dataset must put on directory with structure as below.

├── data_dir
|  └── train.txt
|  └── dev.txt
|  └── test.txt

🎓Training🎓

The commands below fine-tune PhoBert for Token-classification task. Models download automatically from the latest Hugging Face release

python main.py train --task vlsp2016 --run_test --data_dir ./datasets/vlsp2016 --model_name_or_path vinai/phobert-base --model_arch softmax --output_dir outputs --max_seq_length 256 --train_batch_size 32 --eval_batch_size 32 --learning_rate 3e-5 --epochs 20 --early_stop 2 --overwrite_data

or

bash ./train.sh

Arguments:

type (str,*required): What is process type to be run. Must in [train, test, predict, demo].

task (str, *optional): Training task selected in the list: [vlsp2016, vlsp2018_l1, vlsp2018_l2, vlsp2018_join]. Default: vlsp2016

data_dir (Union[str, os.PathLike], *required): The input data dir. Should contain the .csv files (or other data files) for the task.

overwrite_data (bool, *optional) : Whether not to overwirte splitted dataset. Default=False

load_weights (Union[str, os.PathLike], *optional): Path of pretrained file.

model_name_or_path (str, *required): Pre-trained model selected in the list: [vinai/phobert-base, vinai/phobert-large,...] Default=vinai/phobert-base

model_arch (str, *required): Punctuation prediction model architecture selected in the list: [softmax, crf, lstm_crf].

output_dir (Union[str, os.PathLike], *required): The output directory where the model predictions and checkpoints will be written.

max_seq_length (int, *optional): The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded. Default=190.

train_batch_size (int, *optional): Total batch size for training. Default=32.

eval_batch_size (int, *optional): Total batch size for eval. Default=32.

learning_rate (float, *optional): The initial learning rate for Adam. Default=1e-4.

classifier_learning_rate (float, *optional): The initial classifier learning rate for Adam. Default=5e-4.

epochs (float, *optional): Total number of training epochs to perform. Default=100.0.

weight_decay (float, *optional): Weight deay if we apply some. Default=0.01.

adam_epsilon (float, *optional): Epsilon for Adam optimizer. Default=5e-8.

max_grad_norm (float, *optional): Max gradient norm. Default=1.0.

early_stop (float, *optional): Number of early stop step. Default=10.0.

no_cuda (bool, *optional): Whether not to use CUDA when available. Default=False.

run_test (bool, *optional): Whether not to run evaluate best model on test set after train. Default=False.

seed (bool, *optional): Random seed for initialization. Default=42.

num_workers (int, *optional): how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. Default=0.

save_step (int, *optional): The number of steps in the model will be saved. Default=10000.

gradient_accumulation_steps (int, *optional): Number of updates steps to accumulate before performing a backward/update pass. Default=1.

📈Tensorboard📈

The command below start Tensorboard help you follow fine-tune process.

tensorboard --logdir runs --host 0.0.0.0 --port=6006

🥇Performances🥇

All experiments were performed on an RTX 3090 with 24GB VRAM, and a CPU Xeon® E5-2678 v3 with 64GB RAM, both of which are available for rent on vast.ai. The pretrained-model used for comparison are available on HuggingFace.

VLSP 2016

Click to expand!

Model	BIO-Metrics	NE-Metrics	Log
Accuracy	Precision	Recall	F1-score	Accuracy (w/o 'O')	Accuracy	Precision	Recall	F1-score
Bert-base-multilingual-cased [1]	Softmax	0.9905	0.9239	0.8776	0.8984	0.9068	0.9905	0.8938	0.8941	0.8939	Maxtrix Log
CRF	0.9903	0.9241	0.8880	0.9048	0.9087	0.9903	0.8951	0.8945	0.8948	Maxtrix Log
LSTM_CRF	0.9905	0.9183	0.8898	0.9027	0.9178	0.9905	0.8879	0.8992	0.8935	Maxtrix Log
PhoBert-base [2]	Softmax	0.9950	0.9312	0.9404	0.9348	0.9570	0.9950	0.9434	0.9466	0.9450	Maxtrix Log
CRF	0.9949	0.9497	0.9248	0.9359	0.9525	0.9949	0.9516	0.9456	0.9486	Maxtrix Log
LSTM_CRF	0.9949	0.9535	0.9181	0.9349	0.9456	0.9949	0.9520	0.9396	0.9457	Maxtrix Log
viBERT [3]	Softmax	...	...	...	...	...	...	...	...	...	...
CRF	...	...	...	...	...	...	...	...	...	...
LSTM_CRF	...	...	...	...	...	...	...	...	...	...

VLSP 2018

Level 1

Click to expand!

Model	BIO-Metrics	NE-Metrics	Epoch
Accuracy	Precision	Recall	F1-score	Accuracy (w/o 'O')	Accuracy	Precision	Recall	F1-score
Bert-base-multilingual-cased [1]	Softmax	0.9828	0.7421	0.7980	0.7671	0.8510	0.9828	0.7302	0.8339	0.7786	Maxtrix Log
CRF	0.9824	0.7716	0.7619	0.7601	0.8284	0.9824	0.7542	0.8127	0.7824	Maxtrix Log
LSTM_CRF	0.9829	0.7533	0.7750	0.7626	0.8296	0.9829	0.7612	0.8122	0.7859	Maxtrix Log
PhoBert-base [2]	Softmax	0.9896	0.7970	0.8404	0.8170	0.8892	0.9896	0.8421	0.8942	0.8674	Maxtrix Log
CRF	0.9903	0.8124	0.8428	0.8260	0.8834	0.9903	0.8695	0.8943	0.8817	Maxtrix Log
LSTM+CRF	0.9901	0.8240	0.8278	0.8241	0.8715	0.9901	0.8671	0.8773	0.8721	Maxtrix Log
viBERT [3]	Softmax	...	...	...	...	...	...	...	...	...	...
CRF	...	...	...	...	...	...	...	...	...	...
LSTM_CRF	...	...	...	...	...	...	...	...	...	...

Level 2

Click to expand!

Model	BIO-Metrics	NE-Metrics	Epoch
Accuracy	Precision	Recall	F1-score	Accuracy (w/o 'O')	Accuracy	Precision	Recall	F1-score
Bert-base-multilingual-cased [1]	Softmax	...	...	...	...	...	...	...	...	...	...
CRF	...	...	...	...	...	...	...	...	...	...
LSTM_CRF	...	...	...	...	...	...	...	...	...	...
PhoBert-base [2]	Softmax	...	...	...	...	...	...	...	...	...	...
CRF	...	...	...	...	...	...	...	...	...	...
LSTM+CRF	...	...	...	...	...	...	...	...	...	...
viBERT [3]	Softmax	...	...	...	...	...	...	...	...	...	...
CRF	...	...	...	...	...	...	...	...	...	...
LSTM_CRF	...	...	...	...	...	...	...	...	...	...

Join

Click to expand!

Model	BIO-Metrics	NE-Metrics	Epoch
Accuracy	Precision	Recall	F1-score	Accuracy (w/o 'O')	Accuracy	Precision	Recall	F1-score
Bert-base-multilingual-cased [1]	Softmax	...	...	...	...	...	...	...	...	...	...
CRF	...	...	...	...	...	...	...	...	...	...
LSTM_CRF	...	...	...	...	...	...	...	...	...	...
PhoBert-base [2]	...	...	...	...	...	...	...	...	...	...
CRF	...	...	...	...	...	...	...	...	...	...
LSTM+CRF	...	...	...	...	...	...	...	...	...	...
viBERT [3]	Softmax	...	...	...	...	...	...	...	...	...	...
CRF	...	...	...	...	...	...	...	...	...	...
LSTM_CRF	...	...	...	...	...	...	...	...	...	...

References

[1] Kenton, J. D. M. W. C., & Toutanova, L. K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT (pp. 4171-4186).

[2] Nguyen, D. Q., & Nguyen, A. T. (2020, November). PhoBERT: Pre-trained language models for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 1037-1042).

[3] The, V. B., Thi, O. T., & Le-Hong, P. (2020). Improving sequence tagging for vietnamese text using transformer-based neural models. arXiv preprint arXiv:2006.15994.

🧠Inference🧠

The command below load your fine-tuned model and inference in your text input.

python main.py predict --model_path outputs/best_model.pt

Arguments:

type (str,*required): What is process type to be run. Must in [train, test, predict, demo].

model_path (Union[str, os.PathLike], *optional): Path of pretrained file.

no_cuda (bool, *optional): Whether not to use CUDA when available. Default=False.

🌟Demo🌟

The command below load your fine-tuned model and start demo page.

python main.py demo --model_path outputs/best_model.pt

Arguments:

type (str,*required): What is process type to be run. Must in [train, test, predict, demo].

model_path (Union[str, os.PathLike], *optional): Path of pretrained file.

no_cuda (bool, *optional): Whether not to use CUDA when available. Default=False.

💡Acknowledgements💡

Pretrained model Phobert by VinAI Research and Pytorch implementation by Hugging Face.

Name	Name	Last commit message	Last commit date
Latest commit History 70 Commits 70 Commits
capu	capu
datasets	datasets
statics	statics
tools	tools
vphoberttagger	vphoberttagger
.ignore	.ignore
README.md	README.md
install_dependencies.sh	install_dependencies.sh
main.py	main.py
requirements.txt	requirements.txt
train.sh	train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🍜VPhoBertTagger

🏞️Environments🏞️

📚Dataset📚

🎓Training🎓

📈Tensorboard📈

🥇Performances🥇

VLSP 2016

VLSP 2018

Level 1

Level 2

Join

References

🧠Inference🧠

🌟Demo🌟

💡Acknowledgements💡

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Search code, repositories, users, issues, pull requests...

Folders and files

Latest commit

History

Repository files navigation

🍜VPhoBertTagger

🏞️Environments🏞️

📚Dataset📚

🎓Training🎓

📈Tensorboard📈

🥇Performances🥇

VLSP 2016

VLSP 2018

Level 1

Level 2

Join

References

🧠Inference🧠

🌟Demo🌟

💡Acknowledgements💡

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages