Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" accpeted in Findings of the Annual Conference of the North American Chap…

Notifications You must be signed in to change notification settings

csebuetnlp/banglabert

Open more actions menu

Repository files navigation

BanglaBERT

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" accpeted in Findings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: NAACL 2022.

Table of Contents

Models

The pretrained model checkpoint is available here. To use this model for the supported downstream tasks in this repository see Training & Evaluation.

Note: This model was pretrained using a specific normalization pipeline available here. All finetuning scripts in this repository uses this normalization by default. If you need to adapt the pretrained model for a different task make sure the text units are normalized using this pipeline before tokenizing to get best results. A basic example is available at the model page.

Datasets

We are also releasing the Bangla Natural Language Inference (NLI) and Bangla Question Answering (QA) datasets introduced in the paper.

Setup

For installing the necessary requirements, use the following snippet

$ git clone https://github.com/csebuetnlp/banglabert
$ cd banglabert/
$ conda create python==3.7.9 pytorch==1.8.1 torchvision==0.9.1 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch -p ./env
$ conda activate ./env # or source activate ./env (for older versions of anaconda)
$ bash setup.sh 
  • Use the newly created environment for running the scripts in this repository.

Training & Evaluation

To use the pretrained model for finetuning / inference on different downstream tasks see the following section:

  • Sequence Classification.
    • For single sequence classification such as
      • Document classification
      • Sentiment classification
      • Emotion classification etc.
    • For double sequence classification such as
      • Natural Language Inference (NLI)
      • Paraphrase detection etc.
  • Token Classification.
    • For token tagging / classification tasks such as
      • Named Entity Recognition (NER)
      • Parts of Speech Tagging (PoS) etc.
  • Question Answering.
    • For tasks such as,
      • Extractive Question Answering
      • Open-domain Question Answering

Benchmarks

  • Zero-shot cross-lingual transfer-learning
Model Params SC (macro-F1) NLI (accuracy) NER (micro-F1) QA (EM/F1) BangLUE score
mBERT 180M - 62.22 39.27 59.01/64.18 50.35
XLM-R (base) 270M - 72.18 45.37 55.03/61.83 55.29
XLM-R (large) 550M - 78.16 57.74 71.13/77.70 70.74
  • Supervised fine-tuning
Model Params SC (macro-F1) NLI (accuracy) NER (micro-F1) QA (EM/F1) BangLUE score
mBERT 180M 67.59 75.13 68.97 67.12/72.64 70.29
XLM-R (base) 270M 69.54 78.46 73.32 68.09/74.27 72.82
XLM-R (large) 550M 70.97 82.40 78.39 73.15/79.06 76.79
sahajBERT 18M 71.12 76.92 70.94 65.48/70.69 71.03
BanglaBERT 110M 72.89 82.80 77.78 72.63/79.34 77.09

The benchmarking datasets are as follows:

Acknowledgements

We would like to thank Intelligent Machines and Google TFRC Program for providing cloud support for pretraining the models.

License

Contents of this repository are restricted to non-commercial research purposes only under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

Creative Commons License

Citation

If you use any of the datasets, models or code modules, please cite the following paper:

@inproceedings{bhattacharjee-etal-2022-banglabert,
    title     = {BanglaBERT: Lagnuage Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla},
    author = "Bhattacharjee, Abhik  and
      Hasan, Tahmid  and
      Mubasshir, Kazi  and
      Islam, Md. Saiful  and
      Uddin, Wasi Ahmad  and
      Iqbal, Anindya  and
      Rahman, M. Sohel  and
      Shahriyar, Rifat",
      booktitle = "Findings of the North American Chapter of the Association for Computational Linguistics: NAACL 2022",
      month = july,
    year      = {2022},
    url       = {https://arxiv.org/abs/2101.00204},
    eprinttype = {arXiv},
    eprint    = {2101.00204}
}

About

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" accpeted in Findings of the Annual Conference of the North American Chap…

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  
Morty Proxy This is a proxified and sanitized view of the page, visit original site.