GPT2-Nepali (Pretrained from scratch) #485

Jan 16, 2025

Aananda-giri
Jan 16, 2025

Hi everyone! 👋

I’m excited to share my recent project: GPT2-Nepali, a GPT-2 model pretrained from scratch for the Nepali language. This project builds upon the GPT-2 model training code detailed in Build a Large Language Model (From Scratch), adapting it specifically for the Nepali language.

Project Highlights:
🔗 Chat Interface: GPT2-Nepali Chat Interface on Hugging Face
📦 Pre-Trained Model: GPT2-Nepali on Hugging Face
💻 Training Code: GitHub Repository
📊 Dataset: 12GB Nepali text derived from NepBERTa project.

Modifications from Original Code

1️⃣ Tokenizer:

Initially, I experimented with the GPT-2 pretrained tokenizer. However, it generated more tokens than the number of Nepali characters in a sentence. This likely happened due to the digital representation of Nepali characters, where each character is sometimes represented as a combination of multiple sub-characters (e.g., diacritics and base characters).
To address this, I trained a new BPE tokenizer tailored for the Nepali language. This tokenizer was based on an earlier version of my NepaliBPE tokenizer.

2️⃣ Dataloader:

I pre-tokenized the dataset and made minor modifications to the dataloader for using pre-trained dataset.

A huge thank you to @rasbt for the inspiration and for writing such an incredible resource—easily the best book on LLMs I’ve ever read!

Aananda-giri · Jan 16, 2025

rasbt
Jan 16, 2025
Maintainer

Awesome project! I don't speak Nepali, unfortunately, but I find that super interesting as a kind of case study for adapting LLMs to new languages, codes, structures, etc.

May I ask what tool you used for training the tokenizer?

PS: I have some code for implementing and training a BPE from scratch (this was one of the outtakes that didn't fit into the book / was way too long for chapter 2). I have ample notes for this but totally forgot to upload it as part of the bonus materials, thanks for reminding me. Will probably share it in a few days here. It's not meant for efficiency but more for educational purposes

8 replies

Aananda-giri Jan 18, 2025
Author

Thank you so much, @rasbt ! 🙌

I just checked out your implementation, and it’s amazing! The fact that it’s only 5x faster than Hugging Face implementation is mind-blowing. I’m looking forward to diving deeper into the notebook and understanding the internals.

Your work continues to inspire me, and I’m sure this resource will be incredibly valuable for the entire community. Thanks again for sharing and for tagging me—it’s much appreciated!

rasbt Jan 18, 2025
Maintainer

The fact that it’s only 5x faster than Hugging Face

"only" 5x faster 😆

d-kleine Jan 20, 2025

Do you any idea why your implementation is way faster than the HF one?

rasbt Jan 20, 2025
Maintainer

No idea. My guess is it's more simplistic and HF has more bells and whistles because it has to support so many other tokenizers, settings, and so forth. Just my assumption here though.

d-kleine Jan 20, 2025

Just saw that you used HF's GPT2Tokenizer for the comparison - there is also a faster version called GPT2TokenizerFast, see here. Just tested, your implementation is still ca. 2x faster than GPT2TokenizerFast, at least on my hardware.

Aananda-giri · Jan 20, 2025

test-dan-run
Jan 20, 2025

Hi @Aananda-giri, this is fantastic work!

I'm particularly interested in the tokenizer comparison. Could you share any insights on how the performance of your new BPE tokenizer compares to the original GPT-2 pretrained tokenizer? Did the BPE tokenizer lead to more accurate text generation or better understanding of Nepali nuances?

4 replies

Aananda-giri Jan 20, 2025
Author

Thank you so much for the kind words and your interest in the project! 😊

Here are some insights into the tokenizer comparison:

1️⃣ Did the BPE tokenizer lead to more accurate text generation?

Yes, the BPE tokenizer significantly improved tokenization efficiency for the Nepali language by generating fewer and more meaningful tokens compared to the GPT-2 tokenizer.

Example:

Input text: "राम ले भात खायो ।"
Tokens generated by GPT-2 tokenizer (tiktoken):
['�', '�', 'ा', '�', '�', ' �', '�', '�', '�', ' �', '�', 'ा', '�', '�', ' �', '�', 'ा', '�', '�', '�', '�', ' ', '�', '�']
(24 tokens, with many fragmented and unusable outputs)
Tokens generated by BPE tokenizer:
['राम</w>', 'ले</w>', 'भात</w>', 'खायो</w>', '।</w>']
(5 tokens, clean and intuitive for Nepali text)

2️⃣ Performance of BPE tokenizer vs. GPT-2 pretrained tokenizer

Training: The BPE tokenizer was trained using the Hugging Face Tokenizers library.
Speed: While the Hugging Face tokenizer provided flexibility and ease of use, it is approximately 10x slower than tiktoken (as noted in this discussion by @rasbt).

3️⃣ More accurate token splitting and generation?

Yes, the BPE tokenizer demonstrated more accurate handling of Nepali morphology and word structures.

Example:

Input text: "राम घरको छानामाथि बस्यो"
Tokens generated by BPE tokenizer:
['राम</w>', 'घरको</w>', 'छाना', 'माथि</w>', 'बस्यो</w>']

Here, छाना and माथि</w> were correctly split, with माथि</w> recognized as a suffix. This aligns well with the structure of the Nepali language, making the outputs more meaningful.

Improvements and Future Directions

Handling prefixes/suffixes: While the tokenizer correctly split some suffixes, not all were seperated consistently. Incorporating pre-tokenization techniques (e.g., regex to separate prefixes/suffixes) could further enhance accuracy.
Rare word handling: Rare words are currently split into individual characters. Increasing the vocabulary size might reduce this issue and improve tokenization for less common words.

I hope this provides a clear overview of the tokenizer comparison and the improvements achieved! Let me know if you’d like further details.

ark-sandbox Jan 20, 2025

@Aananda-giri I am trying to gauge how much compute required to train tokenizers from scratch. It would helpful to know what percentage of data used from nepali_llm_datasets and what is the peak memory usage? As I am trying something similar.

Aananda-giri Jan 20, 2025
Author

Unfortunately, I do not have the exact details of the peak memory usage. However, I trained the tokenizer on Google Colab using 12GB of text data. The environment had 12.7GB of RAM and 107GB of storage, which was sufficient for the task.

Later, I trained an updated version of the tokenizer (available here) with approximately 30GB of text data, also on Google Colab.

BTW, what kind of tokenizer are you building (size of your dataset, vocab size)?

ark-sandbox Jan 21, 2025

Just with Colab notebook, awesome!

I was trying to replicate Gemma like tokenizer, with vocab size 256k, 246GB of Wikipedia dataset with focus on Indic languages.
https://huggingface.co/datasets/ai4bharat/wiki-translate
Here is the gist of the code I used,
https://gist.github.com/karavindhan/63930554f5481d34efdd7f2c5ba2ce56.

Definitely I will checkout minBPE & @rasbt implementaion or try reducing both vocab size and dataset size.

Feb 3, 2025

abdelazizharane
Feb 3, 2025

Hi everyone! 👋

I’m excited to share my recent project: GPT2-Nepali, a GPT-2 model pretrained from scratch for the Nepali language. This project builds upon the GPT-2 model training code detailed in Build a Large Language Model (From Scratch), adapting it specifically for the Nepali language.

Project Highlights: 🔗 Chat Interface: GPT2-Nepali Chat Interface on Hugging Face 📦 Pre-Trained Model: GPT2-Nepali on Hugging Face 💻 Training Code: GitHub Repository 📊 Dataset: 12GB Nepali text derived from NepBERTa project.

Modifications from Original Code

1️⃣ Tokenizer:

Initially, I experimented with the GPT-2 pretrained tokenizer. However, it generated more tokens than the number of Nepali characters in a sentence. This likely happened due to the digital representation of Nepali characters, where each character is sometimes represented as a combination of multiple sub-characters (e.g., diacritics and base characters).

To address this, I trained a new BPE tokenizer tailored for the Nepali language. This tokenizer was based on an earlier version of my NepaliBPE tokenizer.

2️⃣ Dataloader:

I pre-tokenized the dataset and made minor modifications to the dataloader for using pre-trained dataset.

A huge thank you to @rasbt for the inspiration and for writing such an incredible resource—easily the best book on LLMs I’ve ever read!

Congratulations @Aananda-giri,

I'm also working on a local language, Chadian Arabic (shu), which differs slightly from standard Arabic but has the Latin alphabet. I'm currently building up a corpus and once it's finished, I plan to create a BPE tokenizer.
After that, I plan to create a text-to text model.

0 replies

Search code, repositories, users, issues, pull requests...

GPT2-Nepali (Pretrained from scratch) #485

Uh oh!

Aananda-giri Jan 16, 2025

Modifications from Original Code

Replies: 3 comments · 12 replies

Uh oh!

rasbt Jan 16, 2025 Maintainer

Uh oh!

Aananda-giri Jan 18, 2025 Author

Uh oh!

rasbt Jan 18, 2025 Maintainer

Uh oh!

d-kleine Jan 20, 2025

Uh oh!

rasbt Jan 20, 2025 Maintainer

Uh oh!

d-kleine Jan 20, 2025

Uh oh!

test-dan-run Jan 20, 2025

Uh oh!

Aananda-giri Jan 20, 2025 Author

1️⃣ Did the BPE tokenizer lead to more accurate text generation?

2️⃣ Performance of BPE tokenizer vs. GPT-2 pretrained tokenizer

3️⃣ More accurate token splitting and generation?

Improvements and Future Directions

Uh oh!

ark-sandbox Jan 20, 2025

Uh oh!

Uh oh!

Aananda-giri Jan 20, 2025 Author

Uh oh!

ark-sandbox Jan 21, 2025

Uh oh!

abdelazizharane Feb 3, 2025

Modifications from Original Code

Modifications from Original Code

Aananda-giri
Jan 16, 2025

rasbt
Jan 16, 2025
Maintainer

Aananda-giri Jan 18, 2025
Author

rasbt Jan 18, 2025
Maintainer

rasbt Jan 20, 2025
Maintainer

test-dan-run
Jan 20, 2025

Aananda-giri Jan 20, 2025
Author

Aananda-giri Jan 20, 2025
Author

abdelazizharane
Feb 3, 2025