Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
Discussion options


Hi everyone! 👋

I’m excited to share my recent project: GPT2-Nepali, a GPT-2 model pretrained from scratch for the Nepali language. This project builds upon the GPT-2 model training code detailed in Build a Large Language Model (From Scratch), adapting it specifically for the Nepali language.

Project Highlights:
🔗 Chat Interface: GPT2-Nepali Chat Interface on Hugging Face
📦 Pre-Trained Model: GPT2-Nepali on Hugging Face
💻 Training Code: GitHub Repository
📊 Dataset: 12GB Nepali text derived from NepBERTa project.


Modifications from Original Code

1️⃣ Tokenizer:

  • Initially, I experimented with the GPT-2 pretrained tokenizer. However, it generated more tokens than the number of Nepali characters in a sentence. This likely happened due to the digital representation of Nepali characters, where each character is sometimes represented as a combination of multiple sub-characters (e.g., diacritics and base characters).
  • To address this, I trained a new BPE tokenizer tailored for the Nepali language. This tokenizer was based on an earlier version of my NepaliBPE tokenizer.

2️⃣ Dataloader:

  • I pre-tokenized the dataset and made minor modifications to the dataloader for using pre-trained dataset.

A huge thank you to @rasbt for the inspiration and for writing such an incredible resource—easily the best book on LLMs I’ve ever read!


You must be logged in to vote

Replies: 3 comments · 12 replies

Comment options

Awesome project! I don't speak Nepali, unfortunately, but I find that super interesting as a kind of case study for adapting LLMs to new languages, codes, structures, etc.

May I ask what tool you used for training the tokenizer?

PS: I have some code for implementing and training a BPE from scratch (this was one of the outtakes that didn't fit into the book / was way too long for chapter 2). I have ample notes for this but totally forgot to upload it as part of the bonus materials, thanks for reminding me. Will probably share it in a few days here. It's not meant for efficiency but more for educational purposes

You must be logged in to vote
8 replies
@Aananda-giri
Comment options

Thank you so much, @rasbt ! 🙌

I just checked out your implementation, and it’s amazing! The fact that it’s only 5x faster than Hugging Face implementation is mind-blowing. I’m looking forward to diving deeper into the notebook and understanding the internals.

Your work continues to inspire me, and I’m sure this resource will be incredibly valuable for the entire community. Thanks again for sharing and for tagging me—it’s much appreciated!

@rasbt
Comment options

The fact that it’s only 5x faster than Hugging Face

"only" 5x faster 😆

@d-kleine
Comment options

Do you any idea why your implementation is way faster than the HF one?

@rasbt
Comment options

No idea. My guess is it's more simplistic and HF has more bells and whistles because it has to support so many other tokenizers, settings, and so forth. Just my assumption here though.

@d-kleine
Comment options

Just saw that you used HF's GPT2Tokenizer for the comparison - there is also a faster version called GPT2TokenizerFast, see here. Just tested, your implementation is still ca. 2x faster than GPT2TokenizerFast, at least on my hardware.

Comment options

Hi @Aananda-giri, this is fantastic work!

I'm particularly interested in the tokenizer comparison. Could you share any insights on how the performance of your new BPE tokenizer compares to the original GPT-2 pretrained tokenizer? Did the BPE tokenizer lead to more accurate text generation or better understanding of Nepali nuances?

You must be logged in to vote
4 replies
@Aananda-giri
Comment options


Thank you so much for the kind words and your interest in the project! 😊

Here are some insights into the tokenizer comparison:


1️⃣ Did the BPE tokenizer lead to more accurate text generation?

Yes, the BPE tokenizer significantly improved tokenization efficiency for the Nepali language by generating fewer and more meaningful tokens compared to the GPT-2 tokenizer.

Example:

  • Input text: "राम ले भात खायो ।"
  • Tokens generated by GPT-2 tokenizer (tiktoken):
    ['�', '�', 'ा', '�', '�', ' �', '�', '�', '�', ' �', '�', 'ा', '�', '�', ' �', '�', 'ा', '�', '�', '�', '�', ' ', '�', '�']
    (24 tokens, with many fragmented and unusable outputs)
  • Tokens generated by BPE tokenizer:
    ['राम</w>', 'ले</w>', 'भात</w>', 'खायो</w>', '।</w>']
    (5 tokens, clean and intuitive for Nepali text)

2️⃣ Performance of BPE tokenizer vs. GPT-2 pretrained tokenizer

  • Training: The BPE tokenizer was trained using the Hugging Face Tokenizers library.
  • Speed: While the Hugging Face tokenizer provided flexibility and ease of use, it is approximately 10x slower than tiktoken (as noted in this discussion by @rasbt).

3️⃣ More accurate token splitting and generation?

Yes, the BPE tokenizer demonstrated more accurate handling of Nepali morphology and word structures.

Example:

  • Input text: "राम घरको छानामाथि बस्यो"
  • Tokens generated by BPE tokenizer:
    ['राम</w>', 'घरको</w>', 'छाना', 'माथि</w>', 'बस्यो</w>']

Here, छाना and माथि</w> were correctly split, with माथि</w> recognized as a suffix. This aligns well with the structure of the Nepali language, making the outputs more meaningful.


Improvements and Future Directions

  • Handling prefixes/suffixes: While the tokenizer correctly split some suffixes, not all were seperated consistently. Incorporating pre-tokenization techniques (e.g., regex to separate prefixes/suffixes) could further enhance accuracy.
  • Rare word handling: Rare words are currently split into individual characters. Increasing the vocabulary size might reduce this issue and improve tokenization for less common words.

I hope this provides a clear overview of the tokenizer comparison and the improvements achieved! Let me know if you’d like further details.


@ark-sandbox
Comment options

@Aananda-giri I am trying to gauge how much compute required to train tokenizers from scratch. It would helpful to know what percentage of data used from nepali_llm_datasets and what is the peak memory usage? As I am trying something similar.

@Aananda-giri
Comment options

Unfortunately, I do not have the exact details of the peak memory usage. However, I trained the tokenizer on Google Colab using 12GB of text data. The environment had 12.7GB of RAM and 107GB of storage, which was sufficient for the task.

Later, I trained an updated version of the tokenizer (available here) with approximately 30GB of text data, also on Google Colab.

BTW, what kind of tokenizer are you building (size of your dataset, vocab size)?

@ark-sandbox
Comment options

Just with Colab notebook, awesome!

I was trying to replicate Gemma like tokenizer, with vocab size 256k, 246GB of Wikipedia dataset with focus on Indic languages.
https://huggingface.co/datasets/ai4bharat/wiki-translate
Here is the gist of the code I used,
https://gist.github.com/karavindhan/63930554f5481d34efdd7f2c5ba2ce56.

Definitely I will checkout minBPE & @rasbt implementaion or try reducing both vocab size and dataset size.

Comment options

Hi everyone! 👋

I’m excited to share my recent project: GPT2-Nepali, a GPT-2 model pretrained from scratch for the Nepali language. This project builds upon the GPT-2 model training code detailed in Build a Large Language Model (From Scratch), adapting it specifically for the Nepali language.

Project Highlights: 🔗 Chat Interface: GPT2-Nepali Chat Interface on Hugging Face 📦 Pre-Trained Model: GPT2-Nepali on Hugging Face 💻 Training Code: GitHub Repository 📊 Dataset: 12GB Nepali text derived from NepBERTa project.

Modifications from Original Code

1️⃣ Tokenizer:

  • Initially, I experimented with the GPT-2 pretrained tokenizer. However, it generated more tokens than the number of Nepali characters in a sentence. This likely happened due to the digital representation of Nepali characters, where each character is sometimes represented as a combination of multiple sub-characters (e.g., diacritics and base characters).
  • To address this, I trained a new BPE tokenizer tailored for the Nepali language. This tokenizer was based on an earlier version of my NepaliBPE tokenizer.

2️⃣ Dataloader:

  • I pre-tokenized the dataset and made minor modifications to the dataloader for using pre-trained dataset.

A huge thank you to @rasbt for the inspiration and for writing such an incredible resource—easily the best book on LLMs I’ve ever read!

Hi everyone! 👋

I’m excited to share my recent project: GPT2-Nepali, a GPT-2 model pretrained from scratch for the Nepali language. This project builds upon the GPT-2 model training code detailed in Build a Large Language Model (From Scratch), adapting it specifically for the Nepali language.

Project Highlights: 🔗 Chat Interface: GPT2-Nepali Chat Interface on Hugging Face 📦 Pre-Trained Model: GPT2-Nepali on Hugging Face 💻 Training Code: GitHub Repository 📊 Dataset: 12GB Nepali text derived from NepBERTa project.

Modifications from Original Code

1️⃣ Tokenizer:

  • Initially, I experimented with the GPT-2 pretrained tokenizer. However, it generated more tokens than the number of Nepali characters in a sentence. This likely happened due to the digital representation of Nepali characters, where each character is sometimes represented as a combination of multiple sub-characters (e.g., diacritics and base characters).
  • To address this, I trained a new BPE tokenizer tailored for the Nepali language. This tokenizer was based on an earlier version of my NepaliBPE tokenizer.

2️⃣ Dataloader:

  • I pre-tokenized the dataset and made minor modifications to the dataloader for using pre-trained dataset.

A huge thank you to @rasbt for the inspiration and for writing such an incredible resource—easily the best book on LLMs I’ve ever read!

Congratulations @Aananda-giri,

I'm also working on a local language, Chadian Arabic (shu), which differs slightly from standard Arabic but has the Latin alphabet. I'm currently building up a corpus and once it's finished, I plan to create a BPE tokenizer.
After that, I plan to create a text-to text model.

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
6 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.