New Bonus Materials: Byte Pair Encoding (BPE) Tokenizer From Scratch #489

Jan 17, 2025

rasbt
Jan 17, 2025
Maintainer

Hi all, here's some new bonus material that I thought you might enjoy 😊

Byte Pair Encoding (BPE) Tokenizer From Scratch

Happy weekend!

rasbt · Jan 18, 2025

d-kleine
Jan 18, 2025

What I have not yet fully understood is why preserving an empty space in some tokens in the BPE's training process rather treating empty spaces as separate tokens. Is this because it would grow the context window utilization significantly?

7 replies

rasbt Jan 20, 2025
Maintainer Author

Btw for GPT-4, multiple whitespaces have dedicated tokens, which makes it

a much better tokenizer for coding tasks.

d-kleine Jan 20, 2025

Interesting - yeah, that makes sense for code indentions! 👍🏻 Actually quite insightful how the tokenizers evolved by time, becoming multimodal 🧠

d-kleine Jan 24, 2025

BTW Aleph Alpha just proposed a new tokenizer-free autoregressive LLM architecture:

rasbt Jan 24, 2025
Maintainer Author

Oh nice! This will be added right to my bookmark list. Kind of reminds me of the byte-latent transformer in December. These tokenizer-free approaches could be a nice article one day.

d-kleine Jan 24, 2025

These tokenizer-free approaches could be a nice article one day.

Yeah, this would be awesome!

Jan 18, 2025

superJen99
Jan 18, 2025

Hah, he's got a good point! Not all zeros are the same kind of zero

…

On Sat, Jan 18, 2025 at 6:09 PM Sebastian Raschka ***@***.***> wrote: Great question. I am also not sure about the history behind it but I strongly suspect it's because keeping the context window util small. Note that if you have multiple white spaces after each other, they get treated as separate white space characters, so white space characters do exist in GPT-2 tokenizers. Screenshot.2025-01-18.at.12.08.48.PM.png (view on web) <https://github.com/user-attachments/assets/5d78d0df-d3d5-403f-b5ef-e3a90e6a9583> — Reply to this email directly, view it on GitHub <#489 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADKNJ3VFQJSWQAQLJKGZMJD2LKKF7AVCNFSM6AAAAABVMPI5VWVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCOBXGY3TONQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.*** .com>

0 replies

rasbt · Jan 20, 2025

d-kleine
Jan 20, 2025

BTW as you mentioned minbpe in the notebook:

You should be able to train the tokenizer with HF's tokenizer framework. This should be possible with the BpeTrainer, see this example code.

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

# Initialize tokenizer
tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))

# Configure trainer
trainer = BpeTrainer(
    vocab_size=1000,
    special_tokens=["<|endoftext|>"]
)

# Train the tokenizer
tokenizer.train(files=["the-verdict.txt"], trainer=trainer)

# Save the tokenizer
tokenizer.save("tokenizer.json")

You can also load the original tokenizer, e.g. for GPT-2

4 replies

rasbt Jan 21, 2025
Maintainer Author

You should be able to train the tokenizer with HF's tokenizer framework.

Ah yes, this is what @Aananda-giri did in #485

d-kleine Jan 21, 2025

Yes, exactly. I think it would be great to add this to the introduction text in the notebook at

The difference between the implementations above and my implementation in this notebook, besides it being is that it also includes a function for training the tokenizer (for educational purposes)

There's also an implementation called minBPE with training support, which is maybe more performant (my implementation here is focused on educational purposes); in contrast to minbpe my implementation additionally allows loading the original OpenAI tokenizer vocabulary and merges

because HF's tokenizers can do both training and loading a pretrained tokenizer (along with transformers)

rasbt Jan 21, 2025
Maintainer Author

@d-kleine Sure, I can add a note about that. I find HF code really hard to read tbh so I would prefer recommending minBPE. But yeah, I added the note as part of #495

d-kleine Jan 21, 2025

Thanks! 👍🏻

Search code, repositories, users, issues, pull requests...

New Bonus Materials: Byte Pair Encoding (BPE) Tokenizer From Scratch #489

Uh oh!

rasbt Jan 17, 2025 Maintainer

Replies: 3 comments · 11 replies

Uh oh!

Uh oh!

rasbt Jan 20, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rasbt Jan 24, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rasbt Jan 21, 2025 Maintainer Author

Uh oh!

Uh oh!

rasbt Jan 21, 2025 Maintainer Author

Uh oh!

rasbt
Jan 17, 2025
Maintainer

rasbt Jan 20, 2025
Maintainer Author

rasbt Jan 24, 2025
Maintainer Author

rasbt Jan 21, 2025
Maintainer Author

rasbt Jan 21, 2025
Maintainer Author