Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
Discussion options

Hi all, here's some new bonus material that I thought you might enjoy 😊

Byte Pair Encoding (BPE) Tokenizer From Scratch

Happy weekend!

You must be logged in to vote

Replies: 3 comments · 11 replies

Comment options

What I have not yet fully understood is why preserving an empty space in some tokens in the BPE's training process rather treating empty spaces as separate tokens. Is this because it would grow the context window utilization significantly?

You must be logged in to vote
7 replies
@rasbt
Comment options

rasbt Jan 20, 2025
Maintainer Author

Btw for GPT-4, multiple whitespaces have dedicated tokens, which makes it
Screenshot 2025-01-20 at 9 07 56 AM
a much better tokenizer for coding tasks.

@d-kleine
Comment options

Interesting - yeah, that makes sense for code indentions! 👍🏻 Actually quite insightful how the tokenizers evolved by time, becoming multimodal 🧠

@d-kleine
Comment options

BTW Aleph Alpha just proposed a new tokenizer-free autoregressive LLM architecture:

@rasbt
Comment options

rasbt Jan 24, 2025
Maintainer Author

Oh nice! This will be added right to my bookmark list. Kind of reminds me of the byte-latent transformer in December. These tokenizer-free approaches could be a nice article one day.

@d-kleine
Comment options

These tokenizer-free approaches could be a nice article one day.

Yeah, this would be awesome!

Comment options

You must be logged in to vote
0 replies
Comment options

BTW as you mentioned minbpe in the notebook:

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

# Initialize tokenizer
tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))

# Configure trainer
trainer = BpeTrainer(
    vocab_size=1000,
    special_tokens=["<|endoftext|>"]
)

# Train the tokenizer
tokenizer.train(files=["the-verdict.txt"], trainer=trainer)

# Save the tokenizer
tokenizer.save("tokenizer.json")
  • You can also load the original tokenizer, e.g. for GPT-2
You must be logged in to vote
4 replies
@rasbt
Comment options

rasbt Jan 21, 2025
Maintainer Author

You should be able to train the tokenizer with HF's tokenizer framework.

Ah yes, this is what @Aananda-giri did in #485

@d-kleine
Comment options

Yes, exactly. I think it would be great to add this to the introduction text in the notebook at

  • The difference between the implementations above and my implementation in this notebook, besides it being is that it also includes a function for training the tokenizer (for educational purposes)
  • There's also an implementation called minBPE with training support, which is maybe more performant (my implementation here is focused on educational purposes); in contrast to minbpe my implementation additionally allows loading the original OpenAI tokenizer vocabulary and merges

because HF's tokenizers can do both training and loading a pretrained tokenizer (along with transformers)

@rasbt
Comment options

rasbt Jan 21, 2025
Maintainer Author

@d-kleine Sure, I can add a note about that. I find HF code really hard to read tbh so I would prefer recommending minBPE. But yeah, I added the note as part of #495

@d-kleine
Comment options

Thanks! 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.