Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

A small language model built from scratch in PyTorch utilizing transformers. Trained on WikiText-2 with character- and word-level tokenization. Educational project to explore embeddings, positional encodings, multi-head self-attention, and transformer decoder blocks for text generation.

Notifications You must be signed in to change notification settings

Joelcic/TinyLLM_LanguageModelFromScratch

Open more actions menu

Repository files navigation

Small Language Model

This project is about creating a Small langauge Model from scratch utilizing transformer architectures. The purpuse is to learn transformers and its relates parts. This projects includes:

  • Data preprocessing
  • Tokenization
  • Transformer implementation (embedding, attention, etc.)
  • Traing and Generating tokens

The data used for this project: Wikitext2

Generated text with the trained model (character-level tokenization):

The meaning of life is a production of border on the major proteins . In 1992 , the country was the former received at the Most American American second conflict of the South America , Middle Marine , and the state was for the first first part of the season . The second s


Set up

To create an envivoment and install requirements:

python -m venv .venv
pyhton .venv/Scripts/activate
pip install -r requirements.txt

To download and preprocess the data:

run -> data_prep.py    

To select and run tokenization (create mapping, etc.)

run -> tokenizer.py    

To train the model and select training and model parameters:

run -> train.py    

To try the models generation capabilities

run -> main.py    

Result

The model was trained on WikiText-2 for 10 epochs using the settings described above. Both training and validation loss decreased steadily, showing that the model successfully learned language structure.

  • Final Validation Loss: ~1.37

  • Final Validation Perplexity: ~3.94

This means that on average, the model assigns much higher probability to the correct next token compared to a random baseline.

The following are examples of text generated by the TinyLLM after training. Note that the model is small and trained on limited data, so the outputs are not always coherent, but they demonstrate that the model has learned meaningful character/word sequences:

Start sequence: The meaning of life is

The meaning of life is a memorial strength of the former of the state . The start has been formed to be the first and international content of the command was an area of the start of the first transferred to the season . The production of the first was produced by the Com


Model

The final model is implemented in tiny_LLM_model.py and utilizes the implemented submodels in the folder submodels. The reson for built it modular is the learnign purpuse by building each part seperate and later combine the parts.

Model parameters:

    vocab_size = 1014 (unique characters from the traing dataset)
    d_model = 256
    n_layer = 4
    n_head = 16
    block_size = 128

The model using pre normalization and risidual connections. and a short overview of the model is presented below

--> Embedding: Tokenembedding + Positinal embedding (SinusoidalPositionalEncoding)

--> 4 x Transformers Decoder Block:

--> LM head


Data Preprocessing + Tokenization

The dataset used is WikiText-2, a popular benchmark for language modeling. Before tokenization, some cleaning steps were applied:

  • All tab characters (\t) were removed, ensuring consistency in the text.
  • The data was split into train, validation, and test sets. Only the training split was used to build the vocabulary. This avoids data leakage from validation/test into training.

Tokenization Approaches

Two different tokenization methods were implemented for learning purposes:

Character-level tokenization (Very simple to implement, Small vocabulary size (≈ 1k tokens)).

  • Each unique character (letters, digits, punctuation, whitespace, etc.) becomes a token.
  • Example: "Hello" → [‘H’, ‘e’, ‘l’, ‘l’, ‘o’].

Word-level tokenization (Shorter sequences, Larger vocabulary (can easily reach 20k–30k+))

  • Each unique word becomes a token.
  • Example: "Hello world" → [‘Hello’, ‘world’].

Character-level was utilized at the end --> Vacabolary size: 1014


Training

The model is trained on WikiText-2 with standard Transformer optimization settings. The main goal was to balance stable convergence with efficient training for a small-scale GPT-style model. Training was performed on a CPU, which required using relatively small model sizes and batch sizes. As seen in the plots, the model successfully learns over time, with both training and validation loss/perplexity decreasing steadily.

Note: The validation loss and perplexity curves should be shifted by one epoch. This happened because the model performed a training step (backpropagation) before evaluating on the validation set, I noticed this after running the experiments. The results are still valid, but normally traning loss and perplexity should be lower than validation scores.

Training Result

Training parameters (most important):

  • Adamoptimizer: betas=(0.9, 0.95)
  • Dropout: 0.1
  • Learning rate: 3e-4
    batch_size = 16
    epochs = 10
    batches_per_epoch = 5000
    avg_val_iter = 100

About

A small language model built from scratch in PyTorch utilizing transformers. Trained on WikiText-2 with character- and word-level tokenization. Educational project to explore embeddings, positional encodings, multi-head self-attention, and transformer decoder blocks for text generation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

Morty Proxy This is a proxified and sanitized view of the page, visit original site.