This project is about creating a Small langauge Model from scratch utilizing transformer architectures. The purpuse is to learn transformers and its relates parts. This projects includes:
- Data preprocessing
- Tokenization
- Transformer implementation (embedding, attention, etc.)
- Traing and Generating tokens
The data used for this project: Wikitext2
Generated text with the trained model (character-level tokenization):
The meaning of life is a production of border on the major proteins . In 1992 , the country was the former received at the Most American American second conflict of the South America , Middle Marine , and the state was for the first first part of the season . The second s
To create an envivoment and install requirements:
python -m venv .venv
pyhton .venv/Scripts/activate
pip install -r requirements.txt
To download and preprocess the data:
run -> data_prep.py
To select and run tokenization (create mapping, etc.)
run -> tokenizer.py
To train the model and select training and model parameters:
run -> train.py
To try the models generation capabilities
run -> main.py
The model was trained on WikiText-2 for 10 epochs using the settings described above. Both training and validation loss decreased steadily, showing that the model successfully learned language structure.
-
Final Validation Loss: ~1.37
-
Final Validation Perplexity: ~3.94
This means that on average, the model assigns much higher probability to the correct next token compared to a random baseline.
The following are examples of text generated by the TinyLLM after training. Note that the model is small and trained on limited data, so the outputs are not always coherent, but they demonstrate that the model has learned meaningful character/word sequences:
Start sequence:
The meaning of life is
The meaning of life is a memorial strength of the former of the state . The start has been formed to be the first and international content of the command was an area of the start of the first transferred to the season . The production of the first was produced by the Com
The final model is implemented in tiny_LLM_model.py
and utilizes the implemented submodels in the folder submodels
. The reson for built it modular is the learnign purpuse by building each part seperate and later combine the parts.
Model parameters:
vocab_size = 1014 (unique characters from the traing dataset)
d_model = 256
n_layer = 4
n_head = 16
block_size = 128
The model using pre normalization and risidual connections. and a short overview of the model is presented below
--> Embedding
: Tokenembedding + Positinal embedding (SinusoidalPositionalEncoding)
--> 4 x Transformers Decoder Block
:
--> LM head
The dataset used is WikiText-2, a popular benchmark for language modeling. Before tokenization, some cleaning steps were applied:
- All tab characters (\t) were removed, ensuring consistency in the text.
- The data was split into train, validation, and test sets. Only the training split was used to build the vocabulary. This avoids data leakage from validation/test into training.
Tokenization Approaches
Two different tokenization methods were implemented for learning purposes:
Character-level tokenization (Very simple to implement, Small vocabulary size (≈ 1k tokens)).
- Each unique character (letters, digits, punctuation, whitespace, etc.) becomes a token.
- Example: "Hello" → [‘H’, ‘e’, ‘l’, ‘l’, ‘o’].
Word-level tokenization (Shorter sequences, Larger vocabulary (can easily reach 20k–30k+))
- Each unique word becomes a token.
- Example: "Hello world" → [‘Hello’, ‘world’].
Character-level was utilized at the end
--> Vacabolary size: 1014
The model is trained on WikiText-2 with standard Transformer optimization settings. The main goal was to balance stable convergence with efficient training for a small-scale GPT-style model. Training was performed on a CPU, which required using relatively small model sizes and batch sizes. As seen in the plots, the model successfully learns over time, with both training and validation loss/perplexity decreasing steadily.
Note
: The validation loss and perplexity curves should be shifted by one epoch. This happened because the model performed a training step (backpropagation) before evaluating on the validation set, I noticed this after running the experiments. The results are still valid, but normally traning loss and perplexity should be lower than validation scores.
Training parameters (most important):
- Adamoptimizer: betas=(0.9, 0.95)
- Dropout: 0.1
- Learning rate: 3e-4
batch_size = 16
epochs = 10
batches_per_epoch = 5000
avg_val_iter = 100