Small Language Model

This project is about creating a Small langauge Model from scratch utilizing transformer architectures. The purpuse is to learn transformers and its relates parts. This projects includes:

Data preprocessing
Tokenization
Transformer implementation (embedding, attention, etc.)
Traing and Generating tokens

The data used for this project: Wikitext2

Generated text with the trained model (character-level tokenization):

The meaning of life is a production of border on the major proteins . In 1992 , the country was the former received at the Most American American second conflict of the South America , Middle Marine , and the state was for the first first part of the season . The second s

Set up

To create an envivoment and install requirements:

python -m venv .venv
pyhton .venv/Scripts/activate
pip install -r requirements.txt

To download and preprocess the data:

run -> data_prep.py

To select and run tokenization (create mapping, etc.)

run -> tokenizer.py

To train the model and select training and model parameters:

run -> train.py

To try the models generation capabilities

run -> main.py

Result

The model was trained on WikiText-2 for 10 epochs using the settings described above. Both training and validation loss decreased steadily, showing that the model successfully learned language structure.

Final Validation Loss: ~1.37
Final Validation Perplexity: ~3.94

This means that on average, the model assigns much higher probability to the correct next token compared to a random baseline.

The following are examples of text generated by the TinyLLM after training. Note that the model is small and trained on limited data, so the outputs are not always coherent, but they demonstrate that the model has learned meaningful character/word sequences:

Start sequence: The meaning of life is

The meaning of life is a memorial strength of the former of the state . The start has been formed to be the first and international content of the command was an area of the start of the first transferred to the season . The production of the first was produced by the Com

Model

The final model is implemented in tiny_LLM_model.py and utilizes the implemented submodels in the folder submodels. The reson for built it modular is the learnign purpuse by building each part seperate and later combine the parts.

Model parameters:

    vocab_size = 1014 (unique characters from the traing dataset)
    d_model = 256
    n_layer = 4
    n_head = 16
    block_size = 128

The model using pre normalization and risidual connections. and a short overview of the model is presented below

--> Embedding: Tokenembedding + Positinal embedding (SinusoidalPositionalEncoding)

--> 4 x Transformers Decoder Block:

--> LM head

Data Preprocessing + Tokenization

The dataset used is WikiText-2, a popular benchmark for language modeling. Before tokenization, some cleaning steps were applied:

All tab characters (\t) were removed, ensuring consistency in the text.
The data was split into train, validation, and test sets. Only the training split was used to build the vocabulary. This avoids data leakage from validation/test into training.

Tokenization Approaches

Two different tokenization methods were implemented for learning purposes:

Character-level tokenization (Very simple to implement, Small vocabulary size (≈ 1k tokens)).

Each unique character (letters, digits, punctuation, whitespace, etc.) becomes a token.
Example: "Hello" → [‘H’, ‘e’, ‘l’, ‘l’, ‘o’].

Word-level tokenization (Shorter sequences, Larger vocabulary (can easily reach 20k–30k+))

Each unique word becomes a token.
Example: "Hello world" → [‘Hello’, ‘world’].

Character-level was utilized at the end --> Vacabolary size: 1014

Training

The model is trained on WikiText-2 with standard Transformer optimization settings. The main goal was to balance stable convergence with efficient training for a small-scale GPT-style model. Training was performed on a CPU, which required using relatively small model sizes and batch sizes. As seen in the plots, the model successfully learns over time, with both training and validation loss/perplexity decreasing steadily.

Note: The validation loss and perplexity curves should be shifted by one epoch. This happened because the model performed a training step (backpropagation) before evaluating on the validation set, I noticed this after running the experiments. The results are still valid, but normally traning loss and perplexity should be lower than validation scores.

Training parameters (most important):

Adamoptimizer: betas=(0.9, 0.95)
Dropout: 0.1
Learning rate: 3e-4

    batch_size = 16
    epochs = 10
    batches_per_epoch = 5000
    avg_val_iter = 100

Name	Name	Last commit message	Last commit date
Latest commit History 7 Commits
explinations	explinations
images	images
submodels	submodels
README.md	README.md
best_model.pt	best_model.pt
data_prep.py	data_prep.py
main.py	main.py
requirements.txt	requirements.txt
tiny_LLM_model.py	tiny_LLM_model.py
tokenizers.py	tokenizers.py
train.py	train.py
training_result.csv	training_result.csv
utils.py	utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Small Language Model

Set up

Result

Model

Data Preprocessing + Tokenization

Training

About

Uh oh!

Releases

Packages

Languages

Search code, repositories, users, issues, pull requests...

Joelcic/TinyLLM_LanguageModelFromScratch

Folders and files

Latest commit

History

Repository files navigation

Small Language Model

Set up

Result

Model

Data Preprocessing + Tokenization

Training

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages