Vision Transformer Paper Replication

A complete PyTorch implementation and replication of the groundbreaking Vision Transformer (ViT) paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" applied to the FoodVision Mini dataset.

🎯 Project Overview

This project demonstrates how to replicate a state-of-the-art machine learning research paper from scratch, building a Vision Transformer architecture using PyTorch and applying it to image classification tasks. The implementation focuses on understanding the core concepts by building each component step-by-step.

🚀 Getting Setup

Prerequisites

Python 3.8+
PyTorch 1.12+ (with CUDA support recommended)
torchvision 0.13+
GPU with CUDA support (recommended)

Dependencies

torch>=1.12.0
torchvision>=0.13.0
matplotlib
torchinfo

The notebook automatically handles dependency installation and downloads required helper modules from the pytorch-going-modular repository.

📊 Data Preparation

The project uses the FoodVision Mini dataset containing three food categories:

🍕 Pizza
🥩 Steak
🍣 Sushi

Dataset Statistics:

Image size: 224×224 pixels (as per ViT paper specifications)
Batch size: 32
Data split: Train/Test
Transforms: Resize and normalization

The dataset is automatically downloaded and prepared using the provided data setup utilities.

🏗️ Model Architecture

Vision Transformer Components

The ViT architecture is built from four main mathematical equations, implemented as modular components:

1. Patch Embedding (Equation 1)

Converts input images into sequences of learnable patches
Patch size: 16×16 pixels
Adds positional embeddings and class token

2. Multi-Head Self-Attention (MSA) - Equation 2

Core attention mechanism of the Transformer
Enables the model to focus on relevant image patches
Multiple attention heads for diverse representation learning

3. Multilayer Perceptron (MLP) - Equation 3

Feed-forward network component
Used in Transformer Encoder blocks and output layer
Provides non-linear transformations

4. Transformer Encoder

Alternating layers of MSA and MLP
Residual connections and layer normalization
Multiple encoder blocks stacked together

Model Variants

The implementation includes:

Custom ViT: Built from scratch following the paper
Pretrained ViT: Using torchvision's pretrained models for transfer learning

🔧 Training

Training Configuration

Optimizer: Adam (as commonly used with Transformers)
Loss Function: Cross-entropy loss
Device: CUDA GPU (if available), CPU fallback
Epochs: Configurable (typically 10-50 epochs)

Training Process

Data Loading: Efficient DataLoaders with proper transforms
Forward Pass: Through custom ViT architecture
Loss Calculation: Cross-entropy for classification
Backpropagation: Gradient computation and parameter updates
Validation: Regular evaluation on test set

📈 Evaluation

Metrics Tracked

Training/Validation Loss
Training/Validation Accuracy
Loss curves visualization
Model performance comparison

Visualization Tools

Loss and accuracy curves
Sample predictions on test images
Attention visualization (where applicable)

🎯 Transfer Learning

The project demonstrates the power of transfer learning by:

Using pretrained ViT models from torchvision
Fine-tuning on the FoodVision Mini dataset
Comparing custom implementation vs. pretrained performance

Benefits of Transfer Learning:

Faster convergence
Better performance with limited data
Reduced computational requirements

📊 Results

The notebook provides comprehensive results including:

Model accuracy comparisons
Training time analysis
Loss convergence patterns
Predictions on custom images (including the famous "pizza-dad" image)

💻 Usage

Running the Notebook

Open in Google Colab: Click the "Open in Colab" badge

Local Setup:

git clone https://github.com/NANDAGOPALNG/Vision_Transformer_Paper_Replication
cd Vision_Transformer_Paper_Replication
jupyter notebook Vision_Transformers_paper_replicating.ipynb

Key Code Sections

# 1. Environment Setup
device = "cuda" if torch.cuda.is_available() else "cpu"

# 2. Data Preparation  
train_dataloader, test_dataloader, class_names = data_setup.create_dataloaders(
    train_dir=train_dir,
    test_dir=test_dir,
    transform=manual_transforms,
    batch_size=BATCH_SIZE
)

# 3. Model Creation
vit_model = VisionTransformer(
    img_size=224,
    patch_size=16,
    num_classes=len(class_names),
    # ... other parameters
)

# 4. Training
results = engine.train(
    model=vit_model,
    train_dataloader=train_dataloader,
    test_dataloader=test_dataloader,
    optimizer=optimizer,
    loss_fn=loss_fn,
    epochs=epochs,
    device=device
)

🧠 Key Concepts

Vision Transformer Innovation

Patch-based Processing: Treats image patches as "words" in a sequence
Self-Attention: Captures global dependencies across the entire image
No Convolutions: Pure transformer architecture for computer vision
Scalability: Performance improves with larger datasets and models

Implementation Highlights

Modular Design: Each component built as separate, reusable classes
Paper Fidelity: Close adherence to original paper specifications
Educational Focus: Step-by-step breakdown for learning purposes
Practical Application: Real-world dataset and evaluation

🎓 Learning Outcomes

By working through this implementation, you'll understand:

How to read and replicate machine learning research papers
Vision Transformer architecture and its components
PyTorch implementation of complex neural networks
Transfer learning techniques and benefits
Computer vision pipeline: data → model → training → evaluation

🔗 References

Original ViT Paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Attention is All You Need: Original Transformer Paper
Learn PyTorch: Educational Resource

📝 Notes

Important: This implementation focuses on educational understanding rather than production optimization. The goal is to demonstrate how mathematical concepts from research papers translate into working code.

🤝 Contributing

Feel free to contribute by:

Improving code documentation
Adding new features or optimizations
Fixing bugs or issues
Enhancing visualization tools

📄 License

This project is for educational purposes. Please refer to the original paper and PyTorch licensing for commercial use.

Author: NANDAGOPALNG
Project Type: Educational Implementation
Framework: PyTorch
Domain: Computer Vision, Deep Learning, Transformer Architecture

Name	Name	Last commit message	Last commit date
Latest commit History 13 Commits 13 Commits
README.md	README.md
Vision_Transformers_paper_replicating.ipynb	Vision_Transformers_paper_replicating.ipynb
pizza_steak_sushi.zip	pizza_steak_sushi.zip

Search code, repositories, users, issues, pull requests...

Folders and files

Latest commit

History

Repository files navigation

Vision Transformer Paper Replication

🎯 Project Overview

📋 Table of Contents

🚀 Getting Setup

Prerequisites

Dependencies

📊 Data Preparation

🏗️ Model Architecture

Vision Transformer Components

1. Patch Embedding (Equation 1)

2. Multi-Head Self-Attention (MSA) - Equation 2

3. Multilayer Perceptron (MLP) - Equation 3

4. Transformer Encoder

Model Variants

🔧 Training

Training Configuration

Training Process

📈 Evaluation

Metrics Tracked

Visualization Tools

🎯 Transfer Learning

📊 Results

💻 Usage

Running the Notebook

Key Code Sections

🧠 Key Concepts

Vision Transformer Innovation

Implementation Highlights

🎓 Learning Outcomes

🔗 References

📝 Notes

🤝 Contributing

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages