Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

NANDAGOPALNG/Vision_Transformer_Paper_Replication

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
13 Commits
 
 
 
 
 
 

Repository files navigation

Vision Transformer Paper Replication

A complete PyTorch implementation and replication of the groundbreaking Vision Transformer (ViT) paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" applied to the FoodVision Mini dataset.

🎯 Project Overview

This project demonstrates how to replicate a state-of-the-art machine learning research paper from scratch, building a Vision Transformer architecture using PyTorch and applying it to image classification tasks. The implementation focuses on understanding the core concepts by building each component step-by-step.

📋 Table of Contents

  1. Getting Setup
  2. Data Preparation
  3. Model Architecture
  4. Training
  5. Evaluation
  6. Transfer Learning
  7. Results
  8. Requirements
  9. Usage
  10. Key Concepts

🚀 Getting Setup

Prerequisites

  • Python 3.8+
  • PyTorch 1.12+ (with CUDA support recommended)
  • torchvision 0.13+
  • GPU with CUDA support (recommended)

Dependencies

torch>=1.12.0
torchvision>=0.13.0
matplotlib
torchinfo

The notebook automatically handles dependency installation and downloads required helper modules from the pytorch-going-modular repository.

📊 Data Preparation

The project uses the FoodVision Mini dataset containing three food categories:

  • 🍕 Pizza
  • 🥩 Steak
  • 🍣 Sushi

Dataset Statistics:

  • Image size: 224×224 pixels (as per ViT paper specifications)
  • Batch size: 32
  • Data split: Train/Test
  • Transforms: Resize and normalization

The dataset is automatically downloaded and prepared using the provided data setup utilities.

🏗️ Model Architecture

Vision Transformer Components

The ViT architecture is built from four main mathematical equations, implemented as modular components:

1. Patch Embedding (Equation 1)

  • Converts input images into sequences of learnable patches
  • Patch size: 16×16 pixels
  • Adds positional embeddings and class token

2. Multi-Head Self-Attention (MSA) - Equation 2

  • Core attention mechanism of the Transformer
  • Enables the model to focus on relevant image patches
  • Multiple attention heads for diverse representation learning

3. Multilayer Perceptron (MLP) - Equation 3

  • Feed-forward network component
  • Used in Transformer Encoder blocks and output layer
  • Provides non-linear transformations

4. Transformer Encoder

  • Alternating layers of MSA and MLP
  • Residual connections and layer normalization
  • Multiple encoder blocks stacked together

Model Variants

The implementation includes:

  • Custom ViT: Built from scratch following the paper
  • Pretrained ViT: Using torchvision's pretrained models for transfer learning

🔧 Training

Training Configuration

  • Optimizer: Adam (as commonly used with Transformers)
  • Loss Function: Cross-entropy loss
  • Device: CUDA GPU (if available), CPU fallback
  • Epochs: Configurable (typically 10-50 epochs)

Training Process

  1. Data Loading: Efficient DataLoaders with proper transforms
  2. Forward Pass: Through custom ViT architecture
  3. Loss Calculation: Cross-entropy for classification
  4. Backpropagation: Gradient computation and parameter updates
  5. Validation: Regular evaluation on test set

📈 Evaluation

Metrics Tracked

  • Training/Validation Loss
  • Training/Validation Accuracy
  • Loss curves visualization
  • Model performance comparison

Visualization Tools

  • Loss and accuracy curves
  • Sample predictions on test images
  • Attention visualization (where applicable)

🎯 Transfer Learning

The project demonstrates the power of transfer learning by:

  • Using pretrained ViT models from torchvision
  • Fine-tuning on the FoodVision Mini dataset
  • Comparing custom implementation vs. pretrained performance

Benefits of Transfer Learning:

  • Faster convergence
  • Better performance with limited data
  • Reduced computational requirements

📊 Results

The notebook provides comprehensive results including:

  • Model accuracy comparisons
  • Training time analysis
  • Loss convergence patterns
  • Predictions on custom images (including the famous "pizza-dad" image)

💻 Usage

Running the Notebook

  1. Open in Google Colab: Click the "Open in Colab" badge
  2. Local Setup:
    git clone https://github.com/NANDAGOPALNG/Vision_Transformer_Paper_Replication
    cd Vision_Transformer_Paper_Replication
    jupyter notebook Vision_Transformers_paper_replicating.ipynb

Key Code Sections

# 1. Environment Setup
device = "cuda" if torch.cuda.is_available() else "cpu"

# 2. Data Preparation  
train_dataloader, test_dataloader, class_names = data_setup.create_dataloaders(
    train_dir=train_dir,
    test_dir=test_dir,
    transform=manual_transforms,
    batch_size=BATCH_SIZE
)

# 3. Model Creation
vit_model = VisionTransformer(
    img_size=224,
    patch_size=16,
    num_classes=len(class_names),
    # ... other parameters
)

# 4. Training
results = engine.train(
    model=vit_model,
    train_dataloader=train_dataloader,
    test_dataloader=test_dataloader,
    optimizer=optimizer,
    loss_fn=loss_fn,
    epochs=epochs,
    device=device
)

🧠 Key Concepts

Vision Transformer Innovation

  • Patch-based Processing: Treats image patches as "words" in a sequence
  • Self-Attention: Captures global dependencies across the entire image
  • No Convolutions: Pure transformer architecture for computer vision
  • Scalability: Performance improves with larger datasets and models

Implementation Highlights

  • Modular Design: Each component built as separate, reusable classes
  • Paper Fidelity: Close adherence to original paper specifications
  • Educational Focus: Step-by-step breakdown for learning purposes
  • Practical Application: Real-world dataset and evaluation

🎓 Learning Outcomes

By working through this implementation, you'll understand:

  • How to read and replicate machine learning research papers
  • Vision Transformer architecture and its components
  • PyTorch implementation of complex neural networks
  • Transfer learning techniques and benefits
  • Computer vision pipeline: data → model → training → evaluation

🔗 References

📝 Notes

Important: This implementation focuses on educational understanding rather than production optimization. The goal is to demonstrate how mathematical concepts from research papers translate into working code.

🤝 Contributing

Feel free to contribute by:

  • Improving code documentation
  • Adding new features or optimizations
  • Fixing bugs or issues
  • Enhancing visualization tools

📄 License

This project is for educational purposes. Please refer to the original paper and PyTorch licensing for commercial use.


Author: NANDAGOPALNG
Project Type: Educational Implementation
Framework: PyTorch
Domain: Computer Vision, Deep Learning, Transformer Architecture

About

A PyTorch implementation that replicates the Vision Transformer (ViT) paper from scratch, building each component step-by-step and applying it to food image classification with pizza, steak, and sushi.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Morty Proxy This is a proxified and sanitized view of the page, visit original site.