Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

samibahig/Document-Image-Understanding-and-Analysis

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📄 Document Image Understanding & Analysis

Fine-tuning transformer models (BERT, RoBERTa, LayoutLM, GPT-2) for document understanding and token classification on structured document datasets.

Python PyTorch HuggingFace License


🎯 Objective

This project explores Document Image Understanding — the task of classifying tokens in scanned documents into semantic categories:

Label Description
Answer Answer fields in forms
Question Question fields in forms
Header Document headers
Other Other content
PAD Padding tokens

The goal: benchmark multiple transformer architectures on this structured document classification task.


🛠️ Models & Stack

Component Technology
Base Models BERT, RoBERTa, LayoutLM, GPT-2
Framework HuggingFace Transformers
Deep Learning PyTorch
Environment Google Colab

🗂️ Project Structure

Document-Image-Understanding-and-Analysis/
│
├── 📓 LayoutLM.ipynb         ← LayoutLM fine-tuning (layout-aware)
├── 🐍 Bert.py                ← BERT experiments
├── 🐍 Roberta.py             ← RoBERTa experiments
├── 🐍 Layout-LM.py           ← LayoutLM script
├── 📄 GPT-2                  ← GPT-2 experiment
└── 📖 README.md

📊 Experimental Results

BERT — Token Classification

7 experiments varying epochs and learning rate

Experiment Epochs LR Accuracy Best F1 (Answer)
Exp 1 random init 0.4106 0.5110
Exp 2 3 3e-5 0.4189 0.5302
Exp 3 5 3e-5 0.4369 0.5219
Exp 4 5 2e-5 0.4042 0.5041
Exp 5 5 2e-5 0.4042 0.5041
Exp 6 5 2e-5 0.4042 0.5041
Exp 7 5 2e-5 0.4186 0.5122

Key insight: BERT struggles with Header classification (F1 ≈ 0) across all experiments — suggesting the model lacks layout awareness to distinguish headers from body text.


RoBERTa — Token Classification

4 experiments varying epochs (3 → 20)

Experiment Epochs Best Accuracy Best F1
Exp 1 3 0.7894 0.2020
Exp 2 5 0.7993 0.2348
Exp 3 7 0.7970 0.2272
Exp 3b 10 0.7997 0.2365
Exp 4 20 0.7894 0.3136

Key insight: RoBERTa achieves higher accuracy than BERT but F1 remains low — the model converges quickly and then overfits. More epochs don't help after epoch 5.


💡 Key Takeaways

  • BERT with random parameters already learns Answer tokens reasonably well (F1 ~0.51) but completely fails on Header — a structural label that requires layout context
  • RoBERTa shows better raw accuracy but similar F1 ceiling — text-only models have inherent limits on document understanding tasks
  • LayoutLM (layout-aware) is the natural next step — it incorporates bounding box coordinates alongside text, making it purpose-built for this task
  • Optimal learning rate appears to be around 2e-5 to 3e-5 across both architectures

🔬 Context

This project was conducted as part of my graduate coursework in NLP and representation learning — benchmarking transformer architectures before the emergence of layout-aware models as the standard for document AI.

It connects directly to my later work on:


👤 Author

Sami Bahig — Data Scientist & AI Engineer

LinkedIn GitHub


MIT License · Sami Bahig · 2023

About

Document Image Understanding: Analysis of 2 datasets

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Morty Proxy This is a proxified and sanitized view of the page, visit original site.