Machine Learning Algorithms from Scratch

Overview

This repository contains implementations of core machine learning algorithms built entirely from first principles using Python and NumPy.

The primary objective of this project is to develop a deep understanding of optimization, gradient-based learning, and distance-based methods by manually implementing model training and evaluation pipelines without relying on high-level machine learning libraries such as scikit-learn.

The emphasis is on mathematical clarity, algorithmic correctness, and clean implementation.

Motivation

Modern machine learning frameworks abstract away much of the underlying mathematics. While this improves productivity, it can limit conceptual depth.

This project focuses on:

Understanding gradient-based optimization
Implementing loss functions explicitly
Studying convergence behavior
Strengthening intuition behind classification algorithms
Preparing for technical interviews requiring algorithm-level clarity

Implemented Algorithms

Logistic Regression

Binary classification
Sigmoid activation function
Cross-entropy loss
Manual gradient computation
Batch gradient descent optimization
Custom train-test split implementation

Gradient Descent (Standalone)

Parameter initialization
Iterative update rule
Learning rate experimentation
Convergence tracking

K-Nearest Neighbors (KNN)

Euclidean distance computation
Majority voting
Manual prediction pipeline
Accuracy evaluation without external ML libraries

Gaussian Naive Bayes

Probabilistic classification model
Assumes feature independence
Gaussian likelihood estimation
Log-probability computation for numerical stability
Manual prior and likelihood calculation
Compared performance with sklearn GaussianNB

Mathematical Formulation

Logistic Regression hypothesis:

h(x) = σ(wᵀx + b)

Loss function (Binary Cross-Entropy):

L = -[y log(h(x)) + (1 - y) log(1 - h(x))]

Parameters are optimized using gradient descent:

w := w - α ∂L/∂w b := b - α ∂L/∂b

Naive Bayes (Gaussian)

Posterior probability:

P(C | X) ∝ P(X | C) P(C)

Gaussian likelihood:

P(x_i | C) = (1 / √(2πσ²)) * exp( - (x - μ)² / (2σ²) )

Log posterior used for numerical stability:

log P(C | X) = log P(C) + Σ log P(x_i | C)

Computational Complexity

Logistic Regression:

Training: O(n · d · iterations)
Inference: O(d)

K-Nearest Neighbors:

Training: O(1)
Inference: O(n · d)

Where: n = number of samples d = number of features

Technical Approach

Each algorithm follows a structured implementation pipeline:

Data loading using Pandas
Conversion to NumPy arrays for numerical computation
Parameter initialization
Forward propagation
Loss computation
Gradient calculation
Parameter updates
Prediction and evaluation

All mathematical steps are implemented explicitly to ensure transparency and full control over the training process.

Project Structure

ML-Algorithm-From-Scratch/
│
├── Logistic_Regression.ipynb
├── Gradient_Descent.ipynb
├── Knn.ipynb
├── NaiveBayes.ipynb
│
├── big_logistic_regression_dataset.csv
├── gradient_descent_large_dataset.csv
├── knn_200_dataset.csv
├── knn_large_dataset.csv
├── naive_bayes_100k_dataset.csv
│
├── .gitignore
└── README.md

Design Principles

No high-level ML frameworks used for training
Clear separation between training and evaluation logic
Emphasis on mathematical transparency
Readable and structured code

Future Improvements

Linear Regression with L1/L2 regularization
Decision Tree (CART) implementation
Performance benchmarking against scikit-learn
Refactoring notebooks into modular Python scripts
Adding unit tests and benchmarking support

Author

Atul Kumar
B.Tech – Artificial Intelligence & Data Science
IIITDM Kurnool

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning Algorithms from Scratch

Overview

Motivation

Implemented Algorithms

Logistic Regression

Gradient Descent (Standalone)

K-Nearest Neighbors (KNN)

Gaussian Naive Bayes

Mathematical Formulation

Naive Bayes (Gaussian)

Computational Complexity

Technical Approach

Project Structure

Design Principles

Future Improvements

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name	Name	Last commit message	Last commit date
Latest commit History 8 Commits 8 Commits
.gitignore	.gitignore
Gradient_Descent.ipynb	Gradient_Descent.ipynb
Knn.ipynb	Knn.ipynb
Logistic_Regression.ipynb	Logistic_Regression.ipynb
Naive_Bayes.ipynb	Naive_Bayes.ipynb
README.md	README.md
big_logistic_regression_dataset.csv	big_logistic_regression_dataset.csv
gradient_descent_large_dataset.csv	gradient_descent_large_dataset.csv
knn_200_dataset.csv	knn_200_dataset.csv
knn_large_dataset.csv	knn_large_dataset.csv
naive_bayes_100k_dataset.csv	naive_bayes_100k_dataset.csv

Search code, repositories, users, issues, pull requests...

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Algorithms from Scratch

Overview

Motivation

Implemented Algorithms

Logistic Regression

Gradient Descent (Standalone)

K-Nearest Neighbors (KNN)

Gaussian Naive Bayes

Mathematical Formulation

Naive Bayes (Gaussian)

Computational Complexity

Technical Approach

Project Structure

Design Principles

Future Improvements

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages