This repository contains implementations of core machine learning algorithms built entirely from first principles using Python and NumPy.
The primary objective of this project is to develop a deep understanding of optimization, gradient-based learning, and distance-based methods by manually implementing model training and evaluation pipelines without relying on high-level machine learning libraries such as scikit-learn.
The emphasis is on mathematical clarity, algorithmic correctness, and clean implementation.
Modern machine learning frameworks abstract away much of the underlying mathematics. While this improves productivity, it can limit conceptual depth.
This project focuses on:
- Understanding gradient-based optimization
- Implementing loss functions explicitly
- Studying convergence behavior
- Strengthening intuition behind classification algorithms
- Preparing for technical interviews requiring algorithm-level clarity
- Binary classification
- Sigmoid activation function
- Cross-entropy loss
- Manual gradient computation
- Batch gradient descent optimization
- Custom train-test split implementation
- Parameter initialization
- Iterative update rule
- Learning rate experimentation
- Convergence tracking
- Euclidean distance computation
- Majority voting
- Manual prediction pipeline
- Accuracy evaluation without external ML libraries
- Probabilistic classification model
- Assumes feature independence
- Gaussian likelihood estimation
- Log-probability computation for numerical stability
- Manual prior and likelihood calculation
- Compared performance with sklearn GaussianNB
Logistic Regression hypothesis:
h(x) = σ(wᵀx + b)
Loss function (Binary Cross-Entropy):
L = -[y log(h(x)) + (1 - y) log(1 - h(x))]
Parameters are optimized using gradient descent:
w := w - α ∂L/∂w b := b - α ∂L/∂b
Posterior probability:
P(C | X) ∝ P(X | C) P(C)
Gaussian likelihood:
P(x_i | C) = (1 / √(2πσ²)) * exp( - (x - μ)² / (2σ²) )
Log posterior used for numerical stability:
log P(C | X) = log P(C) + Σ log P(x_i | C)
Logistic Regression:
- Training: O(n · d · iterations)
- Inference: O(d)
K-Nearest Neighbors:
- Training: O(1)
- Inference: O(n · d)
Where: n = number of samples d = number of features
Each algorithm follows a structured implementation pipeline:
- Data loading using Pandas
- Conversion to NumPy arrays for numerical computation
- Parameter initialization
- Forward propagation
- Loss computation
- Gradient calculation
- Parameter updates
- Prediction and evaluation
All mathematical steps are implemented explicitly to ensure transparency and full control over the training process.
ML-Algorithm-From-Scratch/
│
├── Logistic_Regression.ipynb
├── Gradient_Descent.ipynb
├── Knn.ipynb
├── NaiveBayes.ipynb
│
├── big_logistic_regression_dataset.csv
├── gradient_descent_large_dataset.csv
├── knn_200_dataset.csv
├── knn_large_dataset.csv
├── naive_bayes_100k_dataset.csv
│
├── .gitignore
└── README.md
- No high-level ML frameworks used for training
- Clear separation between training and evaluation logic
- Emphasis on mathematical transparency
- Readable and structured code
- Linear Regression with L1/L2 regularization
- Decision Tree (CART) implementation
- Performance benchmarking against scikit-learn
- Refactoring notebooks into modular Python scripts
- Adding unit tests and benchmarking support
Atul Kumar
B.Tech – Artificial Intelligence & Data Science
IIITDM Kurnool