This project implements multiple machine learning algorithms to predict breast cancer diagnoses based on medical diagnostic data. The project compares the performance of various models, providing insights into which algorithms are most effective for this task. The complete code is available in the model.ipynb file.
- Dataset: Features medical data such as
radius_mean
,texture_mean
,perimeter_mean
, and others. The target column isdiagnosis
(malignant or benign). - Algorithms: Includes Logistic Regression, Decision Tree, Random Forest, SVM, k-NN, and more.
- Evaluation Metrics: Accuracy, precision, recall, F1-score, and ROC-AUC.
- Visualization: Graphs and tables illustrate model comparisons.
- Notebook Implementation: Code is structured in a Jupyter notebook for easy reproducibility.
- Python (>= 3.8)
- Required libraries:
pip install pandas numpy scikit-learn matplotlib seaborn
-
Data Preprocessing:
- Load and clean the dataset
- Encode categorical variables
- Normalize features using
StandardScaler
- Split data into training and testing sets
-
Model Training and Evaluation:
- Implement multiple ML algorithms
- Evaluate models using metrics like accuracy and ROC-AUC
-
Visualization:
- Generate comparison graphs for model performance
The dataset includes:
- Features:
- Mean values:
radius_mean
,texture_mean
,perimeter_mean
- Standard error:
radius_se
,texture_se
,perimeter_se
- Worst values:
radius_worst
,texture_worst
,perimeter_worst
- Mean values:
- Target Variable:
diagnosis
: Malignant (M
) or Benign (B
)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score
# Load dataset
data = pd.read_csv('breast_cancer_data.csv')
data['diagnosis'] = data['diagnosis'].map({'M': 1, 'B': 0})
# Feature-target split
X = data.drop(columns=['diagnosis'])
y = data['diagnosis']
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
# Evaluation
y_pred = log_reg.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
# Evaluation
y_pred_rf = rf_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
Model | Accuracy | Precision | Recall | F1-Score | ROC-AUC |
---|---|---|---|---|---|
Logistic Regression | 96% | 94% | 95% | 94.5% | 97% |
Random Forest | 98% | 97% | 96% | 96.5% | 99% |
SVM | 95% | 93% | 94% | 93.5% | 96% |
k-NN | 92% | 90% | 91% | 90.5% | 93% |
The Random Forest classifier achieved the highest accuracy and ROC-AUC, making it the most effective model for this dataset. Logistic Regression and SVM also performed well, indicating their suitability for medical diagnostic tasks.
- Incorporate deep learning models for enhanced prediction accuracy
- Perform feature selection to reduce dimensionality
- Use cross-validation for more robust performance evaluation
- Clone the repository:
git clone https://github.com/udityamerit/Breast-Cancer-Prediction-using-different-ML-models
- Install the dependencies:
pip install -r requirements.txt
- Run the Jupyter Notebook:
jupyter notebook model.ipynb
- The dataset is sourced from the UCI Machine Learning Repository.
- Project inspired by real-world medical applications of machine learning.