Layoffs Data Cleaning Project
Cleaned and standardized a raw layoffs dataset using SQL for analysis and reporting purposes
Tools & Technologies Used:
- MySQL 8.0+
- SQL Queries (CTEs, Window Functions, JOINs)
- STR_TO_DATE(), TRIM(), LIKE
Dataset Source:
- Original: https://www.kaggle.com/datasets/swaptr/layoffs-2022
- Columns: company, location, industry, total_laid_off, percentage_laid_off, date, stage, country, funds_raised_millions
Step-by-Step Cleaning Process:
Step 1: Remove Duplicates
- Created layoffs_staging as a copy of original table.
- Added row_num using ROW_NUMBER() PARTITION BY company, industry, total_laid_off, percentage_laid_off, date.
- Created layoffs_staging2 with row_num column.
- Deleted all rows where row_num > 1.
Step 2: Standardize Data
- Trimmed whitespace from company names: UPDATE layoffs_staging2 SET company = TRIM(company);
- Standardized crypto-related industries to 'Crypto': UPDATE layoffs_staging2 SET industry = 'Crypto' WHERE industry LIKE 'cry%';
- Fixed country names: Updated "United States." → "United States" using TRIM(TRAILING '.' FROM country).
- Converted date from TEXT ('MM/DD/YYYY') to DATE type: UPDATE layoffs_staging2 SET date = STR_TO_DATE(date, '%m/%d/%Y'); ALTER TABLE layoffs_staging2 MODIFY COLUMN date DATE;
Step 3: Handle Null / Blank Values
- Changed empty strings in industry to NULL: UPDATE layoffs_staging2 SET industry = NULL WHERE industry = '';
- Filled missing industry values using self-JOIN on company: UPDATE layoffs_staging2 lay2 JOIN layoffs_staging2 lay1 ON lay1.company = lay2.company SET lay1.industry = lay2.industry WHERE (lay1.industry IS NULL OR lay1.industry = '') AND lay2.industry IS NOT NULL;
Step 4: Remove Unnecessary Columns
- Kept all columns (none removed); row_num was dropped after deduplication.
Final Table Structure (layoffs_staging2):
- company (TEXT)
- location (TEXT)
- industry (TEXT) — standardized
- total_laid_off (INT)
- percentage_laid_off (TEXT)
- date (DATE)
- stage (TEXT)
- country (TEXT) — cleaned
- funds_raised_millions (INT)
Sample Queries:
Top 5 Industries by Total Laid Off: SELECT industry, SUM(total_laid_off) AS total_laid_off FROM layoffs_staging2 WHERE total_laid_off IS NOT NULL GROUP BY industry ORDER BY total_laid_off DESC LIMIT 5;
Monthly Layoff Trends: SELECT YEAR(date) AS year, MONTH(date) AS month, SUM(total_laid_off) AS monthly_layoffs FROM layoffs_staging2 WHERE date IS NOT NULL GROUP BY year, month ORDER BY year, month;
Highest % Laid Off by Company: SELECT company, percentage_laid_off, total_laid_off, country FROM layoffs_staging2 WHERE percentage_laid_off IS NOT NULL ORDER BY CAST(percentage_laid_off AS DECIMAL(5,2)) DESC LIMIT 10;
Key Insights:
- Crypto had the highest total layoffs (>100k).
- US, India, Germany were top countries.
- Peak layoffs occurred Q1–Q2 2022.
Next Steps:
- Build dashboard in Tableau/Power BI.
- Join with economic data (interest rates, GDP).
- Add Python EDA notebook.
Acknowledgments:
- Kaggle for dataset
- Public reports from TechCrunch, Bloomberg, LinkedIn
License: Educational use only.
Contact: Email: your.email@example.com LinkedIn: linkedin.com/in/yourprofile GitHub: github.com/yourgithubusername
Star ⭐ this repo if you found it helpful!