World Layoffs: SQL Data Cleaning
Goal — Clean and standardize a real-world layoffs dataset to enable reliable SQL-based trend analysis and exploratory data analysis.
Stack — MySQL (Workbench), GitHub.
Data — Public global layoffs dataset (Layoffs.fyi, CSV format).
Output — Cleaned and structured SQL tablelayoffs_staging2
, optimized for business intelligence and reporting.
Overview
This project focused on preparing a raw layoffs dataset for deeper business analysis using SQL. It involved building a structured ETL (Extract, Transform, Load) pipeline across three stages:
- Raw Import (unchanged data source)
- Staging Copy (intermediate cleaning and transformation)
- Final Cleaned Table (analysis-ready)
The cleaning process addressed duplicates, inconsistent text formatting, missing values, and incorrect data types. The resulting dataset was optimized for downstream analytics and dashboard use.
Key Highlights
- Duplicates were removed using
ROW_NUMBER()
partitioned over company, location, industry, and date fields. - Text fields were standardized by trimming whitespace and consolidating variations in industries and country formatting.
- Null values in key fields were resolved using self-joins or dropped when non-informative.
- Text-based date fields were converted into SQL
DATE
format for accurate filtering and time-series analysis. - The final schema was aligned for efficiency, consistency, and compatibility with business reporting tools.
Database Structure
Database: world_layoffs
Table | Purpose |
---|---|
layoffs |
Raw data directly imported from CSV |
layoffs_staging |
Intermediate working copy |
layoffs_staging2 |
Final cleaned table used for analysis |
1. Raw Data Import
The initial table layoffs
was created from a CSV export of the original dataset. This table was preserved as an unmodified baseline.
CREATE TABLE layoffs (
company TEXT,
location TEXT,
industry TEXT,
total_laid_off INT,
percentage_laid_off TEXT,
date TEXT,
stage TEXT,
country TEXT,
funds_raised_millions INT
);
2. Staging & Cleaning Pipeline
Step 1: Create Staging Tables
Two copies of the raw data were created to safely perform transformations without affecting the original.
-- Clone structure and data for transformation
CREATE TABLE layoffs_staging LIKE layoffs;
INSERT INTO layoffs_staging SELECT * FROM layoffs;
-- Define the final table structure
CREATE TABLE layoffs_staging2 (
company TEXT,
location TEXT,
industry TEXT,
total_laid_off INT,
percentage_laid_off TEXT,
date TEXT,
stage TEXT,
country TEXT,
funds_raised_millions INT
);
INSERT INTO layoffs_staging2 SELECT * FROM layoffs_staging;
Step 2: Remove Duplicate Records
Duplicate rows were identified using ROW_NUMBER()
and removed from the cleaned table.
-- Identify duplicates
WITH duplicate_cte AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY company, location, industry, total_laid_off, percentage_laid_off, date, stage, country, funds_raised_millions
) AS row_num
FROM layoffs_staging
)
DELETE FROM layoffs_staging2
WHERE row_num > 1;
Step 3: Standardize Text Fields
Inconsistent formatting in company names, industries, and countries was corrected.
-- Trim whitespace from company names
UPDATE layoffs_staging2 SET company = TRIM(company);
-- Consolidate industry values
UPDATE layoffs_staging2
SET industry = 'Crypto'
WHERE industry LIKE 'Crypto%';
-- Remove trailing punctuation from countries
UPDATE layoffs_staging2
SET country = TRIM(TRAILING '.' FROM country)
WHERE country LIKE 'United States%';
Step 4: Handle Missing Values
Wherever possible, missing industry
values were filled using self-joins. Rows without useful signals were removed.
-- Fill missing industry values using other rows
UPDATE layoffs_staging2 t1
JOIN layoffs_staging2 t2
ON t1.company = t2.company AND t1.location = t2.location
SET t1.industry = t2.industry
WHERE t1.industry IS NULL AND t2.industry IS NOT NULL;
-- Remove rows missing both layoff metrics
DELETE FROM layoffs_staging2
WHERE total_laid_off IS NULL AND percentage_laid_off IS NULL;
Step 5: Convert Dates to SQL DATE Format
Text-formatted dates were converted for compatibility with time-based analysis.
-- Convert string to DATE format
UPDATE layoffs_staging2
SET date = STR_TO_DATE(date, '%m/%d/%Y');
-- Update column type
ALTER TABLE layoffs_staging2
MODIFY COLUMN date DATE;
Step 6: Final Table Verification
After transformation, helper columns were dropped and the final cleaned dataset was reviewed.
-- Drop temporary columns if any were used
ALTER TABLE layoffs_staging2 DROP COLUMN row_num;
-- Preview the cleaned dataset
SELECT * FROM layoffs_staging2;
Use Cases for Analysis
The cleaned dataset supports a range of business intelligence use cases, including:
- Layoffs by country, industry, or company
- Time-series visualizations of layoff spikes or patterns
- Correlations between funding and layoffs
- Stage-based breakdowns (early, growth, IPO, etc.)
System Setup & Execution
Software:
- MySQL Workbench or equivalent SQL IDE
- GitHub for version control and documentation
Workflow:
- Import raw CSV to create
layoffs
table - Duplicate to
layoffs_staging
andlayoffs_staging2
- Run cleaning scripts step-by-step
- Analyze data from
layoffs_staging2
Future Enhancements
- Automate the cleaning pipeline using stored procedures or Python scripts
- Connect the final cleaned table to Tableau or Power BI for interactive dashboards
- Integrate API-based data ingestion for real-time updates
- Add version tracking for schema changes and refresh logs
Updated: August 31, 2025
GitHub Repo: Data-Analyst-Portfolio