World Layoffs: SQL Data Cleaning

Goal — Clean and standardize a real-world layoffs dataset to enable reliable SQL-based trend analysis and exploratory data analysis.
Stack — MySQL (Workbench), GitHub.
Data — Public global layoffs dataset (Layoffs.fyi, CSV format).
Output — Cleaned and structured SQL table layoffs_staging2, optimized for business intelligence and reporting.

Overview

This project focused on preparing a raw layoffs dataset for deeper business analysis using SQL. It involved building a structured ETL (Extract, Transform, Load) pipeline across three stages:

Raw Import (unchanged data source)
Staging Copy (intermediate cleaning and transformation)
Final Cleaned Table (analysis-ready)

The cleaning process addressed duplicates, inconsistent text formatting, missing values, and incorrect data types. The resulting dataset was optimized for downstream analytics and dashboard use.

Key Highlights

Duplicates were removed using ROW_NUMBER() partitioned over company, location, industry, and date fields.
Text fields were standardized by trimming whitespace and consolidating variations in industries and country formatting.
Null values in key fields were resolved using self-joins or dropped when non-informative.
Text-based date fields were converted into SQL DATE format for accurate filtering and time-series analysis.
The final schema was aligned for efficiency, consistency, and compatibility with business reporting tools.

Database Structure

Database: world_layoffs

Table	Purpose
`layoffs`	Raw data directly imported from CSV
`layoffs_staging`	Intermediate working copy
`layoffs_staging2`	Final cleaned table used for analysis

1. Raw Data Import

The initial table layoffs was created from a CSV export of the original dataset. This table was preserved as an unmodified baseline.

CREATE TABLE layoffs (
  company TEXT,
  location TEXT,
  industry TEXT,
  total_laid_off INT,
  percentage_laid_off TEXT,
  date TEXT,
  stage TEXT,
  country TEXT,
  funds_raised_millions INT
);

2. Staging & Cleaning Pipeline

Step 1: Create Staging Tables

Two copies of the raw data were created to safely perform transformations without affecting the original.

-- Clone structure and data for transformation
CREATE TABLE layoffs_staging LIKE layoffs;
INSERT INTO layoffs_staging SELECT * FROM layoffs;

-- Define the final table structure
CREATE TABLE layoffs_staging2 (
  company TEXT,
  location TEXT,
  industry TEXT,
  total_laid_off INT,
  percentage_laid_off TEXT,
  date TEXT,
  stage TEXT,
  country TEXT,
  funds_raised_millions INT
);
INSERT INTO layoffs_staging2 SELECT * FROM layoffs_staging;

Step 2: Remove Duplicate Records

Duplicate rows were identified using ROW_NUMBER() and removed from the cleaned table.

-- Identify duplicates
WITH duplicate_cte AS (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY company, location, industry, total_laid_off, percentage_laid_off, date, stage, country, funds_raised_millions
    ) AS row_num
  FROM layoffs_staging
)
DELETE FROM layoffs_staging2
WHERE row_num > 1;

Step 3: Standardize Text Fields

Inconsistent formatting in company names, industries, and countries was corrected.

-- Trim whitespace from company names
UPDATE layoffs_staging2 SET company = TRIM(company);

-- Consolidate industry values
UPDATE layoffs_staging2
SET industry = 'Crypto'
WHERE industry LIKE 'Crypto%';

-- Remove trailing punctuation from countries
UPDATE layoffs_staging2
SET country = TRIM(TRAILING '.' FROM country)
WHERE country LIKE 'United States%';

Step 4: Handle Missing Values

Wherever possible, missing industry values were filled using self-joins. Rows without useful signals were removed.

-- Fill missing industry values using other rows
UPDATE layoffs_staging2 t1
JOIN layoffs_staging2 t2
  ON t1.company = t2.company AND t1.location = t2.location
SET t1.industry = t2.industry
WHERE t1.industry IS NULL AND t2.industry IS NOT NULL;

-- Remove rows missing both layoff metrics
DELETE FROM layoffs_staging2
WHERE total_laid_off IS NULL AND percentage_laid_off IS NULL;

Step 5: Convert Dates to SQL DATE Format

Text-formatted dates were converted for compatibility with time-based analysis.

-- Convert string to DATE format
UPDATE layoffs_staging2
SET date = STR_TO_DATE(date, '%m/%d/%Y');

-- Update column type
ALTER TABLE layoffs_staging2
MODIFY COLUMN date DATE;

Step 6: Final Table Verification

After transformation, helper columns were dropped and the final cleaned dataset was reviewed.

-- Drop temporary columns if any were used
ALTER TABLE layoffs_staging2 DROP COLUMN row_num;

-- Preview the cleaned dataset
SELECT * FROM layoffs_staging2;

Use Cases for Analysis

The cleaned dataset supports a range of business intelligence use cases, including:

Layoffs by country, industry, or company
Time-series visualizations of layoff spikes or patterns
Correlations between funding and layoffs
Stage-based breakdowns (early, growth, IPO, etc.)

System Setup & Execution

Software:

MySQL Workbench or equivalent SQL IDE
GitHub for version control and documentation

Workflow:

Import raw CSV to create layoffs table
Duplicate to layoffs_staging and layoffs_staging2
Run cleaning scripts step-by-step
Analyze data from layoffs_staging2

Future Enhancements

Automate the cleaning pipeline using stored procedures or Python scripts
Connect the final cleaned table to Tableau or Power BI for interactive dashboards
Integrate API-based data ingestion for real-time updates
Add version tracking for schema changes and refresh logs

Updated: August 31, 2025
GitHub Repo: Data-Analyst-Portfolio