← Back to Portfolio
SQL Data Cleaning MySQL Duplicate Removal Standardization Workforce Analytics

World Layoffs Data Cleaning for Business Analysis

Cleaned a real-world global layoffs dataset in MySQL to support reliable exploratory analysis and business reporting. Built a raw-to-staging pipeline, removed duplicate rows, standardized text fields, handled missing values, and converted dates for time series and trend analysis.

3,000+
Records
Layoff events standardized for analysis
5
Cleaning phases
From raw import to final output table
100%
Duplicate removal
Exact copies dropped using window functions
0
Raw changes
All work done in staging layers, not source
1
Analytics table
Clean, trusted dataset for BI tools
Database Management Practice MySQL, MySQL Workbench January 2025

Project overview

The layoffs dataset was assembled from multiple public sources and news reports, which made it noisy and inconsistent out of the box. Entries included duplicated records, inconsistent company and industry labels, country names with trailing punctuation, and dates stored as text strings.

Key deliverables

  • Raw CSV import workflow and profiling in MySQL Workbench
  • Staging and secondary staging tables for safe, stepwise cleaning
  • Duplicate removal using ROW_NUMBER() window functions
  • Standardization of company, industry, and country values
  • Logic to handle missing values and discard unusable records
  • Date conversion from text into native DATE type for time series analysis

Workflow Overview

Phase 1

Raw data import

Loaded the layoffs CSV into MySQL, confirmed columns, and profiled data quality issues.

Phase 2

Staging architecture

Created staging copies of the raw table to isolate transformations and preserve the source.

Phase 3

Duplicate removal

Applied window functions to identify and remove exact duplicate layoff records.

Phase 4

Standardization & nulls

Cleaned company, industry, and country labels and resolved missing or unusable records.

Phase 5

Date conversion

Converted dates to DATE, dropped helper columns, and finalized the analytics table.

Lab Breakdown

Phase 1 — Raw dataset overview and initial profiling

The original dataset captured company-level layoff events with columns for location, industry, total employees laid off, percent laid off, layoff date, funding stage, country, and funds raised. It arrived as a CSV file with no constraints and visible quality issues.

  • Duplicate rows for the same company, location, date, and layoff counts
  • Inconsistent capitalization and whitespace across text fields
  • Country names such as United States. with trailing punctuation
  • Dates stored as text strings instead of native DATE types
World Layoffs raw data CSV preview

Preview of the world layoffs CSV as originally delivered with company, location, industry, and layoff data.

World Layoffs raw table in MySQL

The same data imported into MySQL database (world_layoffs) and exposed as the layoffs table.

Phase 2 — Staging and cleaning setup

To protect the original layoff records, staging tables were created that mirror the raw structure. All cleaning steps run inside these staging layers, which makes the pipeline repeatable and easy to debug without risking the source data.

-- Duplicate raw table into a staging table
CREATE TABLE layoffs_staging LIKE layoffs;

INSERT INTO layoffs_staging
SELECT *
FROM layoffs;

-- Create a second staging table for stepwise cleaning
CREATE TABLE layoffs_staging2 (
  company               TEXT,
  location              TEXT,
  industry              TEXT,
  total_laid_off        INT DEFAULT NULL,
  percentage_laid_off   TEXT,
  `date`                TEXT,
  stage                 TEXT,
  country               TEXT,
  funds_raised_millions INT DEFAULT NULL
);

INSERT INTO layoffs_staging2
SELECT *
FROM layoffs_staging;
Staging table

View of the layoffs_staging2 table in MySQL Workbench after initial copy from the raw table.

Phase 3 — Removing duplicate records

Exact duplicates occur when the same company, location, industry, layoff counts, date, stage, country, and funding appear more than once. A CTE with ROW_NUMBER() was used to flag these cases and then remove all but a single instance.

-- Identify duplicates using ROW_NUMBER()
WITH duplicate_cte AS (
    SELECT *,
           ROW_NUMBER() OVER (
               PARTITION BY company,
                            location,
                            industry,
                            total_laid_off,
                            `date`
           ) AS row_num
    FROM layoffs_staging
)
SELECT *
FROM duplicate_cte
WHERE row_num > 1;

-- Remove duplicate rows where row_num > 1
DELETE
FROM layoffs_staging2
WHERE row_num > 1;

After this step, every layoff event is represented once for each unique company, location, industry, and date combination.

Phase 4 — Standardizing text fields and handling missing values

Consistent text values and meaningful non-null records are essential for trustworthy analysis. In this phase company, industry, and country fields were standardized, then records were backfilled and removed to ensure analysts work with coherent data.

4.1 — Company, industry, and country standardization

-- Standardize company names by trimming whitespace
UPDATE layoffs_staging2
SET company = TRIM(company);

-- Consolidate industry variations (Crypto)
UPDATE layoffs_staging2
SET industry = 'Crypto'
WHERE industry LIKE 'Crypto%';

-- Remove inconsistent formatting in country field
UPDATE layoffs_staging2
SET country = TRIM(TRAILING '.' FROM country)
WHERE country LIKE 'United States%';

These standardization steps allow analysts to roll up layoffs by industry and country without losing records to minor spelling differences.

4.2 — Filling gaps and removing unusable records

-- Populate missing industry fields by self joining
UPDATE layoffs_staging2 t1
JOIN layoffs_staging2 t2
  ON t1.company = t2.company
 AND t1.location = t2.location
SET t1.industry = t2.industry
WHERE t1.industry IS NULL
  AND t2.industry IS NOT NULL;

-- Remove rows where both layoff metrics are NULL
DELETE
FROM layoffs_staging2
WHERE total_laid_off IS NULL
  AND percentage_laid_off IS NULL;

This step retains informative records while discarding entries that cannot contribute to quantitative analysis.

Phase 5 — Date conversion and final analytics output

To support time-based analysis, the date column was converted from text to a native DATE type and temporary helper fields were dropped. The result is a clean, analysis-ready table for BI tools.

-- Convert the date column from TEXT to DATE format
UPDATE layoffs_staging2
SET `date` = STR_TO_DATE(`date`, '%m/%d/%Y');

-- Modify the date column type to DATE
ALTER TABLE layoffs_staging2
MODIFY COLUMN `date` DATE;

-- Drop the unnecessary row_num column
ALTER TABLE layoffs_staging2
DROP COLUMN row_num;

-- Verify the final cleaned table
SELECT *
FROM layoffs_staging2;

At this point, layoffs_staging2 serves as the cleaned, analysis-ready table for BI tools like Tableau or Power BI.

Data Cleaning Impact

Systematic data cleaning transformed a noisy, duplicate-heavy CSV into a structured and dependable database table ready for analysis and reporting.

Key Insights Enabled

5 CAPABILITIES
INDUSTRY TRENDS

Industry-level trends showing which sectors experienced the most layoffs over time.

GEOGRAPHIC ANALYSIS

Geographic comparisons of layoffs across countries after standardizing country names.

FUNDING STAGE INSIGHTS

Analysis by funding stage to see how capital access relates to layoffs.

TIME SERIES READY

Ability to build time series charts from the cleaned DATE column and group by year or quarter.

TRUSTWORTHY METRICS

More trustworthy totals and averages now that duplicates and unusable rows are removed.

Architecture & Usage

Architecture overview

  • Dataset — World layoffs CSV file hosted in GitHub
  • Database — MySQL schema world_layoffs with raw and staging tables
  • Client — MySQL Workbench for SQL execution and inspection
  • Scripts — Cleaning steps stored in version control for repeatability

Data flow

  • CSV → raw table layoffs
  • Raw table → layoffs_staging for safe experimentation
  • Staging → layoffs_staging2 with deduplication and cleaning logic
  • Clean table → analytics queries, dashboards, and exported datasets

Tier 1: Setup

Objective: Configure the data cleaning environment and prepare raw data.

  • System requirements MySQL database server with support for window functions.
  • Initial import Import the raw layoffs CSV into a MySQL database.
  • Script execution Run staging, deduplication, and cleaning scripts in order.

Tier 2: Analysis

Objective: Query the cleaned data to extract business insights.

  • Clean table access Use layoffs_staging2 as the cleaned table for all analysis.
  • Aggregation queries Aggregate by industry, country, year, or funding stage.
  • Dashboard exports Export results to dashboards or notebooks for storytelling.

Tier 3: Enhancement

Objective: Extend the pipeline for ongoing data operations.

  • Automated ETL Build automated jobs to rerun cleaning scripts on updated data.
  • Real-time integration Connect to upstream sources if data is refreshed regularly.
  • BI connections Direct connections to BI tools for scheduled reporting.

Key insights

  • Staging architecture protects raw data while enabling iterative cleaning operations.
  • Window functions provide elegant solution for identifying and removing exact duplicates.
  • Text standardization ensures consistent aggregation across industry and geographic dimensions.
  • Date type conversion enables native time series operations and temporal analysis.
  • Systematic null handling ensures only informative records remain in the final dataset.

Skills demonstrated

MySQL SQL Data Cleaning Window Functions Text Standardization Null Handling Date Conversion ETL Staging Design Analytics Preparation

Summary

This World Layoffs SQL project turns a noisy, duplicate-heavy CSV into a structured and dependable database table. By introducing staging layers, using window functions for deduplication, cleaning text fields, resolving missing values, and converting dates into a proper type, the final dataset now supports accurate industry, geography, and funding stage insights. It gives decision makers a solid base for dashboards, time series analysis, and deeper workforce trend investigations.