SQL Data Cleaning MySQL Duplicate Removal Standardization Workforce Analytics

World Layoffs Data Cleaning for Business Analysis

Cleaned a real-world global layoffs dataset in MySQL to support reliable exploratory analysis and business reporting. Built a raw-to-staging pipeline, removed duplicate rows, standardized text fields, handled missing values, and converted dates for time series and trend analysis.

3,000+

Records

Layoff events standardized for analysis

Cleaning phases

From raw import to final output table

100%

Duplicate removal

Exact copies dropped using window functions

Raw changes

All work done in staging layers, not source

Analytics table

Clean, trusted dataset for BI tools

Database Management Practice MySQL, MySQL Workbench January 2025

Project overview

The layoffs dataset was assembled from multiple public sources and news reports, which made it noisy and inconsistent out of the box. Entries included duplicated records, inconsistent company and industry labels, country names with trailing punctuation, and dates stored as text strings.

Key deliverables

Raw CSV import workflow and profiling in MySQL Workbench
Staging and secondary staging tables for safe, stepwise cleaning
Duplicate removal using ROW_NUMBER() window functions
Standardization of company, industry, and country values
Logic to handle missing values and discard unusable records
Date conversion from text into native DATE type for time series analysis

Workflow Overview

Phase 1

Raw data import

Loaded the layoffs CSV into MySQL, confirmed columns, and profiled data quality issues.

Phase 2

Staging architecture

Created staging copies of the raw table to isolate transformations and preserve the source.

Phase 3

Duplicate removal

Applied window functions to identify and remove exact duplicate layoff records.

Phase 4

Standardization & nulls

Cleaned company, industry, and country labels and resolved missing or unusable records.

Phase 5

Date conversion

Converted dates to DATE, dropped helper columns, and finalized the analytics table.

Lab Breakdown

Phase 1 — Raw dataset overview and initial profiling

The original dataset captured company-level layoff events with columns for location, industry, total employees laid off, percent laid off, layoff date, funding stage, country, and funds raised. It arrived as a CSV file with no constraints and visible quality issues.

Duplicate rows for the same company, location, date, and layoff counts
Inconsistent capitalization and whitespace across text fields
Country names such as United States. with trailing punctuation
Dates stored as text strings instead of native DATE types

Preview of the world layoffs CSV as originally delivered with company, location, industry, and layoff data.

The same data imported into MySQL database (world_layoffs) and exposed as the layoffs table.

Phase 2 — Staging and cleaning setup

To protect the original layoff records, staging tables were created that mirror the raw structure. All cleaning steps run inside these staging layers, which makes the pipeline repeatable and easy to debug without risking the source data.


-- Duplicate raw table into a staging table

CREATE TABLE layoffs_staging LIKE layoffs;


INSERT INTO layoffs_staging

SELECT *

FROM layoffs;


-- Create a second staging table for stepwise cleaning

CREATE TABLE layoffs_staging2 (

  company               TEXT,

  location              TEXT,

  industry              TEXT,

  total_laid_off        INT DEFAULT NULL,

  percentage_laid_off   TEXT,

  `date`                TEXT,

  stage                 TEXT,

  country               TEXT,

  funds_raised_millions INT DEFAULT NULL

);


INSERT INTO layoffs_staging2

SELECT *

FROM layoffs_staging;

View of the layoffs_staging2 table in MySQL Workbench after initial copy from the raw table.

Phase 3 — Removing duplicate records

Exact duplicates occur when the same company, location, industry, layoff counts, date, stage, country, and funding appear more than once. A CTE with ROW_NUMBER() was used to flag these cases and then remove all but a single instance.


-- Identify duplicates using ROW_NUMBER()

WITH duplicate_cte AS (

    SELECT *,

           ROW_NUMBER() OVER (

               PARTITION BY company,

                            location,

                            industry,

                            total_laid_off,

                            `date`

           ) AS row_num

    FROM layoffs_staging

)

SELECT *

FROM duplicate_cte

WHERE row_num > 1;


-- Remove duplicate rows where row_num > 1

DELETE

FROM layoffs_staging2

WHERE row_num > 1;

After this step, every layoff event is represented once for each unique company, location, industry, and date combination.

Phase 4 — Standardizing text fields and handling missing values

Consistent text values and meaningful non-null records are essential for trustworthy analysis. In this phase company, industry, and country fields were standardized, then records were backfilled and removed to ensure analysts work with coherent data.

4.1 — Company, industry, and country standardization


-- Standardize company names by trimming whitespace

UPDATE layoffs_staging2

SET company = TRIM(company);


-- Consolidate industry variations (Crypto)

UPDATE layoffs_staging2

SET industry = 'Crypto'

WHERE industry LIKE 'Crypto%';


-- Remove inconsistent formatting in country field

UPDATE layoffs_staging2

SET country = TRIM(TRAILING '.' FROM country)

WHERE country LIKE 'United States%';

These standardization steps allow analysts to roll up layoffs by industry and country without losing records to minor spelling differences.

4.2 — Filling gaps and removing unusable records


-- Populate missing industry fields by self joining

UPDATE layoffs_staging2 t1

JOIN layoffs_staging2 t2

  ON t1.company = t2.company

 AND t1.location = t2.location

SET t1.industry = t2.industry

WHERE t1.industry IS NULL

  AND t2.industry IS NOT NULL;


-- Remove rows where both layoff metrics are NULL

DELETE

FROM layoffs_staging2

WHERE total_laid_off IS NULL

  AND percentage_laid_off IS NULL;

This step retains informative records while discarding entries that cannot contribute to quantitative analysis.

Phase 5 — Date conversion and final analytics output

To support time-based analysis, the date column was converted from text to a native DATE type and temporary helper fields were dropped. The result is a clean, analysis-ready table for BI tools.


-- Convert the date column from TEXT to DATE format

UPDATE layoffs_staging2

SET `date` = STR_TO_DATE(`date`, '%m/%d/%Y');


-- Modify the date column type to DATE

ALTER TABLE layoffs_staging2

MODIFY COLUMN `date` DATE;


-- Drop the unnecessary row_num column

ALTER TABLE layoffs_staging2

DROP COLUMN row_num;


-- Verify the final cleaned table

SELECT *

FROM layoffs_staging2;

At this point, layoffs_staging2 serves as the cleaned, analysis-ready table for BI tools like Tableau or Power BI.

Data Cleaning Impact

Systematic data cleaning transformed a noisy, duplicate-heavy CSV into a structured and dependable database table ready for analysis and reporting.

Key Insights Enabled

5 CAPABILITIES

INDUSTRY TRENDS

Industry-level trends showing which sectors experienced the most layoffs over time.

GEOGRAPHIC ANALYSIS

Geographic comparisons of layoffs across countries after standardizing country names.

FUNDING STAGE INSIGHTS

Analysis by funding stage to see how capital access relates to layoffs.

TIME SERIES READY

Ability to build time series charts from the cleaned DATE column and group by year or quarter.

TRUSTWORTHY METRICS

More trustworthy totals and averages now that duplicates and unusable rows are removed.

Architecture & Usage

Architecture overview

Dataset — World layoffs CSV file hosted in GitHub
Database — MySQL schema world_layoffs with raw and staging tables
Client — MySQL Workbench for SQL execution and inspection
Scripts — Cleaning steps stored in version control for repeatability

Data flow

CSV → raw table layoffs
Raw table → layoffs_staging for safe experimentation
Staging → layoffs_staging2 with deduplication and cleaning logic
Clean table → analytics queries, dashboards, and exported datasets

Tier 1: Setup

Objective: Configure the data cleaning environment and prepare raw data.

• System requirements MySQL database server with support for window functions.
• Initial import Import the raw layoffs CSV into a MySQL database.
• Script execution Run staging, deduplication, and cleaning scripts in order.

Tier 2: Analysis

Objective: Query the cleaned data to extract business insights.

• Clean table access Use layoffs_staging2 as the cleaned table for all analysis.
• Aggregation queries Aggregate by industry, country, year, or funding stage.
• Dashboard exports Export results to dashboards or notebooks for storytelling.

Tier 3: Enhancement

Objective: Extend the pipeline for ongoing data operations.

• Automated ETL Build automated jobs to rerun cleaning scripts on updated data.
• Real-time integration Connect to upstream sources if data is refreshed regularly.
• BI connections Direct connections to BI tools for scheduled reporting.

Key insights

Staging architecture protects raw data while enabling iterative cleaning operations.
Window functions provide elegant solution for identifying and removing exact duplicates.
Text standardization ensures consistent aggregation across industry and geographic dimensions.
Date type conversion enables native time series operations and temporal analysis.
Systematic null handling ensures only informative records remain in the final dataset.

Skills demonstrated

MySQL SQL Data Cleaning Window Functions Text Standardization Null Handling Date Conversion ETL Staging Design Analytics Preparation

Summary

This World Layoffs SQL project turns a noisy, duplicate-heavy CSV into a structured and dependable database table. By introducing staging layers, using window functions for deduplication, cleaning text fields, resolving missing values, and converting dates into a proper type, the final dataset now supports accurate industry, geography, and funding stage insights. It gives decision makers a solid base for dashboards, time series analysis, and deeper workforce trend investigations.

← Back to Portfolio