Retail Data Store Cleaning Project

Retail Data Store Cleaning Project

Retail Data Store Cleaning Project

Tool

Python

Project Type

Data Cleaning

Link

https://github.com/john-nio/data-cleaning-projects/tree/main/Retail_store

Source

kaggle.com

🥅

Goals

Practice real-world data cleaning techniques on a retail dataset.

Handle missing values using appropriate strategies rather than blindly dropping data.

Standardize column names, categories, and date formats for consistency.

Produce a clean CSV ready for analysis or visualization.

💼

Process

Explored dataset structure using df.info(), df.isna().sum() and df.describe()

Standardized column names — stripped whitespace, lowercased, replaced spaces with underscores.

Handled nulls per column based on context — filled, calculated, or dropped where justified.

Cleaned and standardized the category column formatting

✨

Insights

Nearly 5% of transactions had no recoverable quantity or spending data — dropping them was justified and had minimal impact on the dataset

The item column encoded category information in its naming pattern (Item_[number]_[category]), which revealed a way to validate the category column

discount_applied nulls were meaningful — they represented transactions with no discount, not truly missing data. Context matters when handling nulls

Standardizing text columns early (before any grouping or analysis) prevents silent bugs where "Food" and "food" are treated as different categories

notion image

Made with Bullet