Cafe Sale Data Cleaning

Cafe Sale Data Cleaning

Cafe Sale Data Cleaning

Tool

Python

Project Type

Data Cleaning

Link

https://github.com/john-nio/data-cleaning-projects/tree/main/cafe_sales

Source

kaggle.com

🥅

Goals

Handle fake null values (ERROR, UNKNOWN) alongside real nulls

Recover missing values using mathematical relationships between columns

Apply appropriate strategies per column based on context and null percentage

Produce a clean, analysis-ready dataset with correct data types

💼

Process

Handled nulls per column based on context — filled, calculated, or dropped where justified.

Replaced ERROR and UNKNOWN strings with np.nan to standardize all missing values

Recovered price_per_unit, quantity, and total_spent using a price chart and column arithmetic

Filled high-null categorical columns (payment_method, location) with "unknown" rather than dropping

✨

Insights

25% of payment methods and 32% of locations were missing — dropping would have destroyed the dataset

errors='coerce' revealed 301 extra nulls hiding as ERROR/UNKNOWN in the date column that isna() alone would have missed

Recovering values mathematically (quantity × price = total) is always better than filling with a statistic

Only 60 rows (0.6%) were truly unrecoverable — a sign that most missing data had a recovery path worth exploring

notion image

Made with Bullet