Goals
- Handle fake null values (
ERROR,UNKNOWN) alongside real nulls
- Recover missing values using mathematical relationships between columns
- Apply appropriate strategies per column based on context and null percentage
- Produce a clean, analysis-ready dataset with correct data types
Process
- Handled nulls per column based on context — filled, calculated, or dropped where justified.
- Replaced
ERRORandUNKNOWNstrings withnp.nanto standardize all missing values
- Recovered
price_per_unit,quantity, andtotal_spentusing a price chart and column arithmetic
- Filled high-null categorical columns (
payment_method,location) with"unknown"rather than dropping
Insights
- 25% of payment methods and 32% of locations were missing — dropping would have destroyed the dataset
errors='coerce'revealed 301 extra nulls hiding asERROR/UNKNOWNin the date column thatisna()alone would have missed
- Recovering values mathematically (quantity × price = total) is always better than filling with a statistic
- Only 60 rows (0.6%) were truly unrecoverable — a sign that most missing data had a recovery path worth exploring