Cafe Sale Data Cleaning
📝

Cafe Sale Data Cleaning


⬅️
Navigation Bar
HomepageHomepage


🥅
Goals
  • Handle fake null values (ERROR, UNKNOWN) alongside real nulls
  • Recover missing values using mathematical relationships between columns
  • Apply appropriate strategies per column based on context and null percentage
  • Produce a clean, analysis-ready dataset with correct data types
 
💼
Process
  • Handled nulls per column based on context — filled, calculated, or dropped where justified.
  • Replaced ERROR and UNKNOWN strings with np.nan to standardize all missing values
  • Recovered price_per_unit, quantity, and total_spent using a price chart and column arithmetic
  • Filled high-null categorical columns (payment_method, location) with "unknown" rather than dropping
 
 
Insights
  • 25% of payment methods and 32% of locations were missing — dropping would have destroyed the dataset
  • errors='coerce' revealed 301 extra nulls hiding as ERROR/UNKNOWN in the date column that isna() alone would have missed
  • Recovering values mathematically (quantity × price = total) is always better than filling with a statistic
  • Only 60 rows (0.6%) were truly unrecoverable — a sign that most missing data had a recovery path worth exploring
notion image

 
 
⬅️
Navigation Bar
HomepageHomepage