Retail Data Store Cleaning Project
Retail Data Store Cleaning Project

Retail Data Store Cleaning Project


⬅️
Navigation Bar
HomepageHomepage


🥅
Goals
  • Practice real-world data cleaning techniques on a retail dataset.
  • Handle missing values using appropriate strategies rather than blindly dropping data.
  • Standardize column names, categories, and date formats for consistency.
  • Produce a clean CSV ready for analysis or visualization.
 
💼
Process
  • Explored dataset structure using df.info(), df.isna().sum() and df.describe()
  • Standardized column names — stripped whitespace, lowercased, replaced spaces with underscores.
  • Handled nulls per column based on context — filled, calculated, or dropped where justified.
  • Cleaned and standardized the category column formatting
 
Insights
  • Nearly 5% of transactions had no recoverable quantity or spending data — dropping them was justified and had minimal impact on the dataset
  • The item column encoded category information in its naming pattern (Item_[number]_[category]), which revealed a way to validate the category column
  • discount_applied nulls were meaningful — they represented transactions with no discount, not truly missing data. Context matters when handling nulls
  • Standardizing text columns early (before any grouping or analysis) prevents silent bugs where "Food" and "food" are treated as different categories
notion image

 
 
⬅️
Navigation Bar
HomepageHomepage