Goals
- Practice real-world data cleaning techniques on a retail dataset.
- Handle missing values using appropriate strategies rather than blindly dropping data.
- Standardize column names, categories, and date formats for consistency.
- Produce a clean CSV ready for analysis or visualization.
Process
- Explored dataset structure using
df.info(),df.isna().sum()anddf.describe()
- Standardized column names — stripped whitespace, lowercased, replaced spaces with underscores.
- Handled nulls per column based on context — filled, calculated, or dropped where justified.
- Cleaned and standardized the
categorycolumn formatting
Insights
- Nearly 5% of transactions had no recoverable quantity or spending data — dropping them was justified and had minimal impact on the dataset
- The
itemcolumn encoded category information in its naming pattern (Item_[number]_[category]), which revealed a way to validate thecategorycolumn
discount_appliednulls were meaningful — they represented transactions with no discount, not truly missing data. Context matters when handling nulls
- Standardizing text columns early (before any grouping or analysis) prevents silent bugs where
"Food"and"food"are treated as different categories