Data Science Jobs — Data Cleaning Project
📝

Data Science Jobs — Data Cleaning Project


⬅️
Navigation Bar
HomepageHomepage


🥅
Goals
  • Identify and handle fake nulls disguised as 1 strings across multiple columns
  • Parse composite columns (salary range, company name + rating, location, size) into structured fields
  • Simplify high-cardinality categorical columns into clean, grouped labels
  • Extract skill signals from unstructured job description text into binary columns
 
💼
Process
  • Split location and headquarters into separate city and state columns
  • Parsed size range strings into min_employeemax_employee, and avg_company_size
  • Mapped verbose ownership labels into 8 clean groups using keyword matching
  • Extracted skill_pythonskill_sqlskill_ml, and skill_aws from job description text using regex
 
 
Insights
  • df.info() showed 0 nulls — yet the data was far from clean, proving that null checks alone are never enough
  • 53 rows had a mismatch between rating and the rating embedded in company_name, raising a trust question about which source was more reliable
  • Structured data (salary range, employee count) stored as raw strings is a common scraping artifact — always check object-typed columns for hidden numeric data
  • Unstructured text (job descriptions) can be a rich source of structured signals — boolean skill columns extracted from free text are often more reliable than manually tagged fields

 
 
⬅️
Navigation Bar
HomepageHomepage