Data Science Jobs — Data Cleaning Project

Data Science Jobs — Data Cleaning Project

Data Science Jobs — Data Cleaning Project

Tool

Python

Project Type

Data Cleaning

Link

https://github.com/john-nio/data-cleaning-projects/tree/main/data_jobs

Source

kaggle.com

🥅

Goals

Identify and handle fake nulls disguised as 1 strings across multiple columns

Parse composite columns (salary range, company name + rating, location, size) into structured fields

Simplify high-cardinality categorical columns into clean, grouped labels

Extract skill signals from unstructured job description text into binary columns

💼

Process

Split location and headquarters into separate city and state columns

Parsed size range strings into min_employee, max_employee, and avg_company_size

Mapped verbose ownership labels into 8 clean groups using keyword matching

Extracted skill_python, skill_sql, skill_ml, and skill_aws from job description text using regex

✨

Insights

df.info() showed 0 nulls — yet the data was far from clean, proving that null checks alone are never enough

53 rows had a mismatch between rating and the rating embedded in company_name, raising a trust question about which source was more reliable

Structured data (salary range, employee count) stored as raw strings is a common scraping artifact — always check object-typed columns for hidden numeric data

Unstructured text (job descriptions) can be a rich source of structured signals — boolean skill columns extracted from free text are often more reliable than manually tagged fields

Made with Bullet