Goals
- Identify and handle fake nulls disguised as
1strings across multiple columns
- Parse composite columns (salary range, company name + rating, location, size) into structured fields
- Simplify high-cardinality categorical columns into clean, grouped labels
- Extract skill signals from unstructured job description text into binary columns
Process
- Split
locationandheadquartersinto separate city and state columns
- Parsed
sizerange strings intomin_employee,max_employee, andavg_company_size
- Mapped verbose ownership labels into 8 clean groups using keyword matching
- Extracted
skill_python,skill_sql,skill_ml, andskill_awsfrom job description text using regex
Insights
df.info()showed 0 nulls — yet the data was far from clean, proving that null checks alone are never enough
- 53 rows had a mismatch between
ratingand the rating embedded incompany_name, raising a trust question about which source was more reliable
- Structured data (salary range, employee count) stored as raw strings is a common scraping artifact — always check object-typed columns for hidden numeric data
- Unstructured text (job descriptions) can be a rich source of structured signals — boolean skill columns extracted from free text are often more reliable than manually tagged fields