Goals
- Identify which physical measurements of cell nuclei are most strongly associated with malignant diagnosis
- Compare mean, standard error, and worst feature categories to determine which group has the highest predictive power
- Visualize the separation between benign and malignant tumors across 30 features using multiple chart types
- Practice EDA on real medical data and translate statistical findings into meaningful clinical insights
Process
- Dropped the empty
Unnamed: 32column and the non-analyticalidcolumn, then encoded diagnosis as a numeric variable for correlation analysis
- Computed correlations between all 30 features and diagnosis — identifying
concave points_worstas the strongest predictor at 0.79
- Built 20 visualizations across box plots, histograms, scatter plots, violin plots, swarm plots, a pairplot, radar chart, and correlation charts
- Compared all three feature categories (mean, SE, worst) side by side to determine predictive ranking
Insights
- Malignant tumors are significantly larger — area mean is 111% higher in malignant cases compared to benign
- Shape irregularity is the strongest differentiator — concave points and concavity show up to 250% difference between benign and malignant
- Fractal dimension is the only feature where benign and malignant are statistically identical (0.0629 vs 0.0627) — boundary complexity alone cannot distinguish tumor type
- Worst features consistently outperform mean features in predictive power — the extreme measurement of a sample is more diagnostic than the average
- SE features are the weakest predictors overall — variability within a sample adds little diagnostic value
What I Learned
- Working with real medical data requires more care in how you interpret and present findings — numbers represent real patients
- Not all features are equal — 30 features sounds like a lot, but only 8–10 carry meaningful predictive signal
- The radar chart required normalization before it worked correctly — a lesson in how scale affects visualization
- Worst features being stronger predictors than mean features was not obvious before the analysis — the data revealed it
- This project made me curious about what comes next — training a classification model to actually predict diagnosis using these features
This dataset is a perfect candidate for a machine learning classification project. Next step: build a logistic regression or random forest model to predict breast cancer diagnosis using the top features identified in this EDA.