Breast Cancer Diagnostic Dataset

🥅

Goals

Identify which physical measurements of cell nuclei are most strongly associated with malignant diagnosis
Compare mean, standard error, and worst feature categories to determine which group has the highest predictive power
Visualize the separation between benign and malignant tumors across 30 features using multiple chart types
Practice EDA on real medical data and translate statistical findings into meaningful clinical insights

💼

Process

Dropped the empty Unnamed: 32 column and the non-analytical id column, then encoded diagnosis as a numeric variable for correlation analysis
Computed correlations between all 30 features and diagnosis — identifying concave points_worst as the strongest predictor at 0.79
Built 20 visualizations across box plots, histograms, scatter plots, violin plots, swarm plots, a pairplot, radar chart, and correlation charts
Compared all three feature categories (mean, SE, worst) side by side to determine predictive ranking

✨

Insights

Malignant tumors are significantly larger — area mean is 111% higher in malignant cases compared to benign
Shape irregularity is the strongest differentiator — concave points and concavity show up to 250% difference between benign and malignant
Fractal dimension is the only feature where benign and malignant are statistically identical (0.0629 vs 0.0627) — boundary complexity alone cannot distinguish tumor type
Worst features consistently outperform mean features in predictive power — the extreme measurement of a sample is more diagnostic than the average
SE features are the weakest predictors overall — variability within a sample adds little diagnostic value

What I Learned

Working with real medical data requires more care in how you interpret and present findings — numbers represent real patients
Not all features are equal — 30 features sounds like a lot, but only 8–10 carry meaningful predictive signal
The radar chart required normalization before it worked correctly — a lesson in how scale affects visualization
Worst features being stronger predictors than mean features was not obvious before the analysis — the data revealed it
This project made me curious about what comes next — training a classification model to actually predict diagnosis using these features

This dataset is a perfect candidate for a machine learning classification project. Next step: build a logistic regression or random forest model to predict breast cancer diagnosis using the top features identified in this EDA.