Breast Cancer Diagnostic Dataset
Breast Cancer Diagnostic Dataset

Breast Cancer Diagnostic Dataset

logo
Tool
Python
logo
Project Type
Data Cleaning
Data Analysis
Source
kaggle.com

⬅️
Navigation Bar
HomepageHomepage


🥅
Goals
  • Identify which physical measurements of cell nuclei are most strongly associated with malignant diagnosis
  • Compare mean, standard error, and worst feature categories to determine which group has the highest predictive power
  • Visualize the separation between benign and malignant tumors across 30 features using multiple chart types
  • Practice EDA on real medical data and translate statistical findings into meaningful clinical insights
 
💼
Process
  • Dropped the empty Unnamed: 32 column and the non-analytical id column, then encoded diagnosis as a numeric variable for correlation analysis
  • Computed correlations between all 30 features and diagnosis — identifying concave points_worst as the strongest predictor at 0.79
  • Built 20 visualizations across box plots, histograms, scatter plots, violin plots, swarm plots, a pairplot, radar chart, and correlation charts
  • Compared all three feature categories (mean, SE, worst) side by side to determine predictive ranking
 
Insights
  • Malignant tumors are significantly larger — area mean is 111% higher in malignant cases compared to benign
  • Shape irregularity is the strongest differentiator — concave points and concavity show up to 250% difference between benign and malignant
  • Fractal dimension is the only feature where benign and malignant are statistically identical (0.0629 vs 0.0627) — boundary complexity alone cannot distinguish tumor type
  • Worst features consistently outperform mean features in predictive power — the extreme measurement of a sample is more diagnostic than the average
  • SE features are the weakest predictors overall — variability within a sample adds little diagnostic value
What I Learned
  • Working with real medical data requires more care in how you interpret and present findings — numbers represent real patients
  • Not all features are equal — 30 features sounds like a lot, but only 8–10 carry meaningful predictive signal
  • The radar chart required normalization before it worked correctly — a lesson in how scale affects visualization
  • Worst features being stronger predictors than mean features was not obvious before the analysis — the data revealed it
  • This project made me curious about what comes next — training a classification model to actually predict diagnosis using these features

This dataset is a perfect candidate for a machine learning classification project. Next step: build a logistic regression or random forest model to predict breast cancer diagnosis using the top features identified in this EDA.
 
⬅️
Navigation Bar
HomepageHomepage