Project overview — Pima Indians Diabetes dataset. Authors: Pierre-Antoine Faribaud & Maïna Cerede.
← Back to projects

2024

Diabetes Risk Prediction — ML Study

Academic project · Python · scikit-learn

Pythonscikit-learnpandasKNNSVMRandom ForestK-Means

Photos & Illustrations

Project

Authors: Pierre-Antoine Faribaud & Maïna Cerede

Dataset: Pima Indians Diabetes Database — 768 patients, 8 physiological features (Glucose, BMI, Insulin, Age, Pregnancies, BloodPressure, SkinThickness, DiabetesPedigreeFunction), binary outcome (diabetic / not diabetic).

Objective: build and compare supervised classifiers for diabetes risk prediction; identify unsupervised risk profiles with clustering.

Data Exploration and Preprocessing

The dataset contains zero-value anomalies in physiological variables that are biologically impossible (e.g., Glucose = 0, BMI = 0). These were treated as missing values and imputed with the median for each outcome class.

Feature distributions were analysed by class: Glucose shows the clearest separation between diabetic and non-diabetic patients.

Supervised Classification

Three classifiers were trained and tuned:

K-Nearest Neighbours (KNN)

  • Optimal k found by cross-validation: k = 11
  • Accuracy: 73.2% on the held-out test set

Support Vector Machine (SVM)

  • RBF kernel, grid search on C and γ
  • Accuracy: 75.3% — best classifier on this dataset
  • Achieves the highest AUC on the ROC curve

Random Forest

  • 100 estimators, max depth tuned
  • Accuracy: 74.0%
  • Advantage: provides feature importance ranking (Glucose > BMI > Age)

Unsupervised: K-Means Clustering

K-Means (k=2) applied on Glucose and BMI (the two most informative features):

  • Cluster 1 — Low Glucose, Low BMI: predominantly non-diabetic patients
  • Cluster 2 — High Glucose, High BMI: predominantly diabetic patients

The cluster boundaries align closely with the SVM decision boundary, confirming that the two-class structure is genuinely present in the data geometry.

Conclusions

| Model | Accuracy | |-------|----------| | KNN | 73.2% | | Random Forest | 74.0% | | SVM (RBF) | 75.3% |

SVM with RBF kernel is the best classifier for this dataset. The main limiting factor is the dataset size (768 samples) and the class imbalance (35% positive). Future work: SMOTE oversampling, feature engineering from domain knowledge, or a neural approach with larger data.

What This Demonstrates

This project shows the full ML pipeline: data cleaning, EDA, feature analysis, supervised model training with cross-validation, and unsupervised validation — not just "run scikit-learn and report accuracy."