Diabetes Risk Prediction — ML Study — Pierre-Antoine Faribaud

Project

Authors: Pierre-Antoine Faribaud & Maïna Cerede

Dataset: Pima Indians Diabetes Database — 768 patients, 8 physiological features (Glucose, BMI, Insulin, Age, Pregnancies, BloodPressure, SkinThickness, DiabetesPedigreeFunction), binary outcome (diabetic / not diabetic).

Objective: build and compare supervised classifiers for diabetes risk prediction; identify unsupervised risk profiles with clustering.

Data Exploration and Preprocessing

The dataset contains zero-value anomalies in physiological variables that are biologically impossible (e.g., Glucose = 0, BMI = 0). These were treated as missing values and imputed with the median for each outcome class.

Feature distributions were analysed by class: Glucose shows the clearest separation between diabetic and non-diabetic patients.

Supervised Classification

Three classifiers were trained and tuned:

K-Nearest Neighbours (KNN)

Optimal k found by cross-validation: k = 11
Accuracy: 73.2% on the held-out test set

Support Vector Machine (SVM)

RBF kernel, grid search on C and γ
Accuracy: 75.3% — best classifier on this dataset
Achieves the highest AUC on the ROC curve

Random Forest

100 estimators, max depth tuned
Accuracy: 74.0%
Advantage: provides feature importance ranking (Glucose > BMI > Age)

Unsupervised: K-Means Clustering

K-Means (k=2) applied on Glucose and BMI (the two most informative features):

Cluster 1 — Low Glucose, Low BMI: predominantly non-diabetic patients
Cluster 2 — High Glucose, High BMI: predominantly diabetic patients

The cluster boundaries align closely with the SVM decision boundary, confirming that the two-class structure is genuinely present in the data geometry.

Conclusions

| Model | Accuracy | |-------|----------| | KNN | 73.2% | | Random Forest | 74.0% | | SVM (RBF) | 75.3% |

SVM with RBF kernel is the best classifier for this dataset. The main limiting factor is the dataset size (768 samples) and the class imbalance (35% positive). Future work: SMOTE oversampling, feature engineering from domain knowledge, or a neural approach with larger data.

What This Demonstrates

This project shows the full ML pipeline: data cleaning, EDA, feature analysis, supervised model training with cross-validation, and unsupervised validation — not just "run scikit-learn and report accuracy."