
2024
Diabetes Risk Prediction — ML Study
Academic project · Python · scikit-learn
Photos & Illustrations
Project
Authors: Pierre-Antoine Faribaud & Maïna Cerede
Dataset: Pima Indians Diabetes Database — 768 patients, 8 physiological features (Glucose, BMI, Insulin, Age, Pregnancies, BloodPressure, SkinThickness, DiabetesPedigreeFunction), binary outcome (diabetic / not diabetic).
Objective: build and compare supervised classifiers for diabetes risk prediction; identify unsupervised risk profiles with clustering.
Data Exploration and Preprocessing
The dataset contains zero-value anomalies in physiological variables that are biologically impossible (e.g., Glucose = 0, BMI = 0). These were treated as missing values and imputed with the median for each outcome class.
Feature distributions were analysed by class: Glucose shows the clearest separation between diabetic and non-diabetic patients.
Supervised Classification
Three classifiers were trained and tuned:
K-Nearest Neighbours (KNN)
- Optimal k found by cross-validation: k = 11
- Accuracy: 73.2% on the held-out test set
Support Vector Machine (SVM)
- RBF kernel, grid search on C and γ
- Accuracy: 75.3% — best classifier on this dataset
- Achieves the highest AUC on the ROC curve
Random Forest
- 100 estimators, max depth tuned
- Accuracy: 74.0%
- Advantage: provides feature importance ranking (Glucose > BMI > Age)
Unsupervised: K-Means Clustering
K-Means (k=2) applied on Glucose and BMI (the two most informative features):
- Cluster 1 — Low Glucose, Low BMI: predominantly non-diabetic patients
- Cluster 2 — High Glucose, High BMI: predominantly diabetic patients
The cluster boundaries align closely with the SVM decision boundary, confirming that the two-class structure is genuinely present in the data geometry.
Conclusions
| Model | Accuracy | |-------|----------| | KNN | 73.2% | | Random Forest | 74.0% | | SVM (RBF) | 75.3% |
SVM with RBF kernel is the best classifier for this dataset. The main limiting factor is the dataset size (768 samples) and the class imbalance (35% positive). Future work: SMOTE oversampling, feature engineering from domain knowledge, or a neural approach with larger data.
What This Demonstrates
This project shows the full ML pipeline: data cleaning, EDA, feature analysis, supervised model training with cross-validation, and unsupervised validation — not just "run scikit-learn and report accuracy."