Explainable Recurrent PPO for Automated Insulin Dosing

(Research Project, University of Toronto - supervised by Prof. Roger Grosse, Anthropic researcher and founding member of the Vector Institute)

Jan. 2025 - May 2025

AIReinforcement LearningPyTorchPPORecurrent PPOLSTMAttention MechanismResearchStable-Baselines3Leadership

Goal

Insulin dosing for people with Type 1 diabetes is a difficult and high-stakes task. Blood sugar reacts slowly and non-linearly to meals, activity, and stress. Standard reinforcement learning can automate dosing, but most models act as black boxes, offering little explanation for their decisions.

This project aimed to build an agent that delivers safe, personalized insulin recommendations and explains its reasoning, helping make AI-driven glucose control trustworthy for clinical use.

Data and Evaluation

I lead a team of 3 students and used the SimGlucose simulator, based on the FDA-validated UVA/Padova model of human glucose-insulin dynamics.

Training was done on five virtual patients, with testing on five unseen ones.
Each run covered 24 hours with randomized meals.
We measured: time in a healthy glucose range, frequency of dangerous highs or lows, and smoothness of insulin delivery.

Models

Baseline PPO: a simple feedforward controller.
Recurrent PPO: added LSTM layers so the agent could learn from past glucose, insulin, and meal data.
Attention Recurrent PPO: combined LSTM with a feature-level attention module. At each step, attention weights were extracted to show which inputs (CGM readings, meal size, previous dose, time) shaped the decision.

The following is the architecture of the Recurrent PPO model: Recurrent PPO architecture

Results

Both recurrent models achieved a 40% increase in time spent in the safe glucose range compared with the baseline PPO.
They also reduced dangerous episodes and avoided sharp swings in insulin dosing.
The attention model matched this performance while giving clear insights into its reasoning. After meals, carbohydrate intake had the highest importance, while time of day mattered least.

Key Takeaways

Temporal memory is essential for handling insulin’s delayed effects.
A small attention layer can reveal why each dose is chosen without hurting accuracy.
Interpretable reinforcement learning can support safer, more transparent decision-making in healthcare.

Collaboration

I carried out this project under the guidance of Prof. Roger Grosse at the University of Toronto, also a researcher at Anthropic and a founding member of the Vector Institute.