Figure Legends
Figure 1: Reinforcement learning framework; (a) basic RL loop
(b) RL workflow applied to real time control of T cell activation and
expansion. The cell profiles and properties are inferred from donor
sample pool with the help of imaging and sensing instruments. These
properties coupled with data driven approach are used to create a
simulation of the cell culture process. Afterwards this environment is
transformed into the form of a game or RL control environment. An agent
is trained on RL environment to get trained and navigate the actual cell
activation process.
Figure 2: Proposed simulation replicating cell activation and
expansion (a) Process and permitted actions by the cells in each
simulated step. (b) Simulated life trajectory of a naïve starting cell
to activated with full potency, and natural exhaustion caused by aging.
Also defined are two modes of division – symmetric and asymmetric. (c)
Sample simulation trajectories for three control strategies – top to
bottom row depicts optimum, sub-optimum and random bead additions; bar
plot at left indicates the number of cells separated by type at each
simulation step, the symbols at the x axis represents the action taken:
(+) refers to bead addition, (-) refers to removal and (o) refers to no
action; the right three windows are simulation screens at 1, 5 and 19
steps.
Figure 3. A higher average reward with tight distribution of
outcomes are indications of a better trained agent. The quality of the
policy can be determined from the episodic reward distribution of a
trained agent. For example, with PPO-tabular and DQN-image (Figure 3),
the agent adopted a stable strategy by 100,000 training episodes as
observed from the episodic reward and flattened out average reward. But
with PPO-tabular the episodic reward distribution around average is +/-
50 whereas it is +/- 250 for DQN-image. That indicates PPO-tabular agent
is better trained which has a tighter distribution of higher rewards and
DQN-image is subjected to variability and chance events. The
distribution is even tighter for A2C-combined, but the average reward is
far less than PPO-combined or DQN-combined.
Figure 4: Change of strategy by the agent using 20 control
steps for different cell types. (a) Simulation process to obtain control
strategy information (b) Strategy of the agent visualized by average
number of beads at each control step (y and x axes respectively). Error
bar indicates the standard deviation of beads used at that control step
– indication of simulation variability or constancy (where there are no
bars). The learning curve is also attached with each bar plot, axes same
as Figure 3. Arrows between plots indicate the change in cell type (also
see Table 1).
Figure 5: Change of strategy by the agent using 50 control
steps learned from training with different cell types. Strategy of the
agent visualized by average number of beads per control step (y and x
axes respectively). Error bars indicate one standard deviation, showing
variability of steps or uniformity (no error bars). The learning curve
is also attached with each bar plot. Arrows indicate the change in cell
type, also see Table 1.
Figure 6: (a) Learning curve for agent trained with and without
noise and reward histogram for simulation conducted with agent trained
on 0, 250k and 500k episodes (b) Agent trained with 20, 50 and 400
timesteps (c) Number or training episodes required to reach accuracy of
80%, 90% and 95% by agents pre-trained for 500k steps on cell 1 vs.
agents trained on respective cell types from beginning. Y axis shows the
number of training runs required in log base 10 scale.