Figure Legends
Figure 1: Reinforcement learning framework; (a) basic RL loop (b) RL workflow applied to real time control of T cell activation and expansion. The cell profiles and properties are inferred from donor sample pool with the help of imaging and sensing instruments. These properties coupled with data driven approach are used to create a simulation of the cell culture process. Afterwards this environment is transformed into the form of a game or RL control environment. An agent is trained on RL environment to get trained and navigate the actual cell activation process.
Figure 2: Proposed simulation replicating cell activation and expansion (a) Process and permitted actions by the cells in each simulated step. (b) Simulated life trajectory of a naïve starting cell to activated with full potency, and natural exhaustion caused by aging. Also defined are two modes of division – symmetric and asymmetric. (c) Sample simulation trajectories for three control strategies – top to bottom row depicts optimum, sub-optimum and random bead additions; bar plot at left indicates the number of cells separated by type at each simulation step, the symbols at the x axis represents the action taken: (+) refers to bead addition, (-) refers to removal and (o) refers to no action; the right three windows are simulation screens at 1, 5 and 19 steps.
Figure 3. A higher average reward with tight distribution of outcomes are indications of a better trained agent. The quality of the policy can be determined from the episodic reward distribution of a trained agent. For example, with PPO-tabular and DQN-image (Figure 3), the agent adopted a stable strategy by 100,000 training episodes as observed from the episodic reward and flattened out average reward. But with PPO-tabular the episodic reward distribution around average is +/- 50 whereas it is +/- 250 for DQN-image. That indicates PPO-tabular agent is better trained which has a tighter distribution of higher rewards and DQN-image is subjected to variability and chance events. The distribution is even tighter for A2C-combined, but the average reward is far less than PPO-combined or DQN-combined.
Figure 4: Change of strategy by the agent using 20 control steps for different cell types. (a) Simulation process to obtain control strategy information (b) Strategy of the agent visualized by average number of beads at each control step (y and x axes respectively). Error bar indicates the standard deviation of beads used at that control step – indication of simulation variability or constancy (where there are no bars). The learning curve is also attached with each bar plot, axes same as Figure 3. Arrows between plots indicate the change in cell type (also see Table 1).
Figure 5: Change of strategy by the agent using 50 control steps learned from training with different cell types. Strategy of the agent visualized by average number of beads per control step (y and x axes respectively). Error bars indicate one standard deviation, showing variability of steps or uniformity (no error bars). The learning curve is also attached with each bar plot. Arrows indicate the change in cell type, also see Table 1.
Figure 6: (a) Learning curve for agent trained with and without noise and reward histogram for simulation conducted with agent trained on 0, 250k and 500k episodes (b) Agent trained with 20, 50 and 400 timesteps (c) Number or training episodes required to reach accuracy of 80%, 90% and 95% by agents pre-trained for 500k steps on cell 1 vs. agents trained on respective cell types from beginning. Y axis shows the number of training runs required in log base 10 scale.