Figure 5: Change of strategy by the agent using 50 control steps learned from training with different cell types. Strategy of the agent visualized by average number of beads per control step (y and x axes respectively). Error bars indicate one standard deviation, showing variability of steps or uniformity (no error bars). The learning curve is also attached with each bar plot. Arrows indicate the change in cell type, also see Table 1.
Effect of measurement noise, number of control steps, and number of training runs
The ability of an agent to learn unique control strategies for different cell types is a major finding; however, to put this into practice, it will be important to know how accurate the measurements (inputs to agent) must be as well as the required number of training runs (as 106 experiments to determine a unique training regime is not tractable). Here we explore both topics using the T-cell expansion simulator using the PPO algorithm with combined input.
The observation space for tabular input would be obtained from cell monitoring sensors that distinguish between cell types and estimate potency (optical, impedance, etc.). These devices will not have complete precision. To observe the effect of noise, an agent is trained with 40% of the initial cell number added as gaussian noise in cell count and potency estimation to simulate measurement error. There is no observable change in the episodic and average reward of the training steps and reward distribution with and without noise (Figure 7a). There are two possible reasons: first, gaussian noise in a stochastic environment does not make perceivable difference in mapping observation to action and second, the agent either maps the noise along with the observations or totally disregards the noisy observations and build its policy on more stable inputs such as time steps. A histogram is also drawn at three stages of training – the zeroth training run, where the agent is fully random, as well as at 250k and at 500k episodes. It is also observed that there is a clear difference in the reward distribution between the random agent at start and trained agent at 250k runs, but the distribution of rewards at 250k and 500k episodes were indistinguishable.
Figures 4 and 5 demonstrate that the agent can perform better with increased interaction with the environment (50 control steps rather than 20). With more interaction it has better control and there is a higher reward with less fluctuation whereas with fewer interactions it is difficult to control the environment. We investigated if this pattern holds for even further interactions. Conceivably, an agent could interact with a fully automated environment at every observation point. To observe the effect of increased control, we trained an agent with 400 control steps (adding, removing, or maintaining beads every 24 m). In this case there are an overwhelming 3400 possible combinations of action sequences. With such a high number the agent finds it difficult to settle on a control policy and the learning curve fluctuates more than the 50-control point case (Figure 7b). This finding indicates that ‘real-time’ control is likely not as advantageous as a control strategy that is still dynamic yet has a tractable number of possible actions.
In a realized, clinical setting, there will likely be a limited number of experiments that can be performed on a new cell type (patient sample) for the agent to self-learn a control strategy. The average learning curve of cell 1 shows 90% of max average reward after 29,000 training sessions for an agent with 50 control steps (Figure 6c). We hypothesized that this number could be further reduced if an agent trained on one cell type is then used as start point for another cell (e.g., training the agent on a stock cell, prior to testing with the patient cell sample). To test this approach, the agent is trained on 500k training runs on a base case cell 1 and then used to subsequently train on Cell types 1-4. For cell 1 and cell 2 the optimum strategy is similar – to add beads in the beginning. In that case the agent can adapt faster, and a smaller number of runs (1000 or one updated policy step) is required in comparison to training from scratch to reach the same level of accuracy. But optimum strategy is different for cell 3 and 4 – to add beads at the end. In those cases, the agent needs to unlearn the previous strategy and adapt a new strategy. With such a change in policy, it takes longer to reach the same level of accuracy rather than starting training from scratch. An alternative or parallel approach to settling on an optimal control strategy would be taking patient cells and performing a series of tests to obtain growth parameters that would allow for efficient simulation (Figure 1b). Then in silico tests, much like this, would augment the physical training data. An in-silico test thus can guide if there is a change in policy and weight the choice of – retraining on another cell or training from scratch considering desired yield and resources.