Figure 5: Change of
strategy by the agent using 50 control steps learned from training with
different cell types. Strategy of the agent visualized by average number
of beads per control step (y and x axes respectively). Error bars
indicate one standard deviation, showing variability of steps or
uniformity (no error bars). The learning curve is also attached with
each bar plot. Arrows indicate the change in cell type, also see Table
1.
Effect of measurement
noise, number of control steps, and number of training runs
The ability of an agent to learn unique control strategies for different
cell types is a major finding; however, to put this into practice, it
will be important to know how accurate the measurements (inputs to
agent) must be as well as the required number of training runs (as
106 experiments to determine a unique training regime
is not tractable). Here we explore both topics using the T-cell
expansion simulator using the PPO algorithm with combined input.
The observation space for tabular input would be obtained from cell
monitoring sensors that distinguish between cell types and estimate
potency (optical, impedance, etc.). These devices will not have complete
precision. To observe the effect of noise, an agent is trained with 40%
of the initial cell number added as gaussian noise in cell count and
potency estimation to simulate measurement error. There is no observable
change in the episodic and average reward of the training steps and
reward distribution with and without noise (Figure 7a). There are two
possible reasons: first, gaussian noise in a stochastic environment does
not make perceivable difference in mapping observation to action and
second, the agent either maps the noise along with the observations or
totally disregards the noisy observations and build its policy on more
stable inputs such as time steps. A histogram is also drawn at three
stages of training – the zeroth training run, where the agent is fully
random, as well as at 250k and at 500k episodes. It is also observed
that there is a clear difference in the reward distribution between the
random agent at start and trained agent at 250k runs, but the
distribution of rewards at 250k and 500k episodes were
indistinguishable.
Figures 4 and 5 demonstrate that the agent can perform better with
increased interaction with the environment (50 control steps rather than
20). With more interaction it has better control and there is a higher
reward with less fluctuation whereas with fewer interactions it is
difficult to control the environment. We investigated if this pattern
holds for even further interactions. Conceivably, an agent could
interact with a fully automated environment at every observation point.
To observe the effect of increased control, we trained an agent with 400
control steps (adding, removing, or maintaining beads every 24 m). In
this case there are an overwhelming 3400 possible
combinations of action sequences. With such a high number the agent
finds it difficult to settle on a control policy and the learning curve
fluctuates more than the 50-control point case (Figure 7b). This finding
indicates that ‘real-time’ control is likely not as advantageous as a
control strategy that is still dynamic yet has a tractable number of
possible actions.
In a realized, clinical setting,
there will likely be a limited number of experiments that can be
performed on a new cell type (patient sample) for the agent to
self-learn a control strategy. The average learning curve of cell 1
shows 90% of max average reward after 29,000 training sessions for an
agent with 50 control steps (Figure 6c). We hypothesized that this
number could be further reduced if an agent trained on one cell type is
then used as start point for another cell (e.g., training the agent on a
stock cell, prior to testing with the patient cell sample). To test this
approach, the agent is trained on 500k training runs on a base case cell
1 and then used to subsequently train on Cell types 1-4. For cell 1 and
cell 2 the optimum strategy is similar – to add beads in the beginning.
In that case the agent can adapt faster, and a smaller number of runs
(1000 or one updated policy step) is required in comparison to training
from scratch to reach the same level of accuracy. But optimum strategy
is different for cell 3 and 4 – to add beads at the end. In those
cases, the agent needs to unlearn the previous strategy and adapt a new
strategy. With such a change in policy, it takes longer to reach the
same level of accuracy rather than starting training from scratch. An
alternative or parallel approach to settling on an optimal control
strategy would be taking patient cells and performing a series of tests
to obtain growth parameters that would allow for efficient simulation
(Figure 1b). Then in silico tests, much like this, would augment
the physical training data. An in-silico test thus can guide if
there is a change in policy and weight the choice of – retraining on
another cell or training from scratch considering desired yield and
resources.