Figure 3. A higher
average reward with tight distribution of outcomes are indications of a
better trained agent. The quality of the policy can be determined from
the episodic reward distribution of a trained agent. For example, with
PPO-tabular and DQN-image (Figure 3), the agent adopted a stable
strategy by 100,000 training episodes as observed from the episodic
reward and flattened out average reward. But with PPO-tabular the
episodic reward distribution around average is +/- 50 whereas it is +/-
250 for DQN-image. That indicates PPO-tabular agent is better trained
which has a tighter distribution of higher rewards and DQN-image is
subjected to variability and chance events. The distribution is even
tighter for A2C-combined, but the average reward is far less than
PPO-combined or DQN-combined.
With image input, our goal is to observe if it is possible to navigate
the environment by getting a snapshot of number of cells and beads, cell
type, potency, and age from an image only without any temporal labels.
We probed whether the simulation strategy can be step independent. We
notice that context information is important. Performance for all
algorithms were higher with context data (tabular and combined) than
without-context data (image only). In all three-input strategies the
nature of DQN was very similar. It settles for a sub-optimal strategy
with broader reward distribution (details in Discussion ). In this
work the default hyperparameters for each neural architecture
(Supplement 3,4,5) as reported in OpenAI Gym were used without
fine-tuning. How an untrained and trained agent navigates the
environment is demonstrated in supplementary video 1 and 2 respectively.
Learned control strategies
for different cell types and number of controls steps
Next, a PPO-combined agent is tested on each respective cell type,
simulating diversity of patient-derived cells, to assess how the RL
agent can adapt its learned control strategy. Six cell types are
simulated by changing the cell parameters (Table 1). For each of the
cell types, an agent is first trained for 1M timesteps and then used to
navigate 1000 simulations on the same ‘environment’. The average number
of beads in each control step is plotted with standard deviations to
reveal the bead addition patterns (Figure 4). The variable actions taken
in response to observations (presence of error bars) indicates that the
policy is adaptive to navigate different situations and did not simply
memorize and repeat the same actions at each step. In a few instances,
there was uniformity of actions (no error bars, same number of beads in
all 1000 simulations). The learning curve is also included with the bar
plot (insets) indicating that the agent settled for a policy at the end
of training (discussed above).