Figure 3. A higher average reward with tight distribution of outcomes are indications of a better trained agent. The quality of the policy can be determined from the episodic reward distribution of a trained agent. For example, with PPO-tabular and DQN-image (Figure 3), the agent adopted a stable strategy by 100,000 training episodes as observed from the episodic reward and flattened out average reward. But with PPO-tabular the episodic reward distribution around average is +/- 50 whereas it is +/- 250 for DQN-image. That indicates PPO-tabular agent is better trained which has a tighter distribution of higher rewards and DQN-image is subjected to variability and chance events. The distribution is even tighter for A2C-combined, but the average reward is far less than PPO-combined or DQN-combined.
With image input, our goal is to observe if it is possible to navigate the environment by getting a snapshot of number of cells and beads, cell type, potency, and age from an image only without any temporal labels. We probed whether the simulation strategy can be step independent. We notice that context information is important. Performance for all algorithms were higher with context data (tabular and combined) than without-context data (image only). In all three-input strategies the nature of DQN was very similar. It settles for a sub-optimal strategy with broader reward distribution (details in Discussion ). In this work the default hyperparameters for each neural architecture (Supplement 3,4,5) as reported in OpenAI Gym were used without fine-tuning. How an untrained and trained agent navigates the environment is demonstrated in supplementary video 1 and 2 respectively.
Learned control strategies for different cell types and number of controls steps
Next, a PPO-combined agent is tested on each respective cell type, simulating diversity of patient-derived cells, to assess how the RL agent can adapt its learned control strategy. Six cell types are simulated by changing the cell parameters (Table 1). For each of the cell types, an agent is first trained for 1M timesteps and then used to navigate 1000 simulations on the same ‘environment’. The average number of beads in each control step is plotted with standard deviations to reveal the bead addition patterns (Figure 4). The variable actions taken in response to observations (presence of error bars) indicates that the policy is adaptive to navigate different situations and did not simply memorize and repeat the same actions at each step. In a few instances, there was uniformity of actions (no error bars, same number of beads in all 1000 simulations). The learning curve is also included with the bar plot (insets) indicating that the agent settled for a policy at the end of training (discussed above).