Figure 4: Change of strategy by the agent using 20 control steps for different cell types. (a) Simulation process to obtain control strategy information (b) Strategy of the agent visualized by average number of beads at each control step (y and x axes respectively). Error bar indicates the standard deviation of beads used at that control step – indication of simulation variability or constancy (where there are no bars). The learning curve is also attached with each bar plot, axes same as Figure 3. Arrows between plots indicate the change in cell type (also see Table 1).
To convert the newly produced naïve cells, beads are required, but those same beads cause the activated cells to get exhausted. To navigate this system the agent alternately adds and removes beads, and the overall end score is lower than the other cell types.
To test the effect of an agent that has more control over the environment, we repeat the training process with 50 control steps (interacting with the growth vessel every 3.2 hr instead of 8 hr – see justification in Supplement 8) for six cell types (Table 1). The base case behaved the same way, with more dosing of beads in the beginning and reduced in the end (Figure 5). But as it has more frequent control points, the agent skips adding beads at the onset to account for small natural exhaustion, continuously adding beads for second to fifth step, then performed the add-remove-skip step depending on simulated status, with diminishing number of beads in subsequent steps. For cell type 2, it adds beads for more steps at the outset (Figure 5) than before (Figure 4b) and Cell types 3 and 4 differ as well. Cell 5 is simulated with only regeneration increased from the base case and the agent removes beads in the second half to let the activated cells grow without getting exhausted. However, when the natural exhaustion is increased in cell type 3, the agent falls into a dilemma: if it adds bead at the beginning, the converted cells will be exhausted in the next steps, if it adds bead at the end, it cannot take advantage of the higher regeneration rate. Balancing these constraints, the agent adds beads in the first two steps and then removes them in the third and skips the next 10 steps. It then adds or removes bead depending upon the present situation. However, this is less favorable than other cell types and ends with a lower number of potent cells in the end. Finally, for cell type 6 we increased the rate of natural exhaustion and added asymmetric regeneration. In this case the agent alternately adds and removes beads for first third of the control steps, and then ramps number of beads with variability based on current cell count ; again, the expected outcome (average reward) for this unfortunate cell type is dependent on chance and lower than others.