Evaluating input strategies and algorithms
At each step of an RL episode, the agent algorithm chooses the action by taking an observation snapshot as input. There are many possible observation data formats that can be sent to the agent. For example, bulk measurements could be made by impedimetric (Agilent, Xcelligence) or permittivity-based sensors (Skroot Lab Inc). Real time imaging systems (Sartorius Incucyte) coupled to Artificial Intelligence (AI) empowered cell classification tools can specify and quantify cell types based on morphology . Those tools can be used to count naïve and activated cells along with other cell properties such as age and robustness. Other data such as time elapsed, quantity of beads in the system and action history can be obtained from the system itself. All the data can be input in the form of a list of measured values to the agent. This method is termed the tabular method in this work (Figure 3). Another possible observation format can be in the form of image, obtained from high precision microscopy. In this work we also try to observe if a three-channel image of the simulation environment like Figure 2c alone is enough to provide the agent with enough information to adequately train (Figure 3). The third input format tested is the fusion between the above two, where both tabular and image information are provided to the agent (Figure 3). Here we refer to each agent in ‘algorithm-input’ format, for example PPO-image refers agent trained with PPO algorithm on image data. The aim of this analysis is to demonstrate how agent training depends on algorithms and input schemes.
The reward the agent gets at the end of each episode is the episodic reward and at each point we plot another point averaging all previous episodic rewards (average reward shown in red in Figure 3). The rising trend of the average reward in the beginning indicates the agent is learning and constantly obtaining a better strategy whereas flattening of the average reward indicates that the agent has settled for an optimized strategy (see PPO-tabular and DQN-tabular input in Figure 3).