Reward Function Design
In RL, the agent gets a reward or penalty at each step dependent on its latest action. The purpose of a good reward function is to direct the agent towards its desired outcome as efficiently as possible. Designing a reward function is an iterative process that considers an understanding of the algorithms and environment. For this cell activation and expansion control task, a reward scheme is proposed in which the end goal is to get the greatest number of potent activated cells. For this we must encourage the agent to add beads to activate the cell as well as remove beads when it estimates exhaustion. At each step if the average potency is higher than the previous step the agent receives a small reward of 5, otherwise it receives a penalty. At the last step the sum of potency of all the cells above a threshold is multiplied by 100 and added to the reward. The reasoning behind this scheme is that at the initial part of the training the small rewards in all the timesteps encourage the agent to add beads to activate the cells, because this way the average potency per cell increases.
After a few steps, the existing beads have the chance to cause exhaustion to activated cells The reward obtained for the end goal is much higher than the rest of the timesteps combined, which prompts the agent to take steps to score higher at the end even at the expense of sacrificing some initial rewards. In this way, by repeated interaction with the environment an agent can self-train and determine when to add or take out beads to maximize the reward score. Apparently, it seems the ratio value higher above one should be rewarded more, but empirically we found out that the agent falls at reward frustration and fails to learn if higher values are used at the beginning.