Reward Function Design
In RL, the agent gets a reward or penalty at each step dependent on its
latest action. The purpose of a good reward function is to direct the
agent towards its desired outcome as efficiently as possible. Designing
a reward function is an iterative process that considers an
understanding of the algorithms and environment. For this cell
activation and expansion control task, a reward scheme is proposed in
which the end goal is to get the greatest number of potent activated
cells. For this we must encourage the agent to add beads to activate the
cell as well as remove beads when it estimates exhaustion. At each step
if the average potency is higher than the previous step the agent
receives a small reward of 5, otherwise it receives a penalty. At the
last step the sum of potency of all the cells above a threshold is
multiplied by 100 and added to the reward. The reasoning behind this
scheme is that at the initial part of the training the small rewards in
all the timesteps encourage the agent to add beads to activate the
cells, because this way the average potency per cell increases.
After a few steps, the existing beads have the chance to cause
exhaustion to activated cells The reward obtained for the end goal is
much higher than the rest of the timesteps combined, which prompts the
agent to take steps to score higher at the end even at the expense of
sacrificing some initial rewards. In this way, by repeated interaction
with the environment an agent can self-train and determine when to add
or take out beads to maximize the reward score. Apparently, it seems the
ratio value higher above one should be rewarded more, but empirically we
found out that the agent falls at reward frustration and fails to learn
if higher values are used at the beginning.