Figure 1: Reinforcement
learning framework; (a) basic RL loop (b) RL workflow applied to real
time control of T cell activation and expansion. The cell profiles and
properties are inferred from donor sample pool with the help of imaging
and sensing instruments. These properties coupled with data driven
approach are used to create a simulation of the cell culture process.
Afterwards this environment is transformed into the form of a game or RL
control environment. An agent is trained on RL environment to get
trained and navigate the actual cell activation process.
In this paper, CAR T-cell
activation and expansion is first coded as a 2D simulation, where a
player or agent-algorithm can decide to add, skip, or remove antigen
presenting beads to a given population of T cells in each step with the
end goal of getting the highest number of potent cells when the
simulation ends. This simulation can be termed as an agent-based model
because each cell acts as an agent and behaves with a predefined set of
rules (note, this use of agent is different from the RL agent discussed
prior). The simulation is then converted into a customized gym
environment in OpenAI Gym which enables testing several RL algorithms to
benchmark policies for this custom environment. An RL agent algorithm
then settles on an optimized strategy by repeatedly interacting with the
environment. Three model free algorithms –proximal policy optimization
(PPO), actor-critic algorithm (A2C) and deep Q-learning network (DQN)
are selected as candidate algorithms and are trained in this environment
using three different observation spaces: 1) list of cell counts and
other measurable parameters, 2) image of 2D cell environment, and 3) a
combined list-and-image approach (Supplement 5). Different cell types
are then used to test how the policies adapt their control strategies of
bead dosing. The effect of noise resulting from poor sensors on training
efficiency is also tested with observation variables corrupted with
gaussian noise. Finally, the effects of changing the number of times the
agent is allowed to interact with the environment as well as effects of
pre-training agents on control performance are also tested and
discussed.