Figure 1: Reinforcement learning framework; (a) basic RL loop (b) RL workflow applied to real time control of T cell activation and expansion. The cell profiles and properties are inferred from donor sample pool with the help of imaging and sensing instruments. These properties coupled with data driven approach are used to create a simulation of the cell culture process. Afterwards this environment is transformed into the form of a game or RL control environment. An agent is trained on RL environment to get trained and navigate the actual cell activation process.
In this paper, CAR T-cell activation and expansion is first coded as a 2D simulation, where a player or agent-algorithm can decide to add, skip, or remove antigen presenting beads to a given population of T cells in each step with the end goal of getting the highest number of potent cells when the simulation ends. This simulation can be termed as an agent-based model because each cell acts as an agent and behaves with a predefined set of rules (note, this use of agent is different from the RL agent discussed prior). The simulation is then converted into a customized gym environment in OpenAI Gym which enables testing several RL algorithms to benchmark policies for this custom environment. An RL agent algorithm then settles on an optimized strategy by repeatedly interacting with the environment. Three model free algorithms –proximal policy optimization (PPO), actor-critic algorithm (A2C) and deep Q-learning network (DQN) are selected as candidate algorithms and are trained in this environment using three different observation spaces: 1) list of cell counts and other measurable parameters, 2) image of 2D cell environment, and 3) a combined list-and-image approach (Supplement 5). Different cell types are then used to test how the policies adapt their control strategies of bead dosing. The effect of noise resulting from poor sensors on training efficiency is also tested with observation variables corrupted with gaussian noise. Finally, the effects of changing the number of times the agent is allowed to interact with the environment as well as effects of pre-training agents on control performance are also tested and discussed.