Introduction
CAR T-cell therapy is a promising approach for personalized cancer treatment, applicable to a growing range of diseases such as treatment of B cell malignancies, multiple myeloma , solid tumors, HIV and several other types of cancer . In brief, CAR-T cell therapeutics involve production, collection, and separation of naïve T cells from the patient, transfecting them to produce Chimeric Antigen Receptors (CARs), and expanding them to provide a suitable dose. The cells are then infused back into the patient where they efficiently attack the malignant cells . During CAR T-cell therapy production, an important step is to activate the T cells, as they proliferate more rapidly than naïve, such that CARs are more readily expressed. One popular approach to activate the cells is by using antigen presenting beads, however prolonged proximity to aAPCs can lead to cell exhaustion . Exhausted cells consequently lose reproductive and therapeutic capacity . The success of an activation and expansion campaign is to have the maximum number of robust active cells. Logically, there is an optimal strategy to activation (bead addition) such that maximum cells remain activated, while the number of exhausted cells is minimized. However, such optimal strategies are often confounded by non-isotropic activation and propagation rates of donor cells based on age and other genetic factors. We thereby posit that the addition of activating beads is a dynamic control problem that must adapt based on the specific patient cell features and thus, is a good candidate for real time control strategies.
The current practice for activating T-cells is to add beads at the beginning of the culture and remove at the end . Prolonged signaling causes exhaustion and this can be mitigated by halting expression when unnecessary . As younger cells likely corelate to higher proliferation rates, the manufacturing time is often curtailed ad hoc . Although it is observed that intermittent exposure to bead yields a greater number of robust effector cells, the underlying activation-exhaustion mechanism, and its response to dosing across all cell types remains elusive till date . There is room for improvement considering the difficulty of the production process . No monitoring or control is involved in the activation process, which could partially explain the loss of potency of manufactured CAR T-cells . Optimizing CAR-T production is not limited to activation bead timing; cytokines, such as IL2, are necessary for survival and growth but also can induce Fas-mediated AICD (apoptosis induced cell death). Treatment efficacy can be enhanced by controlling the dosage of IL-27 while excessive IL-2 concentration leads to exhaustion . While the complex and rapid interplay between different cytokines and metabolic and genetic pathways in individual cells is hard to map, tracking the sensor outputs, and imaging data from the bulk population on-the-fly is far more amenable.
Recently there has been significant progress in the field of real time cell monitoring and automated control of biological processes . It is possible to track, monitor and infer the conditions of the individual cells from morphology . Sensors can output high dimensional feature space which can be used to profile cells or infer growth trajectories . Learning based algorithms can be applied to navigate the decision making of adding/removing the activator. Reinforcement learning , is well adapted to solve such black-box decision problems where the system dynamics do not obey defined analytical expressions.
RL comprises an agent which interacts with an environment and gleans rules to develop a control policy (Figure 1a). At each time step, the agent assesses the current environment (observation space) and takes an action from a predefined list of allowed moves. The environment then updates based on the action and a reward or penalty is assigned to the agent which then updates their policy based on this learning. By repeated interaction with the environment, the agent continuously updates the learned strategy and establishes a policy. The policy is a function that maps every possible observation state to an action with a goal of maximizing the end reward. RL has been widely used for chatbots , autonomous vehicles , robot automation , predicting stock prices and projections , and industrial processes such as manufacturing and supply chain . Agents can perform better in an actual environment after being trained on incrementally complex simulated environments . RL algorithms are divided into two types – model based and model free. Model free RL algorithms optimize a policy or value function instead of modeling the environment. It can learn directly from sensor data and is useful in situations where it is difficult to model the environment. Despite being a well-established field, application of RL to optimization biological systems is largely untapped. The main reason can be attributed to lack of suitable environments to train the agent and the confounding, inherent variability in biological processes. To benchmark new RL algorithms, OpenAI has established a test platform called gym, which has several environments on which new policy algorithms can be tested. The greatest benefit of using an RL platform is easy integration of different algorithms and policies to test on the specific environment. There are different environments coded for specific control tasks, for example - for simple robotic task – robo-gym, multi goal robotic task - panda-gym and gym-pybullet-drones and for self-driving bots – MACAD-gym . Biological processes are different than existing control problems like driving a car considering the complexity and variability. Each action by the agent on biological ‘environment’ will produce a stochastic outcome rather than a deterministic one. Controlling biological systems through RL will be possible with better understanding of system dynamics as well as design of better process simulations.
Multiple efforts have been made thus far in modeling T-cell expansion . Researchers have presented defined, analytical models with systems of ordinary differential equations . For modeling biological systems, stochastic models are often better suited than deterministic models. For instance, Monte Carlo methods have been used to model CD4+ T-cells response to infection and to manage biological variability for cell therapy production . Growth of organisms can be modelled with lattice kinetic MC simulation , for example Hall et al. modelled growth of yeast under influence of nutrient concentration and magnetic field exposure . Agent based models (ABM) is another stochastic approach where each component of the model is an autonomous entity governed by its own rules. ABM is widely used in T cell therapy models; for instance, Neve-Oz et al. presented agent-based simulations of T Cell -aAPC interactions and Azarov et al. modeled chemotaxis of T-cell to dendritic cell . Zheng et al. have demonstrated a hybrid-RL strategy to optimize media replacement steps in cell therapy production and show viasimulation that it outperforms deterministic models . Although control of the real physical environment (Figure 1b) is the near-term goal, demonstrating RL on a simulated biological process is a good first step to understand how RL can be effectively applied. Overall, the emergence of faster computing architecture is propelling us to a future of ML-driven policy making and training robotic arms for precision medicine and bioengineering.