Introduction
CAR T-cell therapy is a promising
approach for personalized cancer treatment, applicable to a growing
range of diseases such as treatment of B cell malignancies, multiple
myeloma , solid tumors, HIV and several other types of cancer . In
brief, CAR-T cell therapeutics involve production, collection, and
separation of naïve T cells from the patient, transfecting them to
produce Chimeric Antigen Receptors (CARs), and expanding them to provide
a suitable dose. The cells are then infused back into the patient where
they efficiently attack the malignant cells . During CAR T-cell therapy
production, an important step is to activate the T cells, as they
proliferate more rapidly than naïve, such that CARs are more readily
expressed. One popular approach to activate the cells is by using
antigen presenting beads, however prolonged proximity to aAPCs can lead
to cell exhaustion . Exhausted cells consequently lose reproductive and
therapeutic capacity . The success of an activation and expansion
campaign is to have the maximum number of robust active cells.
Logically, there is an optimal strategy to activation (bead addition)
such that maximum cells remain activated, while the number of exhausted
cells is minimized. However, such optimal strategies are often
confounded by non-isotropic activation and propagation rates of donor
cells based on age and other genetic factors. We thereby posit that the
addition of activating beads is a dynamic control problem that must
adapt based on the specific patient cell features and thus, is a good
candidate for real time control strategies.
The current practice for activating T-cells is to add beads at the
beginning of the culture and remove at the end . Prolonged signaling
causes exhaustion and this can be mitigated by halting expression when
unnecessary . As younger cells likely corelate to higher proliferation
rates, the manufacturing time is often curtailed ad hoc .
Although it is observed that intermittent exposure to bead yields a
greater number of robust effector cells, the underlying
activation-exhaustion mechanism, and its response to dosing across all
cell types remains elusive till date . There is room for improvement
considering the difficulty of the production process . No monitoring or
control is involved in the activation process, which could partially
explain the loss of potency of manufactured CAR T-cells . Optimizing
CAR-T production is not limited to activation bead timing; cytokines,
such as IL2, are necessary for survival and growth but also can induce
Fas-mediated AICD (apoptosis induced cell death). Treatment efficacy can
be enhanced by controlling the dosage of IL-27 while excessive IL-2
concentration leads to exhaustion . While the complex and rapid
interplay between different cytokines and metabolic and genetic pathways
in individual cells is hard to map, tracking the sensor outputs, and imaging data from the bulk population on-the-fly
is far more amenable.
Recently there has been significant progress in the field of real time
cell monitoring and automated control of biological processes . It is
possible to track, monitor and infer the conditions of the individual
cells from morphology . Sensors can output high dimensional feature
space which can be used to profile cells or infer growth trajectories .
Learning based algorithms can be applied to navigate the decision making
of adding/removing the activator. Reinforcement learning , is well
adapted to solve such black-box decision problems where the system
dynamics do not obey defined analytical expressions.
RL comprises an agent which interacts with an environment and gleans
rules to develop a control policy (Figure 1a). At each time step, the
agent assesses the current environment (observation space) and takes an
action from a predefined list of allowed moves. The environment then
updates based on the action and a reward or penalty is assigned to the
agent which then updates their policy based on this learning. By
repeated interaction with the environment, the agent continuously
updates the learned strategy and establishes a policy. The policy is a
function that maps every possible observation state to an action with a
goal of maximizing the end reward. RL has been widely used for chatbots
, autonomous vehicles , robot automation , predicting stock prices and
projections , and industrial processes such as manufacturing and supply
chain . Agents can perform better in an actual environment after being
trained on incrementally complex simulated environments . RL algorithms
are divided into two types – model based and model free. Model free RL
algorithms optimize a policy or value function instead of modeling the
environment. It can learn directly from sensor data and is useful in
situations where it is difficult to model the environment. Despite being
a well-established field, application of RL to optimization biological
systems is largely untapped. The main reason can be attributed to lack
of suitable environments to train the agent and the confounding,
inherent variability in biological processes. To benchmark new RL
algorithms, OpenAI has established a test platform called gym, which has
several environments on which new policy algorithms can be tested. The
greatest benefit of using an RL platform is easy integration of
different algorithms and policies to test on the specific environment.
There are different environments coded for specific control tasks, for
example - for simple robotic task – robo-gym, multi goal robotic task -
panda-gym and gym-pybullet-drones and for self-driving bots – MACAD-gym
. Biological processes are different than existing control problems like
driving a car considering the complexity and variability. Each action by
the agent on biological ‘environment’ will produce a stochastic outcome
rather than a deterministic one. Controlling biological systems through
RL will be possible with better understanding of system dynamics as well
as design of better process simulations.
Multiple efforts have been made thus far in modeling T-cell expansion .
Researchers have presented defined, analytical models with systems of
ordinary differential equations . For modeling biological systems,
stochastic models are often better suited than deterministic models. For
instance, Monte Carlo methods have been used to model CD4+ T-cells
response to infection and to manage biological variability for cell
therapy production . Growth of organisms can be modelled with lattice
kinetic MC simulation , for example Hall et al. modelled growth of yeast
under influence of nutrient concentration and magnetic field exposure .
Agent based models (ABM) is another stochastic approach where each
component of the model is an autonomous entity governed by its own
rules. ABM is widely used in T cell therapy models; for instance,
Neve-Oz et al. presented agent-based simulations of T Cell -aAPC
interactions and Azarov et al. modeled chemotaxis of T-cell to dendritic
cell . Zheng et al. have demonstrated a hybrid-RL strategy to optimize
media replacement steps in cell therapy production and show viasimulation that it outperforms deterministic models . Although control
of the real physical environment (Figure 1b) is the near-term goal,
demonstrating RL on a simulated biological process is a good first step
to understand how RL can be effectively applied. Overall, the emergence
of faster computing architecture is propelling us to a future of
ML-driven policy making and training robotic arms for precision medicine
and bioengineering.