Abhisek Panda - 21DOCS Test Area

Serverless computing systems have become very popular because of their natural advantages with respect to auto-scaling, load balancing and fast distributed processing. Almost all serverless systems as of today define two QoS classes where one set of applications is deemed to be latency sensitive (LS) and the other set of applications needs to be handled on a best-effort basis (BE). LS jobs are often very short, and thus, they require low latencies and deterministic processing times (akin to soft real-time jobs). Hence, several performance metrics have been considered in the literature, such as the mean, tail latency, median, etc., for the different QoS classes. We use this standard setting and club all performance metrics into a performance vector that we shall refer to as the comprehensive latency (CL). We design a system FaaSCtrl that is superior to prior work for every component of the CL for LS applications by a wide margin (e.g., 44.2% and 28% better than the nearest competing work for the variance and mean components, resp.). We achieve this by first understanding the reasons for a poor CL by performing detailed characterization experiments. Once, we identify the causative factors, we establish that the knobs to control these factors, such as process priorities and CPU affinities, have a complex relationship with the CL. This is a fit case for using a reinforcement learning (RL) method that has been used successfully in the past to solve problems where the benefit of a certain intervention is probabilistic in nature. Our detailed characterization experiments naturally segue into our RL formulation. The last section establishes that our RL-based system is robust and generalizes well to hitherto unseen situations.