loading page

DRS: A Deep Reinforcement Learning enhanced Kubernetes Scheduler for Microservice-based System
  • +3
  • Zhaolong Jian,
  • Xueshuo Xie,
  • Yaozheng Fang,
  • Yibing Jiang,
  • Tao Li,
  • Ye Lu
Zhaolong Jian
Nankai University College of Computer Science
Author Profile
Xueshuo Xie
Nankai University College of Computer Science
Author Profile
Yaozheng Fang
Nankai University College of Computer Science
Author Profile
Yibing Jiang
Nankai University College of Computer Science
Author Profile
Tao Li
Nankai University College of Computer Science

Corresponding Author:[email protected]

Author Profile
Ye Lu
Nankai University College of Computer Science
Author Profile

Abstract

Recently, Kubernetes is widely used to manage and schedule the resources of microservices in cloud-native distributed applications, as the most famous container orchestration framework. However, Kubernetes preferentially schedules microservices to nodes with rich and balanced CPU and memory resources on a single node. The native scheduler of Kubernetes, called Kube-scheduler, may cause resource fragmentation and decrease resource utilization. In this paper, we propose a deep reinforcement learning enhanced Kubernetes scheduler named DRS. To improve resource utilization and reduce load imbalance, we first present the Kubernetes scheduling problem as a Markov decision process and elaborately designed the state, action, and reward. Then, we design and implement DRS mointor to perceive six metrics about resource utilization to construct a comprehensive global resource view. Finally, DRS can automatically learn the scheduling policy through interaction with the Kubernetes cluster, without relying on expert knowledge about workload and cluster status. We implement a prototype of DRS in a Kubernetes cluster with five nodes and evaluate its performance. Experimental results highlight that DRS overcomes the shortcomings of Kube-scheduler and achieve the expected scheduling target with three workloads. Compared with Kube-scheduler, DRS brings an improvement of 27.29% in resource utilization and reduce the load imbalance by 2 .90× on average, with only 3.27% CPU overhead and 0.648% communication latency.
03 Jan 2023Submitted to Software: Practice and Experience
03 Jan 2023Submission Checks Completed
03 Jan 2023Assigned to Editor
10 Jan 2023Review(s) Completed, Editorial Evaluation Pending
11 Jan 2023Reviewer(s) Assigned
14 Mar 2023Editorial Decision: Revise Major
13 Jun 20231st Revision Received
19 Jun 2023Assigned to Editor
19 Jun 2023Submission Checks Completed
19 Jun 2023Review(s) Completed, Editorial Evaluation Pending
20 Jun 2023Reviewer(s) Assigned
11 Aug 2023Editorial Decision: Revise Minor
13 Sep 20232nd Revision Received
15 Sep 2023Submission Checks Completed
15 Sep 2023Assigned to Editor
15 Sep 2023Review(s) Completed, Editorial Evaluation Pending
16 Sep 2023Reviewer(s) Assigned
22 Sep 2023Editorial Decision: Accept