AI Dirven Latency Constrained Resource Management In Kubernetes

Team

Supervisors

Table of content

  1. Abstract
  2. Related works
  3. Methodology
  4. Experiment Setup and Implementation
  5. Results and Analysis
  6. Conclusion
  7. Publications
  8. Links

Abstract

Modern cloud-native applications often run on Kubernetes with conservative resource limits to avoid service degradation. However, this leads to substantial CPU and memory overprovisioning, increasing operational cost without proportional performance benefits. This research addresses the challenge of reducing Kubernetes resource limits and requests for microservices while keeping latency within acceptable Service-Level Objectives (SLOs), specifically allowing no more than a 20% increase from the baseline latency. Through controlled experiments on multiple Java and Go-based microservices, we analyze latency patterns under CPU-only, memory-only, and combined resource reductions. We discover service-specific resource sensitivity profiles, identify non-linear latency cliffs (Critical Reduction Points), and uncover the compounded behavior caused by simultaneous resource constraints. The insights lay the foundation for designing intelligent, SLA-aware resource tuners using machine learning, reinforcement learning, and Bayesian optimization. Our ultimate goal is to integrate this solution into platforms like Choreo to support dynamic, real-time resource tuning that is both cost-effective and performance-stable.

This work contributes a hybrid of empirical workload characterization and intelligent control for latency-preserving resource tuning.

Methodology

Workload Profiler

The Workload Profiler resides inside each pod and continuously monitors real-time metrics. These metrics are gathered using Prometheus, Kubernetes API, OpenTelemetry and serve as the foundation for predicting future resource demands.
Workload Profiler

Cold-Start Mitigation Module

This module is also deployed within each pod to reduce the delays associated with starting new pods. By using pre-trained models and incremental online learning techniques, it predicts the resource requirements of a pod before it starts, enabling proactive resource pre-allocation. This approach ensures that new pods can quickly become operational without causing service interruptions or performance degradation. Moreover, by predicting resource needs in advance, the system can reduce the time it takes to allocate the necessary resources, thus decreasing startup latency and improving the overall responsiveness of the system. Cold Start Mitigation Module

Dependency-Aware Scaling Module

The Dependency-Aware Scaling Module is deployed at the node level as part of the Custom Controller. It leverages distributed tracing tools like Jaeger and OpenTelemetry to map inter-service dependencies and identify critical microservices. This module ensures that tightly coupled services are allocated prioritized resources, preventing scaling actions from negatively affecting dependent services and maintaining system balance and efficiency. Dependency-Aware Scaling Module

Latency Constrained Optimizer

Also part of the Custom Controller, the Latency-Constrained Optimizer utilizes metaheuristic algorithms, such as Genetic Algorithms (GA) and Simulated Annealing (SA), to determine the optimal allocation of CPU and memory resources across the node. The optimizer ensures that scaling decisions do not increase latency by more than a given threshold. It incorporates a latency prediction model to estimate the impact of resource changes before applying them. By constantly evaluating resource adjustments and their potential impact on overall system latency, the optimizer strives to maintain system responsiveness while optimizing resource distribution to prevent bottlenecks Latency-Constrained Optimizer

Reinforcement Learning-Based Decision Maker

This module operates at both the pod level and within the Custom Controller. At the pod level, a reinforcement learning model using Deep Q Networks learns the optimal scaling policy based on historical and real-time data. At the node level, the RL model aggregates feedback from individual pods and collaborates with the Latency-Constrained Optimizer to refine global scaling decisions. The reward function penalizes actions leading to excessive latency, ensuring both resource efficiency and latency awareness in all decisions. Additionally, the RL model continuously refines its decision-making process, adapting to evolving system conditions and optimizing scaling policies to meet long-term performance goals. Reinforcement Learning-Based Decision Maker

Experiment Setup and Implementation

Services Tested:

  1. Prime Verifier (Java)
  2. Echo (Go)
  3. Hash Generator (Java)
  4. Password Generator (Java)

Load Generation:

Constant 10–20 requests/sec per service.

Metrics Collected:

Automation Tools:

Results and Analysis

Latency Profiles

CPU-bound Services

Memory-resilient Services

Note: Echo remained stable under all reductions.

Critical Reduction Points (CRPs)

Small additional reductions beyond CRP caused exponential latency spikes.

Combined Reductions

Nonlinear and sometimes adaptive behavior observed. For example, the JVM improved under tighter GC pressure.

Comparative Analysis

Service CPU-Only Memory-Only Combined
Prime Verifier Gradual spike Mild Latency cliff
Echo (Go) No change No change Very stable
Hash Generator Immediate spikes Flat → volatile Chaotic
Password Generator Step-wise spikes Bursty Unexpected dips

Conclusion

This research reveals that Kubernetes resource tuning is not linear or generic. Different microservices have distinct performance profiles, and static autoscaling is insufficient. We demonstrate that intelligent resource tuning must:

Our findings will power a machine learning–driven, SLA-aware resource optimizer that can be integrated into Kubernetes platforms like Choreo, enabling developers to reduce cost without compromising performance.

Publications