Multi-Tenant AI Workload Scheduling on Kubernetes: Addressing Modern Cloud Computing Challenges

Authors

  • Anuj Harishkumar Chaudhari

DOI:

https://doi.org/10.22399/ijcesen.3931

Keywords:

Multi-Tenant Scheduling, Container Orchestration, Artificial Intelligence Workloads, GPU Resource Management, Fairness Optimization, Distributed Machine Learning

Abstract

Scheduling artificial intelligence workloads on multi-tenant container orchestration is an extremely challenging problem that goes far beyond classic microservices deployment use cases. Existing scheduling mechanisms have inherent limitations when it comes to managing hardware-intensive machine learning workloads that require expert-level hardware accelerators, have uneven consumption profiles, and necessitate simultaneous resource allocation across distributed compute nodes. The intersection of containerized computation and artificial intelligence has given rise to sophisticated scheduling environments where fairness, efficiency, and performance predictability must be optimized simultaneously across a variety of tenant requirements. Sophisticated scheduling techniques such as gang scheduling, topology-aware placement, and predictive resource management have become key solutions for dealing with resource heterogeneity, communication overhead, and fairness violations that afflict conventional scheduling methods. Implementing frameworks that include workload classification, fairness engines, and topology optimization show significant improvements in cluster utilization while ensuring service level agreement adherence to latency-sensitive inference tasks. Experimental results show drastic decreases in job completion times, better resource allocation fairness among tenants, and better GPU utilization efficiency through smart placement decisions that account for both real-time resource demands and longer organizational goals. The architectural answers supplied mitigate key challenges of present-day cloud-local AI implementations and offer scalable frameworks to handle an increasing number of complex multi-tenant computing environments.

References

[1] Tuba Arif et al., (2025). A Comprehensive Survey of Privacy-Enhancing and Trust-Centric Cloud-Native Security Techniques Against Cyber Threats, MDPI. https://www.mdpi.com/1424-8220/25/8/2350

[2] Marius Schlegel and Kai-Uwe Sattler, (2022). Management of Machine Learning Lifecycle Artifacts: A Survey, arXiv. https://arxiv.org/pdf/2210.11831

[3] Aakash Sharma et al., (2024). GPU Cluster Scheduling for Network-Sensitive Deep Learning, arXiv. https://arxiv.org/pdf/2401.16492

[4] Jaewon Son et al., (2021). A GPU Scheduling Framework to Accelerate Hyper-Parameter Optimization in Deep Learning Clusters, MDPI. https://www.mdpi.com/2079-9292/10/3/350

[5] WEI GAO et al., (2022). Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision, arXiv. https://arxiv.org/pdf/2205.11913

[6] Aaron Harlap et al., (2018). PipeDream: Fast and Efficient Pipeline Parallel DNN Training, arXiv. https://arxiv.org/pdf/1806.03377

[7] Emre Karabulut et al., (2025). THEMIS: Time, Heterogeneity, and Energy Minded Scheduling for Fair Multi-Tenant Use in FPGAs, arXiv. https://arxiv.org/pdf/2404.00507

[8] Emna Baccour et al., (2022). Pervasive AI for IoT applications: A Survey on Resource-efficient Distributed Artificial Intelligence, arXiv. https://arxiv.org/pdf/2105.01798

[9] Alexander Ulanov et al., (2017). Modeling Scalability of Distributed Machine Learning, arXiv. https://arxiv.org/pdf/1610.06276

[10] Zakariya Ba Alawi, (2025). A Comparative Survey of PyTorch vs TensorFlow for Deep Learning: Usability, Performance, and Deployment Trade-offs, arXiv. https://arxiv.org/pdf/2508.04035

Downloads

Published

2025-09-21

How to Cite

Anuj Harishkumar Chaudhari. (2025). Multi-Tenant AI Workload Scheduling on Kubernetes: Addressing Modern Cloud Computing Challenges. International Journal of Computational and Experimental Science and Engineering, 11(3). https://doi.org/10.22399/ijcesen.3931

Issue

Section

Research Article