Autonomous AI Agents for Apache Flink Pipeline Management on Kubernetes

Authors

  • Jyothish Sreedharan

DOI:

https://doi.org/10.22399/ijcesen.4998

Keywords:

Apache Flink Stream Processing, Kubernetes Orchestration, Reinforcement Learning Optimization, Autonomous Failure Recovery, Predictive Service-Level Agreement Enforcement

Abstract

Modern companies struggle to maintain data pipelines that process millions of events per second while meeting strict performance requirements. Traditional methods use fixed configurations and manual fixes, which fail when workloads change unexpectedly. This paper presents an AI-powered system that automatically manages Apache Flink pipelines on Kubernetes. The system uses machine learning to predict problems before they occur, recover from failures automatically, and optimize resource usage continuously. The system was evaluated using two publicly available benchmark datasets: the NYC Taxi Trip Record dataset adapted for streaming scenarios and the Yahoo Cloud Serving Benchmark dataset. Tests show the AI-driven approach significantly reduces service violations, substantially cuts recovery time, and lowers infrastructure costs compared to manual management while maintaining better performance. The system uses three AI agents working together where the prediction agent forecasts problems ahead of time with high accuracy using a neural network that processes multiple metrics continuously, the recovery agent detects failures rapidly using isolation forests, autoencoders, and long short-term memory networks, and the optimization agent adjusts resources dynamically based on workload patterns using reinforcement learning. Together, these agents enable the system to operate autonomously, dramatically reducing manual interventions and operational overhead.

References

[1] Paris Carbone, et al., "Apache Flink: Stream and batch processing in a single engine," Asterios Katsifodimos, 2015. [Online]. Available: https://asterios.katsifodimos.com/assets/publications/flink-deb.pdf

[2] D. Sculley et al., "Hidden technical debt in machine learning systems," ACM digital library,2015. [Online]. Available: https://dl.acm.org/doi/10.5555/2969442.2969519

[3] Paris Carbone, et al., "Apache Flink: Stream and batch processing in a single engine," ResearchGate, 2015. [Online]. Available: https://www.researchgate.net/publication/308993790_Apache_Flink_Stream_and_Batch_Processing_in_a_Single_Engine

[4] Brendan Burns, et al., "Borg, Omega, and Kubernetes," ACM Digital Library, May 2016. [Online]. Available: https://dl.acm.org/doi/10.1145/2890784

[5] John Schulman, et al., "Proximal policy optimization algorithms," arXiv, 2017. [Online]. Available: https://arxiv.org/abs/1707.06347

[6] Joshua Achiam, et al., "Constrained policy optimization," ACM Digital Library, 2017. [Online]. Available: https://dl.acm.org/doi/10.5555/3305381.3305384

[7] Abhishek Verma, "Large-scale cluster management at Google with Borg," ACM Digital Library, 2015. [Online]. Available: https://dl.acm.org/doi/10.1145/2741948.2741964

[8] Cory Maklin, "Isolation forest," [Online]. Available: https://ieeexplore.ieee.org/document/4781136

[9] Tianqi Chen, Carlos Guestrin, "XGBoost: A scalable tree boosting system," ACM Digital Library. [Online]. Available: https://dl.acm.org/doi/10.1145/2939672.2939785

[10] Robert Grandl, et al., "Multi-resource packing for cluster schedulers," 2014. [Online]. Available: https://dl.acm.org/doi/10.1145/2619239.2626334

Downloads

Published

2026-03-05

How to Cite

Jyothish Sreedharan. (2026). Autonomous AI Agents for Apache Flink Pipeline Management on Kubernetes. International Journal of Computational and Experimental Science and Engineering, 12(1). https://doi.org/10.22399/ijcesen.4998

Issue

Section

Research Article