The Role of Chaos Engineering in Enhancing System Resilience and Reliability in Modern Distributed Architectures
DOI:
https://doi.org/10.22399/ijcesen.3885Keywords:
Chaos Engineering, System Resilience, Distributed Systems, Microservices Architecture, Fault Injection Testing, Reliability EngineeringAbstract
The transition from the monolithic systems to the microservices and cloud-native architectures has mainly revolutionized software development, offering increased agility, scalability, and fault isolation. However, this particular shift has also introduced greater complexity and fragility in distributed systems,, where interdependent services are more at risk of partial screw ups, cascading consequences, and unpredictable behaviors. Traditional checking out methods—inclusive of unit and integration checking out—are often inadequate for uncovering hidden failure modes below actual-international, excessive-stress eventualities.In this context, Chaos Engineering has mainly been emerged as one of the critical methodology for improving the system resilience as well as the level of reliability. Pioneered by companies like Netflix,, Chaos Engineering entails the deliberate creation of faults into production or staging environments to evaluate how systems respond to turbulent situations. By simulating outages, latency spikes, or infrastructure failures, this exercise enables teams to discover vulnerabilities, validate restoration mechanisms, and ensure the effectiveness of fail-safes like circuit breakers and retry common sense. Despite developing focus of its blessings, many groups nevertheless rely heavily on reactive techniques like tracking gear and submit-incident reviews. These methods often fall brief in stopping failures, in particular those arising from unknown or emergent behaviors in large-scale structures. As such, there may be an urgent want to comprise proactive resilience strategies—like Chaos Engineering—into the software development existence cycle. This study explores the principles of the Chaos Engineering, its actual alignment with resilience engineering, as well as its practical implementation across the industries. It evaluates the impact of chaos experiments on machine overall performance, incident reaction, and organizational tradition. Through actual-global case research, the paper highlights each the blessings and challenges of adopting this method. Ultimately, it emphasizes the need of shifting from reactive firefighting to proactive reliability guarantee, thereby strengthening the foundations of cutting-edge allotted systems.
References
[1] Chinamanagonda, S., 2023. Focus on resilience engineering in cloud offerings. Academia Nexus Journal, 2(1).
[2] Dedousis, P., Stergiopoulos, G., Arampatzis, G. And Gritzalis, D., 2023. Enhancing Operational Resilience of Critical Infrastructure Processes Through Chaos Engineering. IEEE Access, 11, pp.106172-106189. DOI: https://doi.org/10.1109/ACCESS.2023.3316028
[3] Fogli, M., Giannelli, C., Poltronieri, F., Stefanelli, C. And Tortonesi, M., 2023. Chaos engineering for resilience assessment of digital twins. IEEE Transactions on Industrial Informatics, 20(2), pp.1134-1143.
[4] Fogli, M., Giannelli, C., Poltronieri, F., Stefanelli, C. And Tortonesi, M., 2023. Chaos engineering for resilience evaluation of virtual twins. IEEE Transactions on Industrial Informatics, 20(2), pp.1134-1143. DOI: https://doi.org/10.1109/TII.2023.3264101
[5] Gogineni, A., 2025. Chaos Engineering in the Cloud-Native Era: Evaluating Distributed AI Model Resilience on Kubernetes. J Artif Intell Mach Learn & Data Sci 2025, 3(1), pp.2182-2187. DOI: https://doi.org/10.51219/JAIMLD/anila-gogineni/477
[6] Konstantinou, C., Stergiopoulos, G., Parvania, M. And Esteves-Verissimo, P., 2021, October. Chaos engineering for superior resilience of cyber-physical systems. In 2021 Resilience Week (RWS) (pp. 1-10). IEEE.
[7] Konstantinou, C., Stergiopoulos, G., Parvania, M. And Esteves-Verissimo, P., 2021, October. Chaos engineering for more suitable resilience of cyber-physical structures. In 2021 Resilience Week (RWS) (pp. 1-10). IEEE. DOI: https://doi.org/10.1109/RWS52686.2021.9611797
[8] Mailewa, A.B., Akuthota, A. And Mohottalalage, T.M.D., 2025, January. A review of resilience checking out in microservices architectures: Implementing chaos engineering for fault tolerance and gadget reliability. In 2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC) (pp. 00236-00242). IEEE.
[9] Malea, A.B., Autohotel, A. And Mohottalalage, T.M.D., 2025, January. A overview of resilience checking out in microservices architectures: Implementing chaos engineering for fault tolerance and machine reliability. In 2025 IEEE fifteenth Annual Computing and Communication Workshop and Conference (CCWC) (pp. 00236-00242). IEEE. DOI: https://doi.org/10.1109/CCWC62904.2025.10903891
[10] Naqvi, M.A., Malik, S., Astelin, M. And Moonen, L., 2022, September. On evaluating self-adaptive and self-recovery structures the use of chaos engineering. In 2022 IEEE global conference on autonomic computing and self-organizing systems (ACSOS) (pp. 1-10). IEEE. DOI: https://doi.org/10.1109/ACSOS55765.2022.00018
[11] Poltronieri, F., Tortonesi, M. And Stefanelli, C., 2022, April. A chaos engineering technique for improving the resiliency of its service configurations. In NOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium (pp. 1-6). IEEE.
[12] Poltronieri, F., Tortonesi, M. And Stefanelli, C., 2022, April. A chaos engineering technique for improving the resiliency of its provider configurations. In NOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium (pp. 1-6). IEEE. DOI: https://doi.org/10.1109/NOMS54207.2022.9789887
[13] Shortridge, K., 2023. Security chaos engineering: sustaining resilience in software program and systems. " O'Reilly Media, Inc.".
[14] Talaver, V. And Vakaliuk, T.A., 2023. Reliable allotted systems: overview of present day strategies. Journal of facet computing, 2(1), pp.84-one hundred and one. DOI: https://doi.org/10.55056/jec.586
[15] Tatineni, S., 2023. Cloud-Based Reliability Engineering: Strategies for Ensuring High Availability and Performance. International Journal of Science and Research (IJSR), 12(eleven), pp.1005-1012. DOI: https://doi.org/10.21275/SR231113060258
[16] Vered, S., 2025. Chaos engineering in cloud platforms. In EPJ Web of Conferences (Vol. 321, p. 02006). EDP Sciences. DOI: https://doi.org/10.1051/epjconf/202532102006
[17] Yadav, R., Harnessing Chaos: The Role of Chaos Engineering in Cloud Applications and Impacts on Site Reliability Engineering.
[18] bytebytego (2022) https://blog.bytebytego.com
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 International Journal of Computational and Experimental Science and Engineering

This work is licensed under a Creative Commons Attribution 4.0 International License.