Designing a Hybrid Execution Strategy: Balancing Full Restarts and Partial Recovery in Distributed Frameworks
DOI:
https://doi.org/10.22399/ijcesen.3688Keywords:
distributed fault tolerance, hybrid recovery mechanisms, adaptive threshold systems, DAG-based computation, checkpoint coordinationAbstract
The proliferation of large-scale distributed data processing systems has fundamentally transformed computational workload management across enterprise environments, creating an urgent need for intelligent fault tolerance mechanisms that can dynamically balance recovery speed, resource efficiency, and system reliability. This article presents a hybrid execution strategy that combines full restart mechanisms and partial recovery approaches within DAG-based computation frameworks, utilizing adaptive threshold mechanisms and sophisticated execution heuristics to optimize fault handling decisions based on runtime characteristics, job properties, and system conditions. The proposed framework incorporates multi-dimensional threshold evaluation systems that analyze task count, data volume, execution parallelism, and failure patterns to determine optimal recovery strategies for diverse workload types ranging from compute-intensive applications to complex shuffle operations. Performance evaluation demonstrates significant improvements in recovery time reduction and resource utilization efficiency, with the hybrid strategy achieving substantial performance gains across synthetic and production workloads while maintaining strong consistency guarantees through advanced coordination protocols and distributed snapshot mechanisms. The implementation considerations address critical challenges in state consistency management, distributed coordination, and system integration, providing practical guidance for deploying robust fault tolerance capabilities in next-generation distributed computing platforms that demand both high throughput and exceptional reliability.
References
[1] Matei Zaharia, et al., "Apache Spark: A unified engine for big data processing," ACM Digital Library. 2016. [Online]. Available: https://dl.acm.org/doi/10.1145/2934664
[2] Paris Carbone, et al., "Apache Flink™: Stream and Batch Processing in a Single Engine," Asterios Katsifodimos. [Online]. Available: https://asterios.katsifodimos.com/assets/publications/flink-deb.pdf
[3] Shivaram Venkataraman, "Drizzle: Fast and Adaptable Stream Processing at Scale," ACM Digital Library, 2017. Available: https://dl.acm.org/doi/10.1145/3132747.3132750
[4] Kay Ousterhout, et al., "Making Sense of Performance in Data Analytics Frameworks”, [Online]. Available: https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/ousterhout
[5] Eric Boutin, et al., "Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing," USENIX, 2014. [Online]. Available: https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-boutin_0.pdf
[6] Benjamin Hindman, "Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center," People @EECS. [Online]. Available: https://people.eecs.berkeley.edu/~alig/papers/mesos.pdf
[7] Michael Isard, "Dryad: Distributed data-parallel programs from sequential building blocks," ACM Digital Library, 2007 [Online]. Available: https://dl.acm.org/doi/10.1145/1272998.1273005
[8] Yuan Yu, "DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language," ResearchGate, 2008. [Online]. Available: https://www.researchgate.net/publication/220851822_DryadLINQ_A_System_for_General-Purpose_Distributed_Data-Parallel_Computing_Using_a_High-Level_Language
[9] Paris Carbone, et al., "Lightweight Asynchronous Snapshots for Distributed Dataflows," arXiv, 2015. [Online]. Available: https://arxiv.org/abs/1506.08603
[10] Diego Ongaro and John Ousterhout, "In Search of an Understandable Consensus Algorithm,". 2014. [Online]. Available: https://raft.github.io/raft.pdf
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 International Journal of Computational and Experimental Science and Engineering

This work is licensed under a Creative Commons Attribution 4.0 International License.