Simulation Modeling of a Fault-Tolerant Computing Cluster with Two-Level Load Balancing and Container Virtualization

Vagif Mamedov; Vladimir Bogatyrev; Manh Kiem Do

Simulation Modeling of a Fault-Tolerant Computing Cluster with Two-Level Load Balancing and Container Virtualization

Vagif Mamedov, Vladimir Bogatyrev, Manh Kiem Do

Abstract

This paper investigates the fault tolerance of a computing cluster with container virtualization and a dedicated load balancer. A discrete-event simulation model is proposed and implemented in Python/SimPy, incorporating the load balancer as a vulnerable single point of failure (SPOF, Level 0) and a pool of worker nodes with containers (Level 1). Unlike classical queueing models, each server undergoes a realistic cascade of hardware/software aging UP → DEGRADE → DOWN, with a host-node failure physically blocking its containers and their migration. Container service rate is dynamically derived from an experimentally measured performance matrix μ(n, m). To capture the true delay profile, the architecture employs infinite persistent queues combined with strict rejection of new requests during hardware downtimes. The software implementation is verified against the analytical M/M/c model in a no-failure regime, yielding a relative error of mean waiting time below 6 %. Tail characteristics are computed over the pooled sample of all runs (pooled metric), eliminating the systematic bias introduced by averaging per-run percentiles. Four series of experiments are conducted (10 Monte Carlo runs each): influence of MTTFfail with three cascade-aging profiles p ∈ {0.7, 0.8, 0.9}, MTTR, balancer routing policies, and arrival intensity λ ∈ [10, 30] tasks/s. Key result: accounting for hidden SPOF-node degradation critically shifts the cluster stability boundary; a saturation threshold λ* ≈ 22–26 tasks/s is observed, beyond which the pooled 95-th percentile of waiting time W₀,₉₅ grows from 0.11 s at λ = 10 tasks/s to 109 s at λ = 30 tasks/s. The model source code and experiments are published in an open repository. The proposed model and the obtained data form a foundation for the synthesis of predictive routing policies and proactive cluster resource management.

Full Text:

PDF (Russian)

References

Maenhaut P. J., Volckaert B., Ongenae V., De Turck F. Resource Management in a Containerized Cloud: Status and Challenges // Journal of Network and Systems Management. 2020. Vol. 28(2). P. 197–246. DOI: 10.1007/s10922-019-09504-0.

Kelton W., Sadowski R., Zupick N. Simulation with Arena. New York: McGraw-Hill Education, 2015. 640 p.

Law A. M. Simulation Modeling and Analysis. New York: McGraw-Hill Education, 2014. 784 p.

Kleinrock L. Queueing Systems. Vol. 1. New York: Wiley, 1975. 432 p.

Kumar D., Ravi V. A survey on fault tolerance in cloud computing // Journal of Cloud Computing. 2018. Vol. 7(1).

Matloff N. Introduction to Discrete-Event Simulation and the SimPy Language. Davis: University of California, 2008.

Avizienis A., Laprie J.-C., Randell B., Landwehr C. Basic concepts and taxonomy of dependable and secure computing // IEEE Trans. Dependable and Secure Computing. 2004. Vol. 1(1). P. 11–33.

Schroeder B., Gibson G. A large-scale study of failures in high-performance computing systems // IEEE Trans. Dependable and Secure Computing. 2010. Vol. 7(4). P. 337–350.

Bogatyrev V. A., Bogatyrev A. V., Bogatyrev S. V. Reliability assessment of cluster execution of real-time requests // Izv. Vyssh. Uchebn. Zaved. Priborostroenie. 2014. Vol. 57. No. 4. P. 46–48.

Bogatyrev V. A. Combinatorial-probabilistic assessment of reliability and fault tolerance of cluster systems // Pribory i Sistemy. Upravlenie, Kontrol', Diagnostika. 2006. No. 6. P. 21–26.

Bogatyrev V. A., Derkach A. N., Bogatyrev S. V. Timeliness of the Reserved Maintenance by Duplicated Computers of Heterogeneous Delay-Critical Stream // CEUR Workshop Proceedings. ISTMC, 2019. P. 26–36.

Hwang J., et al. IASO: A Framework for Mitigating the Impact of Fail-Slow in Distributed Storage Services // USENIX ATC’19. 2019.

Koutras M. A. Markov regenerative process model for performability evaluation of a computer cluster system // Reliability Engineering & System Safety. 2023.

Fung V. K., Bogatyrev V. A., Do M. K. Simulation model of a computing cluster with container virtualization // Vestnik Kompʹyuternykh i Informatsionnykh Tekhnologiy. 2025. Vol. 22. No. 8. P. 3–12. DOI: 10.14489/vkit.2025.08.pp.003-012

Dean J., Barroso L. The tail at scale // Communications of the ACM. 2013. Vol. 56(2). P. 74–80.

Fung V. K., Bogatyrev V. A. Experimental study of cluster performance with container virtualization // Izv. Vyssh. Uchebn. Zaved. Priborostroenie. 2024. Vol. 67. No. 8. P. 647–656. DOI: 10.17586/0021-3454-2024-67-8-647-656

Fung V. K., Bogatyrev V. A., Karmanovsky N. S., Le V. H. Probabilistic-temporal characteristics of a computer system with container virtualization // Nauchno-Tekhnicheskiy Vestnik IT, Mekhaniki i Optiki. 2024. Vol. 24. No. 2. P. 249–255. DOI: 10.17586/2226-1494-2024-24-2-249-255

Zhang T., Sharma U., Kapritsos M. Performal: Formal Verification of Latency Properties for Distributed Systems // Proc. ACM on Programming Languages. 2023. Vol. 7. Art. 121. P. 1–26. DOI: 10.1145/3591249.

Zhao K., Goyal P., Alizadeh M., Anderson T. E. Scalable Tail Latency Estimation for Data Center Networks // 20th USENIX NSDI. Boston, MA, 2023. P. 685–702.

Ledmi A., Bendjenna H., Hemam S. M. Fault Tolerance in Distributed Systems: A Survey // Proc. 3rd Intl. Conf. Pattern Analysis and Intelligent Systems (PAIS). IEEE, 2018. P. 1–5. DOI: 10.1109/PAIS.2018.8598484.

Gunawi H. S., Suminto R. O., Sears R. et al. Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems // ACM Transactions on Storage. 2018. Vol. 14, no. 3. Art. 23. P. 1–26. DOI: 10.1145/3242086.

Lu R., Xu E., Zhang Y. et al. Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems // Proc. 21st USENIX Conference on File and Storage Technologies (FAST '23). Santa Clara: USENIX Association, 2023. P. 49–64. (Best Paper Award).

Lou C., Jing Y., Huang P. Demystifying and Checking Silent Semantic Violations in Large Distributed Systems // Proc. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI '22). Carlsbad: USENIX Association, 2022. P. 91–107.

Tirmazi M., Barker A., Deng N. et al. Borg: the Next Generation // Proc. 15th European Conference on Computer Systems (EuroSys '20). Heraklion: ACM, 2020. Art. 30. DOI: 10.1145/3342195.3387517.

Beyer B., Murphy N. R., Rensin D. K., Kawahara K., Thorne S. The Site Reliability Workbook: Practical Ways to Implement SRE. Sebastopol: O'Reilly Media, 2018. ISBN 978-1-491-92521-7.

Refbacks

There are currently no refbacks.

Abava Кибербезопасность Monetec 2026 СНЭ

ISSN: 2307-8162

International Journal of Open Information Technologies