MEPhI Higher Engineering School Digital polygon for educational and practical projects infrastructure support

G.G. Hajilov, A.V. Khatunov, T.A. Voloshin, M.G. Zhabitsky

Abstract


The article describes the concept, architecture and results of the trial operation of the "Digital Polygon" of the Higher School of National Research Nuclear University MEPhI - a digital infrastructure for combining education, research and applied projects with industrial partners. It is shown how the gap in expectations between academic and industrial environments (in terms of levels of technological readiness, TRL) is transformed into requirements for services, access policy, and observability. A target architecture based on virtualization (Proxmox VE), infrastructure as code (Ansible) and a single observability loop (Prometheus/Grafana, distributed tracing) is proposed, as well as operating regulations for educational research and pilot production scenarios. The cases demonstrated: local inference of large language models on workstations with multiple GPUs and a microservice application bench (≈15 services) with tracing and business metrics. As a result of the implementation, improvements in operational indicators were recorded: an increase in availability up to ~99.5%, a decrease in the number of incidents by ~78%, a reduction in the deployment time of typical services to 1-2 hours; for a microservice bench – a reduction in the time of detection and elimination of failures (TTD/TTR) by 67% and 58%, respectively. The scientific and practical novelty of the work lies in the integration of the TRL approach to interaction with the industry with a reproducible engineering template of the campus platform (virtualization + IaC + observability) and the demonstration of its applicability for on-prem AI tasks. Further development steps are outlined: clustering, IAM/SSO unification and backup policy.


Full Text:

PDF (Russian)

References


AI Index Steering Committee. Artificial Intelligence Index Report 2025. Stanford University, Institute for Human-Centered AI (HAI), 2025. Доступно по ссылке: https://hai.stanford.edu/assets/files/hai_ai_index_report_2025.pdf. Stanford HAI

European Commission. Horizon Europe Work Programme 2021–2022: 13. General Annexes. Technology Readiness Levels (TRL). Luxembourg: Publications Office of the EU, 2022. // https://ec.europa.eu/... (PDF). Дата обращения: 14.08.2025. European Commission

Order of Rosatom State Corporation "On Approval of the List of Levels of Readiness of Technologies and Production" dated 24.04.2018 No 1/420-P (Appendix as amended by the Order of Rosatom State Corporation dated 11.08.2021 No 1/1007-P).

National Institute of Standards and Technology (NIST). Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1. Gaithersburg, MD: NIST, 2023. DOI: 10.6028/NIST.AI.100-1. https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf.

Google SRE Team. Service Level Objectives (глава онлайн из «Site Reliability Engineering: How Google Runs Production Systems»). O’Reilly Media, 2016 https://sre.google/sre-book/service-level-objectives/.

OpenTelemetry Project (CNCF). OpenTelemetry Specification, v1.47.0 (Overview). 2025. Доступно по ссылке: https://opentelemetry.io/docs/specs/otel/.

Wilkie, T. The RED Method: How to Instrument Your Services. Grafana Labs Blog, 02.08.2018. Доступно по ссылке: https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/.

Zhou, Z.; Ning, X.; Hong, K.; et al. A Survey on Efficient Inference for Large Language Models. arXiv:2404.14294, 2024.


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность ИБП для ЦОД СНЭ

ISSN: 2307-8162