Development of an authentication method for running Apache Spark tasks in the Apache Airflow

P.A Golovnyak, K.S. Zaytsev, A.M. Pinchuk

Abstract


Today there is a significant increase in the number of payments made using plastic cards. Banking organizations are forced to analyze huge amounts of data in their work, and therefore software products for processing big data are being put into operation. Apache Spark is a popular tool for streaming and batch data processing. To analyze information, software is often used to run processes on a schedule. An example of such software is Apache Airflow. It is customary to represent the processes launched in Airflow in the form of a directed graph without cycles, while the vertices of the graph are tasks that are the implementation of a certain Airflow operator. The purpose of the article is to develop an Apache Airflow operator to run Spark tasks on a server other than the one on which Airflow operates. When developing such an operator, the task of authentication on a remote server arises. The article proposes a solution to the authentication problem by using access tokens provided by the Vault secret repository, as well as a web service of its own development, with the help of which the secrets and tokens stored in the Vault are managed. As a result of the work, a solution was obtained that allows you to run Spark tasks in Airflow on a remote server, while the secrets used for authentication are stored in Vault, which increases the security of the system.

Full Text:

PDF (Russian)

References


CBR [online resource] // Number of payment cards issued by credit institutions and the Bank of Russia, by card type [website] URL: https://old.cbr.ru/statistics/psrf/sheet013/ (Date of request: 01.05.2022).

Tadviser [online resource] //Bank card fraud [website] URL: https://www.tadviser.ru/index.php//index.php/Stat'ja:Moshennichestvo_s_bankovskimi_kartami (Date of request 01.05.2022)

FINVERSIA [online resource] // Big-data of banks and telekcoms: who and how inject big fata [website] URL: https://www.finversia.ru/publication/big-data-bankov-i-telekomov-kto-i-kak-vnedryaet-bolshie-dannye-45951 (Date of request 02.05.2022)

Tableau [online resource] // Identity Theft Reports - Federal Trade Commission [sajt]: https://public.tableau.com/profile/federal.trade.commission#!/vizhome/IdentityTheftReports/TheftTypesOverTime (Date of request 05.05.2022).

Project GOST R Information technology. Big data. Overview and dictionary. Information technology. Big data. – M., Standartinform, 2021. – 16 c.

Apache [online resource] // Apache Airflow Documentation [website] URL: https://airflow.apache.org/docs/apache-airflow/2.2.5/ (Date of request 10.05.2022).

Kotliar M., Kartashov A. V., Barski A. CWL-Airflow: a lightweight pipeline manager supporting Common Workflow Language //Gigascience. – 2019. – T. 8. – #. 7. – S. giz084.

Rubio D. Jinja templates in Django //Beginning Django. – Apress, Berkeley, CA, 2017. – S. 117-161.

Spark [online resource] // Spark Overview [website] https://spark.apache.org/docs/2.4.0/ (Date of request: 15.05.2022).

Salloum S. et al. Big data analytics on Apache Spark //International Journal of Data Science and Analytics. – 2016. – T. 1. – #. 3. – S. 145-164.

Esmaeilzadeh A. et al. Efficient large scale nlp feature engineering with apache spark //2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC). – IEEE, 2022. – S. 0274-0280.

HASHICORP [online resource] // Documentation [website] https://www.vaultproject.io/docs (Date of request: 18.05.2022).

Sabharwal N., Pandey S., Pandey P. Getting Started with Vault //Infrastructure-as-Code Automation Using Terraform, Packer, Vault, Nomad and Consul. – Apress, Berkeley, CA, 2021. – S. 131-150.

Seh A. H. et al. Hybrid computational modeling for web application security assessment //CMC-Comput., Mater. Continua. – 2022. – T. 70. – #. 1. – S. 469-489.


Refbacks

  • There are currently no refbacks.


Abava  Absolutech Convergent 2022

ISSN: 2307-8162