Classification of the Tajik Language Words for Their Generation and Recognition in Natural Language Processing
Abstract
Natural language processing remains relevant despite the emergence of large language models. Natural language is processed at the morphological, syntactic, and semantic levels. At the morphological level, word-form generation and recognition tasks are solved. For this purpose, a method for generating and recognizing word-forms is used. The universality of this method has been demonstrated for languages of various families and groups. To apply the method, it is necessary to classify natural language words by their word-forming types. The purpose of the study is automation of Tajik word-form generation and recognition by the universal method. We classify numerals, adverbs, and participles of the Tajik language. 46 numerals fall into 3 types. Numerals have from 88 to 91 forms. 289 adverbs were also classified into 3 types. Adverbs have from 94 to 97 forms. 21 participles were also divided into 3 types. Participles have 80 or 81 forms. Each type has attributes that allow words to be classified into that type. Nouns, verbs, adjectives, and pronouns were previously classified. The types of these parts of speech, their characteristics, and their attributes are resumed in the article. The article contains the final results of a five-year study of word formation types in the Tajik language. The classification was used in the development of the Ibora internet application (http://ibora.su) for generating word-forms.
Full Text:
PDF (Russian)References
Hasan K.S., Ng V. Automatic Keyphrase Extraction: A Survey of the State of the Art. In 52nd Annual Meeting of the Associa-tion for Computational Linguistics, vol 1: Long Papers, pp. 1262–1273, 2014.
Markowitz D.M. The Meaning Extraction Method: An Approach to Evaluate Content Patterns from Large-Scale Language Data. Frontiers in Communication, no. 6., p. 588823, 2021.
Li Q. et al. A Survey on Text Classification: From Traditional to Deep Learning, ACM Trans. Intell. Syst. Technol., vol. 37, no. 4, p. 111, 2021.
Golovko D.R. Osobennosti i vidy mashinnogo perevoda [Fea-tures and Types of Machine Translation], Vestnik Moskovskogo informatsionno-tekhnologicheskogo universiteta – Moskovskogo arkhitekturno-stroitel'nogo instituta, no. 4. pp. 24–30, 2020. [In Rus].
Murodov P.S., Prutzkow A.V. Matematicheskaja model' nechetkogo opredelenija tematiki nauchnykh statej s pomosch'ju sintaksicheski svjazannykh slov [Mathematical Model of Fuzzy Recognizing of the Topics of Scientific Articles Using Syntacti-cally Related Words], Vestnik Tadzhikskogo natsional'nogo uni-versiteta. Serija estestvennykh nauk, no. 2, pp. 14–22, 2024. [In Rus].
Bol'shakova E.I., Vorontsov K.V., Efremova N.E. Avtomatich-eskaja obrabotka tekstov na estestvennom jazyke i analiz dannykh [Automatic Processing of Natural Language Texts and Data Analysis]. Moscow, VShE, 2017, 269 pp. [In Rus].
Fomichev V.A. Matematicheskaja model' mnogoobrazija estestvenno-jazykovykh semanticheskikh struktur i ee znachenie dlja biomeditsinskikh nauk [Mathematical Model of the Diversity of Natural language Semantic Structures and Its Significance for Biomedical Sciences], Informatsionnye tekhnologii, vol. 31, no. 8, pp. 405-418, 2025. [In Rus].
Chernenko O., Gordeeva O. Semantic Analysis of Text Data with Automated System. In 3rd Information Technology and Nanotechnology, pp. 72–76, 2017.
Sak A.N., Bessonova E.V. Sravnenie sintaksicheskogo analiza predlozhenija estestvennogo jazyka [Comparison of Syntactic Analysis of a Natural Language Sentence], Baltijskij gumani-tarnyj zhurnal, vol. 10, no. 1 (34), pp. 373–377, 2021. [In Rus].
Kochkonbaeva B.O. O morfologicheskom analize v prilozheni-jakh avtomaticheskoj obrabotki teksta [On Morphological Anal-ysis in Automatic Text Processing Applications] Bjulleten' nauki i praktiki, vol. 4, no. 12, pp. 608–612, 2018. [In Rus].
Prutskov A.V. Algorithmic Provision of a Universal Method for Word-Form Generation and Recognition, Automatic Documenta-tion and Mathematical Linguistics, vol. 45, no. 5, pp. 232–238, 2011.
Madibragimov N.Sh. Sovremennye tendentsii razvitija avto-maticheskogo morfologicheskogo analiza tadzhikskikh slovo-form [Current Trends in the Development of Automatic Mor-phological Analysis of Tajik Word-Forms]. In Sovremennye tekhnologii v nauke i obrazovanii – STNO-2019: sb. tr. mezhdu-nar. nauch.-tekhn. konf.: v 10 t. Rjazan', RGRTU, 2019, vol. 1, pp. 12–15. [In Rus].
Usmanov Z.D., Dovudov G.M. Morfologicheskij analiz slovo-form tadzhikskogo jazyka [Morphological Analysis of Word-Forms of the Tajik Language]. Dushanbe, Donish, 2015, 143 pp. [In Rus].
Usmanov Z.D., Dovudov G.M. Kontseptual'naja model' avto-maticheskogo morfologicheskogo analiza tadzhikskikh slovo-form [Conceptual Model of Automatic Morphological Analysis of Tajik Word-Forms], Doklady Akademii nauk Respubliki Ta-dzhikistan, vol. 57, no. 3, pp. 205–209, 2014. [In Rus].
Khudojberdiev Kh.A. Ob algoritme proverki orfografii na pri-mere tadzhikskogo jazyka [On the Spell Checking Algorithm Us-ing the Tajik Language As an Example], Politekhnicheskij vest-nik. Serija Intellekt. Innovatsii. Investitsii, no. 3 (55), pp. 58–63, 2021. [In Rus].
Khudojberdiev Kh.A., Nazarov A.A., Ashurova Sh.N. Formiro-vanie elektronnogo slovarja dlja sistemy avtomaticheskogo perevoda teksta s tadzhikskogo jazyka na russkij [Forming an Electronic Dictionary for the System of Automatic Translation of Text from Tajik into Russian]. In Informatsionnyj obmen v mezhdistsiplinarnykh issledovanijakh II: stat'i Vseros. nauch.-prakt. konf. s mezhdunar. uchastiem. Rjazan', Akad. FSIN Ros-sii, 2023, pp. 227–231. [In Rus].
Khudojberdiev Kh.A., Soliev O.M. Lingvisticheskij tezaurus tadzhikskogo jazyka [Linguistic Thesaurus of the Tajik Lan-guage], Novye informatsionnye tekhnologii v avtomatiziro-vannykh sistemakh, no. 20, pp. 103–105, 2017. [In Rus].
Soliev O.M.O., Khudojberdiev Kh.A., Dovudov G.M. Sistema avtomaticheskoj proverki orfografii tadzhikskogo jazyka – TajSpell [Automatic Spell Checking System for the Tajik Lan-guage – TajSpell], Vestnik Tekhnologicheskogo universiteta Ta-dzhikistana, no. 3 (46), pp. 188–194, 2021. [In Rus].
Khudojberdiev Kh.A. O sintezatore tadzhikskoj rechi po tekstu [On a Tajik Speech Synthesizer from Text]. Novye infor-matsionnye tekhnologii v avtomatizirovannykh sistemakh, no. 16, pp. 273–276, 2013. [In Rus].
Usmanov Z.D., Kosimov A.A. Razrabotka programmnogo kompleksa dlja raspoznavanija avtora neznakomogo teksta [De-velopment of a Software Package for Recognizing the Author of an Text]. Dushanbe, Donish, 2022, 105 pp. [In Rus].
Kosimov A.A. Matematicheskij metod opisanija nejronnykh setej dlja klassifikatsii svojstv [Mathematical Method for De-scribing Neural Networks for Classifying Properties], Politekhnicheskij vestnik. Serija: Intellekt. Innovatsii. Investitsii, no. 3 (71), pp. 63–67, 2025. [In Rus].
Kosimov A.A., Shamsov S.M. Takhijai algoritmi khulosabarorj baroi modelkho (amsilakho) oidi ba tartibi durust ovardani chumlai sodai tochikj khangomi navishti nodurust [Developing an Inference Algorithm for Models (Examples) to Correct a Simple Tajik Sentence When It Is Misspelled] , Politekhnicheskij vestnik. Serija: Intellekt. Innovatsii. Investitsii, no. 3 (71), pp. 68–72, 2025. [In Taj].
Istamkulov Kh., Muzafarov D. Metody tokenizatsii teksta na tadzhikskom jazyke s pomosch'ju jazyka Python [Methods of Text Tokenization in the Tajik Language Using the Python Lan-guage], Sovremennaja nauka: aktual'nye problemy teorii i prak-tiki. Serija: Estestvennye i tekhnicheskie nauki, no. 4, pp. 78–82, 2023. [In Rus].
Niyazmukhammadov B., Buzurg-zoda L. Morfologiya tadzhik-skogo yazyka [Morphology of the Tajik language. In Tajik lan-guage]. Stalinabad, Tadzhikgosizdat, 1941, 325 pp. [In Rus].
Atamova S.M. Sootnesennost' form vido-vremennoj sistemy tadzhikskogo glagola s formami vremeni russkogo glagola [Cor-relation of the Forms of the Aspect-Tense System of the Tajik Verb with the Tense Forms of the Russian verb], Uchenye zapiski Khudzhandskogo gosudarstvennogo universiteta im. akademika B. Gafurova. Gumanitarnye nauki, no. 2 (47), pp. 182–187, 2016. [In Rus].
Arzumanov S.D., Jalolov O.D. Uchebnik tadzhikskogo yazyka dlya vuzov [Tajik Language Textbook for Universities]. Du-shanbe, Irfon, 1969, 320 pp. [In Rus].
Arzumanov S.D., Sanginov A. Tadzhikskij jazyk [The Tajik Lan-guage]. Dushanbe, Maorif, 1988. 416 pp. [In Rus].
Sorokina M.A. Razrjady imen prilagatel'nykh v tadzhikskom, anglijskom i russkom jazykakh [Classes of adjectives in the Ta-jik, English and Russian languages], Vestnik Pedagogicheskogo universiteta, no. 6-1 (67), pp. 117–131, 2015. [In Rus].
Madibragimov N.Sh., Prutzkow A.V. Klassifikatsija susch-estvitel'nykh tadzhikskogo jazyka dlja avtomaticheskoj obrabot-ki tekstov [Classification of Nouns of the Tajik Language for Natural Language Processing], Prikaspijskij zhurnal: upravlenie i vysokie tekhnologii, no. 4, pp. 39–52, 2020. [In Rus].
Madibragimov N.Sh. Osobennosti mashinnogo morfolo-gicheskogo analiza i sinteza glagolov tadzhikskogo jazyka [Fea-tures of Computer Morphological Analysis and Synthesis of verbs of the Tajik language], International Journal of Open In-formation Technologies, vol. 11, no. 1, pp. 79–86, 2023. [In Rus].
Madibragimov N.Sh., Prutzkow A.V. Tipy prilagatel'nykh i mes-toimenij tadzhikskogo jazyka i ikh ispol'zovanie dlja generatsii slovoform [Types of Adjectives and Pronouns of the Tajik Lan-guage and Their Use to Generate Word-Forms], International Journal of Open Information Technologies, vol. 9, no 11, pp. 85–89, 2021. [In Rus].
Refbacks
- There are currently no refbacks.
Abava Кибербезопасность Monetec 2026 СНЭ
ISSN: 2307-8162