CrossLingual-Noised BackTranslation

Aslan Aghayev, Sergey Molodyakov

Abstract


Low-resource languages suffer from data scarcity, limiting robust text classification. We introduce a cross‑lingual, noise‑injected augmentation pipeline for Azerbaijani that leverages a higher‑resource latent space. Starting with Azerbaijani, we translate into Turkish, encode with a multilingual encoder–decoder, inject token‑level noise into encoder states, decode in Turkish, and translate back to Azerbaijani to yield diverse, fluent paraphrases. This process exploits the model’s stronger expressiveness in Turkish while anchoring semantics across languages. On an Azerbaijani news dataset, training with the generated paraphrases increases accuracy and robustness and expands lexical and structural variety. Measured by cosine similarity and BLEU, paraphrases preserve meaning while increasing diversity. The approach offers a scalable way to create diverse augmentations.


Full Text:

PDF

References


Ziyaden, A., Yelenov, A., Hajiyev, F., Rustamov, S., & Pak, A. (2024). Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages. PeerJ Computer Science, 10, e1974.

Louvan, S. & Magnini, B. (2020). Simple is Better! Lightweight Data Augmentation for Low-Resource Slot Filling and Intent Classification. In PACLIC 2020[39]. ACL Anthology: 2020.

Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy. 2021. A Survey of Data Augmentation Approaches for NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 968–988, Online. Association for Computational Linguistics.

Wei, Jason and Kai Zou. “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks.” Conference on Empirical Methods in Natural Language Processing (2019).

Q. Xie, Z. Dai, E. H. Hovy, T. Luong, Q. Le, Unsupervised data augmentation for consistency training, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.

Amane Sugiyama and Naoki Yoshinaga. 2019. Data augmentation using back-translation for context-aware neural machine translation. In Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), pages 35–44, Hong Kong, China. Association for Computational Linguistics.

Minki Kang, Sung Ju Hwang, Gibbeum Lee, and Jaewoong Cho. 2025. Latent paraphrasing: perturbation on layers improves knowledge injection in language models. In Proceedings of the 38th International Conference on Neural Information Processing Systems (NIPS '24), Vol. 37. Curran Associates Inc., Red Hook, NY, USA, Article 3803, 119689–119716.

A. W. Yu, D. Dohan, M. Luong, R. Zhao, K. Chen, M. Norouzi, Q. V. Le, Qanet: Combining local convolution with global self-attention for reading comprehension, in: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, OpenReview.net, 2018.

Nugent, J., et al. (2021). Diverse Paraphrase Generation with Positive Noise.

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.

Bayer, M., Kaufhold, M., & Reuter, C. (2021). A Survey on Data Augmentation for Text Classification. ACM Computing Surveys. 55.

Sennrich, R., Haddow, B., & Birch, A. (2016). Improving Neural Machine Translation Models with Monolingual Data. In ACL 2016.

Li, Bohan et al. “Data Augmentation Approaches in Natural Language Processing: A Survey.” AI Open 3 (2021): 71-90.

Varun Kumar, Ashutosh Choudhary, and Eunah Cho. 2020. Data Augmentation using Pre-trained Transformer Models. In Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems, pages 18–26, Suzhou, China. Association for Computational Linguistics.

Tang, Y., Tran, C., Li, X., Chen, P., Goyal, N., Chaudhary, V., Gu, J., & Fan, A. (2020). Multilingual Translation with Extensible Multilingual Pretraining and Finetuning. arXiv preprint arXiv:2008.00401.


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность ИТ конгресс СНЭ

ISSN: 2307-8162