КФУ. Карточка публикации. Developing the Tajik language in the era of large language models: corpus infrastructure, linguistic challenges, and safety alignment

RUS

- EN
- 中文
- ES

DEVELOPING THE TAJIK LANGUAGE IN THE ERA OF LARGE LANGUAGE MODELS: CORPUS INFRASTRUCTURE, LINGUISTIC CHALLENGES, AND SAFETY ALIGNMENT

Форма представления

Статьи в российских журналах и сборниках

Год публикации

2025

Язык

английский

Авторы, сотрудники КФУ

Арабов Муллошараф Курбонович, автор

Библиографическое описание на языке оригинала

Arabov, M. K. Developing the Tajik language in the era of large language models: corpus infrastructure, linguistic challenges, and safety alignment / M. K. Arabov // Modern Science. – 2025. – No. 12-2. – P. 85-93. – EDN LQLURB.

Аннотация

The rapid progress of large language models (LLMs) has reshaped natural language processing, yet this progress has reinforced existing inequalities between high-resource and low-resource languages. Tajik, despite its long-standing literary tradition and official status, remains largely absent from contemporary LLM ecosystems. At the present stage, the language lacks publicly accessible, standardised and computationally usable corpora and datasets suitable for training, adaptation or evaluation of modern language models. Although a National Corpus of the Tajik Language is often cited, its internal structure, annotation formats and access conditions do not allow its effective use in reproducible NLP research. This paper adopts a theoretical and infrastructural perspective and analyses the structural reasons for this situation. The study identifies three interrelated domains that constrain the development of Tajik LLM technologies: data availability and quality, linguistic representation, and research infrastructure. Particular attention is paid to the discrepancy between classical linguistic proximity and functional technological compatibility, especially with respect to cross-lingual transfer from Persian. The paper does not present new datasets or empirical experiments; instead, it formulates a conceptual framework and preparatory research agenda intended to guide future corpus construction, linguistic preprocessing and safety-aware model adaptation for the Tajik language.

Ключевые слова

TAJIK LANGUAGE, LARGE LANGUAGE MODELS, LOW-RESOURCE LANGUAGES, CORPUS INFRASTRUCTURE, MORPHOLOGICAL RICHNESS, TOKENISATION, CODE-SWITCHING, LANGUAGE SAFETY, DETOXIFICATION, DIGITAL INEQUALITY

Название журнала

MODERN SCIENCE

Пожалуйста, используйте этот идентификатор, чтобы цитировать или ссылаться на эту карточку

https://repository.kpfu.ru/?p_id=323443

Файлы ресурса

Название файла	Размер (Мб)	Формат
elibrary_87881993_55071084.pdf	0,17	pdf	посмотреть / скачать

Полная запись метаданных

Поле DC	Значение	Язык
dc.contributor.author	Арабов Муллошараф Курбонович	ru_RU
dc.date.accessioned	2025-01-01T00:00:00Z	ru_RU
dc.date.available	2025-01-01T00:00:00Z	ru_RU
dc.date.issued	2025	ru_RU
dc.identifier.citation	Arabov, M. K. Developing the Tajik language in the era of large language models: corpus infrastructure, linguistic challenges, and safety alignment / M. K. Arabov // Modern Science. – 2025. – No. 12-2. – P. 85-93. – EDN LQLURB.	ru_RU
dc.identifier.uri	https://repository.kpfu.ru/?p_id=323443	ru_RU
dc.description.abstract	MODERN SCIENCE	ru_RU
dc.description.abstract	The rapid progress of large language models (LLMs) has reshaped natural language processing, yet this progress has reinforced existing inequalities between high-resource and low-resource languages. Tajik, despite its long-standing literary tradition and official status, remains largely absent from contemporary LLM ecosystems. At the present stage, the language lacks publicly accessible, standardised and computationally usable corpora and datasets suitable for training, adaptation or evaluation of modern language models. Although a National Corpus of the Tajik Language is often cited, its internal structure, annotation formats and access conditions do not allow its effective use in reproducible NLP research. This paper adopts a theoretical and infrastructural perspective and analyses the structural reasons for this situation. The study identifies three interrelated domains that constrain the development of Tajik LLM technologies: data availability and quality, linguistic representation, and research infrastructure. Particular attention is paid to the discrepancy between classical linguistic proximity and functional technological compatibility, especially with respect to cross-lingual transfer from Persian. The paper does not present new datasets or empirical experiments; instead, it formulates a conceptual framework and preparatory research agenda intended to guide future corpus construction, linguistic preprocessing and safety-aware model adaptation for the Tajik language.	ru_RU
dc.language.iso	ru	ru_RU
dc.subject	TAJIK LANGUAGE	ru_RU
dc.subject	LARGE LANGUAGE MODELS	ru_RU
dc.subject	LOW-RESOURCE LANGUAGES	ru_RU
dc.subject	CORPUS INFRASTRUCTURE	ru_RU
dc.subject	MORPHOLOGICAL RICHNESS	ru_RU
dc.subject	TOKENISATION	ru_RU
dc.subject	CODE-SWITCHING	ru_RU
dc.subject	LANGUAGE SAFETY	ru_RU
dc.subject	DETOXIFICATION	ru_RU
dc.subject	DIGITAL INEQUALITY	ru_RU
dc.title	Developing the Tajik language in the era of large language models: corpus infrastructure, linguistic challenges, and safety alignment	ru_RU
dc.type	Статьи в российских журналах и сборниках	ru_RU