| Форма представления | Статьи в российских журналах и сборниках |
| Год публикации | 2025 |
| Язык | английский |
|
Арабов Муллошараф Курбонович, автор
|
| Библиографическое описание на языке оригинала |
Arabov, M. K. Developing the Tajik language in the era of large language models: corpus infrastructure, linguistic challenges, and safety alignment / M. K. Arabov // Modern Science. – 2025. – No. 12-2. – P. 85-93. – EDN LQLURB. |
| Аннотация |
The rapid progress of large language models (LLMs) has reshaped natural language processing, yet this progress has reinforced existing inequalities between high-resource and low-resource languages. Tajik, despite its long-standing literary tradition and official status, remains largely absent from contemporary LLM ecosystems. At the present stage, the language lacks publicly accessible, standardised and computationally usable corpora and datasets suitable for training, adaptation or evaluation of modern language models. Although a National Corpus of the Tajik Language is often cited, its internal structure, annotation formats and access conditions do not allow its effective use in reproducible NLP research. This paper adopts a theoretical and infrastructural perspective and analyses the structural reasons for this situation. The study identifies three interrelated domains that constrain the development of Tajik LLM technologies: data availability and quality, linguistic representation, and research infrastructure. Particular attention is paid to the discrepancy between classical linguistic proximity and functional technological compatibility, especially with respect to cross-lingual transfer from Persian. The paper does not present new datasets or empirical experiments; instead, it formulates a conceptual framework and preparatory research agenda intended to guide future corpus construction, linguistic preprocessing and safety-aware model adaptation for the Tajik language. |
| Ключевые слова |
TAJIK LANGUAGE, LARGE LANGUAGE MODELS, LOW-RESOURCE LANGUAGES, CORPUS INFRASTRUCTURE, MORPHOLOGICAL RICHNESS, TOKENISATION, CODE-SWITCHING, LANGUAGE SAFETY, DETOXIFICATION, DIGITAL INEQUALITY |
| Название журнала |
MODERN SCIENCE
|
| Пожалуйста, используйте этот идентификатор, чтобы цитировать или ссылаться на эту карточку |
https://repository.kpfu.ru/?p_id=323443 |
| Файлы ресурса | |
|
|
Полная запись метаданных  |
| Поле DC |
Значение |
Язык |
| dc.contributor.author |
Арабов Муллошараф Курбонович |
ru_RU |
| dc.date.accessioned |
2025-01-01T00:00:00Z |
ru_RU |
| dc.date.available |
2025-01-01T00:00:00Z |
ru_RU |
| dc.date.issued |
2025 |
ru_RU |
| dc.identifier.citation |
Arabov, M. K. Developing the Tajik language in the era of large language models: corpus infrastructure, linguistic challenges, and safety alignment / M. K. Arabov // Modern Science. – 2025. – No. 12-2. – P. 85-93. – EDN LQLURB. |
ru_RU |
| dc.identifier.uri |
https://repository.kpfu.ru/?p_id=323443 |
ru_RU |
| dc.description.abstract |
MODERN SCIENCE |
ru_RU |
| dc.description.abstract |
The rapid progress of large language models (LLMs) has reshaped natural language processing, yet this progress has reinforced existing inequalities between high-resource and low-resource languages. Tajik, despite its long-standing literary tradition and official status, remains largely absent from contemporary LLM ecosystems. At the present stage, the language lacks publicly accessible, standardised and computationally usable corpora and datasets suitable for training, adaptation or evaluation of modern language models. Although a National Corpus of the Tajik Language is often cited, its internal structure, annotation formats and access conditions do not allow its effective use in reproducible NLP research. This paper adopts a theoretical and infrastructural perspective and analyses the structural reasons for this situation. The study identifies three interrelated domains that constrain the development of Tajik LLM technologies: data availability and quality, linguistic representation, and research infrastructure. Particular attention is paid to the discrepancy between classical linguistic proximity and functional technological compatibility, especially with respect to cross-lingual transfer from Persian. The paper does not present new datasets or empirical experiments; instead, it formulates a conceptual framework and preparatory research agenda intended to guide future corpus construction, linguistic preprocessing and safety-aware model adaptation for the Tajik language. |
ru_RU |
| dc.language.iso |
ru |
ru_RU |
| dc.subject |
TAJIK LANGUAGE |
ru_RU |
| dc.subject |
LARGE LANGUAGE MODELS |
ru_RU |
| dc.subject |
LOW-RESOURCE LANGUAGES |
ru_RU |
| dc.subject |
CORPUS INFRASTRUCTURE |
ru_RU |
| dc.subject |
MORPHOLOGICAL RICHNESS |
ru_RU |
| dc.subject |
TOKENISATION |
ru_RU |
| dc.subject |
CODE-SWITCHING |
ru_RU |
| dc.subject |
LANGUAGE SAFETY |
ru_RU |
| dc.subject |
DETOXIFICATION |
ru_RU |
| dc.subject |
DIGITAL INEQUALITY |
ru_RU |
| dc.title |
Developing the Tajik language in the era of large language models: corpus infrastructure, linguistic challenges, and safety alignment |
ru_RU |
| dc.type |
Статьи в российских журналах и сборниках |
ru_RU |
|