Казанский (Приволжский) федеральный университет, КФУ
КАЗАНСКИЙ
ФЕДЕРАЛЬНЫЙ УНИВЕРСИТЕТ
 
DEVELOPING THE TAJIK LANGUAGE IN THE ERA OF LARGE LANGUAGE MODELS: CORPUS INFRASTRUCTURE, LINGUISTIC CHALLENGES, AND SAFETY ALIGNMENT
Форма представленияСтатьи в российских журналах и сборниках
Год публикации2025
Языканглийский
  • Арабов Муллошараф Курбонович, автор
  • Библиографическое описание на языке оригинала Arabov, M. K. Developing the Tajik language in the era of large language models: corpus infrastructure, linguistic challenges, and safety alignment / M. K. Arabov // Modern Science. – 2025. – No. 12-2. – P. 85-93. – EDN LQLURB.
    Аннотация The rapid progress of large language models (LLMs) has reshaped natural language processing, yet this progress has reinforced existing inequalities between high-resource and low-resource languages. Tajik, despite its long-standing literary tradition and official status, remains largely absent from contemporary LLM ecosystems. At the present stage, the language lacks publicly accessible, standardised and computationally usable corpora and datasets suitable for training, adaptation or evaluation of modern language models. Although a National Corpus of the Tajik Language is often cited, its internal structure, annotation formats and access conditions do not allow its effective use in reproducible NLP research. This paper adopts a theoretical and infrastructural perspective and analyses the structural reasons for this situation. The study identifies three interrelated domains that constrain the development of Tajik LLM technologies: data availability and quality, linguistic representation, and research infrastructure. Particular attention is paid to the discrepancy between classical linguistic proximity and functional technological compatibility, especially with respect to cross-lingual transfer from Persian. The paper does not present new datasets or empirical experiments; instead, it formulates a conceptual framework and preparatory research agenda intended to guide future corpus construction, linguistic preprocessing and safety-aware model adaptation for the Tajik language.
    Ключевые слова TAJIK LANGUAGE, LARGE LANGUAGE MODELS, LOW-RESOURCE LANGUAGES, CORPUS INFRASTRUCTURE, MORPHOLOGICAL RICHNESS, TOKENISATION, CODE-SWITCHING, LANGUAGE SAFETY, DETOXIFICATION, DIGITAL INEQUALITY
    Название журнала MODERN SCIENCE
    Пожалуйста, используйте этот идентификатор, чтобы цитировать или ссылаться на эту карточку https://repository.kpfu.ru/?p_id=323443
    Файлы ресурса 
    Название файла Размер (Мб) Формат  
    elibrary_87881993_55071084.pdf 0,17 pdf посмотреть / скачать

    Полная запись метаданных