• Original research article
  • April 25, 2024
  • Open access

Using machine learning for the topic annotation of oral speech corpus texts

Abstract

The research aims to determine the effectiveness of the thesaurus method for forming a list of topic classes when using machine learning for the topic classification of text materials of sociolinguistic interviews. The paper considers the potential of using machine learning in the topic annotation of linguistic corpus materials. The polytopical nature of the analyzed material is due to its genre belonging to dialogical speech. The hierarchical structure of the topics, identified as a result of a preliminary introspective analysis of the texts, can be described using a thesaurus. The results of using the unsupervised machine learning method are discussed involving two sets of topic class names: a list of topics used in manual text annotation and an extended list of micro-topics whose names were selected from a Russian language thesaurus. The paper is novel in that it is the first to propose the thesaurus method for selecting topic labels for the zero-shot classification of weakly structured Russian texts. The research findings show that using a more detailed lexical description for topic classes improves the classification result.

References

  1. Баранов А. Н., Добровольский Д. О. Корпусная модель идиостиля Достоевского. М.: ЛЕКСРУС, 2021.
  2. Захаров В. П., Богданова С. Ю. Корпусная лингвистика. СПб.: Изд-во С.-Петерб. ун-та, 2020.
  3. Казакевич О. А. О принципах построения функциональной типологии малых языков (на материале малых автохтонных языков Сибири и Дальнего Востока) // Функциональное развитие языков в полиэтнических странах мира (Россия – Вьетнам): материалы международного круглого стола. М.: Азбуковник, 2015.
  4. Лукашевич Н. В. Тезаурусы в задачах информационного поиска. М., 2010.
  5. Ляшевская О. Н. Корпусные инструменты в грамматических исследованиях русского языка. М.: Издательский дом ЯСК; Рукописные памятники Древней Руси, 2016.
  6. Резанова З. И. Корпус устной речи русско-тюркских билингвов Южной Сибири: разметка отклонений от речевого стандарта // Вопросы лексикографии. 2019. № 15.
  7. Резанова З. И. Подкорпус устной речи русско-тюркских билингвов Южной Сибири: типологически релевантные признаки // Вопросы лексикографии. 2017. № 11.
  8. Bhambhoria R., Chen L., Zhu X. A Simple and Effective Framework for Strict Zero-Shot Hierarchical Classification // arXiv. 2023. Art. 2305.15282. https://doi.org/10.48550/arXiv.2305.15282
  9. Marian V., Blumenfeld H. K., Kaushanskaya M. The Language Experience and Proficiency Questionnaire (LEAP-Q): Assessing Language Profiles in Bilinguals and Multilinguals // Journal of Speech, Language, and Hearing Research. 2007. Vol. 50 (4).
  10. Plaza-del-Arco F., Nozza D., Hovy D. Wisdom of Instruction-Tuned Language Model Crowds. Exploring Model Label Variation // arXiv. 2023. Art. 2307.12973. https://doi.org/10.48550/arXiv.2307.12973.
  11. Rothman D. Transformers for Natural Language Processing and Computer Vision. Birmingham: Packt Publishing, 2024.
  12. Singh J. Natural Language Processing in the Real World: Text Processing, Analytics, and Classification. 1st ed. N. Y.: Chapman and Hall, 2023.
  13. Song Y., Upadhyay S., Peng H., Mayhew S., Roth D. Toward Any-Language Zero-Shot Topic Classification of Textual Documents // Artificial Intelligence. 2019. Vol. 274.
  14. Wang Z., Pang Y., Lin Y. Large Language Models Are Zero-Shot Text Classifiers // arXiv. 2023. Art. 2312.01044. https://doi.org/10.48550/arXiv.2312.01044
  15. Zhang Y., Yang R., Xu X., Xiao J., Shen J., Han J. TELEClass: Taxonomy Enrichment and LLM-Enhanced Hierarchical Text Classification with Minimal Supervision // arXiv. 2024. Art. 2403.00165. https://doi.org/10.48550/arXiv.2403.00165

Author information

Elena Nikolaevna Pogodaeva

Tomsk State University

About this article

Publication history

  • Received: February 20, 2024.
  • Published: April 25, 2024.

Keywords

  • лингвистический корпус
  • машинное обучение
  • тематическая классификация
  • разметка данных
  • диалогическая речь
  • linguistic corpus
  • machine learning
  • topic classification
  • data annotation
  • dialogical speech

Copyright

© 2024 The Author(s)
© 2024 Gramota Publishing, LLC

User license

Creative Commons Attribution 4.0 International (CC BY 4.0)