• Original research article
  • May 17, 2023
  • Open access

Building a linguistic corpus based on natural language processing tools: Planning software solutions

Abstract

The paper is aimed at building a model of a linguistic corpus, which is generated according to the rules of the spaCy natural language processing library. Scientific novelty lies in the fact that within the framework of humanities research, the method of modelling is used, which is combined with a corpus approach and takes into account the technological (software) component at the very stage of goal setting. In the research, firstly, a general structural model of a linguistic corpus as a sequence of blocks was determined and standard queries to the database were formulated; secondly, a model of the corpus manager interface able to implement these standard queries was built; thirdly, an analysis of the proposed model with the help of mini-programs that allow assessing the degree of technical feasibility of the queries and their practical value was conducted. At this stage, text arrays of fictional works by German-speaking (F. Kafka, E. M. Remarque) and English-speaking (A. C. Doyle, G. Orwell) writers were involved as linguistic material. The obtained results showed that the constructed model has a number of advantages with a limited number of disadvantages, is flexible in terms of further development and can be programmatically implemented in the short term.

References

  1. Бакаев М. А., Разумникова О. М. Определение сложности задач для зрительно-пространственной памяти и пропускной способности человека-оператора // Управление большими системами: сборник трудов. 2017. № 70.
  2. Бойко В. А., Легалов А. И., Зыков С. В. Архитектура интеллектуальной системы тестирования // Журнал Сибирского федерального университета. Серия «Техника и технологии». 2022. Т. 15. № 2. DOI: 10.17516/1999-494X-0390
  3. Горожанов А. И. Экспериментальное моделирование базы данных сбалансированного лингвистического корпуса // Филологические науки. Вопросы теории и практики. 2022. Т. 15. Вып. 10. DOI: 10.30853/phil20220563
  4. Горожанов А. И., Степанова Д. В. Составление сбалансированного корпуса художественного произведения (на материале романов Ф. Кафки) // Вестник Московского государственного лингвистического университета. Гуманитарные науки. 2022. № 7 (862). DOI: 10.52070/2542-2197_2022_7_862_31
  5. Писарик О. И. Принципы разработки базы данных подъязыка предметной области «Строительство» // Вестник Московского государственного лингвистического университета. Гуманитарные науки. 2021. № 5 (847). DOI: 10.52070/2542-2197_2021_5_847_150
  6. Читалов Д. И. Доработка графического интерфейса платформы OpenFOAM в части расширения перечня утилит для работы с расчетными сетками // Системы и средства информатики. 2022. Т. 32. № 1. DOI: 10.14357/08696527220113
  7. Fonseca C. A., Guelpeli M. V. C., De Souza Netto R. S. Representation of structured data of the text genre as a technique for automatic text processing // Texto Livre. 2021. Vol. 15. DOI: 10.35699/1983-3652.2022.35445
  8. Malyuga E. N., McCarthy M. “No” and “net” as response tokens in English and Russian business discourse: In search of a functional equivalence // Russian Journal of Linguistics. 2021. Vol. 25 (2). DOI: 10.22363/2687-0088-2021-25-2-391-416
  9. O’Neill H., Welsh A., Smith D. A., Roe G., Terras M. Text mining mill: Computationally detecting influence in the writings of John Stuart Mill from library records // Digital Scholarship in the Humanities. 2021. Vol. 36 (4). DOI: 10.1093/llc/fqab010
  10. Tsujii J. Natural language processing and computational linguistics // Computational Linguistics. 2021. Vol. 47 (4). DOI: 10.1162/COLI_a_00420

Author information

Alexey Ivanovich Gorozhanov

Dr

Moscow State Linguistic University

About this article

Publication history

  • Received: March 17, 2023.
  • Published: May 17, 2023.

Keywords

  • моделирование
  • корпусная лингвистика
  • корпусный менеджер
  • графический интерфейс пользователя
  • spaCy
  • modelling
  • corpus linguistics
  • corpus manager
  • graphical user interface

Copyright

© 2023 The Author(s)
© 2023 Gramota Publishing, LLC

User license

Creative Commons Attribution 4.0 International (CC BY 4.0)