STUDY OF THE EFFECTIVENESS OF THE MODIFIED METHOD OF AUTOMATED SEARCH FOR KEYWORDS IN TEXT
DOI:
https://doi.org/10.32689/maup.it.2024.1.4Keywords:
keywords, performance analysis, text data processing, Python NLTK, Stanford classificationAbstract
In the conditions of constant growth of the volume of text data, which a person has to process in almost all spheres of his activity, the task of ensuring quick access to the necessary information becomes extremely important. To solve this problem, existing search engines, as a rule, perform data indexing: special bots scan resources and try to find keywords related to them. The relevance of the search results that will be issued to the user of the search engine directly depends on the correctness of the keywords found. This article discusses a modified method of automated search for keywords in natural language text data. It is based on the analysis of complex syntactic relationships between words in the sentences of the text and is able to search for key terms consisting of several words. The research objective is the programmatic implementation and experimental study of the effectiveness of the modified method of automated search for keywords in text data. Methodology of implementation. For testing, the modified method was implemented on the Python NLTK platform. Two sets of texts were chosen as a test dataset: texts of a small volume (up to 400 words) and texts of a larger volume (up to 2500 words). Comparisons were made with three popular analogues, each of which is implemented on the basis of different approaches (machine learning, N-gram analysis, statistical analysis). For quantitative measurement of efficiency and comparison with existing analogues, it is proposed to use absolute accuracy and completeness metrics according to Jaccard. Conclusions. The results of the tests demonstrated the superiority of the proposed method over analogues in the accuracy of searching for keywords. It was noted that with an increase in the volume of texts, the absolute accuracy increases in almost all cases, but the completeness according to Jaccard decreases. Based on the test results, further directions of work on improving the proposed method are formulated.
References
Бухаленков Д.О., Заболотня Т.М. Модифікований метод пошуку ключових слів та термінів у текстових даних. Проблеми програмування № 1 (2024). С. 12–22. Київ, 2024.
Яхимович О.В. Інформаційна технологія пошуку ключових слів на основі парсингу англомовних текстів. Вінниця, 2021.
Shibamouli Lahiri, Sagnik Ray Choudhury, Cornelia Caragea. Keyword and Keyphrase Extraction Using Centrality Measures on Collocation Networks, 2014.
C. Zhang, H. Wang, Y. Liu, D. Wu, Y. Liao, and B. Wang, «Automatic keyword extraction from documents using conditional random fields», Journal of Computational Information Systems №4, pp. 1169–1180, 2008.
Rafael Geraldeli Rossi, Ricardo Marcondes Marcacini, Solange Oliveira Rezende. Analysis of Statistical Key-word Extraction Methods for Incremental Clustering. Proceedings of the 10th of the Encontro Nacional de Inteligˆencia Artificial e Computacional (ENIAC), Fortaleza, Brazil, 2013, 1–12.
Takashi Yamauchi, Dongshik Kang, Hayao Miyagi. The Keyword Search Using Thesaurus Concept, 2002 [Електронний ресурс] URL: https://koreascience.kr/article/CFKO200211921321260.pdf (дата звернення 27.03.2024).
K. S. Sampada, N Kavya. Machine Learning Methods for Keyword extraction and Indexing, 2019.
Marie-Catherine de Marneffe, Christopher D. Manning (2008). Stanford typed dependencies manual [Електронний ресурс] URL: https://downloads.cs.stanford.edu/nlp/software/dependencies_manual.pdf (дата звернення 27.03.2024).
Beatrice Santorini (1990). Part-of-Speech Tagging Guidelines for the Penn Treebank Project [Електронний ресурс] URL: https://www.cis.upenn.edu/~bies/manuals/tagguide.pdf (дата звернення 27.03.2024).
NC Chung, B. Miasojedow, M. Startek, A. Gambin (2019). «Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data». BMC Bioinformatics.
Klakow, Dietrich; Jochen Peters (September 2002). «Testing the correlation of word error rate and perplexity». Speech Communication. 38 (1–2): 19–28. doi:10.1016/S0167-6393(01)00041-3. ISSN 0167-6393
Keyword Extractor – MonkeyLearn [Електронний ресурс] URL: https://monkeylearn.com/keyword-extractoronline/(дата звернення 27.03.2024).
Keyword Extractor – WordCount [Електронний ресурс] URL: https://wordcount.com/keyword-extractor (дата звернення 27.03.2024).
Keyword Extractor – Komprehend [Електронний ресурс] URL: https://komprehend.io/keyword-extractor (дата звернення 27.03.2024).
Journal of Aerospace Technology and Management [Електронний ресурс] URL: https://jatm.com.br/jatm/issue/archive (дата звернення 27.03.2024).