DOCUMENT CLASSIFICATION VIA AUGMENTATION OF DOCUMENT EMBEDDINGS WITH GRAPH EMBEDDINGS OF SYNONYMS DICTIONARY

Authors

DOI:

https://doi.org/10.32689/maup.it.2022.3.6

Keywords:

natural language processing, vector embeddings, document classification, mathematical model, machine learning, neural networks, low-resource

Abstract

The article is devoted to the assessment of the influence of the methods of augmentation of vector representations of documents with graph representations of the elements of the synonym dictionary on the quality of the classification of these documents in a low-resource environment. The study of such environments is an important task, because most of the world's languages, as well as highly specialized application areas, fall under this criterion – there is not enough data for building and training modern powerful machine learning models. The main goal of this article is to improve the quality of document classification in a low-resource environment by augmenting them with information from the dictionary of synonyms through the encoding of the latter. The research was carried out through the analysis and use of modern developments in the field of mathematical modeling, machine learning, natural language processing and data science. The scientific novelty of the work lies in the fact that a vector model of words from the dictionary of synonyms is proposed, which, unlike others, works on the basis of representations of individual nodes of the dictionary graph, and therefore can be used in other text data processing tasks. This can be helped by transfer learning, an approach that allows combining dense vector representations in neural network methods. At the same time, the choice of the method of building vector representations of the dictionary of synonyms directly affects the quality of the results, as well as the speed and requirements for hardware when using them. Also, the work presents a set of preprocessing steps and a method of converting a dictionary into a graph for modeling. As a conclusion, the article shows that the proposed model is able to increase the F1-score of document classification in a low-resource environment by 2-3% using the example of the classification of petitions to the Kyiv City Council by topic. The highest quality gain was obtained using the Node2Vec method of constructing graph vector representations, which works on the basis of random walks and does not require a large amount of training resources.

References

Hedderich, M. A., Lange, L., Adel, H., Strötgen, J., & Klakow, D. (2021). A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 2545–2568).

Perez, L., & Wang, J. (2017). The Effectiveness of Data Augmentation in Image Classification using Deep Learning. Retrieved from https://arxiv.org/abs/1712.04621.

Eskander, R., Muresan, S., & Collins, M. (2020). Unsupervised Cross-Lingual Part-of-Speech Tagging for Truly Low-Resource Scenarios. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4820–4831). Association for Computational Linguistics (ACL). https://doi.org/10.18653/V1/2020.EMNLP-MAIN.391.

Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (pp. 1003–1011). Association for Computational Linguistics. Retrieved from https://aclanthology.org/P09-1113.

Collobert, R., Weston, J., Com, J., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research, 12, 2493–2537. https://doi.org/10.5555/1953048.2078186.

Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O., & Johnson, M. (2020). XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation. In International Conference on Machine Learning. PMLR.

Gururangan, S., Marasovi´cmarasovi´c, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D. (2020). Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8342–8360). Association for Computational Linguistics. Retrieved from https://github.com/allenai.

Wei, J., & Zou, K. (2019). EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (pp. 6382–6388). Retrieved from http://github.

Goyal, P., & Ferrara, E. (2018). Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems, 151, 78–94. https://doi.org/10.1016/J.KNOSYS.2018.03.022

Ahmed, A., Shervashidze, N., Narayanamurthy, S., Josifovski, V., & Smola, A. J. (2013). Distributed Large-Scale Natural Graph Factorization. In Proceedings of the 22nd International Conference on World Wide Web (pp. 37–48). New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2488388.2488393.

Ou, M., Cui, P., Pei, J., Zhang, Z., & Zhu, W. (2016). Asymmetric Transitivity Preserving Graph Embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1105–1114). New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2939672.2939751.

T., R. S., & K., S. L. (2000). Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science, 290(5500), 2323–2326. https://doi.org/10.1126/science.290.5500.2323.

Cao, S., Lu, W., & Xu, Q. (2015). GraRep: Learning Graph Representations with Global Structural Information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (pp. 891–900). New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2806416.2806512.

Belkin, M., & Niyogi, P. (2003). Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation, 15(6), 1373–1396. https://doi.org/10.1162/089976603321780317.

Grover, A., & Leskovec, J. (2016). Node2vec: Scalable Feature Learning for Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 855–864). New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2939672.2939754.

Perozzi, B., Kulkarni, V., Chen, H., & Skiena, S. (2017). Don’t Walk, Skip! Online Learning of Multi-Scale Network Embeddings. In Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017 (pp. 258–265). New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3110025.3110086.

Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). DeepWalk: Online learning of social representations. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 701–710). https://doi.org/10.1145/2623330.2623732.

Wang, D., Cui, P., & Zhu, W. (2016). Structural Deep Network Embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1225–1234). New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2939672.2939753.

Cao, S., Lu, W., & Xu, Q. (2016). Deep Neural Networks for Learning Graph Representations. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1 SE-Technical Papers: Machine Learning Applications). Retrieved from https://ojs.aaai.org/index.php/AAAI/article/view/10179.

Kipf, T. N., & Welling, M. (2016). Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017 – Conference Track Proceedings. International Conference on Learning Representations, ICLR. Retrieved from https://arxiv.org/abs/1609.02907v4.

Samvelyan, A., Shaptala, R., & Kyselov, G. (2020). Exploratory data analysis of Kyiv city petitions. In 2020 IEEE 2nd International Conference on System Analysis Intelligent Computing (SAIC) (pp. 1–4). https://doi.org/10.1109/SAIC51296.2020.9239185.

Офіційний сайт Української мови. URL: https://ukrainskamova.com.

McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2), 153–157. https://doi.org/10.1007/BF02295996.

Published

2023-01-26

How to Cite

ШАПТАЛА, Р., & КИСЕЛЬОВ, Г. (2023). DOCUMENT CLASSIFICATION VIA AUGMENTATION OF DOCUMENT EMBEDDINGS WITH GRAPH EMBEDDINGS OF SYNONYMS DICTIONARY. Information Technology and Society, (3 (5), 49-55. https://doi.org/10.32689/maup.it.2022.3.6