D.V. Lande , O.O. Dmytrenko, A.A. Snarskii
Transformation texts into complex network with applying visability graphs algorithms

// Информационные технологии и безопасность. Материалы XVIII Международной научно-практической конференции ИТБ-2018. - К.: ООО "Инжиниринг", 2018. - C. 20-33.


In this article the algorithms of visibility for transforming texts into a complex network is proposed. Key words and concepts from the set of documents which describe some subject domain are extracted. Numeric values are assigned to each word or phrase using GTF-IDF metric, which was proposed in this article instead ordinary TF-IDF metric, that is intended to reflect how important a word is to a document in a collection or corpus. As the result a time series are constructed. A tool in time series analysis - the visibility graph algorithm is used for constructing the graph of subject domain. In this article two actual subject domains ("Information extraction" and "Complex network") are considered for example. The corpora of documents, which are related with actual subject domains, were considered from an open access repository of electronic preprints - arXiv (https://arxiv.org). The proposed algorithm is used for the set of documents, which are related with "Information extraction" and "Complex network". This article shows that applying only GTF metric is more expedient compared with GTF-IDF metric in case when the set of documents describe one subject domain. Also the results of applying the visibility graph algorithm and the compactified horizontal visibility graph algorithm are compared. This article shows, that in some case using the compactified horizontal visibility graph algorithm gives a network of words with more quantity of connections between concepts compared with using the visibility graph algorithm. An open-source visualization and exploration software for all kinds of graphs and networks Gephi and an original package of specially developed Python modules are used for simulation and visualization as an additional tool. The proposed algorithm can be used for visualization some subject domain, and also for information support systems, enabling to reveal key components of the subject domain. Also the results of this article can be used for building UI of information retrieval systems, enabling to make a process of search a relevant information easier.
Keywords: Set of Documents, Subject Domain, Time Series, Network of Words, TF-IDF, Visibility Graph, Compactified Horizontal Visibility Graph.

PDF

HOME