D.V. Lande , O.O. Dmytrenko, A.A. Snarskii
Transformation texts into complex network with applying visability graphs algorithms
//
Информационные технологии и безопасность. Материалы XVIII Международной научнопрактической конференции ИТБ2018.  К.: ООО "Инжиниринг", 2018.
 C. 2033.
In this article the algorithms of visibility for transforming texts into a complex network is proposed. Key words and concepts from the set of documents which describe some subject domain are extracted. Numeric values are assigned to each word or phrase using GTFIDF metric, which was proposed in this article instead ordinary TFIDF metric, that is intended to reflect how important a word is to a document in a collection or corpus. As the result a time series are constructed. A tool in time series analysis  the visibility graph algorithm is used for constructing the graph of subject domain. In this article two actual subject domains ("Information extraction" and "Complex network") are considered for example. The corpora of documents, which are related with actual subject domains, were considered from an open access repository of electronic preprints  arXiv (https://arxiv.org). The proposed algorithm is used for the set of documents, which are related with "Information extraction" and "Complex network". This article shows that applying only GTF metric is more expedient compared with GTFIDF metric in case when the set of documents describe one subject domain. Also the results of applying the visibility graph algorithm and the compactified horizontal visibility graph algorithm are compared. This article shows, that in some case using the compactified horizontal visibility graph algorithm gives a network of words with more quantity of connections between concepts compared with using the visibility graph algorithm. An opensource visualization and exploration software for all kinds of graphs and networks Gephi and an original package of specially developed Python modules are used for simulation and visualization as an additional tool. The proposed algorithm can be used for visualization some subject domain, and also for information support systems, enabling to reveal key components of the subject domain. Also the results of this article can be used for building UI of information retrieval systems, enabling to make a process of search a relevant information easier.
Keywords: Set of Documents, Subject Domain, Time Series, Network of Words, TFIDF, Visibility Graph, Compactified Horizontal Visibility Graph.
