Ланде Д., Дмитренко О.
Створення мереж сл╕в на основ╕ текст╕в з використанням алгоритм╕в граф╕в видимост╕
// Information Technology and Security, 2018. - Vol. 6. Iss. 2 (11). - pp.
// 5-18.
DOI: https://doi.org/10.20535/2411-1031.2018.6.2.153486
A method to constructing language networks is proposed. Key words and concepts from the set of documents which describe some subject domain are retrieved. Numeric values are assigned to each word using a TF-IDF metric, that is intended to reflect how important a word is to a document in a collection or corpus. As the result a time series are constructed. A tool in time series analysis . the visibility graph algorithm is used for constructing the graph of subject domain. In this article two actual subject domains (.Space. and .Computer graphic.) are considered for example. The proposed method is used for the set of documents, which are related with .Space. and .Computer graphic.. A network of connections between terms and concepts, which go into textual documents is builded. Building networks of words, the nodes of which are elements of the text, enables to reveal key components of the text. At the same time, the task of determining the important structural elements of the text which are also informationally important, is actual. As a result of the research, it was found that such words as .uranium., .nuclear., .waste., .Jupiter., .Mercury., .Moon., .Earth., .comet., .space. and others are key for the subject area .Space.. This article shows that applying only a TF metric is more expedient compared with the TF-IDF metric in case when the set of documents describe one subject domain. Also the results of applying the visibility graphs algorithm and the compactified horizontal visibility graph algorithm are compared. It was found that in some case using the compactified horizontal visibility graph algorithm gives a network of words with more quantity of connections between concepts compared with using the visibility graphs algorithm. An open-source visualization and exploration software for all kinds of graphs and networks Gephi and an original package of specially developed Python modules are used for simulation and visualization as an additional tool. The proposed method can be used for visualization some subject domain, and also for information decision support systems, enabling to reveal key components of a subject domain. Also the results of this article can be used for building UI of information retrieval systems, enabling to make a process of search a relevant information easier.
Keywords: Set of documents; domain; time series; network of words; statistical weight of word; visibility graph; compactified horizontal visibility graph. |