Dmytro Lande,
Oleh Dmytrenko
Using Part-of-Speech Tagging for Building Networks of Terms in Legal Sphere
Abstract. This paper considers an important formalization problem and building the terminological ontology of problem subject domains based on content-related text data. As an ontological model, we propose to use a linguistic network model of text representation, the so-called network of key terms. In this network, the nodes are keywords and phrases that appear in the text corpus, and the links between them are semantic-syntactic links between these terms in the text. Using systems of aggregation of thematic information flows from freely available information resources distributed in global computer networks, input sets of text data were prepared. In particular, this paper solves the important and urgent problem of computerized processing of legal information. The task of computerized processing of natural language texts lies at the intersection between linguistic theory and mathematical sciences. Therefore, a wider natural language processing based on Part-of-Speech tagging was used for extraction of the key terms. After the extraction, a statistical weighing of the formed words and phrases was performed. The horizontal visibility graph algorithm was used to build undirected links between key terms. This paper also considers a new method that allows determining the direction of links between terms and weighting these links in the undirected network of words and phrases. This method takes into account the parts of speech tagging and also obeys the principle of inclusion of a word or phrase in their corresponding extended phrases with more words. The approbation of the proposed method was carried out on the example of a freely available legal document .Universal Declaration of Human Rights.. After extracting the key terms from this legal document and determining the direction and weight of links between words or phrases using the proposed methods the directed weighted network of terms was built. The considered in this work method for building the terminological networks can be used, in particular, in systems for automatic text structuring and summarizing of legal information, or systems for detecting the duplicates and contradictions in normative legal documents. It will promote the formation and improvement of conceptual and terminological apparatus in the legal sphere and harmonize national and international law. Keywords: Information space, unstructured data, ontological model, problem subject domain, legal information, text corpus, computerized text processing, Part-of-Speech tagging, network of terms, automatic summarization |