Alan Nafiiev, Hlib Kholodulkin, Andrii Rodionov, Dmytro Lande
Comparative Analysis of Machine Learning Models with Different Types of Data Representations for Detecting Malicious Files

// Selected Papers of the IX International Scientific Conference "Information Technology and Implementation" (IT&I-2022). Conference Proceedings Kyiv, Ukraine, November 30 - December 02, 2022. CEUR Workshop Proceedings (ceur-ws.org). - Vol-3347. - pp 367-375. ISSN 1613-0073. [https://ceur-ws.org/Vol-3347/Short_8.pdf]

Nowadays, malicious software authors are creating more and more advanced, sophisticated malware. To detect such programs, machine learning methods are increasingly used. However, these solutions can consume a lot of computing resources to perform their operations. Therefore, the problem arises of creating an optimal machine learning model regarding learning rate and the accuracy of malware detection. Also, usually one method is not enough for high-quality file detection, so it is more efficient to use several types of data representation. The purpose of this work is to conduct research on several machine learning models based on the support vector machine. Compare them with each other and identify the best model for two different types of data representation. The set of data used was collected from various Internet sources. It consists of 12824 executable files in .exe format, 11844 of which are malicious and 980 are benign. This article presents recommended methods for feature selection and input data generation for a machine learning model. These methods allow you to find the best option for preparing features that describe a malicious file, which will be used in the process of training and determining the model with the best parameters.

Keywords PE format, malware detection, feature selections, machine learning, intrusion detection.

PDF

HOME