Alan Nafiiev,
Hlib Kholodulkin,
Andrii Rodionov,
Dmytro Lande
Comparative Analysis of Machine Learning Models with Different Types of Data Representations for Detecting Malicious Files
//
Selected Papers of the IX International Scientific Conference "Information Technology and Implementation" (IT&I-2022). Conference Proceedings Kyiv, Ukraine, November 30 - December 02, 2022. CEUR Workshop Proceedings (ceur-ws.org). - Vol-3347. - pp 367-375. ISSN 1613-0073. [https://ceur-ws.org/Vol-3347/Short_8.pdf]
Nowadays, malicious software authors are creating more and more advanced, sophisticated
malware. To detect such programs, machine learning methods are increasingly used. However,
these solutions can consume a lot of computing resources to perform their operations.
Therefore, the problem arises of creating an optimal machine learning model regarding
learning rate and the accuracy of malware detection. Also, usually one method is not enough
for high-quality file detection, so it is more efficient to use several types of data representation.
The purpose of this work is to conduct research on several machine learning models based on
the support vector machine. Compare them with each other and identify the best model for two
different types of data representation. The set of data used was collected from various Internet
sources. It consists of 12824 executable files in .exe format, 11844 of which are malicious and
980 are benign. This article presents recommended methods for feature selection and input data
generation for a machine learning model. These methods allow you to find the best option for
preparing features that describe a malicious file, which will be used in the process of training
and determining the model with the best parameters.
Keywords
PE format, malware detection, feature selections, machine learning, intrusion detection.
|