Ландэ Д.В., публикации

Alan Nafiiev, Dmytro Lande.
Malware detection model based on machine learning.
Bulletin of Cherkasy State Technological University, 2023. Iss. 3. - pp. 40-50. DOI: 0.24025/2306-4412.3.2023.286374

Every year, malware authors create more and more sophisticated and clever malware that can harm our computers. Traditional methods, which are based on searching for program signatures are no longer effective in solving the problem of malware detection. It is being replaced by automated file analysis, which is a more promising approach to detecting suspicious files. Machine learning methods are increasingly used to detect such malware programs. However, such solutions may require a lot of computing resources to perform their operations. Therefore, the task of creating an optimal machine learning model in terms of learning speed and malware detection accuracy arises. In addition, usually one method of data representation is not sufficient to detect malicious features of files. Therefore, this paper will describe two different methods: one method is based on the binary information of the file, the other one is based on disassembled code of executable files. The purpose of this work is to improve the efficiency of malware detection by optimising feature extraction methods and applying machine learning. The main tasks of the study include: extracting features from exe files, creating several machine learning models and comparing them to determine the most effective one. The dataset used in this study has been collected from various online sources and consists of 12824 executable files in .exe format, of which 11844 files are malicious and 980 are benign. This paper presents recommended methods of feature extraction and input data generation for machine learning models based on the support vector machine algorithm. These methods allow to find the best way to process the features describing a malicious file. Six machine learning models, each of which performed well in terms of F-score, precision, and recall metrics, were created. The model that was created based on the binary type of data representation showed the highest results for all metrics.
Keywords: intrusion detection, PE format, feature extraction, disassembled instructions, support vector machine.