검색 상세

FAST AND ACCURATE CONTENT-CLASSIFICATION TECHNIQUES FOR MALWARE DETECTION AND FILE TYPE IDENTIFICATION

초록/요약

A file-type (such as MP3, DOC and AVI) represents an encoding scheme in order to understand the underlying information in file contents. The classification of contents into their respective file types is a non-trivial task since it is required by many security applications in order perform their functions effectively. For instance, email attachment filtering requires blocking the types of inbound attachment that may contain malicious contents. The current practice of identifying encoding scheme relies on metadata information. For instance, file extensions (combining the file type with the name using a period), magic numbers (keeping the file type information in a file header) and file systems are used to store metadata information. However, these are susceptible to tampering or corruption, for instance, the file-extension can be easily spoofed and the magic numbers can be obfuscated. A more reliable approach may be to analyze the content for type identification. Since content-based schemes use statistical and data mining techniques, they are inaccurate and time-consuming (if compared with metadata techniques). In this dissertation, we propose several techniques to improve the accuracy and detection speed of content-based type identification. First, we propose a divide-and-conquer approach to improve the classification accuracy. We decompose the identification procedure into two steps: In the first step, the similar files in terms of byte pattern frequencies are grouped into several clusters. In the next step, the cluster which contains different file-types is fed to a neural network in order for finer classification. The experiments showed that the classification followed by clustering leads to higher accuracies. Second, we propose a feature selection technique in order to reduce the number of features, since current schemes use all the byte patterns as features. It uses a subset of highly-occurring byte patterns as features assuming that they are sufficient to build the representative model of a file type and classifying files. To evaluate its effectiveness, we applied it to the six most popular classification algorithms (i.e. neural network, linear discriminant analysis, K-means, K-nearest neighbor, decision tree, and support vector machine). On average, the K-nearest neighbor method achieved the optimum accuracy of 90% using only 40% of byte patterns; this reduces 55% of computation time. Furthermore, we propose to use the cosine distance as a similarity metric when comparing the file content rather than the Mahalanobis distance that is popular and has been used by the other related approaches. We show that the cosine similarity (unlike the Mahalanobis distance) retains the classification accuracy on a small number of highly frequent byte patterns which leads to smaller model size and faster detection rate. Third, we propose a content sampling technique, since the current schemes process a whole file to obtain its byte frequency distribution, It uses a small portion of a file to obtain its byte-frequency distribution. To evaluate the effectiveness of this approach, we sample in two ways: 1) initial contiguous bytes (where the frequency is generated from the first few consecutive bytes of a file), 2) a few small blocks in random locations in a file. The scheme is effective for large size files where a relatively small sample can generate the representative byte frequency distribution. For instance, it reduces the sampling size of MP3 files from 5MB to 400KB (without compromising the accuracy). This is a 15 fold size reduction. Furthermore, since the content sampling technique cannot classify small-size contents (such as packet payload), we propose a signature-free content-classification scheme that identifies executable contents in incoming packets. For accurate detection, the proposed scheme analyzes the packet payload in two steps. It first analyzes the packet payload to see if it contains multimedia-type data (binary contents except executables such as avi, wmv, jpg, etc.). If not, in the second step it classifies the payload either as text-type (txt, jsp, asp, etc.) or executable-type data. We propose two-step scheme because we found that characteristics of multimedia-type are different enough from text and executable-type, but the difference between text and executable (code)-type is smaller, which requires different technique to distinguish the two. To evaluate the proposed scheme, we transfer the different types of files (i.e. executable, multimedia and text files) using an FTP server in order to classify them into their respective types. It produces 2.53% and 4.69% of false positive and false negative rates respectively.

more

목차

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Contribution of the dissertation . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Brief summary of the proposed techniques . . . . . . . . . . . . . . . . . . 5
1.3.1 Divide-and-conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Using high-frequency byte patterns . . . . . . . . . . . . . . . . . . 5
1.3.3 Content sampling technique . . . . . . . . . . . . . . . . . . . . . . 6
1.3.4 Content-classification for malware detection . . . . . . . . . . . . . 7
1.4 Dissertation structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 File-type identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Using file headers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Using file contents . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Malware detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Improving the content-classi cation accuracy for file-type identification . . . . . 16
3.1 Divide-and-conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.3 Type identification process of a file . . . . . . . . . . . . . . . . . . 18
4 Improving the content-classification speed for file-type identification . . . . . . . 20
4.1 Using high-frequency byte patterns . . . . . . . . . . . . . . . . . . . . . . 20
4.1.1 Using cosine similarity vs Mahalanobis distance . . . . . . . . . . . 22
4.1.2 Using classification (machine learning) algorithms . . . . . . . . . . 22
4.2 Content sampling technique . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 Initial contiguous bytes sampling . . . . . . . . . . . . . . . . . . . 25
4.2.2 Random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5 Experiments and their analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Results on divide-and-conquer . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2.1 Experimental settings . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2.2 Analysis of empirical results . . . . . . . . . . . . . . . . . . . . . . 29
5.3 Results on high-frequency byte patterns using cosine similarity . . . . . . . 32
5.3.1 Experimental settings . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3.2 Analysis of empirical results . . . . . . . . . . . . . . . . . . . . . . 33
5.3.3 Reduction in processing time . . . . . . . . . . . . . . . . . . . . . 36
5.4 Results on high-frequency byte patterns using the classification algorithms 38
5.4.1 Experimental settings . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.4.2 Analysis of empirical results . . . . . . . . . . . . . . . . . . . . . . 39
5.4.3 Reduction in processing time . . . . . . . . . . . . . . . . . . . . . 44
5.5 Results on content sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.5.1 Analysis of empirical results of initial contiguous byte sampling . . 46
5.5.2 Analysis of empirical results of random sampling . . . . . . . . . . . 48
5.5.3 Reduction in processing time . . . . . . . . . . . . . . . . . . . . . 49
6 Content-classification for malware detection . . . . . . . . . . . . . . . . . . . . 51
6.1 Identification of multimedia packets . . . . . . . . . . . . . . . . . . . . . . 54
6.1.1 Learning phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1.2 Identification phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.1.3 Determining the n-gram order . . . . . . . . . . . . . . . . . . . . . 56
6.1.4 Limitation of the distance-based algorithm . . . . . . . . . . . . . . 58
6.2 Identification of executable and text packets . . . . . . . . . . . . . . . . . 60
6.2.1 Distance-based algorithm for text contents . . . . . . . . . . . . . . 60
6.2.2 Pattern counting scheme . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3 Analyzing the classification accuracy using file segments as packet payload 63
6.3.1 Distance-based algorithm for multimedia contents . . . . . . . . . . 63
6.3.2 Distance-based algorithm for text contents . . . . . . . . . . . . . . 66
6.3.3 Pattern counting scheme . . . . . . . . . . . . . . . . . . . . . . . . 67
6.4 Evaluating with real-world network applications and malwares . . . . . . . 69
6.4.1 Protecting an FTP server . . . . . . . . . . . . . . . . . . . . . . . 70
6.4.2 Protecting a Web client . . . . . . . . . . . . . . . . . . . . . . . . 70
6.4.3 Applying the content classi cation scheme on malwares . . . . . . . 71
6.5 Detecting malwares automatically . . . . . . . . . . . . . . . . . . . . . . . 72
6.5.1 Detecting malwares in the presence of false positives . . . . . . . . . 72
6.5.2 Detecting malwares in the presence of false negatives . . . . . . . . 73
6.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.6.1 Memory overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.6.2 Higher n-gram computation overhead . . . . . . . . . . . . . . . . . 77
7 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.1 Content classi cation for file-type identification . . . . . . . . . . . . . . . 85
7.1.1 Divide-and-conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.1.2 Using high-frequency byte patterns . . . . . . . . . . . . . . . . . . 86
7.1.3 Content sampling technique . . . . . . . . . . . . . . . . . . . . . . 87
7.2 Content classification for malware detection . . . . . . . . . . . . . . . . . 88
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Appendix
A File-type-wise comparison of cosine similarity and Mahalanobis distance . . . . . 95
B Byte frequency distributions of les of different file-types . . . . . . . . . . . . . 98

more