Abstract
The explosive growth of malware variants poses a major threatto information security. Traditional anti-virus systems based on signaturesfail to classify unknown malware into their corresponding familiesand to detect new kinds of malware programs. Therefore, we proposea machine learning based malware analysis system, which is composedof three modules: data processing, decision making, and new malwaredetection. The data processing module deals with gray-scale images,Opcode n-gram, and import functions, which are employed to extractthe features of the malware. The decision-making module uses the featuresto classify the malware and to identify suspicious malware. Finally,the detection module uses the shared nearest neighbor (SNN) clusteringalgorithm to discover new malware families. Our approach is evaluatedon more than 20 000 malware instances, which were collected by Kingsoft,ESET NOD32, and Anubis. The results show that our system can effectivelyclassify the unknown malware with a best accuracy of 98.9%, and successfullydetects 86.7% of the new malware.