A short listing of resources useful for creating malware training sets for machine learning.
In leading academic and industry research on malware detection, it is common to use variations of the following techniques (based on Virustotal determinations) in order to build labeled training data.
- “In this paper, we use a ‘1-/5+ criterion for labeling a given file as malicious or benign: if a file has one or fewer vendors reporting it as malicious, we label the file as ‘benign’”. See ALOHA: Auxiliary Loss Optimization for Hypothesis Augmentation for more details.
- “We assign malicious/benign labels on a 5+/1- basis, i.e., documents for which one or fewer vendors labeled malicious, we ascribe the aggregate label benign, while documents for which 5 or more vendors labeled malicious, we ascribe the aggregate label malicious.” See MEADE: Towards a Malicious Email Attachment Detection Engine for more details.
- Uses similar method as above, but further removes files that use hash based file names or filenames that are “malware” or “sample”. See Learning from Context: Exploiting and Interpreting File Path Information for Better Malware Detection for more details.
- “To train and evaluate our model at low false positive rates, we require accurate labels for our malware and benignware binaries. We accomplish this by running all of our data through VirusTotal, which runs the binaries through approximately 55 malware engines.We then use a voting strategy to decide if each file is either malware or benignware… We label any file against which 30% or more of the antivirus engines alarm as malware, and any file that no antivirus engine alarms on as benignware. For the purposes of both training and accuracy evaluation we discard any files that more than 0% and less than 30% of VirusTotal’s antivirus engines declare it malware, given the uncertainty surrounding the nature of these files.” See Deep Neural Network Based Malware Detection Using Two Dimensional Binary Program Features for more details.
- Scraping packages from Ninite, Chocolatey, and Cygwin.
- Endgame’s Ember is becoming one of the most cited datasets used for security machine learning. “The dataset includes features extracted from 1.1M binary files: 900K training samples (300K malicious, 300K benign, 300K unlabeled) and 200K test samples (100K malicious, 100K benign)”. See EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models for more details.
- Labeling the VirusShare Corpus: Lessons Learned - John Seymour
The “short links” format was inspired by O’Reilly’s Four Short Links series.