This post outlines some experiments I ran using Auxiliary Loss Optimization for Hypothesis Augmentation (ALOHA) for DGA domain detection.

(Update 2019-07-18) After getting feedback from one of the ALOHA paper authors, I modified my code to set loss weights for the auxilary targets as they did in their paper (Weights used: main target 1.0, auxilary targets 0.1). I also added 3 word-based/dictionary DGAs. All diagrams and metrics have been updated to reflect this.

I recently read this paper ALOHA: Auxiliary Loss Optimization for Hypothesis Augmentation by Ethan M. Rudd, Felipe N. Ducau, Cody Wild, Konstantin Berlin, and Richard Harang from Sophos Lab. This research will be presented at USENIX Security 2019 in Aug, 2019. This paper shares findings that supplying more prediction targets to their model at training time, they can improve the prediction performance of the primary prediction target. More specifically, they modify a deep learning based model for detecting malware (binary classifier) to also predict things like individual vendor predictions, malware tags, and number of VT detections. Their “auxiliary loss architecture yields a significant reduction in detection error rate (false negatives) of 42.6% at a false positive rate (FPR) of 10^−3 when compared to a similar model with only one target, and a decrease of 53.8% at 10^−5 FPR.”

Aloha Model Architecture

Figure 1 from the paper

A schematic overview of our neural network architecture. Multiple output layers with corresponding loss functions are optionally connected to a common base topology which consists of five dense blocks. Each block is composed of a Dropout, dense and batch normalization layers followed by an exponential linear unit (ELU) activation of sizes 1024, 768, 512, 512, and 512. This base, connected to our main malicious/benign output (solid line in the figure) with a loss on the aggregate label constitutes our baseline architecture. Auxiliary outputs and their respective losses are represented in dashed lines. The auxiliary losses fall into three types: count loss, multi-label vendor loss, and multi-label attribute tag loss

This paper made me wonder how well this technique would work for other areas in network security such as:

  • Detecting malicious URLs from Exploit Kits - possible auxiliary labels: Exploit Kit names, Web Proxy Categories, etc.
  • Detecting malicious C2 domains - possible auxiliary labels: malware family names, DGA or not, proxy categories.
  • Detecting DGA Domains - possible auxiliary labels: malware families, DGA type (wordlist, hex based, alphanumeric, etc).

I decided to explore the last use case of how well auxiliary loss optimizations would improve DGA domain detections. For this work I identified four DGA models and used these as baselines. Then I ran some experiments. All code from these experiments is hosted here. This code is based heavily off of Endgame’s dga_predict, but with many modifications.

Data:

For this work, I used the same data sources selected by Endgame’s dga_predict (but I added 3 additional DGAs: gozi, matsnu, and suppobox).

  • Alexa top 1m domains
  • classical DGA domains for the following malware families: banjori, corebot, cryptolocker, dircrypt, kraken, lockyv2, pykspa, qakbot, ramdo, ramnit, and simda.
  • Word-based/dictionary DGA domains for the following malware families - gozi, matsnu, and suppobox

Baseline Models:

I used 4 baseline binary models + 4 extensions of these model that use Auxiliary Loss Optimization for Hypothesis Augmentation.

Baseline Models:

ALOHA Extended Models (each simply use the 11 malware families as additional binary labels):

  • ALOHA CNN
  • ALOHA Bigram
  • ALOHA LSTM
  • ALOHA CNN+LSTM

I trained each of these models using the default settings as provided by dga_predict (except, I added stratified sampling based on the full labels: benign + malware families):

  • training splits: 76% training, 4% validation, %20 testing
  • all models were trained with a batch size of 128
  • The CNN, LSTM, and CNN+LSTM models used up to 25 epochs, while the bigram models used up to 50 epochs.

Below shows counts of how many of each DGA family were used and how many Alexa top 1m domains were included (denoted as “benign”).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
In [1]: import pickle

In [2]: from collections import Counter

In [3]: data = pickle.loads(open('traindata.pkl', 'rb').read())

In [4]: Counter([d[0] for d in data]).most_common(100)
Out[4]: 
[('benign', 139935),
 ('qakbot', 10000),
 ('dircrypt', 10000),
 ('pykspa', 10000),
 ('corebot', 10000),
 ('kraken', 10000),
 ('suppobox', 10000),
 ('gozi', 10000),
 ('ramnit', 10000),
 ('matsnu', 10000),
 ('locky', 9999),
 ('banjori', 9984),
 ('simda', 9984),
 ('ramdo', 9984),
 ('cryptolocker', 9984)]

Results

Model AUC scores (sorted by AUC):

  • aloha_bigram 0.9435
  • bigram 0.9444
  • cnn 0.9817
  • aloha_cnn 0.9820
  • lstm 0.9944
  • aloha_cnn_lstm 0.9947
  • aloha_lstm 0.9950
  • cnn_lstm 0.9957

Overall, by AUC, the ALOHA technique only seemed to improve the LSTM and CNN models and only marginally. The ROC curves show reductions in the error rates at very low false positive rates (between 10^-5 and 10^-3) which is similar to those gains seen in the ALOHA paper, yet the paper’s gains appeared much larger.


ROC: All Models Linear Scale


ROC: All Models Log Scale


ROC: Bigram Models Log Scale


ROC: CNN Models Log Scale


ROC: CNN+LSTM Models Log Scale


ROC: LSTM Models Log Scale

Heatmap

Below is a heatmap showing the percentage of detections across all the malware families for each model. Low numbers are good for the benign label (top row), high numbers are good for all the others.

Note the last 3 rows are all word-based/dictionary DGAs. It is interesting, although not too surprising that the models that include LSTMs tended to do better against these DGAs.

I annotated with green boxes places where the ALOHA models did better. This seems to be most apparent with the models that include LSTMs and for the word-based/dictionary DGAs.


Future Work:

These are some areas of future work I hope to have time to try out.

  • Add more DGA generators to the project, esp word-based / dictionary DGAs and see how the models react. I have identified several (see “Word-based / Dictionary-based DGA Resources” from here for more info).
  • try incorporating other auxiliary targets like:
    • Type of DGA (hex based, alphanumeric, custom alphabet, dictionary/word-based, etc)
    • Classical DGA domain features like string entropy, count of longest consecutive consonant string, count of longest consecutive vowel string, etc. I am curious if forcing the NN to learn these would improve its primary scoring mechanism.
    • Metadata from VT domain report.
    • Summary / stats from Passive DNS (PDNS).
    • Features from various aspects of the domain’s whois record.

If you enjoyed this post, you may be interested in my other recent post on Getting Started with DGA Domain Detection Research. Also, please see more Security Data Science blog posts at by personal blog: covert.io.

As always, feedback is welcome so please leave a message here, on Medium, or @ me on twitter!

–Jason
@jason_trost

This post provides resources for getting started with research on Domain Generation Algorithm (DGA) Domain Detection.

DGA Domains are commonly used by malware as a mechanism to maintain a command and control (C2) and make it more difficult for defenders to block. Prior to DGA domains, most malware used a small hardcoded list of IPs or domains. Once these IPs / domains were discovered they could be blocked by defenders or taken down for abuse. DGA domains make this more difficult since the C2 domain changes frequently and enumerating and blocking all generated domains can be expensive.

Recently, I have been working on a research project recently related to DGA detection (hopefully it will turn into a blogpost or a presentation somewhere), and it occurred to me that DGA is probably one of the most accessible areas for those getting into security data science due to the availability of so much labelled data and the availability of so many open source implementations of DGA detection. One might argue that this means it is not an area worth researching due to saturation, but I think that depends on your situation/goals. This short posts outlines some of the resources that I found useful for DGA research.

Data:

This section lists some domain lists and DGA generators that may be useful for creating “labelled” DGA domain lists.

DGA Data:

  • DGArchive Large private collection of DGA related data. This contains ~88 csv files of DGA domains organized by malware family. DGArchive is password protected and if you want access you need to reach out to the maintainer.
  • Bambenek Feeds (see “DGA Domain Feed”).
  • Netlab 360 DGA Feeds

DGA Generators:

Word-based / Dictionary-based DGA Resources:

Below are all Malware Families that use word-based / dictionary DGAs, meaning their domains consist of 2 or more words selected from a list/dictionary and concatenated together. I separate these out since they are different than most other “classical” DGAs.

“Benign” / Non-DGA Data:

This section lists some domain lists that may be useful for creating “labelled” benign domain lists. In several academic papers one or more of these sources are used, but they generally create derivatives that represent the Stable N-day Top X Sites (e.g. Stable Alexa 30-day top 500k – meaning domains from the Alexa top 500k that have been on the list consecutively for the last 30 days straight – the alexa data needs to be downloaded each day for 30+ days to create this since only today’s snapshot is provided by Amazon). This filters out domains that can become popular for a short amount of time but them drop off as sometimes happens with malicious domains.

Utilities:

Domain Parser:

When parsing the various domain list data tldextract is very helpful for stripping off TLDs or subdomains if desired. I have seen several projects attempt to parse domains using “split(‘.’)” or “domain[:-3]”. This does not work very well since domain’s TLDs can contain multiple “.”s (e.g. .co.uk)

Installation:

1
pip install tldextract

Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
In [1]: import tldextract
In [2]: e = tldextract.extract('abc.www.google.co.uk')

In [3]: e                                                                                                                            Out[3]: ExtractResult(subdomain='abc.www', domain='google', suffix='co.uk')

In [4]: e.domain
Out[4]: 'google'

In [5]: e.subdomain
Out[5]: 'abc.www'

In [6]: e.registered_domain
Out[6]: 'google.co.uk'

In [7]: e.fqdn
Out[7]: 'abc.www.google.co.uk'

In [8]: e.suffix
Out[8]: 'co.uk'

Domain Resolution:

During the course of your research you may need to perform DNS resolutions on lots of DGA domains. If you do this, I highly recommend setting up your own bind9 server on Digital Ocean or Amazon and using adnshost (a utility from adns). If you perform the DNS resolutions from your home or office, your ISP may interfere with the DNS responses because they will appear malicious, which can bias your research. If you use a provider’s recursive nameservers, you may violate the acceptable use policy (AUP) due to the volume AND the provider may also interfere with the responses.

Adnshost enables high throughput / bulk DNS queries to be performed asynchronously. It will be much faster than performing the DNS queries synchronously (one after the other).

Here is an example of using adnshost (assuming you are running it from the Bind9 server you setup):

1
2
3
4
5
6
cat huge-domains-list.txt | adnshost \
    --asynch \
    --config "nameserver 127.0.0.1" \
    --type a \
    --pipe \
    ----addr-ipv4-only > results.txt

This article should get you most of the way there with setting up the bind9 server.

Models:

This section provides links to a few models that could be used as baselines for comparison.

Research:


I hope this is helpful. As always, feedback is welcome so please leave a message here, on Medium, or @ me on twitter!

–Jason
@jason_trost

This short post catalogs some resources that may be useful for those interested in security data science. It is not meant to be an exhaustive list. It is meant to be a curated list to help you get started.

Staying Current with Security Data Science

Here is my current strategy for staying current with security data science research. It leans heavier towards academic research since this is what interests me at the moment.

  1. Google Scholar Publication alerts on known respected researchers.
  2. Google Scholar Citation alerts on interesting or noteworthy papers.
  3. Follow security ML researchers on Twitter and Medium. They frequently share interesting and cutting edge research papers / videos / blogs.
  4. Periodically review proceedings from noteworthy security conferences.
  5. Skim published security conference videos from Irongeek looking for topics of interest.

Google Scholar alerts

Citation Alerts on these papers:

  • “Acing the IOC game: Toward automatic discovery and analysis of open-source cyber threat intelligence”
  • “AI^ 2: training a big data machine to defend”
  • “APT Infection Discovery using DNS Data”
  • “Beehive: Large-scale log analysis for detecting suspicious activity in enterprise networks”
  • “Deep neural network based malware detection using two dimensional binary program features”
  • “Detecting malicious domains via graph inference”
  • “Detecting malware based on DNS graph mining”
  • “Detecting structurally anomalous logins in Enterprise Networks”
  • “Discovering malicious domains through passive DNS data graph analysis”
  • “EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models”
  • “Enabling network security through active DNS datasets”
  • “Feature-based transfer learning for network security”
  • “Gotcha-Sly Malware!: Scorpion A Metagraph2vec Based Malware Detection System”
  • “Guilt by association: large scale malware detection by mining file-relation graphs”
  • “Identifying suspicious activities through dns failure graph analysis”
  • “Polonium: Tera-scale graph mining and inference for malware detection”
  • “Segugio: Efficient behavior-based tracking of malware-control domains in large ISP networks”

New article alerts on these authors with the bolded being the most relevant / interesting to me.

  • Alina Oprea - heavily focused on operational security ML.
  • Josh Saxe, Rich Harang, and Konstantin Berlin - heavily focused on Malware detection/analytics using ML. Also a published book author.
  • Manos Antonakakis and Roberto Perdisci - heavily focused on network security analytics using ML with a specialty in DNS traffic.
  • Balduzzi Marco
  • Battista Biggio
  • Chaz Lever
  • Christopher Kruegel
  • Damon McCoy
  • David Dagon
  • David Freeman
  • Gianluca Stringhini
  • Giovanni Vigna
  • Guofei Gu
  • Han Yufei
  • Hossein Siadati
  • Issa Khalil
  • Jason (Iasonas) Polakis
  • Michael Donald Bailey
  • Michael Iannacone
  • Nick Feamster
  • Niels Provos
  • Nir Nissim
  • Patrick McDaniel
  • Stefan Savage
  • Steven Noel
  • Terry Nelms
  • Ting-Fang Yen
  • Vern Paxson
  • Wenke Lee
  • Yacin Nadji
  • Yanfang (Fanny) Ye
  • Yizheng Chen
  • Yuval Elovici

Twitter

Twitter can be a gold mine for new and relevant ideas, blogs, presentations, etc for security data science. You just need to make sure you continually follow the right folks. Here is a short list of thought leaders in this space (if I left you off it is my oversight so please don’t take offense).

For a more exhaustive list of others I would recommend following on Twitter, see this gist. This list is focused on Threat Intel, Threat Hunting, Detection Engineering, IR, and Security Engineering. It is not exhaustive, but is a good start.

Conferences

Below are several interesting security conferences where research is published on security data science topics. It is a good idea to be on the look out for the proceedings from these events.

This page is also an excellent resource in general for top academic security conferences: Top Academic Security conferences list. The major industry focused security conferences like Blackhat, RSA, Defcon, BSides*, DerbyCon, and ShmooCon all frequently have talks relevant to security data science, but this is not their primary focus, so they are not explicitly called out above.

Learning Resources

These resources will help you build a baseline of knowledge in Cyber Security and Machine Learning.

Books

Security:

Machine Learning / Data Science:

Courses


I hope this is helpful, and I would be interested to hear about other resources that you find useful. Please leave a message here, on Medium, or @ me on twitter!

–Jason
@jason_trost

A short listing of research papers I’ve read or plan to read that use passive DNS (PDNS) data and graph analytics for identifying malicious domains.

Host-Domain Graphs

Host domain graphs are bipartite graphs mapping hosts/IPs to domains that they either resolved (passive DNS) or visited (web proxy logs). These graphs are used heavily in operational security machine learning papers on network threat hunting as they provide insight into the behavioral patterns across an enterprise or ISP.

Detecting Malicious Domains via Graph Inference P. K. Manadhata, S. Yadav, P. Rao, and W. Horne. In Proceedings of 19th European Symposium on Research in Computer Security, Wroclaw, Poland, September 7-11, 2014.

Detection of Early-Stage Enterprise Infection by Mining Large-Scale Log Data Alina Oprea, Zhou Li, Ting-Fang Yen, Sang H. Chin, and Sumyah Alrwais In Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2015.

Segugio: Efficient Behavior-Based Tracking of Malware-Control Domains in Large ISP Networks Babak Rahbarinia and Manos Antonakakis In Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2015

Domain Resolution Graphs (Domain-IP Graphs)

A domain resolution graph is an undirected bipartite graph representing observed domain->IP DNS resolution from Passive DNS data.

Notos: Building a Dynamic Reputation System for DNS M. Antonakakis, R. Perdisci, D. Dagon, W. Lee, and N. Feamster. In the Proceedings of the 19th USENIX Security Symposium, Washington, DC, USA, August 11-13, 2010.

EXPOSURE: Finding Malicious Domains using Passive DNS Analysis L. Bilge, E. Kirda, C. Kruegel, and M. Balduzzi. In Proceedings of the Network and Distributed System Security Symposium, San Diego, California, USA, February 2011.

Discovering Malicious Domains through Passive DNS Data Graph Analysis Issa Khalil, Ting Yu, and Bei Guan. In Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security (ASIA CCS ‘16), 2016.

–Jason
@jason_trost

The “short links” format was inspired by Oreilly’s Four Short Links series.

Beehive: Large-Scale Log Analysis for Detecting Suspicious Activity in Enterprise Networks Ting-Fang Yen, Alina Oprea, Kaan Onarlioglu, Todd Leetham, William Robertson, Ari Juels, and Engin Kirda In Proceedings of Annual Computer Security Applications Conference (ACSAC), 2013

An Epidemiological Study of Malware Encounters in a Large Enterprise Ting-Fang Yen, Victor Heorhiadi, Alina Oprea, Michael K. Reiter, and Ari Juels In Proceedings of ACM Conference on Computer and Communications Security (CCS), 2014

Detection of Early-Stage Enterprise Infection by Mining Large-Scale Log Data Alina Oprea, Zhou Li, Ting-Fang Yen, Sang H. Chin, and Sumyah Alrwais In Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2015

Segugio: Efficient Behavior-Based Tracking of Malware-Control Domains in Large ISP Networks Babak Rahbarinia and Manos Antonakakis In Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2015

Malicious Behavior Detection using Windows Audit Logs Konstantin Berlin, David Slater, Joshua Saxe In Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security (AISec) 2015

Operational security log analytics for enterprise breach detection Zhou Li and Alina Oprea In Proceedings of the First IEEE Cybersecurity Development Conference (SecDev), 2016

Lens on the endpoint: Hunting for malicious software through endpoint data analysis. Ahmet Buyukkayhan, Alina Oprea, Zhou Li, and William Robertson. In Proceedings of Recent Advances in Intrusion Detection (RAID), 2017

–Jason
@jason_trost

PS …