Getting Started with DGA Domain Detection Research

This post provides resources for getting started with research on Domain Generation Algorithm (DGA) Domain Detection.

DGA Domains are commonly used by malware as a mechanism to maintain a command and control (C2) and make it more difficult for defenders to block. Prior to DGA domains, most malware used a small hardcoded list of IPs or domains. Once these IPs / domains were discovered they could be blocked by defenders or taken down for abuse. DGA domains make this more difficult since the C2 domain changes frequently and enumerating and blocking all generated domains can be expensive.

Recently, I have been working on a research project recently related to DGA detection (hopefully it will turn into a blogpost or a presentation somewhere), and it occurred to me that DGA is probably one of the most accessible areas for those getting into security data science due to the availability of so much labelled data and the availability of so many open source implementations of DGA detection. One might argue that this means it is not an area worth researching due to saturation, but I think that depends on your situation/goals. This short posts outlines some of the resources that I found useful for DGA research.

Data:

This section lists some domain lists and DGA generators that may be useful for creating “labelled” DGA domain lists.

DGA Data:

DGArchive Large private collection of DGA related data. This contains ~88 csv files of DGA domains organized by malware family. DGArchive is password protected and if you want access you need to reach out to the maintainer.
Bambenek Feeds (see “DGA Domain Feed”).
Netlab 360 DGA Feeds

DGA Generators:

baderj/domain_generation_algorithms (276-stars on Github) by Johannes Bader - DGA algorithms implemented in python.
andrewaeva/DGA (123-stars on Github) - smaller collection of DGA algorithms and data, but fills in some of the gaps from domain_generation_algorithms.
pchaigno/dga-collection (37-stars on Github)

Word-based / Dictionary-based DGA Resources:

Below are all Malware Families that use word-based / dictionary DGAs, meaning their domains consist of 2 or more words selected from a list/dictionary and concatenated together. I separate these out since they are different than most other “classical” DGAs.

“Benign” / Non-DGA Data:

This section lists some domain lists that may be useful for creating “labelled” benign domain lists. In several academic papers one or more of these sources are used, but they generally create derivatives that represent the Stable N-day Top X Sites (e.g. Stable Alexa 30-day top 500k – meaning domains from the Alexa top 500k that have been on the list consecutively for the last 30 days straight – the alexa data needs to be downloaded each day for 30+ days to create this since only today’s snapshot is provided by Amazon). This filters out domains that can become popular for a short amount of time but them drop off as sometimes happens with malicious domains.

Update (2020-03-22) - More Heuristics for Benign training set curation:

Excerpt from Inline Detection of DGA Domains Using Side Information (page 12)

The benign samples are collected based on a predefined set of heuristics as listed below:

Domain name should have valid DNS characters only (digits, letters, dot and hyphen)

Domain has to be resolved at least once for every day between June 01, 2019 and July 31, 2019.

Domain name should have a valid public suffix

Characters in the domain name are not all digits (after removing ‘.’ and ‘-‘)

Domain should have at most four labels (Labels are sequence of characters separated by a dot)

Length of the domain name is at most 255 characters

Longest label is between 7 and 64 characters

Longest label is more than twice the length of the TLD

Longest label is more than 70% of the combined length of all labels

Excludes IDN (International Distribution Network) domains (such as domains starting with xn–)

Domain must not exist in DGArchive

Utilities:

Domain Parser:

When parsing the various domain list data tldextract is very helpful for stripping off TLDs or subdomains if desired. I have seen several projects attempt to parse domains using “split(‘.’)” or “domain[:-3]”. This does not work very well since domain’s TLDs can contain multiple “.”s (e.g. .co.uk)

Installation:

pip install tldextract

Example:

In [1]: import tldextract
In [2]: e = tldextract.extract('abc.www.google.co.uk')

In [3]: e                                                                                                                            Out[3]: ExtractResult(subdomain='abc.www', domain='google', suffix='co.uk')

In [4]: e.domain
Out[4]: 'google'

In [5]: e.subdomain
Out[5]: 'abc.www'

In [6]: e.registered_domain
Out[6]: 'google.co.uk'

In [7]: e.fqdn
Out[7]: 'abc.www.google.co.uk'

In [8]: e.suffix
Out[8]: 'co.uk'

Domain Resolution:

During the course of your research you may need to perform DNS resolutions on lots of DGA domains. If you do this, I highly recommend setting up your own bind9 server on Digital Ocean or Amazon and using adnshost (a utility from adns). If you perform the DNS resolutions from your home or office, your ISP may interfere with the DNS responses because they will appear malicious, which can bias your research. If you use a provider’s recursive nameservers, you may violate the acceptable use policy (AUP) due to the volume AND the provider may also interfere with the responses.

Adnshost enables high throughput / bulk DNS queries to be performed asynchronously. It will be much faster than performing the DNS queries synchronously (one after the other).

Here is an example of using adnshost (assuming you are running it from the Bind9 server you setup):

cat huge-domains-list.txt | adnshost \
    --asynch \
    --config "nameserver 127.0.0.1" \
    --type a \
    --pipe \
    ----addr-ipv4-only > results.txt

This article should get you most of the way there with setting up the bind9 server.

Models:

This section provides links to a few models that could be used as baselines for comparison.

dga_predict’s LSTM and Bigram model from Endgame.
snowman’s CNN model from Keegan Hines. This is not specifically designed for DGA, but it works for this.
matthoffman/degas - DGA-generated domain detection using deep learning models.
#dga-detection, #dga, and #dga-domains on Github - these tags provide other DGA related projects (DGA domain generators, DGA detection, DGA domain lists).
BKCS-HUST/LSTM-MI

Research:

I hope this is helpful. As always, feedback is welcome so please leave a message here, on Medium, or @ me on twitter!

–Jason
@jason_trost