This post explores Heterogeneous Information Networks (HIN) and applications to Cyber security.

Over the past few months I have been researching Heterogeneous Information Networks (HIN) and Cyber security use cases. I first encountered HIN’s after discovering this paper: “Gotcha: Sly Malware!- Scorpion A Metagraph2vec Based Malware Detection System” through a Google Scholar Alert I had setup for “Guilt by Association: Large Scale Malware Detection by Mining File-relation Graphs”. If you’re interested in how I setup my Google Alerts to stay abreast of the latest security data science research, see this: Security Data Science Learning Resources.

Heterogeneous Information Networks are a relatively simple way of modelling one or more datasets as a graph consisting of nodes and edges where 1) all nodes and edges have defined types, and 2) types of nodes > 1 or types of edges > 1 (hence “Heterogeneous”). The set of node and edge types represents the schema of the network. This differs from homogeneous networks where the nodes and edges are all the same type (e.g. Facebook Social Network Graph, World Wide Web, etc.). HINs provide a very rich abstraction for modelling complex datasets.

Below, I will walk through important HIN concepts using the HinDom paper as an example. HinDom uses DNS relationship data from passive DNS, DNS query logs, and DNS response logs to build a malicious domain classifier using HIN. They use Alexa Top 1K list,,, DGArchive, Google Safe Browsing, and VirusTotal for deriving labels. Below is an example HIN schema taken from this paper.

HinDom Schema

This schema represents three combined datasets (Passive DNS, DNS query logs, DNS response logs) and it models three node types (Client, Domain, and IP Address) and six edge types (segment, query, CNAME, similar, resolve, and same-domain). Here is an expanded example and descriptions of the relationships:

HinDom Example

  • Client-query-Domain - matrix Q denotes that domain i is queried by client j.
  • Client-segment-Client - matrix N denotes that client i and client j belong to the same network segment.
  • Domain-resolve-IP - matrix R denotes that domain i is resolved to IP address j.
  • Domain-similar-Domain - matrix S denotes the character-level similarity between domain i and j.
  • Domain-cname-Domain - matrix C denotes that domain i and domain j are in a CNAME record.
  • IP-domain-IP - matrix D denotes that IP address i and IP address j are once mapped to the same domain.

Once the dataset is represented as a graph, feature vectors need to be extracted before machine learning models can be built. A common technique for featurizing a HIN is by defining Meta-paths or Meta-graphs against the graph and then performing guided random walks against the defined meta-paths/graphs. Meta-paths represent graph traversals through specific node and edge sequences. Meta-paths selection are akin to feature engineering in classical machine learning as it is very important to select meta-paths that provide useful signals for whatever variable is being predicted. As seen in many HIN papers, meta-paths/graphs are often evaluated individually or in combination to determine their influence on model performance. Guided random walks against meta-paths produce a sequence of nodes (similar to sentences of words), which can then be fed into models like Skipgram or Continuous Bag-of-Words (CBOW) to create embeddings. Once the nodes are represented as embeddings many different models (SVM, DNN, etc) can be used to solve many different types of problems (Similarity Search, Classification, Clustering, Recommendation, etc). Below are the meta-paths used in the HinDom paper.

HinDom Meta-paths

Below is the HinDom Architecture to illustrate how all these concepts come together.

HinDom Architecture

Below are some resources that I found useful for learning more about Heterogeneous Information Networks as well as several security related papers that used HIN.


HIN Papers:

Malware Detection / Code Analysis:

Mining the Darkweb / Fraud Detection / Social Network Analysis:



Prominent Security Researchers using HIN:

As always, feedback is welcome so please leave a message here, on Medium, or @ me on twitter!


This post outlines some experiments I ran using Auxiliary Loss Optimization for Hypothesis Augmentation (ALOHA) for DGA domain detection.

(Update 2019-07-18) After getting feedback from one of the ALOHA paper authors, I modified my code to set loss weights for the auxilary targets as they did in their paper (Weights used: main target 1.0, auxilary targets 0.1). I also added 3 word-based/dictionary DGAs. All diagrams and metrics have been updated to reflect this.

I recently read this paper ALOHA: Auxiliary Loss Optimization for Hypothesis Augmentation by Ethan M. Rudd, Felipe N. Ducau, Cody Wild, Konstantin Berlin, and Richard Harang from Sophos Lab. This research will be presented at USENIX Security 2019 in Aug, 2019. This paper shares findings that supplying more prediction targets to their model at training time, they can improve the prediction performance of the primary prediction target. More specifically, they modify a deep learning based model for detecting malware (binary classifier) to also predict things like individual vendor predictions, malware tags, and number of VT detections. Their “auxiliary loss architecture yields a significant reduction in detection error rate (false negatives) of 42.6% at a false positive rate (FPR) of 10^−3 when compared to a similar model with only one target, and a decrease of 53.8% at 10^−5 FPR.”

Aloha Model Architecture

Figure 1 from the paper

A schematic overview of our neural network architecture. Multiple output layers with corresponding loss functions are optionally connected to a common base topology which consists of five dense blocks. Each block is composed of a Dropout, dense and batch normalization layers followed by an exponential linear unit (ELU) activation of sizes 1024, 768, 512, 512, and 512. This base, connected to our main malicious/benign output (solid line in the figure) with a loss on the aggregate label constitutes our baseline architecture. Auxiliary outputs and their respective losses are represented in dashed lines. The auxiliary losses fall into three types: count loss, multi-label vendor loss, and multi-label attribute tag loss

This paper made me wonder how well this technique would work for other areas in network security such as:

  • Detecting malicious URLs from Exploit Kits - possible auxiliary labels: Exploit Kit names, Web Proxy Categories, etc.
  • Detecting malicious C2 domains - possible auxiliary labels: malware family names, DGA or not, proxy categories.
  • Detecting DGA Domains - possible auxiliary labels: malware families, DGA type (wordlist, hex based, alphanumeric, etc).

I decided to explore the last use case of how well auxiliary loss optimizations would improve DGA domain detections. For this work I identified four DGA models and used these as baselines. Then I ran some experiments. All code from these experiments is hosted here. This code is based heavily off of Endgame’s dga_predict, but with many modifications.


For this work, I used the same data sources selected by Endgame’s dga_predict (but I added 3 additional DGAs: gozi, matsnu, and suppobox).

  • Alexa top 1m domains
  • classical DGA domains for the following malware families: banjori, corebot, cryptolocker, dircrypt, kraken, lockyv2, pykspa, qakbot, ramdo, ramnit, and simda.
  • Word-based/dictionary DGA domains for the following malware families - gozi, matsnu, and suppobox

Baseline Models:

I used 4 baseline binary models + 4 extensions of these model that use Auxiliary Loss Optimization for Hypothesis Augmentation.

Baseline Models:

ALOHA Extended Models (each simply use the 11 malware families as additional binary labels):

  • ALOHA Bigram

I trained each of these models using the default settings as provided by dga_predict (except, I added stratified sampling based on the full labels: benign + malware families):

  • training splits: 76% training, 4% validation, %20 testing
  • all models were trained with a batch size of 128
  • The CNN, LSTM, and CNN+LSTM models used up to 25 epochs, while the bigram models used up to 50 epochs.

Below shows counts of how many of each DGA family were used and how many Alexa top 1m domains were included (denoted as “benign”).

In [1]: import pickle

In [2]: from collections import Counter

In [3]: data = pickle.loads(open('traindata.pkl', 'rb').read())

In [4]: Counter([d[0] for d in data]).most_common(100)
[('benign', 139935),
 ('qakbot', 10000),
 ('dircrypt', 10000),
 ('pykspa', 10000),
 ('corebot', 10000),
 ('kraken', 10000),
 ('suppobox', 10000),
 ('gozi', 10000),
 ('ramnit', 10000),
 ('matsnu', 10000),
 ('locky', 9999),
 ('banjori', 9984),
 ('simda', 9984),
 ('ramdo', 9984),
 ('cryptolocker', 9984)]


Model AUC scores (sorted by AUC):

  • aloha_bigram 0.9435
  • bigram 0.9444
  • cnn 0.9817
  • aloha_cnn 0.9820
  • lstm 0.9944
  • aloha_cnn_lstm 0.9947
  • aloha_lstm 0.9950
  • cnn_lstm 0.9957

Overall, by AUC, the ALOHA technique only seemed to improve the LSTM and CNN models and only marginally. The ROC curves show reductions in the error rates at very low false positive rates (between 10^-5 and 10^-3) which is similar to those gains seen in the ALOHA paper, yet the paper’s gains appeared much larger.

ROC: All Models Linear Scale

ROC: All Models Log Scale

ROC: Bigram Models Log Scale

ROC: CNN Models Log Scale

ROC: CNN+LSTM Models Log Scale

ROC: LSTM Models Log Scale


Below is a heatmap showing the percentage of detections across all the malware families for each model. Low numbers are good for the benign label (top row), high numbers are good for all the others.

Note the last 3 rows are all word-based/dictionary DGAs. It is interesting, although not too surprising that the models that include LSTMs tended to do better against these DGAs.

I annotated with green boxes places where the ALOHA models did better. This seems to be most apparent with the models that include LSTMs and for the word-based/dictionary DGAs.

Future Work:

These are some areas of future work I hope to have time to try out.

  • Add more DGA generators to the project, esp word-based / dictionary DGAs and see how the models react. I have identified several (see “Word-based / Dictionary-based DGA Resources” from here for more info).
  • try incorporating other auxiliary targets like:
    • Type of DGA (hex based, alphanumeric, custom alphabet, dictionary/word-based, etc)
    • Classical DGA domain features like string entropy, count of longest consecutive consonant string, count of longest consecutive vowel string, etc. I am curious if forcing the NN to learn these would improve its primary scoring mechanism.
    • Metadata from VT domain report.
    • Summary / stats from Passive DNS (PDNS).
    • Features from various aspects of the domain’s whois record.

If you enjoyed this post, you may be interested in my other recent post on Getting Started with DGA Domain Detection Research. Also, please see more Security Data Science blog posts at by personal blog:

As always, feedback is welcome so please leave a message here, on Medium, or @ me on twitter!


This post provides resources for getting started with research on Domain Generation Algorithm (DGA) Domain Detection.

DGA Domains are commonly used by malware as a mechanism to maintain a command and control (C2) and make it more difficult for defenders to block. Prior to DGA domains, most malware used a small hardcoded list of IPs or domains. Once these IPs / domains were discovered they could be blocked by defenders or taken down for abuse. DGA domains make this more difficult since the C2 domain changes frequently and enumerating and blocking all generated domains can be expensive.

Recently, I have been working on a research project recently related to DGA detection (hopefully it will turn into a blogpost or a presentation somewhere), and it occurred to me that DGA is probably one of the most accessible areas for those getting into security data science due to the availability of so much labelled data and the availability of so many open source implementations of DGA detection. One might argue that this means it is not an area worth researching due to saturation, but I think that depends on your situation/goals. This short posts outlines some of the resources that I found useful for DGA research.


This section lists some domain lists and DGA generators that may be useful for creating “labelled” DGA domain lists.

DGA Data:

  • DGArchive Large private collection of DGA related data. This contains ~88 csv files of DGA domains organized by malware family. DGArchive is password protected and if you want access you need to reach out to the maintainer.
  • Bambenek Feeds (see “DGA Domain Feed”).
  • Netlab 360 DGA Feeds

DGA Generators:

Word-based / Dictionary-based DGA Resources:

Below are all Malware Families that use word-based / dictionary DGAs, meaning their domains consist of 2 or more words selected from a list/dictionary and concatenated together. I separate these out since they are different than most other “classical” DGAs.

“Benign” / Non-DGA Data:

This section lists some domain lists that may be useful for creating “labelled” benign domain lists. In several academic papers one or more of these sources are used, but they generally create derivatives that represent the Stable N-day Top X Sites (e.g. Stable Alexa 30-day top 500k – meaning domains from the Alexa top 500k that have been on the list consecutively for the last 30 days straight – the alexa data needs to be downloaded each day for 30+ days to create this since only today’s snapshot is provided by Amazon). This filters out domains that can become popular for a short amount of time but them drop off as sometimes happens with malicious domains.

Update (2020-03-22) - More Heuristics for Benign training set curation:

Excerpt from Inline Detection of DGA Domains Using Side Information (page 12)

The benign samples are collected based on a predefined set of heuristics as listed below:

  • Domain name should have valid DNS characters only (digits, letters, dot and hyphen)
  • Domain has to be resolved at least once for every day between June 01, 2019 and July 31, 2019.
  • Domain name should have a valid public suffix
  • Characters in the domain name are not all digits (after removing ‘.’ and ‘-‘)
  • Domain should have at most four labels (Labels are sequence of characters separated by a dot)
  • Length of the domain name is at most 255 characters
  • Longest label is between 7 and 64 characters
  • Longest label is more than twice the length of the TLD
  • Longest label is more than 70% of the combined length of all labels
  • Excludes IDN (International Distribution Network) domains (such as domains starting with xn–)
  • Domain must not exist in DGArchive


Domain Parser:

When parsing the various domain list data tldextract is very helpful for stripping off TLDs or subdomains if desired. I have seen several projects attempt to parse domains using “split(‘.’)” or “domain[:-3]”. This does not work very well since domain’s TLDs can contain multiple “.”s (e.g.


pip install tldextract


In [1]: import tldextract
In [2]: e = tldextract.extract('')

In [3]: e                                                                                                                            Out[3]: ExtractResult(subdomain='abc.www', domain='google', suffix='')

In [4]: e.domain
Out[4]: 'google'

In [5]: e.subdomain
Out[5]: 'abc.www'

In [6]: e.registered_domain
Out[6]: ''

In [7]: e.fqdn
Out[7]: ''

In [8]: e.suffix
Out[8]: ''

Domain Resolution:

During the course of your research you may need to perform DNS resolutions on lots of DGA domains. If you do this, I highly recommend setting up your own bind9 server on Digital Ocean or Amazon and using adnshost (a utility from adns). If you perform the DNS resolutions from your home or office, your ISP may interfere with the DNS responses because they will appear malicious, which can bias your research. If you use a provider’s recursive nameservers, you may violate the acceptable use policy (AUP) due to the volume AND the provider may also interfere with the responses.

Adnshost enables high throughput / bulk DNS queries to be performed asynchronously. It will be much faster than performing the DNS queries synchronously (one after the other).

Here is an example of using adnshost (assuming you are running it from the Bind9 server you setup):

cat huge-domains-list.txt | adnshost \
    --asynch \
    --config "nameserver" \
    --type a \
    --pipe \
    ----addr-ipv4-only > results.txt

This article should get you most of the way there with setting up the bind9 server.


This section provides links to a few models that could be used as baselines for comparison.


I hope this is helpful. As always, feedback is welcome so please leave a message here, on Medium, or @ me on twitter!


This short post catalogs some resources that may be useful for those interested in security data science. It is not meant to be an exhaustive list. It is meant to be a curated list to help you get started.

Staying Current with Security Data Science

Here is my current strategy for staying current with security data science research. It leans heavier towards academic research since this is what interests me at the moment.

  1. Google Scholar Publication alerts on known respected researchers.
  2. Google Scholar Citation alerts on interesting or noteworthy papers.
  3. Follow security ML researchers on Twitter and Medium. They frequently share interesting and cutting edge research papers / videos / blogs.
  4. Periodically review proceedings from noteworthy security conferences.
  5. Skim published security conference videos from Irongeek looking for topics of interest.

Google Scholar alerts

Citation Alerts on these papers:

  • “Acing the IOC game: Toward automatic discovery and analysis of open-source cyber threat intelligence”
  • “AI^ 2: training a big data machine to defend”
  • “APT Infection Discovery using DNS Data”
  • “Beehive: Large-scale log analysis for detecting suspicious activity in enterprise networks”
  • “Deep neural network based malware detection using two dimensional binary program features”
  • “Detecting malicious domains via graph inference”
  • “Detecting malware based on DNS graph mining”
  • “Detecting structurally anomalous logins in Enterprise Networks”
  • “Discovering malicious domains through passive DNS data graph analysis”
  • “EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models”
  • “Enabling network security through active DNS datasets”
  • “Feature-based transfer learning for network security”
  • “Gotcha-Sly Malware!: Scorpion A Metagraph2vec Based Malware Detection System”
  • “Guilt by association: large scale malware detection by mining file-relation graphs”
  • “Identifying suspicious activities through dns failure graph analysis”
  • “Polonium: Tera-scale graph mining and inference for malware detection”
  • “Segugio: Efficient behavior-based tracking of malware-control domains in large ISP networks”

New article alerts on these authors with the bolded being the most relevant / interesting to me.

  • Alina Oprea - heavily focused on operational security ML.
  • Josh Saxe, Rich Harang, and Konstantin Berlin - heavily focused on Malware detection/analytics using ML. Also a published book author.
  • Manos Antonakakis and Roberto Perdisci - heavily focused on network security analytics using ML with a specialty in DNS traffic.
  • Balduzzi Marco
  • Battista Biggio
  • Chaz Lever
  • Christopher Kruegel
  • Damon McCoy
  • David Dagon
  • David Freeman
  • Gianluca Stringhini
  • Giovanni Vigna
  • Guofei Gu
  • Han Yufei
  • Hossein Siadati
  • Issa Khalil
  • Jason (Iasonas) Polakis
  • Michael Donald Bailey
  • Michael Iannacone
  • Nick Feamster
  • Niels Provos
  • Nir Nissim
  • Patrick McDaniel
  • Stefan Savage
  • Steven Noel
  • Terry Nelms
  • Ting-Fang Yen
  • Vern Paxson
  • Wenke Lee
  • Yacin Nadji
  • Yanfang (Fanny) Ye
  • Yizheng Chen
  • Yuval Elovici


Twitter can be a gold mine for new and relevant ideas, blogs, presentations, etc for security data science. You just need to make sure you continually follow the right folks. Here is a short list of thought leaders in this space (if I left you off it is my oversight so please don’t take offense).

For a more exhaustive list of others I would recommend following on Twitter, see this gist. This list is focused on Threat Intel, Threat Hunting, Detection Engineering, IR, and Security Engineering. It is not exhaustive, but is a good start.


Below are several interesting security conferences where research is published on security data science topics. It is a good idea to be on the look out for the proceedings from these events.

This page is also an excellent resource in general for top academic security conferences: Top Academic Security conferences list. The major industry focused security conferences like Blackhat, RSA, Defcon, BSides*, DerbyCon, and ShmooCon all frequently have talks relevant to security data science, but this is not their primary focus, so they are not explicitly called out above.

Learning Resources

These resources will help you build a baseline of knowledge in Cyber Security and Machine Learning.



Machine Learning / Data Science:


I hope this is helpful, and I would be interested to hear about other resources that you find useful. Please leave a message here, on Medium, or @ me on twitter!


A short listing of research papers I’ve read or plan to read that use passive DNS (PDNS) data and graph analytics for identifying malicious domains.

Host-Domain Graphs

Host domain graphs are bipartite graphs mapping hosts/IPs to domains that they either resolved (passive DNS) or visited (web proxy logs). These graphs are used heavily in operational security machine learning papers on network threat hunting as they provide insight into the behavioral patterns across an enterprise or ISP.

Detecting Malicious Domains via Graph Inference P. K. Manadhata, S. Yadav, P. Rao, and W. Horne. In Proceedings of 19th European Symposium on Research in Computer Security, Wroclaw, Poland, September 7-11, 2014.

Detection of Early-Stage Enterprise Infection by Mining Large-Scale Log Data Alina Oprea, Zhou Li, Ting-Fang Yen, Sang H. Chin, and Sumyah Alrwais In Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2015.

Segugio: Efficient Behavior-Based Tracking of Malware-Control Domains in Large ISP Networks Babak Rahbarinia and Manos Antonakakis In Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2015

Domain Resolution Graphs (Domain-IP Graphs)

A domain resolution graph is an undirected bipartite graph representing observed domain->IP DNS resolution from Passive DNS data.

Notos: Building a Dynamic Reputation System for DNS M. Antonakakis, R. Perdisci, D. Dagon, W. Lee, and N. Feamster. In the Proceedings of the 19th USENIX Security Symposium, Washington, DC, USA, August 11-13, 2010.

EXPOSURE: Finding Malicious Domains using Passive DNS Analysis L. Bilge, E. Kirda, C. Kruegel, and M. Balduzzi. In Proceedings of the Network and Distributed System Security Symposium, San Diego, California, USA, February 2011.

Discovering Malicious Domains through Passive DNS Data Graph Analysis Issa Khalil, Ting Yu, and Bei Guan. In Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security (ASIA CCS ‘16), 2016.


The “short links” format was inspired by O’Reilly’s Four Short Links series.