covert.io

9 Short links on Network Beacon Detection

2022-01-16T00:00:00-00:00

In this post I share 9 links to resources related to Network Beacon detection.

Network beacons are continuous automated communications between 2 hosts. Network beacon detection focuses on identifying this automated traffic with the primary goal of aiding in detecting malware infections or adversary activity that have been missed by other controls.

Beacon detection is a useful building block analytic with many different usecases.

Threat Hunting and Malware command and control (C2) detection - aid in detecting malware missed by anti-virus products.
Detection of automated third party traffic - detection of ongoing automated traffic to third parties may reveal unknown or emerging business relationships.
Identify automated web application dependencies (within an enterprise or external to an enterprise)

Links:

Identifying beaconing malware using Elastic [code] by Apoorva Joshi, Thomas Veasey, and Craig Chamberlain - uses statistical techniques of coefficient of variation (COV), relative variance (RV), and autocorrelation; implemented as Elastic Painless scripts.
Enterprise Scale Threat Hunting: C2 Beacon Detection with Unsupervised ML and KQL — [Part 1] [Part 2] [code] by Mehmet Ergene
Detecting network beacons via KQL using simple spread stats functions by Alex Teixeira
Detect Network beaconing via Intra-Request time delta patterns in Azure Sentinel [code] by Ashwin Patil
RITA (Real Intelligence Threat Analytics) beacon analyzer - uses simple statistical approach based on 6 measures: connection time delta skew, connection dispersion, connection counts over time, data size skew, data size dispersion, and data size smallness score.
How to detect beaconing traffic with Splunk? by Alex Teixeira
Detect Beaconing with Flare, Elastic Stack, and Intrusion Detection Systems [code] by Austin Taylor
BAYWATCH: Robust Beaconing Detection to Identify Infected Hosts in Large-Scale Enterprise Networks - uses FFT and periodogram based technique for identifying automated traffic.
Malware Beaconing Detection by Mining Large-scale DNS Logs for Targeted Attack Identification

–Jason
@jason_trost

The “short links” format was inspired by O’Reilly’s Four Short Links series.

9 Short links on Network Beacon Detection was originally published by Jason Trost at covert.io on January 16, 2022.

10 Short links on Cybersquatting domain detection

2022-01-08T00:00:00-00:00

In this short blog, I share 3 papers and 7 tools that focus on detecting cyber squatting domains (including typosquating, homograph, combosquatting, etc.).

Tools for generating cybersquatting domains (for use in detection)

Lots of other tools/libraries now exist if you need an implementation in a different language. See these github tags for lots more tools: typosquatting, homoglyph, and homograph-attack.

–Jason
@jason_trost

The “short links” format was inspired by O’Reilly’s Four Short Links series.

10 Short links on Cybersquatting domain detection was originally published by Jason Trost at covert.io on January 08, 2022.

Four Short Links on Malicious Lateral Movement Detection

2021-05-30T00:00:00-00:00

In this short blog, I share four papers that focus on detecting malicious lateral movement (a.k.a. pivoting, a.k.a. island hopping).

Papers:

Lastly, if you’re interested in discovering more interesting papers like these, use the method I outlined here.

–Jason
@jason_trost

The “short links” format was inspired by O’Reilly’s Four Short Links series.

Four Short Links on Malicious Lateral Movement Detection was originally published by Jason Trost at covert.io on May 30, 2021.

Seven Short Links of Dictionary DGA Detection

2021-05-11T00:00:00-00:00

In this short blog, I share seven papers that focus on detecting Dictionary Domain Generation Algorithm (DGA) domains, A.K.A. Word-based DGAs. Dictionary DGAs are algorithms seen in various malware families (suppobox, matsnu, gozi, rovnix, etc.) that are used to periodically generate a large number of domain names that use pseudo-randomly concatenated words from a dictionary. These domains may appear legitimate at first glance and are often able to evade blacklisting as well as traditional DGA detections based on entropy or counts of consonants vs vowels. Below are a small sample of rovnix domains from Unit42’s blogpost.

kingwhichtotallyadminis[.]biz
thareplunjudiciary[.]net
townsunalienable[.]net
taxeslawsmockhigh[.]net
transientperfidythe[.]biz
inhabitantslaindourmock[.]cn
thworldthesuffer[.]biz

Papers:

In a previous post, I also shared details on several models that are capable of effectively detecting dictionary DGA domains as well. Please see Auxiliary Loss Optimization for Hypothesis Augmentation for DGA Domain Detection.

Lastly, if you’re interested in discovering more interesting papers like these, use the method I outlined here.

–Jason
@jason_trost

The “short links” format was inspired by O’Reilly’s Four Short Links series.

Seven Short Links of Dictionary DGA Detection was originally published by Jason Trost at covert.io on May 11, 2021.

Eight Short Links of Recent Cyber Security Data Science Papers

2021-04-24T00:00:00-00:00

A short listing of cyber security data science research papers I’ve discovered recently. Each of them uses machine learning or enables ML (i.e. providing training data or enabling creation of training data) to solve various security usecases, and many provide open source code as well.

BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware. [data]. Other malware related training data can be found here.
Compromised or Attacker-Owned: A Large Scale Classification and Study of Hosting Domains of Malicious URLs. [code] referenced in paper, but not live as of 4/24/2021.
DeepHunter: A Graph Neural Network Based Approach for Robust Cyber Threat Hunting. This uses an open source EDR tool named BLUESPAWN that I had not heard of before.
DeepReflect: Discovering Malicious Functionality through Binary Reconstruction. [code]
Explanation-Guided Backdoor Poisoning Attacks Against Malware Classifiers. [code]
EXTRACTOR: Extracting Attack Behavior from Threat Reports. [code]
On Generating and Labeling Network Traffic with Realistic, Self-Propagating Malware.
Stratosphere: Finding Vulnerable Cloud Storage Buckets. [code]

If you’re interested in discovering more interesting papers like these, use the method I outlined here.

–Jason
@jason_trost

The “short links” format was inspired by O’Reilly’s Four Short Links series.

Eight Short Links of Recent Cyber Security Data Science Papers was originally published by Jason Trost at covert.io on April 24, 2021.

All your SPF are belong to us: Exploring trust relationships through global scale SPF Mining

2020-07-06T00:00:00-00:00

In this post we explore a large collection of Sender Policy Framework (SPF) records to see what they might tell us about global email sending trust relationships and how they relate to email security providers. This is a fast follow-up to my previous post on Mining DNS MX Records for Fun and Profit.

Here is the methodology I devised for this (very similar to the previous post, but with new custom built tools):

Collect a large sample of SPF records via DNS TXT lookups of popular domain names (and recursively resolving SPF “include” domains).
Enrich SPF records with IP intelligence and useful metadata (including email security provider mappings)
Analyze the enriched results.

Intro to Sender Policy Framework (SPF)

The Sender Policy Framework (SPF) enables domain name administrators to authorize hosts to use their domain names when sending email (i.e. in the “MAIL FROM” or “HELO” identities in SMTP). One of the goals of SPF is to limit spammer’s abilities to spoof email messages. SPF is limited and is usually used with DKIM and DMARC. SPF records are published using DNS TXT records. SPF compliant mail receivers use the published SPF records to test the authorization of sending Mail Transfer Agents (MTAs). SPF can be used to build complex policies around who can send email on whose behalf. Below is an example SPF record for Florida State University.

According to this SPF record 146.201.58.212, 146.201.58.213, 146.201.107.145, 146.201.107.249, 192.12.121.23, and 199.188.157.80 are allowed to send email purporting to be from fsu.edu. Also, the SPF records from spf.protection.outlook.com, _spf.qualtrics.com, spf.blackboardconnect.com, servers.mcsv.net, and _spf.mlsend.com should be retrieved and their policies applied as well. Below are the SPF records for each of these domains. As you can see they include more and more IPs/CIDRs as well as additional SPF includes.

As you can see, SPF forms a chain of trust between the domain owner and all the SPF policies included recursively (potentially crossing several different administrative boundaries). In this post I was hoping to explore this chain of trust at a large scale by collecting a large sample of SPF records and mining them.

Below are some useful resources for understanding SPF:

RFC7208: Sender Policy Framework (SPF) for Authorizing Use of Domains in Email
SPF Syntax Table - really useful guide for understanding SPF “mechanisms”.

Step One: Collection

For step one, I built a very ~~crude~~ useful SPF crawler that uses dig (optionally adnshost) to perform DNS TXT requests, parse out SPF records found, and then recursively follow the trail of SPF include records and perform TXT lookups against the included domains.

In order to seed the SPF crawler, I used the same domains I used in my previous blog post on mining MX records. I downloaded the Alexa top 1M domains, Quantcast top 1m domains (from WaybackMachine), Domcop Top 10m domains, Majestic Million Domains and Cisco Umbrella top 1m domains. I identified the registered domain using tldextract for each of these and then combined them into a single de-duplicated list. This resulted in ~8.3M unique domain names.

These domains were fed into my SPF crawler and then the results were collected, parsed, and then assembled. I ended up backing the SPF crawler with “dig” instead of “adnshost” this time since I found dig was more reliable, completing 23% more DNS requests in an experiment against the Fortune 1000 domains. Dig is single threaded, but I easily parallelized it using splits files and xargs and its performance ended up being good enough. See parallel_dig.sh for more details.

Below are a few simple commands as well as example output data collected with my SPF crawler applied to just one domain. As you can see, the assembled output for fsu.edu includes all the IPs and Netblocks from all the SPF includes that it links to, recursively.

Below is the same information, visualized as a network (and enriched with ASN info from Maxmind).

Step Two: Enrichment

For this step, I reused a lot of the code from my previous blog post on Mining MX records and performed the following enrichments:

Maxmind ASN
Maxmind Country
Cloud Provider IP Lookups for AWS, Azure, and GCP
Alexa Ranking
Email Security Provider mapping

netaddr, tldextract, and cidr-trie were useful during this stage.

Step Three: Analysis

Through this analysis, I hoped to answer the following questions:

What is the largest trusted network size (both single CIDR and aggregate network space)? … HUGE
Could I find any blatantly misconfigured SPF records? … YES
What does SPF data show about email security providers? … A lot that MX doesn’t
What are the most “included” SPF includes? … Not many surprises here
Does SPF augment the MX record mining (give more coverage? reveal things previously hidden? or 100% redundant?) … YES!
Are domains trusting IP space from cloud providers that may be re-usable (i.e. AWS EC2)? … YES!

Below are some outputs and commentary from this project’s Jupyter notebook that answer the questions above.

Network Graphs

These networkx visualizations of the Fortune 100 and Alexa 100 are a bit of a mess, but they should get the point across of how interconnected the SPF trust relationships are.

Fortune 100 SPF Trusted Networks Graph

Alexa 100 SPF Trusted Networks Graph

Heatmaps

As you can see from the next several heatmaps, as we go beyond the Alexa top 1,000 domains the number of networks trusted drastically increases, and as we hit the Alexa 1m, the entire Internet is trusted (likely due to SPF misconfigurations).

These heatmaps were generated with the awesome ipv4-heatmap tool provided by the Measurement Factory. The code to automate this can be found in my Jupyter Notebook here.

Fortune 1,000 SPF Trusted Networks Heatmap

Alexa 1,000 SPF Trusted Networks Heatmap

Alexa 10,000 SPF Trusted Networks Heatmap

Alexa 100,000 SPF Trusted Networks Heatmap

Alexa 1,000,000 SPF Trusted Networks Heatmap

Alexa Top 1M Domains Trusting /7 or larger networks

As you can see from this list, there are quite a few domains that trust very large networks. Several of these seem like likely misconfigurations. For example, these four domains trust the entire Internet:

hitadouble[.]com: 208.67.207.0/0
payukraine[.]com: 0.0.0.0/0
angliss[.]edu[.]au: 0.0.0.0/0
hutkigrosh[.]by: 0.0.0.0/0

This domain trusts half of the Internet - salaam[.]af: 175.106.32.0/1

And these five domains trust 1/4 of the Internet. cfe[.]fr appears to have fixed this apparent misconfiguration now. As their TXT record has changed.

creativecircle[.]com: 64.4.22.64/2
gevestor[.]de: 91.241.72.0/2
debeersgroup[.]com: 10.47.149.168/2
cfe[.]fr: 82.97.62.0/2
adecco[.]com: 148.105.8.0/2

Top SPF Includes from all top domain lists (via SPF)

Using all the popular domain names, here is a summary of the top 10 SPF includes.

Major Cloud Email Providers:

Microsoft: spf.protection.outlook.com
Google: _spf.google.com

Hosting Providers:

HostGator: websitewelcome.com
OVH: mx.ovh.com
Bluehost: bluehost.com

Commercial Email Marketing companies

MailChimp: servers.mcsv.net
Mandrill: spf.mandrillapp.com (MailChimp add-on)
Sendgrid: sendgrid.net

Email Security company:

MailChannels: mailchannels.net (more on this later)

Top SPF Includes from Fortune 1000 (via SPF)

Top SPF Includes from Alexa top1m

Email Security Providers

If you read my previous blog post on Mining DNS MX Records for Fun and Profit, then you might notice that these top lists look significantly different than the top email providers as identified from MX records. The top 5 providers identified in the SPF data are MailChannels, Mimecast, Proofpoint, Solarwinds, and Barracuda. In the MX post, the top 5 were Proofpoint, Mimecast, Deteque, Barracuda, and Solarwinds, AND MailChannels was #48 on that list. These top lists are using all the popular domains data which is likely not an accurate reflection of the actual email security market. When reviewing the Fortune 1000 top Email Security providers the story is not as surprising as the top 4 from the Fortune 1000 Email security providers were nearly identical across SPF and MX records with just the order being different. I suspect that MailChannels shows up as popular in SPF because either it is the default setting on newly registered domains OR it is the default setting for domains that are parked with certain hosting providers, but I haven’t spent the time to prove/disprove this.

(Update 7/7/2020) I received this message from Ken Simpson, CEO of MailChannels, that helps explain why there is a mismatch between the MX and SPF counts.

“You were wondering why MailChannels shows up in a lot of SPF records (actually, we’re number one), but relatively few MX records. MailChannels delivers email for the web hosting industry, with over 700 service provider customers worldwide. To deliver email reliably, they have to add us to their customers’ SPF records. Those same customers often host their inbound email with someone else - GSuite, Microsoft 365, or another provider. Hence the mismatch in SPF and MX records.”

One other interesting aspect with SPF is it (potentially) reveals relationships with multiple email security providers. See the “Fortune 100 Email Security Providers Listing (via SPF)” and “Domains with 4 or more Email Security Providers (via SPF)” gists below. In the Fortune 100 list, there are 3 domains with SPF relationships with more than one provider. If you look across all the top domains data you can see there are many. For anyone who has worked in the cyber security department at a large company before, this is not surprising, but it was cool to be able to see this in the data.

Domains with 2 SPF relationships with Email Security Providers: 11,393
Domains with 3 SPF relationships with Email Security Providers: 468
Domains with 4 SPF relationships with Email Security Providers: 35
Domains with 5 SPF relationships with Email Security Providers: 1

Top Email Security Provider from all top domain lists (via SPF)

Top Email Security Provider from Alexa 1m (via SPF)

Top Email Security Provider from Fortune 1000 (via SPF)

Top Email Security Provider from Fortune 100 (via SPF)

Fortune 100 Email Security Providers Listing (via SPF)

Domains with 4 or more Email Security Providers (via SPF)

Trusting Cloud Provider Networks

As you can see from the next few tables, many domains transitively trust a lot of Cloud provider IP space for SPF. For some of the larger networks trusted it seems like this carries risk since it may be possible for the cloud IP space to get reused; see Fishing the AWS IP Pool for Dangling Domains for a practical example of this. Like I mentioned earlier, SPF is usually used with DKIM and DMARC so this data doesn’t paint the whole picture. I am hoping to dive into DMARC/DKIM next.

Alexa 1000 Trusting AWS Networks

Alexa 1000 Trusting Azure Networks

Alexa 1000 Trusting GCP Networks

Fortune 1000 Trusting AWS Networks

Fortune 1000 Trusting Azure Networks

Fortune 1000 Trusting GCP Networks

Some other potentially interesting results, not worth dumping here:

Alexa top1m domains trusting AWS Networks
Alexa top1m domains trusting Azure Networks
Alexa top1m domains trusting GCP Networks
Top Maxmind ASNs of SFP Trusted Networks from Fortune 1000 (via SPF)
Top Maxmind ASNs of SFP Trusted Networks from all top domain lists (via SPF)
Top Maxmind ASNs of SFP Trusted Networks from Alexa top1m (via SPF)
Graph analytics applied to Fortune 1000 and Alexa 1000: degree centrality, edge betweenness centrality, pagerank, closeness centrality, triangle counts, and connected components stats, see the notebook and search for “print_graph_metrics”.

Future Work

SPF Crawler enhancements: As you can see from the SPF guide I shared above for “a” and “mx”, SPF supports some fairly complex policies for allowing certain IPs to send email (esp. the prefix operators on these SPF mechanisms). I did not provide support for these mechanisms in the first version of my SPF crawler mainly due to the complexity involved. Because of this, my results will under represent the trust relationships where these are used. I hope to add support for these operators to expand what could be found in this data.
Try some more graph analytics on the entire dataset. In the Jupyter notebook I ran several graph algorithms on subsets of the entire graph (Fortune 100 and Alexa 100). These showed some mildly interesting results, but testing against larger graphs caused graphviz to fail due to some data format issues that I have not had a chance to research.
Perform another study measuring DMARC and DKIM usage across popular domains.

Resources

As usual all notebooks, code, and summary results can be found in Github: https://github.com/covert-labs/mx-intel.

And all data can be found at the links below:

all-registered-domains.txt.gz - base domains extracted from combining several popular domains lists together and then uniqued.
all-registered-domains-outputs-combined.txt.gz - raw dig output for all the TXT requests.
spf-results-all-registered-domains.json.gz - the parsed results from running the SPF Crawler against all-registered-domains.txt.gz.
spf-linked-all-registered-domains.json.gz - the assembled results from processing spf-results-all-registered-domains.json.gz. This is the collapsed/combined data that shows all the SPF domains and networks included recursively.

–Jason
@jason_trost

All your SPF are belong to us: Exploring trust relationships through global scale SPF Mining was originally published by Jason Trost at covert.io on July 06, 2020.

Mining DNS MX Records for Fun and Profit

2020-06-27T00:00:00-00:00

If you have read my blog before, you may realize that I really love DNS data and dns analytics. In this post, I share some experiences in using mostly DNS data for identifying the visible footprint of popular email security providers.

This may not be terribly novel, but it was an interesting exploration during a time of boredom for me. This work was initially motivated by two events:

When the Proofpoint email protection machine learning vulnerability (CVE-2019-20634) was announced by Will Pearce and Nick Landers I got to wondering about how large their deployment footprint was and how one could figure this out, and
A friend at another company mentioned that they were using a specific startup email security provider and I wondered whether I could determine what other companies were also using this same provider.

Here is the methodology I devised for this:

Collect a large sample of MX records
Enrich MX records with IP intelligence and useful metadata
Sift through the enriched records and identify recognizable email provider’s domains through OSINT (whois, PDNS, Google) and market research.
Profit?!?!?

For step one, I downloaded the Alexa top 1M domains, Quantcast top 1m domains (from WaybackMachine), Domcop Top 10m domains, Majestic Million Domains and Cisco Umbrella top 1m domains. I identified the registered domain using tldextract for each of these and then combined them into a single de-duplicated list. This resulted in ~8.3M unique domain names. I then performed bulk MX lookups using adnshost against my own bind9 recursive nameserver. In my experience, adnshost works pretty well for bulk DNS resolution at this scale, and it will perform both the lookup requested (MX) as well as a domain resolution (A-lookup). When performing bulk DNS lookups at this scale it is important to add retry logic for failed resolutions as this tends to happen enough to be a problem. I did this using a simple bash script that retried failed lookups up to three times.

For step two, I then developed a simple Jupyter notebook to parse the adnshost logs and perform the enrichments using tldextract, PTR lookups (also using adnshost), Maxmind ASN, Maxmind City, Alexa ranking, and Cloud provider IP Ranges for AWS, Azure, and GCP. Side note: I also attempted to perform SOA lookups on the /24 networks of each IP after noticing some useful patterns with failed PTR lookups. This appears potentially useful for identifying some of uses of some of the IP space of the cloud providers, but this turned into a rabbit hole since adnshost appears to crash when trying to handle some of the results it received.

For step three, I did the following:

Performed market research on the top email security providers as well as emerging and niche providers. This site was helpful as well as just googling around and exploring PDNS/Whois data from PassiveTotal and SecurityTrails.
Scrutinized the top MX server registered domains and ASNs and tried to identify potential security providers.
Sifted through the remaining results trying to identify any obvious providers with “malware”, “phish”, “spam”, or “security” in their domain names.

I used this to build two mappings to email security providers: MX server base domains and ASN names. The mappings can be found here. Then I summarized the overall dataset and those results are presented below. Elephants in the room I purposefully did not include Microsoft, Google, and some of the bigger tech companies that provide email service as part of these mappings since I don’t consider them email security companies. This may be debatable since these companies do provide security features through their offerings.

Brief Intro to MX records

For those of you who may not be familiar with DNS MX records, these are DNS Resource Records (RRs) used to map a domain name to the Mail Exchange (MX) servers responsible for accepting email for that domain. MX records are used by Mail Transfer Agents (MTA) in order to identify where email should be sent for a given recipient email address. Below we use the command line utility “dig” to perform an MX lookup on gmail.com to find its Mail Exchange servers. As you can see, at the time of this writing, there are five MX domains that can accept email for gmail.com.

Besides being critical for identifying where email should be sent, MX records are also useful for mapping out infrastructure and can sometimes be used to identify which email security providers are being used by a company of interest. Below is an example for Florida State University (go Noles!) that reveals that, at the time of this writing, they are using Proofpoint to receive their email. How do we know this? Their mail exchanges are hosted on sub domains of pphosted.com which is owned by Proofpoint.

Some companies obscure their security providers by first receiving their email to other mail exchanges such as ones hosted in their own data center or ones hosted by Google or Microsoft. In this blog, we explore a large DNS dataset to identify interesting info about the visible footprint / market share of email security companies.

All code and data for this study can be found in this Github Repo: https://github.com/covert-labs/mx-intel.

Observations:

Email security provider OPSEC is remarkably bad in a lot of cases and it is often easy to determine which provider is being used. Anyone who works in cyber security knows it is generally not a good idea to broadcast which cyber security products you are using since it may provide information that can be exploited by the adversary. This is especially true when vulnerabilities are announced in security products.
Since email exchanges can be chained together, only the outermost layer is visible in DNS MX records. For this reason, this research will underestimate the size of each provider’s market share.
Some security providers supply very specialized services (like anti-phishing only) and because of this they are often not the first layer in the email exchange chain. They will be dramatically underrepresented in this study.

Results:

Summary:

8,395,595 domains (derived from several top domain lists)
12,910,550 unique MX records (from 5994452 unique domains)
2,901,843 Unique Mail server domains
1,940,993 Unique Mail server base domains
25,733 Unique Mail server ASNs
56 Unique Security Providers identified

Analytics:

Here are the questions I was hoping to answer with the tables presented below:

Who are the market leading security companies reflected in the data?
What is the visible market share of email security providers as reflected in DNS records?
What can be inferred from publicly available MX records about email security?
Which email security providers are leveraging cloud hosting? And which cloud hosting environments are used most?
Who are the visible customers of provider X?

Note: All tables below show the count of domains hosted, NOT companies; companies can own many domains. Fortune 1000 domains are from 2015 and are based on this file created by Bob Rudis.

Top Email Security Providers Overall

Fortune 1000 Email security providers

Fortune 100 domain, MX base domain, email security provider

Note: I ended up adding Google and Microsoft to this table since they were very well represented. As you can see, Proofpoint and self-hosting dominate the Fortune 100.

Alexa 1000 Email security providers

Alexa 100 domain, MX base domain, email security provider

Note: I ended up adding Google and Microsoft to this table since they were very well represented. As you can see, self-hosting, Google and Microsoft dominate the Alexa 100. Almost all of these domains are from large technology / web companies so this isn’t so surprising, but it is interesting as compared to the Fortune 100.

Top Email Security Providers Hosted in AWS

Many large email security companies are operating from AWS.

Top Email Security Providers Hosted in Azure

Only a small number of identifiable email security companies were operating from Azure.

Top Self-hosted Email Security Providers

Misc Findings

When mining this data I discovered a few interesting items.

Linode / CSC Digital Brand Services

One of the more popular email security providers, “CSC Digital Brand Services” (which service multiple Fortune 100 companies), uses Linode for their hosting. This was surprising since Linode seems like a much smaller player in the Cloud hosting market.

googlemial[.]com

When I initially collected this data, freecodecamp.org had a misconfigured MX domain pointing to googlemial[.]com. And this sketchy domain is not owned by Google and resolved to a GCP IP. Upon further inspection, this IP appears to be hosting a parking page for unregistered domains owned by GoDaddy. A quick PDNS check of other domains resolving to this IP reveals ~4.2M+ domains, and a quick DNS resolution on those domains with any subdomain shows that they all resolve to the same IP.

adnshost logs for freecodecamp.org

Future Work

I am not sure if I will return to this research or not, but I had some ideas that may be worth pursuing at some point, maybe during the next pandemic :)

Perform similar work against a much larger scale - using all major zone files (COM, NET, ORG) and ICANN’s CZDS as the inputs.
Or perform similar work using the Rapid7 Opendata DNS data sets.
Determine if port scans against MX servers could be useful to augment this.
Automate PDNS queries and analysis against the MX records found to identify other domains not found in the top domain lists.
Perform similar work, but collect SPF records and see what interesting insights could be gleaned about email sending trust (and whether vulns could be identified – like AWS IPs in the SPF that are stale and potentially obtainable).
Completely automate this entire process and use it to generate weekly reports.
Identify providers hidden by the first layer mail exchange. It may be possible to do this at scale (but only for some companies) if the companies send Bounced notifications to external email senders for non-existent recipients. These bounced messages often contain all the SMTP headers of the original message sent. These headers can reveal security products. This technique was used on a targeted basis by Will Pearce and Nick Landers in their DerbyCon research on Proofpoint. Trying to do this at scale may draw a lot of attention or get my research box put on some blacklists. It would also likely be a lot more effort to identify the SMTP headers associated with different security providers.

Resources

Data:

all-registered-domains.txt.gz - base domains extracted from combining several popular domains lists together and then uniqued.
all-popular-domains-MX-20200620.txt.unique.gz - adnshost logs from performing MX lookups on domains from all-registered-domains.txt.gz.
mailserver_registered_domain-NS-20200620.txt.gz - adnshost logs from performing NS lookups on all the MX base domains; used for enrichment.
mx-intel-enriched.csv.gz - the final enriched output from this work.

Notebooks, Code, and summary results: https://github.com/covert-labs/mx-intel.

–Jason
@jason_trost

Mining DNS MX Records for Fun and Profit was originally published by Jason Trost at covert.io on June 27, 2020.

Seven Short Links on Cyber Security Alert Triage Automation

2020-05-23T00:00:00-00:00

A short listing of research papers I’ve discovered recently that aim to automate or speed up cyber security alert triage (alert prioritization/ranking, causal event correlation, and enrichment).

–Jason
@jason_trost

The “short links” format was inspired by O’Reilly’s Four Short Links series.

Seven Short Links on Cyber Security Alert Triage Automation was originally published by Jason Trost at covert.io on May 23, 2020.

Eight Short Links on Provenance Analytics for Cyber Security

2020-02-01T00:00:00-00:00

A short listing of research papers I’ve discovered recently that use Provenance Analytics for various Cyber Security usecases from EDR data analysis to malware analysis to threat hunting and IR.

–Jason
@jason_trost

The “short links” format was inspired by O’Reilly’s Four Short Links series.

Eight Short Links on Provenance Analytics for Cyber Security was originally published by Jason Trost at covert.io on May 23, 2020.

3 Short Links on Popular Domain Lists for Threat Intelligence

2020-02-01T00:00:00-00:00

A short listing of research papers I’ve read that analyze popular domain lists. These papers analyze Alexa, Quantcast, Cisco Umbrella, and Majestic top websites/domains.

–Jason
@jason_trost

The “short links” format was inspired by O’Reilly’s Four Short Links series.

3 Short Links on Popular Domain Lists for Threat Intelligence was originally published by Jason Trost at covert.io on February 01, 2020.

6 Short Links on Malware Training Set Creation for Machine Learning

2020-02-01T00:00:00-00:00

A short listing of resources useful for creating malware training sets for machine learning.

In leading academic and industry research on malware detection, it is common to use variations of the following techniques (based on Virustotal determinations) in order to build labeled training data.

“In this paper, we use a ‘1-/5+ criterion for labeling a given file as malicious or benign: if a file has one or fewer vendors reporting it as malicious, we label the file as ‘benign’”. See ALOHA: Auxiliary Loss Optimization for Hypothesis Augmentation for more details.
“We assign malicious/benign labels on a 5+/1- basis, i.e., documents for which one or fewer vendors labeled malicious, we ascribe the aggregate label benign, while documents for which 5 or more vendors labeled malicious, we ascribe the aggregate label malicious.” See MEADE: Towards a Malicious Email Attachment Detection Engine for more details.
Uses similar method as above, but further removes files that use hash based file names or filenames that are “malware” or “sample”. See Learning from Context: Exploiting and Interpreting File Path Information for Better Malware Detection for more details.
“To train and evaluate our model at low false positive rates, we require accurate labels for our malware and benignware binaries. We accomplish this by running all of our data through VirusTotal, which runs the binaries through approximately 55 malware engines.We then use a voting strategy to decide if each file is either malware or benignware… We label any file against which 30% or more of the antivirus engines alarm as malware, and any file that no antivirus engine alarms on as benignware. For the purposes of both training and accuracy evaluation we discard any files that more than 0% and less than 30% of VirusTotal’s antivirus engines declare it malware, given the uncertainty surrounding the nature of these files.” See Deep Neural Network Based Malware Detection Using Two Dimensional Binary Program Features for more details.
Scraping packages from Ninite, Chocolatey, and Cygwin.
Endgame’s Ember is becoming one of the most cited datasets used for security machine learning. “The dataset includes features extracted from 1.1M binary files: 900K training samples (300K malicious, 300K benign, 300K unlabeled) and 200K test samples (100K malicious, 100K benign)”. See EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models for more details.
Labeling the VirusShare Corpus: Lessons Learned - John Seymour

–Jason
@jason_trost

The “short links” format was inspired by O’Reilly’s Four Short Links series.

6 Short Links on Malware Training Set Creation for Machine Learning was originally published by Jason Trost at covert.io on February 01, 2020.

Collecting and Curating IOC Whitelists for Threat Intelligence and Machine Learning Research

2020-02-01T00:00:00-00:00

In this post, I share my experience in building and maintaining large collections of benign IOCs (whitelists) for Threat Intelligence and Machine Learning Research.

Whitelisting is a useful concept in Threat Intelligence correlation since it can be very easy for benign observables to make their way into threat intelligence indicator feeds, esp. coming from open source providers or vendors that are not as careful as they should be. If these threat intelligence feeds are used for blocking (e.g. in firewalls or WAF devices) or alerting (e.g. log correlation in SIEM or IDS), the cost of benign entries making their way into a security control will be very high (wasted analyst time for triaging false positive alerts or loss of business productivity for blocked legitimate websites). Whitelists are generally used to filter out observables from threat intelligence feeds that almost certainly would be marked as a false positive if they were intersected against event logs (e.g. bluecoat proxy logs, firewall logs, etc) and used for alerting. Whitelists are also very useful for building labeled datasets required for building machine learning models and enriching alerts with contextual information.

The classic example of a benign observable is 8.8.8.8 (Google’s published open DNS resolver). This has found its way into many open source and commercial threat intelligence feeds by mistake since sometimes malware use this IP for DNS resolution or they ping it for connectivity checks. There are many other observables that commonly make their way into threat feeds due to how the threat feeds are derived / collected. Below are a summary of the major sources of false positives for threat intelligence feeds and ways to identify these to prevent their use. Most commercial threat intelligence platforms are pretty good at identifying these today and the dominant open source threat intelligence platform MISP is getting better with its MISP-warninglists, but as you will discover below there is some room for improvement.

Benign Inbound Observables

Benign Inbound Observables commonly show up in threat intelligence feeds derived from distributed network sensors such as honeypots or firewall logs. These IPs show up in firewall logs and are generally benign or at best are considered noise. Below are several common Benign Inbound Observable types. Each type also comes with recommended data sources or collection techniques listed as sub bullets:

Known Web Crawlers - Web crawlers are servers that crawl the World Wide Web and through this process may enter the networks of many companies or may accidentally hit honeypots or firewalls.
- RDNS + DNS analytics can be used to enumerate these in bulk once patterns are identified. Here is an example pattern for googlebots. Mining large collections of rdns data can reveal other patterns to focus on. Below is an example of a simple PTR lookup on a known googlebot IP. This should start to reveal patterns that could be codified assuming you have access to a large corpus of RDNS data like is provided here (or could easily be generated).

Known port scanners associated with highly visible projects or security companies (Shodan, Censys, Rapid7 Project Sonar, ShadowServer, etc.)
- RDNS + DNS analytics may be able to enumerate these in bulk (assuming the vendors want to be identified). Example:

Mail Servers - these servers send email and they sometimes wind up on Threat feeds by mistake.
- In order to enumerate these, you need a good list of popular email domains. Then perform DNS TXT request against this list and parse the SPF records. Multiple lookups will likely be needed as SPF allows for redirects and includes. Below shows the commands needed to do this manually for gmail.com as an example. The CIDR blocks returned are the IP space where gmail emails are sent from. Alerting or blocking on these is gonna cause a bad day.

Cloud PaaS Providers – Most Cloud providers publish their IP space via APIs or in their documentation. These lists are useful to derive whitelists, but they will need to be further filtered. Ideally you only whitelist Cloud IP space that are massively shared (like S3, CLOUDFRONT, etc), not IPs that are easy for bad guys to use, such as like EC2s. These whitelists should not be used to exclude domain names that resolve to this IP space, but instead should be used for either enrichments on alerting or to suppress IOC based alerting from these IP ranges.

Note: Greynoise is commercial provider of “anti-threat” intelligence (i.e. they identify the noise and other benign observables). They are very good at identifying the types of benign observables listed above since they maintain a globally distributed sensor array and are specifically analyzing network events in order to identify benign activity.

Note: MISP-warninglists provides many of these items today but they may be stale (several of their lists have not been updated in months). Ideally all of these lists are kept up-to-date through automated collection from authoritative sources instead of hard coded data stored in github (unless these are automatically updated frequently). See section on “Building / Maintaining Whitelist Data” for more tips.

Benign Outbound Observables

Benign Outbound Observables show up frequently in threat intelligence feeds derived from malware sandboxing, URL sandboxing, outbound web crawling, email sandboxing, and other similar threat feeds. Below are several common Benign Outbound Observable types. Each type also comes with recommended data sources or collection techniques listed as sub bullets:

Popular Domains - Popular domains can wind up on threat intelligence feeds, especially those derived from malware sandboxing since often times malware uses benign domains as connectivity checks and some malware, like those conducting click fraud act more like web crawlers, visiting many different benign sites. These same popular domains show up very often in most corporate networks and are almost always benign in nature (Note: they can be compromised and used for hosting malicious content so great care needs to be taken here).
- Below are several data sources for popular domain names. Each are slightly different in how they measure popularity (by volume of Web visitors, frequency of occurrence in Web Crawling data, by volume of DNS queries based, or a combination). These lists should not be used as-is for whitelisting; they need to be filtered/refined. See section on “Building / Maintaining Whitelist Data” below for more details on recommendations for refinement.
  - Amazon Alexa top 1 Million
  - Cisco Umbrella Top 1 Million
  - Domcop Top 10m Domains (data) - The top 10 million websites taken from the Open PageRank Initiative.
  - Majestic Million Domains
  - Moz’s list of the most popular 500 websites on the internet
  - Quantcast Top 1 Million
  - Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation
  - MISP-warninglists’ dax30 websites, bank websites, university domains, url shorteners, whats-my-ip sites
- For more details and analysis on these popular domain lists, checkout this post:
Popular IP Addresses - Popular IPs are very similar to popular domains. They show up everywhere and when they wind up on threat intelligence feeds they cause a lot of false positives. Popular IP lists can be generated from resolving the Popular domain lists. These lists should not be used as-is for whitelisting; they need to be filtered/refined. See section on “Building / Maintaining Whitelist Data” below for more details on recommendations for refinement.
Free email domains - free email domains occasionally show up in threat intelligence feeds by accident so it is good to maintain a good list of these to prevent false positives. Hubspot provides a list that is decent.
Ad servers - Ad servers show up very frequently in URL sandbox feeds as these feeds are often obtained by visiting many websites and waiting for exploitation attempts or for AV alerts. These same servers show up all the time in benign Internet traffic. Easylist provides this sort of data.
CDN IPs - Content Distribution Networks are geographically distributed network of proxy servers or caches that provide high availability and high performance for web content distribution. Their servers are massively shared for distributing varied web content. When IPs from CDNs make it into threat intelligence feeds, false positives are soon to follow. Below are several CDN IP and domain sources.
- WPO-Foundation CDN list (embedded in Python code)
- AWS IP Ranges - but filtered for cloudfront and S3 IP space.
- Cloudflare IP Ranges
- Fastly IP Ranges
- MaxCDN IP Ranges
- Very similar to identifying known web crawlers, DNS PTR-Lookup + DNS A-Lookup analytics can be used to enumerate these in bulk once patterns are identified.
Certificate Revocation Lists (CRL) and the Online Certificate Status Protocol (OCSP) domains/URLs - When executing a binary in a malware sandbox and the executable has been signed, connections will be made to CRL and OCSP servers. Because of this, these often mistakenly wind up in threat feeds.
- Grab Certificates from Alexa top websites, extract OCSP URL. This old Alienvault post describes the process (along with another approach using the now defunct EFF SSL Observatory), and this github repo provides the code to do it. Care should be taken here since adversaries can influence the data collected in this way.
- MISP-warninglists’ crl-ip-hostname
NTP Servers - Some malware call out to NTP servers for connectivity checks or to determine the real date/time. Because of this, NTP servers often wind up mistakenly on threat intelligence feeds that are derived from malware sandboxing.
- Web scrape lists of NTP servers (such as the NIST Internet Time Servers and NTP Pool Project Servers) and perform DNS resolutions to derive all the servers behind each regional load balancer.
Root Nameservers and TLD Nameservers
- Perform DNS NS-lookups against each domain in the Public Suffix List and then perform DNS A-lookup each nameserver domain to obtain their IP addresses.
Mail Exchange servers
- Obtain a list of popular email domains and then perform MX lookups against popular email domains to get their respective Mail Exchange (MX) servers. Perform DNS A-lookups on the MX servers list to obtain their IP addresses.
STUN Servers - “Session Traversal Utilities for NAT (STUN) is a standardized set of methods, including a network protocol, for traversal of network address translator (NAT) gateways in applications of real-time voice, video, messaging, and other interactive communications.” via https://en.wikipedia.org/wiki/STUN. Below are some sources of STUN servers (some of these appear old though).
Parking IPs - IPs used as the default IP for DNS-A records for brand new registered domains.
- maltrails parking_sites
Popular Open DNS Resolvers
- Public recursive name server (Wikipedia) - lists the largest and most popular open recursive nameservers.
- Public DNS Server List - maintains a large list of open recursive nameservers that may be useful for context, but should not be whitelisted.
Security Companies, Security Blogs and Security Tool sites - These sites show up in threat mailing lists frequently which are sometimes scraped as threat feeds and these domains are mistakenly flagged as malicious.
- Scrape all the reputable awesome-* security related github repo’s. This is a little risky since an adversary could potentially get their domain added to these lists. Examples:
- MISP-warninglists provides a security-provider-blogpost and automated-malware-analysis lists that look pretty good.
Bit Torrent Trackers - github.com/ngosang/trackerslist
Tracking domains - commonly used by well known email marketing companies. Often shows up in threat intel feeds derived from spam or phishing email sinkholes. Results in high false positive rates in practice.
- PDNS and/or Domain Whois analytics are one way to identify these once patterns can be observed. Below is an example of using Whois data for Marketo.com and identifying all the other Marketo email tracking domains that use Marketo’s nameserver. This example is from Whoisology, but bulk Whois mining is a preferred method.

Note: MISP-warninglists provides some of these items today but they may be stale. Ideally all of these lists are kept up-to-date through automated collection from authoritative sources. See section on “Building / Maintaining Whitelist Data” for more tips.

Benign Host-based Observables

Benign Host-based Observables show up very commonly in threat intelligence feeds based on malware sandboxing. Here are some example observable types. So far, I have only found decent benign lists for File hashes (see below).

File hashes
Mutexes
Registry Keys
File Paths
Service names

Data Sources:

NSRL Hashsets
Windows-7/32 Diskprint
Neo23x0/fp-hashes.py
MISP common IOC false positives
Mandiant Redline Whitelist (mirror) - NOTE: this is ~5yr old at the time of this blog.
Hashsets.com (commercial) hash lists

In leading academic and industry research on malware detection, it is common to use Virustotal in order to build labeled training data. See this post for more details. These techniques seem very suitable for training data creation, but are not recommended for whitelisting for operational use due to the high likelihood of false negatives.

Note: If your goal is building a machine learning model on binaries, you should strongly consider Endgame’s Ember. “The dataset includes features extracted from 1.1M binary files: 900K training samples (300K malicious, 300K benign, 300K unlabeled) and 200K test samples (100K malicious, 100K benign)”. See EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models for more details.

Whitelist Exclusions

There are many observables that we will never want to whitelist due to their popularity or importance. These should be maintained in a whitelist exclusions list (a.k.a. greylist). Below are some examples:

Shared hosting domains and Dynamic DNS domains - these base domains should never be alerted on as many are in the Alexa top 1m list and will be incredibly noisy. BUT subdomains of these are fair game for alerting as they are easily adversary controlled and often abused. Below are some sources of this information, but identifying the major providers and scraping their websites or APIs would be a better way to keep these fresh.
- Shared Hosting - Maltrails free web hosting
- Dynamic DNS - Maltrails DynDNS
DNS Sinkhole IPs

Building / Maintaining Whitelist Data

Whitelist generation needs to be automated in order to be maintainable. There may be exceptions to this rule for things that you want to ensure are always in the whitelist, but for everything else, ideally they are collected from authoritative sources or are generated based on sound analytic techniques. You cannot always blindly trust each data source listed above. For several, some automated verification, filtering, or analytics will be needed. Below are some tips for how to do this effectively.

Each entity in the whitelist should be categorized (what type of whitelist entry is this?) and sourced (where did this come from?) so we know exactly how it got there (i.e. what data source was responsible) and when it was added/updated. This will help if there is ever a problem related to the whitelist so the specific source of the problem can be addressed.
Retrieve whitelist entries from the original source sites and parse/extract data from there. Avoid one time dumps of whitelist entries where possible since these will become stale very quickly. If you are including one-time dumps be sure to maintain their lineage.
Several bulk data sets will be very useful for analytics to expand or filter various whitelists
- Bulk Active DNS resolution (A-lookups, MX-lookups, NS-lookups, and TXT-lookups). Adns may be useful for this.
- Bulk RDNS data (either from scans.io or collected yourself).
- Bulk Whois data - This can be purchased from several vendors. Here are a few: whoisxmlapi.com, iqwhois.com, jsonwhois.com, whoisdatabasedownload.com, and research.domaintools.com.
- Passive DNS (PDNS) data - PDNS data can be purchased from several vendors or you can instrument your own network to collect and store this data. Here are some PDNS suppliers: farsightsecurity.com, deteque.com, circl.lu, riskiq.com, passivedns.mnemonic.no, and coresecurity.com (formerly Damballa).
Netblock ownership (Maxmind) lookups / analytics will be useful for some of the vetting.
The whitelist should be updated at least daily to stay fresh. There may be data sources that change more or less frequently than this.
- BE CAREFUL when refreshing the whitelist. Add sanity checks to ensure that the new whitelist was generated correctly before replacing the old one. The costs of a failed whitelist load will be mass false positives (unfortunately, I had to learn this lesson the hard way …).
Popular domain lists cannot be taken at face value as benign. Malicious domains get into these lists all the time. Here are some ways to combat this:
- Use the N-day stable top-X technique - e.g. Stable 6-month Alexa top 500k - create a derivative list from the top Alexa domains where you filter the list for only domains that have been on the Alexa top 500k list every day for the past 6 months. This technique is commonly used in malicious domain detection literature as a way to build high quality benign labeled data. It is not perfect and may need to be tuned based on how the whitelist is being used. This technique requires keeping historic popular domain lists. The Wayback Machine appears to have a large historic mirror of the Alexa top1m data that may be suitable for bootstrapping your own collection.
Bulk DNS resolution of these lists can also be useful for generating Popular IP lists, but only when using the N-day stable top-X concept or if great care is taken in how they are used.
Use a whitelist exclusions set for removing categories of domains/IPs that you never want whitelisted. The whitelist exclusions set should also be kept fresh through automated collection from authoritative sources (e.g. scraping dynamic DNS providers and shared hosting websites where possible, PDNS / Whois analytics may also work).
Lastly, be careful when generating whitelists and think about what aspects of the data are adversary controlled. These are things we need to be careful not to blindly trust. Some examples:
- RDNS entries can be made to be deceptive especially if the adversary knows they are used for whitelisting. For example, an adversary can create PTR records for IP address space they own that are identical to Google’s googlebot RDNS or Shodan’s census RDNS, BUT they cannot change the DNS A record mapping that domain name back to their IP space. For these a forward lookup (A Lookup) is generally also needed OR a netblock ownership verification.

In conclusion, whitelists are useful for filtering out observables from threat intelligence lists before correlation with event data, building labeled datasets for machine learning models, and enriching threat intelligence or alerts with contextual information. Creating and maintaining these lists can be a lot of work. Great care should be taken as to not go too far or to whitelist domains or IPs that are easily adversary controlled.

As always, feedback is welcome so please leave a message here, on Medium, or @ me on twitter!

–Jason
@jason_trost

Collecting and Curating IOC Whitelists for Threat Intelligence and Machine Learning Research was originally published by Jason Trost at covert.io on February 01, 2020.

Heterogeneous Information Networks + Cyber Security Use cases

2020-01-20T00:00:00-00:00

This post explores Heterogeneous Information Networks (HIN) and applications to Cyber security.

Over the past few months I have been researching Heterogeneous Information Networks (HIN) and Cyber security use cases. I first encountered HIN’s after discovering this paper: “Gotcha: Sly Malware!- Scorpion A Metagraph2vec Based Malware Detection System” through a Google Scholar Alert I had setup for “Guilt by Association: Large Scale Malware Detection by Mining File-relation Graphs”. If you’re interested in how I setup my Google Alerts to stay abreast of the latest security data science research, see this: Security Data Science Learning Resources.

Heterogeneous Information Networks are a relatively simple way of modelling one or more datasets as a graph consisting of nodes and edges where 1) all nodes and edges have defined types, and 2) types of nodes > 1 or types of edges > 1 (hence “Heterogeneous”). The set of node and edge types represents the schema of the network. This differs from homogeneous networks where the nodes and edges are all the same type (e.g. Facebook Social Network Graph, World Wide Web, etc.). HINs provide a very rich abstraction for modelling complex datasets.

Below, I will walk through important HIN concepts using the HinDom paper as an example. HinDom uses DNS relationship data from passive DNS, DNS query logs, and DNS response logs to build a malicious domain classifier using HIN. They use Alexa Top 1K list, Malwaredomains.com, Malwaredomainlist.com, DGArchive, Google Safe Browsing, and VirusTotal for deriving labels. Below is an example HIN schema taken from this paper.

This schema represents three combined datasets (Passive DNS, DNS query logs, DNS response logs) and it models three node types (Client, Domain, and IP Address) and six edge types (segment, query, CNAME, similar, resolve, and same-domain). Here is an expanded example and descriptions of the relationships:

Client-query-Domain - matrix Q denotes that domain i is queried by client j.
Client-segment-Client - matrix N denotes that client i and client j belong to the same network segment.
Domain-resolve-IP - matrix R denotes that domain i is resolved to IP address j.
Domain-similar-Domain - matrix S denotes the character-level similarity between domain i and j.
Domain-cname-Domain - matrix C denotes that domain i and domain j are in a CNAME record.
IP-domain-IP - matrix D denotes that IP address i and IP address j are once mapped to the same domain.

Once the dataset is represented as a graph, feature vectors need to be extracted before machine learning models can be built. A common technique for featurizing a HIN is by defining Meta-paths or Meta-graphs against the graph and then performing guided random walks against the defined meta-paths/graphs. Meta-paths represent graph traversals through specific node and edge sequences. Meta-paths selection are akin to feature engineering in classical machine learning as it is very important to select meta-paths that provide useful signals for whatever variable is being predicted. As seen in many HIN papers, meta-paths/graphs are often evaluated individually or in combination to determine their influence on model performance. Guided random walks against meta-paths produce a sequence of nodes (similar to sentences of words), which can then be fed into models like Skipgram or Continuous Bag-of-Words (CBOW) to create embeddings. Once the nodes are represented as embeddings many different models (SVM, DNN, etc) can be used to solve many different types of problems (Similarity Search, Classification, Clustering, Recommendation, etc). Below are the meta-paths used in the HinDom paper.

Below is the HinDom Architecture to illustrate how all these concepts come together.

Below are some resources that I found useful for learning more about Heterogeneous Information Networks as well as several security related papers that used HIN.

Books:

HIN Papers:

Malware Detection / Code Analysis:

Tutorials:

Code:

github.com/zhoushengisnoob/HINE - Heterogeneous Information Network Embedding: papers and code implementations.
github.com/stellargraph/stellargraph (see stellargraph-metapath2vec.ipynb)
github.com/hetio/hetnetpy - HIN library
github.com/hetio/hetmatpy - HIN library that represents as matrices.
github.com/csiesheep/hin2vec

Prominent Security Researchers using HIN:

As always, feedback is welcome so please leave a message here, on Medium, or @ me on twitter!

–Jason
@jason_trost

Heterogeneous Information Networks + Cyber Security Use cases was originally published by Jason Trost at covert.io on January 20, 2020.

Auxiliary Loss Optimization for Hypothesis Augmentation for DGA Domain Detection

2019-07-18T00:00:00-00:00

This post outlines some experiments I ran using Auxiliary Loss Optimization for Hypothesis Augmentation (ALOHA) for DGA domain detection.

(Update 2019-07-18) After getting feedback from one of the ALOHA paper authors, I modified my code to set loss weights for the auxilary targets as they did in their paper (Weights used: main target 1.0, auxilary targets 0.1). I also added 3 word-based/dictionary DGAs. All diagrams and metrics have been updated to reflect this.

I recently read this paper ALOHA: Auxiliary Loss Optimization for Hypothesis Augmentation by Ethan M. Rudd, Felipe N. Ducau, Cody Wild, Konstantin Berlin, and Richard Harang from Sophos Lab. This research will be presented at USENIX Security 2019 in Aug, 2019. This paper shares findings that supplying more prediction targets to their model at training time, they can improve the prediction performance of the primary prediction target. More specifically, they modify a deep learning based model for detecting malware (binary classifier) to also predict things like individual vendor predictions, malware tags, and number of VT detections. Their “auxiliary loss architecture yields a significant reduction in detection error rate (false negatives) of 42.6% at a false positive rate (FPR) of 10^−3 when compared to a similar model with only one target, and a decrease of 53.8% at 10^−5 FPR.”

Figure 1 from the paper

A schematic overview of our neural network architecture. Multiple output layers with corresponding loss functions are optionally connected to a common base topology which consists of five dense blocks. Each block is composed of a Dropout, dense and batch normalization layers followed by an exponential linear unit (ELU) activation of sizes 1024, 768, 512, 512, and 512. This base, connected to our main malicious/benign output (solid line in the figure) with a loss on the aggregate label constitutes our baseline architecture. Auxiliary outputs and their respective losses are represented in dashed lines. The auxiliary losses fall into three types: count loss, multi-label vendor loss, and multi-label attribute tag loss

This paper made me wonder how well this technique would work for other areas in network security such as:

Detecting malicious URLs from Exploit Kits - possible auxiliary labels: Exploit Kit names, Web Proxy Categories, etc.
Detecting malicious C2 domains - possible auxiliary labels: malware family names, DGA or not, proxy categories.
Detecting DGA Domains - possible auxiliary labels: malware families, DGA type (wordlist, hex based, alphanumeric, etc).

I decided to explore the last use case of how well auxiliary loss optimizations would improve DGA domain detections. For this work I identified four DGA models and used these as baselines. Then I ran some experiments. All code from these experiments is hosted here. This code is based heavily off of Endgame’s dga_predict, but with many modifications.

Data:

For this work, I used the same data sources selected by Endgame’s dga_predict (but I added 3 additional DGAs: gozi, matsnu, and suppobox).

Alexa top 1m domains
classical DGA domains for the following malware families: banjori, corebot, cryptolocker, dircrypt, kraken, lockyv2, pykspa, qakbot, ramdo, ramnit, and simda.
Word-based/dictionary DGA domains for the following malware families - gozi, matsnu, and suppobox

Baseline Models:

I used 4 baseline binary models + 4 extensions of these model that use Auxiliary Loss Optimization for Hypothesis Augmentation.

Baseline Models:

Bigram - Endgame’s Bigram model from dga_predict.
LSTM - Endgame’s LSTM model from dga_predict.
CNN - CNN adapted from Keegan Hine’s snowman.
LSTM + CNN - CNN adapted from Keegan Hine’s snowman, combined with an LSTM as defined by Deep Learning For Realtime Malware Detection (ShmooCon 2018)’s LSTM + CNN (see 13:17 for architecture) by Domenic Puzio and Kate Highnam.

ALOHA Extended Models (each simply use the 11 malware families as additional binary labels):

ALOHA CNN
ALOHA Bigram
ALOHA LSTM
ALOHA CNN+LSTM

I trained each of these models using the default settings as provided by dga_predict (except, I added stratified sampling based on the full labels: benign + malware families):

training splits: 76% training, 4% validation, %20 testing
all models were trained with a batch size of 128
The CNN, LSTM, and CNN+LSTM models used up to 25 epochs, while the bigram models used up to 50 epochs.

Below shows counts of how many of each DGA family were used and how many Alexa top 1m domains were included (denoted as “benign”).

In [1]: import pickle

In [2]: from collections import Counter

In [3]: data = pickle.loads(open('traindata.pkl', 'rb').read())

In [4]: Counter([d[0] for d in data]).most_common(100)
Out[4]: 
[('benign', 139935),
 ('qakbot', 10000),
 ('dircrypt', 10000),
 ('pykspa', 10000),
 ('corebot', 10000),
 ('kraken', 10000),
 ('suppobox', 10000),
 ('gozi', 10000),
 ('ramnit', 10000),
 ('matsnu', 10000),
 ('locky', 9999),
 ('banjori', 9984),
 ('simda', 9984),
 ('ramdo', 9984),
 ('cryptolocker', 9984)]

Results

Model AUC scores (sorted by AUC):

aloha_bigram 0.9435
bigram 0.9444
cnn 0.9817
aloha_cnn 0.9820
lstm 0.9944
aloha_cnn_lstm 0.9947
aloha_lstm 0.9950
cnn_lstm 0.9957

Overall, by AUC, the ALOHA technique only seemed to improve the LSTM and CNN models and only marginally. The ROC curves show reductions in the error rates at very low false positive rates (between 10^-5 and 10^-3) which is similar to those gains seen in the ALOHA paper, yet the paper’s gains appeared much larger.

ROC: All Models Linear Scale

ROC: All Models Log Scale

ROC: Bigram Models Log Scale

ROC: CNN Models Log Scale

ROC: CNN+LSTM Models Log Scale

ROC: LSTM Models Log Scale

Heatmap

Below is a heatmap showing the percentage of detections across all the malware families for each model. Low numbers are good for the benign label (top row), high numbers are good for all the others.

Note the last 3 rows are all word-based/dictionary DGAs. It is interesting, although not too surprising that the models that include LSTMs tended to do better against these DGAs.

I annotated with green boxes places where the ALOHA models did better. This seems to be most apparent with the models that include LSTMs and for the word-based/dictionary DGAs.

Future Work:

These are some areas of future work I hope to have time to try out.

Add more DGA generators to the project, esp word-based / dictionary DGAs and see how the models react. I have identified several (see “Word-based / Dictionary-based DGA Resources” from here for more info).
try incorporating other auxiliary targets like:
- Type of DGA (hex based, alphanumeric, custom alphabet, dictionary/word-based, etc)
- Classical DGA domain features like string entropy, count of longest consecutive consonant string, count of longest consecutive vowel string, etc. I am curious if forcing the NN to learn these would improve its primary scoring mechanism.
- Metadata from VT domain report.
- Summary / stats from Passive DNS (PDNS).
- Features from various aspects of the domain’s whois record.

If you enjoyed this post, you may be interested in my other recent post on Getting Started with DGA Domain Detection Research. Also, please see more Security Data Science blog posts at by personal blog: covert.io.

As always, feedback is welcome so please leave a message here, on Medium, or @ me on twitter!

–Jason
@jason_trost

Auxiliary Loss Optimization for Hypothesis Augmentation for DGA Domain Detection was originally published by Jason Trost at covert.io on July 17, 2019.

Getting Started with DGA Domain Detection Research

2020-03-22T00:00:00-00:00

This post provides resources for getting started with research on Domain Generation Algorithm (DGA) Domain Detection.

DGA Domains are commonly used by malware as a mechanism to maintain a command and control (C2) and make it more difficult for defenders to block. Prior to DGA domains, most malware used a small hardcoded list of IPs or domains. Once these IPs / domains were discovered they could be blocked by defenders or taken down for abuse. DGA domains make this more difficult since the C2 domain changes frequently and enumerating and blocking all generated domains can be expensive.

Recently, I have been working on a research project recently related to DGA detection (hopefully it will turn into a blogpost or a presentation somewhere), and it occurred to me that DGA is probably one of the most accessible areas for those getting into security data science due to the availability of so much labelled data and the availability of so many open source implementations of DGA detection. One might argue that this means it is not an area worth researching due to saturation, but I think that depends on your situation/goals. This short posts outlines some of the resources that I found useful for DGA research.

Data:

This section lists some domain lists and DGA generators that may be useful for creating “labelled” DGA domain lists.

DGA Data:

DGArchive Large private collection of DGA related data. This contains ~88 csv files of DGA domains organized by malware family. DGArchive is password protected and if you want access you need to reach out to the maintainer.
Bambenek Feeds (see “DGA Domain Feed”).
Netlab 360 DGA Feeds

DGA Generators:

baderj/domain_generation_algorithms (276-stars on Github) by Johannes Bader - DGA algorithms implemented in python.
andrewaeva/DGA (123-stars on Github) - smaller collection of DGA algorithms and data, but fills in some of the gaps from domain_generation_algorithms.
pchaigno/dga-collection (37-stars on Github)

Word-based / Dictionary-based DGA Resources:

Below are all Malware Families that use word-based / dictionary DGAs, meaning their domains consist of 2 or more words selected from a list/dictionary and concatenated together. I separate these out since they are different than most other “classical” DGAs.

“Benign” / Non-DGA Data:

This section lists some domain lists that may be useful for creating “labelled” benign domain lists. In several academic papers one or more of these sources are used, but they generally create derivatives that represent the Stable N-day Top X Sites (e.g. Stable Alexa 30-day top 500k – meaning domains from the Alexa top 500k that have been on the list consecutively for the last 30 days straight – the alexa data needs to be downloaded each day for 30+ days to create this since only today’s snapshot is provided by Amazon). This filters out domains that can become popular for a short amount of time but them drop off as sometimes happens with malicious domains.

Update (2020-03-22) - More Heuristics for Benign training set curation:

Excerpt from Inline Detection of DGA Domains Using Side Information (page 12)

The benign samples are collected based on a predefined set of heuristics as listed below:

Domain name should have valid DNS characters only (digits, letters, dot and hyphen)

Domain has to be resolved at least once for every day between June 01, 2019 and July 31, 2019.

Domain name should have a valid public suffix

Characters in the domain name are not all digits (after removing ‘.’ and ‘-‘)

Domain should have at most four labels (Labels are sequence of characters separated by a dot)

Length of the domain name is at most 255 characters

Longest label is between 7 and 64 characters

Longest label is more than twice the length of the TLD

Longest label is more than 70% of the combined length of all labels

Excludes IDN (International Distribution Network) domains (such as domains starting with xn–)

Domain must not exist in DGArchive

Utilities:

Domain Parser:

When parsing the various domain list data tldextract is very helpful for stripping off TLDs or subdomains if desired. I have seen several projects attempt to parse domains using “split(‘.’)” or “domain[:-3]”. This does not work very well since domain’s TLDs can contain multiple “.”s (e.g. .co.uk)

Installation:

pip install tldextract

Example:

In [1]: import tldextract
In [2]: e = tldextract.extract('abc.www.google.co.uk')

In [3]: e                                                                                                                            Out[3]: ExtractResult(subdomain='abc.www', domain='google', suffix='co.uk')

In [4]: e.domain
Out[4]: 'google'

In [5]: e.subdomain
Out[5]: 'abc.www'

In [6]: e.registered_domain
Out[6]: 'google.co.uk'

In [7]: e.fqdn
Out[7]: 'abc.www.google.co.uk'

In [8]: e.suffix
Out[8]: 'co.uk'

Domain Resolution:

During the course of your research you may need to perform DNS resolutions on lots of DGA domains. If you do this, I highly recommend setting up your own bind9 server on Digital Ocean or Amazon and using adnshost (a utility from adns). If you perform the DNS resolutions from your home or office, your ISP may interfere with the DNS responses because they will appear malicious, which can bias your research. If you use a provider’s recursive nameservers, you may violate the acceptable use policy (AUP) due to the volume AND the provider may also interfere with the responses.

Adnshost enables high throughput / bulk DNS queries to be performed asynchronously. It will be much faster than performing the DNS queries synchronously (one after the other).

Here is an example of using adnshost (assuming you are running it from the Bind9 server you setup):

cat huge-domains-list.txt | adnshost \
    --asynch \
    --config "nameserver 127.0.0.1" \
    --type a \
    --pipe \
    ----addr-ipv4-only > results.txt

This article should get you most of the way there with setting up the bind9 server.

Models:

This section provides links to a few models that could be used as baselines for comparison.

dga_predict’s LSTM and Bigram model from Endgame.
snowman’s CNN model from Keegan Hines. This is not specifically designed for DGA, but it works for this.
matthoffman/degas - DGA-generated domain detection using deep learning models.
#dga-detection, #dga, and #dga-domains on Github - these tags provide other DGA related projects (DGA domain generators, DGA detection, DGA domain lists).
BKCS-HUST/LSTM-MI

Research:

I hope this is helpful. As always, feedback is welcome so please leave a message here, on Medium, or @ me on twitter!

–Jason
@jason_trost

Getting Started with DGA Domain Detection Research was originally published by Jason Trost at covert.io on July 16, 2019.

Security Data Science Learning Resources

2019-05-05T00:00:00-00:00

This short post catalogs some resources that may be useful for those interested in security data science. It is not meant to be an exhaustive list. It is meant to be a curated list to help you get started.

Staying Current with Security Data Science

Here is my current strategy for staying current with security data science research. It leans heavier towards academic research since this is what interests me at the moment.

Google Scholar Publication alerts on known respected researchers.
Google Scholar Citation alerts on interesting or noteworthy papers.
Follow security ML researchers on Twitter and Medium. They frequently share interesting and cutting edge research papers / videos / blogs.
Periodically review proceedings from noteworthy security conferences.
Skim published security conference videos from Irongeek looking for topics of interest.

Google Scholar alerts

Citation Alerts on these papers:

“Acing the IOC game: Toward automatic discovery and analysis of open-source cyber threat intelligence”
“AI^ 2: training a big data machine to defend”
“APT Infection Discovery using DNS Data”
“Beehive: Large-scale log analysis for detecting suspicious activity in enterprise networks”
“Deep neural network based malware detection using two dimensional binary program features”
“Detecting malicious domains via graph inference”
“Detecting malware based on DNS graph mining”
“Detecting structurally anomalous logins in Enterprise Networks”
“Discovering malicious domains through passive DNS data graph analysis”
“EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models”
“Enabling network security through active DNS datasets”
“Feature-based transfer learning for network security”
“Gotcha-Sly Malware!: Scorpion A Metagraph2vec Based Malware Detection System”
“Guilt by association: large scale malware detection by mining file-relation graphs”
“Identifying suspicious activities through dns failure graph analysis”
“Polonium: Tera-scale graph mining and inference for malware detection”
“Segugio: Efficient behavior-based tracking of malware-control domains in large ISP networks”

New article alerts on these authors with the bolded being the most relevant / interesting to me.

Alina Oprea - heavily focused on operational security ML.
Josh Saxe, Rich Harang, and Konstantin Berlin - heavily focused on Malware detection/analytics using ML. Also a published book author.
Manos Antonakakis and Roberto Perdisci - heavily focused on network security analytics using ML with a specialty in DNS traffic.
Balduzzi Marco
Battista Biggio
Chaz Lever
Christopher Kruegel
Damon McCoy
David Dagon
David Freeman
Gianluca Stringhini
Giovanni Vigna
Guofei Gu
Han Yufei
Hossein Siadati
Issa Khalil
Jason (Iasonas) Polakis
Michael Donald Bailey
Michael Iannacone
Nick Feamster
Niels Provos
Nir Nissim
Patrick McDaniel
Stefan Savage
Steven Noel
Terry Nelms
Ting-Fang Yen
Vern Paxson
Wenke Lee
Yacin Nadji
Yanfang (Fanny) Ye
Yizheng Chen
Yuval Elovici

Twitter

Twitter can be a gold mine for new and relevant ideas, blogs, presentations, etc for security data science. You just need to make sure you continually follow the right folks. Here is a short list of thought leaders in this space (if I left you off it is my oversight so please don’t take offense).

For a more exhaustive list of others I would recommend following on Twitter, see this gist. This list is focused on Threat Intel, Threat Hunting, Detection Engineering, IR, and Security Engineering. It is not exhaustive, but is a good start.

Conferences

Below are several interesting security conferences where research is published on security data science topics. It is a good idea to be on the look out for the proceedings from these events.

This page is also an excellent resource in general for top academic security conferences: Top Academic Security conferences list. The major industry focused security conferences like Blackhat, RSA, Defcon, BSides*, DerbyCon, and ShmooCon all frequently have talks relevant to security data science, but this is not their primary focus, so they are not explicitly called out above.

Learning Resources

These resources will help you build a baseline of knowledge in Cyber Security and Machine Learning.

Books

Security:

Extrusion Detection: Security Monitoring for Internal Intrusions by Richard Bejtlich
Intelligence-Driven Incident Response: Outwitting the Adversary by Scott J. Roberts and Rebekah Brown
Counter Hack Reloaded: A Step-by-Step Guide to Computer Attacks and Effective Defenses (2nd Edition) by Edward Skoudis and Tom Liston

Machine Learning / Data Science:

Network Security Through Data Analysis: Building Situational Awareness by Michael S Collins
Malware Data Science: Attack Detection and Attribution by Joshua Saxe and Hillary Sanders
Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow, 2nd Edition by Sebastian Raschka and Vahid Mirjalili
Deep Learning with Python by Francois Chollet

Courses

I hope this is helpful, and I would be interested to hear about other resources that you find useful. Please leave a message here, on Medium, or @ me on twitter!

–Jason
@jason_trost

Security Data Science Learning Resources was originally published by Jason Trost at covert.io on May 05, 2019.

6 Short Links on PDNS Graph Analytics for Security

2017-08-08T00:00:00-00:00

A short listing of research papers I’ve read or plan to read that use passive DNS (PDNS) data and graph analytics for identifying malicious domains.

Host-Domain Graphs

Host domain graphs are bipartite graphs mapping hosts/IPs to domains that they either resolved (passive DNS) or visited (web proxy logs). These graphs are used heavily in operational security machine learning papers on network threat hunting as they provide insight into the behavioral patterns across an enterprise or ISP.

Detecting Malicious Domains via Graph Inference P. K. Manadhata, S. Yadav, P. Rao, and W. Horne. In Proceedings of 19th European Symposium on Research in Computer Security, Wroclaw, Poland, September 7-11, 2014.

Detection of Early-Stage Enterprise Infection by Mining Large-Scale Log Data Alina Oprea, Zhou Li, Ting-Fang Yen, Sang H. Chin, and Sumyah Alrwais In Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2015.

Segugio: Efficient Behavior-Based Tracking of Malware-Control Domains in Large ISP Networks Babak Rahbarinia and Manos Antonakakis In Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2015

Domain Resolution Graphs (Domain-IP Graphs)

A domain resolution graph is an undirected bipartite graph representing observed domain->IP DNS resolution from Passive DNS data.

Notos: Building a Dynamic Reputation System for DNS M. Antonakakis, R. Perdisci, D. Dagon, W. Lee, and N. Feamster. In the Proceedings of the 19th USENIX Security Symposium, Washington, DC, USA, August 11-13, 2010.

EXPOSURE: Finding Malicious Domains using Passive DNS Analysis L. Bilge, E. Kirda, C. Kruegel, and M. Balduzzi. In Proceedings of the Network and Distributed System Security Symposium, San Diego, California, USA, February 2011.

Discovering Malicious Domains through Passive DNS Data Graph Analysis Issa Khalil, Ting Yu, and Bei Guan. In Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security (ASIA CCS ‘16), 2016.

–Jason
@jason_trost

The “short links” format was inspired by O’Reilly’s Four Short Links series.

6 Short Links on PDNS Graph Analytics for Security was originally published by Jason Trost at covert.io on August 14, 2017.

7 Short Links on Operational Security Machine Learning

2017-08-08T00:00:00-00:00

Beehive: Large-Scale Log Analysis for Detecting Suspicious Activity in Enterprise Networks Ting-Fang Yen, Alina Oprea, Kaan Onarlioglu, Todd Leetham, William Robertson, Ari Juels, and Engin Kirda In Proceedings of Annual Computer Security Applications Conference (ACSAC), 2013

An Epidemiological Study of Malware Encounters in a Large Enterprise Ting-Fang Yen, Victor Heorhiadi, Alina Oprea, Michael K. Reiter, and Ari Juels In Proceedings of ACM Conference on Computer and Communications Security (CCS), 2014

Malicious Behavior Detection using Windows Audit Logs Konstantin Berlin, David Slater, Joshua Saxe In Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security (AISec) 2015

Operational security log analytics for enterprise breach detection Zhou Li and Alina Oprea In Proceedings of the First IEEE Cybersecurity Development Conference (SecDev), 2016

Lens on the endpoint: Hunting for malicious software through endpoint data analysis. Ahmet Buyukkayhan, Alina Oprea, Zhou Li, and William Robertson. In Proceedings of Recent Advances in Intrusion Detection (RAID), 2017

–Jason
@jason_trost

PS …

many of these papers were found via Alina Oprea’s home page.
The “short links” format was inspired by O’Reilly’s Four Short Links series.

7 Short Links on Operational Security Machine Learning was originally published by Jason Trost at covert.io on August 08, 2017.

The Definitive Security Data Science and Machine Learning Guide

2016-12-31T00:00:00-00:00

This is the Definitive Security Data Science and Machine Learning Guide. It includes books, tutorials, presentations, blog posts, and research papers about solving security problems using data science.

Machine Learning and Security Papers
Deep Learning and Security Papers
Deep Learning and Security Presentations
Security Data Science Blogs
Security Data Science Blogposts / Tutorials
Security Data Science Projects
Security Data
Security Data Science Books
Security Data Science Presentations / Talks
Misc

Machine Learning and Security Papers

Intrusion Detection Papers

Malware Papers

Data Collection Papers

Vulnerability Analysis/Reversing Papers

Anonymity/Privacy/OPSEC/Censorship Papers

Data Mining Papers

Cyber Crime Papers

CND/CNA/CNE/CNO Papers

Deep Learning and Security Papers

Deep Learning and Security Presentations

Security Data Science Blogs

Blogs that frequently cover topics on security data science, machine learning, etc. These are recommended for your RSS feed.

Security Data Science Blogposts / Tutorials

Security Data Science Projects

Open source projects and code applying data science/machine learning to security problems.

Clearcut - a tool that uses machine learning to help you focus on the log entries that really need manual review
Click Security’s Data Hacking Project
Combine - Tool to gather Threat Intelligence indicators from publicly available sources
dga_predict - Predicting Domain Generation Algorithms using LSTMs.
mlsec.org - Various Machine Learning and Computer Security Research projects from mlsec.org.
tiq-test - Threat Intelligence Quotient Test - Dataviz and Statistical Analysis of TI feeds.
CuckooML: Machine Learning for Cuckoo Sandbox https://honeynet.github.io/cuckooml/

Security Data

Collection of Security and Network Data Resources.

See Covert.io Data Page
See Covert.io Threat Intelligence Page
See secrepo.com is more comprehensive and should be checked as well.

Security Data Science Books

Security Data Science Presentations / Talks

Misc

awesome-ml-for-cybersecurity

The Definitive Security Data Science and Machine Learning Guide was originally published by Jason Trost at covert.io on January 01, 2017.

Deep Learning Security Papers

2017-01-01T00:00:00-00:00

Update (1/1/2017): I will not be updating this page and instead will make all updates to this page: The Definitive Security Data Science and Machine Learning Guide (see Deep Learning and Security Papers section).

This is another quick post. Over the past few months I started researching deep learning to determine if it may be useful for solving security problems. This post on The Unreasonable Effectiveness of Recurrent Neural Networks was what got me interested in this topic, and I highly recommend reading it in its entirety.

Throughout this research, I came across several security related academic and professional research papers on security topics that use Deep Learning as part of their research. What follows is a list of the papers/slides/videos that I found, and these may be useful to others. If you have others that you think should be added to this list, please ping me: @jason_trost.

Deep Learning Papers on Security

Deep Learning Presentations on Security

Security Machine Learning Resources:

General Deep Learning Resources:

–Jason
@jason_trost

Deep Learning Security Papers was originally published by Jason Trost at covert.io on December 29, 2016.

covert.io

9 Short links on Network Beacon Detection

10 Short links on Cybersquatting domain detection

Four Short Links on Malicious Lateral Movement Detection

Seven Short Links of Dictionary DGA Detection

Eight Short Links of Recent Cyber Security Data Science Papers

All your SPF are belong to us: Exploring trust relationships through global scale SPF Mining

Intro to Sender Policy Framework (SPF)

Step One: Collection

Step Two: Enrichment

Step Three: Analysis

Network Graphs

Fortune 100 SPF Trusted Networks Graph

Alexa 100 SPF Trusted Networks Graph

Heatmaps

Fortune 1,000 SPF Trusted Networks Heatmap

Alexa 1,000 SPF Trusted Networks Heatmap

Alexa 10,000 SPF Trusted Networks Heatmap

Alexa 100,000 SPF Trusted Networks Heatmap

Alexa 1,000,000 SPF Trusted Networks Heatmap

Alexa Top 1M Domains Trusting /7 or larger networks

Top SPF Includes from all top domain lists (via SPF)

Top SPF Includes from Fortune 1000 (via SPF)

Top SPF Includes from Alexa top1m

Email Security Providers

Top Email Security Provider from all top domain lists (via SPF)

Top Email Security Provider from Alexa 1m (via SPF)

Top Email Security Provider from Fortune 1000 (via SPF)

Top Email Security Provider from Fortune 100 (via SPF)

Fortune 100 Email Security Providers Listing (via SPF)

Domains with 4 or more Email Security Providers (via SPF)

Trusting Cloud Provider Networks

Alexa 1000 Trusting AWS Networks

Alexa 1000 Trusting Azure Networks

Alexa 1000 Trusting GCP Networks

Fortune 1000 Trusting AWS Networks

Fortune 1000 Trusting Azure Networks

Fortune 1000 Trusting GCP Networks

Some other potentially interesting results, not worth dumping here:

Future Work

Resources

Mining DNS MX Records for Fun and Profit

Brief Intro to MX records

Observations:

Results:

Summary:

Analytics:

Top Email Security Providers Overall

Fortune 1000 Email security providers

Fortune 100 domain, MX base domain, email security provider

Alexa 1000 Email security providers

Alexa 100 domain, MX base domain, email security provider

Top Email Security Providers Hosted in AWS

Top Email Security Providers Hosted in Azure

Top Self-hosted Email Security Providers

Misc Findings

Linode / CSC Digital Brand Services

googlemial[.]com

Future Work

Resources

Seven Short Links on Cyber Security Alert Triage Automation

Eight Short Links on Provenance Analytics for Cyber Security

3 Short Links on Popular Domain Lists for Threat Intelligence

6 Short Links on Malware Training Set Creation for Machine Learning

Collecting and Curating IOC Whitelists for Threat Intelligence and Machine Learning Research

Benign Inbound Observables

Benign Outbound Observables

Benign Host-based Observables

Whitelist Exclusions

Building / Maintaining Whitelist Data

Heterogeneous Information Networks + Cyber Security Use cases

Books:

HIN Papers:

Security-related HIN Papers:

Malware Detection / Code Analysis:

Mining the Darkweb / Fraud Detection / Social Network Analysis:

Tutorials:

Code:

Prominent Security Researchers using HIN:

Auxiliary Loss Optimization for Hypothesis Augmentation for DGA Domain Detection