covert.ioJekyll2024-02-03T13:43:50-05:00http://www.covert.io/Jason Trosthttp://www.covert.io/jason.trost@gmail.comhttp://www.covert.io/nine-short-links-on-network-beacon-detection2022-01-16T00:00:00-00:002022-01-16T00:00:00-05:00Jason Trosthttp://www.covert.iojason.trost@gmail.com<p>In this post I share 9 links to resources related to Network Beacon detection.</p>
<p>Network beacons are continuous automated communications between 2 hosts. Network beacon detection focuses on identifying this automated traffic with the primary goal of aiding in detecting malware infections or adversary activity that have been missed by other controls.</p>
<p>Beacon detection is a useful building block analytic with many different usecases.</p>
<ul>
<li>Threat Hunting and Malware command and control (C2) detection - aid in detecting malware missed by anti-virus products.</li>
<li>Detection of automated third party traffic - detection of ongoing automated traffic to third parties may reveal unknown or emerging business relationships.</li>
<li>Identify automated web application dependencies (within an enterprise or external to an enterprise)</li>
</ul>
<p>Links:</p>
<ul>
<li><a href="https://www.elastic.co/blog/identifying-beaconing-malware-using-elastic">Identifying beaconing malware using Elastic</a> <a href="https://github.com/elastic/detection-rules/releases/tag/ML-Beaconing-20211216-1">[code]</a> by Apoorva Joshi, Thomas Veasey, and Craig Chamberlain - uses statistical techniques of coefficient of variation (COV), relative variance (RV), and autocorrelation; implemented as Elastic Painless scripts.</li>
<li>Enterprise Scale Threat Hunting: C2 Beacon Detection with Unsupervised ML and KQL — [<a href="https://posts.bluraven.io/enterprise-scale-threat-hunting-network-beacon-detection-with-unsupervised-machine-learning-and-277c4c30304f">Part 1</a>] [<a href="https://posts.bluraven.io/enterprise-scale-threat-hunting-network-beacon-detection-with-unsupervised-ml-and-kql-part-2-bff46cfc1e7e">Part 2</a>] <a href="https://github.com/Cyb3r-Monk/Threat-Hunting-and-Detection/tree/main/Command%20and%20Control">[code]</a> by Mehmet Ergene</li>
<li><a href="https://ateixei.medium.com/detecting-network-beacons-via-kql-using-simple-spread-stats-functions-c2f031b0736b">Detecting network beacons via KQL using simple spread stats functions</a> by Alex Teixeira</li>
<li><a href="https://techcommunity.microsoft.com/t5/microsoft-sentinel-blog/detect-network-beaconing-via-intra-request-time-delta-patterns/ba-p/779586">Detect Network beaconing via Intra-Request time delta patterns in Azure Sentinel</a> <a href="https://github.com/Azure/Azure-Sentinel/blob/master/Detections/CommonSecurityLog/PaloAlto-NetworkBeaconing.yaml">[code]</a> by Ashwin Patil</li>
<li><a href="https://github.com/activecm/rita/blob/master/pkg/beacon/analyzer.go">RITA (Real Intelligence Threat Analytics) beacon analyzer</a> - uses simple statistical approach based on 6 measures: connection time delta skew, connection dispersion, connection counts over time, data size skew, data size dispersion, and data size smallness score.</li>
<li><a href="https://github.com/inodee/threathunting-spl/blob/master/hunt-queries/Detecting_Beaconing.md">How to detect beaconing traffic with Splunk?</a> by Alex Teixeira</li>
<li><a href="http://www.austintaylor.io/detect/beaconing/intrusion/detection/system/command/control/flare/elastic/stack/2017/06/10/detect-beaconing-with-flare-elasticsearch-and-intrusion-detection-systems/">Detect Beaconing with Flare, Elastic Stack, and Intrusion Detection Systems</a> <a href="https://github.com/austin-taylor/flare/blob/master/flare/analytics/command_control.py">[code]</a> by Austin Taylor</li>
<li><a href="https://alps-lab.github.io/paper/hu-dsn-2016.pdf">BAYWATCH: Robust Beaconing Detection to Identify Infected Hosts in Large-Scale Enterprise Networks</a> - uses FFT and periodogram based technique for identifying automated traffic.</li>
<li><a href="https://publications.waset.org/10004242/malware-beaconing-detection-by-mining-large-scale-dns-logs-for-targeted-attack-identification">Malware Beaconing Detection by Mining Large-scale DNS Logs for Targeted Attack Identification</a></li>
</ul>
<p>–Jason
<br /><a href="https://twitter.com/#!/jason_trost">@jason_trost</a></p>
<p>The “short links” format was inspired by <a href="https://www.oreilly.com/feed/four-short-links">O’Reilly’s Four Short Links</a> series.</p>
<p><a href="http://www.covert.io/nine-short-links-on-network-beacon-detection/">9 Short links on Network Beacon Detection</a> was originally published by Jason Trost at <a href="http://www.covert.io">covert.io</a> on January 16, 2022.</p>http://www.covert.io/10-Short-links-on-Cybersquatting-domain-detection2022-01-08T00:00:00-00:002022-01-08T00:00:00-05:00Jason Trosthttp://www.covert.iojason.trost@gmail.com<p>In this short blog, I share 3 papers and 7 tools that focus on detecting cyber squatting domains (including typosquating, homograph, combosquatting, etc.).</p>
<ul>
<li><a href="http://scg.unibe.ch/archive/masters/Fris21a.pdf">Detection of Cybersquatted Domains (Master’s Thesis)</a> by Patrick Frischknecht</li>
<li><a href="https://arxiv.org/pdf/1708.08519.pdf">Hiding in plain sight: A longitudinal study of combosquatting abuse</a></li>
<li><a href="https://www.ndss-symposium.org/wp-content/uploads/2017/09/01_3_1.pdf">Seven months’ worth of mistakes: A longitudinal study of typosquatting abuse</a></li>
</ul>
<p>Tools for generating cybersquatting domains (for use in detection)</p>
<ul>
<li><a href="https://github.com/elceef/dnstwist">https://github.com/elceef/dnstwist</a></li>
<li><a href="https://github.com/atenreiro/opensquat">https://github.com/atenreiro/opensquat</a></li>
<li><a href="http://www.morningstarsecurity.com/research/urlcrazy">http://www.morningstarsecurity.com/research/urlcrazy</a></li>
<li><a href="https://github.com/phar/eyephish">https://github.com/phar/eyephish</a></li>
<li><a href="https://github.com/SquatPhish/2-Distributed-Crawler">https://github.com/SquatPhish/2-Distributed-Crawler</a></li>
<li><a href="https://github.com/SquatPhish/3-Phish-Page-Detection">https://github.com/SquatPhish/3-Phish-Page-Detection</a></li>
<li><a href="https://github.com/SquatPhish/4-Evasion-Obfuscation-Analysis">https://github.com/SquatPhish/4-Evasion-Obfuscation-Analysis</a></li>
</ul>
<p>Lots of other tools/libraries now exist if you need an implementation in a different language. See these github tags for lots more tools: <a href="https://github.com/topics/typosquatting">typosquatting</a>, <a href="https://github.com/topics/homoglyph">homoglyph</a>, and <a href="https://github.com/topics/homograph-attack">homograph-attack</a>.</p>
<p>–Jason
<br /><a href="https://twitter.com/#!/jason_trost">@jason_trost</a></p>
<p>The “short links” format was inspired by <a href="https://www.oreilly.com/feed/four-short-links">O’Reilly’s Four Short Links</a> series.</p>
<p><a href="http://www.covert.io/10-Short-links-on-Cybersquatting-domain-detection/">10 Short links on Cybersquatting domain detection</a> was originally published by Jason Trost at <a href="http://www.covert.io">covert.io</a> on January 08, 2022.</p>http://www.covert.io/four-short-links-on-malicious-lateral-movement-detection2021-05-30T00:00:00-00:002021-05-30T00:00:00-04:00Jason Trosthttp://www.covert.iojason.trost@gmail.com<p>In this short blog, I share four papers that focus on detecting malicious lateral movement (a.k.a. pivoting, a.k.a. island hopping).</p>
<p><strong>Papers:</strong></p>
<ul>
<li><a href="https://www.microsoft.com/en-us/research/uploads/prod/2019/03/Milcom2018_Liu.pdf">Latte: Large-Scale Lateral Movement Detection </a></li>
<li><a href="https://iris.unimore.it/retrieve/handle/11380/1149159/349875/Detection%20and%20Threat.pdf">Detection and Threat Prioritization of Pivoting Attacks in Large Networks</a></li>
<li><a href="http://dl.ifip.org/db/conf/im/im2021-ws4-grasec/213223.pdf">Towards an Efficient Detection of Pivoting Activity</a></li>
<li><a href="https://uwspace.uwaterloo.ca/bitstream/handle/10012/15074/Bai_Zhenyu.pdf?sequence=3">A Machine Learning Approach for RDP-based Lateral Movement Detection</a></li>
</ul>
<p>Lastly, if you’re interested in discovering more interesting papers like these, use the method I outlined <a href="/security-data-science-learning-resources/">here</a>.</p>
<p>–Jason
<br /><a href="https://twitter.com/#!/jason_trost">@jason_trost</a></p>
<p>The “short links” format was inspired by <a href="https://www.oreilly.com/feed/four-short-links">O’Reilly’s Four Short Links</a> series.</p>
<p><a href="http://www.covert.io/four-short-links-on-malicious-lateral-movement-detection/">Four Short Links on Malicious Lateral Movement Detection</a> was originally published by Jason Trost at <a href="http://www.covert.io">covert.io</a> on May 30, 2021.</p>http://www.covert.io/seven-short-links-on-dictionary-dga-detection2021-05-11T00:00:00-00:002021-05-11T00:00:00-04:00Jason Trosthttp://www.covert.iojason.trost@gmail.com<p>In this short blog, I share seven papers that focus on detecting Dictionary Domain Generation Algorithm (DGA) domains, A.K.A. Word-based DGAs. Dictionary DGAs are algorithms seen in various malware families (suppobox, matsnu, gozi, rovnix, etc.) that are used to periodically generate a large number of domain names that use pseudo-randomly concatenated words from a dictionary. These domains may appear legitimate at first glance and are often able to evade blacklisting as well as traditional DGA detections based on entropy or counts of consonants vs vowels. Below are a small sample of rovnix domains from <a href="https://unit42.paloaltonetworks.com/rovnix-declaration-generation-algorithm/">Unit42’s blogpost</a>.</p>
<ul>
<li>kingwhichtotallyadminis[.]biz</li>
<li>thareplunjudiciary[.]net</li>
<li>townsunalienable[.]net</li>
<li>taxeslawsmockhigh[.]net</li>
<li>transientperfidythe[.]biz</li>
<li>inhabitantslaindourmock[.]cn</li>
<li>thworldthesuffer[.]biz</li>
</ul>
<p><strong>Papers:</strong></p>
<ul>
<li><a href="https://arxiv.org/pdf/2003.12805.pdf">Real-Time Detection of Dictionary DGA Network Traffic using Deep Learning</a></li>
<li><a href="https://machine-learning-and-security.github.io/slides/Mayana-final-of-NIPS-DDGA.pdf">A Word Graph Approach for Dictionary Detection and Extraction in DGA Domain Names</a></li>
<li><a href="http://faculty.washington.edu/mdecock/papers/mpereira2018a.pdf">Dictionary Extraction and Detection of Algorithmically Generated Domain Names in Passive DNS Traffic</a></li>
<li><a href="https://arxiv.org/pdf/1811.08705.pdf">Inline Detection of Domain Generation Algorithms with Context-Sensitive Word Embeddings</a></li>
<li><a href="http://faculty.washington.edu/mdecock/papers/rsivaguru2018a.pdf">An Evaluation of DGA Classifiers</a></li>
<li><a href="https://link.springer.com/chapter/10.1007/978-3-030-00009-7_43">A Novel Detection Method for Word-Based DGA</a></li>
<li><a href="https://res.mdpi.com/d_attachment/electronics/electronics-10-01039/article_deploy/electronics-10-01039-v2.pdf">A Word-Level Analytical Approach for Identifying Malicious Domain Names Caused by Dictionary-Based DGA Malware</a></li>
</ul>
<p>In a previous post, I also shared details on several models that are capable of effectively detecting dictionary DGA domains as well. Please see <a href="/auxiliary-loss-optimization-for-hypothesis-augmentation-for-dga-domain-detection/">Auxiliary Loss Optimization for Hypothesis Augmentation for DGA Domain Detection</a>.</p>
<p>Lastly, if you’re interested in discovering more interesting papers like these, use the method I outlined <a href="/security-data-science-learning-resources/">here</a>.</p>
<p>–Jason
<br /><a href="https://twitter.com/#!/jason_trost">@jason_trost</a></p>
<p>The “short links” format was inspired by <a href="https://www.oreilly.com/feed/four-short-links">O’Reilly’s Four Short Links</a> series.</p>
<p><a href="http://www.covert.io/seven-short-links-on-dictionary-dga-detection/">Seven Short Links of Dictionary DGA Detection</a> was originally published by Jason Trost at <a href="http://www.covert.io">covert.io</a> on May 11, 2021.</p>http://www.covert.io/eight-short-links-on-recent-cyber-data-science-papers2021-04-24T00:00:00-00:002021-04-24T00:00:00-04:00Jason Trosthttp://www.covert.iojason.trost@gmail.com<p>A short listing of cyber security data science research papers I’ve discovered recently. Each of them uses machine learning or enables ML (i.e. providing training data or enabling creation of training data) to solve various security usecases, and many provide open source code as well.</p>
<ul>
<li><a href="https://liminyang.web.illinois.edu/data/DLS21_BODMAS.pdf">BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware</a>. <a href="https://whyisyoung.github.io/BODMAS/">[data]</a>. Other malware related training data can be found <a href="/data-links/">here</a>.</li>
<li><a href="https://www.usenix.org/system/files/sec21fall-desilva.pdf">Compromised or Attacker-Owned: A Large Scale Classification and Study of Hosting Domains of Malicious URLs</a>. <a href="https://github.com/qcri/compromised">[code]</a> referenced in paper, but not live as of 4/24/2021.</li>
<li><a href="https://arxiv.org/pdf/2104.09806.pdf">DeepHunter: A Graph Neural Network Based Approach for Robust Cyber Threat Hunting</a>. This uses an open source EDR tool named <a href="(https://github.com/ION28/BLUESPAWN/)">BLUESPAWN</a> that I had not heard of before.</li>
<li><a href="https://www.usenix.org/system/files/sec21fall-downing.pdf">DeepReflect: Discovering Malicious Functionality through Binary Reconstruction</a>. <a href="https://github.com/evandowning/deepreflect">[code]</a></li>
<li><a href="https://www.usenix.org/system/files/sec21fall-severi.pdf">Explanation-Guided Backdoor Poisoning Attacks Against Malware Classifiers</a>. <a href="https://github.com/ClonedOne/MalwareBackdoors">[code]</a></li>
<li><a href="https://arxiv.org/pdf/2104.08618.pdf">EXTRACTOR: Extracting Attack Behavior from Threat Reports</a>. <a href="https://github.com/ksatvat/Extractor">[code]</a></li>
<li><a href="https://arxiv.org/pdf/2104.10034.pdf">On Generating and Labeling Network Traffic with Realistic, Self-Propagating Malware</a>.</li>
<li><a href="https://zakird.com/papers/stratosphere-preprint.pdf">Stratosphere: Finding Vulnerable Cloud Storage Buckets</a>. <a href="https://github.com/stanford-esrg/stratosphere">[code]</a></li>
</ul>
<p>If you’re interested in discovering more interesting papers like these, use the method I outlined <a href="/security-data-science-learning-resources/">here</a>.</p>
<p>–Jason
<br /><a href="https://twitter.com/#!/jason_trost">@jason_trost</a></p>
<p>The “short links” format was inspired by <a href="https://www.oreilly.com/feed/four-short-links">O’Reilly’s Four Short Links</a> series.</p>
<p><a href="http://www.covert.io/eight-short-links-on-recent-cyber-data-science-papers/">Eight Short Links of Recent Cyber Security Data Science Papers</a> was originally published by Jason Trost at <a href="http://www.covert.io">covert.io</a> on April 24, 2021.</p>http://www.covert.io/all-your-spf-are-belong-to-us-exploring-trust-relationships-through-gloabl-scale-spf-mining2020-07-06T00:00:00-00:002020-07-06T00:00:00-04:00Jason Trosthttp://www.covert.iojason.trost@gmail.com<p>In this post we explore a large collection of Sender Policy Framework (SPF) records to see what they might tell us about global email sending trust relationships and how they relate to email security providers. This is a fast follow-up to my previous post on <a href="http://www.covert.io/mining-mx-records-for-fun-and-profit/">Mining DNS MX Records for Fun and Profit</a>.</p>
<p>Here is the methodology I devised for this (very similar to the previous post, but with <a href="https://github.com/covert-labs/mx-intel/blob/master/parallel_dig.sh">new</a> <a href="https://github.com/covert-labs/mx-intel/blob/master/spf_crawler.py">custom</a> <a href="https://github.com/covert-labs/mx-intel/blob/master/spf_results_parser.py">built</a> <a href="https://github.com/covert-labs/mx-intel/blob/master/SPF-Parse-Enrich.ipynb">tools</a>):</p>
<ol>
<li>Collect a large sample of SPF records via DNS TXT lookups of popular domain names (and recursively resolving SPF “include” domains).</li>
<li>Enrich SPF records with IP intelligence and useful metadata (including <a href="https://github.com/covert-labs/mx-intel/blob/master/email_security_providers.py">email security provider mappings</a>)</li>
<li>Analyze the enriched results.</li>
</ol>
<h2 id="intro-to-sender-policy-framework-spf">Intro to Sender Policy Framework (SPF)</h2>
<p>The Sender Policy Framework (SPF) enables domain name administrators to authorize hosts to use their domain names when sending email (i.e. in the “MAIL FROM” or “HELO” identities in SMTP). One of the goals of SPF is to limit spammer’s abilities to spoof email messages. SPF is limited and is usually used with DKIM and DMARC. SPF records are published using DNS TXT records. SPF compliant mail receivers use the published SPF records to test the authorization of sending Mail Transfer Agents (MTAs). SPF can be used to build complex policies around who can send email on whose behalf. Below is an example SPF record for Florida State University.</p>
<script src="https://gist.github.com/jatrost/544dfbc979332f6948a2bca065830dc5.js"></script>
<p>According to this SPF record 146.201.58.212, 146.201.58.213, 146.201.107.145, 146.201.107.249, 192.12.121.23, and 199.188.157.80 are allowed to send email purporting to be from fsu.edu. Also, the SPF records from spf.protection.outlook.com, _spf.qualtrics.com, spf.blackboardconnect.com, servers.mcsv.net, and _spf.mlsend.com should be retrieved and their policies applied as well. Below are the SPF records for each of these domains. As you can see they include more and more IPs/CIDRs as well as additional SPF includes.</p>
<script src="https://gist.github.com/jatrost/e342a1b77bde98d231cc4ef3f30e71b7.js"></script>
<p>As you can see, SPF forms a chain of trust between the domain owner and all the SPF policies included recursively (potentially crossing several different administrative boundaries). In this post I was hoping to explore this chain of trust at a large scale by collecting a large sample of SPF records and mining them.</p>
<p>Below are some useful resources for understanding SPF:</p>
<ul>
<li><a href="https://tools.ietf.org/html/rfc7208">RFC7208: Sender Policy Framework (SPF) for Authorizing Use of Domains in Email</a></li>
<li><a href="https://dmarcian.com/spf-syntax-table/">SPF Syntax Table</a> - really useful guide for understanding SPF “mechanisms”.</li>
</ul>
<h2 id="step-one-collection">Step One: Collection</h2>
<p>For step one, I built a very <del>crude</del> useful <a href="https://github.com/covert-labs/mx-intel/blob/master/spf_crawler.py">SPF crawler</a> that uses dig (optionally adnshost) to perform DNS TXT requests, parse out SPF records found, and then recursively follow the trail of SPF include records and perform TXT lookups against the included domains.</p>
<p>In order to seed the SPF crawler, I used the same domains I used in my <a href="http://www.covert.io/mining-mx-records-for-fun-and-profit/">previous blog post on mining MX records</a>. I downloaded the <a href="http://s3.amazonaws.com/alexa-static/top-1m.csv.zip">Alexa top 1M domains</a>, <a href="https://web.archive.org/web/*/https://ak.quantcast.com/quantcast-top-sites.zip">Quantcast top 1m domains (from WaybackMachine)</a>, <a href="https://www.domcop.com/top-10-million-domains">Domcop Top 10m domains</a>, <a href="https://majestic.com/reports/majestic-million">Majestic Million Domains</a> and <a href="https://umbrella.cisco.com/blog/cisco-umbrella-1-million">Cisco Umbrella top 1m domains</a>. I identified the registered domain using <a href="https://pypi.org/project/tldextract/">tldextract</a> for each of these and then combined them into a <a href="https://mx-intel-public.s3.amazonaws.com/all-registered-domains.txt.gz">single de-duplicated list</a>. This resulted in ~8.3M unique domain names.</p>
<p>These domains were fed into my SPF crawler and then the results were collected, parsed, and then assembled. I ended up backing the SPF crawler with “dig” instead of “adnshost” this time since I found dig was more reliable, completing 23% more DNS requests in an experiment against the Fortune 1000 domains. Dig is single threaded, but I easily parallelized it using splits files and xargs and its performance ended up being good enough. See <a href="https://github.com/covert-labs/mx-intel/blob/master/parallel_dig.sh">parallel_dig.sh</a> for more details.</p>
<p>Below are a few simple commands as well as example output data collected with my SPF crawler applied to just one domain. As you can see, the assembled output for fsu.edu includes all the IPs and Netblocks from all the SPF includes that it links to, recursively.</p>
<script src="https://gist.github.com/jatrost/ea5826ea2d5596499507e5ad78bec398.js"></script>
<p>Below is the same information, visualized as a network (and enriched with ASN info from Maxmind).</p>
<p><a href="/images/spf/fsu-networkx.png"><img src="/images/spf/fsu-networkx.png" width="600px" /></a></p>
<h2 id="step-two-enrichment">Step Two: Enrichment</h2>
<p>For this step, I reused <a href="https://github.com/covert-labs/mx-intel">a lot of the code</a> from my previous blog post on <a href="http://www.covert.io/mining-mx-records-for-fun-and-profit/">Mining MX records</a> and performed the following enrichments:</p>
<ol>
<li>Maxmind ASN</li>
<li>Maxmind Country</li>
<li>Cloud Provider IP Lookups for AWS, Azure, and GCP</li>
<li>Alexa Ranking</li>
<li><a href="https://github.com/covert-labs/mx-intel/blob/master/email_security_providers.py">Email Security Provider mapping</a></li>
</ol>
<p><a href="https://pypi.org/project/netaddr/">netaddr</a>, <a href="https://pypi.org/project/tldextract/">tldextract</a>, and <a href="https://github.com/Figglewatts/cidr-trie">cidr-trie</a> were useful during this stage.</p>
<h2 id="step-three-analysis">Step Three: Analysis</h2>
<p>Through this analysis, I hoped to answer the following questions:</p>
<ul>
<li>What is the largest trusted network size (both single CIDR and aggregate network space)? … HUGE</li>
<li>Could I find any blatantly misconfigured SPF records? … YES</li>
<li>What does SPF data show about email security providers? … A lot that MX doesn’t</li>
<li>What are the most “included” SPF includes? … Not many surprises here</li>
<li>Does SPF augment the MX record mining (give more coverage? reveal things previously hidden? or 100% redundant?) … YES!</li>
<li>Are domains trusting IP space from cloud providers that may be re-usable (i.e. AWS EC2)? … YES!</li>
</ul>
<p>Below are some outputs and commentary from this project’s Jupyter notebook that answer the questions above.</p>
<h2 id="network-graphs">Network Graphs</h2>
<p>These <a href="https://networkx.github.io/">networkx</a> visualizations of the Fortune 100 and Alexa 100 are a bit of a mess, but they should get the point across of how interconnected the SPF trust relationships are.</p>
<h3 id="fortune-100-spf-trusted-networks-graph">Fortune 100 SPF Trusted Networks Graph</h3>
<p><a href="/images/spf/fortune-100-networkx.png"><img src="/images/spf/fortune-100-networkx.png" width="600px" /></a></p>
<h3 id="alexa-100-spf-trusted-networks-graph">Alexa 100 SPF Trusted Networks Graph</h3>
<p><a href="/images/spf/alexa-100-networkx.png"><img src="/images/spf/alexa-100-networkx.png" width="600px" /></a></p>
<h2 id="heatmaps">Heatmaps</h2>
<p>As you can see from the next several heatmaps, as we go beyond the Alexa top 1,000 domains the number of networks trusted drastically increases, and as we hit the Alexa 1m, the entire Internet is trusted (likely due to SPF misconfigurations).</p>
<p>These heatmaps were generated with the awesome <a href="https://github.com/measurement-factory/ipv4-heatmap">ipv4-heatmap</a> tool provided by the <a href="http://www.measurement-factory.com/">Measurement Factory</a>. The code to automate this can be found in my Jupyter Notebook <a href="https://github.com/covert-labs/mx-intel/blob/master/SPF-Parse-Enrich.ipynb">here</a>.</p>
<h3 id="fortune-1000-spf-trusted-networks-heatmap">Fortune 1,000 SPF Trusted Networks Heatmap</h3>
<p><img src="/images/spf/fortune-1000-heatmap.png" width="600px" /></p>
<h3 id="alexa-1000-spf-trusted-networks-heatmap">Alexa 1,000 SPF Trusted Networks Heatmap</h3>
<p><img src="/images/spf/alexa-1000-heatmap.png" width="600px" /></p>
<h3 id="alexa-10000-spf-trusted-networks-heatmap">Alexa 10,000 SPF Trusted Networks Heatmap</h3>
<p><img src="/images/spf/alexa-10000-heatmap.png" width="600px" /></p>
<h3 id="alexa-100000-spf-trusted-networks-heatmap">Alexa 100,000 SPF Trusted Networks Heatmap</h3>
<p><img src="/images/spf/alexa-100000-heatmap.png" width="600px" /></p>
<h3 id="alexa-1000000-spf-trusted-networks-heatmap">Alexa 1,000,000 SPF Trusted Networks Heatmap</h3>
<p><img src="/images/spf/alexa-1000000-heatmap.png" width="600px" /></p>
<h3 id="alexa-top-1m-domains-trusting-7-or-larger-networks">Alexa Top 1M Domains Trusting /7 or larger networks</h3>
<p>As you can see from this list, there are quite a few domains that trust very large networks. Several of these seem like likely misconfigurations. For example, these four domains trust the entire Internet:</p>
<ul>
<li>hitadouble[.]com: 208.67.207.0/0</li>
<li>payukraine[.]com: 0.0.0.0/0</li>
<li>angliss[.]edu[.]au: 0.0.0.0/0</li>
<li>hutkigrosh[.]by: 0.0.0.0/0</li>
</ul>
<p>This domain trusts half of the Internet - salaam[.]af: 175.106.32.0/1</p>
<p>And these five domains trust 1/4 of the Internet. cfe[.]fr appears to have fixed this apparent misconfiguration now. As their TXT record has changed.</p>
<ul>
<li>creativecircle[.]com: 64.4.22.64/2</li>
<li>gevestor[.]de: 91.241.72.0/2</li>
<li>debeersgroup[.]com: 10.47.149.168/2</li>
<li>cfe[.]fr: 82.97.62.0/2</li>
<li>adecco[.]com: 148.105.8.0/2</li>
</ul>
<script src="https://gist.github.com/jatrost/1bb8d1fa91e2346cb1deca6a6e7761e6.js"></script>
<h3 id="top-spf-includes-from-all-top-domain-lists-via-spf">Top SPF Includes from all top domain lists (via SPF)</h3>
<script src="https://gist.github.com/jatrost/4d60851dcab2928e5e82e68714187237.js"></script>
<p>Using all the popular domain names, here is a summary of the top 10 SPF includes.</p>
<p>Major Cloud Email Providers:</p>
<ul>
<li>Microsoft: spf.protection.outlook.com</li>
<li>Google: _spf.google.com</li>
</ul>
<p>Hosting Providers:</p>
<ul>
<li>HostGator: websitewelcome.com</li>
<li>OVH: mx.ovh.com</li>
<li>Bluehost: bluehost.com</li>
</ul>
<p>Commercial Email Marketing companies</p>
<ul>
<li>MailChimp: servers.mcsv.net</li>
<li>Mandrill: spf.mandrillapp.com (MailChimp add-on)</li>
<li>Sendgrid: sendgrid.net</li>
</ul>
<p>Email Security company:</p>
<ul>
<li>MailChannels: mailchannels.net (more on this later)</li>
</ul>
<h3 id="top-spf-includes-from-fortune-1000-via-spf">Top SPF Includes from Fortune 1000 (via SPF)</h3>
<script src="https://gist.github.com/jatrost/6944e81b2759125be864dde4f3db4ced.js"></script>
<h3 id="top-spf-includes-from-alexa-top1m">Top SPF Includes from Alexa top1m</h3>
<script src="https://gist.github.com/jatrost/b8c209b7149768123e357301a66b89e6.js"></script>
<h2 id="email-security-providers">Email Security Providers</h2>
<p>If you read my previous blog post on <a href="http://www.covert.io/mining-mx-records-for-fun-and-profit/">Mining DNS MX Records for Fun and Profit</a>, then you might notice that these top lists look significantly different than the top email providers as identified from MX records. The top 5 providers identified in the SPF data are MailChannels, Mimecast, Proofpoint, Solarwinds, and Barracuda. In the MX post, the top 5 were Proofpoint, Mimecast, Deteque, Barracuda, and Solarwinds, AND MailChannels was #48 on that list. These top lists are using all the popular domains data which is likely not an accurate reflection of the actual email security market. When reviewing the Fortune 1000 top Email Security providers the story is not as surprising as the top 4 from the Fortune 1000 Email security providers were nearly identical across SPF and MX records with just the order being different. I suspect that MailChannels shows up as popular in SPF because either it is the default setting on newly registered domains OR it is the default setting for domains that are parked with certain hosting providers, but I haven’t spent the time to prove/disprove this.</p>
<p><strong>(Update 7/7/2020)</strong> I received this message from <a href="https://www.linkedin.com/in/ksimpson/">Ken Simpson</a>, CEO of MailChannels, that helps explain why there is a mismatch between the MX and SPF counts.</p>
<blockquote>
<p>“You were wondering why MailChannels shows up in a lot of SPF records (actually, we’re number one), but relatively few MX records. MailChannels delivers email for the web hosting industry, with over 700 service provider customers worldwide. To deliver email reliably, they have to add us to their customers’ SPF records. Those same customers often host their inbound email with someone else - GSuite, Microsoft 365, or another provider. Hence the mismatch in SPF and MX records.”</p>
</blockquote>
<p>One other interesting aspect with SPF is it (potentially) reveals relationships with multiple email security providers. See the “Fortune 100 Email Security Providers Listing (via SPF)” and “Domains with 4 or more Email Security Providers (via SPF)” gists below. In the Fortune 100 list, there are 3 domains with SPF relationships with more than one provider. If you look across all the top domains data you can see there are many. For anyone who has worked in the cyber security department at a large company before, this is not surprising, but it was cool to be able to see this in the data.</p>
<ul>
<li>Domains with 2 SPF relationships with Email Security Providers: 11,393</li>
<li>Domains with 3 SPF relationships with Email Security Providers: 468</li>
<li>Domains with 4 SPF relationships with Email Security Providers: 35</li>
<li>Domains with 5 SPF relationships with Email Security Providers: 1</li>
</ul>
<h3 id="top-email-security-provider-from-all-top-domain-lists-via-spf">Top Email Security Provider from all top domain lists (via SPF)</h3>
<script src="https://gist.github.com/jatrost/a1a3a0b2c4a7dbc3babefee99ad753eb.js"></script>
<h3 id="top-email-security-provider-from-alexa-1m-via-spf">Top Email Security Provider from Alexa 1m (via SPF)</h3>
<script src="https://gist.github.com/jatrost/657214251b56e6267059330a169b0ea1.js"></script>
<h3 id="top-email-security-provider-from-fortune-1000-via-spf">Top Email Security Provider from Fortune 1000 (via SPF)</h3>
<script src="https://gist.github.com/jatrost/d16bf6b4dc95e1c122b88593a3803e05.js"></script>
<h3 id="top-email-security-provider-from-fortune-100-via-spf">Top Email Security Provider from Fortune 100 (via SPF)</h3>
<script src="https://gist.github.com/jatrost/56344ec6076379a83958087ae63afc3e.js"></script>
<h3 id="fortune-100-email-security-providers-listing-via-spf">Fortune 100 Email Security Providers Listing (via SPF)</h3>
<script src="https://gist.github.com/jatrost/34a30bca216d171ae66f466359e498a5.js"></script>
<h3 id="domains-with-4-or-more-email-security-providers-via-spf">Domains with 4 or more Email Security Providers (via SPF)</h3>
<script src="https://gist.github.com/jatrost/4b505a6dbcbbd55e7e068cf7262fb468.js"></script>
<h2 id="trusting-cloud-provider-networks">Trusting Cloud Provider Networks</h2>
<p>As you can see from the next few tables, many domains transitively trust a lot of Cloud provider IP space for SPF. For some of the larger networks trusted it seems like this carries risk since it may be possible for the cloud IP space to get reused; see <a href="https://labs.bishopfox.com/tech-blog/2015/10/fishing-the-aws-ip-pool-for-dangling-domains">Fishing the AWS IP Pool for Dangling Domains
</a> for a practical example of this. Like I mentioned earlier, SPF is usually used with DKIM and DMARC so this data doesn’t paint the whole picture. I am hoping to dive into DMARC/DKIM next.</p>
<h3 id="alexa-1000-trusting-aws-networks">Alexa 1000 Trusting AWS Networks</h3>
<script src="https://gist.github.com/jatrost/9349839085ad5362d9cbb8ae981da524.js"></script>
<h3 id="alexa-1000-trusting-azure-networks">Alexa 1000 Trusting Azure Networks</h3>
<script src="https://gist.github.com/jatrost/3fc802f5727049f390f427c5cc43651c.js"></script>
<h3 id="alexa-1000-trusting-gcp-networks">Alexa 1000 Trusting GCP Networks</h3>
<script src="https://gist.github.com/jatrost/e956f69d3d8120898eba5f2b07c19dac.js"></script>
<h3 id="fortune-1000-trusting-aws-networks">Fortune 1000 Trusting AWS Networks</h3>
<script src="https://gist.github.com/jatrost/6b2ce9608a2ec3078243bd2dcdc99cc0.js"></script>
<h3 id="fortune-1000-trusting-azure-networks">Fortune 1000 Trusting Azure Networks</h3>
<script src="https://gist.github.com/jatrost/44eedaa273e05522953581b09d4e5c1d.js"></script>
<h3 id="fortune-1000-trusting-gcp-networks">Fortune 1000 Trusting GCP Networks</h3>
<script src="https://gist.github.com/jatrost/a240ba87eb430cecfe62fe41d5ba752a.js"></script>
<h3 id="some-other-potentially-interesting-results-not-worth-dumping-here">Some other potentially interesting results, not worth dumping here:</h3>
<ul>
<li><a href="https://gist.github.com/jatrost/60cc44bf1b3b4a4617ca8ffb74b726a7">Alexa top1m domains trusting AWS Networks</a></li>
<li><a href="https://gist.github.com/jatrost/97214849df789fd987b585f7321a8907">Alexa top1m domains trusting Azure Networks</a></li>
<li><a href="https://gist.github.com/jatrost/a9b3c5ed9efd6cdd931673d9da6882e1">Alexa top1m domains trusting GCP Networks</a></li>
<li><a href="https://gist.github.com/jatrost/6d789a41a0712b1ef71818234335d5eb">Top Maxmind ASNs of SFP Trusted Networks from Fortune 1000 (via SPF)</a></li>
<li><a href="https://gist.github.com/jatrost/eb52a7f3c19607ce4d08407660fc09aa">Top Maxmind ASNs of SFP Trusted Networks from all top domain lists (via SPF)</a></li>
<li><a href="https://gist.github.com/jatrost/f2e3567eb48788b8ba923852cb4aec96">Top Maxmind ASNs of SFP Trusted Networks from Alexa top1m (via SPF)</a></li>
<li>Graph analytics applied to Fortune 1000 and Alexa 1000: degree centrality, edge betweenness centrality, pagerank, closeness centrality, triangle counts, and connected components stats, see the <a href="https://github.com/covert-labs/mx-intel/blob/master/SPF-Parse-Enrich.ipynb">notebook</a> and search for “print_graph_metrics”.</li>
</ul>
<h3 id="future-work">Future Work</h3>
<ul>
<li>SPF Crawler enhancements: As you can see from the SPF guide I shared above for <a href="https://dmarcian.com/spf-syntax-table/#a">“a”</a> and <a href="https://dmarcian.com/spf-syntax-table/#mx">“mx”</a>, SPF supports some fairly complex policies for allowing certain IPs to send email (esp. the prefix operators on these SPF mechanisms). I did not provide support for these mechanisms in the first version of my SPF crawler mainly due to the complexity involved. Because of this, my results will under represent the trust relationships where these are used. I hope to add support for these operators to expand what could be found in this data.</li>
<li>Try some more graph analytics on the entire dataset. In the Jupyter notebook I ran several graph algorithms on subsets of the entire graph (Fortune 100 and Alexa 100). These showed some mildly interesting results, but testing against larger graphs caused graphviz to fail due to some data format issues that I have not had a chance to research.</li>
<li>Perform another study measuring DMARC and DKIM usage across popular domains.</li>
</ul>
<h3 id="resources">Resources</h3>
<p>As usual all notebooks, code, and summary results can be found in Github: <a href="https://github.com/covert-labs/mx-intel">https://github.com/covert-labs/mx-intel</a>.</p>
<p>And all data can be found at the links below:</p>
<ul>
<li><a href="https://mx-intel-public.s3.amazonaws.com/all-registered-domains.txt.gz">all-registered-domains.txt.gz</a> - base domains extracted from combining several popular domains lists together and then uniqued.</li>
<li><a href="https://mx-intel-public.s3.amazonaws.com/all-registered-domains-outputs-combined.txt.gz">all-registered-domains-outputs-combined.txt.gz</a> - raw dig output for all the TXT requests.</li>
<li><a href="https://mx-intel-public.s3.amazonaws.com/spf-results-all-registered-domains.json.gz">spf-results-all-registered-domains.json.gz</a> - the parsed results from running the SPF Crawler against all-registered-domains.txt.gz.</li>
<li><a href="https://mx-intel-public.s3.amazonaws.com/spf-results-all-registered-domains.json.gz">spf-linked-all-registered-domains.json.gz</a> - the assembled results from processing spf-results-all-registered-domains.json.gz. This is the collapsed/combined data that shows all the SPF domains and networks included recursively.</li>
</ul>
<p>–Jason
<br /><a href="https://twitter.com/#!/jason_trost">@jason_trost</a></p>
<p><a href="http://www.covert.io/all-your-spf-are-belong-to-us-exploring-trust-relationships-through-gloabl-scale-spf-mining/">All your SPF are belong to us: Exploring trust relationships through global scale SPF Mining</a> was originally published by Jason Trost at <a href="http://www.covert.io">covert.io</a> on July 06, 2020.</p>http://www.covert.io/mining-mx-records-for-fun-and-profit2020-06-27T00:00:00-00:002020-06-27T00:00:00-04:00Jason Trosthttp://www.covert.iojason.trost@gmail.com<p>If you have read my blog before, you may realize that I <a href="http://www.covert.io/three-short-links-on-popular-domain-lists-for-threat-intelligence/">really</a> <a href="http://www.covert.io/post/23331714509/hadoop-dns-mining">love</a> <a href="http://www.covert.io/post/72223750985/dns-census-2013">DNS</a> <a href="http://www.covert.io/six-short-links-on-pdns-graph-analytics-for-security/">data</a> and <a href="http://www.covert.io/getting-started-with-dga-research/">dns</a> <a href="http://www.covert.io/auxiliary-loss-optimization-for-hypothesis-augmentation-for-dga-domain-detection/">analytics</a>. In this post, I share some experiences in using mostly DNS data for identifying the visible footprint of popular email security providers.</p>
<p>This may not be terribly novel, but it was an interesting exploration during a time of boredom for me. This work was initially motivated by two events:</p>
<ol>
<li>When the Proofpoint email protection machine learning vulnerability (<a href="https://nvd.nist.gov/vuln/detail/CVE-2019-20634">CVE-2019-20634</a>) was <a href="https://github.com/moohax/Talks/blob/master/slides/DerbyCon19.pdf">announced by Will Pearce and Nick Landers</a> I got to wondering about how large their deployment footprint was and how one could figure this out, and</li>
<li>A friend at another company mentioned that they were using a specific startup email security provider and I wondered whether I could determine what other companies were also using this same provider.</li>
</ol>
<p>Here is the methodology I devised for this:</p>
<ol>
<li>Collect a large sample of MX records</li>
<li>Enrich MX records with IP intelligence and useful metadata</li>
<li>Sift through the enriched records and identify recognizable email provider’s domains through OSINT (whois, PDNS, Google) and market research.</li>
<li>Profit?!?!?</li>
</ol>
<p>For step one, I downloaded the <a href="http://s3.amazonaws.com/alexa-static/top-1m.csv.zip">Alexa top 1M domains</a>, <a href="https://web.archive.org/web/*/https://ak.quantcast.com/quantcast-top-sites.zip">Quantcast top 1m domains (from WaybackMachine)</a>, <a href="https://www.domcop.com/top-10-million-domains">Domcop Top 10m domains</a>, <a href="https://majestic.com/reports/majestic-million">Majestic Million Domains</a> and <a href="https://umbrella.cisco.com/blog/cisco-umbrella-1-million">Cisco Umbrella top 1m domains</a>. I identified the registered domain using <a href="https://pypi.org/project/tldextract/">tldextract</a> for each of these and then combined them into a <a href="https://mx-intel-public.s3.amazonaws.com/all-registered-domains.txt.gz">single de-duplicated list</a>. This resulted in ~8.3M unique domain names. I then performed bulk MX lookups using <a href="https://www.gnu.org/software/adns/">adnshost</a> against my own bind9 recursive nameserver. In my experience, adnshost works pretty well for bulk DNS resolution at this scale, and it will perform both the lookup requested (MX) as well as a domain resolution (A-lookup). When performing bulk DNS lookups at this scale it is important to add retry logic for failed resolutions as this tends to happen enough to be a problem. I did this using a <a href="https://github.com/covert-labs/mx-intel/blob/master/robust_perform_resolutions.sh">simple bash script</a> that retried failed lookups up to three times.</p>
<p>For step two, I then developed a simple Jupyter notebook to parse the adnshost logs and perform the enrichments using <a href="https://pypi.org/project/tldextract/">tldextract</a>, PTR lookups (also using adnshost), Maxmind ASN, Maxmind City, Alexa ranking, and Cloud provider IP Ranges for <a href="https://ip-ranges.amazonaws.com/ip-ranges.json">AWS</a>, <a href="https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519">Azure</a>, and <a href="http://www.gstatic.com/ipranges/cloud.json">GCP</a>. <strong>Side note</strong>: I also attempted to perform SOA lookups on the /24 networks of each IP after noticing some useful patterns with failed PTR lookups. This appears potentially useful for identifying some of uses of some of the IP space of the cloud providers, but this turned into a rabbit hole since adnshost appears to crash when trying to handle some of the results it received.</p>
<p><img src="/images/mx-intel-diagram.png" border="1px" /></p>
<p>For step three, I did the following:</p>
<ol>
<li>Performed market research on the top email security providers as well as emerging and niche providers. This <a href="https://www.datanyze.com/market-share/email-security--343">site</a> was helpful as well as just googling around and exploring PDNS/Whois data from <a href="https://community.riskiq.com/">PassiveTotal</a> and <a href="https://securitytrails.com/">SecurityTrails</a>.</li>
<li>Scrutinized the top MX server registered domains and ASNs and tried to identify potential security providers.</li>
<li>Sifted through the remaining results trying to identify any obvious providers with “malware”, “phish”, “spam”, or “security” in their domain names.</li>
</ol>
<p>I used this to build two mappings to email security providers: MX server base domains and ASN names. The mappings can be found <a href="https://github.com/covert-labs/mx-intel/blob/master/email_security_providers.py">here</a>. Then I summarized the overall dataset and those results are presented below. <strong>Elephants in the room</strong> I purposefully did not include Microsoft, Google, and some of the bigger tech companies that provide email service as part of these mappings since I don’t consider them email security companies. This may be debatable since these companies do provide security features through their offerings.</p>
<h3 id="brief-intro-to-mx-records">Brief Intro to MX records</h3>
<p>For those of you who may not be familiar with DNS MX records, these are DNS Resource Records (RRs) used to map a domain name to the Mail Exchange (MX) servers responsible for accepting email for that domain. MX records are used by Mail Transfer Agents (MTA) in order to identify where email should be sent for a given recipient email address. Below we use the command line utility “dig” to perform an MX lookup on gmail.com to find its Mail Exchange servers. As you can see, at the time of this writing, there are five MX domains that can accept email for gmail.com.</p>
<script src="https://gist.github.com/jatrost/dbe0c0b5b111fdb86e43a56a6b074e24.js"></script>
<p>Besides being critical for identifying where email should be sent, MX records are also useful for mapping out infrastructure and can sometimes be used to identify which email security providers are being used by a company of interest. Below is an example for Florida State University (go Noles!) that reveals that, at the time of this writing, they are using Proofpoint to receive their email. How do we know this? Their mail exchanges are hosted on sub domains of pphosted.com which is owned by Proofpoint.</p>
<script src="https://gist.github.com/jatrost/12d74b8837a84fc1865669a333ae75fa.js"></script>
<p>Some companies obscure their security providers by first receiving their email to other mail exchanges such as ones hosted in their own data center or ones hosted by Google or Microsoft. In this blog, we explore a large DNS dataset to identify interesting info about the visible footprint / market share of email security companies.</p>
<p>All code and data for this study can be found in this Github Repo: <a href="https://github.com/covert-labs/mx-intel">https://github.com/covert-labs/mx-intel</a>.</p>
<h2 id="observations">Observations:</h2>
<ul>
<li>Email security provider OPSEC is remarkably bad in a lot of cases and it is often easy to determine which provider is being used. Anyone who works in cyber security knows it is generally not a good idea to broadcast which cyber security products you are using since it may provide information that can be exploited by the adversary. This is especially true when vulnerabilities are announced in security products.</li>
<li>Since email exchanges can be chained together, only the outermost layer is visible in DNS MX records. For this reason, this research will underestimate the size of each provider’s market share.</li>
<li>Some security providers supply very specialized services (like anti-phishing only) and because of this they are often not the first layer in the email exchange chain. They will be dramatically underrepresented in this study.</li>
</ul>
<h2 id="results">Results:</h2>
<h3 id="summary">Summary:</h3>
<ul>
<li>8,395,595 domains (derived from several top domain lists)</li>
<li>12,910,550 unique MX records (from 5994452 unique domains)</li>
<li>2,901,843 Unique Mail server domains</li>
<li>1,940,993 Unique Mail server base domains</li>
<li>25,733 Unique Mail server ASNs</li>
<li>56 Unique Security Providers identified</li>
</ul>
<h3 id="analytics">Analytics:</h3>
<p>Here are the questions I was hoping to answer with the tables presented below:</p>
<ul>
<li>Who are the market leading security companies reflected in the data?</li>
<li>What is the visible market share of email security providers as reflected in DNS records?</li>
<li>What can be inferred from publicly available MX records about email security?</li>
<li>Which email security providers are leveraging cloud hosting? And which cloud hosting environments are used most?</li>
<li>Who are the visible customers of provider X?</li>
</ul>
<p>Note: All tables below show the count of domains hosted, NOT companies; companies can own many domains. Fortune 1000 domains are from 2015 and are based on <a href="https://gist.github.com/hrbrmstr/ae574201af3de035c684/">this file created by Bob Rudis</a>.</p>
<h4 id="top-email-security-providers-overall">Top Email Security Providers Overall</h4>
<script src="https://gist.github.com/jatrost/61ff8d530c8dbb1587b5148beb890515.js"></script>
<h4 id="fortune-1000-email-security-providers">Fortune 1000 Email security providers</h4>
<script src="https://gist.github.com/jatrost/95914680804782b62d2273928575e730.js"></script>
<h4 id="fortune-100-domain-mx-base-domain-email-security-provider">Fortune 100 domain, MX base domain, email security provider</h4>
<p>Note: I ended up adding Google and Microsoft to this table since they were very well represented. As you can see, Proofpoint and self-hosting dominate the Fortune 100.</p>
<script src="https://gist.github.com/jatrost/ffcea5afa0a204901c2bfefac931b30b.js"></script>
<h4 id="alexa-1000-email-security-providers">Alexa 1000 Email security providers</h4>
<script src="https://gist.github.com/jatrost/107d72575748e7f1e7975ce1f124bb8c.js"></script>
<h4 id="alexa-100-domain-mx-base-domain-email-security-provider">Alexa 100 domain, MX base domain, email security provider</h4>
<p>Note: I ended up adding Google and Microsoft to this table since they were very well represented. As you can see, self-hosting, Google and Microsoft dominate the Alexa 100. Almost all of these domains are from large technology / web companies so this isn’t so surprising, but it is interesting as compared to the Fortune 100.</p>
<script src="https://gist.github.com/jatrost/df324355c19b0b1c0f1ca142d49f33e8.js"></script>
<h4 id="top-email-security-providers-hosted-in-aws">Top Email Security Providers Hosted in AWS</h4>
<p>Many large email security companies are operating from AWS.</p>
<script src="https://gist.github.com/jatrost/03a63988d2ffb1fdafa61355db2be1d9.js"></script>
<h4 id="top-email-security-providers-hosted-in-azure">Top Email Security Providers Hosted in Azure</h4>
<p>Only a small number of identifiable email security companies were operating from Azure.</p>
<script src="https://gist.github.com/jatrost/1683f4cd598c96f7787a2ff0fe1cd965.js"></script>
<h4 id="top-self-hosted-email-security-providers">Top Self-hosted Email Security Providers</h4>
<script src="https://gist.github.com/jatrost/ec1a85077dbd9f6d171a82d9a7888711.js"></script>
<h2 id="misc-findings">Misc Findings</h2>
<p>When mining this data I discovered a few interesting items.</p>
<h3 id="linode--csc-digital-brand-services">Linode / CSC Digital Brand Services</h3>
<p>One of the more popular email security providers, “CSC Digital Brand Services” (which service multiple Fortune 100 companies), uses Linode for their hosting. This was surprising since Linode seems like a much smaller player in the Cloud hosting market.</p>
<h3 id="googlemialcom">googlemial[.]com</h3>
<p>When I initially collected this data, freecodecamp.org had a misconfigured MX domain pointing to googlemial[.]com. And this sketchy domain is not owned by Google and resolved to a GCP IP. Upon further inspection, this IP appears to be hosting a parking page for unregistered domains owned by GoDaddy. A quick <a href="https://securitytrails.com/list/ip/35.186.238.101">PDNS check</a> of other domains resolving to this IP reveals ~4.2M+ domains, and a quick DNS resolution on those domains with any subdomain shows that they all resolve to the same IP.</p>
<script src="https://gist.github.com/jatrost/f5963734af7b7fe0f5bc0474cb5b6d17.js"></script>
<p><strong>adnshost logs for freecodecamp.org</strong></p>
<h2 id="future-work">Future Work</h2>
<p>I am not sure if I will return to this research or not, but I had some ideas that may be worth pursuing at some point, maybe during the next pandemic :)</p>
<ul>
<li>Perform similar work against a much larger scale - using all major zone files (COM, NET, ORG) and <a href="https://czds.icann.org/home">ICANN’s CZDS</a> as the inputs.</li>
<li>Or perform similar work using the <a href="https://opendata.rapid7.com/">Rapid7 Opendata</a> DNS data sets.</li>
<li>Determine if port scans against MX servers could be useful to augment this.</li>
<li>Automate PDNS queries and analysis against the MX records found to identify other domains not found in the top domain lists.</li>
<li>Perform similar work, but collect SPF records and see what interesting insights could be gleaned about email sending trust (and whether vulns could be identified – like AWS IPs in the SPF that are stale and potentially obtainable).</li>
<li>Completely automate this entire process and use it to generate weekly reports.</li>
<li>Identify providers hidden by the first layer mail exchange. It may be possible to do this at scale (but only for some companies) if the companies send Bounced notifications to external email senders for non-existent recipients. These bounced messages often contain all the SMTP headers of the original message sent. These headers can reveal security products. This technique was used on a targeted basis by Will Pearce and Nick Landers in their <a href="https://github.com/moohax/Talks/blob/master/slides/DerbyCon19.pdf">DerbyCon research on Proofpoint</a>. Trying to do this at scale may draw a lot of attention or get my research box put on some blacklists. It would also likely be a lot more effort to identify the SMTP headers associated with different security providers.</li>
</ul>
<h2 id="resources">Resources</h2>
<p>Data:</p>
<ul>
<li><a href="https://mx-intel-public.s3.amazonaws.com/all-registered-domains.txt.gz">all-registered-domains.txt.gz</a> - base domains extracted from combining several popular domains lists together and then uniqued.</li>
<li><a href="https://mx-intel-public.s3.amazonaws.com/all-popular-domains-MX-20200620.txt.unique.gz">all-popular-domains-MX-20200620.txt.unique.gz</a> - adnshost logs from performing MX lookups on domains from all-registered-domains.txt.gz.</li>
<li><a href="https://mx-intel-public.s3.amazonaws.com/mailserver_registered_domain-NS-20200620.txt.gz">mailserver_registered_domain-NS-20200620.txt.gz</a> - adnshost logs from performing NS lookups on all the MX base domains; used for enrichment.</li>
<li><a href="https://mx-intel-public.s3.amazonaws.com/mx-intel-enriched.csv.gz">mx-intel-enriched.csv.gz</a> - the final enriched output from this work.</li>
</ul>
<p>Notebooks, Code, and summary results: <a href="https://github.com/covert-labs/mx-intel">https://github.com/covert-labs/mx-intel</a>.</p>
<p>–Jason
<br /><a href="https://twitter.com/#!/jason_trost">@jason_trost</a></p>
<p><a href="http://www.covert.io/mining-mx-records-for-fun-and-profit/">Mining DNS MX Records for Fun and Profit</a> was originally published by Jason Trost at <a href="http://www.covert.io">covert.io</a> on June 27, 2020.</p>http://www.covert.io/seven-short-links-on-cyber-security-alert-triage-automation2020-05-23T00:00:00-00:002020-05-23T00:00:00-04:00Jason Trosthttp://www.covert.iojason.trost@gmail.com<p>A short listing of research papers I’ve discovered recently that aim to automate or speed up cyber security alert triage (alert prioritization/ranking, causal event correlation, and enrichment).</p>
<ul>
<li><a href="https://kangkookjee.github.io/publications/nodoze-ndss2019.pdf">NODOZE: Combatting Threat Alert Fatigue with Automated Provenance Triage</a></li>
<li><a href="https://www.ndss-symposium.org/wp-content/uploads/2020/02/24270-paper.pdf">OmegaLog: High-Fidelity Attack Investigation via Transparent Multi-layer Log Analysis</a></li>
<li><a href="http://www.princeton.edu/~pmittal/publications/priotracker-ndss18.pdf">Towards a Timely Causality Analysis for Enterprise Security</a></li>
<li><a href="https://arxiv.org/pdf/1810.05711.pdf">ProPatrol: Attack Investigation via Extracted High-Level Tasks</a></li>
<li><a href="https://www.osti.gov/servlets/purl/1505905">Exploiting Time and Subject Locality for Fast, Efficient, and Understandable Alert Triage</a></li>
<li><a href="https://ieeexplore.ieee.org/abstract/document/8170757">Deep learning for prioritizing and responding to intrusion detection alerts</a></li>
<li><a href="https://ieeexplore.ieee.org/abstract/document/8949029">Automated Threat-Alert Screening for Battling Alert Fatigue with Temporal Isolation Forest</a></li>
</ul>
<p>–Jason
<br /><a href="https://twitter.com/#!/jason_trost">@jason_trost</a></p>
<p>The “short links” format was inspired by <a href="https://www.oreilly.com/feed/four-short-links">O’Reilly’s Four Short Links</a> series.</p>
<p><a href="http://www.covert.io/seven-short-links-on-cyber-security-alert-triage-automation/">Seven Short Links on Cyber Security Alert Triage Automation</a> was originally published by Jason Trost at <a href="http://www.covert.io">covert.io</a> on May 23, 2020.</p>http://www.covert.io/eight-short-links-on-provenance-analytics-for-cyber-security2020-02-01T00:00:00-00:002020-05-23T00:00:00-04:00Jason Trosthttp://www.covert.iojason.trost@gmail.com<p>A short listing of research papers I’ve discovered recently that use Provenance Analytics for various Cyber Security usecases from EDR data analysis to malware analysis to threat hunting and IR.</p>
<ul>
<li><a href="https://whassan3.web.engr.illinois.edu/papers/rapsheet-oakland20.pdf">Tactical Provenance Analysis for Endpoint Detection and Response Systems</a></li>
<li><a href="https://whassan3.web.engr.illinois.edu/papers/provdetector-NDSS2020.pdf">You Are What You Do: Hunting Stealthy Malware via Data Provenance Analysis</a></li>
<li><a href="https://whassan3.web.engr.illinois.edu/papers/NoDoze-NDSS2019.pdf">NODOZE: Combatting Threat Alert Fatigue with Automated Provenance Triage</a></li>
<li><a href="https://adambates.org/documents/Bates_Www17.pdf">Transparent Web Service Auditing via Network Provenance Functions</a></li>
<li><a href="https://whassan3.web.engr.illinois.edu/papers/omegalog-NDSS2020.pdf">OmegaLog: High-Fidelity Attack Investigation via Transparent Multi-layer Log Analysis</a></li>
<li><a href="https://www.usenix.org/system/files/conference/tapp2017/tapp17_paper_lemay.pdf">Automated Provenance Analytics: A Regular Grammar Based Approach with Applications in Security</a></li>
<li><a href="https://www.usenix.org/system/files/tapp2019-paper-barre.pdf">Mining Data Provenance to Detect Advanced Persistent Threats</a></li>
<li><a href="https://dl.acm.org/doi/pdf/10.1145/3319535.3363217">Poirot: Aligning Attack Behavior with Kernel Audit Records for Cyber Threat Hunting</a></li>
</ul>
<p>–Jason
<br /><a href="https://twitter.com/#!/jason_trost">@jason_trost</a></p>
<p>The “short links” format was inspired by <a href="https://www.oreilly.com/feed/four-short-links">O’Reilly’s Four Short Links</a> series.</p>
<p><a href="http://www.covert.io/eight-short-links-on-provenance-analytics-for-cyber-security/">Eight Short Links on Provenance Analytics for Cyber Security</a> was originally published by Jason Trost at <a href="http://www.covert.io">covert.io</a> on May 23, 2020.</p>http://www.covert.io/three-short-links-on-popular-domain-lists-for-threat-intelligence2020-02-01T00:00:00-00:002020-02-01T00:00:00-05:00Jason Trosthttp://www.covert.iojason.trost@gmail.com<p>A short listing of research papers I’ve read that analyze popular domain lists. These papers analyze Alexa, Quantcast, Cisco Umbrella, and Majestic top websites/domains.</p>
<p><img src="/images/popular-lists-overlap.png" text="Average daily intersection of the top domain lists (from TRANCO paper)" /></p>
<ul>
<li><a href="https://arxiv.org/pdf/1805.11506.pdf">A Long Way to the Top: Significance, Structure, and Stability of Internet Top Lists</a></li>
<li><a href="https://pdfs.semanticscholar.org/0047/4a718cac85d240f605acdffe396046be0ac0.pdf">Rigging Research Results by Manipulating Top Websites Rankings</a></li>
<li><a href="https://tranco-list.eu/assets/tranco-ndss19.pdf">TRANCO: A Research-Oriented Top Sites Ranking Hardened Against Manipulation</a></li>
</ul>
<p>–Jason
<br /><a href="https://twitter.com/#!/jason_trost">@jason_trost</a></p>
<p>The “short links” format was inspired by <a href="https://www.oreilly.com/feed/four-short-links">O’Reilly’s Four Short Links</a> series.</p>
<p><a href="http://www.covert.io/three-short-links-on-popular-domain-lists-for-threat-intelligence/">3 Short Links on Popular Domain Lists for Threat Intelligence</a> was originally published by Jason Trost at <a href="http://www.covert.io">covert.io</a> on February 01, 2020.</p>http://www.covert.io/six-short-links-on-malware-training-set-creation-for-machine-learning2020-02-01T00:00:00-00:002020-02-01T00:00:00-05:00Jason Trosthttp://www.covert.iojason.trost@gmail.com<p>A short listing of resources useful for creating malware training sets for machine learning.</p>
<p>In leading academic and industry research on malware detection, it is common to use variations of the following techniques (based on <a href="https://www.virustotal.com/">Virustotal</a> determinations) in order to build labeled training data.</p>
<ul>
<li>“In this paper, we use a ‘1-/5+ criterion for labeling a given file as malicious or benign: if a file has one or fewer vendors reporting it as malicious, we label the file as ‘benign’”. See <a href="https://www.usenix.org/system/files/sec19-rudd.pdf">ALOHA: Auxiliary Loss Optimization for Hypothesis Augmentation</a> for more details.</li>
<li>“We assign malicious/benign labels on a 5+/1- basis, i.e., documents for which one or fewer vendors labeled malicious, we ascribe the aggregate label benign, while documents for which 5 or more vendors labeled malicious, we ascribe the aggregate label malicious.” See <a href="https://arxiv.org/pdf/1804.08162.pdf">MEADE: Towards a Malicious Email Attachment Detection Engine</a> for more details.</li>
<li>Uses similar method as above, but further removes files that use hash based file names or filenames that are “malware” or “sample”. See <a href="https://arxiv.org/pdf/1905.06987.pdf">Learning from Context: Exploiting and Interpreting File Path Information for Better Malware Detection</a> for more details.</li>
<li>“To train and evaluate our model at low false positive rates, we require accurate labels for our malware and benignware binaries. We accomplish this by running all of our data through VirusTotal, which runs the binaries through approximately 55 malware engines.We then use a voting strategy to decide if each file is either malware or benignware… We label any file against which 30% or more of the antivirus engines alarm as malware, and any file that no antivirus engine alarms on as benignware. For the purposes of both training and accuracy evaluation we discard any files that more than 0% and less than 30% of VirusTotal’s antivirus engines declare it malware, given the uncertainty surrounding the nature of these files.”
See <a href="https://arxiv.org/pdf/1508.03096.pdf">Deep Neural Network Based Malware Detection Using Two Dimensional Binary Program Features</a> for more details.</li>
<li>Scraping packages from <a href="https://ninite.com/">Ninite</a>, <a href="https://chocolatey.org/">Chocolatey</a>, and <a href="https://www.cygwin.com/">Cygwin</a>.</li>
<li>Endgame’s <a href="https://github.com/endgameinc/ember">Ember</a> is becoming one of the <a href="https://scholar.google.com/scholar?cites=15291045276750854027&as_sdt=80005&sciodt=0,11&hl=en">most cited</a> datasets used for security machine learning. “The dataset includes features extracted from 1.1M binary files: 900K training samples (300K malicious, 300K benign, 300K unlabeled) and 200K test samples (100K malicious, 100K benign)”. See <a href="https://arxiv.org/abs/1804.04637">EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models</a> for more details.</li>
<li><a href="https://www.youtube.com/watch?v=BlnZEh4q72I">Labeling the VirusShare Corpus: Lessons Learned - John Seymour</a></li>
</ul>
<p>–Jason
<br /><a href="https://twitter.com/#!/jason_trost">@jason_trost</a></p>
<p>The “short links” format was inspired by <a href="https://www.oreilly.com/feed/four-short-links">O’Reilly’s Four Short Links</a> series.</p>
<p><a href="http://www.covert.io/six-short-links-on-malware-training-set-creation-for-machine-learning/">6 Short Links on Malware Training Set Creation for Machine Learning</a> was originally published by Jason Trost at <a href="http://www.covert.io">covert.io</a> on February 01, 2020.</p>http://www.covert.io/collecting-and-curating-benign-observables-for-threat-intelligence-and-machine-learning2020-02-01T00:00:00-00:002020-02-01T00:00:00-05:00Jason Trosthttp://www.covert.iojason.trost@gmail.com<p>In this post, I share my experience in building and maintaining large collections of benign IOCs (whitelists) for Threat Intelligence and Machine Learning Research.</p>
<p>Whitelisting is a useful concept in Threat Intelligence correlation since it can be very easy for benign observables to make their way into <a href="/threat-intelligence/">threat intelligence indicator feeds</a>, esp. coming from open source providers or vendors that are not as careful as they should be. If these threat intelligence feeds are used for blocking (e.g. in firewalls or WAF devices) or alerting (e.g. log correlation in SIEM or IDS), the cost of benign entries making their way into a security control will be very high (wasted analyst time for triaging false positive alerts or loss of business productivity for blocked legitimate websites). Whitelists are generally used to filter out observables from threat intelligence feeds that almost certainly would be marked as a false positive if they were intersected against event logs (e.g. bluecoat proxy logs, firewall logs, etc) and used for alerting. Whitelists are also very useful for building labeled datasets required for building machine learning models and enriching alerts with contextual information.</p>
<p>The classic example of a benign observable is 8.8.8.8 (Google’s published open DNS resolver). This has found its way into many open source and commercial threat intelligence feeds by mistake since sometimes malware use this IP for DNS resolution or they ping it for connectivity checks. There are many other observables that commonly make their way into threat feeds due to how the threat feeds are derived / collected. Below are a summary of the major sources of false positives for threat intelligence feeds and ways to identify these to prevent their use. Most commercial threat intelligence platforms are pretty good at identifying these today and the dominant open source threat intelligence platform MISP is getting better with its <a href="https://github.com/MISP/misp-warninglists/">MISP-warninglists</a>, but as you will discover below there is some room for improvement.</p>
<h2 id="benign-inbound-observables">Benign Inbound Observables</h2>
<p>Benign Inbound Observables commonly show up in threat intelligence feeds derived from distributed network sensors such as honeypots or firewall logs. These IPs show up in firewall logs and are generally benign or at best are considered noise. Below are several common Benign Inbound Observable types. Each type also comes with recommended data sources or collection techniques listed as sub bullets:</p>
<ul>
<li><strong>Known Web Crawlers</strong> - Web crawlers are servers that crawl the World Wide Web and through this process may enter the networks of many companies or may accidentally hit honeypots or firewalls.
<ul>
<li>RDNS + DNS analytics can be used to enumerate these in bulk once patterns are identified. <a href="https://support.google.com/webmasters/answer/80553?hl=en">Here</a> is an example pattern for googlebots. Mining large collections of rdns data can reveal other patterns to focus on. Below is an example of a simple PTR lookup on a known googlebot IP. This should start to reveal patterns that could be codified assuming you have access to a large corpus of RDNS data like is provided <a href="https://opendata.rapid7.com/sonar.rdns_v2/">here</a> (or could easily be generated).</li>
</ul>
</li>
</ul>
<p><img src="/images/whitelists/googlebot-dns.png" alt="Googlebot DNS" /></p>
<ul>
<li><strong>Known port scanners</strong> associated with highly visible projects or security companies (Shodan, Censys, Rapid7 Project Sonar, ShadowServer, etc.)
<ul>
<li>RDNS + DNS analytics may be able to enumerate these in bulk (assuming the vendors want to be identified). Example:</li>
</ul>
</li>
</ul>
<p><img src="/images/whitelists/shodan-dns.png" alt="Shodan DNS" /></p>
<ul>
<li><strong>Mail Servers</strong> - these servers send email and they sometimes wind up on Threat feeds by mistake.
<ul>
<li>In order to enumerate these, you need a good list of popular email domains. Then perform DNS TXT request against this list and parse the SPF records. Multiple lookups will likely be needed as SPF allows for redirects and includes. Below shows the commands needed to do this manually for gmail.com as an example. The CIDR blocks returned are the IP space where gmail emails are sent from. Alerting or blocking on these is gonna cause a bad day.</li>
</ul>
</li>
</ul>
<p><img src="/images/whitelists/gmail-dns.png" alt="Gmail DNS" /></p>
<ul>
<li><strong>Cloud PaaS Providers</strong> – Most Cloud providers publish their IP space via APIs or in their documentation. These lists are useful to derive whitelists, but they will need to be further filtered. Ideally you only whitelist Cloud IP space that are massively shared (like S3, CLOUDFRONT, etc), not IPs that are easy for bad guys to use, such as like EC2s. These whitelists should not be used to exclude domain names that resolve to this IP space, but instead should be used for either enrichments on alerting or to suppress IOC based alerting from these IP ranges.
<ul>
<li><a href="https://ip-ranges.amazonaws.com/ip-ranges.json">Amazon AWS IP Ranges</a></li>
<li><a href="https://gist.github.com/n0531m/f3714f6ad6ef738a3b0a">Google Cloud Platform IP Ranges</a></li>
<li><a href="https://www.microsoft.com/en-us/download/details.aspx?id=56519">Azure IP Ranges</a></li>
</ul>
</li>
</ul>
<p><strong>Note</strong>: <a href="https://greynoise.io">Greynoise</a> is commercial provider of “anti-threat” intelligence (i.e. they identify the noise and other benign observables). They are very good at identifying the types of benign observables listed above since they maintain a globally distributed sensor array and are specifically analyzing network events in order to identify benign activity.</p>
<p><strong>Note</strong>: <a href="https://github.com/MISP/misp-warninglists">MISP-warninglists</a> provides many of these items today but they may be stale (several of their lists have not been updated in months). Ideally all of these lists are kept up-to-date through automated collection from authoritative sources instead of hard coded data stored in github (unless these are automatically updated frequently). See section on “<a href="#building-maintaining-whitelists">Building / Maintaining Whitelist Data</a>” for more tips.</p>
<h2 id="benign-outbound-observables">Benign Outbound Observables</h2>
<p>Benign Outbound Observables show up frequently in threat intelligence feeds derived from malware sandboxing, URL sandboxing, outbound web crawling, email sandboxing, and other similar threat feeds. Below are several common Benign Outbound Observable types. Each type also comes with recommended data sources or collection techniques listed as sub bullets:</p>
<ul>
<li><strong>Popular Domains</strong> - Popular domains can wind up on threat intelligence feeds, especially those derived from malware sandboxing since often times malware uses benign domains as connectivity checks and some malware, like those conducting click fraud act more like web crawlers, visiting many different benign sites. These same popular domains show up very often in most corporate networks and are almost always benign in nature (Note: they can be compromised and used for hosting malicious content so great care needs to be taken here).
<ul>
<li>Below are several data sources for popular domain names. Each are slightly different in how they measure popularity (by volume of Web visitors, frequency of occurrence in Web Crawling data, by volume of DNS queries based, or a combination). These lists should not be used as-is for whitelisting; they need to be filtered/refined. See section on “<a href="#building-maintaining-whitelists">Building / Maintaining Whitelist Data</a>” below for more details on recommendations for refinement.
<ul>
<li><a href="http://s3.amazonaws.com/alexa-static/top-1m.csv.zip">Amazon Alexa top 1 Million</a></li>
<li><a href="http://s3-us-west-1.amazonaws.com/umbrella-static/index.html">Cisco Umbrella Top 1 Million</a></li>
<li><a href="https://www.domcop.com/top-10-million-websites">Domcop Top 10m Domains</a> (<a href="https://www.domcop.com/files/top/top10milliondomains.csv.zip">data</a>) - The top 10 million websites taken from the Open PageRank Initiative.</li>
<li><a href="https://blog.majestic.com/development/majestic-million-csv-daily/">Majestic Million Domains</a></li>
<li><a href="https://moz.com/top500">Moz’s list of the most popular 500 websites on the internet</a></li>
<li><a href="http://ak.quantcast.com/quantcast-top-sites.zip">Quantcast Top 1 Million</a></li>
<li><a href="https://tranco-list.eu/">Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation</a></li>
<li>MISP-warninglists’ <a href="https://github.com/MISP/misp-warninglists/tree/master/lists/dax30">dax30 websites</a>, <a href="https://github.com/MISP/misp-warninglists/tree/master/lists/bank-website">bank websites</a>, <a href="https://github.com/MISP/misp-warninglists/tree/master/lists/university_domains">university domains</a>, <a href="https://github.com/MISP/misp-warninglists/tree/master/lists/url-shortener">url shorteners</a>, <a href="https://github.com/MISP/misp-warninglists/tree/master/lists/whats-my-ip">whats-my-ip sites</a></li>
</ul>
</li>
<li>For more details and analysis on these popular domain lists, checkout this <a href="/three-short-links-on-popular-domain-lists-for-threat-intelligence/">post</a>:</li>
</ul>
</li>
<li><strong>Popular IP Addresses</strong> - Popular IPs are very similar to popular domains. They show up everywhere and when they wind up on threat intelligence feeds they cause a lot of false positives. Popular IP lists can be generated from resolving the Popular domain lists. These lists should not be used as-is for whitelisting; they need to be filtered/refined. See section on “<a href="#building-maintaining-whitelists">Building / Maintaining Whitelist Data</a>” below for more details on recommendations for refinement.</li>
<li><strong>Free email domains</strong> - free email domains occasionally show up in threat intelligence feeds by accident so it is good to maintain a good list of these to prevent false positives. Hubspot provides a <a href="https://knowledge.hubspot.com/articles/kcs_article/forms/what-domains-are-blocked-when-using-the-forms-email-domains-to-block-feature">list</a> that is decent.</li>
<li><strong>Ad servers</strong> - Ad servers show up very frequently in URL sandbox feeds as these feeds are often obtained by visiting many websites and waiting for exploitation attempts or for AV alerts. These same servers show up all the time in benign Internet traffic. <a href="https://easylist.to/">Easylist</a> provides this sort of data.</li>
<li><strong>CDN IPs</strong> - Content Distribution Networks are geographically distributed network of proxy servers or caches that provide high availability and high performance for web content distribution. Their servers are massively shared for distributing varied web content. When IPs from CDNs make it into threat intelligence feeds, false positives are soon to follow. Below are several CDN IP and domain sources.
<ul>
<li><a href="https://github.com/WPO-Foundation/wptagent/blob/master/internal/optimization_checks.py#L62-L286">WPO-Foundation CDN list (embedded in Python code)</a></li>
<li><a href="https://ip-ranges.amazonaws.com/ip-ranges.json">AWS IP Ranges</a> - but filtered for cloudfront and S3 IP space.</li>
<li><a href="https://www.cloudflare.com/ips-v4">Cloudflare IP Ranges</a></li>
<li><a href="https://api.fastly.com/public-ip-list">Fastly IP Ranges</a></li>
<li><a href="https://www.maxcdn.com/one/assets/ips.txt">MaxCDN IP Ranges</a></li>
<li>Very similar to identifying known web crawlers, DNS PTR-Lookup + DNS A-Lookup analytics can be used to enumerate these in bulk once patterns are identified.</li>
</ul>
</li>
<li><strong>Certificate Revocation Lists (CRL) and the Online Certificate Status Protocol (OCSP) domains/URLs</strong> - When executing a binary in a malware sandbox and the executable has been signed, connections will be made to CRL and OCSP servers. Because of this, these often mistakenly wind up in threat feeds.
<ul>
<li>Grab Certificates from Alexa top websites, extract OCSP URL. This <a href="https://cybersecurity.att.com/blogs/labs-research/massively-collecting-crl-and-ocsp-information">old Alienvault post</a> describes the process (along with another approach using the now defunct EFF SSL Observatory), and this <a href="https://github.com/pmurgatroyd/alienvault-labs-garage/tree/master/certs">github repo</a> provides the code to do it. Care should be taken here since adversaries can influence the data collected in this way.</li>
<li><a href="https://github.com/MISP/misp-warninglists/tree/master/lists/crl-ip-hostname">MISP-warninglists’ crl-ip-hostname</a></li>
</ul>
</li>
<li><strong>NTP Servers</strong> - Some malware call out to NTP servers for connectivity checks or to determine the real date/time. Because of this, NTP servers often wind up mistakenly on threat intelligence feeds that are derived from malware sandboxing.
<ul>
<li>Web scrape lists of NTP servers (such as the <a href="https://tf.nist.gov/tf-cgi/servers.cgi">NIST Internet Time Servers</a> and <a href="http://www.pool.ntp.org/en/">NTP Pool Project Servers</a>) and perform DNS resolutions to derive all the servers behind each regional load balancer.</li>
</ul>
</li>
<li><strong>Root Nameservers and TLD Nameservers</strong>
<ul>
<li>Perform DNS NS-lookups against each domain in the <a href="https://publicsuffix.org/list/public_suffix_list.dat">Public Suffix List</a> and then perform DNS A-lookup each nameserver domain to obtain their IP addresses.</li>
</ul>
</li>
<li><strong>Mail Exchange servers</strong>
<ul>
<li>Obtain a list of popular email domains and then perform MX lookups against popular email domains to get their respective Mail Exchange (MX) servers. Perform DNS A-lookups on the MX servers list to obtain their IP addresses.</li>
</ul>
</li>
<li><strong>STUN Servers</strong> - “Session Traversal Utilities for NAT (STUN) is a standardized set of methods, including a network protocol, for traversal of network address translator (NAT) gateways in applications of real-time voice, video, messaging, and other interactive communications.” via https://en.wikipedia.org/wiki/STUN. Below are some sources of STUN servers (some of these appear old though).
<ul>
<li><a href="https://www.voip-info.org/stun">https://www.voip-info.org/stun/</a>/</li>
<li><a href="https://gist.github.com/mondain/b0ec1cf5f60ae726202e">https://gist.github.com/mondain/b0ec1cf5f60ae726202e</a></li>
<li><a href="https://gist.github.com/zziuni/3741933">https://gist.github.com/zziuni/3741933</a></li>
<li><a href="http://enumer.org/public-stun.txt">http://enumer.org/public-stun.txt</a></li>
</ul>
</li>
<li><strong>Parking IPs</strong> - IPs used as the default IP for DNS-A records for brand new registered domains.
<ul>
<li><a href="https://github.com/stamparm/maltrail/blob/master/trails/static/suspicious/parking_site.txt">maltrails parking_sites</a></li>
</ul>
</li>
<li><strong>Popular Open DNS Resolvers</strong>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Public_recursive_name_server">Public recursive name server (Wikipedia)</a> - lists the largest and most popular open recursive nameservers.</li>
<li><a href="https://public-dns.info/">Public DNS Server List</a> - maintains a large list of open recursive nameservers that may be useful for context, but should not be whitelisted.</li>
</ul>
</li>
<li><strong>Security Companies, Security Blogs and Security Tool sites</strong> - These sites show up in threat mailing lists frequently which are sometimes scraped as threat feeds and these domains are mistakenly flagged as malicious.
<ul>
<li>Scrape all the reputable awesome-* security related github repo’s. This is a little risky since an adversary could potentially get their domain added to these lists. Examples:
<ul>
<li><a href="https://github.com/sbilly/awesome-security">awesome-security</a></li>
<li><a href="https://github.com/rshipp/awesome-malware-analysis">awesome-malware-analysis</a></li>
<li><a href="https://github.com/paralax/awesome-honeypots">awesome-honeypots</a></li>
<li>etc.</li>
</ul>
</li>
<li>MISP-warninglists provides a <a href="https://github.com/MISP/misp-warninglists/blob/master/lists/security-provider-blogpost/">security-provider-blogpost</a> and <a href="https://github.com/MISP/misp-warninglists/tree/master/lists/automated-malware-analysis">automated-malware-analysis</a> lists that look pretty good.</li>
</ul>
</li>
<li><strong>Bit Torrent Trackers</strong> - <a href="https://github.com/ngosang/trackerslist">github.com/ngosang/trackerslist</a></li>
<li><strong>Tracking domains</strong> - commonly used by well known email marketing companies. Often shows up in threat intel feeds derived from spam or phishing email sinkholes. Results in high false positive rates in practice.
<ul>
<li>PDNS and/or Domain Whois analytics are one way to identify these once patterns can be observed. Below is an example of using Whois data for Marketo.com and identifying all the other Marketo email tracking domains that use Marketo’s nameserver. This example is from <a href="https://whoisology.com/ns1/ns1.marketo.com/1">Whoisology</a>, but bulk Whois mining is a preferred method.</li>
</ul>
</li>
</ul>
<p><img src="/images/whitelists/whoisology.com-ns1.marketo.com.png" alt="Marketo Example" /></p>
<p><strong>Note:</strong> <a href="https://github.com/MISP/misp-warninglists">MISP-warninglists</a> provides some of these items today but they may be stale. Ideally all of these lists are kept up-to-date through automated collection from authoritative sources. See section on “<a href="#building-maintaining-whitelists">Building / Maintaining Whitelist Data</a>” for more tips.</p>
<h2 id="benign-host-based-observables">Benign Host-based Observables</h2>
<p>Benign Host-based Observables show up very commonly in threat intelligence feeds based on malware sandboxing. Here are some example observable types. So far, I have only found decent benign lists for File hashes (see below).</p>
<ul>
<li>File hashes</li>
<li>Mutexes</li>
<li>Registry Keys</li>
<li>File Paths</li>
<li>Service names</li>
</ul>
<p>Data Sources:</p>
<ul>
<li><a href="https://www.nist.gov/itl/ssd/software-quality-group/national-software-reference-library-nsrl/nsrl-download">NSRL Hashsets</a></li>
<li><a href="https://www.nist.gov/itl/ssd/software-quality-group/windows-732-diskprint">Windows-7/32 Diskprint</a></li>
<li><a href="https://gist.github.com/Neo23x0/fd9af35c5061578025d00838c215dfe4">Neo23x0/fp-hashes.py</a></li>
<li><a href="https://github.com/MISP/misp-warninglists/blob/master/lists/common-ioc-false-positive/list.json">MISP common IOC false positives</a></li>
<li><a href="https://github.com/kost/m-whitelist">Mandiant Redline Whitelist (mirror)</a> - NOTE: this is ~5yr old at the time of this blog.</li>
<li><a href="https://www.hashsets.com/white-hash-sets-2/">Hashsets.com (commercial) hash lists</a></li>
</ul>
<p>In leading academic and industry research on malware detection, it is common to use Virustotal in order to build labeled training data. See this <a href="/six-short-links-on-malware-training-set-creation-for-machine-learning/">post</a> for more details. These techniques seem very suitable for training data creation, but are not recommended for whitelisting for operational use due to the high likelihood of false negatives.</p>
<p><strong>Note:</strong> If your goal is building a machine learning model on binaries, you should strongly consider Endgame’s <a href="https://github.com/endgameinc/ember">Ember</a>. “The dataset includes features extracted from 1.1M binary files: 900K training samples (300K malicious, 300K benign, 300K unlabeled) and 200K test samples (100K malicious, 100K benign)”. See <a href="https://arxiv.org/abs/1804.04637">EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models</a> for more details.</p>
<h2 id="whitelist-exclusions">Whitelist Exclusions</h2>
<p>There are many observables that we will never want to whitelist due to their popularity or importance. These should be maintained in a whitelist exclusions list (a.k.a. greylist). Below are some examples:</p>
<ul>
<li><strong>Shared hosting domains and Dynamic DNS domains</strong> - these base domains should never be alerted on as many are in the Alexa top 1m list and will be incredibly noisy. BUT subdomains of these are fair game for alerting as they are easily adversary controlled and often abused. Below are some sources of this information, but identifying the major providers and scraping their websites or APIs would be a better way to keep these fresh.
<ul>
<li><strong>Shared Hosting</strong> - <a href="https://github.com/stamparm/maltrail/blob/master/trails/static/suspicious/free_web_hosting.txt">Maltrails free web hosting</a></li>
<li><strong>Dynamic DNS</strong> - <a href="https://github.com/stamparm/maltrail/blob/master/trails/static/suspicious/dynamic_domain.txt">Maltrails DynDNS</a></li>
</ul>
</li>
<li><strong>DNS Sinkhole IPs</strong>
<ul>
<li><a href="https://tisiphone.net/2017/05/16/consolidated-malware-sinkhole-list/">https://tisiphone.net/2017/05/16/consolidated-malware-sinkhole-list/</a></li>
<li><a href="https://github.com/brakmic/Sinkholes">github.com/brakmic/Sinkholes</a></li>
<li><a href="https://sinkdb.abuse.ch/">sinkdb.abuse.ch</a></li>
<li><a href="https://github.com/MISP/misp-warninglists/blob/master/lists/sinkholes/">MISP warninglists sinkholes</a></li>
</ul>
</li>
</ul>
<h2 id="building--maintaining-whitelist-data"><a name="building-maintaining-whitelists">Building / Maintaining Whitelist Data</a></h2>
<p>Whitelist generation needs to be automated in order to be maintainable. There may be exceptions to this rule for things that you want to ensure are always in the whitelist, but for everything else, ideally they are collected from authoritative sources or are generated based on sound analytic techniques. You cannot always blindly trust each data source listed above. For several, some automated verification, filtering, or analytics will be needed. Below are some tips for how to do this effectively.</p>
<ul>
<li>Each entity in the whitelist should be categorized (what type of whitelist entry is this?) and sourced (where did this come from?) so we know exactly how it got there (i.e. what data source was responsible) and when it was added/updated. This will help if there is ever a problem related to the whitelist so the specific source of the problem can be addressed.</li>
<li>Retrieve whitelist entries from the original source sites and parse/extract data from there. Avoid one time dumps of whitelist entries where possible since these will become stale very quickly. If you are including one-time dumps be sure to maintain their lineage.</li>
<li>Several bulk data sets will be very useful for analytics to expand or filter various whitelists
<ul>
<li><strong>Bulk Active DNS resolution</strong> (A-lookups, MX-lookups, NS-lookups, and TXT-lookups). <a href="https://www.gnu.org/software/adns/">Adns</a> may be useful for this.</li>
<li><strong>Bulk RDNS data</strong> (either from <a href="https://scans.io">scans.io</a> or collected yourself).</li>
<li><strong>Bulk Whois data</strong> - This can be purchased from several vendors. Here are a few: <a href="https://www.whoisxmlapi.com/">whoisxmlapi.com</a>, <a href="https://iqwhois.com/whois-database-download">iqwhois.com</a>, <a href="https://jsonwhois.com/whois-database-download">jsonwhois.com</a>, <a href="https://whoisdatabasedownload.com/">whoisdatabasedownload.com</a>, and <a href="http://research.domaintools.com/bulk-parsed-whois/">research.domaintools.com</a>.</li>
<li><strong>Passive DNS (PDNS) data</strong> - PDNS data can be purchased from several vendors or you can instrument your own network to collect and store this data. Here are some PDNS suppliers: <a href="https://www.farsightsecurity.com/solutions/dnsdb/">farsightsecurity.com</a>, <a href="https://www.deteque.com/passive-dns/">deteque.com</a>, <a href="https://www.circl.lu/services/passive-dns/">circl.lu</a>, <a href="https://www.riskiq.com/products/passivetotal/">riskiq.com</a>, <a href="https://passivedns.mnemonic.no/">passivedns.mnemonic.no</a>, and <a href="https://www.coresecurity.com/content/core-pdns">coresecurity.com (formerly Damballa)</a>.</li>
</ul>
</li>
<li>Netblock ownership (Maxmind) lookups / analytics will be useful for some of the vetting.</li>
<li>The whitelist should be updated at least daily to stay fresh. There may be data sources that change more or less frequently than this.
<ul>
<li>BE CAREFUL when refreshing the whitelist. Add sanity checks to ensure that the new whitelist was generated correctly before replacing the old one. The costs of a failed whitelist load will be mass false positives (unfortunately, I had to learn this lesson the hard way …).</li>
</ul>
</li>
<li>Popular domain lists cannot be taken at face value as benign. Malicious domains get into these lists all the time. Here are some ways to combat this:
<ul>
<li>Use the <strong>N-day stable top-X technique</strong> - e.g. Stable 6-month Alexa top 500k - create a derivative list from the top Alexa domains where you filter the list for only domains that have been on the Alexa top 500k list every day for the past 6 months. This technique is commonly used in malicious domain detection literature as a way to build high quality benign labeled data. It is not perfect and may need to be tuned based on how the whitelist is being used. This technique requires keeping historic popular domain lists. The <a href="https://web.archive.org">Wayback Machine</a> appears to have a <a href="https://web.archive.org/web/*/http://s3.amazonaws.com/alexa-static/top-1m.csv.zip">large historic mirror of the Alexa top1m data</a> that may be suitable for bootstrapping your own collection.</li>
</ul>
</li>
<li>Bulk DNS resolution of these lists can also be useful for generating Popular IP lists, but only when using the N-day stable top-X concept or if great care is taken in how they are used.</li>
<li>Use a whitelist exclusions set for removing categories of domains/IPs that you never want whitelisted. The whitelist exclusions set should also be kept fresh through automated collection from authoritative sources (e.g. scraping dynamic DNS providers and shared hosting websites where possible, PDNS / Whois analytics may also work).</li>
<li>Lastly, be careful when generating whitelists and think about what aspects of the data are adversary controlled. These are things we need to be careful not to blindly trust. Some examples:
<ul>
<li>RDNS entries can be made to be deceptive especially if the adversary knows they are used for whitelisting. For example, an adversary can create PTR records for IP address space they own that are identical to Google’s googlebot RDNS or Shodan’s census RDNS, BUT they cannot change the DNS A record mapping that domain name back to their IP space. For these a forward lookup (A Lookup) is generally also needed OR a netblock ownership verification.</li>
</ul>
</li>
</ul>
<p>In conclusion, whitelists are useful for filtering out observables from threat intelligence lists before correlation with event data, building labeled datasets for machine learning models, and enriching threat intelligence or alerts with contextual information. Creating and maintaining these lists can be a lot of work. Great care should be taken as to not go too far or to whitelist domains or IPs that are easily adversary controlled.</p>
<p>As always, feedback is welcome so please leave a message here, on <a href="https://medium.com/@jason_trost">Medium</a>, or @ me on <a href="(https://twitter.com/#!/jason_trost)">twitter</a>!</p>
<p>–Jason
<br /><a href="https://twitter.com/#!/jason_trost">@jason_trost</a></p>
<p><a href="http://www.covert.io/collecting-and-curating-benign-observables-for-threat-intelligence-and-machine-learning/">Collecting and Curating IOC Whitelists for Threat Intelligence and Machine Learning Research</a> was originally published by Jason Trost at <a href="http://www.covert.io">covert.io</a> on February 01, 2020.</p>http://www.covert.io/heterogeneous-information-networks-and-applications-to-cyber-security2020-01-20T00:00:00-00:002020-01-20T00:00:00-05:00Jason Trosthttp://www.covert.iojason.trost@gmail.com<p>This post explores Heterogeneous Information Networks (HIN) and applications to Cyber security.</p>
<p>Over the past few months I have been researching Heterogeneous Information Networks (HIN) and Cyber security use cases. I first encountered HIN’s after discovering this paper: <a href="/research-papers/heterogeneous-information-networks/Gotcha - Sly Malware!- Scorpion A Metagraph2vec Based Malware Detection System.pdf">“Gotcha: Sly Malware!- Scorpion A Metagraph2vec Based Malware Detection System”</a> through a Google Scholar Alert I had setup for <a href="/research-papers/heterogeneous-information-networks/Guilt by Association - Large Scale Malware Detection by Mining File-relation Graphs.pdf">“Guilt by Association: Large Scale Malware Detection by Mining File-relation Graphs”</a>. If you’re interested in how I setup my Google Alerts to stay abreast of the latest security data science research, see this: <a href="https://medium.com/@jason_trost/security-data-science-learning-resources-8f7586995040">Security Data Science Learning Resources</a>.</p>
<p>Heterogeneous Information Networks are a relatively simple way of modelling one or more datasets as a graph consisting of nodes and edges where 1) all nodes and edges have defined types, and 2) types of nodes > 1 or types of edges > 1 (hence “Heterogeneous”). The set of node and edge types represents the schema of the network. This differs from homogeneous networks where the nodes and edges are all the same type (e.g. Facebook Social Network Graph, World Wide Web, etc.). HINs provide a very rich abstraction for modelling complex datasets.</p>
<p>Below, I will walk through important HIN concepts using the <a href="/research-papers/heterogeneous-information-networks/HinDom- A Robust Malicious Domain Detection System based on Heterogeneous Information Network with Transductive Classification.pdf">HinDom paper</a> as an example. HinDom uses DNS relationship data from passive DNS, DNS query logs, and DNS response logs to build a malicious domain classifier using HIN. They use Alexa Top 1K list, Malwaredomains.com, Malwaredomainlist.com, DGArchive, Google Safe Browsing, and VirusTotal for deriving labels. Below is an example HIN schema taken from this paper.</p>
<p><img src="/images/hin/hindom-schema-2.png" alt="HinDom Schema" /></p>
<p>This schema represents three combined datasets (Passive DNS, DNS query logs, DNS response logs) and it models three node types (Client, Domain, and IP Address) and six edge types (segment, query, CNAME, similar, resolve, and same-domain). Here is an expanded example and descriptions of the relationships:</p>
<p><img src="/images/hin/hindom-example.png" alt="HinDom Example" /></p>
<ul>
<li><strong>Client-query-Domain</strong> - matrix Q denotes that domain i is queried by client j.</li>
<li><strong>Client-segment-Client</strong> - matrix N denotes that client i and client j belong to the same network segment.</li>
<li><strong>Domain-resolve-IP</strong> - matrix R denotes that domain i is resolved to IP address j.</li>
<li><strong>Domain-similar-Domain</strong> - matrix S denotes the character-level similarity between domain i and j.</li>
<li><strong>Domain-cname-Domain</strong> - matrix C denotes that domain i and domain j are in a CNAME record.</li>
<li><strong>IP-domain-IP</strong> - matrix D denotes that IP address i and IP address j are once mapped to the same domain.</li>
</ul>
<p>Once the dataset is represented as a graph, feature vectors need to be extracted before machine learning models can be built. A common technique for featurizing a HIN is by defining Meta-paths or Meta-graphs against the graph and then performing guided random walks against the defined meta-paths/graphs. Meta-paths represent graph traversals through specific node and edge sequences. Meta-paths selection are akin to feature engineering in classical machine learning as it is very important to select meta-paths that provide useful signals for whatever variable is being predicted. As seen in many HIN papers, meta-paths/graphs are often evaluated individually or in combination to determine their influence on model performance. Guided random walks against meta-paths produce a sequence of nodes (similar to sentences of words), which can then be fed into models like <a href="https://arxiv.org/pdf/1301.3781.pdf">Skipgram or Continuous Bag-of-Words (CBOW)</a> to create embeddings. Once the nodes are represented as embeddings many different models (SVM, DNN, etc) can be used to solve many different types of problems (Similarity Search, Classification, Clustering, Recommendation, etc). Below are the meta-paths used in the HinDom paper.</p>
<p><img src="/images/hin/hindom-metapaths.png" alt="HinDom Meta-paths" /></p>
<p>Below is the HinDom Architecture to illustrate how all these concepts come together.</p>
<p><img src="/images/hin/hindom-arch.png" alt="HinDom Architecture" /></p>
<p>Below are some resources that I found useful for learning more about Heterogeneous Information Networks as well as several security related papers that used HIN.</p>
<h3 id="books">Books:</h3>
<ul>
<li><a href="https://www.amazon.com/Mining-Heterogeneous-Information-Networks-Methodologies/dp/1608458806/ref=as_li_ss_tl?ie=UTF8&linkCode=ll1&tag=cyberanaly-20&linkId=7f761b6fc9b4e6a799a8d71d9e018cbb&language=en_US">Mining Heterogeneous Information Networks: Principles and Methodologies</a></li>
<li><a href="https://www.amazon.com/Heterogeneous-Information-Analysis-Applications-Analytics-ebook/dp/B071P9W8JV/ref=as_li_ss_tl?ie=UTF8&linkCode=ll1&tag=cyberanaly-20&linkId=3b1edf38484828593ed6a0faad474d4a&language=en_US">Heterogeneous Information Network Analysis and Applications</a></li>
</ul>
<h3 id="hin-papers">HIN Papers:</h3>
<ul>
<li><a href="/research-papers/heterogeneous-information-networks/Mining Heterogeneous Information Networks- A Structural Analysis Approach.pdf">Mining Heterogeneous Information Networks- A Structural Analysis Approach</a></li>
<li><a href="/research-papers/heterogeneous-information-networks/HIN2Vec- Explore Meta-paths in Heterogeneous Information Networks for Representation Learning.pdf">HIN2Vec: Explore Meta-paths in Heterogeneous Information Networks for Representation Learning</a></li>
<li><a href="/research-papers/heterogeneous-information-networks/PathSim- Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks.pdf">PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks</a></li>
<li><a href="/research-papers/heterogeneous-information-networks/Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema.pdf">Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema</a></li>
<li><a href="/research-papers/heterogeneous-information-networks/metapath2vec- Scalable Representation Learning for Heterogeneous Networks.pdf">Metapath2vec: Scalable Representation Learning for Heterogeneous Networks</a></li>
<li><a href="/research-papers/heterogeneous-information-networks/A Survey of Heterogeneous Information Network Analysis.pdf">A Survey of Heterogeneous Information Network Analysis</a></li>
<li><a href="/research-papers/heterogeneous-information-networks/Adversarial Learning on Heterogeneous Information Networks.pdf">Adversarial Learning on Heterogeneous Information Networks</a></li>
</ul>
<h3 id="security-related-hin-papers">Security-related HIN Papers:</h3>
<h4 id="malware-detection--code-analysis">Malware Detection / Code Analysis:</h4>
<ul>
<li><a href="/research-papers/heterogeneous-information-networks/AiDroid - When Heterogeneous Information Network Marries Deep Neural Network for Real-time Android Malware Detection.pdf">AiDroid: When Heterogeneous Information Network Marries Deep Neural Network for Real-time Android Malware Detection</a></li>
<li><a href="/research-papers/heterogeneous-information-networks/Gotcha - Sly Malware!- Scorpion A Metagraph2vec Based Malware Detection System.pdf">Gotcha: Sly Malware!- Scorpion A Metagraph2vec Based Malware Detection System</a></li>
<li><a href="/research-papers/heterogeneous-information-networks/HinDroid - An Intelligent Android Malware Detection System Based on Structured Heterogeneous Information Network.pdf">HinDroid: An Intelligent Android Malware Detection System Based on Structured Heterogeneous Information Network</a></li>
<li><a href="/research-papers/heterogeneous-information-networks/Make Evasion Harder- An Intelligent Android Malware Detection System.pdf">Make Evasion Harder: An Intelligent Android Malware Detection System</a></li>
<li><a href="https://link.springer.com/article/10.1007/s10115-017-1058-9">DeepAM: a heterogeneous deep learning framework for intelligent malware detection</a></li>
<li><a href="/research-papers/heterogeneous-information-networks/HinDom- A Robust Malicious Domain Detection System based on Heterogeneous Information Network with Transductive Classification.pdf">HinDom: A Robust Malicious Domain Detection System based on Heterogeneous Information Network with Transductive Classification</a></li>
<li><a href="/research-papers/heterogeneous-information-networks/iTrustSO - An Intelligent System for Automatic Detection of Insecure Code Snippets in Stack Overflow.pdf">iTrustSO: An Intelligent System for Automatic Detection of Insecure Code Snippets in Stack Overflow</a></li>
</ul>
<h4 id="mining-the-darkweb--fraud-detection--social-network-analysis">Mining the Darkweb / Fraud Detection / Social Network Analysis:</h4>
<ul>
<li><a href="/research-papers/heterogeneous-information-networks/Key Player Identification in Underground Forums over Attributed Heterogeneous Information Network Embedding Framework.pdf">Key Player Identification in Underground Forums over Attributed Heterogeneous Information Network Embedding Framework</a></li>
<li><a href="/research-papers/heterogeneous-information-networks/Your Style Your Identity- Leveraging Writing and Photography Styles for Drug Trafficker Identification in Darknet Markets over Attributed Heterogeneous Information Network.pdf">Your Style Your Identity: Leveraging Writing and Photography Styles for Drug Trafficker Identification in Darknet Markets over Attributed Heterogeneous Information Network</a></li>
<li><a href="/research-papers/heterogeneous-information-networks/iDetector - Automate Underground Forum Analysis Based on Heterogeneous Information Network.pdf">iDetector: Automate Underground Forum Analysis Based on Heterogeneous Information Network</a></li>
<li><a href="/research-papers/heterogeneous-information-networks/Cash-out User Detection based on Attributed Heterogeneous Information Network with a Hierarchical Attention Mechanism.pdf">Cash-out User Detection based on Attributed Heterogeneous Information Network with a Hierarchical Attention Mechanism</a></li>
<li><a href="/research-papers/heterogeneous-information-networks/iDev- Enhancing Social Coding Security by Cross-platform User Identification Between GitHub and Stack Overflow.pdf">iDev: Enhancing Social Coding Security by Cross-platform User Identification Between GitHub and Stack Overflow</a></li>
</ul>
<h3 id="tutorials">Tutorials:</h3>
<ul>
<li><a href="http://web.cs.ucla.edu/~yzsun/Tutorials/KDD2017/KDD_17_Recommendation.pdf">KDD 2017: Mining Heterogeneous Information Networks</a></li>
<li><a href="http://people.cs.vt.edu/~badityap/classes/cs6604-Fall17/student-lectures/prashant-hetero-networks.pdf">Intro to Heterogeneous (Information) Networks</a></li>
</ul>
<h3 id="code">Code:</h3>
<ul>
<li><a href="https://github.com/zhoushengisnoob/HINE">github.com/zhoushengisnoob/HINE</a> - Heterogeneous Information Network Embedding: papers and code implementations.</li>
<li><a href="https://github.com/stellargraph/stellargraph">github.com/stellargraph/stellargraph</a> (see <a href="https://github.com/stellargraph/stellargraph/blob/develop/demos/embeddings/stellargraph-metapath2vec.ipynb">stellargraph-metapath2vec.ipynb</a>)</li>
<li><a href="https://github.com/hetio/hetnetpy">github.com/hetio/hetnetpy</a> - HIN library</li>
<li><a href="https://github.com/hetio/hetmatpy">github.com/hetio/hetmatpy</a> - HIN library that represents as matrices.</li>
<li><a href="https://github.com/csiesheep/hin2vec">github.com/csiesheep/hin2vec</a></li>
</ul>
<h3 id="prominent-security-researchers-using-hin">Prominent Security Researchers using HIN:</h3>
<ul>
<li><a href="https://scholar.google.com/citations?user=egjr888AAAAJ&hl=en&oi=ao">Yanfang Ye</a></li>
<li><a href="https://scholar.google.com/citations?hl=en&user=-NnGknEAAAAJ">Shifu Hou</a></li>
<li><a href="https://scholar.google.com/citations?hl=en&user=0fTVEgQAAAAJ">Yiming Zhang</a></li>
</ul>
<hr />
<p>As always, feedback is welcome so please leave a message here, on <a href="https://medium.com/@jason_trost">Medium</a>, or @ me on <a href="(https://twitter.com/#!/jason_trost)">twitter</a>!</p>
<p>–Jason
<br /><a href="https://twitter.com/#!/jason_trost">@jason_trost</a></p>
<p><a href="http://www.covert.io/heterogeneous-information-networks-and-applications-to-cyber-security/">Heterogeneous Information Networks + Cyber Security Use cases</a> was originally published by Jason Trost at <a href="http://www.covert.io">covert.io</a> on January 20, 2020.</p>http://www.covert.io/auxiliary-loss-optimization-for-hypothesis-augmentation-for-dga-domain-detection2019-07-18T00:00:00-00:002019-07-17T00:00:00-04:00Jason Trosthttp://www.covert.iojason.trost@gmail.com<p>This post outlines some experiments I ran using Auxiliary Loss Optimization for Hypothesis Augmentation (ALOHA) for DGA domain detection.</p>
<p><strong>(Update 2019-07-18)</strong> After getting feedback from one of the ALOHA paper authors, I <a href="https://github.com/covert-labs/aloha_dga/pull/2">modified my code</a> to set loss weights for the auxilary targets as they did in their paper (Weights used: main target 1.0, auxilary targets 0.1). I also added 3 word-based/dictionary DGAs. All diagrams and metrics have been updated to reflect this.</p>
<p>I recently read this paper <a href="https://arxiv.org/abs/1903.05700">ALOHA: Auxiliary Loss Optimization for Hypothesis Augmentation
</a> by Ethan M. Rudd, Felipe N. Ducau, Cody Wild, Konstantin Berlin, and Richard Harang from Sophos Lab. This <a href="https://www.usenix.org/conference/usenixsecurity19/presentation/rudd">research</a> will be presented at <a href="https://www.usenix.org/conference/usenixsecurity19">USENIX Security 2019</a> in Aug, 2019. This paper shares findings that supplying more prediction targets to their model at training time, they can improve the prediction performance of the primary prediction target. More specifically, they modify a deep learning based model for detecting malware (binary classifier) to also predict things like individual vendor predictions, malware tags, and number of VT detections. Their “auxiliary loss architecture yields a significant reduction in detection error rate (false negatives) of 42.6% at a false positive rate (FPR) of 10^−3 when compared to a similar model with only one target, and a decrease of 53.8% at 10^−5 FPR.”</p>
<p><a href="/images/aloha-dga/aloha-arch.png"><img src="/images/aloha-dga/aloha-arch.png" alt="Aloha Model Architecture" width="400px" /></a></p>
<p><strong>Figure 1 from the paper</strong></p>
<blockquote>
<p>A schematic overview of our neural network architecture. Multiple output layers with corresponding loss functions are optionally connected to a common base topology which consists of five dense blocks. Each block is composed of a Dropout, dense and batch normalization layers followed by an exponential linear unit (ELU) activation of sizes 1024, 768, 512, 512, and 512. This base, connected to our main malicious/benign output (solid line in the figure) with a loss on the aggregate label constitutes our baseline architecture. Auxiliary outputs and their respective losses are represented in dashed lines. The auxiliary losses fall into three types: count loss, multi-label vendor loss, and multi-label attribute tag loss</p>
</blockquote>
<p>This paper made me wonder how well this technique would work for other areas in network security such as:</p>
<ul>
<li>Detecting malicious URLs from Exploit Kits - possible auxiliary labels: Exploit Kit names, Web Proxy Categories, etc.</li>
<li>Detecting malicious C2 domains - possible auxiliary labels: malware family names, DGA or not, proxy categories.</li>
<li>Detecting DGA Domains - possible auxiliary labels: malware families, DGA type (wordlist, hex based, alphanumeric, etc).</li>
</ul>
<p>I decided to explore the last use case of how well auxiliary loss optimizations would improve DGA domain detections. For this work I identified four DGA models and used these as baselines. Then I ran some experiments. All code from these experiments is hosted <a href="https://github.com/covert-labs/aloha_dga">here</a>. This code is based heavily off of Endgame’s <a href="https://github.com/endgameinc/dga_predict">dga_predict</a>, but with many <a href="https://github.com/endgameinc/dga_predict/compare/master...covert-labs:master">modifications</a>.</p>
<h3 id="data">Data:</h3>
<p>For this work, I used the same data sources selected by Endgame’s dga_predict (but I added 3 additional DGAs: gozi, matsnu, and suppobox).</p>
<ul>
<li>Alexa top 1m domains</li>
<li>classical DGA domains for the following malware families: banjori, corebot, cryptolocker, dircrypt, kraken, lockyv2, pykspa, qakbot, ramdo, ramnit, and simda.</li>
<li>Word-based/dictionary DGA domains for the following malware families - gozi, matsnu, and suppobox</li>
</ul>
<h3 id="baseline-models">Baseline Models:</h3>
<p>I used 4 baseline binary models + 4 extensions of these model that use Auxiliary Loss Optimization for Hypothesis Augmentation.</p>
<p>Baseline Models:</p>
<ul>
<li>Bigram - Endgame’s Bigram model from <a href="[dga_predict](https://github.com/endgameinc/dga_predict)">dga_predict</a>.</li>
<li>LSTM - Endgame’s LSTM model from <a href="[dga_predict](https://github.com/endgameinc/dga_predict)">dga_predict</a>.</li>
<li>CNN - CNN adapted from Keegan Hine’s <a href="https://github.com/keeganhines/snowman">snowman</a>.</li>
<li>LSTM + CNN - CNN adapted from Keegan Hine’s <a href="https://github.com/keeganhines/snowman">snowman</a>, combined with an LSTM as defined by <a href="https://www.youtube.com/watch?v=99hniQYB6VM">Deep Learning For Realtime Malware Detection (ShmooCon 2018)</a>’s LSTM + CNN (see 13:17 for architecture) by Domenic Puzio and Kate Highnam.</li>
</ul>
<p>ALOHA Extended Models (each simply use the 11 malware families as additional binary labels):</p>
<ul>
<li>ALOHA CNN</li>
<li>ALOHA Bigram</li>
<li>ALOHA LSTM</li>
<li>ALOHA CNN+LSTM</li>
</ul>
<p>I trained each of these models using the default settings as provided by dga_predict (except, I added stratified sampling based on the full labels: benign + malware families):</p>
<ul>
<li>training splits: 76% training, 4% validation, %20 testing</li>
<li>all models were trained with a batch size of 128</li>
<li>The CNN, LSTM, and CNN+LSTM models used up to 25 epochs, while the bigram models used up to 50 epochs.</li>
</ul>
<p>Below shows counts of how many of each DGA family were used and how many Alexa top 1m domains were included (denoted as “benign”).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
</pre></td><td class="rouge-code"><pre>In [1]: import pickle
In [2]: from collections import Counter
In [3]: data = pickle.loads(open('traindata.pkl', 'rb').read())
In [4]: Counter([d[0] for d in data]).most_common(100)
Out[4]:
[('benign', 139935),
('qakbot', 10000),
('dircrypt', 10000),
('pykspa', 10000),
('corebot', 10000),
('kraken', 10000),
('suppobox', 10000),
('gozi', 10000),
('ramnit', 10000),
('matsnu', 10000),
('locky', 9999),
('banjori', 9984),
('simda', 9984),
('ramdo', 9984),
('cryptolocker', 9984)]
</pre></td></tr></tbody></table></code></pre></div></div>
<h3 id="results">Results</h3>
<p>Model AUC scores (sorted by AUC):</p>
<ul>
<li>aloha_bigram 0.9435</li>
<li>bigram 0.9444</li>
<li>cnn 0.9817</li>
<li>aloha_cnn 0.9820</li>
<li>lstm 0.9944</li>
<li>aloha_cnn_lstm 0.9947</li>
<li>aloha_lstm 0.9950</li>
<li>cnn_lstm 0.9957</li>
</ul>
<p>Overall, by AUC, the ALOHA technique only seemed to improve the LSTM and CNN models and only marginally. The ROC curves show reductions in the error rates at very low false positive rates (between 10^-5 and 10^-3) which is similar to those gains seen in the ALOHA paper, yet the paper’s gains appeared much larger.</p>
<p><a href="/images/aloha-dga/results-linear-all-1.0.png"><img src="/images/aloha-dga/results-linear-all-1.0.png" width="400px" /></a><br />
<strong>ROC: All Models Linear Scale</strong></p>
<p><a href="/images/aloha-dga/results-logscale-all-0.000001-to-1.05.png"><img src="/images/aloha-dga/results-logscale-all-0.000001-to-1.05.png" width="400px" /></a><br />
<strong>ROC: All Models Log Scale</strong></p>
<p><a href="/images/aloha-dga/results-logscale-bigram-0.000001-to-1.05.png"><img src="/images/aloha-dga/results-logscale-bigram-0.000001-to-1.05.png" width="400px" /></a><br />
<strong>ROC: Bigram Models Log Scale</strong></p>
<p><a href="/images/aloha-dga/results-logscale-cnn-0.000001-to-1.05.png"><img src="/images/aloha-dga/results-logscale-cnn-0.000001-to-1.05.png" width="400px" /></a><br />
<strong>ROC: CNN Models Log Scale</strong></p>
<p><a href="/images/aloha-dga/results-logscale-cnn_lstm-0.000001-to-1.05.png"><img src="/images/aloha-dga/results-logscale-cnn_lstm-0.000001-to-1.05.png" width="400px" /></a><br />
<strong>ROC: CNN+LSTM Models Log Scale</strong></p>
<p><a href="/images/aloha-dga/results-logscale-lstm-0.000001-to-1.05.png"><img src="/images/aloha-dga/results-logscale-lstm-0.000001-to-1.05.png" width="400px" /></a><br />
<strong>ROC: LSTM Models Log Scale</strong></p>
<p><strong>Heatmap</strong></p>
<p>Below is a heatmap showing the percentage of detections across all the malware families for each model. Low numbers are good for the benign label (top row), high numbers are good for all the others.</p>
<p>Note the last 3 rows are all word-based/dictionary DGAs. It is interesting, although not too surprising that the models that include LSTMs tended to do better against these DGAs.</p>
<p>I annotated with green boxes places where the ALOHA models did better. This seems to be most apparent with the models that include LSTMs and for the word-based/dictionary DGAs.</p>
<p><a href="/images/aloha-dga/heatmap.png"><img src="/images/aloha-dga/heatmap.png" width="400px" /></a><br /></p>
<h3 id="future-work">Future Work:</h3>
<p>These are some areas of future work I hope to have time to try out.</p>
<ul>
<li>Add more DGA generators to the project, esp word-based / dictionary DGAs and see how the models react. I have identified several (see “Word-based / Dictionary-based DGA Resources” from <a href="http://www.covert.io/getting-started-with-dga-research/">here</a> for more info).</li>
<li>try incorporating other auxiliary targets like:
<ul>
<li>Type of DGA (hex based, alphanumeric, custom alphabet, dictionary/word-based, etc)</li>
<li>Classical DGA domain features like string entropy, count of longest consecutive consonant string, count of longest consecutive vowel string, etc. I am curious if forcing the NN to learn these would improve its primary scoring mechanism.</li>
<li>Metadata from VT <a href="https://developers.virustotal.com/reference#domain-report">domain report</a>.</li>
<li>Summary / stats from Passive DNS (PDNS).</li>
<li>Features from various aspects of the domain’s whois record.</li>
</ul>
</li>
</ul>
<p>If you enjoyed this post, you may be interested in my other recent post on <a href="http://www.covert.io/getting-started-with-dga-research/">Getting Started with DGA Domain Detection Research</a>. Also, please see more Security Data Science blog posts at by personal blog: <a href="http://www.covert.io/">covert.io</a>.</p>
<p>As always, feedback is welcome so please leave a message here, on <a href="https://medium.com/@jason_trost">Medium</a>, or @ me on <a href="(https://twitter.com/#!/jason_trost)">twitter</a>!</p>
<p>–Jason
<br /><a href="https://twitter.com/#!/jason_trost">@jason_trost</a></p>
<p><a href="http://www.covert.io/auxiliary-loss-optimization-for-hypothesis-augmentation-for-dga-domain-detection/">Auxiliary Loss Optimization for Hypothesis Augmentation for DGA Domain Detection</a> was originally published by Jason Trost at <a href="http://www.covert.io">covert.io</a> on July 17, 2019.</p>http://www.covert.io/getting-started-with-dga-research2020-03-22T00:00:00-00:002019-07-16T00:00:00-04:00Jason Trosthttp://www.covert.iojason.trost@gmail.com<p>This post provides resources for getting started with research on Domain Generation Algorithm (DGA) Domain Detection.</p>
<p>DGA Domains are commonly used by malware as a mechanism to maintain a command and control (C2) and make it more difficult for defenders to block. Prior to DGA domains, most malware used a small hardcoded list of IPs or domains. Once these IPs / domains were discovered they could be blocked by defenders or taken down for abuse. DGA domains make this more difficult since the C2 domain changes frequently and enumerating and blocking all generated domains can be expensive.</p>
<p>Recently, I have been working on a <a href="http://www.covert.io/auxiliary-loss-optimization-for-hypothesis-augmentation-for-dga-domain-detection/">research project</a> recently related to DGA detection (hopefully it will turn into a blogpost or a presentation somewhere), and it occurred to me that DGA is probably one of the most accessible areas for those getting into security data science due to the availability of so much labelled data and the availability of so many open source implementations of DGA detection. One might argue that this means it is not an area worth researching due to saturation, but I think that depends on your situation/goals. This short posts outlines some of the resources that I found useful for DGA research.</p>
<h2 id="data">Data:</h2>
<p>This section lists some domain lists and DGA generators that may be useful for creating “labelled” DGA domain lists.</p>
<h3 id="dga-data">DGA Data:</h3>
<ul>
<li><a href="https://dgarchive.caad.fkie.fraunhofer.de/">DGArchive</a> Large private collection of DGA related data. This contains ~88 csv files of DGA domains organized by malware family. DGArchive is password protected and if you want access you need to reach out to the maintainer.</li>
<li><a href="https://osint.bambenekconsulting.com/feeds/">Bambenek Feeds</a> (see “DGA Domain Feed”).</li>
<li><a href="https://data.netlab.360.com/dga/">Netlab 360 DGA Feeds</a></li>
</ul>
<h3 id="dga-generators">DGA Generators:</h3>
<ul>
<li><a href="https://github.com/baderj/domain_generation_algorithms">baderj/domain_generation_algorithms</a> (276-stars on Github) by <a href="https://twitter.com/viql">Johannes Bader</a> - DGA algorithms implemented in python.</li>
<li><a href="https://github.com/andrewaeva/DGA">andrewaeva/DGA</a> (123-stars on Github) - smaller collection of DGA algorithms and data, but fills in some of the gaps from domain_generation_algorithms.</li>
<li><a href="https://github.com/pchaigno/dga-collection">pchaigno/dga-collection</a> (37-stars on Github)</li>
</ul>
<h4 id="word-based--dictionary-based-dga-resources">Word-based / Dictionary-based DGA Resources:</h4>
<p>Below are all Malware Families that use word-based / dictionary DGAs, meaning their domains consist of 2 or more words selected from a list/dictionary and concatenated together. I separate these out since they are different than most other “classical” DGAs.</p>
<ul>
<li><a href="https://github.com/andrewaeva/DGA/blob/master/dga_algorithms/Matsnu.py">Matsnu DGA generator</a></li>
<li><a href="https://github.com/baderj/domain_generation_algorithms/blob/master/gozi/dga.py">gozi DGA generator</a></li>
<li><a href="https://github.com/baderj/domain_generation_algorithms/blob/master/suppobox/dga.py">suppobox DGA generator</a></li>
<li><a href="https://github.com/baderj/domain_generation_algorithms/blob/master/nymaim2/dga.py">nymaim2 DGA generator</a></li>
<li><a href="https://github.com/baderj/domain_generation_algorithms/blob/master/pizd/pizd">pizd DGA generator</a></li>
</ul>
<h3 id="benign--non-dga-data">“Benign” / Non-DGA Data:</h3>
<p>This section lists some domain lists that may be useful for creating “labelled” benign domain lists. In several academic papers one or more of these sources are used, but they generally create derivatives that represent the Stable N-day Top X Sites (e.g. Stable Alexa 30-day top 500k – meaning domains from the Alexa top 500k that have been on the list consecutively for the last 30 days straight – the alexa data needs to be downloaded each day for 30+ days to create this since only today’s snapshot is provided by Amazon). This filters out domains that can become popular for a short amount of time but them drop off as sometimes happens with malicious domains.</p>
<ul>
<li><a href="http://s3.amazonaws.com/alexa-static/top-1m.csv.zip">Alexa Top 1M Domains</a></li>
<li><a href="https://ak.quantcast.com/quantcast-top-sites.zip">Quantcast</a></li>
<li><a href="http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip">Cisco Umbrella Top 1 million</a></li>
<li><a href="http://downloads.majestic.com/majestic_million.csv">The Majestic Million</a></li>
<li><a href="https://www.domcop.com/files/top/top10milliondomains.csv.zip">DomCop Top 1M</a></li>
<li><a href="https://raw.githubusercontent.com/maravento/blackweb/master/bwupdate/lst/whiteurls.txt">Whitelisted Domains</a> from <a href="https://github.com/maravento/blackweb">maravento/blackweb</a></li>
</ul>
<h4 id="update-2020-03-22---more-heuristics-for-benign-training-set-curation">Update (2020-03-22) - More Heuristics for Benign training set curation:</h4>
<p>Excerpt from <a href="https://arxiv.org/pdf/2003.05703.pdf">Inline Detection of DGA Domains Using Side Information (page 12)</a></p>
<blockquote>
<p>The benign samples are collected based on a predefined set of heuristics as listed below:</p>
<ul>
<li>Domain name should have valid DNS characters only (digits, letters, dot and hyphen)</li>
<li>Domain has to be resolved at least once for every day between June 01, 2019 and July 31, 2019.</li>
<li>Domain name should have a valid public suffix</li>
<li>Characters in the domain name are not all digits (after removing ‘.’ and ‘-‘)</li>
<li>Domain should have at most four labels (Labels are sequence of characters separated by a dot)</li>
<li>Length of the domain name is at most 255 characters</li>
<li>Longest label is between 7 and 64 characters</li>
<li>Longest label is more than twice the length of the TLD</li>
<li>Longest label is more than 70% of the combined length of all labels</li>
<li>Excludes IDN (International Distribution Network) domains (such as domains starting with xn–)</li>
<li>Domain must not exist in DGArchive</li>
</ul>
</blockquote>
<h3 id="utilities">Utilities:</h3>
<p><strong>Domain Parser:</strong></p>
<p>When parsing the various domain list data <a href="https://pypi.org/project/tldextract/">tldextract</a> is very helpful for stripping off TLDs or subdomains if desired. I have seen several projects attempt to parse domains using “split(‘.’)” or “domain[:-3]”. This does not work very well since domain’s TLDs can contain multiple “.”s (e.g. .co.uk)</p>
<p><strong>Installation:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>pip install tldextract
</pre></td></tr></tbody></table></code></pre></div></div>
<p><strong>Example:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
</pre></td><td class="rouge-code"><pre>In [1]: import tldextract
In [2]: e = tldextract.extract('abc.www.google.co.uk')
In [3]: e Out[3]: ExtractResult(subdomain='abc.www', domain='google', suffix='co.uk')
In [4]: e.domain
Out[4]: 'google'
In [5]: e.subdomain
Out[5]: 'abc.www'
In [6]: e.registered_domain
Out[6]: 'google.co.uk'
In [7]: e.fqdn
Out[7]: 'abc.www.google.co.uk'
In [8]: e.suffix
Out[8]: 'co.uk'
</pre></td></tr></tbody></table></code></pre></div></div>
<p><strong>Domain Resolution:</strong></p>
<p>During the course of your research you may need to perform DNS resolutions on lots of DGA domains. If you do this, I highly recommend setting up your own bind9 server on Digital Ocean or Amazon and using adnshost (a utility from <a href="https://www.gnu.org/software/adns/">adns</a>). If you perform the DNS resolutions from your home or office, your ISP may interfere with the DNS responses because they will appear malicious, which can bias your research. If you use a provider’s recursive nameservers, you may violate the acceptable use policy (AUP) due to the volume AND the provider may also interfere with the responses.</p>
<p>Adnshost enables high throughput / bulk DNS queries to be performed asynchronously. It will be much faster than performing the DNS queries synchronously (one after the other).</p>
<p>Here is an example of using adnshost (assuming you are running it from the Bind9 server you setup):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="rouge-code"><pre>cat huge-domains-list.txt | adnshost \
--asynch \
--config "nameserver 127.0.0.1" \
--type a \
--pipe \
----addr-ipv4-only > results.txt
</pre></td></tr></tbody></table></code></pre></div></div>
<p>This <a href="https://www.digitalocean.com/community/tutorials/how-to-configure-bind-as-a-private-network-dns-server-on-ubuntu-14-04">article</a> should get you most of the way there with setting up the bind9 server.</p>
<h2 id="models">Models:</h2>
<p>This section provides links to a few models that could be used as baselines for comparison.</p>
<ul>
<li><a href="https://github.com/endgameinc/dga_predict">dga_predict</a>’s LSTM and Bigram model from Endgame.</li>
<li><a href="https://github.com/keeganhines/snowman">snowman</a>’s CNN model from <a href="https://twitter.com/keeghin">Keegan Hines</a>. This is not specifically designed for DGA, but it works for this.</li>
<li><a href="https://github.com/matthoffman/degas">matthoffman/degas</a> - DGA-generated domain detection using deep learning models.</li>
<li><a href="https://github.com/topics/dga-detection">#dga-detection</a>, <a href="https://github.com/topics/dga">#dga</a>, and <a href="https://github.com/topics/dga-domains">#dga-domains</a> on Github - these tags provide other DGA related projects (DGA domain generators, DGA detection, DGA domain lists).</li>
<li><a href="https://github.com/BKCS-HUST/LSTM-MI">BKCS-HUST/LSTM-MI</a></li>
</ul>
<h2 id="research">Research:</h2>
<ul>
<li><a href="https://sites.cs.ucsb.edu/~chris/research/doc/ndss11_exposure.pdf">EXPOSURE: Finding Malicious Domains Using Passive DNS Analysis</a></li>
<li><a href="https://www.usenix.org/system/files/conference/usenixsecurity12/sec12-final127.pdf">From Throw-Away Traffic to Bots: Detecting the Rise of DGA-Based Malware</a></li>
<li><a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.221.4391&rep=rep1&type=pdf">Detecting Algorithmically Generated Domain-Flux Attacks With DNS Traffic Analysis</a></li>
<li><a href="https://resources.sei.cmu.edu/asset_files/Presentation/2013_017_101_51242.pdf">Clairvoyant Squirrel: Large-scale malicious domain classification (FloCon 2013)</a> – shameless plug :), I worked on this.</li>
<li><a href="https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_plohmann.pdf">(DGArchive) A Comprehensive Measurement Study of Domain Generating Malware</a></li>
<li><a href="https://arxiv.org/pdf/1610.01969.pdf">DeepDGA: Adversarially-Tuned Domain Generation and Detection</a></li>
<li><a href="https://www.researchgate.net/publication/321165269_A_LSTM_based_Framework_for_Handling_Multiclass_Imbalance_in_DGA_Botnet_Detection">A LSTM based Framework for Handling Multiclass Imbalance in DGA Botnet Detection</a> <a href="https://github.com/BKCS-HUST/LSTM-MI">[code]</a></li>
<li><a href="http://faculty.washington.edu/mdecock/papers/byu2018a.pdf">Character Level based Detection of DGA Domain Names</a> <a href="https://github.com/matthoffman/degas">[code]</a></li>
<li><a href="http://faculty.washington.edu/mdecock/papers/mpereira2018a.pdf">Dictionary Extraction and Detection of Algorithmically Generated Domain Names in Passive DNS Traffic</a></li>
<li><a href="http://faculty.washington.edu/mdecock/papers/rsivaguru2018a.pdf">An Evaluation of DGA Classifiers</a></li>
<li><a href="https://www.usenix.org/system/files/conference/usenixsecurity18/sec18-schuppen.pdf">FANCI : Feature-based Automated NXDomain Classification and Intelligence</a></li>
<li><a href="https://www.youtube.com/watch?v=99hniQYB6VM">Deep Learning For Realtime Malware Detection (ShmooCon 2018)</a> by Domenic Puzio and Kate Highnam. – shameless plug :), I worked with Dom and Kate on several projects.</li>
<li><a href="https://link.springer.com/chapter/10.1007/978-3-030-00009-7_43">A Novel Detection Method for Word-Based DGA</a></li>
<li><a href="https://arxiv.org/pdf/1805.08426.pdf">A Survey on Malicious Domains Detection through DNS Data Analysis</a></li>
<li><a href="https://arxiv.org/pdf/1810.02023.pdf">Detecting DGA domains with recurrent neural networks and side information</a></li>
<li><a href="https://arxiv.org/pdf/1905.01078.pdf">CharBot: A Simple and Effective Method for Evading DGA Classifiers</a></li>
</ul>
<hr />
<p>I hope this is helpful. As always, feedback is welcome so please leave a message here, on <a href="https://medium.com/@jason_trost">Medium</a>, or @ me on <a href="(https://twitter.com/#!/jason_trost)">twitter</a>!</p>
<p>–Jason
<br /><a href="https://twitter.com/#!/jason_trost">@jason_trost</a></p>
<p><a href="http://www.covert.io/getting-started-with-dga-research/">Getting Started with DGA Domain Detection Research</a> was originally published by Jason Trost at <a href="http://www.covert.io">covert.io</a> on July 16, 2019.</p>http://www.covert.io/security-data-science-learning-resources2019-05-05T00:00:00-00:002019-05-05T00:00:00-04:00Jason Trosthttp://www.covert.iojason.trost@gmail.com<p>This short post catalogs some resources that may be useful for those interested in security data science. It is not meant to be an exhaustive list. It is meant to be a curated list to help you get started.</p>
<h3 id="staying-current-with-security-data-science">Staying Current with Security Data Science</h3>
<p>Here is my current strategy for staying current with security data science research. It leans heavier towards academic research since this is what interests me at the moment.</p>
<ol>
<li>Google Scholar Publication alerts on known respected researchers.</li>
<li>Google Scholar Citation alerts on interesting or noteworthy papers.</li>
<li>Follow security ML researchers on Twitter and Medium. They frequently share interesting and cutting edge research papers / videos / blogs.</li>
<li>Periodically review proceedings from noteworthy security conferences.</li>
<li>Skim published security conference videos from <a href="http://www.irongeek.com/">Irongeek</a> looking for topics of interest.</li>
</ol>
<h3 id="google-scholar-alerts">Google Scholar alerts</h3>
<p>Citation Alerts on these papers:</p>
<ul>
<li>“Acing the IOC game: Toward automatic discovery and analysis of open-source cyber threat intelligence”</li>
<li>“AI^ 2: training a big data machine to defend”</li>
<li>“APT Infection Discovery using DNS Data”</li>
<li>“Beehive: Large-scale log analysis for detecting suspicious activity in enterprise networks”</li>
<li>“Deep neural network based malware detection using two dimensional binary program features”</li>
<li>“Detecting malicious domains via graph inference”</li>
<li>“Detecting malware based on DNS graph mining”</li>
<li>“Detecting structurally anomalous logins in Enterprise Networks”</li>
<li>“Discovering malicious domains through passive DNS data graph analysis”</li>
<li>“EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models”</li>
<li>“Enabling network security through active DNS datasets”</li>
<li>“Feature-based transfer learning for network security”</li>
<li>“Gotcha-Sly Malware!: Scorpion A Metagraph2vec Based Malware Detection System”</li>
<li>“Guilt by association: large scale malware detection by mining file-relation graphs”</li>
<li>“Identifying suspicious activities through dns failure graph analysis”</li>
<li>“Polonium: Tera-scale graph mining and inference for malware detection”</li>
<li>“Segugio: Efficient behavior-based tracking of malware-control domains in large ISP networks”</li>
</ul>
<p>New article alerts on these authors with the <strong>bolded</strong> being the most relevant / interesting to me.</p>
<ul>
<li><strong>Alina Oprea</strong> - heavily focused on operational security ML.</li>
<li><strong>Josh Saxe</strong>, <strong>Rich Harang</strong>, and <strong>Konstantin Berlin</strong> - heavily focused on Malware detection/analytics using ML. Also a published book author.</li>
<li><strong>Manos Antonakakis</strong> and <strong>Roberto Perdisci</strong> - heavily focused on network security analytics using ML with a specialty in DNS traffic.</li>
<li>Balduzzi Marco</li>
<li>Battista Biggio</li>
<li>Chaz Lever</li>
<li>Christopher Kruegel</li>
<li>Damon McCoy</li>
<li>David Dagon</li>
<li>David Freeman</li>
<li>Gianluca Stringhini</li>
<li>Giovanni Vigna</li>
<li>Guofei Gu</li>
<li>Han Yufei</li>
<li>Hossein Siadati</li>
<li>Issa Khalil</li>
<li>Jason (Iasonas) Polakis</li>
<li>Michael Donald Bailey</li>
<li>Michael Iannacone</li>
<li>Nick Feamster</li>
<li>Niels Provos</li>
<li>Nir Nissim</li>
<li>Patrick McDaniel</li>
<li>Stefan Savage</li>
<li>Steven Noel</li>
<li>Terry Nelms</li>
<li>Ting-Fang Yen</li>
<li>Vern Paxson</li>
<li>Wenke Lee</li>
<li>Yacin Nadji</li>
<li>Yanfang (Fanny) Ye</li>
<li>Yizheng Chen</li>
<li>Yuval Elovici</li>
</ul>
<h3 id="twitter">Twitter</h3>
<p>Twitter can be a gold mine for new and relevant ideas, blogs, presentations, etc for security data science. You just need to make sure you continually follow the right folks. Here is a short list of thought leaders in this space (if I left you off it is my oversight so please don’t take offense).</p>
<ul>
<li><a href="https://twitter.com/_delta_zero">@_delta_zero</a></li>
<li><a href="https://twitter.com/alexcpsec">@alexcpsec</a></li>
<li><a href="https://twitter.com/DavidJBianco">@DavidJBianco</a></li>
<li><a href="https://twitter.com/DhiaLite">@DhiaLite</a></li>
<li><a href="https://twitter.com/drhyrum">@drhyrum</a></li>
<li><a href="https://twitter.com/filar">@filar</a></li>
<li><a href="https://twitter.com/hrbrmstr">@hrbrmstr</a></li>
<li><a href="https://twitter.com/jayjacobs">@jayjacobs</a></li>
<li><a href="https://twitter.com/JohnLaTwC">@JohnLaTwC</a></li>
<li><a href="https://twitter.com/JosephZadeh">@JosephZadeh</a></li>
<li><a href="https://twitter.com/joshua_saxe">@joshua_saxe</a></li>
<li><a href="https://twitter.com/Kym_Possible">@Kym_Possible</a></li>
<li><a href="https://twitter.com/mroytman">@mroytman</a></li>
<li><a href="https://twitter.com/mrphilroth">@mrphilroth</a></li>
<li><a href="https://twitter.com/MSwannMSFT">@MSwannMSFT</a></li>
<li><a href="https://twitter.com/ram_ssk">@ram_ssk</a></li>
<li><a href="https://twitter.com/rharang">@rharang</a></li>
<li><a href="https://twitter.com/rseymour">@rseymour</a></li>
<li><a href="https://twitter.com/sooshie">@sooshie</a></li>
</ul>
<p>For a more exhaustive list of others I would recommend following on Twitter, see <a href="https://gist.github.com/jatrost/b0ae18a545af69130e0033460562aca2">this gist</a>. This list is focused on Threat Intel, Threat Hunting, Detection Engineering, IR, and Security Engineering. It is not exhaustive, but is a good start.</p>
<h3 id="conferences">Conferences</h3>
<p>Below are several interesting security conferences where research is published on security data science topics. It is a good idea to be on the look out for the proceedings from these events.</p>
<ul>
<li><a href="http://ycheng.org/codaspy/2018/program.html">ACM CODASPY (ACM Conference on Data and Application Security and Privacy)</a></li>
<li><a href="http://www-personal.umich.edu/~arunesh/AISec2016/Program.html">AI Sec</a></li>
<li><a href="https://www.acsac.org/">Annual Computer Security Applications Conference (ACSAC)</a></li>
<li><a href="https://www.camlis.org/">Conference on Applied Machine Learning for Information Security</a></li>
<li><a href="https://www.ieee-security.org/TC/SPW2018/DLS/">Deep Learning and Security Workshop (Co located with IEEE Security Oakland conference)</a></li>
<li><a href="https://deepintel.net/index.php">DEEPINTEL Conference. Focus on security intelligence.</a></li>
<li><a href="https://aivillage.org/">Defcon AIVillage</a></li>
<li><a href="https://machine-learning-and-security.github.io/cfp.html">Machine Learning and Computer Security Workshop (colocated at NIPS)</a></li>
<li><a href="https://www.usenix.org/conference/scainet18">ScAINet: 2018 USENIX Security and AI Networking Conference</a></li>
<li><a href="http://deem-workshop.org/">Workshop on Data Management for End-to-End Learning</a></li>
<li><a href="https://event.cwi.nl/grades/2017/index.shtml">Workshop on Graph Data-management Experiences & Systems (colocated with SIGMOD/PODS)</a></li>
<li><a href="http://isyou.info/conf/mist17/">Workshop on Managing Insider Security Threats (In Conjunction with ACM CCS 2017)</a></li>
<li><a href="http://www.mlgworkshop.org/2017/">Workshop on Mining and Learning with Graphs (colocated with KDD)</a></li>
</ul>
<p>This page is also an excellent resource in general for top academic security conferences: <a href="http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm">Top Academic Security conferences list</a>. The major industry focused security conferences like <a href="https://www.blackhat.com/">Blackhat</a>, <a href="https://www.rsaconference.com/">RSA</a>, <a href="https://www.defcon.org/">Defcon</a>, <a href="http://www.securitybsides.com/w/page/12194156/FrontPage">BSides*</a>, <a href="https://www.derbycon.com/">DerbyCon</a>, and <a href="https://www.shmoocon.org/">ShmooCon</a> all frequently have talks relevant to security data science, but this is not their primary focus, so they are not explicitly called out above.</p>
<h2 id="learning-resources">Learning Resources</h2>
<p>These resources will help you build a baseline of knowledge in Cyber Security and Machine Learning.</p>
<h3 id="books">Books</h3>
<h4 id="security">Security:</h4>
<ul>
<li><a href="http://www.amazon.com/gp/product/0321349962/ref=as_li_tf_tl?ie=UTF8&tag=cyberanaly-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0321349962">Extrusion Detection: Security Monitoring for Internal Intrusions</a> by Richard Bejtlich</li>
<li><a href="https://www.amazon.com/Intelligence-Driven-Incident-Response-Outwitting-Adversary/dp/1491934948/ref=as_li_ss_tl?keywords=Intelligence+Driven+Incident+Response&qid=1557078317&s=gateway&sr=8-1&linkCode=ll1&tag=cyberanaly-20&linkId=6973e7074668673eb200b437790fc76a&language=en_US">Intelligence-Driven Incident Response: Outwitting the Adversary</a> by Scott J. Roberts and Rebekah Brown</li>
<li><a href="https://www.amazon.com/Counter-Hack-Reloaded-Step-Step/dp/0131481045/ref=as_li_ss_tl?crid=2XJH9F6S3WM3I&keywords=counter+hack+reloaded&qid=1557078411&s=gateway&sprefix=Counter+Hack+Reload,aps,155&sr=8-1-fkmrnull&linkCode=ll1&tag=cyberanaly-20&linkId=a99cc70ae09426878c765479651c034b&language=en_US">Counter Hack Reloaded: A Step-by-Step Guide to Computer Attacks and Effective Defenses (2nd Edition)</a> by Edward Skoudis and Tom Liston</li>
</ul>
<h4 id="machine-learning--data-science">Machine Learning / Data Science:</h4>
<ul>
<li><a href="http://www.amazon.com/gp/product/1449357903/ref=as_li_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=1449357903&linkCode=as2&tag=cyberanaly-20&linkId=35TDX547RG2KPGAB">Network Security Through Data Analysis: Building Situational Awareness</a> by Michael S Collins</li>
<li><a href="https://www.amazon.com/Malware-Data-Science-Detection-Attribution/dp/1593278594/ref=as_li_ss_tl?crid=27NUF8YTLE98J&keywords=malware+data+science&qid=1557078010&s=gateway&sprefix=malware+daya,aps,155&sr=8-1&linkCode=ll1&tag=cyberanaly-20&linkId=224929fd1a36732beae7c48cdda46135&language=en_US">Malware Data Science: Attack Detection and Attribution </a> by Joshua Saxe and Hillary Sanders</li>
<li><a href="https://www.amazon.com/Python-Machine-Learning-scikit-learn-TensorFlow/dp/1787125939/ref=as_li_ss_tl?keywords=python+machine+learning&qid=1557078084&s=gateway&sr=8-3&linkCode=ll1&tag=cyberanaly-20&linkId=ff7c0b5e77ba0fbb6759cf06a5225fef&language=en_US">Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow, 2nd Edition</a> by Sebastian Raschka and Vahid Mirjalili</li>
<li><a href="https://www.amazon.com/Deep-Learning-Python-Francois-Chollet/dp/1617294438/ref=as_li_ss_tl?_encoding=UTF8&qid=1557078084&sr=8-7&linkCode=ll1&tag=cyberanaly-20&linkId=54592a68a3e717dbc6263596795829fb&language=en_US">Deep Learning with Python</a> by Francois Chollet</li>
</ul>
<h3 id="courses">Courses</h3>
<ul>
<li><a href="https://learning.oreilly.com/home/">O’Reilly Learning Platform</a></li>
<li><a href="https://course.fast.ai/">FastAI: Practical Deep Learning for Coders</a></li>
<li><a href="http://course18.fast.ai/part2.html">FastAI: Cutting Edge Deep Learning for Coders</a></li>
<li><a href="http://course18.fast.ai/ml">FastAI: Introduction to Machine Learning for Coders</a></li>
<li><a href="https://github.com/fastai/numerical-linear-algebra/blob/master/README.md">FastAI: Computational Linear Algebra</a></li>
<li><a href="https://www.coursera.org/specializations/deep-learning">Coursera Deep Learning</a></li>
<li><a href="https://www.coursera.org/learn/machine-learning">Coursera Machine Learning</a></li>
<li><a href="https://www.udacity.com/course/deep-learning-nanodegree--nd101">Udacity Deep Learning</a></li>
</ul>
<hr />
<p>I hope this is helpful, and I would be interested to hear about other resources that you find useful. Please leave a message here, on <a href="https://medium.com/@jason_trost">Medium</a>, or @ me on <a href="(https://twitter.com/#!/jason_trost)">twitter</a>!</p>
<p>–Jason
<br /><a href="https://twitter.com/#!/jason_trost">@jason_trost</a></p>
<p><a href="http://www.covert.io/security-data-science-learning-resources/">Security Data Science Learning Resources</a> was originally published by Jason Trost at <a href="http://www.covert.io">covert.io</a> on May 05, 2019.</p>http://www.covert.io/six-short-links-on-pdns-graph-analytics-for-security2017-08-08T00:00:00-00:002017-08-14T00:00:00-04:00Jason Trosthttp://www.covert.iojason.trost@gmail.com<p>A short listing of research papers I’ve read or plan to read that use passive DNS (PDNS) data and graph analytics for identifying malicious domains.</p>
<h2 id="host-domain-graphs">Host-Domain Graphs</h2>
<p>Host domain graphs are bipartite graphs mapping hosts/IPs to domains that they either resolved (passive DNS) or visited (web proxy logs). These graphs are used heavily in operational security machine learning papers on network threat hunting as they provide insight into the behavioral patterns across an enterprise or ISP.</p>
<p><strong><a href="/research-papers/security/Detecting malicious domains via graph inference.pdf">Detecting Malicious Domains via Graph Inference</a></strong>
P. K. Manadhata, S. Yadav, P. Rao, and W. Horne.
In Proceedings of 19th European Symposium on Research in Computer Security, Wroclaw, Poland, September 7-11, 2014.</p>
<p><strong><a href="/research-papers/security/Detection of Early-Stage Enterprise Infection by Mining Large-Scale Log Data.pdf">Detection of Early-Stage Enterprise Infection by Mining Large-Scale Log Data</a></strong>
Alina Oprea, Zhou Li, Ting-Fang Yen, Sang H. Chin, and Sumyah Alrwais
In Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2015.</p>
<p><strong><a href="/research-papers/security/Segugio - Efficient Behavior-Based Tracking of Malware-Control Domains in Large ISP Networks.pdf">Segugio: Efficient Behavior-Based Tracking of Malware-Control Domains in Large ISP Networks</a></strong>
Babak Rahbarinia and Manos Antonakakis
In Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2015</p>
<h2 id="domain-resolution-graphs-domain-ip-graphs">Domain Resolution Graphs (Domain-IP Graphs)</h2>
<p>A domain resolution graph is an undirected bipartite graph representing observed domain->IP DNS resolution from Passive DNS data.</p>
<p><strong><a href="/research-papers/security/Notos - Building a dynamic reputation system for dns.pdf">Notos: Building a Dynamic Reputation System for DNS</a></strong>
M. Antonakakis, R. Perdisci, D. Dagon, W. Lee, and N. Feamster.
In the Proceedings of the 19th USENIX Security Symposium, Washington, DC, USA, August 11-13, 2010.</p>
<p><strong><a href="/research-papers/security/Exposure - Finding malicious domains using passive dns analysis.pdf">EXPOSURE: Finding Malicious Domains using Passive DNS Analysis</a></strong>
L. Bilge, E. Kirda, C. Kruegel, and M. Balduzzi.
In Proceedings of the Network and Distributed System Security Symposium, San Diego, California, USA, February 2011.</p>
<p><strong><a href="/research-papers/security/Discovering Malicious Domains through Passive DNS Data Graph Analysis.pdf">Discovering Malicious Domains through Passive DNS Data Graph Analysis</a></strong>
Issa Khalil, Ting Yu, and Bei Guan.
In Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security (ASIA CCS ‘16), 2016.</p>
<p>–Jason
<br /><a href="https://twitter.com/#!/jason_trost">@jason_trost</a></p>
<p><em>The “short links” format was inspired by <a href="https://www.oreilly.com/feed/four-short-links">O’Reilly’s Four Short Links</a> series.</em></p>
<p><a href="http://www.covert.io/six-short-links-on-pdns-graph-analytics-for-security/">6 Short Links on PDNS Graph Analytics for Security</a> was originally published by Jason Trost at <a href="http://www.covert.io">covert.io</a> on August 14, 2017.</p>http://www.covert.io/seven-short-links-on-operational-security-machine-learning2017-08-08T00:00:00-00:002017-08-08T00:00:00-04:00Jason Trosthttp://www.covert.iojason.trost@gmail.com<p><strong><a href="http://www.ccs.neu.edu/home/alina/papers/Beehive.pdf">Beehive: Large-Scale Log Analysis for Detecting Suspicious Activity in Enterprise Networks</a></strong>
Ting-Fang Yen, Alina Oprea, Kaan Onarlioglu, Todd Leetham, William Robertson, Ari Juels, and Engin Kirda
In Proceedings of Annual Computer Security Applications Conference (ACSAC), 2013</p>
<p><strong><a href="http://www.ccs.neu.edu/home/alina/papers/InfectionDemographics.pdf">An Epidemiological Study of Malware Encounters in a Large Enterprise</a></strong>
Ting-Fang Yen, Victor Heorhiadi, Alina Oprea, Michael K. Reiter, and Ari Juels
In Proceedings of ACM Conference on Computer and Communications Security (CCS), 2014</p>
<p><strong><a href="http://www.ccs.neu.edu/home/alina/papers/EnterpriseInfection.pdf">Detection of Early-Stage Enterprise Infection by Mining Large-Scale Log Data</a></strong>
Alina Oprea, Zhou Li, Ting-Fang Yen, Sang H. Chin, and Sumyah Alrwais
In Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2015</p>
<p><strong><a href="http://roberto.perdisci.com/publications/publication-files/segugio_dsn.pdf">Segugio: Efficient Behavior-Based Tracking of Malware-Control Domains in Large ISP Networks</a></strong>
Babak Rahbarinia and Manos Antonakakis
In Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2015</p>
<p><strong><a href="https://arxiv.org/pdf/1506.04200.pdf">Malicious Behavior Detection using Windows Audit Logs</a></strong>
Konstantin Berlin, David Slater, Joshua Saxe
In Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security (AISec) 2015</p>
<p><strong><a href="http://www.ccs.neu.edu/home/alina/papers/LogAnalytics.pdf">Operational security log analytics for enterprise breach detection</a></strong>
Zhou Li and Alina Oprea
In Proceedings of the First IEEE Cybersecurity Development Conference (SecDev), 2016</p>
<p><strong><a href="http://www.ccs.neu.edu/home/alina/papers/Endpoint.pdf">Lens on the endpoint: Hunting for malicious software through endpoint data analysis.</a></strong>
Ahmet Buyukkayhan, Alina Oprea, Zhou Li, and William Robertson.
In Proceedings of Recent Advances in Intrusion Detection (RAID), 2017</p>
<p>–Jason
<br /><a href="https://twitter.com/#!/jason_trost">@jason_trost</a></p>
<p>PS …</p>
<ul>
<li>many of these papers were found via <a href="http://www.ccs.neu.edu/home/alina/publications.html">Alina Oprea’s home page</a>.</li>
<li>The “short links” format was inspired by <a href="https://www.oreilly.com/feed/four-short-links">O’Reilly’s Four Short Links</a> series.</li>
</ul>
<p><a href="http://www.covert.io/seven-short-links-on-operational-security-machine-learning/">7 Short Links on Operational Security Machine Learning</a> was originally published by Jason Trost at <a href="http://www.covert.io">covert.io</a> on August 08, 2017.</p>http://www.covert.io/the-definitive-security-datascience-and-machinelearning-guide2016-12-31T00:00:00-00:002017-01-01T00:00:00-05:00Jason Trosthttp://www.covert.iojason.trost@gmail.com<p>This is the Definitive Security Data Science and Machine Learning Guide. It includes books, tutorials, presentations, blog posts, and research papers about solving security problems using data science.</p>
<h2 id="table-of-contents">Table of Contents</h2>
<ul>
<li><a href="#machine-learning-and-security-papers">Machine Learning and Security Papers</a></li>
<li><a href="#deep-learning-and-security-papers">Deep Learning and Security Papers</a></li>
<li><a href="#deep-learning-and-security-presentations">Deep Learning and Security Presentations</a></li>
<li><a href="#security-data-science-blogs">Security Data Science Blogs</a></li>
<li><a href="#security-data-science-blogposts--tutorials">Security Data Science Blogposts / Tutorials</a></li>
<li><a href="#security-data-science-projects">Security Data Science Projects</a></li>
<li><a href="#security-data">Security Data</a></li>
<li><a href="#security-data-science-books">Security Data Science Books</a></li>
<li><a href="#security-data-science-presentations--talks">Security Data Science Presentations / Talks</a></li>
<li><a href="#misc">Misc</a></li>
</ul>
<h2 id="machine-learning-and-security-papers">Machine Learning and Security Papers</h2>
<h3 id="intrusion-detection-papers">Intrusion Detection Papers</h3>
<ul>
<li><a href="/research-papers/security/A Close Look on n-Grams in Intrusion Detection- Anomaly Detection vs. Classification.pdf">A Close Look on n-Grams in Intrusion Detection- Anomaly Detection vs. Classification</a></li>
<li><a href="/research-papers/security/A Framework for the Application of Association Rule Mining in Large Intrusion Detection Infrastructures.pdf">A Framework for the Application of Association Rule Mining in Large Intrusion Detection Infrastructures</a></li>
<li><a href="/research-papers/security/A Kill Chain Analysis of the 2013 Target Data Breach.pdf">A Kill Chain Analysis of the 2013 Target Data Breach</a></li>
<li><a href="/research-papers/security/A Lone Wolf No More - Supporting Network Intrusion Detection with Real-Time Intelligence.pdf">A Lone Wolf No More - Supporting Network Intrusion Detection with Real-Time Intelligence</a></li>
<li><a href="/research-papers/security/A Machine-learning Approach for Classifying and Categorizing Android Sources and Sinks.pdf">A Machine-learning Approach for Classifying and Categorizing Android Sources and Sinks</a></li>
<li><a href="/research-papers/security/Acquiring Digital Evidence from Botnet Attacks: Procedures and Methods (PhD Thesis).pdf">Acquiring Digital Evidence from Botnet Attacks: Procedures and Methods (PhD Thesis)</a></li>
<li><a href="/research-papers/security/ALERT-ID - Analyze Logs of the network Element in Real Time for Intrusion Detection.pdf">ALERT-ID - Analyze Logs of the network Element in Real Time for Intrusion Detection</a></li>
<li><a href="/research-papers/security/Anagram - A Content Anomaly Detector Resistant to Mimicry Attack.pdf">Anagram - A Content Anomaly Detector Resistant to Mimicry Attack</a></li>
<li><a href="/research-papers/security/Anagram - A Content Anomaly Detector Resistant to Mimicry Attack.pdf">Anagram - A Content Anomaly Detector Resistant to Mimicry Attack</a></li>
<li><a href="/research-papers/security/Anomaly-based intrusion detection in software as a service.pdf">Anomaly-based Intrusion Detection in Software as a Service</a></li>
<li><a href="/research-papers/security/Application of the PageRank Algorithm to Alarm Graphs.pdf">Application of the PageRank Algorithm to Alarm Graphs</a></li>
<li><a href="/research-papers/security/Back to Basics - Beyond Network Hygiene.pdf">Back to Basics - Beyond Network Hygiene</a></li>
<li><a href="/research-papers/security/Beehive - Large-Scale Log Analysis for Detecting Suspicious Activity in Enterprise Networks.pdf">Beehive - Large-Scale Log Analysis for Detecting Suspicious Activity in Enterprise Networks</a></li>
<li><a href="/research-papers/security/Behavioral clustering of http-based malware and signature generation using malicious network traces.pdf">Behavioral Clustering of HTTP-based Malware and Signature Generation Using Malicious Network Traces</a></li>
<li><a href="/research-papers/security/Beheading Hydras - Performing Effective Botnet Takedowns.pdf">Beheading Hydras - Performing Effective Botnet Takedowns</a></li>
<li><a href="/research-papers/security/Bloodhound - Searching Out Malicious Input in Network Flows for Automatic Repair Validation.pdf">Bloodhound - Searching Out Malicious Input in Network Flows for Automatic Repair Validation</a></li>
<li><a href="/research-papers/security/Boosting the Scalability of Botnet Detection Using Adaptive Traffic Sampling.pdf">Boosting the Scalability of Botnet Detection Using Adaptive Traffic Sampling</a></li>
<li><a href="/research-papers/security/CAMP - Content Agnostic Malware Protection.pdf">CAMP - Content Agnostic Malware Protection</a></li>
<li><a href="/research-papers/security/CAMP - Content Agnostic Malware Protection.pdf">CAMP - Content Agnostic Malware Protection</a></li>
<li><a href="/research-papers/security/Casting out demons - Sanitizing training data for anomaly sensors.pdf">Casting out demons - Sanitizing training data for anomaly sensors</a></li>
<li><a href="/research-papers/security/CloudFence - Data Flow Tracking as a Cloud Service.pdf">CloudFence - Data Flow Tracking as a Cloud Service</a></li>
<li><a href="/research-papers/security/Comparing anomaly detection techniques for HTTP.pdf">Comparing anomaly detection techniques for HTTP</a></li>
<li><a href="/research-papers/security/Cujo - Efficient detection and prevention of drive-by-download attacks.pdf">Cujo - Efficient detection and prevention of drive-by-download attacks</a></li>
<li><a href="/research-papers/security/Decoy Document Deployment for Effective Masquerade Attack Detection.pdf">Decoy Document Deployment for Effective Masquerade Attack Detection</a></li>
<li><a href="/research-papers/security/Detecting Spammers with SNARE - Spatio-temporal Network-level Automatic Reputation Engine.pdf">Detecting Spammers with SNARE - Spatio-temporal Network-level Automatic Reputation Engine</a></li>
<li><a href="/research-papers/security/Detecting unknown network attacks using language models.pdf">Detecting Unknown Network Attacks Using Language Models</a></li>
<li><a href="/research-papers/security/Early Detection of Malicious Flux Networks via Large-Scale Passive DNS Traffic Analysis.pdf">Early Detection of Malicious Flux Networks via Large-Scale Passive DNS Traffic Analysis</a></li>
<li><a href="/research-papers/security/Effective Anomaly Detection with Scarce Training Data.pdf">Effective Anomaly Detection with Scarce Training Data</a></li>
<li><a href="/research-papers/security/Efficient Multidimensional Aggregation for Large Scale Monitoring.pdf">Efficient Multidimensional Aggregation for Large Scale Monitoring</a></li>
<li><a href="/research-papers/security/EFFORT - Efficient and Effective Bot Malware Detection.pdf">EFFORT - Efficient and Effective Bot Malware Detection</a></li>
<li><a href="/research-papers/security/ExecScent- Mining for New C and C Domains in Live Networks with Adapive Control Protocol Templates - slides.pdf">ExecScent- Mining for New C and C Domains in Live Networks with Adaptive Control Protocol Templates - slides</a></li>
<li><a href="/research-papers/security/ExecScent- Mining for New C and C Domains in Live Networks with Adapive Control Protocol Templates.pdf">ExecScent- Mining for New C and C Domains in Live Networks with Adaptive Control Protocol Templates</a></li>
<li><a href="/research-papers/security/Exposure - Finding malicious domains using passive dns analysis.pdf">EXPOSURE - Finding Malicious Domains Using Passive DNS Analysis</a></li>
<li><a href="/research-papers/security/Exposure - Finding malicious domains using passive dns analysis.pdf">EXPOSURE - Finding Malicious Domains Using Passive DNS Analysis</a></li>
<li><a href="/research-papers/security/FiG - Automatic Fingerprint Generation.pdf">FiG - Automatic Fingerprint Generation</a></li>
<li><a href="/research-papers/security/Filtering Spam with Behavioral Blacklisting.pdf">Filtering Spam with Behavioral Blacklisting</a></li>
<li><a href="/research-papers/security/Finding The Needle- Suppression of False Alarms in Large Intrusion Detection Data Sets.pdf">Finding The Needle - Suppression of False Alarms in Large Intrusion Detection Data Sets</a></li>
<li><a href="/research-papers/security/FLIPS - Hybrid Adaptive Intrusion Prevention.pdf">FLIPS - Hybrid Adaptive Intrusion Prevention</a></li>
<li><a href="/research-papers/security/Heuristics for Improved Enterprise Intrusion Detection (Jim Treinen PhD Thesis).pdf">Heuristics for Improved Enterprise Intrusion Detection</a> by Jim Treinen</li>
<li><a href="/research-papers/security/HMMPayl - An intrusion detection system based on Hidden Markov Models.pdf">HMMPayl - An Intrusion Detection System Based on Hidden Markov Models</a></li>
<li><a href="/research-papers/security/Kopis - Detecting malware domains at the upper dns hierarchy.pdf">Kopis - Detecting malware domains at the upper dns hierarchy</a></li>
<li><a href="/research-papers/security/Kopis - Detecting malware domains at the upper dns hierarchy.pdf">Kopis - Detecting malware domains at the upper dns hierarchy</a></li>
<li><a href="/research-papers/security/Large-Scale Malware Analysis, Detection, and Signature Generation.pdf">Large-Scale Malware Analysis, Detection, and Signature Generation</a></li>
<li><a href="/research-papers/security/Leveraging Honest Users - Stealth Command-and-Control of Botnets - slides.pdf">Leveraging Honest Users - Stealth Command-and-Control of Botnets - slides</a></li>
<li><a href="/research-papers/security/Leveraging Honest Users - Stealth Command-and-Control of Botnets.pdf">Leveraging Honest Users - Stealth Command-and-Control of Botnets</a></li>
<li><a href="/research-papers/security/Local System Security via SSHD Instrumentation .pdf">Local System Security via SSHD Instrumentation </a></li>
<li><a href="/research-papers/security/Machine learning in adversarial environments.pdf">Machine Learning In Adversarial Environments</a></li>
<li><a href="/research-papers/security/Malware vs Big Data (Ubrella Labs).pdf">Malware vs. Big Data (Umbrella Labs)</a></li>
<li><a href="/research-papers/security/McPAD - A multiple classifier system for accurate payload-based anomaly detection.pdf">McPAD - A Multiple Classifier System for Accurate Payload-based Anomaly Detection</a></li>
<li><a href="/research-papers/security/Measuring and Detecting Malware Downloads in Live Network Traffic.pdf">Measuring and Detecting Malware Downloads in Live Network Traffic</a></li>
<li><a href="/research-papers/security/Mining Botnet Sink holes - slides.pdf">Mining Botnet Sink Holes - slides</a></li>
<li><a href="/research-papers/security/MISHIMA - Multilateration of Internet hosts hidden using malicious fast-flux agents.pdf">MISHIMA - Multilateration of Internet hosts hidden using malicious fast-flux agents</a></li>
<li><a href="/research-papers/security/Monitoring the Initial DNS Behavior of Malicious Domains.pdf">Monitoring the Initial DNS Behavior of Malicious Domains</a></li>
<li><a href="/research-papers/security/N-Gram against the Machine - On the Feasibility of the N-Gram Network Analysis for Binary Protocols.pdf">N-Gram against the Machine - On the Feasibility of the N-Gram Network Analysis for Binary Protocols</a></li>
<li><a href="/research-papers/security/Nazca - Detecting Malware Distribution in Large-Scale Networks.pdf">Nazca - Detecting Malware Distribution in Large-Scale Networks</a></li>
<li><a href="/research-papers/security/Nazca - Detecting Malware Distribution in Large-Scale Networks.pdf">Nazca - Detecting Malware Distribution in Large-Scale Networks</a></li>
<li><a href="/research-papers/security/Netgator - Malware Detection Using Program Interactive Challenges - slides.pdf">Netgator - Malware Detection Using Program Interactive Challenges - slides</a></li>
<li><a href="/research-papers/security/Network Traffic Characterization Using (p, n)-grams Packet Representation.pdf">Network Traffic Characterization Using (p, n)-grams Packet Representation</a></li>
<li><a href="/research-papers/security/Notos - Building a dynamic reputation system for dns.pdf">Notos - Building a Dynamic Reputation System for DNS</a></li>
<li><a href="/research-papers/security/Notos - Building a dynamic reputation system for dns.pdf">Notos - Building a Dynamic Reputation System for DNS</a></li>
<li><a href="/research-papers/security/On the Feasibility of Online Malware Detection with Performance Counters.pdf">On the Feasibility of Online Malware Detection with Performance Counters</a></li>
<li><a href="/research-papers/security/On the infeasibility of modeling polymorphic shellcode.pdf">On the Infeasibility of Modeling Polymorphic Shellcode</a></li>
<li><a href="/research-papers/security/On the Mismanagement and Maliciousness of Networks.pdf">On the Mismanagement and Maliciousness of Networks</a></li>
<li><a href="/research-papers/security/Outside the Closed World - On Using Machine Learning For Network Intrusion Detection.pdf">Outside the Closed World - On Using Machine Learning For Network Intrusion Detection</a></li>
<li><a href="/research-papers/security/PAYL - Anomalous Payload-based Network Intrusion Detection.pdf">PAYL - Anomalous Payload-based Network Intrusion Detection</a></li>
<li><a href="/research-papers/security/PAYL - Anomalous Payload-based Network Intrusion Detection.pdf">PAYL - Anomalous Payload-based Network Intrusion Detection</a></li>
<li><a href="/research-papers/security/PAYL2 - Anomalous Payload-based Worm Detection and Signature Generation.pdf">PAYL2 - Anomalous Payload-based Worm Detection and Signature Generation</a></li>
<li><a href="/research-papers/security/From throw-away traffic to bots - detecting the rise of dga-based malware.pdf">Pleiades - From Throw-away Traffic To Bots - Detecting The Rise Of DGA-based Malware</a></li>
<li><a href="/research-papers/security/From throw-away traffic to bots - detecting the rise of dga-based malware.pdf">Pleiades - From Throw-away Traffic To Bots - Detecting The Rise Of DGA-based Malware</a></li>
<li><a href="/research-papers/security/Polonium - Tera-Scale Graph Mining for Malware Detection.pdf">Polonium - Tera-Scale Graph Mining for Malware Detection</a></li>
<li><a href="/research-papers/security/Practical Comprehensive Bounds on Surreptitious Communication Over DNS - slides.pdf">Practical Comprehensive Bounds on Surreptitious Communication Over DNS - slides</a></li>
<li><a href="/research-papers/security/Practical Comprehensive Bounds on Surreptitious Communication Over DNS.pdf">Practical Comprehensive Bounds on Surreptitious Communication Over DNS</a></li>
<li><a href="/research-papers/security/Privacy-preserving payload-based correlation for accurate malicious traffic detection.pdf">Privacy-preserving Payload-based Correlation for Accurate Malicious Traffic Detection</a></li>
<li><a href="/research-papers/security/Revealing Botnet Membership Using DNSBL Counter-Intelligence.pdf">Revealing Botnet Membership Using DNSBL Counter-Intelligence</a></li>
<li><a href="/research-papers/security/Revolver - An Automated Approach to the Detection of Evasive Web-based Malware.pdf">Revolver - An Automated Approach to the Detection of Evasive Web-based Malware</a></li>
<li><a href="/research-papers/security/Self-organized Collaboration of Distributed IDS Sensors.pdf">Self-organized Collaboration of Distributed IDS Sensors</a></li>
<li><a href="/research-papers/security/SinkMiner- Mining Botnet Sinkholes for Fun and Profit.pdf">SinkMiner- Mining Botnet Sinkholes for Fun and Profit</a></li>
<li><a href="/research-papers/security/Spamming botnets - signatures and characteristics.pdf">Spamming Botnets - Signatures and Characteristics</a></li>
<li><a href="/research-papers/security/Spectrogram - A mixture-of-markov-chains model for anomaly detection in web traffic.pdf">Spectrogram - A Mixture of Markov Chain models for Anomaly Detection in Web Traffic</a></li>
<li><a href="/research-papers/security/The security of machine learning.pdf">The Security of Machine Learning</a></li>
<li><a href="/research-papers/security/Toward Stealthy Malware Detection.pdf">Toward Stealthy Malware Detection</a></li>
<li><a href="/research-papers/security/Traffic aggregation for malware detection.pdf">Traffic Aggregation for Malware Detection</a></li>
<li><a href="/research-papers/security/Understanding the Domain Registration Behavior of Spammers.pdf">Understanding the Domain Registration Behavior of Spammers</a></li>
<li><a href="/research-papers/security/Understanding the Network-Level Behavior of Spammers.pdf">Understanding the Network-Level Behavior of Spammers</a></li>
<li><a href="/research-papers/security/VAST- Network Visibility Across Space and Time.pdf">VAST- Network Visibility Across Space and Time</a></li>
</ul>
<h3 id="malware-papers">Malware Papers</h3>
<ul>
<li><a href="/research-papers/security/A static, packer-agnostic filter to detect similar malware samples.pdf">A static, packer-agnostic filter to detect similar malware samples</a></li>
<li><a href="/research-papers/security/A study of malcode-bearing documents.pdf">A study of malcode-bearing documents</a></li>
<li><a href="/research-papers/security/A survey on automated dynamic malware-analysis techniques and tools.pdf">A survey on automated dynamic malware-analysis techniques and tools</a></li>
<li><a href="/research-papers/security/APT1 Technical backstage (malware.lu hack backs of APT1 servers).pdf">APT1 Technical backstage (malware.lu hack backs of APT1 servers)</a></li>
<li><a href="/research-papers/security/Automatic Analysis of Malware Behavior using Machine Learning.pdf">Automatic Analysis of Malware Behavior using Machine Learning</a></li>
<li><a href="/research-papers/security/BitShred - Fast, Scalable Code Reuse Detection in Binary Code.pdf">BitShred - Fast, Scalable Code Reuse Detection in Binary Code</a></li>
<li><a href="/research-papers/security/BitShred - Fast, Scalable Malware Triage.pdf">BitShred - Fast, Scalable Malware Triage</a></li>
<li><a href="/research-papers/security/Deobfuscating Embedded Malware using Probable-Plaintext Attacks.pdf">Deobfuscating Embedded Malware using Probable-Plaintext Attacks</a></li>
<li><a href="/research-papers/security/Escape from Monkey Island - Evading High-Interaction Honeyclients.pdf">Escape from Monkey Island - Evading High-Interaction Honeyclients</a></li>
<li><a href="/research-papers/security/Eureka - A framework for enabling static malware analysis.pdf">Eureka - A framework for enabling static malware analysis</a></li>
<li><a href="/research-papers/security/Extraction of Statistically Significant Malware Behaviors.pdf">Extraction of Statistically Significant Malware Behaviors</a></li>
<li><a href="/research-papers/security/Fast Automated Unpacking and Classification of Malware.pdf">Fast Automated Unpacking and Classification of Malware</a></li>
<li><a href="/research-papers/security/FIRMA - Malware Clustering and Network Signature Generation with Mixed Network Behaviors.pdf">FIRMA - Malware Clustering and Network Signature Generation with Mixed Network Behaviors</a></li>
<li><a href="/research-papers/security/FuncTracker - Discovering Shared Code (to aid malware forensics) - slides.pdf">FuncTracker - Discovering Shared Code (to aid malware forensics) - slides</a></li>
<li><a href="/research-papers/security/FuncTracker - Discovering Shared Code to Aid Malware Forensics Extended Abstract.pdf">FuncTracker - Discovering Shared Code to Aid Malware Forensics Extended Abstract</a></li>
<li><a href="/research-papers/security/Malware files clustering based on file geometry and visualization using R language.pdf">Malware files clustering based on file geometry and visualization using R language</a></li>
<li><a href="/research-papers/security/Mobile Malware Detection Based on Energy Fingerprints — A Dead End.pdf">Mobile Malware Detection Based on Energy Fingerprints — A Dead End</a></li>
<li><a href="/research-papers/security/Polonium - Tera-Scale Graph Mining for Malware Detection.pdf">Polonium - Tera-Scale Graph Mining for Malware Detection</a></li>
<li><a href="/research-papers/security/Putting out a HIT - Crowdsourcing Malware Installs.pdf">Putting out a HIT - Crowdsourcing Malware Installs</a></li>
<li><a href="/research-papers/security/Scalable fine-grained behavioral clustering of http-based malware.pdf">Scalable Fine-grained Behavioral Clustering of HTTP-based Malware</a></li>
<li><a href="/research-papers/security/Selecting Features to Classify Malware.pdf">Selecting Features to Classify Malware</a> by Karthik Raman</li>
<li><a href="/research-papers/security/SigMal - A Static Signal Processing Based Malware Triage.pdf">SigMal - A Static Signal Processing Based Malware Triage</a></li>
<li><a href="/research-papers/security/Tracking Memory Writes for Malware Classification and Code Reuse Identification.pdf">Tracking Memory Writes for Malware Classification and Code Reuse Identification</a></li>
<li><a href="/research-papers/security/Using File Relationships in Malware Classification.pdf">Using File Relationships in Malware Classification</a></li>
<li><a href="/research-papers/security/VAMO - Towards a Fully Automated Malware Clustering Validity Analysis.pdf">VAMO - Towards a Fully Automated Malware Clustering Validity Analysis</a></li>
</ul>
<h3 id="data-collection-papers">Data Collection Papers</h3>
<ul>
<li><a href="/research-papers/security/Crawling BitTorrent DHTs for Fun and Profit.pdf">Crawling BitTorrent DHTs for Fun and Profit</a></li>
<li><a href="/research-papers/security/CyberProbe - Towards Internet-Scale Active Detection of Malicious Servers.pdf">CyberProbe - Towards Internet-Scale Active Detection of Malicious Servers</a></li>
<li><a href="/research-papers/security/Demystifying service discovery - Implementing an internet-wide scanner.pdf">Demystifying service discovery - Implementing an internet-wide scanner</a></li>
<li><a href="/research-papers/security/gitDigger - Creating useful wordlists from GitHub.pdf">gitDigger - Creating useful wordlists from GitHub</a></li>
<li><a href="/research-papers/security/PoisonAmplifier - A Guided Approach of Discovering Compromised Websites through Reversing Search Poisoning Attacks.pdf">PoisonAmplifier - A Guided Approach of Discovering Compromised Websites through Reversing Search Poisoning Attacks</a></li>
<li><a href="/research-papers/security/ZMap - Fast Internet-Wide Scanning and its Security Applications (slides).pdf">ZMap - Fast Internet-Wide Scanning and its Security Applications (slides)</a></li>
<li><a href="/research-papers/security/ZMap - Fast Internet-Wide Scanning and its Security Applications.pdf">ZMap - Fast Internet-Wide Scanning and its Security Applications</a></li>
</ul>
<h3 id="vulnerability-analysisreversing-papers">Vulnerability Analysis/Reversing Papers</h3>
<ul>
<li><a href="/research-papers/security/A Preliminary Analysis of Vulnerability Scores for Attacks in Wild.pdf">A Preliminary Analysis of Vulnerability Scores for Attacks in Wild</a></li>
<li><a href="/research-papers/security/Attacker Economics for Internet-scale Vulnerability Risk Assessment.pdf">Attacker Economics for Internet-scale Vulnerability Risk Assessment</a></li>
<li><a href="/research-papers/security/Detecting Logic Vulnerabilities in E-Commerce Applications.pdf">Detecting Logic Vulnerabilities in E-Commerce Applications</a></li>
<li><a href="/research-papers/security/ReDeBug - finding unpatched code clones in entire os distributions.pdf">ReDeBug - Finding Unpatched Code Clones in Entire OS Distributions</a></li>
<li><a href="/research-papers/security/The Classification of Valuable Data in an Assumption of Breach Paradigm.pdf">The Classification of Valuable Data in an Assumption of Breach Paradigm</a></li>
<li><a href="/research-papers/security/Toward Black-Box Detection of Logic Flaws in Web Applications.pdf">Toward Black-Box Detection of Logic Flaws in Web Applications</a></li>
<li><a href="/research-papers/security/Vulnerability Extrapolation - Assisted Discovery of Vulnerabilities using Machine Learning - slides.pdf">Vulnerability Extrapolation - Assisted Discovery of Vulnerabilities using Machine Learning - slides</a></li>
<li><a href="/research-papers/security/Vulnerability Extrapolation - Assisted Discovery of Vulnerabilities using Machine Learning.pdf">Vulnerability Extrapolation - Assisted Discovery of Vulnerabilities using Machine Learning</a></li>
</ul>
<h3 id="anonymityprivacyopseccensorship-papers">Anonymity/Privacy/OPSEC/Censorship Papers</h3>
<ul>
<li><a href="/research-papers/security/Anonymous Hacking Group -- OpNewblood-Super-Secret-Security-Handbook.pdf">Anonymous Hacking Group – #OpNewblood Super Secret Security Handbook</a></li>
<li><a href="/research-papers/security/Detecting Traffic Snooping in Tor Using Decoys.pdf">Detecting Traffic Snooping in Tor Using Decoys</a></li>
<li><a href="/research-papers/security/Risks and Realization of HTTPS Traffic Analysis.pdf">Risks and Realization of HTTPS Traffic Analysis</a></li>
<li><a href="/research-papers/security/Selling Off Privacy at Auction.pdf">Selling Off Privacy at Auction</a></li>
<li><a href="/research-papers/security/The Sniper Attack - Anonymously Deanonymizing and Disabling the Tor Network.pdf">The Sniper Attack - Anonymously Deanonymizing and Disabling the Tor Network</a></li>
<li><a href="/research-papers/security/The Velocity of Censorship - High-Fidelity Detection of Microblog Post Deletions - slides.pdf">The Velocity of Censorship - High-Fidelity Detection of Microblog Post Deletions - slides</a></li>
<li><a href="/research-papers/security/The Velocity of Censorship - High-Fidelity Detection of Microblog Post Deletions.pdf">The Velocity of Censorship - High-Fidelity Detection of Microblog Post Deletions</a></li>
<li><a href="/research-papers/security/Tor vs NSA.pdf">Tor vs. NSA</a></li>
</ul>
<h3 id="data-mining-papers">Data Mining Papers</h3>
<ul>
<li><a href="/research-papers/security/An Exploration of Geolocation and Traffic Visualisation Using Network Flows to Aid in Cyber Defence.pdf">An Exploration of Geolocation and Traffic Visualization Using Network Flows to Aid in Cyber Defense</a></li>
<li><a href="/research-papers/security/DSpin - Detecting Automatically Spun Content on the Web.pdf">DSpin - Detecting Automatically Spun Content on the Web</a></li>
<li><a href="/research-papers/security/Gyrus - A Framework for User-Intent Monitoring of Text-Based Networked Applications.pdf">Gyrus - A Framework for User-Intent Monitoring of Text-Based Networked Applications</a></li>
<li><a href="/research-papers/security/Indexing Million of Packets per Second using GPUs.pdf">Indexing Million of Packets per Second using GPUs</a></li>
<li><a href="/research-papers/security/Multi-Label Learning with Millions of Labels - Recommending Advertiser Bid Phrases for Web Pages.pdf">Multi-Label Learning with Millions of Labels - Recommending Advertiser Bid Phrases for Web Pages</a></li>
<li><a href="/research-papers/security/Real-Time Handling of Network Monitoring Data Using a Data-Intensive Framework.pdf">Real-Time Handling of Network Monitoring Data Using a Data-Intensive Framework</a></li>
<li><a href="/research-papers/security/Shingled Graph Disassembly - Finding the Undecideable Path.pdf">Shingled Graph Disassembly - Finding the Undecideable Path</a></li>
<li><a href="/research-papers/security/Synoptic Graphlet - Bridging the Gap between Supervised and Unsupervised Profiling of Host-level Network Traffic.pdf">Synoptic Graphlet - Bridging the Gap between Supervised and Unsupervised Profiling of Host-level Network Traffic</a></li>
</ul>
<h3 id="cyber-crime-papers">Cyber Crime Papers</h3>
<ul>
<li><a href="/research-papers/security/Connected Colors - Unveiling the Structure of Criminal Networks.pdf">Connected Colors - Unveiling the Structure of Criminal Networks</a></li>
<li><a href="/research-papers/security/Image Matching for Branding Phishing Kit Images - slides.pdf">Image Matching for Branding Phishing Kit Images - slides</a></li>
<li><a href="/research-papers/security/Image Matching for Branding Phishing Kit Images.pdf">Image Matching for Branding Phishing Kit Images</a></li>
<li><a href="/research-papers/security/Inside-a-Targeted-Point-of-Sale-Data-Breach.pdf">Inside a Targeted Point-of-Sale Data Breach</a></li>
<li><a href="/research-papers/security/Investigating Advanced Persistent Threat 1 (APT1).pdf">Investigating Advanced Persistent Threat 1 (APT1)</a></li>
<li><a href="/research-papers/security/Measuring pay-per-install - the commoditization of malware distribution.pdf">Measuring pay-per-install - the Commoditization of Malware Distribution</a></li>
<li><a href="/research-papers/security/Scambaiter - Understanding Targeted Nigerian Scams on Craigslist.pdf">Scambaiter - Understanding Targeted Nigerian Scams on Craigslist</a></li>
<li><a href="/research-papers/security/Sherlock Holmes and The Case of the Advanced Persistent Threat.pdf">Sherlock Holmes and the Case of the Advanced Persistent Threat</a></li>
<li><a href="/research-papers/security/The Role of the Underground Market in Twitter Spam and Abuse.pdf">The Role of the Underground Market in Twitter Spam and Abuse</a></li>
<li><a href="/research-papers/security/The Tangled Web of Password Reuse.pdf">The Tangled Web of Password Reuse</a></li>
<li><a href="/research-papers/security/Trafficking Fraudulent Accounts - The Role of the Underground Market in Twitter Spam and Abuse.pdf">Trafficking Fraudulent Accounts - The Role of the Underground Market in Twitter Spam and Abuse</a></li>
</ul>
<h3 id="cndcnacnecno-papers">CND/CNA/CNE/CNO Papers</h3>
<ul>
<li><a href="/research-papers/security/Amplification Hell - Revisiting Network Protocols for DDoS Abuse.pdf">Amplification Hell - Revisiting Network Protocols for DDoS Abuse</a></li>
<li><a href="/research-papers/security/HITB2013AMS - Defending The Enterprise, the Russian Way.pdf">Defending The Enterprise, the Russian Way</a></li>
<li><a href="/research-papers/security/Protecting a moving target - Addressing web application concept drift.pdf">Protecting a Moving Target - Addressing Web Application Concept Drift</a></li>
<li><a href="/research-papers/security/Timing of Cyber Conflict.pdf">Timing of Cyber Conflict</a></li>
</ul>
<h2 id="deep-learning-and-security-papers">Deep Learning and Security Papers</h2>
<ul>
<li><a href="/research-papers/deep-learning-security/A Deep Learning Approach for Network Intrusion Detection System.pdf">A Deep Learning Approach for Network Intrusion Detection System</a></li>
<li><a href="/research-papers/deep-learning-security/A Hybrid Malicious Code Detection Method based on Deep Learning.pdf">A Hybrid Malicious Code Detection Method based on Deep Learning</a></li>
<li><a href="/research-papers/deep-learning-security/A Hybrid Spectral Clustering and Deep Neural Network Ensemble Algorithm for Intrusion Detection in Sensor Networks.pdf">A Hybrid Spectral Clustering and Deep Neural Network Ensemble Algorithm for Intrusion Detection in Sensor Networks</a></li>
<li><a href="/research-papers/deep-learning-security/A Multi-task Learning Model for Malware Classification with Useful File Access Pattern from API Call Sequence.pdf">A Multi-task Learning Model for Malware Classification with Useful File Access Pattern from API Call Sequence</a></li>
<li><a href="/research-papers/deep-learning-security/A novel LSTM-RNN decoding algorithm in CAPTCHA recognition.pdf">A Novel LSTM-RNN Decoding Algorithm in CAPTCHA Recognition</a> (Short paper)</li>
<li><a href="/research-papers/deep-learning-security/An Analysis of Recurrent Neural Networks for Botnet Detection Behavior.pdf">An Analysis of Recurrent Neural Networks for Botnet Detection Behavior</a></li>
<li><a href="/research-papers/deep-learning-security/Application of Recurrent Neural Networks for User Verification based on Keystroke Dynamics.pdf">Application of Recurrent Neural Networks for User Verification based on Keystroke Dynamics</a></li>
<li><a href="/research-papers/deep-learning-security/Applications of Deep Learning On Traffic Identification.pdf">Applications of Deep Learning On Traffic Identification</a> (video: <a href="https://www.youtube.com/watch?v=yZ-Y1WCM0lc">here</a>)</li>
<li><a href="/research-papers/deep-learning-security/Combining Restricted Boltzmann Machine and One Side Perceptron for Malware Detection.pdf">Combining Restricted Boltzmann Machine and One Side Perceptron for Malware Detection</a></li>
<li><a href="/research-papers/deep-learning-security/Comparison Deep Learning Method to Traditional Methods Using for Network Intrusion Detection.pdf">Comparison Deep Learning Method to Traditional Methods Using for Network Intrusion Detection</a> (short paper)</li>
<li><a href="/research-papers/deep-learning-security/Convolutional Neural Networks for Malware Classification.pdf">Convolutional Neural Networks for Malware Classification</a> (THESIS)</li>
<li><a href="/research-papers/deep-learning-security/Deep learning approach for Network Intrusion Detection in Software Defined Networking.pdf">Deep Learning Approach for Network Intrusion Detection in Software Defined Networking</a></li>
<li><a href="/research-papers/deep-learning-security/Deep Learning for Classification of Malware System Call Sequences.pdf">Deep Learning for Classification of Malware System Call Sequences</a></li>
<li><a href="/research-papers/deep-learning-security/Poster - Deep Learning for Zero-day Flash Malware Detection.pdf">Deep Learning for Zero-day Flash Malware Detection</a> (Short Paper)</li>
<li><a href="/research-papers/deep-learning-security/Deep learning is a good steganalysis tool when embedding key is reused for different images, even if there is a cover sourcemismatch.pdf">Deep Learning is a Good Steganalysis Tool When Embedding Key is Reused for Different Images, even if there is a cover source mismatch</a></li>
<li><a href="/research-papers/deep-learning-security/Deep Learning-based Feature Selection for Intrusion Detection System in Transport Layer.pdf">Deep Learning-based Feature Selection for Intrusion Detection System in Transport Layer</a> (Short Paper)</li>
<li><a href="/research-papers/deep-learning-security/Deep Neural Network Based Malware Detection Using Two Dimensional Binary Program Features.pdf">Deep Neural Network Based Malware Detection using Two Dimensional Binary Program Features</a></li>
<li><a href="/research-papers/deep-learning-security/DeepDGA- Adversarially-Tuned Domain Generation and Detection.pdf">DeepDGA: Adversarially-Tuned Domain Generation and Detection</a></li>
<li><a href="/research-papers/deep-learning-security/DeepSign- Deep Learning for Automatic Malware Signature Generation and Classification.pdf">DeepSign: Deep Learning for Automatic Malware Signature Generation and Classification</a></li>
<li><a href="/research-papers/deep-learning-security/DL4MD- A Deep Learning Framework for Intelligent Malware Detection.pdf">DL4MD: A Deep Learning Framework for Intelligent Malware Detection</a></li>
<li><a href="/research-papers/deep-learning-security/DroidSec - Deep Learning in Android Malware Detection.pdf">Droid-Sec: Deep Learning in Android Malware Detection</a></li>
<li><a href="/research-papers/deep-learning-security/Droiddetector - android malware char- acterization and detection using deep learning.pdf">DroidDetector: Android Malware Characterization and Detection using Deep Learning</a></li>
<li><a href="/research-papers/deep-learning-security/HADM- Hybrid Analysis for Detection of Malware.pdf">HADM: Hybrid Analysis for Detection of Malware</a></li>
<li><a href="/research-papers/deep-learning-security/Identifying Top Sellers In Underground Economy Using Deep Learning-based Sentiment Analysis.pdf">Identifying Top Sellers In Underground Economy Using Deep Learning-based Sentiment Analysis</a></li>
<li><a href="/research-papers/deep-learning-security/Intrusion Detection System Using Deep Neural Network for In-Vehicle Network Security.pdf">Intrusion Detection System Using Deep Neural Network for In-Vehicle Network Security</a></li>
<li><a href="/research-papers/deep-learning-security/Large-scale Malware Classification using Random Projections and Neural Networks.pdf">Large-scale Malware Classification using Random Projections and Neural Networks</a></li>
<li><a href="/research-papers/deep-learning-security/Learning a Static Analyzer - A Case Study on a Toy Language.pdf">Learning a Static Analyzer: A Case Study on a Toy Language</a></li>
<li><a href="/research-papers/deep-learning-security/Learning Spam Features using Restricted Boltzmann Machines.pdf">Learning Spam Features using Restricted Boltzmann Machines</a></li>
<li><a href="/research-papers/deep-learning-security/Long Short Term Memory Recurrent Neural Network Classifier for Intrusion Detection.pdf">Long Short Term Memory Recurrent Neural Network Classifier for Intrusion Detection</a></li>
<li><a href="/research-papers/deep-learning-security/LSTM-based System-call Language Modeling and Robust Ensemble Method for Designing Host-based Intrusion Detection Systems.pdf">LSTM-based System-call Language Modeling and Robust Ensemble Method for Designing Host-based Intrusion Detection Systems</a></li>
<li><a href="/research-papers/deep-learning-security/Malware Classification on Time Series Data Through Machine Learning.pdf">Malware Classification on Time Series Data Through Machine Learning</a> (THESIS)</li>
<li><a href="/research-papers/deep-learning-security/Malware Classification with Recurrent Networks.pdf">Malware Classification with Recurrent Networks</a></li>
<li><a href="/research-papers/deep-learning-security/Malware Detection with Deep Neural Network using Process Behavior.pdf">Malware Detection with Deep Neural Network using Process Behavior</a></li>
<li><a href="/research-papers/deep-learning-security/MS-LSTM - a Multi-Scale LSTM Model for BGP Anomaly Detection.pdf">MS-LSTM: a Multi-Scale LSTM Model for BGP Anomaly Detection</a></li>
<li><a href="/research-papers/deep-learning-security/MtNet - A Multi-Task Neural Network for Dynamic Malware Classification.pdf">MtNet: A Multi-Task Neural Network for Dynamic Malware Classification</a></li>
<li><a href="/research-papers/deep-learning-security/Network anomaly detection with the restricted Boltzmann machine.pdf">Network Anomaly Detection with the Restricted Boltzmann Machine</a></li>
<li><a href="/research-papers/deep-learning-security/Predicting Domain Generation Algorithms with Long Short-Term Memory Networks.pdf">Predicting Domain Generation Algorithms with Long Short-Term Memory Networks</a></li>
<li><a href="/research-papers/deep-learning-security/Recognizing Functions in Binaries with Neural Networks.pdf">Recognizing Functions in Binaries with Neural Networks</a></li>
<li><a href="/research-papers/deep-learning-security/The limitations of deep learning in adversarial settings.pdf">The Limitations of Deep Learning in Adversarial Settings</a></li>
<li><a href="/research-papers/deep-learning-security/Toward large-scale vulnerability discovery using Machine Learning.pdf">Toward large-scale vulnerability discovery using Machine Learning</a></li>
</ul>
<h2 id="deep-learning-and-security-presentations">Deep Learning and Security Presentations</h2>
<ul>
<li><a href="/research-papers/deep-learning-security/Presentations/A Deep Learning Approach for Network Intrusion Detection System.pdf">A Deep Learning Approach for Network Intrusion Detection System</a></li>
<li><a href="/research-papers/deep-learning-security/Presentations/Deep Learning on Disassembly Data.pdf">Deep Learning on Disassembly Data</a> (video: <a href="https://www.youtube.com/watch?v=LQh8dktQReI">here</a>)</li>
</ul>
<h2 id="security-data-science-blogs">Security Data Science Blogs</h2>
<p>Blogs that frequently cover topics on security data science, machine learning, etc. These are recommended for your RSS feed.</p>
<ul>
<li><a href="http://www.covert.io">covert.io</a></li>
<li><a href="http://datadrivensecurity.info/blog/">Data Driven Security Blog</a></li>
<li><a href="http://www.mlsecproject.org/#blog">mlsecproject</a></li>
<li><a href="http://www.automatingosint.com/blog/">Automating OSINT</a></li>
<li><a href="https://bigsnarf.wordpress.com/">BigSnarf Blog</a></li>
</ul>
<h2 id="security-data-science-blogposts--tutorials">Security Data Science Blogposts / Tutorials</h2>
<ul>
<li><a href="http://blog.sqrrl.com/an-introduction-to-machine-learning-for-cybersecurity-and-threat-hunting">An Introduction to Machine Learning for Cybersecurity and Threat Hunting</a> (<a href="https://github.com/DavidJBianco/Clearcut">code</a>)</li>
<li><a href="http://clicksecurity.github.io/data_hacking/">Click Security’s Data Hacking</a> (<a href="https://github.com/ClickSecurity/data_hacking">code</a>)</li>
<li><a href="https://blog.opendns.com/2016/09/06/dominos-botnets-little-lstm/">Dominos, Botnets, and a little LSTM</a> (<a href="https://gist.github.com/DavidRdgz/8601bfad4ad512ff9d6d46d1ba0fbcca">code</a>)</li>
<li><a href="http://fsecurify.com/machine-learning-based-password-strength-checking/">Machine Learning based Password Strength Classification</a> (<a href="https://github.com/faizann24/Machine-Learning-based-Password-Strength-Classification">code</a>)</li>
<li><a href="https://deepmlblog.wordpress.com/2016/01/12/recurrent-neural-networks-for-decoding-captchas/">Recurrent neural networks for decoding CAPTCHAS</a></li>
<li><a href="https://deepmlblog.wordpress.com/2016/01/15/sequence-to-sequence-learning-to-decode-variable-length-captchas/">Sequence to sequence learning to decode variable length captchas</a></li>
<li><a href="https://deepmlblog.wordpress.com/2016/01/03/how-to-break-a-captcha-system/">Using deep learning to break a Captcha system</a> (<a href="https://github.com/arunpatala/captcha">code</a>)</li>
<li><a href="http://fsecurify.com/using-machine-learning-detect-malicious-urls/">Using Machine Learning to Detect Malicious URLs</a> (<a href="https://github.com/faizann24/Using-machine-learning-to-detect-malicious-URLs">code</a>)</li>
<li><a href="http://fsecurify.com/using-neural-networks-to-generate-human-readable-passwords/">Using Neural Networks to generate human readable passwords</a></li>
<li><a href="https://edhenry.github.io/2016/12/21/Netflow-flow2vec/">Netflow Flow2vec</a></li>
</ul>
<h2 id="security-data-science-projects">Security Data Science Projects</h2>
<p>Open source projects and code applying data science/machine learning to security problems.</p>
<ul>
<li><a href="https://github.com/DavidJBianco/Clearcut">Clearcut</a> - a tool that uses machine learning to help you focus on the log entries that really need manual review</li>
<li><a href="https://github.com/ClickSecurity/data_hacking">Click Security’s Data Hacking Project</a></li>
<li><a href="https://github.com/mlsecproject/combine">Combine</a> - Tool to gather Threat Intelligence indicators from publicly available sources</li>
<li><a href="https://github.com/endgameinc/dga_predict">dga_predict</a> - Predicting Domain Generation Algorithms using LSTMs.</li>
<li><a href="http://www.mlsec.org/">mlsec.org</a> - Various Machine Learning and Computer Security Research projects from mlsec.org.</li>
<li><a href="https://github.com/mlsecproject/tiq-test">tiq-test</a> - Threat Intelligence Quotient Test - Dataviz and Statistical Analysis of TI feeds.</li>
<li><a href="https://github.com/honeynet/cuckooml">CuckooML</a>: Machine Learning for Cuckoo Sandbox <a href="https://honeynet.github.io/cuckooml/">https://honeynet.github.io/cuckooml/</a></li>
</ul>
<h2 id="security-data">Security Data</h2>
<p>Collection of Security and Network Data Resources.</p>
<ul>
<li>See <a href="/data-links/">Covert.io Data Page</a></li>
<li>See <a href="/threat-intelligence/">Covert.io Threat Intelligence Page</a></li>
<li>See <a href="http://www.secrepo.com/">secrepo.com</a> is more comprehensive and should be checked as well.</li>
</ul>
<h2 id="security-data-science-books">Security Data Science Books</h2>
<ul>
<li><a href="https://www.amazon.com/Machine-Learning-Approach-Phishing-Detection-Defense/dp/0128029277/?tag=cyberanaly-20">A Machine-Learning Approach to Phishing Detection and Defense</a></li>
<li><a href="https://www.amazon.com/Applied-Security-Visualization-Raffael-Marty/dp/0321510100/?tag=cyberanaly-20">Applied Security Visualization</a></li>
<li><a href="https://www.amazon.com/Data-Mining-Machine-Learning-Cybersecurity/dp/1439839425?tag=cyberanaly-20">Data Mining and Machine Learning in Cybersecurity</a></li>
<li><a href="https://www.amazon.com/Data-Driven-Security-Analysis-Visualization-Dashboards/dp/1118793722/?tag=cyberanaly-20">Data-Driven Security: Analysis, Visualization and Dashboards</a></li>
<li><a href="https://www.amazon.com/Information-Security-Analytics-Insights-Anomalies/dp/0128002077/?tag=cyberanaly-20">Information Security Analytics: Finding Security Insights, Patterns, and Anomalies in Big Data</a></li>
<li><a href="https://www.amazon.com/Machine-Learning-Mining-Computer-Security/dp/184628029X?tag=cyberanaly-20">Machine Learning and Data Mining for Computer Security</a></li>
<li><a href="https://www.amazon.com/Network-Anomaly-Detection-Learning-Perspective/dp/1466582081?tag=cyberanaly-20">Network Anomaly Detection: A Machine Learning Perspective</a></li>
<li><a href="https://www.amazon.com/Network-Security-Through-Data-Analysis/dp/1449357903/?tag=cyberanaly-20">Network Security Through Data Analysis: Building Situational Awareness</a></li>
</ul>
<h2 id="security-data-science-presentations--talks">Security Data Science Presentations / Talks</h2>
<ul>
<li><a href="https://www.youtube.com/watch?v=dGwH7m4N8DE">Applied Machine Learning for Data Exfil and Other Fun Topics</a></li>
<li><a href="https://www.youtube.com/watch?v=vy-jpFpm1AU">Applying Machine Learning to Network Security Monitoring</a></li>
<li><a href="https://www.youtube.com/watch?v=iLNHVwSu9EA&t=245s">Build an Antivirus in 5 Min – Fresh Machine Learning #7. A fun video to watch</a></li>
<li><a href="https://www.youtube.com/watch?v=u6a7afsD39A">CrowdSource: Crowd Trained Machine Learning Model for Malware Capability Det</a></li>
<li><a href="https://www.youtube.com/watch?v=6JMEKnes-w0">Data-Driven Threat Intelligence: Metrics On Indicator Dissemination And Sharing</a></li>
<li><a href="https://www.youtube.com/watch?v=oiuS1DyFNd8">Defeating Machine Learning What Your Security Vendor Is Not Telling You</a></li>
<li><a href="https://www.youtube.com/watch?v=sPtbDUJjhbk">Defeating Machine Learning: Systemic Deficiencies for Detecting Malware</a></li>
<li><a href="https://www.youtube.com/watch?v=_0CRSF6yPB4">Defending Networks With Incomplete Information: A Machine Learning Approach</a></li>
<li><a href="https://www.youtube.com/watch?v=36IT9VgGr0g">Defending Networks with Incomplete Information</a></li>
<li><a href="https://www.youtube.com/watch?v=l7U0pDcsKLg">Delta Zero, KingPhish3r – Weaponizing Data Science for Social Engineering</a></li>
<li><a href="https://www.youtube.com/watch?v=gHtN4jU69W0">Fraud detection using machine learning & deep learning</a></li>
<li><a href="https://www.youtube.com/watch?v=zT-4zdtvR30">Hunting for Malware with Machine Learning</a></li>
<li><a href="https://www.youtube.com/watch?v=JAGDpJFFM2A">Machine Duping 101: Pwning Deep Learning Systems</a></li>
<li><a href="https://www.youtube.com/watch?v=fRklX97iGIw">Machine Learning and the Cloud: Disrupting Threat Detection and Prevention</a></li>
<li><a href="https://www.youtube.com/watch?v=qVwktOa-F34">Machine Learning for Threat Detection</a></li>
<li><a href="https://www.youtube.com/watch?v=yG6QlHOAWiE">Measuring the IQ of your Threat Intelligence Feeds</a></li>
<li><a href="https://www.youtube.com/watch?v=2cQRSPFSY-s">Packet Capture Village – Theodora Titonis – How Machine Learning Finds Malware</a></li>
<li><a href="https://www.youtube.com/watch?v=TYVCVzEJhhQ">Secure Because Math: A Deep-Dive on ML-Based Monitoring</a></li>
<li><a href="https://www.youtube.com/watch?v=B7OKgC3AJVM">The Applications Of Deep Learning On Traffic Identification</a></li>
<li><a href="https://www.youtube.com/watch?v=tukidI5vuBs">Using Machine Learning to Support Information Security</a></li>
<li><a href="https://www.youtube.com/watch?v=fN5TOB4ZPVI">Clusterf*ck Actionable Intelligence from Machine Learning</a></li>
<li><a href="https://www.youtube.com/watch?v=jCIT7rXX8y0">I Am Packer And So Can You</a></li>
<li><a href="https://www.youtube.com/watch?v=8lF5rBmKhWk">Practical Applications of Data Science in Detection</a></li>
</ul>
<h2 id="misc">Misc</h2>
<ul>
<li><a href="https://github.com/jivoi/awesome-ml-for-cybersecurity">awesome-ml-for-cybersecurity</a></li>
</ul>
<p><a href="http://www.covert.io/the-definitive-security-datascience-and-machinelearning-guide/">The Definitive Security Data Science and Machine Learning Guide</a> was originally published by Jason Trost at <a href="http://www.covert.io">covert.io</a> on January 01, 2017.</p>http://www.covert.io/deep-learning-security-papers2017-01-01T00:00:00-00:002016-12-29T00:00:00-05:00Jason Trosthttp://www.covert.iojason.trost@gmail.com<p><img src="/images/deep-learning-logo.png" /></p>
<p><strong>Update (1/1/2017)</strong>: I will not be updating this page and instead will make all updates to this page: <a href="/the-definitive-security-datascience-and-machinelearning-guide/">The Definitive Security Data Science and Machine Learning Guide</a> (see <a href="/the-definitive-security-datascience-and-machinelearning-guide/#deep-learning-and-security-papers">Deep Learning and Security Papers</a> section).</p>
<p>This is another quick post. Over the past few months I started researching <a href="https://en.wikipedia.org/wiki/Deep_learning">deep learning</a> to determine if it may be useful for solving security problems. This post on <a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">The Unreasonable Effectiveness of Recurrent Neural Networks</a> was what got me interested in this topic, and I highly recommend reading it in its entirety.</p>
<p>Throughout this research, I came across several security related academic and professional research papers on security topics that use Deep Learning as part of their research. What follows is a list of the papers/slides/videos that I found, and these may be useful to others. If you have others that you think should be added to this list, please ping me: <a href="https://twitter.com/jason_trost">@jason_trost</a>.</p>
<h2 id="deep-learning-papers-on-security">Deep Learning Papers on Security</h2>
<ul>
<li><a href="/research-papers/deep-learning-security/A Deep Learning Approach for Network Intrusion Detection System.pdf">A Deep Learning Approach for Network Intrusion Detection System</a></li>
<li><a href="/research-papers/deep-learning-security/A Hybrid Malicious Code Detection Method based on Deep Learning.pdf">A Hybrid Malicious Code Detection Method based on Deep Learning</a></li>
<li><a href="/research-papers/deep-learning-security/A Hybrid Spectral Clustering and Deep Neural Network Ensemble Algorithm for Intrusion Detection in Sensor Networks.pdf">A Hybrid Spectral Clustering and Deep Neural Network Ensemble Algorithm for Intrusion Detection in Sensor Networks</a></li>
<li><a href="/research-papers/deep-learning-security/A Multi-task Learning Model for Malware Classification with Useful File Access Pattern from API Call Sequence.pdf">A Multi-task Learning Model for Malware Classification with Useful File Access Pattern from API Call Sequence</a></li>
<li><a href="/research-papers/deep-learning-security/A novel LSTM-RNN decoding algorithm in CAPTCHA recognition.pdf">A Novel LSTM-RNN Decoding Algorithm in CAPTCHA Recognition</a> (Short paper)</li>
<li><a href="/research-papers/deep-learning-security/An Analysis of Recurrent Neural Networks for Botnet Detection Behavior.pdf">An Analysis of Recurrent Neural Networks for Botnet Detection Behavior</a></li>
<li><a href="/research-papers/deep-learning-security/Application of Recurrent Neural Networks for User Verification based on Keystroke Dynamics.pdf">Application of Recurrent Neural Networks for User Verification based on Keystroke Dynamics</a></li>
<li><a href="/research-papers/deep-learning-security/Applications of Deep Learning On Traffic Identification.pdf">Applications of Deep Learning On Traffic Identification</a> (video: <a href="https://www.youtube.com/watch?v=yZ-Y1WCM0lc">here</a>)</li>
<li><a href="/research-papers/deep-learning-security/Combining Restricted Boltzmann Machine and One Side Perceptron for Malware Detection.pdf">Combining Restricted Boltzmann Machine and One Side Perceptron for Malware Detection</a></li>
<li><a href="/research-papers/deep-learning-security/Comparison Deep Learning Method to Traditional Methods Using for Network Intrusion Detection.pdf">Comparison Deep Learning Method to Traditional Methods Using for Network Intrusion Detection</a> (short paper)</li>
<li><a href="/research-papers/deep-learning-security/Convolutional Neural Networks for Malware Classification.pdf">Convolutional Neural Networks for Malware Classification</a> (THESIS)</li>
<li><a href="/research-papers/deep-learning-security/Deep learning approach for Network Intrusion Detection in Software Defined Networking.pdf">Deep Learning Approach for Network Intrusion Detection in Software Defined Networking</a></li>
<li><a href="/research-papers/deep-learning-security/Deep Learning for Classification of Malware System Call Sequences.pdf">Deep Learning for Classification of Malware System Call Sequences</a></li>
<li><a href="/research-papers/deep-learning-security/Poster - Deep Learning for Zero-day Flash Malware Detection.pdf">Deep Learning for Zero-day Flash Malware Detection</a> (Short Paper)</li>
<li><a href="/research-papers/deep-learning-security/Deep learning is a good steganalysis tool when embedding key is reused for different images, even if there is a cover sourcemismatch.pdf">Deep Learning is a Good Steganalysis Tool When Embedding Key is Reused for Different Images, even if there is a cover source mismatch</a></li>
<li><a href="/research-papers/deep-learning-security/Deep Learning-based Feature Selection for Intrusion Detection System in Transport Layer.pdf">Deep Learning-based Feature Selection for Intrusion Detection System in Transport Layer</a> (Short Paper)</li>
<li><a href="/research-papers/deep-learning-security/Deep Neural Network Based Malware Detection Using Two Dimensional Binary Program Features.pdf">Deep Neural Network Based Malware Detection using Two Dimensional Binary Program Features</a></li>
<li><a href="/research-papers/deep-learning-security/DeepDGA- Adversarially-Tuned Domain Generation and Detection.pdf">DeepDGA: Adversarially-Tuned Domain Generation and Detection</a></li>
<li><a href="/research-papers/deep-learning-security/DeepSign- Deep Learning for Automatic Malware Signature Generation and Classification.pdf">DeepSign: Deep Learning for Automatic Malware Signature Generation and Classification</a></li>
<li><a href="/research-papers/deep-learning-security/DL4MD- A Deep Learning Framework for Intelligent Malware Detection.pdf">DL4MD: A Deep Learning Framework for Intelligent Malware Detection</a></li>
<li><a href="/research-papers/deep-learning-security/DroidSec - Deep Learning in Android Malware Detection.pdf">Droid-Sec: Deep Learning in Android Malware Detection</a></li>
<li><a href="/research-papers/deep-learning-security/Droiddetector - android malware char- acterization and detection using deep learning.pdf">DroidDetector: Android Malware Characterization and Detection using Deep Learning</a></li>
<li><a href="/research-papers/deep-learning-security/HADM- Hybrid Analysis for Detection of Malware.pdf">HADM: Hybrid Analysis for Detection of Malware</a></li>
<li><a href="/research-papers/deep-learning-security/Identifying Top Sellers In Underground Economy Using Deep Learning-based Sentiment Analysis.pdf">Identifying Top Sellers In Underground Economy Using Deep Learning-based Sentiment Analysis</a></li>
<li><a href="/research-papers/deep-learning-security/Intrusion Detection System Using Deep Neural Network for In-Vehicle Network Security.pdf">Intrusion Detection System Using Deep Neural Network for In-Vehicle Network Security</a></li>
<li><a href="/research-papers/deep-learning-security/Large-scale Malware Classification using Random Projections and Neural Networks.pdf">Large-scale Malware Classification using Random Projections and Neural Networks</a></li>
<li><a href="/research-papers/deep-learning-security/Learning a Static Analyzer - A Case Study on a Toy Language.pdf">Learning a Static Analyzer: A Case Study on a Toy Language</a></li>
<li><a href="/research-papers/deep-learning-security/Learning Spam Features using Restricted Boltzmann Machines.pdf">Learning Spam Features using Restricted Boltzmann Machines</a></li>
<li><a href="/research-papers/deep-learning-security/Long Short Term Memory Recurrent Neural Network Classifier for Intrusion Detection.pdf">Long Short Term Memory Recurrent Neural Network Classifier for Intrusion Detection</a></li>
<li><a href="/research-papers/deep-learning-security/LSTM-based System-call Language Modeling and Robust Ensemble Method for Designing Host-based Intrusion Detection Systems.pdf">LSTM-based System-call Language Modeling and Robust Ensemble Method for Designing Host-based Intrusion Detection Systems</a></li>
<li><a href="/research-papers/deep-learning-security/Malware Classification on Time Series Data Through Machine Learning.pdf">Malware Classification on Time Series Data Through Machine Learning</a> (THESIS)</li>
<li><a href="/research-papers/deep-learning-security/Malware Classification with Recurrent Networks.pdf">Malware Classification with Recurrent Networks</a></li>
<li><a href="/research-papers/deep-learning-security/Malware Detection with Deep Neural Network using Process Behavior.pdf">Malware Detection with Deep Neural Network using Process Behavior</a></li>
<li><a href="/research-papers/deep-learning-security/MS-LSTM - a Multi-Scale LSTM Model for BGP Anomaly Detection.pdf">MS-LSTM: a Multi-Scale LSTM Model for BGP Anomaly Detection</a></li>
<li><a href="/research-papers/deep-learning-security/MtNet - A Multi-Task Neural Network for Dynamic Malware Classification.pdf">MtNet: A Multi-Task Neural Network for Dynamic Malware Classification</a></li>
<li><a href="/research-papers/deep-learning-security/Network anomaly detection with the restricted Boltzmann machine.pdf">Network Anomaly Detection with the Restricted Boltzmann Machine</a></li>
<li><a href="/research-papers/deep-learning-security/Predicting Domain Generation Algorithms with Long Short-Term Memory Networks.pdf">Predicting Domain Generation Algorithms with Long Short-Term Memory Networks</a></li>
<li><a href="/research-papers/deep-learning-security/Recognizing Functions in Binaries with Neural Networks.pdf">Recognizing Functions in Binaries with Neural Networks</a></li>
<li><a href="/research-papers/deep-learning-security/The limitations of deep learning in adversarial settings.pdf">The Limitations of Deep Learning in Adversarial Settings</a></li>
<li><a href="/research-papers/deep-learning-security/Toward large-scale vulnerability discovery using Machine Learning.pdf">Toward large-scale vulnerability discovery using Machine Learning</a></li>
</ul>
<h2 id="deep-learning-presentations-on-security">Deep Learning Presentations on Security</h2>
<ul>
<li><a href="/research-papers/deep-learning-security/Presentations/A Deep Learning Approach for Network Intrusion Detection System.pdf">A Deep Learning Approach for Network Intrusion Detection System</a></li>
<li><a href="/research-papers/deep-learning-security/Presentations/Deep Learning on Disassembly Data.pdf">Deep Learning on Disassembly Data</a> (video: <a href="https://www.youtube.com/watch?v=LQh8dktQReI">here</a>)</li>
</ul>
<h2 id="security-machine-learning-resources">Security Machine Learning Resources:</h2>
<ul>
<li><a href="/security-datascience-papers/">Security Data Science Papers</a></li>
<li><a href="/interesting-security-papers/">Interesting security papers</a></li>
<li><a href="https://github.com/jivoi/awesome-ml-for-cybersecurity/blob/master/README.md">awesome-ml-for-cybersecurity project on Github</a></li>
<li><a href="http://www.mlsecproject.org/">mlsecproject</a></li>
<li><a href="https://speakerdeck.com/davidjbianco/getting-started-with-machine-learning-for-incident-detection">Getting Started With Machine Learning for Incident Detection</a> (code examples <a href="https://github.com/DavidJBianco/Clearcut">here</a>).</li>
</ul>
<h2 id="general-deep-learning-resources">General Deep Learning Resources:</h2>
<ul>
<li><a href="https://github.com/sbrugman/deep-learning-papers">deep-learning-papers project on Github</a></li>
<li><a href="https://github.com/songrotek/Deep-Learning-Papers-Reading-Roadmap">Deep-Learning-Papers-Reading-Roadmap project on Github</a></li>
<li><a href="https://github.com/terryum/awesome-deep-learning-papers">awesome-deep-learning-papers project on Github</a></li>
<li><a href="http://deeplearning.net/reading-list/">deeplearning.net Reading List</a></li>
<li><a href="http://www.deeplearningpatterns.com/doku.php/start">Deep Learning Patterns</a></li>
<li><a href="https://openreview.net/group?id=ICLR.cc/2017/conference">International Conference on Learning Representations (ICLR) 2017 Conference CFP</a></li>
<li><a href="http://course.fast.ai/">Practical Deep Learning For Coders by fast.ai</a></li>
</ul>
<p>–Jason
<br /><a href="https://twitter.com/#!/jason_trost">@jason_trost</a></p>
<p><a href="http://www.covert.io/deep-learning-security-papers/">Deep Learning Security Papers</a> was originally published by Jason Trost at <a href="http://www.covert.io">covert.io</a> on December 29, 2016.</p>