Over the past several years I have collected and read many security research papers/slides and have started a small catalog of sorts. The topics of these papers range from intrusion detection, anomaly detection, machine learning/data mining, Internet scale data collection, malware analysis, and intrusion/breach reports. I figured this collection might useful to others. All links lead to PDFs hosted here.

I hope to clean this up (add author info, date, and publication) when I get some more time as well as adding some detailed notes I have on the various features, models, algorithms, and datasets used in many of these papers.

Here are some of my favorites (nice uses of machine learning, graph analytics, and/or anomaly detection to solve interesting security problems):

Here is the entire collection:

Intrusion Detection


Data Collection

Vulnerability Analysis/Reversing


Data Mining

Cyber Crime



This is an awesome collection of Security Data Science ipython notebooks from @clicksecurity.

They demonstrate using Pandas, Scikit-Learn, and Matplotlib for exploring security datasets involving:

  • Detecting Algorithmically Generated Domains
  • Hierarchical Clustering of Syslogs
  • Exploration of data from Malware Domain List
  • SQL Injection
  • Browser Agent Fingerprinting


Mubix (@mubix) used the subdomains list from here to bruteforce subdomains using dig and args. Really nice use of xargs for parallel execution.

cat subdomains.txt | \
	xargs -P 122 -I subdomain dig +noall subdomain.microsoft.com +answer


Another security related “bigdata” release. The DNS Census is an anonymous public release of DNS data. The person behind this claimed they were inspired by the Internet Census.

Some stats:

  • 2.5B DNS records
  • ~106M unique domain names
  • Most DNS RR types are represented (A/AAAA/CNAME/DNAME/MX/NS/SOA/TXT)
  • 15 GB compressed
  • 157 GB uncompressed
  • Available as a torrent


This is another quick post. I have been working on this small framework for a while now and I decided to publish the code before I completely finished.

The hadoop-dns-mining framework enables large scale DNS lookups using Hadoop. For example, if you had access to zone files from COM, NET, ORG, etc (all free and publicly available), you could take each domain in these files and use this framework to resolve the domains for various record types (A, AAAA, MX, TXT, NS, etc). After resolving the domains, you can run them through an enrichment process to add city, county, lat, long, ASN, and AS Name (via Maxmind’s DBs). And, you could scale out this collection and processing effort using Hadoop, say, running over EC2.

There are some interesting applications of this type of system, like using the existing zone files names to brute force “generating” zone files for TLDs that do not publish them (most ccTLDs do not). For example, like this company has done: http://viewdns.info/data/

For more details check out my github repo. The README covers DNS collection and geo enrichment. I have code checked in that will store this data in Accumulo using a few different storage/access patterns, but more explanation will come later as I have time.