1. Click Security Data Hacking Project

    This is an awesome collection of Security Data Science ipython notebooks from @clicksecurity.

    They demonstrate using Pandas, Scikit-Learn, and Matplotlib for exploring security datasets involving:

    • Detecting Algorithmically Generated Domains
    • Hierarchical Clustering of Syslogs
    • Exploration of data from Malware Domain List
    • SQL Injection
    • Browser Agent Fingerprinting


    1 month ago  /  0 notes

  2. Internet Scale Port Scan Data and Analysis

    A couple days ago, this was posted:

    Port scanning /0 using insecure embedded devices

    Abstract While playing around with the Nmap Scripting Engine (NSE) we discovered an amazing number of open embedded devices on the Internet. Many of them are based on Linux and allow login to standard BusyBox with empty or default credentials. We used these devices to build a distributed port scanner to scan all IPv4 addresses. These scans include service probes for the most common ports, ICMP ping, reverse DNS and SYN scans. We analyzed some of the data to get an estimation of the IP address usage.

    It is a write up about performing an Internet scale port scan using thousands of compromised busybox embedded devices/linux servers.

    While this is wildly unethical, and almost certainly illegal, the results of this study are pretty interesting and it is more interesting that the author decided to post all his code and data (~9TB uncompressed, 1.5 TB Compressed) online for free downloads.

    The author also posted some interactive web apps that allow exploration of this data set:

    It is definitely interesting to see how more and more network/security data is being collected and made available freely on the Internet. I am undecided whether this helps security or hurts security longterm. It definitely makes the situation worse i the short term.


    1 year ago  /  1 note

  3. Proactive Defense for Evolving Cyber Threats (Sandia Report)

    I stumbled on this recently. It is a small collection of reports/publications from Sandia National Labs on using Machine Learning and Predictive Analytics for Computer Network Defense. Here is what is contained in the PDF:

    [1] Early warning analysis for social diffusion events, Security Informatics, Vol. 1, 2012, SAND 2010-5334C.
    [2] Proactive cyber defense, Chapter in Springer Integrated Series on Intelligent Systems, 2012 Document No. 5299122, SAND 2011-8794P).
    [3] Predictability-oriented defense against adaptive adversaries, Proc. IEEE International Conference on Systems, Man, and Cybernetics, Seoul, Korea, October 2012. [or Predictive moving target defense, Proc. 2012 National Symposium on Moving Target Research, Annapolis, MD, June 2012.], SAND 2012-4007C.
    [4] Leveraging sociological models for prediction I: Inferring adversarial relationships, and II: Early warning for complex contagions, Proc. IEEE International Conference on Intelligence and Security Informatics, Washington, DC, June 2012 [Winner of the 2012 Best Paper Award, IEEE ISI], SAND 2012-6729C.
    [5] Predictive defense against evolving adversaries, Proc. IEEE International Conference on Intelligence and Security Informatics, Washington, DC, June 2012, SAND 2012-4007C. [6] Proactive defense for evolving cyber threats, Proc. IEEE International Conference on Intelligence and Security Informatics, Beijing, China, July 2011 [Winner of the 2011 Best Paper Award, IEEE ISI], SAND 2011-2445C.

    Proactive Defense for Evolving Cyber Threats (PDF)


    1 year ago  /  1 note

  4. Big Data Security Analytics from Packetloop/Hortonworks

    This was a great series of articles from the guys at Packetloop on using PacketPig for large scale pcap analysis including offline intrusion detection using Snort over TBs of pcaps and security analytics.


    1 year ago  /  1 note

  5. Large Scale Malicious Domain Classification with Storm, Random Forrests, and Markov Models

    At Endgame we have been working on a system for large scale malicious DNS detection, and Myself and John Munro recently presented some of this work at FloCon.


    Clairvoyant Squirrel: Large Scale Malicious Domain Classification

    Large scale classification of domain names has many applications in network monitoring, intrusion detection, and forensics. The goal with this research is to predict a domain’s maliciousness solely based on the domain string itself, and to perform this classification on domains seen in real-time on high traffic networks, giving network administrators insight into possible intrusions. Our classification model uses the Random Forest algorithm with a 22-feature vector of domain string characteristics. Most of these features are numeric and are quick to calculate. Our model is currently trained off-line on a corpus of highly malicious domains gathered from DNS traffic originating from a malware execution sandbox and benign, popular domains from a high traffic DNS sensor. For stream classification, we use an internally developed platform for distributed high speed event processing that was built over Twitter’s recently open sourced Storm project. We discuss the system architecture as well as the logic behind our model’s features and sampling techniques that have led to 97% classification accuracy on our dataset and the model’s performance within our streaming environment.

    Here are the slides in case you’re interested.


    1 year ago  /  0 notes

  6. Packetpig - Open Source Big Data Security Analysis

    A coworker told me about this project today, and I thought I would share since it looks promising.

    Packetpig is an open source project hosted on github by @packetloop that contains Hadoop InputFormats, Pig Loaders, Pig scripts and R scripts for processing and analyzing pcap data. It also has classes that allow you to stream packets from Hadoop to local snort and p0f processes so you can parallelize this type of packet processing.

    Check it out:


    1 year ago  /  1 note

  7. Hadoop DNS Mining

    This is another quick post. I have been working on this small framework for a while now and I decided to publish the code before I completely finished.

    The hadoop-dns-mining framework enables large scale DNS lookups using Hadoop. For example, if you had access to zone files from COM, NET, ORG, etc (all free and publicly available), you could take each domain in these files and use this framework to resolve the domains for various record types (A, AAAA, MX, TXT, NS, etc). After resolving the domains, you can run them through an enrichment process to add city, county, lat, long, ASN, and AS Name (via Maxmind’s DBs). And, you could scale out this collection and processing effort using Hadoop, say, running over EC2.

    There are some interesting applications of this type of system, like using the existing zone files names to brute force “generating” zone files for TLDs that do not publish them (most ccTLDs do not). For example, like this company has done: http://viewdns.info/data/

    For more details check out my github repo. The README covers DNS collection and geo enrichment. I have code checked in that will store this data in Accumulo using a few different storage/access patterns, but more explanation will come later as I have time.



    1 year ago  /  1 note

  8. Hadoop Binary Analysis Framework

    This is a quick post. I wrote this little framework for using Hadoop to analyze lots of small files. This may not be the most optimal way of doing this, but it worked well and makes repeated analysis tasks easy and scalable.


    I recently needed a quick way to analyze millions of small binary files (from 100K-19MB each) and I wanted a scalable way to repeatedly do this sort of analysis. I chose Hadoop as the platform, and I built this little framework (really, a single MapReduce job) to do it. This is very much a work in progress, and feedback and pull requests are welcome.

    The main MapReduce job in this framework accepts a Sequence file of <Text, BytesWritable> where the Text is a name and the BytesWritable is the contents of a file. The framework unpacks the bytes of the BytesWritable to the local filesystem of the mapper it is running on, allowing the mapper to run arbitrary analysis tools that require local filesystem access. The framework then captures stdout and stderr from the analysis tool/script and stores it (how it stores it is pluggable, see io.covert.binary.analysis.OutputParser).


    mvn package assembly:assembly


    # a local directory with files in it (directories are ignored for now)
    # convert a bunch of relatively small files into one sequence file (Text, BytesWritable)
    hadoop jar $JAR io.covert.binary.analysis.BuildSequenceFile $LOCAL_FILES $INPUT
    # Use the config properties in example.xml to basically run the wrapper.sh script on each file using Hadoop
    # as the platform for computation
    hadoop jar $JAR io.covert.binary.analysis.BinaryAnalysisJob -files wrapper.sh -conf example.xml $INPUT $OUTPUT

    From example.xml:


    This block of example instructs the framework to run wrapper.sh using the args of ${file} (where ${file} is replaced by the unpacked filename from the Sequence File. If multiple command line args are required, they can be specified by appending a delimiter and then each arg to the value of the binary.analysis.program.args property


    2 years ago  /  1 note