At Endgame we have been working on a system for large scale malicious DNS detection, and Myself and John Munro recently presented some of this work at FloCon.


Clairvoyant Squirrel: Large Scale Malicious Domain Classification

Large scale classification of domain names has many applications in network monitoring, intrusion detection, and forensics. The goal with this research is to predict a domain’s maliciousness solely based on the domain string itself, and to perform this classification on domains seen in real-time on high traffic networks, giving network administrators insight into possible intrusions. Our classification model uses the Random Forest algorithm with a 22-feature vector of domain string characteristics. Most of these features are numeric and are quick to calculate. Our model is currently trained off-line on a corpus of highly malicious domains gathered from DNS traffic originating from a malware execution sandbox and benign, popular domains from a high traffic DNS sensor. For stream classification, we use an internally developed platform for distributed high speed event processing that was built over Twitter's recently open sourced Storm project. We discuss the system architecture as well as the logic behind our model's features and sampling techniques that have led to 97% classification accuracy on our dataset and the model's performance within our streaming environment.

Here are the slides in case you’re interested.


This was a great [series]( [of]( [articles]( from the guys at [Packetloop]( on using [PacketPig]( for large scale pcap analysis including offline intrusion detection using Snort over TBs of pcaps and security analytics. --Jason

A coworker told me about this project today, and I thought I would share since it looks promising.

Packetpig is an open source project hosted on github by @packetloop that contains Hadoop InputFormats, Pig Loaders, Pig scripts and R scripts for processing and analyzing pcap data. It also has classes that allow you to stream packets from Hadoop to local snort and p0f processes so you can parallelize this type of packet processing.

Check it out:


Update (2013-08-01): This project is no longer maintain since we port all this functionality over to BinaryPig. Use BinaryPig instead. For more information on BinaryPig, see Slides, Paper, or Video.

This is a quick post. I wrote this little framework for using Hadoop to analyze lots of small files. This may not be the most optimal way of doing this, but it worked well and makes repeated analysis tasks easy and scalable.

I recently needed a quick way to analyze millions of small binary files (from 100K-19MB each) and I wanted a scalable way to repeatedly do this sort of analysis. I chose Hadoop as the platform, and I built this little framework (really, a single MapReduce job) to do it. This is very much a work in progress, and feedback and pull requests are welcome.

The main MapReduce job in this framework accepts a Sequence file of

<Text, BytesWritable>
where the
is a name and the
is the contents of a file. The framework unpacks the bytes of the
to the local filesystem of the mapper it is running on, allowing the mapper to run arbitrary analysis tools that require local filesystem access. The framework then captures stdout and stderr from the analysis tool/script and stores it (how it stores it is pluggable, see


mvn package assembly:assembly



# a local directory with files in it (directories are ignored for now)

# convert a bunch of relatively small files into one sequence file (Text, BytesWritable)
hadoop jar $JAR io.covert.binary.analysis.BuildSequenceFile $LOCAL_FILES $INPUT

# Use the config properties in example.xml to basically run the script on each file using Hadoop
# as the platform for computation
hadoop jar $JAR io.covert.binary.analysis.BinaryAnalysisJob -files -conf example.xml $INPUT $OUTPUT

From example.xml:


This block of example instructs the framework to run

using the args of
is replaced by the unpacked filename from the Sequence File. If multiple command line args are required, they can be specified by appending a delimiter and then each arg to the value of the


I just posted a functional AccumuloStorage module to github.

Here’s how you use it (also in the github README)

###1. Build the JAR

Note: you will need to download the accumulo src, build it, and install it into your maven repo before this will work

mvn package

This will create a JAR file here:


###2. Download the JARs needed by pig

mvn dependency:copy-dependencies -DoutputDirectory=lib  \

This should have copied the needed dependency jars into a


###3. Print the register statements we will need in pig

for JAR in lib/*.jar target/accumulo-pig-1.5.0-incubating-SNAPSHOT.jar ; 
    echo register `pwd`/$JAR; 

Here is some example output

register /home/developer/workspace/accumulo-pig/lib/accumulo-core-1.5.0-incubating-SNAPSHOT.jar
register /home/developer/workspace/accumulo-pig/lib/cloudtrace-1.5.0-incubating-SNAPSHOT.jar
register /home/developer/workspace/accumulo-pig/lib/libthrift-0.6.1.jar
register /home/developer/workspace/accumulo-pig/lib/zookeeper-3.3.1.jar
register /home/developer/workspace/accumulo-pig/target/accumulo-pig-1.5.0-incubating-SNAPSHOT.jar

####5. Run Pig

Copy the register statements above and paste them into the pig terminal. Then you can LOAD from and STORE into accumulo.

$ pig
2012-03-02 08:15:25,808 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/developer/workspace/accumulo-pig/pig_1330694125807.log
2012-03-02 08:15:25,937 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://
2012-03-02 08:15:26,032 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at:
grunt> register /home/developer/workspace/accumulo-pig/lib/accumulo-core-1.5.0-incubating-SNAPSHOT.jar
grunt> register /home/developer/workspace/accumulo-pig/lib/cloudtrace-1.5.0-incubating-SNAPSHOT.jar
grunt> register /home/developer/workspace/accumulo-pig/lib/libthrift-0.6.1.jar
grunt> register /home/developer/workspace/accumulo-pig/lib/zookeeper-3.3.1.jar
grunt> register /home/developer/workspace/accumulo-pig/target/accumulo-pig-1.5.0-incubating-SNAPSHOT.jar
grunt> DATA = LOAD 'accumulo://webpage?instance=inst&user=root&password=secret&zookeepers=' 
>>using org.apache.accumulo.pig.AccumuloStorage() AS (row, cf, cq, cv, ts, val);
grunt> DATA2 = FOREACH DATA GENERATE row, cf, cq, cv, val;
grunt> STORE DATA2 into 'accumulo://webpage_content?instance=inst&user=root&password=secret&zookeepers=' using org.apache.accumulo.pig.AccumuloStorage();
2012-03-02 08:18:44,090 [main] INFO - Pig features used in the script: UNKNOWN
2012-03-02 08:18:44,093 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for DATA: $4
2012-03-02 08:18:44,108 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2012-03-02 08:18:44,110 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2012-03-02 08:18:44,110 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2012-03-02 08:18:44,117 [main] INFO - Pig script settings are added to the job
2012-03-02 08:18:44,118 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-03-02 08:18:44,120 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job7611629033341757288.jar
2012-03-02 08:18:46,282 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job7611629033341757288.jar created
2012-03-02 08:18:46,286 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2012-03-02 08:18:46,375 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2012-03-02 08:18:46,876 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2012-03-02 08:18:46,878 [Thread-17] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2012-03-02 08:18:47,887 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201203020643_0001
2012-03-02 08:18:47,887 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at:
2012-03-02 08:18:54,434 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2012-03-02 08:18:57,484 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2012-03-02 08:18:57,485 [main] INFO - Script Statistics: 
HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures 08:18:442012-03-02 08:18:57UNKNOWN
Job Stats (time in seconds):
Successfully read 288 records from: "accumulo://webpage?instance=inst&user=root&password=secret&zookeepers="
Successfully stored 288 records in: "accumulo://webpage_content?instance=inst&user=root&password=secret&zookeepers="
Total records written : 288
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
2012-03-02 08:18:57,492 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!

Here are the pig commands run if you don’t want to look through the output above:

# load just the web content (from the f:cnt column) from the webpage table
   using org.apache.accumulo.pig.AccumuloStorage() AS (row, cf, cq, cv, ts, val);

# basically, remove the ts field since it is not needed
DATA2 = FOREACH DATA GENERATE row, cf, cq, cv, val;

# store the data as is in a new table called webpage_content
   using org.apache.accumulo.pig.AccumuloStorage();

A more detailed blog post going in more detail of how/why this is useful will follow.


Update (2012/03/04): you may want to run this as the first line of the pig script:

SET false

This will avoid ingesting duplicate entries into accumulo. For the data from this post, ingesting duplicate entries wouldn’t cause any real issues because Accumulo’s

would only keep the newest copy, but for columns/tables with aggregation configured (e.g. using
) we definitely don’t want this.