A coworker told me about this project today, and I thought I would share since it looks promising.

Packetpig is an open source project hosted on github by @packetloop that contains Hadoop InputFormats, Pig Loaders, Pig scripts and R scripts for processing and analyzing pcap data. It also has classes that allow you to stream packets from Hadoop to local snort and p0f processes so you can parallelize this type of packet processing.

Check it out:

–Jason
@jason_trost

Update (2013-08-01): This project is no longer maintain since we port all this functionality over to BinaryPig. Use BinaryPig instead. For more information on BinaryPig, see Slides, Paper, or Video.


This is a quick post. I wrote this little framework for using Hadoop to analyze lots of small files. This may not be the most optimal way of doing this, but it worked well and makes repeated analysis tasks easy and scalable.

https://github.com/jt6211/hadoop-binary-analysis

I recently needed a quick way to analyze millions of small binary files (from 100K-19MB each) and I wanted a scalable way to repeatedly do this sort of analysis. I chose Hadoop as the platform, and I built this little framework (really, a single MapReduce job) to do it. This is very much a work in progress, and feedback and pull requests are welcome.

The main MapReduce job in this framework accepts a Sequence file of

1
<Text, BytesWritable>
where the
1
Text
is a name and the
1
BytesWritable
is the contents of a file. The framework unpacks the bytes of the
1
BytesWritable
to the local filesystem of the mapper it is running on, allowing the mapper to run arbitrary analysis tools that require local filesystem access. The framework then captures stdout and stderr from the analysis tool/script and stores it (how it stores it is pluggable, see
1
io.covert.binary.analysis.OutputParser
).

Building:

mvn package assembly:assembly

Running:

JAR=target/hadoop-binary-analysis-1.0-SNAPSHOT-job.jar

# a local directory with files in it (directories are ignored for now)
LOCAL_FILES=src/main/java/io/covert/binary/analysis/
INPUT="dir-in-hdfs"
OUTPUT="output-dir-in-hdfs"

# convert a bunch of relatively small files into one sequence file (Text, BytesWritable)
hadoop jar $JAR io.covert.binary.analysis.BuildSequenceFile $LOCAL_FILES $INPUT

# Use the config properties in example.xml to basically run the wrapper.sh script on each file using Hadoop
# as the platform for computation
hadoop jar $JAR io.covert.binary.analysis.BinaryAnalysisJob -files wrapper.sh -conf example.xml $INPUT $OUTPUT

From example.xml:

<property>
  <name>binary.analysis.program</name>
  <value>./wrapper.sh</value>
</property>
<property>
  <name>binary.analysis.program.args</name>
  <value>${file}</value>
</property>
<property>
  <name>binary.analysis.program.args.delim</name>
  <value>,</value>
</property>


This block of example instructs the framework to run

1
wrapper.sh
using the args of
1
${file}
(where
1
${file}
is replaced by the unpacked filename from the Sequence File. If multiple command line args are required, they can be specified by appending a delimiter and then each arg to the value of the
1
binary.analysis.program.args
property

–Jason

I just posted a functional AccumuloStorage module to github.

Here’s how you use it (also in the github README)

###1. Build the JAR

Note: you will need to download the accumulo src, build it, and install it into your maven repo before this will work

mvn package

This will create a JAR file here:

1
target/accumulo-pig-1.5.0-incubating-SNAPSHOT.jar

###2. Download the JARs needed by pig

mvn dependency:copy-dependencies -DoutputDirectory=lib  \
    -DincludeArtifactIds=zookeeper,libthrift,accumulo-core,cloudtrace

This should have copied the needed dependency jars into a

1
lib
directory.

###3. Print the register statements we will need in pig

for JAR in lib/*.jar target/accumulo-pig-1.5.0-incubating-SNAPSHOT.jar ; 
do 
    echo register `pwd`/$JAR; 
done

Here is some example output

register /home/developer/workspace/accumulo-pig/lib/accumulo-core-1.5.0-incubating-SNAPSHOT.jar
register /home/developer/workspace/accumulo-pig/lib/cloudtrace-1.5.0-incubating-SNAPSHOT.jar
register /home/developer/workspace/accumulo-pig/lib/libthrift-0.6.1.jar
register /home/developer/workspace/accumulo-pig/lib/zookeeper-3.3.1.jar
register /home/developer/workspace/accumulo-pig/target/accumulo-pig-1.5.0-incubating-SNAPSHOT.jar

####5. Run Pig

Copy the register statements above and paste them into the pig terminal. Then you can LOAD from and STORE into accumulo.

$ pig
2012-03-02 08:15:25,808 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/developer/workspace/accumulo-pig/pig_1330694125807.log
2012-03-02 08:15:25,937 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://127.0.0.1/
2012-03-02 08:15:26,032 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: 127.0.0.1:9001
grunt> register /home/developer/workspace/accumulo-pig/lib/accumulo-core-1.5.0-incubating-SNAPSHOT.jar
grunt> register /home/developer/workspace/accumulo-pig/lib/cloudtrace-1.5.0-incubating-SNAPSHOT.jar
grunt> register /home/developer/workspace/accumulo-pig/lib/libthrift-0.6.1.jar
grunt> register /home/developer/workspace/accumulo-pig/lib/zookeeper-3.3.1.jar
grunt> register /home/developer/workspace/accumulo-pig/target/accumulo-pig-1.5.0-incubating-SNAPSHOT.jar
grunt> 
grunt> DATA = LOAD 'accumulo://webpage?instance=inst&user=root&password=secret&zookeepers=127.0.0.1:2181&columns=f:cnt' 
>>using org.apache.accumulo.pig.AccumuloStorage() AS (row, cf, cq, cv, ts, val);
grunt> 
grunt> DATA2 = FOREACH DATA GENERATE row, cf, cq, cv, val;
grunt> 
grunt> STORE DATA2 into 'accumulo://webpage_content?instance=inst&user=root&password=secret&zookeepers=127.0.0.1:2181' using org.apache.accumulo.pig.AccumuloStorage();
2012-03-02 08:18:44,090 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2012-03-02 08:18:44,093 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for DATA: $4
2012-03-02 08:18:44,108 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2012-03-02 08:18:44,110 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2012-03-02 08:18:44,110 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2012-03-02 08:18:44,117 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2012-03-02 08:18:44,118 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-03-02 08:18:44,120 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job7611629033341757288.jar
2012-03-02 08:18:46,282 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job7611629033341757288.jar created
2012-03-02 08:18:46,286 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2012-03-02 08:18:46,375 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2012-03-02 08:18:46,876 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2012-03-02 08:18:46,878 [Thread-17] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2012-03-02 08:18:47,887 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201203020643_0001
2012-03-02 08:18:47,887 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://127.0.0.1:50030/jobdetails.jsp?jobid=job_201203020643_0001
2012-03-02 08:18:54,434 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2012-03-02 08:18:57,484 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2012-03-02 08:18:57,485 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: 
 
HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
0.20.20.9.2developer2012-03-02 08:18:442012-03-02 08:18:57UNKNOWN
 
Success!
 
Job Stats (time in seconds):
JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTimeMaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs
job_201203020643_000110333000DATA,DATA2MAP_ONLYaccumulo://webpage_content?instance=inst&user=root&password=secret&zookeepers=127.0.0.1:2181,
 
Input(s):
Successfully read 288 records from: "accumulo://webpage?instance=inst&user=root&password=secret&zookeepers=127.0.0.1:2181&columns=f:cnt"
 
Output(s):
Successfully stored 288 records in: "accumulo://webpage_content?instance=inst&user=root&password=secret&zookeepers=127.0.0.1:2181"
 
Counters:
Total records written : 288
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
 
Job DAG:
job_201203020643_0001
 
 
2012-03-02 08:18:57,492 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
grunt> 

Here are the pig commands run if you don’t want to look through the output above:

# load just the web content (from the f:cnt column) from the webpage table
DATA = LOAD 
'accumulo://webpage?instance=inst&user=root&password=secret&zookeepers=127.0.0.1:2181&columns=f:cnt' 
   using org.apache.accumulo.pig.AccumuloStorage() AS (row, cf, cq, cv, ts, val);

# basically, remove the ts field since it is not needed
DATA2 = FOREACH DATA GENERATE row, cf, cq, cv, val;

# store the data as is in a new table called webpage_content
STORE DATA2 into 
'accumulo://webpage_content?instance=inst&user=root&password=secret&zookeepers=127.0.0.1:2181' 
   using org.apache.accumulo.pig.AccumuloStorage();

A more detailed blog post going in more detail of how/why this is useful will follow.

–Jason

Update (2012/03/04): you may want to run this as the first line of the pig script:

SET mapred.map.tasks.speculative.execution false

This will avoid ingesting duplicate entries into accumulo. For the data from this post, ingesting duplicate entries wouldn’t cause any real issues because Accumulo’s

1
VersioningIterator
would only keep the newest copy, but for columns/tables with aggregation configured (e.g. using
1
LongCombiner
) we definitely don’t want this.

In this post I will outline the steps necessary to use Accumulo and Gora to store content retrieved by Nutch.

###Apache Accumulo

Accumulo Logo

For those of you unfamiliar with Accumulo, it is an incubating Apache project and … > “Accumulo is a sorted, distributed key/value store based on Google’s BigTable design. It is built on top of Apache HadoopZookeeper, and Thrift. It features a few novel improvements on the BigTable design in the form of cell-level access labels and a server-side programming mechanism that can modify key/value pairs at various points in the data management process.”

Accumulo is conceptually very similar to HBase, but it has some nice features that HBase is currently lacking.  Some of these features are:

  • Cell level security
  • No fat row problem - i.e. entire rows don’t need to fit in RAM
  • No limitation on Column Families or when Column Families can be created
  • Server side, data local, programming abstraction called Iterators.  Iterators are incredibly useful for adding functionality to Tablet Servers such as data local filtering, aggregation, and search.  
  • Check out this example application that shows off some of the features mentioned above: THE WIKIPEDIA SEARCH EXAMPLE EXPLAINED, WITH PERFORMANCE NUMBERS.

###Apache Gora Gora Logo

Gora is a object relational/non-relational mapping for arbitrary data stores including both relational (MySQL) and non-relational data stores (HBase, Cassandra, Accumulo, etc).  It was designed for Big Data applications and has support (interfaces) for Apache Pig, Apache Hive, Cascading, and generic Map/Reduce.

Apache Nutch

Nutch Logo

Nutch is a highly scalable web crawler built over Hadoop Map/Reduce.  It was designed from the ground up to be an Internet scale web crawler.  This is a great overview of Nutch’s architecture: Nutch as a Web Data Mining Platform

###Accumulo + Nutch + Gora

I generally prefer git over svn, in this post I use the source code hosted on github.

####1. Obtain all sources (and Accumulo patch for GORA)

git clone git://github.com/apache/nutch.git
git clone git://github.com/apache/gora.git
git clone git://github.com/apache/accumulo.git
wget https://issues.apache.org/jira/secure/attachment/12507986/GORA-65-1.patch

####2. Configure and Build Each Project:

#####Standard build and maven install for accumulo

cd accumulo
mvn package install

#####Patching GORA for Accumulo Support

Gora needs to be patched for support for Accumulo.  This patch should be considered beta, but I found it works good enough for experimenting with Nutch/GORA.  Note:

1
-DskipTests
is used because some of the tests seemed to hang indefinitely, so I skipped them for now.

cd ../gora
patch -p0 < ../GORA-65-1.patch
mvn package install -DskipTests

#####Building Nutch/GORA

So, getting Nutch/GORA to build was a bit of a challenge.  I will outline some of the hoops I had to jump through.  File paths mentioned below are assuming you are in the nutch project directory.

Run the following commands to checkout the nutchgora branch of Nutch.

cd ../nutch
git checkout origin/nutchgora
  • Modify the 
    1
    ivy/ivy.xml
    
    file. Change gora-core and gora-sql dependencies rev from
    1
    "0.1.1-incubating"
    
    to
    1
    "0.2-SNAPSHOT"
    
    .  This is to match the patched version we just installed. Also, add the following lines:
<dependency org="org.apache.accumulo" name="accumulo-core" rev="1.5.0-incubating-SNAPSHOT" />
<dependency org="org.apache.accumulo" name="cloudtrace" rev="1.5.0-incubating-SNAPSHOT" />
<dependency org="org.apache.thrift" name="libthrift" rev="0.6.1" />
<dependency org="org.apache.gora" name="gora-accumulo" rev="0.2-SNAPSHOT" />
<dependency org="org.apache.zookeeper" name="zookeeper" rev="3.4.3" />
  • Modify the
    1
    ivy/ivysettings.xml
    
    file.  Add the following line to the top of the 
    1
    <resolvers>
    
    section. This will configure ant/ivy to use your local maven repository when resolving dependencies.  This is necessary because the patched version of GORA and the latest Accumulo version are not in any public maven repos.
<ibiblio name="local" root="file://${user.home}/.m2/repository/"
pattern="${maven2.pattern.ext}"  m2compatible="true"  />


  • Patch src/java/org/apache/nutch/storage/StorageUtils.java
  • diff --git a/src/java/org/apache/nutch/storage/StorageUtils.java b/src/java/org/apache/nutch/storage/StorageUtils.java
    index de740b5..19b37ad 100644
    --- a/src/java/org/apache/nutch/storage/StorageUtils.java
    +++ b/src/java/org/apache/nutch/storage/StorageUtils.java
    @@ -40,8 +40,9 @@ public class StorageUtils {
           Class<K> keyClass, Class<V> persistentClass) throws ClassNotFoundException, GoraException {
         Class<? extends DataStore<K, V>> dataStoreClass =
           (Class<? extends DataStore<K, V>>) getDataStoreClass(conf);
    +    
         return DataStoreFactory.createDataStore(dataStoreClass,
    -            keyClass, persistentClass);
    +            keyClass, persistentClass, conf);
       }
     
       @SuppressWarnings("unchecked")
    @@ -56,8 +57,9 @@ public class StorageUtils {
     
         Class<? extends DataStore<K, V>> dataStoreClass =
           (Class<? extends DataStore<K, V>>) getDataStoreClass(conf);
    +    
         return DataStoreFactory.createDataStore(dataStoreClass,
    -            keyClass, persistentClass, schema);
    +            keyClass, persistentClass, conf, schema);
       }
     
       @SuppressWarnings("unchecked")
  • Create the file conf/gora-accumulo-mapping.xml with the following contents:
  • <gora-orm>
      <table name="webpage">
      <config key="table.file.compress.blocksize" value="32K"/>
      </table>
    
      <class table="webpage" keyClass="java.lang.String" 
                   name="org.apache.nutch.storage.WebPage">
        <!-- fetch fields                     -->
        <field name="baseUrl" family="f" qualifier="bas"/>
        <field name="status" family="f" qualifier="st"/>
        <field name="prevFetchTime" family="f" qualifier="pts"/>
        <field name="fetchTime" family="f" qualifier="ts"/>
        <field name="fetchInterval" family="f" qualifier="fi"/>
        <field name="retriesSinceFetch" family="f" qualifier="rsf"/>
        <field name="reprUrl" family="f" qualifier="rpr"/>
        <field name="content" family="f" qualifier="cnt"/>
        <field name="contentType" family="f" qualifier="typ"/>
        <field name="protocolStatus" family="f" qualifier="prot"/>
        <field name="modifiedTime" family="f" qualifier="mod"/>
        
        <!-- parse fields                     -->
        <field name="title" family="p" qualifier="t"/>
        <field name="text" family="p" qualifier="c"/>
        <field name="parseStatus" family="p" qualifier="st"/>
        <field name="signature" family="p" qualifier="sig"/>
        <field name="prevSignature" family="p" qualifier="psig"/>
        
        <!-- score fields                     -->
        <field name="score" family="s" qualifier="s"/>
        <field name="headers" family="h"/>
        <field name="inlinks" family="il"/>
        <field name="outlinks" family="ol"/>
        <field name="metadata" family="mtdt"/>
        <field name="markers" family="mk"/>
      </class>
    </gora-orm>
  • Edit conf/gora.properties and add the following lines:
  • gora.datastore.default=org.apache.gora.accumulo.store.AccumuloStore
    gora.datastore.accumulo.mock=false
    gora.datastore.accumulo.instance=inst
    gora.datastore.accumulo.zookeepers=localhost
    gora.datastore.accumulo.user=root
    gora.datastore.accumulo.password=secret
    gora.datastore.accumulo.zookeepers=127.0.0.1:2181
  • edit (or create) conf/nutch-site.xml and adding the following property setting to it
  • <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
          <name>storage.data.store.class</name>
          <value>org.apache.gora.accumulo.store.AccumuloStore</value>
        </property>
    <property>
      <name>http.agent.name</name>
      <value>Nutch</value>
    </property>
    </configuration>
  • edit $HOME/.ivy2/cache/jaxen/jaxen/ivy-1.1.3.xml and find and comment out the following lines. If anyone knows a more elegant way to accomplish this please let me know.
  • <!--            <dependency org="maven-plugins" name="maven-cobertura-plugin" rev="1.3" force="true" conf="compile->compile(*),master(*);runtime->runtime(*)">
                            <artifact name="maven-cobertura-plugin" type="plugin" ext="plugin" conf=""/>
                    </dependency>
                    <dependency org="maven-plugins" name="maven-findbugs-plugin" rev="1.3.1" force="true" conf="compile->compile(*),master(*);runtime->runtime(*)">
                            <artifact name="maven-findbugs-plugin" type="plugin" ext="plugin" conf=""/>
                    </dependency>
    -->
    If I don't comment out those dependencies, I get this error during compilation:
    [ivy:resolve]     ::::::::::::::::::::::::::::::::::::::::::::::
    [ivy:resolve]     ::              FAILED DOWNLOADS            ::
    [ivy:resolve]     :: ^ see resolution messages for details  ^ ::
    [ivy:resolve]     ::::::::::::::::::::::::::::::::::::::::::::::
    [ivy:resolve]     :: maven-plugins#maven-cobertura-plugin;1.3!maven-cobertura-plugin.plugin
    [ivy:resolve]     :: maven-plugins#maven-findbugs-plugin;1.3.1!maven-findbugs-plugin.plugin
    [ivy:resolve]     ::::::::::::::::::::::::::::::::::::::::::::::
  • Run the following commands:
ant


####3. Deploy and Run

#####Configure Accumulo and its Dependencies

For this post I am only going to cover the basics for getting these systems to run on a single machine. Deploying and running over a cluster may be covered in another post.

######Configure and Start Hadoop

cd ..
wget ftp://apache.cs.utah.edu/apache.org//hadoop/common/hadoop-0.20.2/hadoop-0.20.2.tar.gz
tar zxvf hadoop-0.20.2.tar.gz

Add the following to

1
hadoop-0.20.2/conf/core-site.xml

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://127.0.0.1/</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/home/${user.name}/tmp/hadoop</value>
  </property>
</configuration>

Add the following to

1
hadoop-0.20.2/conf/mapred-site.xml

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>127.0.0.1:9001</value>
  </property>
</configuration>

Set the JAVA_HOME variable in

1
hadoop-0.20.2/conf/hadoop-env.sh
. Here is an example from my system:

export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.26

At this point, you should have a bare bones configured hadoop installation. It is time to start it…

Run the following commands:

mkdir -p ~/tmp/hadoop
cd hadoop-0.20.2
./bin/hadoop namenode -format
./bin/start-all.sh

If you configured everything properly, you should be able to open http://127.0.0.1:50070/dfshealth.jsp in a web browser and see a page that looks like this. If there is a message saying that the Namenode is in safe mode, wait a minute or two and refresh the page. It should go away.

You should also be able to open http://127.0.0.1:50030/jobtracker.jsp in a web browser and see a page that looks like this:

In both of these status webpages you should be able to see a

1
1
listed after
1
"Live Nodes"
and
1
"Nodes"
respectively.

######Configure and Start Zookeeper

cd ..
wget ftp://apache.cs.utah.edu/apache.org/zookeeper/zookeeper-3.4.3/zookeeper-3.4.3.tar.gz
tar zxvf zookeeper-3.4.3.tar.gz

Add the following to

1
zookeeper-3.4.3/conf/zoo.cfg
. Create this file if it does not exist. NOTE: replace
1
_USERNAME_
with your username.

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
dataDir=/home/_USERNAME_/tmp/zookeeper-data
# the port at which the clients will connect
clientPort=2181
maxClientCnxns=100

Edit bin/zkEnv.sh. Right after the following lines.

ZOOBINDIR=${ZOOBINDIR:-/usr/bin}
ZOOKEEPER_PREFIX=${ZOOBINDIR}/..

Add this line:

export ZOO_LOG_DIR=${ZOOKEEPER_PREFIX}/logs
cd zookeeper-3.4.3
export ZOO_LOG_DIR=$HOME/blogpost/zookeeper-3.4.3/logs
mkdir $ZOO_LOG_DIR
mkdir ~/tmp/zookeeper-data

At this point, you should have a configured zookeeper installation, it’s time to start it…

./bin/zkServer.sh start

If you configured zookeeper and it successfully started you should be able to run the following command:

./bin/zkCli.sh -server 127.0.0.1:2181

It will output a bunch of logging messages, this is fine. Press

1
<ENTER>
, and then you should be dropped into a shell this a prompt that looks like the following:

[zk: 127.0.0.1:2181(CONNECTED) 0]

Type

1
ls /
and then
1
<ENTER>
. You should see a single line of output (followed again by a prompt) that looks like the following:

[zookeeper]

If so, then zookeeper is configured and running properly. When you first run the zkCli.sh command, if you see stack traces that look like this:

java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
    at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:286)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1035)

Then it means zookeeper failed to start for some reason, isn’t listening on 127.0.0.1:2181, or you may have a local firewall blocking access to that port.

######Configure and Start Accumulo

cd ../accumulo
mvn package -P assemble
cp src/assemble/target/accumulo-1.5.0-incubating-SNAPSHOT-dist.tar.gz ../
cd ../
tar zxf accumulo-1.5.0-incubating-SNAPSHOT-dist.tar.gz
cd accumulo-1.5.0-incubating-SNAPSHOT/conf
rename s/.example// *.example
# basically disabling the custom security policy file for now
# since I had issues getting accumulo to work with it enabled
mv accumulo.policy accumulo.policy.example 

Edit

1
accumulo-env.sh
. At the top of the file, define HADOOP_HOME, ZOOKEEPER_HOME, and JAVA_HOME. Here is an example:

export HADOOP_HOME=$HOME/hadoop-0.20.2
export ZOOKEEPER_HOME=$HOME/zookeeper-3.4.3
export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.26

Edit

1
accumulo-site.xml
.

Set the

1
logger.dir.walog
property.

<property>
  <name>logger.dir.walog</name>
  <value>$HOME/tmp/walogs</value>
</property>

Set the

1
instance.secret
property.

<property>
  <name>instance.secret</name>
  <value>SOME-PASSWORD-OF-YOUR-CHOOSING</value>
</property>

Run the following commands:

mkdir $HOME/tmp/walogs
cd ../

At this point you should have a fully configured Accumulo installation. It is time to initialize it and start it…

./bin/accumulo init

You should see similar output to this. I set my instance name to

1
"inst"
and my password to
1
"secret"
. You may want to do the same for the sake of this tutorial or make sure to set the correct config parameters later.

23 08:15:26,635 [util.Initialize] INFO : Hadoop Filesystem is hdfs://127.0.0.1/
23 08:15:26,637 [util.Initialize] INFO : Accumulo data dir is /accumulo
23 08:15:26,637 [util.Initialize] INFO : Zookeeper server is localhost:2181

Warning!!! Your instance secret is still set to the default, this is not secure. 
We highly recommend you change it.

You can change the instance secret in accumulo by using: bin/accumulo 
org.apache.accumulo.server.util.ChangeSecret oldPassword newPassword. 
You will also need to edit your secret in your configuration file by adding the property 
instance.secret to your conf/accumulo-site.xml. Without this accumulo will not operate correctly
Instance name : inst
Enter initial password for root: ******
Confirm initial password for root: *****
23 08:15:34,100 [util.NativeCodeLoader] INFO : Loaded the native-hadoop library
23 08:15:34,337 [security.ZKAuthenticator] INFO : Initialized root user with 
username: root at the request of user !SYSTEM

Note: If it appears to hang after you entered the instance name, zookeeper may not be running.

1
<CTRL>-C
the accumulo init and make sure zookeeper is running.

Now run:

./bin/start-all.sh

After this finishes, you should be able to open http://127.0.0.1:50095/ in a web browser and see a page very similar to this:

The important items to note on this page are the is a 1 after the Tablet Servers, Live Data Nodes, and Trackers in the “Accumulo Master”, “NameNode”, and “JobTracker” tables, respectively. There should also be entry in the “Zookeeper” table.

###4. Crawl

At this point, you should have a fully functional Hadoop, Zookeeper, and Accumulo install, so we are ready to run a Nutch web crawl. Create a file with URLs, one per line, call it seeds.txt and place it in your home directory. I added the following URLs to my seeds file:

http://projects.apache.org/indexes/alpha.html
http://www.dmoz.org/Arts/People/

Run the following commands:

cd ../nutch/
./runtime/local/bin/nutch crawl file://$HOME/seeds.txt -depth 1

You should see some log messages printed to the console, but hopefully no stack traces. If you see a stack trace, you may need to go back and check your configs to make sure they match the ones we created earlier.

After the crawler finishes, you should be able to explore it using the accumulo shell.

cd ../accumulo-1.5.0-incubating-SNAPSHOT/bin/
./accumulo shell -u root -p secret
Shell - Accumulo Interactive Shell
- 
- version: 1.5.0-incubating-SNAPSHOT
- instance name: inst
- instance id: ce63fe79-6624-46c7-98a5-c6a98b8cfcef
- 
- type 'help' for a list of available commands
- 
root@inst> table webpage
root@inst webpage> scan
org.apache.projects:http/categories.html f:fi []    \x00'\x8D\x00
org.apache.projects:http/categories.html f:st []    \x00\x00\x00\x01
org.apache.projects:http/categories.html f:ts []    \x00\x00\x015\xC3\xA8\xCD\x90
org.apache.projects:http/categories.html il:http://projects.apache.org/indexes/alpha.html []    Categories
org.apache.projects:http/categories.html mk:_gnmrk_ []    1330427514-1313139543
org.apache.projects:http/categories.html mtdt:_csh_ []    <\x16O\xDA
org.apache.projects:http/categories.html s:s []    <\x16O\xDA
org.apache.projects:http/create.html f:fi []    \x00'\x8D\x00
org.apache.projects:http/create.html f:st []    \x00\x00\x00\x01
org.apache.projects:http/create.html f:ts []    \x00\x00\x015\xC3\xA8\xCD\x92
org.apache.projects:http/create.html il:http://projects.apache.org/indexes/alpha.html []    Create a DOAP File
org.apache.projects:http/create.html mk:_gnmrk_ []    1330427514-1313139543
org.apache.projects:http/create.html mtdt:_csh_ []    <\x16O\xDA
org.apache.projects:http/create.html s:s []    <\x16O\xDA
org.apache.projects:http/doap.html f:fi []    \x00'\x8D\x00
org.apache.projects:http/doap.html f:st []    \x00\x00\x00\x01
org.apache.projects:http/doap.html f:ts []    \x00\x00\x015\xC3\xA8\xCD\x92
org.apache.projects:http/doap.html il:http://projects.apache.org/indexes/alpha.html []    DOAP Files
org.apache.projects:http/doap.html mk:_gnmrk_ []    1330427514-1313139543
org.apache.projects:http/doap.html mtdt:_csh_ []    <\x16O\xDA
org.apache.projects:http/doap.html s:s []    <\x16O\xDA
org.apache.projects:http/doapfaq.html f:fi []    \x00'\x8D\x00
org.apache.projects:http/doapfaq.html f:st []    \x00\x00\x00\x01
org.apache.projects:http/doapfaq.html f:ts []    \x00\x00\x015\xC3\xA8\xCD\x93
org.apache.projects:http/doapfaq.html il:http://projects.apache.org/indexes/alpha.html []    DOAP File FAQ
org.apache.projects:http/doapfaq.html mk:_gnmrk_ []    1330427514-1313139543
org.apache.projects:http/doapfaq.html mtdt:_csh_ []    <\x16O\xDA
org.apache.projects:http/doapfaq.html s:s []    <\x16O\xDA
org.apache.projects:http/docs/dependancies.html f:fi []    \x00'\x8D\x00
org.apache.projects:http/docs/dependancies.html f:st []    \x00\x00\x00\x01
org.apache.projects:http/docs/dependancies.html f:ts []    \x00\x00\x015\xC3\xA8\xCD\x94
org.apache.projects:http/docs/dependancies.html il:http://projects.apache.org/indexes/alpha.html []    Dependencies
org.apache.projects:http/docs/dependancies.html mk:_gnmrk_ []    1330427514-1313139543
org.apache.projects:http/docs/dependancies.html mtdt:_csh_ []    <\x16O\xDA
org.apache.projects:http/docs/dependancies.html s:s []    <\x16O\xDA
org.apache.projects:http/docs/index.html f:fi []    \x00'\x8D\x00
org.apache.projects:http/docs/index.html f:st []    \x00\x00\x00\x01
org.apache.projects:http/docs/index.html f:ts []    \x00\x00\x015\xC3\xA8\xCD\x96

Further details and exploration of this data in Accumulo will have to wait for another blog post.

I ended up posting all the modified code from Gora (accumulo patch) and Nutchgora (patches for getting gora-accumulo working) to my github. Check it out.

Let me know if you have any questions…

–Jason
@jason_trost

Update (3/2): someone told me that they had problems getting nutch to build and that this patch worked for them (even though the patch is for GORA). I would be curious if anyone else has this same issue. Here is the error they encountered when building with

1
ant
:

        
[ivy:resolve] :: problems summary ::
[ivy:resolve] :::: WARNINGS
[ivy:resolve]           ::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve]           ::          UNRESOLVED DEPENDENCIES         ::
[ivy:resolve]           ::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve]           :: log4j#log4j;1.2.15: configuration not found in log4j#log4j;1.2.15: 'master'.

My name is Jason Trost. I am a developer and security researcher deeply interested in Big Data/cloud computing and machine Learning. I have a few years experience using Hadoop and Mapreduce to process and analyze computer network and security data. I also have experience developing applications with Apache Accumulo and Backtype’s Storm. I like Java and python, as well as UNIX shell scripting. This blog is going to cover topics ranging from processing data with Hadoop and MapReduce to applied machine learning techniques to interesting hacking and computer network defense tools I encounter. I am always open to requests for blog posts…

Feel free to follow me on Twitter and/or Github. * Github: jt6211 * Twitter: @jason_trost

–Jason