1. Accumulo and Pig


    I just posted a functional AccumuloStorage module to github.

    Here’s how you use it (also in the github README)

    1. Build the JAR

    Note: you will need to download the accumulo src, build it, and install it into your maven repo before this will work

    mvn package
    

    This will create a JAR file here: target/accumulo-pig-1.5.0-incubating-SNAPSHOT.jar

    2. Download the JARs needed by pig
    mvn dependency:copy-dependencies -DoutputDirectory=lib  \
        -DincludeArtifactIds=zookeeper,libthrift,accumulo-core,cloudtrace
    

    This should have copied the needed dependency jars into a lib directory.

    3. Print the register statements we will need in pig
    for JAR in lib/*.jar target/accumulo-pig-1.5.0-incubating-SNAPSHOT.jar ; 
    do 
        echo register `pwd`/$JAR; 
    done
    

    Here is some example output

    register /home/developer/workspace/accumulo-pig/lib/accumulo-core-1.5.0-incubating-SNAPSHOT.jar
    register /home/developer/workspace/accumulo-pig/lib/cloudtrace-1.5.0-incubating-SNAPSHOT.jar
    register /home/developer/workspace/accumulo-pig/lib/libthrift-0.6.1.jar
    register /home/developer/workspace/accumulo-pig/lib/zookeeper-3.3.1.jar
    register /home/developer/workspace/accumulo-pig/target/accumulo-pig-1.5.0-incubating-SNAPSHOT.jar
    

    5. Run Pig

    Copy the register statements above and paste them into the pig terminal. Then you can LOAD from and STORE into accumulo.

    $ pig
    

    Here are the pig commands run if you don’t want to look through the output above:

    # load just the web content (from the f:cnt column) from the webpage table
    DATA = LOAD 
    'accumulo://webpage?instance=inst&user=root&password=secret&zookeepers=127.0.0.1:2181&columns=f:cnt' 
       using org.apache.accumulo.pig.AccumuloStorage() AS (row, cf, cq, cv, ts, val);
    
    # basically, remove the ts field since it is not needed
    DATA2 = FOREACH DATA GENERATE row, cf, cq, cv, val;
    
    # store the data as is in a new table called webpage_content
    STORE DATA2 into 
    'accumulo://webpage_content?instance=inst&user=root&password=secret&zookeepers=127.0.0.1:2181' 
       using org.apache.accumulo.pig.AccumuloStorage();
    

    A more detailed blog post going in more detail of how/why this is useful will follow.

    —Jason

    Update (2012/03/04): you may want to run this as the first line of the pig script:

    SET mapred.map.tasks.speculative.execution false
    

    This will avoid ingesting duplicate entries into accumulo. For the data from this post, ingesting duplicate entries wouldn’t cause any real issues because Accumulo’s VersioningIterator would only keep the newest copy, but for columns/tables with aggregation configured (e.g. using LongCombiner) we definitely don’t want this.

    1 year ago  /  0 notes