I just posted a functional AccumuloStorage module to github.
Here’s how you use it (also in the github README)
###1. Build the JAR
Note: you will need to download the accumulo src, build it, and install it into your maven repo before this will work
This will create a JAR file here:
###2. Download the JARs needed by pig
This should have copied the needed dependency jars into a
###3. Print the register statements we will need in pig
Here is some example output
####5. Run Pig
Copy the register statements above and paste them into the pig terminal. Then you can LOAD from and STORE into accumulo.
Here are the pig commands run if you don’t want to look through the output above:
A more detailed blog post going in more detail of how/why this is useful will follow.
Update (2012/03/04): you may want to run this as the first line of the pig script:
This will avoid ingesting duplicate entries into accumulo. For the data from this post, ingesting duplicate entries wouldn’t cause any real issues because Accumulo’s
would only keep the newest copy, but for columns/tables with aggregation configured (e.g. using
) we definitely don’t want this.