This is a quick post. I wrote this little framework for using Hadoop to analyze lots of small files. This may not be the most optimal way of doing this, but it worked well and makes repeated analysis tasks easy and scalable.
I recently needed a quick way to analyze millions of small binary files (from 100K-19MB each) and I wanted a scalable way to repeatedly do this sort of analysis. I chose Hadoop as the platform, and I built this little framework (really, a single MapReduce job) to do it. This is very much a work in progress, and feedback and pull requests are welcome.
The main MapReduce job in this framework accepts a Sequence file of
is a name and the
is the contents of a file. The framework unpacks the bytes of
to the local filesystem of the mapper it is running on, allowing the mapper to run
arbitrary analysis tools that require local filesystem access. The framework then captures stdout and stderr from the
analysis tool/script and stores it (how it stores it is pluggable, see
This block of example instructs the framework to run
using the args of
is replaced by the unpacked filename from the Sequence File. If multiple command line args are required,
they can be specified by appending a delimiter and then each arg to the value of the