« Email, Gource, Hadoop, and Python | Main | Moodle bulk management of users, courses, and course categories »

April 13, 2012

Hadoop and Dumbo

Dumbo is a Python framework for writing Map Reduce flows with or without Hadoop. It's been a pain up until now, trying to get it going as it has relied on a number of patches to Hadoop for different byte streams, type codes etc. to make it work. No longer - as the necessary patches ave now made it into core as of 1.0.2.
On Ubuntu 12.04 all I needed was the debian package from here, (install as per these instructions) and then run sudo easy_install dumbo .
The only catch is that Dumbo does not currently recognise the Debian package layout used by the Hadoop package maintainers, so I found that I had to make a one line patch to compensate for it:
diff --git a/dumbo/util.py b/dumbo/util.py
index a57166d..cd35df3 100644
--- a/dumbo/util.py
+++ b/dumbo/util.py
@@ -267,6 +267,7 @@ def findjar(hadoop, name):
     hadoop home directory and component base name (e.g 'streaming')"""
 
     jardir_candidates = filter(os.path.exists, [
+        os.path.join(hadoop, 'share', 'hadoop', 'contrib', name),
         os.path.join(hadoop, 'mapred', 'build', 'contrib', name),
         os.path.join(hadoop, 'build', 'contrib', name),
         os.path.join(hadoop, 'mapred', 'contrib', name, 'lib'),

And then run the quick tutorial example from here like so:
hadoop fs -copyFromLocal /var/log/apache2/access.log /user/hduser/access.log
hadoop fs -ls /user/hduser/
dumbo start ipcount.py -hadoop /usr -input /user/hduser/access.log -output ipcounts
dumbo cat ipcounts/part* -hadoop /usr | sort -k2,2nr | head -n 5

Posted by PiersHarding at April 13, 2012 5:20 PM