« March 2012 | Main | June 2012 »

April 28, 2012

R and Hadoop

R is my hackers language of choice for analysis work. It really appeals to my sense of iteratively refining a solution. To my delight, I stumbled across this set of libraries for calling out to Hadoop Mapreduce, HDFS, and HBASE directly from R - RHadoop .
It was surprisingly easy to get going - especially with some patient help from Antonio - the project owner. RHadoop relies on the same fixes that Dumbo requires, but the game changer here is that from Hadoop 1.0.2, all the key patches that both require are now part of core.
The thing that tripped me up was a custom .Rprofile file I was using to load, and print things at the startup for R. This was causing R to write things to stdout which is what Hadoop streaming is using to pass data between tasks. This corrupted the data transfer, which was killing RHadoop with a weird Java Heap error. Anyway, once sorted out, everything runs smoothly, and I like the intuitive way things are handled in an R'esque manner. eg. take the example from the tutorial :

> library(rmr)
Loading required package: RJSONIO
Loading required package: itertools
Loading required package: iterators
Loading required package: digest
> small.ints = to.dfs(1:10)
Warning: $HADOOP_HOME is deprecated.

12/04/28 19:17:45 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/04/28 19:17:45 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/04/28 19:17:45 INFO compress.CodecPool: Got brand-new compressor
> out = mapreduce(input = small.ints, map = function(k,v) keyval(v, v^2))
Warning: $HADOOP_HOME is deprecated.

packageJobJar: [/tmp/RtmpXlELmY/rmr-local-env, /tmp/RtmpXlELmY/rmr-global-env, 
                              /home/piers/hadoop/tmp/hadoop-unjar1509588906818235502/] []
                              /tmp/streamjob912555254031649512.jar tmpDir=null
12/04/28 19:18:04 INFO mapred.FileInputFormat: Total input paths to process : 1
12/04/28 19:18:05 INFO streaming.StreamJob: getLocalDirs(): [/home/piers/hadoop/tmp/mapred/local]
12/04/28 19:18:05 INFO streaming.StreamJob: Running job: job_201204281916_0001
12/04/28 19:18:05 INFO streaming.StreamJob: To kill this job, run:
12/04/28 19:18:05 INFO streaming.StreamJob: /usr/libexec/../bin/hadoop job 
                             -Dmapred.job.tracker= -kill job_201204281916_0001
12/04/28 19:18:15 INFO streaming.StreamJob: Tracking URL:
12/04/28 19:18:16 INFO streaming.StreamJob:  map 0%  reduce 0%
12/04/28 19:18:45 INFO streaming.StreamJob:  map 100%  reduce 0%
12/04/28 19:18:54 INFO streaming.StreamJob:  map 100%  reduce 17%
12/04/28 19:18:57 INFO streaming.StreamJob:  map 100%  reduce 67%
12/04/28 19:19:06 INFO streaming.StreamJob:  map 100%  reduce 100%
12/04/28 19:19:21 INFO streaming.StreamJob: Job complete: job_201204281916_0001
12/04/28 19:19:21 INFO streaming.StreamJob: Output: /tmp/RtmpXlELmY/file2cf7546b881b
> from.dfs('/tmp/RtmpXlELmY/file2cf7546b881b')
Warning: $HADOOP_HOME is deprecated.

Warning: $HADOOP_HOME is deprecated.

Warning: $HADOOP_HOME is deprecated.

[1] 1

[1] 1

[1] TRUE

[1] 2

[1] 4

Posted by PiersHarding at 8:04 PM

April 14, 2012

Moodle bulk management of users, courses, and course categories

One of the (many) new features of Moodle 2.2 is the ability to create administration Tools plugins. This enables us developers to create and package (hopefully) useful tools that make the management of Moodle easier. One of the things that I've seen wished for is the ability to bulk upload courses and related material, and over recent months, this is something that I've been working on.

The key things that people want (from an administration point of view) are to manage people and courses. Often these activities are a tiresome bulk process at set times of the year with a relatively minor tweaking type of activity in between. For managing users - create/update/delete, and enrolments - we already have the built in functionality to do bulk user upload. I have added to this for Courses and for Course categories.
The course upload admin tool can be used to create and manage course outlines, but it can also populate courses using either a nominated course as a template (copies the course contents using the Moodle backup/restore facility), or populate the course from a Moodle backup file.

Posted by PiersHarding at 8:41 PM

April 13, 2012

Hadoop and Dumbo

Dumbo is a Python framework for writing Map Reduce flows with or without Hadoop. It's been a pain up until now, trying to get it going as it has relied on a number of patches to Hadoop for different byte streams, type codes etc. to make it work. No longer - as the necessary patches ave now made it into core as of 1.0.2.
On Ubuntu 12.04 all I needed was the debian package from here, (install as per these instructions) and then run sudo easy_install dumbo .
The only catch is that Dumbo does not currently recognise the Debian package layout used by the Hadoop package maintainers, so I found that I had to make a one line patch to compensate for it:
diff --git a/dumbo/util.py b/dumbo/util.py
index a57166d..cd35df3 100644
--- a/dumbo/util.py
+++ b/dumbo/util.py
@@ -267,6 +267,7 @@ def findjar(hadoop, name):
     hadoop home directory and component base name (e.g 'streaming')"""
     jardir_candidates = filter(os.path.exists, [
+        os.path.join(hadoop, 'share', 'hadoop', 'contrib', name),
         os.path.join(hadoop, 'mapred', 'build', 'contrib', name),
         os.path.join(hadoop, 'build', 'contrib', name),
         os.path.join(hadoop, 'mapred', 'contrib', name, 'lib'),

And then run the quick tutorial example from here like so:
hadoop fs -copyFromLocal /var/log/apache2/access.log /user/hduser/access.log
hadoop fs -ls /user/hduser/
dumbo start ipcount.py -hadoop /usr -input /user/hduser/access.log -output ipcounts
dumbo cat ipcounts/part* -hadoop /usr | sort -k2,2nr | head -n 5

Posted by PiersHarding at 5:20 PM

April 12, 2012

Email, Gource, Hadoop, and Python

I never knew that one of the guys (Andrew C) who works at Catalyst wrote a fantastic times series visualisation tool called Gource . It's incredible what people have done with it - just look on Youtube. The focus of use seems to have been on analysis of source code repository activity, but I think there is more mileage to be had from Gource than this. I wrote a simple Map/Reduce map chain for Hadoop (not really necessary for my volume of data) that stripped out the from/to/date information from all my mbox history since 1996. It really is simple - all you need is to a generate a file in the customformat - eg.:

0970518767|"DJ Adams" |M|Andrew_Powis/RVSUK/FES/Rank@rank.com
and then pump this through Gource:
gource --start-position 0.28 --stop-position 0.29 --title 'Communication sphere since 1996' -s 1 --log-format custom email-log.txt

You can record it as a video too:

gource --start-position 0.28 --stop-position 0.29 --title 'Communication sphere since 1996' -s 1 \
    --log-format custom email-log.txt  -1280x720 -o - | ffmpeg -y -r 60 \
    -f image2pipe -vcodec ppm -i - -vcodec libx264 -preset ultrafast -crf 1 -threads 0 -bf 0 gource-video-of-email.mp4
And this is what it looks like:

Posted by PiersHarding at 8:27 PM