« June 2011 | Main | April 2012 »

March 31, 2012

CSV files need SQL

As part of learning about R it soon has become apparent that the basic unit of currency is a CSV file - there are lots of other ways of getting data in and out of the R environment (JSON with library(RJSONIO), DB intefaces with library(RPostgreSQL) ...) but for the majority of work (which consists of hackery and experimentation which is why R is so attractive) CSV is the transportation mechanism.

I have found that, in particular at the beginning, it is often harder to think of basic data munging concepts in R - typical tasks like sorting, grouping, data type conversion - often your language of choice (Perl, Python, or even bash) is just quicker for doing these things in the first instance when I'm paring the data down into what I want to apply some form of statistical analysis or charting too.

With this in mind - I basically wanted to be able to perform SQL against a CSV file, without the hassle of loading it into a database first. Enter a clever tool written in Haskell called txt-sushi. This enables you to do interesting things like:

cat test.csv | tssql -table x - 'select a,b, sum(hours) AS hours_sum from x group by a,b'

However, for my purposes tssql is too strict on handling data types, and is dependent on Haskell, so I've built my own simple CSV SQL processor - csvtable in Python using SQLite as a backend. This is surprisingly easy to do, and let's you have the benefit of the convenience and power of SQLite syntax:

python csvtable.py \
  --where="system_code != 'LEAVE'" \
  --convert='date_epoch:date,hours:int' \
  --list="*, sum(hours) AS hours_sum, min(date_epoch) AS date_epoch_min, 
               max(date_epoch) AS date_epoch_max, count(*) AS days,
               ROUND(AVG(hours), 2) AS avg_time, MIN(hours) AS hours_min,
               MAX(hours) AS hours_max" \
  --group='organisation_code, system_code, request_id' \
  --file=test1.csv | \
 python csvtable.py --list='*, ROUND(((date_epoch_max - date_epoch_min) / (60 * 60 * 24)) + 1, 2) AS duration' > test2.csv

Posted by PiersHarding at 6:46 AM

March 27, 2012

Hadoop and single file to mapper processing flow

It seems like a trivial thing to want to do, but it appears that the standard Hadoop workflow is to treat all input files as line oriented transactions, which does not help at all when I want to process on a file by file basis. The example I was working through is where I have 20 years worth of mbox email files. Each file needs to be broken into individual emails, the contents parsed, and useful information in the headers stripped out into a convenient format for subsequent processing. To do this in the context of Hadoop is slightly odd. It appears that the usual approach is to create an input file of mbox file names (loaded into HDFS), and then each mapper execution uses the HDFS API to pull the file and process it.

This presented another problem - in Python, how do you access the HDFS API? There are two existing integrations that I can find - https://github.com/traviscrawford/python-hdfs, and http://code.google.com/p/libpyhdfs/. Travis Crawfords' is easy to get going, but as it uses a JNI binding I didn't relish the prospect of trying to make sure CLASSPATHs etc are right across all my Hadoop nodes (which for my purposes are any machine that I can beg, borrow or steal), in light of this I created my own cheap and cheerful library that uses subprocess to call the 'hadoop' executable for 'fs' - hdfsio .
I admit this isn't the height of efficiency (or possibly elegance), but it is surprisingly robust and very simple.

Posted by PiersHarding at 6:13 AM

March 26, 2012

Journey into Hadoop

I've been building up my background knowledge on current toolsets used in Data Science, and part of this is R and another is Hadoop.

Hadoop is a big thing, and takes (to my mind) quite a lot of effort to get going, and to understand how you can bend it to your will. Par of this learning process has been about finding a comfortable installation pattern for Linux - in particular Ubuntu, and the best help I've found so far has been from Michael Noll. Things that I had to be careful about were getting ssh working, and name resolution exactly right on all nodes that you put in your cluster, as you distribute things like /etc/hadoop/masters and the *-site.xml config files.

The next stage was to find a development pattern that enabled me to avoid Java. The answer to this for me is Hadoop Streaming. This basically allows you to pipe IO in and out of programs written in your favourite language - and in this case Michael does brilliantly again with Python and MapReduce.

Posted by PiersHarding at 9:23 AM