« Hadoop and single file to mapper processing flow | Main | Email, Gource, Hadoop, and Python »

March 31, 2012

CSV files need SQL

As part of learning about R it soon has become apparent that the basic unit of currency is a CSV file - there are lots of other ways of getting data in and out of the R environment (JSON with library(RJSONIO), DB intefaces with library(RPostgreSQL) ...) but for the majority of work (which consists of hackery and experimentation which is why R is so attractive) CSV is the transportation mechanism.

I have found that, in particular at the beginning, it is often harder to think of basic data munging concepts in R - typical tasks like sorting, grouping, data type conversion - often your language of choice (Perl, Python, or even bash) is just quicker for doing these things in the first instance when I'm paring the data down into what I want to apply some form of statistical analysis or charting too.

With this in mind - I basically wanted to be able to perform SQL against a CSV file, without the hassle of loading it into a database first. Enter a clever tool written in Haskell called txt-sushi. This enables you to do interesting things like:

cat test.csv | tssql -table x - 'select a,b, sum(hours) AS hours_sum from x group by a,b'

However, for my purposes tssql is too strict on handling data types, and is dependent on Haskell, so I've built my own simple CSV SQL processor - csvtable in Python using SQLite as a backend. This is surprisingly easy to do, and let's you have the benefit of the convenience and power of SQLite syntax:

python csvtable.py \
  --where="system_code != 'LEAVE'" \
  --convert='date_epoch:date,hours:int' \
  --list="*, sum(hours) AS hours_sum, min(date_epoch) AS date_epoch_min, 
               max(date_epoch) AS date_epoch_max, count(*) AS days,
               ROUND(AVG(hours), 2) AS avg_time, MIN(hours) AS hours_min,
               MAX(hours) AS hours_max" \
  --group='organisation_code, system_code, request_id' \
  --file=test1.csv | \
 python csvtable.py --list='*, ROUND(((date_epoch_max - date_epoch_min) / (60 * 60 * 24)) + 1, 2) AS duration' > test2.csv

Posted by PiersHarding at March 31, 2012 6:46 AM