<?xml version="1.0" encoding="utf-8"?>
<feed version="0.3" xmlns="http://purl.org/atom/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xml:lang="en">
<title>Where on Earth is Piers?</title>
<link rel="alternate" type="text/html" href="http://www.piersharding.com/blog/" />
<modified>2013-02-09T19:59:49Z</modified>
<tagline></tagline>
<id>tag:www.piersharding.com,2013:/blog//1</id>
<generator url="http://www.movabletype.org/" version="4.24-en">Movable Type</generator>
<copyright>Copyright (c) 2013, PiersHarding</copyright>

<entry>
<title>Hosting an R Repository for RSAP and RMonet</title>
<link rel="alternate" type="text/html" href="http://www.piersharding.com/blog/archives/2013/02/hosting_an_r_repo.html" />
<modified>2013-02-09T19:59:49Z</modified>
<issued>2013-02-09T19:49:27Z</issued>
<id>tag:www.piersharding.com,2013:/blog//1.98</id>
<created>2013-02-09T19:49:27Z</created>
<summary type="text/plain">I&apos;ve just setup an R repository to host my R extensions that I&apos;ve published. This currently contains RSAP the SAP RFC connector, and RMonet the MonetDB connector using the Monet MAPI C API. It&apos;s a very easy process as document...</summary>
<author>
<name>PiersHarding</name>
<url>http://www.piersharding.com</url>
<email>piers@ompka.net</email>
</author>
<dc:subject>R</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.piersharding.com/blog/">
<![CDATA[<p>I've just setup an R repository to host my R extensions that I've published.  This currently contains <a href="https://github.com/piersharding/RSAP">RSAP</a>  the SAP RFC connector, and <a href="https://github.com/piersharding/RMonet">RMonet</a>  the MonetDB connector using the Monet MAPI C API.</p>

<p>It's a very easy process as document <a href="http://cran.r-project.org/doc/manuals/R-admin.html#Setting-up-a-package-repository">here</a> .</p>

<p>This repository can be generally accessed by doing the following:<br />
setRepositories(addURLs = c(PiersHarding = "http://piersharding.com/R"))</p>

<p>Or for and individual package:<br />
install.packages('RMonet', repos=c('http://piersharding.com/R'))</p>]]>

</content>
</entry>

<entry>
<title>Data Hackery - R, SAP, and OpenSource in-memory databases</title>
<link rel="alternate" type="text/html" href="http://www.piersharding.com/blog/archives/2013/01/r_sap_and_opens.html" />
<modified>2013-02-09T20:02:54Z</modified>
<issued>2013-01-31T04:38:09Z</issued>
<id>tag:www.piersharding.com,2013:/blog//1.97</id>
<created>2013-01-31T04:38:09Z</created>
<summary type="text/plain">Data Hackery - R, SAP, and OpenSource in-memory databases</summary>
<author>
<name>PiersHarding</name>
<url>http://www.piersharding.com</url>
<email>piers@ompka.net</email>
</author>

<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.piersharding.com/blog/">
<![CDATA[I've just completed a post on SAP SCN regarding using In-Memory column oriented database MonetDB with SAP and R for exploratory data analysis titled "<a href="http://scn.sap.com/community/scripting-languages/blog/2013/01/31/r-sap-and-opensource-in-memory-databases">Data Hackery - R, SAP, and OpenSource in-memory databases</a>" .

This uses an R library that I've created as a database interface to <a href="http://www.monetdb.org/">MonetDB</a> called <a href="https://github.com/piersharding/RMonet">RMonet</a>.

]]>

</content>
</entry>

<entry>
<title>Google Drive repository plugin for Moodle</title>
<link rel="alternate" type="text/html" href="http://www.piersharding.com/blog/archives/2012/07/google_drive_re.html" />
<modified>2012-07-18T01:34:07Z</modified>
<issued>2012-07-18T01:31:03Z</issued>
<id>tag:www.piersharding.com,2012:/blog//1.96</id>
<created>2012-07-18T01:31:03Z</created>
<summary type="text/plain">Just added a Google Drive repository plugin for Moodle to my moodle-google set of applications here: https://github.com/piersharding/moodle-google/tree/master/repository/googledrive....</summary>
<author>
<name>PiersHarding</name>
<url>http://www.piersharding.com</url>
<email>piers@ompka.net</email>
</author>
<dc:subject>Google</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.piersharding.com/blog/">
<![CDATA[<p>Just added a <a href="https://drive.google.com/start">Google Drive</a> repository plugin for <a href="http://moodle.org/">Moodle</a> to my moodle-google set of applications here: <a href="https://github.com/piersharding/moodle-google/tree/master/repository/googledrive">https://github.com/piersharding/moodle-google/tree/master/repository/googledrive</a>.</p>]]>

</content>
</entry>

<entry>
<title>SAP with R</title>
<link rel="alternate" type="text/html" href="http://www.piersharding.com/blog/archives/2012/06/sap_with_r.html" />
<modified>2012-06-13T03:46:54Z</modified>
<issued>2012-06-13T01:37:14Z</issued>
<id>tag:www.piersharding.com,2012:/blog//1.95</id>
<created>2012-06-13T01:37:14Z</created>
<summary type="text/plain"><![CDATA[Something that piqued my curiosity lately was the developments with SAP HANA and R (good overview here).  This is definitely a new and exciting direction for SAP, with creating a well structured, and organised 'Big Table' option for in memory computing, and then going the extra mile to embed a specialised Open Source Statistical Computing package (R) in it - making the fore front of the world of statistical analysis open to those that dare.
 
This is utterly brilliant, but the problem is that I can't access it as I don't have access to a SAP HANA instance (nor would most people).  It is also heavily geared to 'Big Data', when there is still an awful lot to be gained from small, and mid-range data analysis arenas (resisting the temptation about size and clich&eacute;s).
 
This has definitely touched on my hackers itch, and in response to this I've created one more Scripting Language Connector for R - RSAP.
 ]]></summary>
<author>
<name>PiersHarding</name>
<url>http://www.piersharding.com</url>
<email>piers@ompka.net</email>
</author>
<dc:subject>R</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.piersharding.com/blog/">
<![CDATA[<p>Something that piqued my curiosity lately was the developments with <a href="http://www.sap.com/solutions/technology/in-memory-computing-platform/hana/overview/index.epx">SAP HANA</a> and <a href="http://www.r-project.org/">R</a> (good overview <a href="http://www.slideshare.net/JitenderAswani/na-6693-r-and-sap-hana-dkom-jitenderaswanijensdoeprmund">here</a>).  This is definitely a new and exciting direction for SAP, with creating a well structured, and organised 'Big Table' option for in memory computing, and then going the extra mile to embed a specialised Open Source Statistical Computing package (R) in it - making the fore front of the world of statistical analysis open to those that dare.</p>
<p> </p>
<p>This is utterly brilliant, but the problem is that I can't access it as I don't have access to a SAP HANA instance (nor would most people).  It is also heavily geared to 'Big Data', when there is still an awful lot to be gained from small, and mid-range data analysis arenas (resisting the temptation about size and clich&eacute;s).</p>
<p> </p>
<p>This has definitely touched on my hackers itch, and in response to this I've created one more Scripting Language Connector for R - <a href="https://github.com/piersharding/RSAP">RSAP</a>.</p>
<p> </p>
<p>The idea of this is to enable RFC calls (using the SAP NW RFC SDK) where any table contents are returned as <a href="http://cran.r-project.org/doc/manuals/R-intro.html#Data-frames">data.frames</a> (in R parlance). </p>
<p> </p>
<p>Once you have this data in R, then the world is your oyster - it is up to your imagination as to what you do with it.  To give an overview of how it works, and what you can do, I'm going to step through the process of installing and using RSAP.</p>
<p> </p>
<h2><span style="font-size: 12pt;"><strong>Obtaining and Installing</strong></span></h2>
<p> </p>
<p>Firstly you need to install R.  I recommend using <a href="http://www.rstudio.org/">RStudio</a> as it is a comfortable graphical user interface - you can get it from <a href="http://www.rstudio.org/download/">here</a>.   </p>
<p>Under debian (read Ubuntu) flavoured Linux you can install R first before downloading/installing RStudio using:</p>
<p> </p>
<p><span style="font-family: 'andale mono', times;">sudo apt-get install r-base-core r-base-dev r-base-html r-recommended</span></p>
<p> </p>
<p> </p>
<h2>SAP NW RFCSDK</h2>
<p> </p>
<p>The SDK is available from the SAP Service Market Place SWDC - this is a forum discussion on getting it <a href="/thread/950318" _jive_internal="true">http://scn.sap.com/thread/950318</a></p>
<p>If you have (like me) installed the NPL SAP Test Drive instance, then the SAP NW RFC libs exist in the /usr/sap/NPL/SYS/exe/run directory, the only problem being that it does not contain the C header files (really - SAP should make this available on SDN).</p>
<p> </p>
<h2>RSAP</h2>
<p> </p>
<p>Download or clone the RSAP project source from <a href="https://github.com/piersharding/RSAP">https://github.com/piersharding/RSAP</a></p>
<p> </p>
<h2>Building</h2>
<p> </p>
<p>Ensure that the R library prerequisites are installed.  To do this there is a helper script in the RSAP source code directory.  cd to the source directory (downloaded above) - in my case /home/piers/git/public/RSAP - and run the following:</p>
<p> </p>
<p><span style="font-family: 'andale mono', times;">R --no-save &lt; install_dependencies.R</span></p>
<p> </p>
<p>This will prompt to install the packages yaml, reshape, plotrix, and RUnit.</p>
<p> </p>
<p>To build and install the RSAP package, cd to the source directory (downloaded above) - in my case /home/piers/git/public/RSAP - run the following:</p>
<p> </p>
<p><span style="font-family: 'andale mono', times;">R CMD INSTALL --build --preclean --clean --configure-args='--with-nwrfcsdk-include=/home/piers/code/sap/nwrfcsdk/include --with-nwrfcsdk-lib=/home/piers/code/sap/nwrfcsdk/lib' .</span></p>
<p> </p>
<p>You must change the values for <span style="font-family: 'andale mono', times;">--with-nwrfcsdk-include</span> and <span style="font-family: 'andale mono', times;">--with-nwrfcsdk-lib</span> to point to the directory locations that you have downloaded the SAP NW RFC SDK to.</p>
<p> </p>
<p>Under Linux, it is also likely that you need to add the lib directory to the LD cache or set the LD_LIBRARY_PATH variable.</p>
<p> </p>
<p>Setting the LD Cache:</p>
<p>as root, edit /etc/ld.so.conf and add the lib path from above to it on it's own line.  Now regenrate the cache by executiong 'sudo ldconfig'.</p>
<p> </p>
<p>Setting LD_LIBRARY_PATH</p>
<p>You must ensure that the following environment variable is set in all your shells:</p>
<p>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/nwrfcsdk/lib</p>
<p>The easiest way to do this is to add the above line to your $HOME/.bashrc file so that it happens automatically for all future shells.</p>
<p> </p>
<h2>Does it work?</h2>
<p> </p>
<p>Once the build and install of the RSAP package is complete, now you should test to make sure it's all working.</p>
<p> </p>
<p>Change to the package source code directory (you are probably still there from the above activities), and launch either R or RStudio.</p>
<p>From the R command line try the following:</p>
<p> </p>
<p><span style="font-family: 'andale mono', times;">&gt; library(RSAP)</span></p>
<p><span style="font-family: 'andale mono', times;">Loading required package: yaml</span></p>
<p><span style="font-family: 'andale mono', times;">&gt; </span></p>
<p> </p>
<p>You should get the above confirmation message that the dependent yaml package has been loaded.  Now we are ready to try some R wizardry.</p>
<p> </p>
<h2>How to work with RSAP</h2>
<p> </p>
<p>Lets work through the general process steps for interacting with SAP.</p>
<p> </p>
<h3>Connecting to SAP</h3>
<p> </p>
<p>Using RSAP we need to establish a connection to SAP.  For this you need an account that has the appropriate access for RFC calls, and functionality access.  Connections can be built in two ways - directly passing connection parameters:</p>
<p><span style="font-family: 'andale mono', times;">&gt;     conn &lt;- RSAPConnect(ashost="nplhost", sysnr="42",</span></p>
<p><span style="font-family: 'andale mono', times;">                          client="001", user="developer", </span></p>
<p><span style="font-family: 'andale mono', times;">                          passwd="developer", lang="EN")</span></p>
<p><span style="font-family: 'andale mono', times;">&gt; </span></p>
<p> </p>
<p>Or using a YAML encoded file that contains the connection details:</p>
<p><span style="font-family: 'andale mono', times;">&gt; conn &lt;- RSAPConnect("sap.yml")</span></p>
<p><span style="font-family: 'andale mono', times;">&gt; </span></p>
<p> </p>
<p>The sap.yml file is structured like:</p>
<p><span style="font-family: 'andale mono', times;">ashost: nplhost</span></p>
<p><span style="font-family: 'andale mono', times;">sysnr: "42"</span></p>
<p><span style="font-family: 'andale mono', times;">client: "001"</span></p>
<p><span style="font-family: 'andale mono', times;">user: developer</span></p>
<p><span style="font-family: 'andale mono', times;">passwd: developer</span></p>
<p><span style="font-family: 'andale mono', times;">lang: EN</span></p>
<p><span style="font-family: 'andale mono', times;">trace: 1</span></p>
<p> </p>
<p>The above activates the trace functionality in NW RFC SDK.  This will create trace files in the current working directory, and are invaluable for debugging connectivity problems.</p>
<p> </p>
<p> </p>
<h3><span style="font-family: 'andale mono', times;">Calling SAP</span></h3>
<p><span style="font-family: 'andale mono', times;"><br /></span></p>
<p>Now we have the connection object, we can get connection info with it:</p>
<p><span style="font-family: 'andale mono', times;">info &lt;- RSAPGetInfo(conn)</span></p>
<p>Query the system with:</p>
<p><span style="font-family: 'andale mono', times;">res &lt;- RSAPInvoke(conn, "&lt;RFC Function Name", parms)</span></p>
<p>Or close the connection:</p>
<p><span style="font-family: 'andale mono', times;">RSAPClose(conn)</span></p>
<p> </p>
<p><span style="font-family: 'andale mono', times;">RSAPInvoke</span>() is what we are most interested in, and we need to pass the parameters as a series of nested named lists.  The classic example is RFC_READ_TABLE:</p>
<p><span style="font-family: 'andale mono', times;">parms &lt;- list('DELIMITER' = '|',</span></p>
<p><span style="font-family: 'andale mono', times;">              'FIELDS' = list(FIELDNAME = list('CARRID', 'CONNID', 'PRICE',</span></p>
<p><span style="font-family: 'andale mono', times;">                                               'SEATSMAX', 'SEATSOCC')),</span></p>
<p><span style="font-family: 'andale mono', times;">              'OPTIONS' = list(TEXT = list("CARRID = 'AA' ", " AND CONNID = 0017 ")),</span></p>
<p><span style="font-family: 'andale mono', times;">              'QUERY_TABLE' = 'SFLIGHTS2')</span></p>
<p><span style="font-family: 'andale mono', times;">res &lt;- RSAPInvoke(conn, "RFC_READ_TABLE", parms)</span></p>
<p> </p>
<p>The names must correspond directly to the parameter and structure (for tables) names, and use numeric and character types as appropriate.</p>
<p>The other thing that is really important to get your head around is that R data structures are column oriented, which means we have to think differently about tables that we get from SAP.  Tables in SAP translate to lists of vectors where the outer list is a list of column names (a slightly loose analogy but it will do) and the vectors hang off these column names corresponding to all the values in that column down the rows.</p>
<p> </p>
<p> </p>
<h2>Working through the examples in get_flights.R</h2>
<p> </p>
<p>In the source code package there is an example script - get_flights.R.  It uses the standard demonstration data for the Flight Data system contained in table SFLIGHT2.  Let's look at what this does.</p>
<p> </p>
<p> Load libraries:</p>
<p><span style="font-family: 'andale mono', times;">&gt; library(RSAP)</span></p>
<p><span style="font-family: 'andale mono', times;">Loading required package: yaml</span></p>
<p><span style="font-family: 'andale mono', times;">&gt; library(reshape)</span></p>
<p><span style="font-family: 'andale mono', times;">Loading required package: plyr</span></p>
<p>  <span style="font-family: 'andale mono', times;">Attaching package: &lsquo;reshape&rsquo;</span></p>
<p>  <span style="font-family: 'andale mono', times;">The following object(s) are masked from &lsquo;package:plyr&rsquo;:</span></p>
<p>  <span style="font-family: 'andale mono', times;">    rename, round_any</span></p>
<p><span style="font-family: 'andale mono', times;">&gt; library(plotrix)</span></p>
<p><span style="font-family: 'andale mono', times;">&gt;</span></p>
<p>We now have all the necessary libraries for the rest of the examples.</p>
<p> </p>
<p><span style="font-family: 'andale mono', times;">conn &lt;- RSAPConnect("sap.yml")</span></p>
<p><span style="font-family: 'andale mono', times;">parms &lt;- list('DELIMITER' = ';',</span></p>
<p><span style="font-family: 'andale mono', times;">              'QUERY_TABLE' = 'SFLIGHTS2')</span></p>
<p><span style="font-family: 'andale mono', times;">res &lt;- RSAPInvoke(conn, "RFC_READ_TABLE", parms)</span></p>
<p><span style="font-family: 'andale mono', times;">RSAPClose(conn)</span></p>
<p><span style="font-family: 'andale mono', times;">sflight = res$DATA</span></p>
<p><span style="font-family: 'andale mono', times;">flds &lt;- sub("\\s+$", "", res$FIELDS$FIELDNAME)</span></p>
<p><span style="font-family: 'andale mono', times;">sflight &lt;- data.frame(sflight, colsplit(sflight$WA, split = ";", names = flds))</span></p>
<p><span style="font-family: 'andale mono', times;"><br /></span></p>
<p> </p>
<p>This connects to SAP, calls RFC_READ_TABLE to get the contents of SFLIGHT2, and sets the column delimiter for that table as ';'.  We close the connection and copy the table data from the return parameter res$DATA (see RFC_READ_TABLE in transaction SE37) into sflight.  We also grab the field names returned in table FIELDS, and remove the whitespace at the end.  Next - this is where the importance of the ';' delimiter is - using the colsplit() function from the reshape package, we split return DATA into columns named by the FIELDS that RFC_READ_TABLE provided us.</p>
<p> </p>
<p>Now we have a data.frame that looks a lot like the table SFLIGHT2 when viewed in transaction SE16.</p>
<p> </p>
<p><span style="font-family: 'andale mono', times;">sflight &lt;- cbind(sflight, FLIGHTNO = paste(sub("\\s+$", "",</span></p>
<p><span style="font-family: 'andale mono', times;">                                           sflight$CARRID),sflight$CONNID, sep=""))</span></p>
<p><span style="font-family: 'andale mono', times;">sflight$SEGMENT &lt;- paste(sflight$AIRPFROM, sflight$AIRPTO, sep=" - ")</span></p>
<p><span style="font-family: 'andale mono', times;">sflight$CARRNAME &lt;- sub("\\s+$", "", sflight$CARRNAME)</span></p>
<p><span style="font-family: 'andale mono', times;">sflight$DISTANCE &lt;- as.numeric(lapply(sflight$DISTANCE,</span></p>
<p><span style="font-family: 'andale mono', times;">                                      FUN=function (x) {sub("\\*","", x)}))</span></p>
<p><span style="font-family: 'andale mono', times;">sflight$DISTANCE &lt;- as.numeric(lapply(sflight$DISTANCE,</span></p>
<p><span style="font-family: 'andale mono', times;">                                      FUN=function (x) {if (x == 0) NA else x}))</span></p>
<p><span style="font-family: 'andale mono', times;">sflight[sflight$CARRNAME == 'Qantas Airways','DISTANCE'] &lt;- 10258</span></p>
<p> </p>
<p>This next chunk created  new vectors (columns) FLIGHTNO combined from CARRID and CONNID, SEGMENT from AIRPFROM and AIRPTO, and cleaned vectors CARRNAME, and DISTANCE.</p>
<p> </p>
<p>Now create some aggregated views, to generate visualisations from:</p>
<p> </p>
<p><span style="font-family: 'andale mono', times;">airline_avgocc &lt;- aggregate(data.frame(SEATSMAX=sflight$SEATSMAX,</span></p>
<p><span style="font-family: 'andale mono', times;">                                       SEATSOCC=sflight$SEATSOCC,</span></p>
<p><span style="font-family: 'andale mono', times;">                                       OCCUPANCY=sflight$SEATSOCC/sflight$SEATSMAX),</span></p>
<p><span style="font-family: 'andale mono', times;">                            by=list(carrname=sflight$CARRNAME), FUN=mean, na.rm=TRUE)</span></p>
<p><span style="font-family: 'andale mono', times;">airline_sumocc &lt;- aggregate(data.frame(SEATSOCC=sflight$SEATSOCC), </span></p>
<p><span style="font-family: 'andale mono', times;">                            by=list(carrname=sflight$CARRNAME), FUN=sum, na.rm=TRUE)</span></p>
<p> </p>
<p>Show a pie chart  - sum of airline occupancy as a share of market:</p>
<p> </p>
<p><span style="font-family: 'andale mono', times;">x11()</span></p>
<p><span style="font-family: 'andale mono', times;">lbls &lt;- paste(airline_sumocc$carrname, "\n", sprintf("%.2f%%",          </span></p>
<p><span style="font-family: 'andale mono', times;">        (airline_sumocc$SEATSOCC/sum(airline_sumocc$SEATSOCC))*100), sep="")</span></p>
<p><span style="font-family: 'andale mono', times;">pie3D(airline_sumocc$SEATSOCC, labels=lbls, </span></p>
<p><span style="font-family: 'andale mono', times;">      col=rainbow(length(airline_sumocc$carrname)),</span></p>
<p><span style="font-family: 'andale mono', times;">      main="Occupancy sum share for Airlines", explode=0.1)</span></p>
<p> </p>
<p><img alt="pie.png" class="jive-image" height="196" src="http://scn.sap.com/servlet/JiveServlet/downloadImage/38-67980-110479/240-196/pie.png" style="width: 240px; height: 196.36363636363635px;" width="240" __jive_id="110479" /></p>
<p> </p>
<p> </p>
<p>Create a Stacked Bar Plot with Colors and Legend showing a summary of occupancy by segment and carrier - to do this we need to generate a summary (aggregate), and fill in the missing combinations of the grid, and then switch the orientation of rows for columns to present to the plotting funcitons:</p>
<p> </p>
<p><span style="font-family: 'andale mono', times;">d &lt;- aggregate(SEATSOCC ~ CARRNAME:SEGMENT, data=sflight, FUN=sum, na.rm=FALSE)</span></p>
<p><span style="font-family: 'andale mono', times;">d2 &lt;- with(d, expand.grid(CARRNAME = unique(d$CARRNAME), SEGMENT = unique(d$SEGMENT)))</span></p>
<p><span style="font-family: 'andale mono', times;">airline_sumsegocc &lt;- merge(d, d2, all.y = TRUE)</span></p>
<p><span style="font-family: 'andale mono', times;">airline_sumsegocc$SEATSOCC[is.na(airline_sumsegocc$SEATSOCC)] &lt;- 0</span></p>
<p><span style="font-family: 'andale mono', times;"># switch orientation to segment * carrier</span></p>
<p><span style="font-family: 'andale mono', times;">counts &lt;- data.frame(unique(airline_sumsegocc$CARRNAME))</span></p>
<p><span style="font-family: 'andale mono', times;">for (a in unique(airline_sumsegocc$SEGMENT))  </span></p>
<p><span style="font-family: 'andale mono', times;">    {counts &lt;- cbind(counts, </span></p>
<p><span style="font-family: 'andale mono', times;">     airline_sumsegocc$SEATSOCC[which(airline_sumsegocc$SEGMENT == a)]);}</span></p>
<p><span style="font-family: 'andale mono', times;">counts[,1] &lt;- NULL</span></p>
<p><span style="font-family: 'andale mono', times;">colnames(counts) &lt;- unique(airline_sumsegocc$SEGMENT);</span></p>
<p><span style="font-family: 'andale mono', times;">rownames(counts) &lt;- unique(airline_sumsegocc$CARRNAME);</span></p>
<p><span style="font-family: 'andale mono', times;">x11()</span></p>
<p><span style="font-family: 'andale mono', times;">barplot(as.matrix(counts), main="Total Occupancy by Segment and Carrier",</span></p>
<p><span style="font-family: 'andale mono', times;">        ylab="Number of Seats", </span></p>
<p><span style="font-family: 'andale mono', times;">        col=rainbow(dim(counts)[1]), </span></p>
<p><span style="font-family: 'andale mono', times;">        ylim=c(0, 15000), legend = rownames(counts))</span></p>
<p> </p>
<p><img alt="barchart.png" class="jive-image" height="284" src="http://scn.sap.com/servlet/JiveServlet/downloadImage/38-67980-110480/348-284/barchart.png" style="width: 348px; height: 284.72727272727275px;" width="348" __jive_id="110480" /></p>
<p> </p>
<p> </p>
<p>Lastly - we create a simple performance indicator using a time series comparison of different airlines:</p>
<p> </p>
<p><span style="font-family: 'andale mono', times;"># performance by airline over time - dollars per customer KM</span></p>
<p><span style="font-family: 'andale mono', times;">sflight$FLDATEYYMM &lt;- substr(sflight$FLDATE, start=1, stop=6)</span></p>
<p><span style="font-family: 'andale mono', times;">d &lt;- aggregate(data.frame(PAYMENTSUM=sflight$PAYMENTSUM,</span></p>
<p><span style="font-family: 'andale mono', times;">                          SEATSOCC=sflight$SEATSOCC,</span></p>
<p><span style="font-family: 'andale mono', times;">                          DISTANCE=sflight$DISTANCE,</span></p>
<p><span style="font-family: 'andale mono', times;">                          PERFORMANCE=(sflight$PAYMENTSUM/(sflight$SEATSOCC *</span></p>
<p><span style="font-family: 'andale mono', times;">                             sflight$DISTANCE))),</span></p>
<p><span style="font-family: 'andale mono', times;">               by=list(carrname=sflight$CARRNAME, </span></p>
<p><span style="font-family: 'andale mono', times;">                       fldateyymm=sflight$FLDATEYYMM),</span></p>
<p><span style="font-family: 'andale mono', times;">               FUN=sum, na.rm=TRUE)</span></p>
<p><span style="font-family: 'andale mono', times;">d2 &lt;- with(d, expand.grid(carrname = unique(d$carrname), </span></p>
<p><span style="font-family: 'andale mono', times;">                          fldateyymm = unique(d$fldateyymm)))</span></p>
<p><span style="font-family: 'andale mono', times;">agg_perf &lt;- merge(d, d2, all.y = TRUE)</span></p>
<p><span style="font-family: 'andale mono', times;">agg_perf &lt;- agg_perf[order(agg_perf$carrname, agg_perf$fldateyymm),]</span></p>
<p><span style="font-family: 'andale mono', times;">agg_perf$PERFORMANCE[is.na(agg_perf$PERFORMANCE)] &lt;- 0</span></p>
<p> </p>
<p><span style="font-family: 'andale mono', times;"># create time series and plot comparison</span></p>
<p><span style="font-family: 'andale mono', times;">perf_series &lt;- data.frame(1:length(unique(agg_perf$fldateyymm)))</span></p>
<p><span style="font-family: 'andale mono', times;">for (a in unique(agg_perf$carrname)) </span></p>
<p><span style="font-family: 'andale mono', times;">    {perf_series &lt;- cbind(perf_series, </span></p>
<p><span style="font-family: 'andale mono', times;">       agg_perf$PERFORMANCE[which(agg_perf$carrname == a)]);}</span></p>
<p><span style="font-family: 'andale mono', times;">perf_series[,1] &lt;- NULL</span></p>
<p><span style="font-family: 'andale mono', times;">colnames(perf_series) &lt;- unique(agg_perf$carrname);</span></p>
<p><span style="font-family: 'andale mono', times;"># convert all to time series</span></p>
<p><span style="font-family: 'andale mono', times;">for (a in length(unique(agg_perf$carrname)))</span></p>
<p><span style="font-family: 'andale mono', times;">    {perf_series[[a]] &lt;- ts(perf_series[,a], start=c(2011,5), frequency=12)}</span></p>
<p><span style="font-family: 'andale mono', times;"># plot the first and line the rest</span></p>
<p><span style="font-family: 'andale mono', times;">x11()</span></p>
<p><span style="font-family: 'andale mono', times;">ts.plot(ts(perf_series, start=c(2011,5), frequency=12), </span></p>
<p><span style="font-family: 'andale mono', times;">           gpars=list(main="Performance: dollar per customer KM",</span></p>
<p><span style="font-family: 'andale mono', times;">                      xlab="Months", </span></p>
<p><span style="font-family: 'andale mono', times;">                      ylab="Dollars", </span></p>
<p><span style="font-family: 'andale mono', times;">                      col=rainbow(dim(perf_series)[2]), xy.labels=TRUE))</span></p>
<p><span style="font-family: 'andale mono', times;">legend(2012.05, 3.2, legend=colnames(perf_series), </span></p>
<p><span style="font-family: 'andale mono', times;">                     col=rainbow(dim(perf_series)[2]), lty=1, seg.len=1)</span></p>
<p> </p>
<p><img alt="timeseries.png" class="jive-image" height="273" src="http://scn.sap.com/servlet/JiveServlet/downloadImage/38-67980-110481/334-273/timeseries.png" style="width: 334px; height: 273.27272727272725px;" width="334" __jive_id="110481" /></p>
<p> </p>
<p> </p>
<p>Hopefully, I've shown that there is a lot that can be done with R - especially in the area of adHoc advanced business intelligence and data analysis.  I have not really even scratched the surface in terms of what R can offer for advanced statistical analysis and modelling - that is where the true wizards live.</p>
<p> </p>
<p>I would love to hear back from anyone who tries RSAP out - issues and user experiences alike.</p>
<p> </p>
<p>References:</p>
<ul>
<li>Post on <a href="http://scn.sap.com/community/developer-center/hana/blog/2012/05/21/when-sap-hana-met-r--first-kiss" _jive_internal="true">SAP HANA and R</a> from Alvaro</li>
</ul>
<p>Basic R Tutorials</p>
<ul>
<li><a href="http://www.statmethods.net/">Quick R</a></li>
<li><a href="http://wiki.stdout.org/rcookbook/">The R Cookbook</a></li>
<li><a href="http://www.r-statistics.com/">The R Statistics Blog</a></li>
<li><a href="http://www.harding.edu/fmccown/r/">Producing simple Graphs with R</a></li>
<li><a href="http://cran.r-project.org/">The CRAN project Documentation</a></li>
</ul>]]>

</content>
</entry>

<entry>
<title>R and Hadoop</title>
<link rel="alternate" type="text/html" href="http://www.piersharding.com/blog/archives/2012/04/r_and_hadoop.html" />
<modified>2012-04-28T07:25:33Z</modified>
<issued>2012-04-28T07:04:09Z</issued>
<id>tag:www.piersharding.com,2012:/blog//1.94</id>
<created>2012-04-28T07:04:09Z</created>
<summary type="text/plain"> R is my hackers language of choice for analysis work. It really appeals to my sense of iteratively refining a solution. To my delight, I stumbled across this set of libraries for calling out to Hadoop Mapreduce, HDFS, and...</summary>
<author>
<name>PiersHarding</name>
<url>http://www.piersharding.com</url>
<email>piers@ompka.net</email>
</author>
<dc:subject>Hadoop</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.piersharding.com/blog/">
<![CDATA[<p>
<a href="http://www.r-project.org/">R</a> is my hackers language of choice for analysis work.  It really appeals to my sense of iteratively refining a solution.  To my delight, I stumbled across this set of libraries for calling out to Hadoop Mapreduce, HDFS, and HBASE directly from R - <a href="https://github.com/RevolutionAnalytics/RHadoop">RHadoop</a> .
<br/>
It was surprisingly easy to get going - especially with some patient help from <a href="https://github.com/piccolbo">Antonio</a> - the project owner.  RHadoop relies on the same fixes that <a href="https://github.com/klbostee/dumbo/wiki">Dumbo</a> requires, but the game changer here is that from <a href="http://hadoop.apache.org/common/releases.html#3+Apr%2C+2012%3A+Release+1.0.2+available">Hadoop 1.0.2</a>, all the key patches that both require are now part of core.<br/>
The thing that tripped me up was a custom .Rprofile file I was using to load, and print things at the startup for R.  This was causing R to write things to stdout which is what Hadoop streaming is using to pass data between tasks.  This corrupted the data transfer, which was killing RHadoop with a weird Java Heap error.  Anyway, once sorted out, everything runs smoothly, and I like the intuitive way things are handled in an R'esque manner. eg. take the example from the <a href="https://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial">tutorial</a> :
<pre>
> library(rmr)
Loading required package: RJSONIO
Loading required package: itertools
Loading required package: iterators
Loading required package: digest
> small.ints = to.dfs(1:10)
Warning: $HADOOP_HOME is deprecated.

12/04/28 19:17:45 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/04/28 19:17:45 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/04/28 19:17:45 INFO compress.CodecPool: Got brand-new compressor
> out = mapreduce(input = small.ints, map = function(k,v) keyval(v, v^2))
Warning: $HADOOP_HOME is deprecated.

packageJobJar: [/tmp/RtmpXlELmY/rmr-local-env, /tmp/RtmpXlELmY/rmr-global-env, 
                              /tmp/RtmpXlELmY/rhstr.map2cf71bf8a3a9, 
                              /home/piers/hadoop/tmp/hadoop-unjar1509588906818235502/] []
                              /tmp/streamjob912555254031649512.jar tmpDir=null
12/04/28 19:18:04 INFO mapred.FileInputFormat: Total input paths to process : 1
12/04/28 19:18:05 INFO streaming.StreamJob: getLocalDirs(): [/home/piers/hadoop/tmp/mapred/local]
12/04/28 19:18:05 INFO streaming.StreamJob: Running job: job_201204281916_0001
12/04/28 19:18:05 INFO streaming.StreamJob: To kill this job, run:
12/04/28 19:18:05 INFO streaming.StreamJob: /usr/libexec/../bin/hadoop job 
                             -Dmapred.job.tracker=192.168.1.3:9001 -kill job_201204281916_0001
12/04/28 19:18:15 INFO streaming.StreamJob: Tracking URL: http://192.168.1.3:50030/jobdetails.jsp?jobid=job_201204281916_0001
12/04/28 19:18:16 INFO streaming.StreamJob:  map 0%  reduce 0%
12/04/28 19:18:45 INFO streaming.StreamJob:  map 100%  reduce 0%
12/04/28 19:18:54 INFO streaming.StreamJob:  map 100%  reduce 17%
12/04/28 19:18:57 INFO streaming.StreamJob:  map 100%  reduce 67%
12/04/28 19:19:06 INFO streaming.StreamJob:  map 100%  reduce 100%
12/04/28 19:19:21 INFO streaming.StreamJob: Job complete: job_201204281916_0001
12/04/28 19:19:21 INFO streaming.StreamJob: Output: /tmp/RtmpXlELmY/file2cf7546b881b
> from.dfs('/tmp/RtmpXlELmY/file2cf7546b881b')
Warning: $HADOOP_HOME is deprecated.

Warning: $HADOOP_HOME is deprecated.

Warning: $HADOOP_HOME is deprecated.

[[1]]
[[1]]$key
[1] 1

[[1]]$val
[1] 1

attr(,"rmr.keyval")
[1] TRUE

[[2]]
[[2]]$key
[1] 2

[[2]]$val
[1] 4
...
</pre>
</p>
]]>

</content>
</entry>

<entry>
<title>Moodle bulk management of users, courses, and course categories</title>
<link rel="alternate" type="text/html" href="http://www.piersharding.com/blog/archives/2012/04/moodle_bulk_man.html" />
<modified>2012-04-28T07:24:04Z</modified>
<issued>2012-04-14T07:41:36Z</issued>
<id>tag:www.piersharding.com,2012:/blog//1.93</id>
<created>2012-04-14T07:41:36Z</created>
<summary type="text/plain"> One of the (many) new features of Moodle 2.2 is the ability to create administration Tools plugins. This enables us developers to create and package (hopefully) useful tools that make the management of Moodle easier. One of the things...</summary>
<author>
<name>PiersHarding</name>
<url>http://www.piersharding.com</url>
<email>piers@ompka.net</email>
</author>
<dc:subject>moodle</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.piersharding.com/blog/">
<![CDATA[<p>
One of the (many) new features of <a href="http://moodle.org/">Moodle 2.2</a> is the ability to create administration Tools <a href="http://docs.moodle.org/dev/Admin_tools">plugins</a>.  This enables us developers to create and package (hopefully) useful tools that make the management of Moodle easier.  One of the things that I've seen wished for is the ability to bulk upload courses and related material, and over recent months, this is something that I've been working on.
</p>
<p>The key things that people want (from an administration point of view) are to manage people and courses.  Often these activities are a tiresome bulk process at set times of the year with a relatively minor tweaking type of activity in between.  For managing users - create/update/delete, and enrolments - we already have the built in functionality to do <a href="http://docs.moodle.org/22/en/Upload_users">bulk user upload</a>.  I have added to this for <a href="https://gitorious.org/moodle-tool_uploadcourse">Courses</a> and for <a href="https://gitorious.org/moodle-tool_uploadcoursecategory">Course categories</a>.<br/>
The course upload admin tool can be used to create and manage course outlines, but it can also populate courses using either a nominated course as a template (copies the course contents using the Moodle backup/restore facility), or populate the course from a Moodle backup file.
</p>]]>

</content>
</entry>

<entry>
<title>Hadoop and Dumbo</title>
<link rel="alternate" type="text/html" href="http://www.piersharding.com/blog/archives/2012/04/hadoop_and_dumb.html" />
<modified>2012-04-28T07:24:11Z</modified>
<issued>2012-04-13T04:20:43Z</issued>
<id>tag:www.piersharding.com,2012:/blog//1.92</id>
<created>2012-04-13T04:20:43Z</created>
<summary type="text/plain">Dumbo is a Python framework for writing Map Reduce flows with or without Hadoop. It&apos;s been a pain up until now, trying to get it going as it has relied on a number of patches to Hadoop for different byte...</summary>
<author>
<name>PiersHarding</name>
<url>http://www.piersharding.com</url>
<email>piers@ompka.net</email>
</author>
<dc:subject>Data</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.piersharding.com/blog/">
<![CDATA[<a href="https://github.com/klbostee/dumbo">Dumbo</a> is a <a href="http://www.python.org/">Python</a> framework for writing Map Reduce flows with or without <a href="http://hadoop.apache.org/">Hadoop</a>.  It's been a pain up until now, trying to get it going as it has relied on a number of patches to Hadoop for different byte streams, type codes etc. to make it work.  No longer - as the necessary patches ave now made it into core as of <a href="http://hadoop.apache.org/common/docs/r1.0.2/releasenotes.html">1.0.2</a>.
<br/>
On Ubuntu 12.04 all I needed was the debian package from <a href="http://hadoop.apache.org/common/releases.html#Download">here</a>, (<a href="http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/">install</a> as per these instructions) and then run sudo easy_install dumbo .
<br/>
The only catch is that Dumbo does not currently recognise the Debian package layout used by the Hadoop package maintainers, so I found that I had to make a one line patch to compensate for it:
<pre>
diff --git a/dumbo/util.py b/dumbo/util.py
index a57166d..cd35df3 100644
--- a/dumbo/util.py
+++ b/dumbo/util.py
@@ -267,6 +267,7 @@ def findjar(hadoop, name):
     hadoop home directory and component base name (e.g 'streaming')"""
 
     jardir_candidates = filter(os.path.exists, [
+        os.path.join(hadoop, 'share', 'hadoop', 'contrib', name),
         os.path.join(hadoop, 'mapred', 'build', 'contrib', name),
         os.path.join(hadoop, 'build', 'contrib', name),
         os.path.join(hadoop, 'mapred', 'contrib', name, 'lib'),
</pre>
<br/>
And then run the quick tutorial example from <a href="https://github.com/klbostee/dumbo/wiki/Short-tutorial">here</a> like so:
<pre>
hadoop fs -copyFromLocal /var/log/apache2/access.log /user/hduser/access.log
hadoop fs -ls /user/hduser/
dumbo start ipcount.py -hadoop /usr -input /user/hduser/access.log -output ipcounts
dumbo cat ipcounts/part* -hadoop /usr | sort -k2,2nr | head -n 5
</pre>]]>

</content>
</entry>

<entry>
<title>Email, Gource, Hadoop, and Python</title>
<link rel="alternate" type="text/html" href="http://www.piersharding.com/blog/archives/2012/04/email_gource_ha.html" />
<modified>2012-04-28T07:24:15Z</modified>
<issued>2012-04-12T07:27:43Z</issued>
<id>tag:www.piersharding.com,2012:/blog//1.91</id>
<created>2012-04-12T07:27:43Z</created>
<summary type="text/plain">I never knew that one of the guys (Andrew C) who works at Catalyst wrote a fantastic times series visualisation tool called Gource . It&apos;s incredible what people have done with it - just look on Youtube. The focus of...</summary>
<author>
<name>PiersHarding</name>
<url>http://www.piersharding.com</url>
<email>piers@ompka.net</email>
</author>
<dc:subject>Data</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.piersharding.com/blog/">
<![CDATA[<p>I never knew that one of the guys (Andrew C) who works at <a href="http://www.catalyst.net.nz/">Catalyst</a> wrote a fantastic times series visualisation tool called <a href="http://code.google.com/p/gource/">Gource</a> .  It's incredible what people have done with it - just look on <a href="http://www.youtube.com/results?search_query=gource">Youtube</a>.
The focus of use seems to have been on analysis of source code repository activity, but I think there is more mileage to be had from Gource than this.  I wrote a simple Map/Reduce map chain for Hadoop (not really necessary for my volume of data) that stripped out the from/to/date information from all my mbox history since 1996.  It really is simple - all you need is to a generate a file in the customformat - eg.:
</p>
<pre>
0970518767|"DJ Adams" <DJ_Adams@rank.com>|M|Andrew_Powis/RVSUK/FES/Rank@rank.com
...
</pre>

and then pump this through Gource:<pre>
gource --start-position 0.28 --stop-position 0.29 --title 'Communication sphere since 1996' -s 1 --log-format custom email-log.txt
</pre>
<p>
You can record it as a video too:
</p>
<pre>
gource --start-position 0.28 --stop-position 0.29 --title 'Communication sphere since 1996' -s 1 \
    --log-format custom email-log.txt  -1280x720 -o - | ffmpeg -y -r 60 \
    -f image2pipe -vcodec ppm -i - -vcodec libx264 -preset ultrafast -crf 1 -threads 0 -bf 0 gource-video-of-email.mp4
</pre>
And this is what it looks like:<br/>
<iframe width="560" height="315" src="http://www.youtube.com/embed/i3nag9vSdjo?rel=1&modestbranding=1" frameborder="0" allowfullscreen></iframe>]]>

</content>
</entry>

<entry>
<title>CSV files need SQL</title>
<link rel="alternate" type="text/html" href="http://www.piersharding.com/blog/archives/2012/03/csv_files_need.html" />
<modified>2012-04-28T07:24:22Z</modified>
<issued>2012-03-30T17:46:27Z</issued>
<id>tag:www.piersharding.com,2012:/blog//1.90</id>
<created>2012-03-30T17:46:27Z</created>
<summary type="text/plain">As part of learning about R it soon has become apparent that the basic unit of currency is a CSV file - there are lots of other ways of getting data in and out of the R environment (JSON with...</summary>
<author>
<name>PiersHarding</name>
<url>http://www.piersharding.com</url>
<email>piers@ompka.net</email>
</author>
<dc:subject>python</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.piersharding.com/blog/">
<![CDATA[<p>As part of learning about <a href="http://www.r-project.org/">R</a> it soon has become apparent that the basic unit of currency is a CSV file - there are lots of other ways of getting data in and out of the R environment (JSON with library(RJSONIO), DB intefaces with library(RPostgreSQL) ...) but for the majority of work (which consists of hackery and experimentation which is why R is so attractive) CSV is the transportation mechanism.</p>
<p>
I have found that, in particular at the beginning, it is often harder to think of basic data munging concepts in R - typical tasks like sorting, grouping, data type conversion - often your language of choice (Perl, Python, or even bash) is just quicker for doing these things in the first instance when I'm paring the data down into what I want to apply some form of statistical analysis or charting too.</p>
<p>
With this in mind - I basically wanted to be able to perform SQL against a CSV file, without the hassle of loading it into a database first.
Enter a clever tool written in Haskell called <a href="http://keithsheppard.name/txt-sushi/tssql.html">txt-sushi</a>.  This enables you to do interesting things like:
</p>
<pre>
cat test.csv | tssql -table x - 'select a,b, sum(hours) AS hours_sum from x group by a,b'
</pre>
<p>
However, for my purposes tssql is too strict on handling data types, and is dependent on Haskell, so I've built my own simple CSV SQL processor - <a href="https://github.com/piersharding/csvtable">csvtable</a> in Python using <a href="www.sqlite.org">SQLite</a> as a backend.  This is surprisingly easy to do, and let's you have the benefit of the convenience and power of SQLite syntax:
</p>
<pre>
python csvtable.py \
  --where="system_code != 'LEAVE'" \
  --convert='date_epoch:date,hours:int' \
  --list="*, sum(hours) AS hours_sum, min(date_epoch) AS date_epoch_min, 
               max(date_epoch) AS date_epoch_max, count(*) AS days,
               ROUND(AVG(hours), 2) AS avg_time, MIN(hours) AS hours_min,
               MAX(hours) AS hours_max" \
  --group='organisation_code, system_code, request_id' \
  --file=test1.csv | \
 python csvtable.py --list='*, ROUND(((date_epoch_max - date_epoch_min) / (60 * 60 * 24)) + 1, 2) AS duration' > test2.csv
</pre>

]]>

</content>
</entry>

<entry>
<title>Hadoop and single file to mapper processing flow</title>
<link rel="alternate" type="text/html" href="http://www.piersharding.com/blog/archives/2012/03/hadoop_and_sing.html" />
<modified>2012-04-28T07:24:27Z</modified>
<issued>2012-03-26T17:13:50Z</issued>
<id>tag:www.piersharding.com,2012:/blog//1.89</id>
<created>2012-03-26T17:13:50Z</created>
<summary type="text/plain">It seems like a trivial thing to want to do, but it appears that the standard Hadoop workflow is to treat all input files as line oriented transactions, which does not help at all when I want to process on...</summary>
<author>
<name>PiersHarding</name>
<url>http://www.piersharding.com</url>
<email>piers@ompka.net</email>
</author>
<dc:subject>Hadoop</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.piersharding.com/blog/">
<![CDATA[<p>It seems like a trivial thing to want to do, but it appears that the standard Hadoop workflow is to treat all input files as line oriented transactions, which does not help at all when I want to process on a file by file basis.  The example I was working through is where I have 20 years worth of mbox email files.  Each file needs to be broken into individual emails, the contents parsed, and useful information in the headers stripped out into a convenient format for subsequent processing.  To do this in the context of Hadoop is slightly odd.  It appears that the usual approach is to create an input file of mbox file names (loaded into HDFS), and then each mapper execution uses the HDFS API to pull the file and process it.</p>

<p>This presented another problem - in Python, how do you access the HDFS API?  There are two existing integrations that  I can find - https://github.com/traviscrawford/python-hdfs, and http://code.google.com/p/libpyhdfs/.  <a href="https://github.com/traviscrawford">Travis Crawfords'</a> is easy to get going, but as it uses a JNI binding I didn't relish the prospect of trying to make sure CLASSPATHs etc are right across all my Hadoop nodes (which for my purposes are any machine that I can beg, borrow or steal), in light of this I created my own cheap and cheerful library  that uses subprocess to call the 'hadoop' executable for 'fs' - <a href="https://github.com/piersharding/hdfsio">hdfsio</a> .<br />
I admit this isn't the height of efficiency (or possibly elegance), but it is surprisingly robust and very simple.<br />
</p>]]>

</content>
</entry>

<entry>
<title>Journey into Hadoop</title>
<link rel="alternate" type="text/html" href="http://www.piersharding.com/blog/archives/2012/03/journey_into_ha.html" />
<modified>2012-04-28T07:24:32Z</modified>
<issued>2012-03-25T20:23:14Z</issued>
<id>tag:www.piersharding.com,2012:/blog//1.88</id>
<created>2012-03-25T20:23:14Z</created>
<summary type="text/plain">I&apos;ve been building up my background knowledge on current toolsets used in Data Science, and part of this is R and another is Hadoop. Hadoop is a big thing, and takes (to my mind) quite a lot of effort to...</summary>
<author>
<name>PiersHarding</name>
<url>http://www.piersharding.com</url>
<email>piers@ompka.net</email>
</author>
<dc:subject>python</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.piersharding.com/blog/">
<![CDATA[<p>I've been building up my background knowledge on current toolsets used in Data Science, and part of this is <a href="http://www.r-project.org/">R</a> and another is <a href="http://hadoop.apache.org/">Hadoop</a>.</p>

<p>Hadoop is a big thing, and takes (to my mind) quite a lot of effort to get going, and to understand how you can bend it to your will.  Par of this learning process has been about finding a comfortable installation pattern for Linux - in particular Ubuntu, and the best help I've found so far has been from <a href="http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/">Michael Noll</a>.  Things that I had to be careful about were getting ssh working, and name resolution exactly right on all nodes that you put in your cluster, as you distribute things like /etc/hadoop/masters and the *-site.xml config files.</p>

<p>The next stage was to find a development pattern that enabled me to avoid Java.  The answer to this for me is <a href="http://wiki.apache.org/hadoop/HadoopStreaming">Hadoop Streaming</a>.  This basically allows you to pipe IO in and out of programs written in your favourite language - and in this case Michael does brilliantly again with <a href="http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/">Python and MapReduce</a>.<br />
</p>]]>

</content>
</entry>

<entry>
<title>Web Services for Mahara</title>
<link rel="alternate" type="text/html" href="http://www.piersharding.com/blog/archives/2011/06/web_services_fo.html" />
<modified>2012-03-25T20:21:24Z</modified>
<issued>2011-06-24T20:12:37Z</issued>
<id>tag:www.piersharding.com,2011:/blog//1.87</id>
<created>2011-06-24T20:12:37Z</created>
<summary type="text/plain"> As part of some work for the Ministry of Education for LMS -&gt; myPortfolio (Mahara) integration, it became apparent that we needed a Web Services stack. This is not particularly interesting in it&apos;s self, but it is something that...</summary>
<author>
<name>PiersHarding</name>
<url>http://www.piersharding.com</url>
<email>piers@ompka.net</email>
</author>
<dc:subject>mahara</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.piersharding.com/blog/">
<![CDATA[<p><span class="mt-enclosure mt-enclosure-image" style="display: inline;"><img alt="mahara.png" src="http://www.piersharding.com/blog/mahara.png" width="160" height="160" class="mt-image-none" style="" /></span></p>

<p>As part of some work for the <a href='http://www.minedu.govt.nz'>Ministry of Education</a> for LMS -> <a href="http://myportfolio.school.nz">myPortfolio</a>  (<a href="http://www.mahara.org">Mahara</a>) integration, it became apparent that we needed a Web Services stack.  This is not particularly interesting in it's self, but it is something that an interconnected service needs, in order to participate in a Socially Networked world. <br />
Building a WS framework is not a difficult thing, but it is relatively time consuming (anything that takes more than a few weeks is considered expensive here), so the problem was, how to develop an unexciting feature that in itself does not deliver any great new user experience quickly and cheaply.  At this point it occurred to me that there might be a solution in what <a href="http://www.moodle.org">Moodle</a>  has achieved with it's <a href="http://docs.moodle.org/dev/Web_services">Web Services Framework</a> - after all, Mahara is (in a previous life) based on Moodle.</p>

<p>It turned out, that the way that Peta Skoda has developed the Moodle WSF is fundamentally based on Zend data services,  and is quite portable.  </p>

<p>To this end, I have ported it as an <a href="https://wiki.mahara.org/index.php/Plugins"> auth plugin</a> which can be <a href="https://gitorious.org/mahara-contrib/auth-webservice">downloaded here</a>, and the documentation is <a href="https://wiki.mahara.org/index.php/Plugins/Artefact/WebServices">here</a>.</p>

<p>This gives the basic features of token, and user based auth, with SOAP, XML-RPC and JSON emitting REST based services.  There are a number of other things that I'd like to add to this, the most important being OAuth based authentication, and JSON based import parameter consumption.</p>

<p>Edit: JSON based import parameter consumption, has been done, but I want to add replacing MNet to the list of things to do.<br />
</p>]]>

</content>
</entry>

<entry>
<title>Moodle, OAuth, and Google Fusion</title>
<link rel="alternate" type="text/html" href="http://www.piersharding.com/blog/archives/2010/09/moodle_oauth_an.html" />
<modified>2011-03-23T22:01:56Z</modified>
<issued>2010-09-05T18:06:30Z</issued>
<id>tag:www.piersharding.com,2010:/blog//1.86</id>
<created>2010-09-05T18:06:30Z</created>
<summary type="text/plain">Convergence is a strange and reoccurring theme, and it&apos;s happened again from me over the last few months with BI reporting, Moodle, OAuth, and Google. I&apos;ve looked at a few BI (well SAP, Business Objects, and Pentaho) implementations over the...</summary>
<author>
<name>PiersHarding</name>
<url>http://www.piersharding.com</url>
<email>piers@ompka.net</email>
</author>
<dc:subject>moodle</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.piersharding.com/blog/">
<![CDATA[<p>Convergence is a strange and reoccurring theme, and it's happened again from me over the last few months with BI reporting, <a href="http://www.moodle.org">Moodle</a>, <a href="http://oauth.net">OAuth</a>, and Google.</p>

<p>I've looked at a few BI (well <a href="http://www.sdn.sap.com/irj/sdn/edw">SAP</a>, <a href="http://en.wikipedia.org/wiki/Crystal_Reports">Business Objects</a>, and <a href="http://www.pentaho.com/">Pentaho</a>) implementations over the years, and one of the things that I have always found frustrating/off putting is what I consider the huge startup costs for such implementations.  This has usually been characterised by expensive infrastructure implementations in both hardware and software coupled with the difficulty that most businesses have in visualising what data they need to have access to, and how it should be most effectively presented.</p>

<p>I've found this dilemma more accute in the <a href="http://www.moodle.org">Moodle</a> world, as the so many of the customers involved are on a very tight to non-existent budget, yet their requirement to analyse Learning Managment System performance data is still there.</p>

<p>A year ago, I concluded that Pentaho was my first choice, for the twin reasons that it's OpenSource (specifically no license fees), and that it has sufficiently good data modelling tools to enable a suite of reports customised to Moodle to be delivered.  While this reduces the cost of delivering a flexible reporting solution for Moodle, it still falls short on a couple of points:</p>

<p>(1) Most people who implement Moodle are not Data Warehousing, or Modelling experts so they are unlikely to be able to sufficiently accurately determine what their requirements are in advance (actually a common business problem, not unique to the Moodle community).<br />
(2) Pentaho, while reasonably straight forward to install, is still another complex piece of software to host - a major barrier to entry for most Moodle implementations.</p>

<p>What I started looking for then, was a set of visualisation tools that could be integrated with Moodles PHP environment - atleast users would then be able to do more complex reporting and analysis.  What I found exceeded my expectations, in the form of a Labs project from Google called <a href="http://tables.googlelabs.com/Home">Fusion Tables</a>.</p>

<p>Fusion Tables is shaping up to be Business Intelligence reporting with the twist of collaborative, and Geo encoding capabilities.  The basic mode is that CSV files of data can be uploaded into a flexible storage engine, datasets can be joined and merged, automatically Geo encoded, and then consumed through a good set of graphical presentation tools.  <a href="http://tables.googlelabs.com/DataSource?dsrcid=197026">Datasets</a> can be shared and collaboratively edited.<br />
<script src="http://www.gmodules.com/ig/ifr?url=http://www.google.com/ig/modules/bar-chart.xml&up__table_query_url=http://tables.googlelabs.com/gvizdata?tq=select+col0%252Ccol5+from+191509++skip+0+limit+228&up__table_query_refresh_interval=0&w=600&h=400&border=%23ffffff%7C3px%2C1px+solid+%23999999&synd=open&output=js"></script></p>

<p><br />
As Luck would have it that this service is firstly free, and secondly exposed via an SQL-like <a href="http://code.google.com/apis/fusiontables/">API</a> integrated with the standard Google OAuth mechanism.  This makes it attractive as a generic data analysis and reporting tool for a low cost operating environment like Moodle and the education sector.</p>

<p>To test out the theory of all this, I've implemented 3 things:<br />
 * OAuth integration for Moodle including a site, and secret registry<br />
 * A generic Fusion Tables data proxy for Moodle<br />
 * A Gradebook export module that enables the export of the standard gradebook data to Fusion Tables</p>

<p>For the curious, this can be found at Gitorious -<a href="http://gitorious.org/moodle-local_oauth/moodle-local_oauth">moodle-local_oauth</a>.</p>]]>

</content>
</entry>

<entry>
<title>New release for sapnwrfc PHP and Python</title>
<link rel="alternate" type="text/html" href="http://www.piersharding.com/blog/archives/2009/08/new_release_for.html" />
<modified>2009-08-27T21:27:18Z</modified>
<issued>2009-08-26T18:43:52Z</issued>
<id>tag:www.piersharding.com,2009:/blog//1.85</id>
<created>2009-08-26T18:43:52Z</created>
<summary type="text/plain">Been a busy month, working on the NW SAP RFC connectors. With build help from Menelaos, I now have a working Python build system for Windows on the Python NW RFC Connector as of version 0.07 - this is available...</summary>
<author>
<name>PiersHarding</name>
<url>http://www.piersharding.com</url>
<email>piers@ompka.net</email>
</author>
<dc:subject>general</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.piersharding.com/blog/">
<![CDATA[<p>Been a busy month, working on the NW SAP RFC connectors.  With build help from Menelaos, I now have a working Python build system for Windows on the Python NW RFC Connector as of version 0.07 - this is available <a href="http://www.piersharding.com/download/python/sapnwrfc/">here</a>.</p>

<p>Also, with help from Joachim, I've added a static function sapnwrfc_removefunction(&lt;sysid&gt;, &lt;function name&gt;) to the PHP connector that allows the removing of function definitions from the local cache.  this is most useful when developing RFC applications in PHP, as you can modify your RFC definition without having to restart the web server everytime.  This is available from version 0.09 <a href="http://www.piersharding.com/download/php/sapnwrfc/">here</a>.</p>]]>

</content>
</entry>

<entry>
<title>Auth SAML 2.0 for Mahara</title>
<link rel="alternate" type="text/html" href="http://www.piersharding.com/blog/archives/2009/08/auth_saml_20_fo.html" />
<modified>2009-08-14T20:31:34Z</modified>
<issued>2009-08-14T20:25:49Z</issued>
<id>tag:www.piersharding.com,2009:/blog//1.84</id>
<created>2009-08-14T20:25:49Z</created>
<summary type="text/plain">Following on from the SAML 2.0 work that I&apos;ve done recently for Moodle, I thought it was useful to do the same for the Mahara ePortfolio service, while I was in the same space. Details of the first release can...</summary>
<author>
<name>PiersHarding</name>
<url>http://www.piersharding.com</url>
<email>piers@ompka.net</email>
</author>
<dc:subject>catalyst</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.piersharding.com/blog/">
<![CDATA[<p>Following on from the SAML 2.0 work that I've done recently for Moodle, I thought it was useful to do the same for the <a href="http://www.mahara.org">Mahara</a> ePortfolio service, while I was in the same space.  Details of the first release can be found <a href="http://wiki.mahara.org/Plugins/Auth/Saml">here</a>, with tested version for both trunk, and 1.1_STABLE.</p>]]>

</content>
</entry>

</feed>