September 10, 2013

Integrating R with Pentaho

RPentaho - R integration for Pentaho based on community tools.

Recently, I've been involved in a project that has implemented Pentaho for an Analytics solution for Moodle . This is a large (and probably will be very large) Moodle implementation, so standard Moodle reporting is just not up to it.
One of the requirements was to be able to export student interaction and activity completion data. This can quickly become huge, and the standard Pentaho CSV exporting interfaces can't cope, but there is a good solution to this based on the WebDetails CDA and CDB work. What WebDetails have done, is provide an excellent authenticated JSON API for common Pentaho queries, whether they be Saiku Analytics or Saiku Adhoc queries. With this, a user can use the familiar tools to design a query, and then bookmark it.

To complete the loop, I've written a library for R that uses the JSON interface to access these stored queries, and import the data as a standard data.frame object - RPentaho.

Posted by PiersHarding at 9:02 AM

February 10, 2013

Hosting an R Repository for RSAP and RMonet

I've just setup an R repository to host my R extensions that I've published. This currently contains RSAP the SAP RFC connector, and RMonet the MonetDB connector using the Monet MAPI C API.

It's a very easy process as document here .

This repository can be generally accessed by doing the following:
setRepositories(addURLs = c(PiersHarding = ""))

Or for and individual package:
install.packages('RMonet', repos=c(''))

Posted by PiersHarding at 8:49 AM

January 31, 2013

Data Hackery - R, SAP, and OpenSource in-memory databases

I've just completed a post on SAP SCN regarding using In-Memory column oriented database MonetDB with SAP and R for exploratory data analysis titled "Data Hackery - R, SAP, and OpenSource in-memory databases" . This uses an R library that I've created as a database interface to MonetDB called RMonet.

Posted by PiersHarding at 5:38 PM

July 18, 2012

Google Drive repository plugin for Moodle

Just added a Google Drive repository plugin for Moodle to my moodle-google set of applications here:

Posted by PiersHarding at 2:31 PM

June 13, 2012

SAP with R

Something that piqued my curiosity lately was the developments with SAP HANA and R (good overview here). This is definitely a new and exciting direction for SAP, with creating a well structured, and organised 'Big Table' option for in memory computing, and then going the extra mile to embed a specialised Open Source Statistical Computing package (R) in it - making the fore front of the world of statistical analysis open to those that dare.

This is utterly brilliant, but the problem is that I can't access it as I don't have access to a SAP HANA instance (nor would most people). It is also heavily geared to 'Big Data', when there is still an awful lot to be gained from small, and mid-range data analysis arenas (resisting the temptation about size and clichés).

This has definitely touched on my hackers itch, and in response to this I've created one more Scripting Language Connector for R - RSAP.

The idea of this is to enable RFC calls (using the SAP NW RFC SDK) where any table contents are returned as data.frames (in R parlance).

Once you have this data in R, then the world is your oyster - it is up to your imagination as to what you do with it. To give an overview of how it works, and what you can do, I'm going to step through the process of installing and using RSAP.

Obtaining and Installing

Firstly you need to install R. I recommend using RStudio as it is a comfortable graphical user interface - you can get it from here.

Under debian (read Ubuntu) flavoured Linux you can install R first before downloading/installing RStudio using:

sudo apt-get install r-base-core r-base-dev r-base-html r-recommended


The SDK is available from the SAP Service Market Place SWDC - this is a forum discussion on getting it

If you have (like me) installed the NPL SAP Test Drive instance, then the SAP NW RFC libs exist in the /usr/sap/NPL/SYS/exe/run directory, the only problem being that it does not contain the C header files (really - SAP should make this available on SDN).


Download or clone the RSAP project source from


Ensure that the R library prerequisites are installed. To do this there is a helper script in the RSAP source code directory. cd to the source directory (downloaded above) - in my case /home/piers/git/public/RSAP - and run the following:

R --no-save < install_dependencies.R

This will prompt to install the packages yaml, reshape, plotrix, and RUnit.

To build and install the RSAP package, cd to the source directory (downloaded above) - in my case /home/piers/git/public/RSAP - run the following:

R CMD INSTALL --build --preclean --clean --configure-args='--with-nwrfcsdk-include=/home/piers/code/sap/nwrfcsdk/include --with-nwrfcsdk-lib=/home/piers/code/sap/nwrfcsdk/lib' .

You must change the values for --with-nwrfcsdk-include and --with-nwrfcsdk-lib to point to the directory locations that you have downloaded the SAP NW RFC SDK to.

Under Linux, it is also likely that you need to add the lib directory to the LD cache or set the LD_LIBRARY_PATH variable.

Setting the LD Cache:

as root, edit /etc/ and add the lib path from above to it on it's own line. Now regenrate the cache by executiong 'sudo ldconfig'.


You must ensure that the following environment variable is set in all your shells:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/nwrfcsdk/lib

The easiest way to do this is to add the above line to your $HOME/.bashrc file so that it happens automatically for all future shells.

Does it work?

Once the build and install of the RSAP package is complete, now you should test to make sure it's all working.

Change to the package source code directory (you are probably still there from the above activities), and launch either R or RStudio.

From the R command line try the following:

> library(RSAP)

Loading required package: yaml


You should get the above confirmation message that the dependent yaml package has been loaded. Now we are ready to try some R wizardry.

How to work with RSAP

Lets work through the general process steps for interacting with SAP.

Connecting to SAP

Using RSAP we need to establish a connection to SAP. For this you need an account that has the appropriate access for RFC calls, and functionality access. Connections can be built in two ways - directly passing connection parameters:

> conn <- RSAPConnect(ashost="nplhost", sysnr="42",

client="001", user="developer",

passwd="developer", lang="EN")


Or using a YAML encoded file that contains the connection details:

> conn <- RSAPConnect("sap.yml")


The sap.yml file is structured like:

ashost: nplhost

sysnr: "42"

client: "001"

user: developer

passwd: developer

lang: EN

trace: 1

The above activates the trace functionality in NW RFC SDK. This will create trace files in the current working directory, and are invaluable for debugging connectivity problems.

Calling SAP

Now we have the connection object, we can get connection info with it:

info <- RSAPGetInfo(conn)

Query the system with:

res <- RSAPInvoke(conn, "<RFC Function Name", parms)

Or close the connection:


RSAPInvoke() is what we are most interested in, and we need to pass the parameters as a series of nested named lists. The classic example is RFC_READ_TABLE:

parms <- list('DELIMITER' = '|',



'OPTIONS' = list(TEXT = list("CARRID = 'AA' ", " AND CONNID = 0017 ")),


res <- RSAPInvoke(conn, "RFC_READ_TABLE", parms)

The names must correspond directly to the parameter and structure (for tables) names, and use numeric and character types as appropriate.

The other thing that is really important to get your head around is that R data structures are column oriented, which means we have to think differently about tables that we get from SAP. Tables in SAP translate to lists of vectors where the outer list is a list of column names (a slightly loose analogy but it will do) and the vectors hang off these column names corresponding to all the values in that column down the rows.

Working through the examples in get_flights.R

In the source code package there is an example script - get_flights.R. It uses the standard demonstration data for the Flight Data system contained in table SFLIGHT2. Let's look at what this does.

Load libraries:

> library(RSAP)

Loading required package: yaml

> library(reshape)

Loading required package: plyr

Attaching package: ‘reshape’

The following object(s) are masked from ‘package:plyr’:

rename, round_any

> library(plotrix)


We now have all the necessary libraries for the rest of the examples.

conn <- RSAPConnect("sap.yml")

parms <- list('DELIMITER' = ';',


res <- RSAPInvoke(conn, "RFC_READ_TABLE", parms)


sflight = res$DATA

flds <- sub("\\s+$", "", res$FIELDS$FIELDNAME)

sflight <- data.frame(sflight, colsplit(sflight$WA, split = ";", names = flds))

This connects to SAP, calls RFC_READ_TABLE to get the contents of SFLIGHT2, and sets the column delimiter for that table as ';'. We close the connection and copy the table data from the return parameter res$DATA (see RFC_READ_TABLE in transaction SE37) into sflight. We also grab the field names returned in table FIELDS, and remove the whitespace at the end. Next - this is where the importance of the ';' delimiter is - using the colsplit() function from the reshape package, we split return DATA into columns named by the FIELDS that RFC_READ_TABLE provided us.

Now we have a data.frame that looks a lot like the table SFLIGHT2 when viewed in transaction SE16.

sflight <- cbind(sflight, FLIGHTNO = paste(sub("\\s+$", "",

sflight$CARRID),sflight$CONNID, sep=""))

sflight$SEGMENT <- paste(sflight$AIRPFROM, sflight$AIRPTO, sep=" - ")

sflight$CARRNAME <- sub("\\s+$", "", sflight$CARRNAME)

sflight$DISTANCE <- as.numeric(lapply(sflight$DISTANCE,

FUN=function (x) {sub("\\*","", x)}))

sflight$DISTANCE <- as.numeric(lapply(sflight$DISTANCE,

FUN=function (x) {if (x == 0) NA else x}))

sflight[sflight$CARRNAME == 'Qantas Airways','DISTANCE'] <- 10258

This next chunk created new vectors (columns) FLIGHTNO combined from CARRID and CONNID, SEGMENT from AIRPFROM and AIRPTO, and cleaned vectors CARRNAME, and DISTANCE.

Now create some aggregated views, to generate visualisations from:

airline_avgocc <- aggregate(data.frame(SEATSMAX=sflight$SEATSMAX,



by=list(carrname=sflight$CARRNAME), FUN=mean, na.rm=TRUE)

airline_sumocc <- aggregate(data.frame(SEATSOCC=sflight$SEATSOCC),

by=list(carrname=sflight$CARRNAME), FUN=sum, na.rm=TRUE)

Show a pie chart - sum of airline occupancy as a share of market:


lbls <- paste(airline_sumocc$carrname, "\n", sprintf("%.2f%%",

(airline_sumocc$SEATSOCC/sum(airline_sumocc$SEATSOCC))*100), sep="")

pie3D(airline_sumocc$SEATSOCC, labels=lbls,


main="Occupancy sum share for Airlines", explode=0.1)


Create a Stacked Bar Plot with Colors and Legend showing a summary of occupancy by segment and carrier - to do this we need to generate a summary (aggregate), and fill in the missing combinations of the grid, and then switch the orientation of rows for columns to present to the plotting funcitons:

d <- aggregate(SEATSOCC ~ CARRNAME:SEGMENT, data=sflight, FUN=sum, na.rm=FALSE)

d2 <- with(d, expand.grid(CARRNAME = unique(d$CARRNAME), SEGMENT = unique(d$SEGMENT)))

airline_sumsegocc <- merge(d, d2, all.y = TRUE)

airline_sumsegocc$SEATSOCC[$SEATSOCC)] <- 0

# switch orientation to segment * carrier

counts <- data.frame(unique(airline_sumsegocc$CARRNAME))

for (a in unique(airline_sumsegocc$SEGMENT))

{counts <- cbind(counts,

airline_sumsegocc$SEATSOCC[which(airline_sumsegocc$SEGMENT == a)]);}

counts[,1] <- NULL

colnames(counts) <- unique(airline_sumsegocc$SEGMENT);

rownames(counts) <- unique(airline_sumsegocc$CARRNAME);


barplot(as.matrix(counts), main="Total Occupancy by Segment and Carrier",

ylab="Number of Seats",


ylim=c(0, 15000), legend = rownames(counts))


Lastly - we create a simple performance indicator using a time series comparison of different airlines:

# performance by airline over time - dollars per customer KM

sflight$FLDATEYYMM <- substr(sflight$FLDATE, start=1, stop=6)

d <- aggregate(data.frame(PAYMENTSUM=sflight$PAYMENTSUM,







FUN=sum, na.rm=TRUE)

d2 <- with(d, expand.grid(carrname = unique(d$carrname),

fldateyymm = unique(d$fldateyymm)))

agg_perf <- merge(d, d2, all.y = TRUE)

agg_perf <- agg_perf[order(agg_perf$carrname, agg_perf$fldateyymm),]


# create time series and plot comparison

perf_series <- data.frame(1:length(unique(agg_perf$fldateyymm)))

for (a in unique(agg_perf$carrname))

{perf_series <- cbind(perf_series,

agg_perf$PERFORMANCE[which(agg_perf$carrname == a)]);}

perf_series[,1] <- NULL

colnames(perf_series) <- unique(agg_perf$carrname);

# convert all to time series

for (a in length(unique(agg_perf$carrname)))

{perf_series[[a]] <- ts(perf_series[,a], start=c(2011,5), frequency=12)}

# plot the first and line the rest


ts.plot(ts(perf_series, start=c(2011,5), frequency=12),

gpars=list(main="Performance: dollar per customer KM",



col=rainbow(dim(perf_series)[2]), xy.labels=TRUE))

legend(2012.05, 3.2, legend=colnames(perf_series),

col=rainbow(dim(perf_series)[2]), lty=1, seg.len=1)


Hopefully, I've shown that there is a lot that can be done with R - especially in the area of adHoc advanced business intelligence and data analysis. I have not really even scratched the surface in terms of what R can offer for advanced statistical analysis and modelling - that is where the true wizards live.

I would love to hear back from anyone who tries RSAP out - issues and user experiences alike.


Basic R Tutorials

Posted by PiersHarding at 2:37 PM