Converting a script from rpy to rpy2 using rpy2.rpy_classic

Converting a script from rpy to rpy2 using rpy2.rpy_classic - python

I'm a bit of a novice when it comes to python, but I want to convert a python script using rpy into one using rpy2. We do have rpy installed somewhere (for python 2.6.x), but it's not playing nicely with the current version of R (3.2.0). We do however have rpy2 installed for the version of python being used in these scripts (python 2.7[.5])
As far as I can tell, these are the lines which need to change (I've simplified the function a bit):
from rpy import r
r.library('<libname>', quietly=True)
r("""\
func <- function(x,a={options.a},b={options.b}) {{
...
*R code here*
...
l<-list(o=o,md=a+b)
l
}}""".format(options=options))
and later in the script, there's a line which calls this function:
out = r.func(<python expression>)['o']
I can do the first half as follows:
import rpy2.rpy_classic as rpy
rpy.set_default_mode(rpy.NO_CONVERSION)
rpy.r.library('<libname>', quietly=True)
rpy.r("""\
func <- function(x,a={options.a},b={options.b}) {{
...
*R code here*
...
l<-list(o=o,md=a+b)
l
}}""".format(options=options))
Trying the above at an interactive prompt (with some fake data), the output is:
<rpy2.rpy_classic.Robj object at 0x2b9e48481510>
but I need the output value of the function rpy.r.func rather than its not-converted value (as I need to obtain the func(<expression)$o value)
Am I moving on the right track? And how do I rewrite the rpy (v1) code so that I get what I want (from rpy2)?

rpy_classic was mostly there in the early days to demonstrate that the lower-level interface in rpy2 could be used to implement any higher-level interface, including the one in rpy. It is not meant to be an ultimate compatibility tool.
With rpy2's high-level interface robjects, your rpy code would look like:
from rpy2.robjects.packages import importr
from rpy2.robjects import r
lib=importr('<libname>')
rfunc=r("""
function(x,a={options.a},b={options.b}) {{
...
*R code here*
...
l<-list(o=o,md=a+b)
l
}}""".format(options=options))
out = rfunc(<python expression>).rx2('o')

Related

Comparing Plumbr to other options for making a chart with R in a Python script

I plan on making a chart with ggplot in a python script. These are details about the project:
I have a script that runs on a remote machine and I can install anything within reason on the machine
The script runs in python and has data that I want to visualize stored as a dictionary
The script runs daily and the data always has the same structure
I think my best bet is to do this...
Write an R script that takes the data and creates the ggplot visualization
Use plumbr to create a rest API for my script
Send a call to the rest API and get a PNG of my plot in return
I'm also familiar with ggpy by yhat and I'm even wondering if I can install R on the machine and just send code directly to the machine to process it without having RStudio.
Would plumbr be a recommended and secure implementation?
This is a reproducible example-
my_data = [{"Chicago": "30"} {"New York": "50"}], [{"Cincinatti": "70"}, {"Green Bay": "95"}]
**{this is the part that's missing}**
library(ggplot)
my_data %>% ggplot(aes(city_name, value)) + geom_col()
png("my_bar_chart.png", my_data)

As mentioned in the comments, most of your question should be answered here: Using R in Python with Rpy2: how to ggplot2?.
You can load ggplot with:
import rpy2.robjects.packages as packages
import rpy2.robjects.lib.ggplot2 as ggp2
assuming of course you have ggplot2 + dependencies available.
Then you can almost use R syntax, except that you put ggp2. in front of every command.
E.g: the Python equivilant of ggplot(mtcars) would be ggp2.ggplot(mtcars).
Your example: (not tested)
my_data = [{"Chicago": "30"} {"New York": "50"}], [{"Cincinatti": "70"}, {"Green Bay": "95"}]
import rpy2.robjects.packages as packages
import rpy2.robjects.lib.ggp2 as ggp2
plot = ggp2.ggplot(my_data) +
ggp2.aes(city_name, value)) +
ggp2.geom_col()
plot.plot()
R("dev.copy(png, 'my_data.png')")

Using R packages in Python using rpy2

There is a package in R that I need to use on my data. All my data preprocessing has already been done in python and all the modelling as well. The package in R is 'PMA'. I have used r2py before using Rs PLS package as follows
import numpy as np
from rpy2.robjects.numpy2ri import numpy2ri
import rpy2.robjects as ro
def Rpcr(X_train,Y_train,X_test):
ro.r('''source('R_pls.R')''')
r_pls=ro.globalenv['R_pls']
r_x_train=numpy2ri(X_train)
r_y_train=numpy2ri(Y_train)
r_x_test=numpy2ri(X_test)
p_res=r_pls(r_x_train,r_y_train,r_x_test)
yp_test=np.array(p_res[0])
yp_test=yp_test.reshape((yp_test.size,))
yp_train=np.array(p_res[1])
yp_train=yp_train.reshape((yp_train.size,))
ncomps=np.array(p_res[2])
ncomps=ncomps.reshape((ncomps.size,))
return yp_test,yp_train,ncomps
when I followed this format is gave an error that function numpy2ri does not exist.
So I have been working off of rpy2 manual and have tried a number of things with no success. The package I am working with in R is implemented like so:
library('PMA')
cspa=CCA(X,Z,typex="standard", typez="standard", K=1, penaltyx=0.25, penaltyz=0.25)
# X and Z are dataframes with dimension ppm and pXq
# cspa returns an R object which I need two attributes u and v
U<-cspa$u
V<-cspa$v
So trying to implement something like I was seeing on the rpy2 tried to load the module in python and use it in python like so
import rpy2.robjects as ro
from rpy2.robjects.packages import SignatureTranslatedAnonymousPackage as STAP
from rpy2.robjects import numpy2ri
from rpy2.robjects.packages import importr
base=importr('base'
scca=importr('PMA')
numpy2ri.activate() # To turn NumPy arrays X1 and X2 to r objects
out=scca.CCA(X1,X2,typex="standard",typez="standard", K=1, penaltyz=0.25,penaltyz=0.25)
and got the following error
OMP: Error #15: Initializing libomp.dylib, but found libiomp5.dylib already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/
Abort trap: 6
I also tried using R code directly using an example they had
string<-'''SCCA<-function(X,Z,K,alpha){
library("PMA")
scca<-CCA(X,Z,typex="standard",typez="standard",K=K penaltyx=alpha,penaltyz=alpha)
u<-scca$u
v<-scca$v
out<-list(U=u,V=v)
return(out)}'''
scca=STAP(string,"scca")
which as I understand can be used like an r function directly
numpy2ri.activate()
scca(X,Z,1,0.25)
this results in the same error as above.
So I do not know exactly how to fix it and have been unable to find anything similar.

The error for some reason is a mac-os issue. https://stackoverflow.com/a/53014308/1628393
Thus all you have to do
is modify it with this command and it works well
os.environ['KMP_DUPLICATE_LIB_OK']='True'
string<-'''SCCA<-function(X,Z,K,alpha){
library("PMA")
scca<-CCA(X,Z,typex="standard",typez="standard",K=Kpenaltyx=alpha,penaltyz=alpha)
u<-scca$u
v<-scca$v
out<-list(U=u,V=v)
return(out)}'''
scca=STAP(string,"scca")
then the function is called by
scca.SCCA(X,Z,1,0.25)

How can I partition pyspark RDDs holding R functions

import rpy2.robjects as robjects
dffunc = sc.parallelize([(0,robjects.r.rnorm),(1,robjects.r.runif)])
dffunc.collect()
Outputs
[(0, <rpy2.rinterface.SexpClosure - Python:0x7f2ecfc28618 / R:0x26abd18>), (1, <rpy2.rinterface.SexpClosure - Python:0x7f2ecfc283d8 / R:0x26aad28>)]
While the partitioned version results in an error:
dffuncpart = dffunc.partitionBy(2)
dffuncpart.collect()
RuntimeError: ('R cannot evaluate code before being initialized.', <built-in function unserialize>
It seems like this error is that R wasn't loaded on one of the partitions, which I assume implies that the first import step was not performed. Is there anyway around this?
EDIT 1 This second example causes me to think there's a bug in the timing of pyspark or rpy2.
dffunc = sc.parallelize([(0,robjects.r.rnorm), (1,robjects.r.runif)]).partitionBy(2)
def loadmodel(model):
import rpy2.robjects as robjects
return model[1](2)
dffunc.map(loadmodel).collect()
Produces the same error R cannot evaluate code before being initialized.
dffuncpickle = sc.parallelize([(0,pickle.dumps(robjects.r.rnorm)),(1,pickle.dumps(robjects.r.runif))]).partitionBy(2)
def loadmodelpickle(model):
import rpy2.robjects as robjects
import pickle
return pickle.loads(model[1])(2)
dffuncpickle.map(loadmodelpickle).collect()
Works just as expected.

I'd like to say that "this is not a bug in rpy2, this is a feature" but I'll realistically have to settle with "this is a limitation".
What is happening is that rpy2 has 2 interface levels. One is a low-level one (closer to R's C API) and available through rpy2.rinterface and the other one is a high-level interface with more bells and whistles, more "pythonic", and with classes for R objects inheriting from rinterface level-ones (that last part is important for the part about pickling below). Importing the high-level interface results in initializing (starting) the embedded R with default parameters if necessary. Importing the low-level interface rinterface does not have this side effect and the initialization of the embedded R must be performed explicitly (function initr). rpy2 was designed this way because the initialization of the embedded R can have parameters: importing first rpy2.rinterface, setting the initialization, then importing rpy2.robjects makes this possible.
In addition to that the serialization (pickling) of R objects wrapped by rpy2 is currently only defined at the rinterface level (see the documentation). Pickling robjects-level (high-level) rpy2 objects is using the rinterface-level code and when unpickling them they will remain at that lower-level (the Python pickle contains the module the class of the object is defined in and will import that module - here rinterface, which does not imply the initialization of the embedded R). The reason for things being this way are simply that it was "good enough for now": at the time this was implemented I had to simultaneously think of a good way to bridge two somewhat different languages and learn my way through Python C-API and pickling/unpickling Python objects. Given the ease with which one can write something like
import rpy2.robjects
or
import rpy2.rinterface
rpy2.rinterface.initr()
before unpickling, this was never revisited. The uses of rpy2's pickling I know about are using Python's multiprocessing (and adding something similar to the import statements in the code initializing a child process was a cheap and sufficient fix). May this is the time to look at this again. File a bug report for rpy2 if the case.
edit: this is undoubtedly an issue with rpy2. pickled robjects-level objects should unpickle back to robjects-level, not rinterface-level. I have opened an issue in the rpy2 tracker (and already pushed a rudimentary patch in the default/dev branch).
2nd edit: The patch is part of released rpy2 starting with version 2.7.7 (latest release at the time of writing is 2.7.8).

Using R functions in python

I am an avid python user. I have been programming and performing a lot of my statistics using R. Recently, I tried to go into one of my notebooks to perform some statistical analysis. I have written over 5000 lines of code. Now, I have used R functions scattered everywhere throughout my program. Unfortunately, I am unable to even use any of the functions i have written before.
This is what i have done before:
%load_ext rmagic
import rpy2.robjects as R
import pandas.rpy.common as com
from rpy2.robjects.packages import importr
import scipy.stats as sp
stats=importr('stats')
TSA = importr('TSA')
forecast = importr('forecast')
fUnitRoots = importr('fUnitRoots')
tseries = importr('tseries')
urca = importr('urca')
VARS = importr('vars')
zoo = importr('zoo')
aod = importr('aod')
Now, I can't even run any of this any more as i get an import error "r_magic extension has been moved".
Also, i have called R functions by doing the following:
%R acf(x)
Above statement no longer works.
But if i do....
R.r('acf(x)')
it works. This seems like an annoying change i have to incorporate in my large program. Is there a workaround towards this solution?
Thanks

The rmagic is now in rpy2. Do:
%load_ext rpy2.ipython

Creating an R data.frame in python with low level rpy2

I am using the rpy2 package to bring some R functionality to python.
The functions I'm using in R need a data.frame object, and by using rlike.TaggedList and then robjects.DataFrame I am able to make this work.
However I'm having performance issues, when comparing to the exact same R functions with the exact same data, which led me to try and use the rpy2 low level interface as mentioned here - http://rpy.sourceforge.net/rpy2/doc-2.3/html/performances.html
So far I have tried:
Using TaggedList with FloatSexpVector objects instead of numpy arrays, and the DataFrame object.
Dumping the TaggedList and DataFrame classes by using a dictionary like this:
d = dict((var_name, var_sexp_vector) for ...)
dataframe = robjects.r('data.frame')(**d)
Both did not get me any noticeable speedup.
I have noticed that DataFrame objects can get a rinterface.SexpVector in their constructor , so I have thought of creating a such a named vector, but I have no idea on how to put in the names (in R I know its just names(vec) = c('a','b'...)).
How do I do that? Is there another way?
And is there an easy way to profile rpy itself, so I could know where the bottleneck is?
EDIT:
The following code seem to work great (x4 faster) on newer rpy (2.2.3)
data = ro.r('list')([ri.FloatSexpVector(x) for x in vectors])[0]
data.names = ri.StrSexpVector(vector_names)
However it doesn't on version 2.0.8 (last one supported by windows), since R cant seem to be able to use the names: "Error in eval(expr, envir, enclos) : object 'y' not found"
Ideas?
EDIT #2:
Someone did the fine job of building a rpy2.3 binary for windows (python 2.7), the mentioned works great with it (almost x6 faster for my code)
link: https://bitbucket.org/breisfeld/rpy2_w32_fix/issue/1/binary-installer-for-win32

Python can be several times faster than R (even byte-compiled R), and I managed to perform operations on R data structures with rpy2 faster than R would have. Sharing the relevant R and rpy2 code would help make more specific advice (and improve rpy2 if needed).
In the meantime, SexpVector might not be what you want; it is little more than an abstract class for all R vectors (see class diagram for rpy2.rinterface). ListSexpVector might be more appropriate:
import rpy2.rinterface as ri
ri.initr()
l = ri.ListSexpVector([ri.IntSexpVector((1,2,3)),
ri.StrSexpVector(("a","b","c")),])
An important detail is that R lists are recursive data structures, and R avoids a catch 22-type of situation by having the operator "[[" (in addittion to "["). Python does not have that, and I have not (yet ?) implemented "[[" as a method at the low-level.
Profiling in Python can be done with the module stdlib module cProfile, for example.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting a script from rpy to rpy2 using rpy2.rpy_classic - python

Related

Comparing Plumbr to other options for making a chart with R in a Python script

Using R packages in Python using rpy2

How can I partition pyspark RDDs holding R functions

Using R functions in python

Creating an R data.frame in python with low level rpy2

Categories

Resources