Using R packages in Python using rpy2

Using R packages in Python using rpy2 - python

There is a package in R that I need to use on my data. All my data preprocessing has already been done in python and all the modelling as well. The package in R is 'PMA'. I have used r2py before using Rs PLS package as follows
import numpy as np
from rpy2.robjects.numpy2ri import numpy2ri
import rpy2.robjects as ro
def Rpcr(X_train,Y_train,X_test):
ro.r('''source('R_pls.R')''')
r_pls=ro.globalenv['R_pls']
r_x_train=numpy2ri(X_train)
r_y_train=numpy2ri(Y_train)
r_x_test=numpy2ri(X_test)
p_res=r_pls(r_x_train,r_y_train,r_x_test)
yp_test=np.array(p_res[0])
yp_test=yp_test.reshape((yp_test.size,))
yp_train=np.array(p_res[1])
yp_train=yp_train.reshape((yp_train.size,))
ncomps=np.array(p_res[2])
ncomps=ncomps.reshape((ncomps.size,))
return yp_test,yp_train,ncomps
when I followed this format is gave an error that function numpy2ri does not exist.
So I have been working off of rpy2 manual and have tried a number of things with no success. The package I am working with in R is implemented like so:
library('PMA')
cspa=CCA(X,Z,typex="standard", typez="standard", K=1, penaltyx=0.25, penaltyz=0.25)
# X and Z are dataframes with dimension ppm and pXq
# cspa returns an R object which I need two attributes u and v
U<-cspa$u
V<-cspa$v
So trying to implement something like I was seeing on the rpy2 tried to load the module in python and use it in python like so
import rpy2.robjects as ro
from rpy2.robjects.packages import SignatureTranslatedAnonymousPackage as STAP
from rpy2.robjects import numpy2ri
from rpy2.robjects.packages import importr
base=importr('base'
scca=importr('PMA')
numpy2ri.activate() # To turn NumPy arrays X1 and X2 to r objects
out=scca.CCA(X1,X2,typex="standard",typez="standard", K=1, penaltyz=0.25,penaltyz=0.25)
and got the following error
OMP: Error #15: Initializing libomp.dylib, but found libiomp5.dylib already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/
Abort trap: 6
I also tried using R code directly using an example they had
string<-'''SCCA<-function(X,Z,K,alpha){
library("PMA")
scca<-CCA(X,Z,typex="standard",typez="standard",K=K penaltyx=alpha,penaltyz=alpha)
u<-scca$u
v<-scca$v
out<-list(U=u,V=v)
return(out)}'''
scca=STAP(string,"scca")
which as I understand can be used like an r function directly
numpy2ri.activate()
scca(X,Z,1,0.25)
this results in the same error as above.
So I do not know exactly how to fix it and have been unable to find anything similar.

The error for some reason is a mac-os issue. https://stackoverflow.com/a/53014308/1628393
Thus all you have to do
is modify it with this command and it works well
os.environ['KMP_DUPLICATE_LIB_OK']='True'
string<-'''SCCA<-function(X,Z,K,alpha){
library("PMA")
scca<-CCA(X,Z,typex="standard",typez="standard",K=Kpenaltyx=alpha,penaltyz=alpha)
u<-scca$u
v<-scca$v
out<-list(U=u,V=v)
return(out)}'''
scca=STAP(string,"scca")
then the function is called by
scca.SCCA(X,Z,1,0.25)

Related

How to import a function from an R package as if it was native Python function and use all its outputs?

There is a function called dea(x, y, *args) in library(Benchmarking) which returns useful objects. I've described 3 key ones below:
crs = dea(mydata_matrix_x, my_data_matrix_y, RTS="IN", ORIENTATION= "in") # both matrixes have N rows
efficiency(crs) # a 'numeric' type object which looks like a 1xN vector
peers(crs) # A matrix: Nx2 (looks to me like a pandas dataframe when run in .ipynb file with R kernel)
lambda(crs) # A matrix: Nx2 of type dbl (also looks like a dataframe)
Now I would like to programatically vary my_data_matrix_x. This matrix represents my inputs. At first it will be a Nx10 matrix. However I intend to drop each column sequentially and run dea() on the Nx9 matrix, then graph the efficiency(crs) scores that come out. The issue is I have no idea how to achieve this in R (amongst other things) and would rather circumvent the issue by writing all my code in Python and importing this dea() function somehow from an R script
I believe the best solution available to me will be to read and write from files:
from Benchmarking_script.r import dea
def test_inputs(data, input):
INPUTS = ['input 1', 'input2', 'input3', 'input4,' 'input5']
OUTPUTS = ['output1', 'output2']
data_inputs = data.drop(f"{input}", axis=1)
data_outputs = data[OUTPUTS]
data_inputs.to_csv("my_inputs.csv")
data_outputs.to_csv("my_outputs.csv")
run Benchmarking.dea(data_inputs, data_outputs, RTS="crs", ORIENTATION="in")
clearly this last line won't work: I am interested to hear flexible (and simple!) ways to run this dea() function idiomatically as if it was a native Python function
Related SO questions
The closest answer on SO I've found has been Importing any function from an R package into python
When adapting the code I've written
import pandas as pd
data = pd.read_csv("path/to_data.csv")
import rpy2
import rpy2.robjects as robjects
import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector
from rpy2.robjects.packages import importr
utils = rpackages.importr('utils')
utils.chooseCRANmirror(ind=1)
packnames = ('Benchmarking')
utils.install_packages(StrVector(packnames))
Benchmarking = importr('Benchmarking')
crs = Benchmarking.dea(data['Age'], data['CO2'], RTS='crs', ORIENTATION='in')
--------------------------------------------------------------
NotImplementedError: Conversion 'py2rpy' not defined for objects of type '<class 'pandas.core.series.Series'>'
So importing the function natively as a Python file hasn't worked

The second approach is the way to go. You need to use a converter context so python and r variables would be converted automatically. Specifically, try pandas2ri submodule shipped with rpy2. Something like this:
from rpy2.robjects import pandas2ri
with pandas2ri:
crs = Benchmarking.dea(data['Age'], data['CO2'], RTS='crs', ORIENTATION='in')
If this doesn't work, update your post with the error.

Weird GUDHI + rpy2 interaction

This might be a pretty niche problem but I've come across it and found a solution which maybe will help someone eventually. Basically, I'm trying to call the GUDHI C++ library through R using rpy2 using the following code:
import gudhi
import rpy2.robjects.packages as rpackages
import rpy2.robjects.vectors as rvectors
import rpy2.robjects as robjects
import rpy2.robjects.numpy2ri
import numpy as np
point_cloud = np.random.uniform(0, 10, size=(30, 3))
# tda.point_cloud_to_pl(point_cloud, method="GUDHI_R")
rpy2.robjects.numpy2ri.activate()
# import R's "base" package
base = rpackages.importr('base')
# import R's "utils" package
utils = rpackages.importr('utils')
# select a mirror for R packages
utils.chooseCRANmirror(ind=1) # select the first mirror in the list
# R package names
packnames = ('TDA', 'deldir', 'Matrix', 'SparseM')
names_to_install = [x for x in packnames if not rpackages.isinstalled(x)]
if len(names_to_install) > 0:
utils.install_packages(rvectors.StrVector(names_to_install))
# Importing packages
rpackages.importr('TDA')
rpackages.importr('deldir')
rpackages.importr('Matrix')
rpackages.importr('SparseM')
X = robjects.r.assign("X", point_cloud)
alpha_complex_diag = robjects.r['alphaComplexDiag']
ph_output = alpha_complex_diag(X)
print(ph_output)
But keep getting a segmentation fault error saying
*** caught segfault ***
R[write to console]: address 0x7fa6ac2c6090, cause 'invalid permissions'
The solution is to remove the import gudhi at the top of the python script. For some reason that causes R to not be able to use it? Perhaps because they are trying to call the same library?

It may indeed be a conflict between dynamically loaded libraries used by both Python and R. One such example is scipy doing what seems odd things with BLAS, and this causes a crash: https://github.com/rpy2/rpy2/issues/505
I do not know GUDHI at all unfortunately. If this is a blocker you'd need to run this through a debugger to find what it is happening in the C library.

Error calling a R function from python using rpy2 with survival library

When calling a function in the survival package in R from within python with the rpy2 interface I get the following error:
RRuntimeError: Error in formula[[2]] : subscript out of bounds
Any pointer to solve the issue please?
Thanks
Code:
import pandas as pd
import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector
from rpy2.robjects.packages import importr
import rpy2.robjects as ro
R = ro.r
from rpy2.robjects import pandas2ri
pandas2ri.activate()
## install the survival package
utils = rpackages.importr('utils')
utils.chooseCRANmirror(ind=1) # select the first mirror in the list
utils.install_packages(StrVector('survival'))
#Load the library and example data set
survival=importr('survival')
infert = R('infert')
## Linear model works fine
reslm=R.lm('case~spontaneous+induced',data=infert)
#Run the example clogit function, which fails
rescl=R.clogit('case~spontaneous+induced+strata(stratum)',data=infert)

After trying around, I found out, there is a difference, whether you offer the R instance of rpy2 the full R-code string to execute, or not.
Thus, you can make your function run, by giving as much as possible as R code:
#Run the example clogit function, which fails
rescl=R.clogit('case~spontaneous+induced+strata(stratum)',data=infert)
#But give the R code to be executed as one complete string - this works:
rescl=R('clogit(case ~ spontaneous + induced + strata(stratum), data = infert)')
If you capture the return value to a variable within R, you can inspect the data and get out the critical information of the model
by the usual functions in R.
E.g.
R('rescl.in.R <- clogit(case ~ spontaneous + induced + strata(stratum), data = infert)')
R('str(rescl.in.R)')
# or:
R('coef(rescl.in.R)')
## array([1.98587552, 1.40901163])
R('names(rescl.in.R)')
## array(['coefficients', 'var', 'loglik', 'score', 'iter',
## 'linear.predictors', 'residuals', 'means', 'method', 'n', 'nevent',
## 'terms', 'assign', 'wald.test', 'y', 'formula', 'xlevels', 'call',
## 'userCall'], dtype='<U17')
It helps a lot - at least in this first phase of using rpy2 (for me, too), to have your r instance open and trying the code in parallel which you do, since the output in R is far more readable and you know and see what you are doing and what you could address.
In Python, the output is stripped off of important informations (like the name etc) - and in addition, it is not pretty-printed.

This fails when including the strata() function within the formula because it's not evaluated in the right environment. In R, formulas are special language constructs and so they need to be treated separately by rpy2.
So, for your example, this would look like:
rescl = R.clogit(ro.Formula('case ~ spontaneous + induced + strata(stratum)'),
data = infert)
See the documentation for rpy2.robjects.Formula for more details. That documentation also discusses the pros & cons of this approach vs that provided by #Gwang-jin-kim

How can I partition pyspark RDDs holding R functions

import rpy2.robjects as robjects
dffunc = sc.parallelize([(0,robjects.r.rnorm),(1,robjects.r.runif)])
dffunc.collect()
Outputs
[(0, <rpy2.rinterface.SexpClosure - Python:0x7f2ecfc28618 / R:0x26abd18>), (1, <rpy2.rinterface.SexpClosure - Python:0x7f2ecfc283d8 / R:0x26aad28>)]
While the partitioned version results in an error:
dffuncpart = dffunc.partitionBy(2)
dffuncpart.collect()
RuntimeError: ('R cannot evaluate code before being initialized.', <built-in function unserialize>
It seems like this error is that R wasn't loaded on one of the partitions, which I assume implies that the first import step was not performed. Is there anyway around this?
EDIT 1 This second example causes me to think there's a bug in the timing of pyspark or rpy2.
dffunc = sc.parallelize([(0,robjects.r.rnorm), (1,robjects.r.runif)]).partitionBy(2)
def loadmodel(model):
import rpy2.robjects as robjects
return model[1](2)
dffunc.map(loadmodel).collect()
Produces the same error R cannot evaluate code before being initialized.
dffuncpickle = sc.parallelize([(0,pickle.dumps(robjects.r.rnorm)),(1,pickle.dumps(robjects.r.runif))]).partitionBy(2)
def loadmodelpickle(model):
import rpy2.robjects as robjects
import pickle
return pickle.loads(model[1])(2)
dffuncpickle.map(loadmodelpickle).collect()
Works just as expected.

I'd like to say that "this is not a bug in rpy2, this is a feature" but I'll realistically have to settle with "this is a limitation".
What is happening is that rpy2 has 2 interface levels. One is a low-level one (closer to R's C API) and available through rpy2.rinterface and the other one is a high-level interface with more bells and whistles, more "pythonic", and with classes for R objects inheriting from rinterface level-ones (that last part is important for the part about pickling below). Importing the high-level interface results in initializing (starting) the embedded R with default parameters if necessary. Importing the low-level interface rinterface does not have this side effect and the initialization of the embedded R must be performed explicitly (function initr). rpy2 was designed this way because the initialization of the embedded R can have parameters: importing first rpy2.rinterface, setting the initialization, then importing rpy2.robjects makes this possible.
In addition to that the serialization (pickling) of R objects wrapped by rpy2 is currently only defined at the rinterface level (see the documentation). Pickling robjects-level (high-level) rpy2 objects is using the rinterface-level code and when unpickling them they will remain at that lower-level (the Python pickle contains the module the class of the object is defined in and will import that module - here rinterface, which does not imply the initialization of the embedded R). The reason for things being this way are simply that it was "good enough for now": at the time this was implemented I had to simultaneously think of a good way to bridge two somewhat different languages and learn my way through Python C-API and pickling/unpickling Python objects. Given the ease with which one can write something like
import rpy2.robjects
or
import rpy2.rinterface
rpy2.rinterface.initr()
before unpickling, this was never revisited. The uses of rpy2's pickling I know about are using Python's multiprocessing (and adding something similar to the import statements in the code initializing a child process was a cheap and sufficient fix). May this is the time to look at this again. File a bug report for rpy2 if the case.
edit: this is undoubtedly an issue with rpy2. pickled robjects-level objects should unpickle back to robjects-level, not rinterface-level. I have opened an issue in the rpy2 tracker (and already pushed a rudimentary patch in the default/dev branch).
2nd edit: The patch is part of released rpy2 starting with version 2.7.7 (latest release at the time of writing is 2.7.8).

Using R functions in python

I am an avid python user. I have been programming and performing a lot of my statistics using R. Recently, I tried to go into one of my notebooks to perform some statistical analysis. I have written over 5000 lines of code. Now, I have used R functions scattered everywhere throughout my program. Unfortunately, I am unable to even use any of the functions i have written before.
This is what i have done before:
%load_ext rmagic
import rpy2.robjects as R
import pandas.rpy.common as com
from rpy2.robjects.packages import importr
import scipy.stats as sp
stats=importr('stats')
TSA = importr('TSA')
forecast = importr('forecast')
fUnitRoots = importr('fUnitRoots')
tseries = importr('tseries')
urca = importr('urca')
VARS = importr('vars')
zoo = importr('zoo')
aod = importr('aod')
Now, I can't even run any of this any more as i get an import error "r_magic extension has been moved".
Also, i have called R functions by doing the following:
%R acf(x)
Above statement no longer works.
But if i do....
R.r('acf(x)')
it works. This seems like an annoying change i have to incorporate in my large program. Is there a workaround towards this solution?
Thanks

The rmagic is now in rpy2. Do:
%load_ext rpy2.ipython

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using R packages in Python using rpy2 - python

Related

How to import a function from an R package as if it was native Python function and use all its outputs?

Weird GUDHI + rpy2 interaction

Error calling a R function from python using rpy2 with survival library

How can I partition pyspark RDDs holding R functions

Using R functions in python

Categories

Resources