Create temporary dataframe with rpy2: memory issue - python

This question is similar to but simpler than my previous one.
Here is the code that I use to create R dataframes from python using rpy2:
import numpy as np
from rpy2 import robjects
Z = np.zeros((10000, 500))
df = robjects.r["data.frame"]([robjects.FloatVector(column) for column in Z.T])
My problem is that using it repetitively results in huge memory consumption.
I tried to adapt the idea from here but without success.
How can I convert many numpy arrays to dataframe for treatment by R methods without gradually using all my memory?

You should make sure that you're using the latest version of rpy2. With rpy2 version 2.4.2, the following works nicely:
import gc
import numpy as np
from rpy2 import robjects
from rpy2.robjects.numpy2ri import numpy2ri
for i in range(100):
print i
Z = np.random.random(size=(10000, 500))
matrix = numpy2ri(Z)
df = robjects.r("data.frame")(matrix)
gc.collect()
Memory usage never exceeds 600 MB on my computer.

Related

How to import a function from an R package as if it was native Python function and use all its outputs?

There is a function called dea(x, y, *args) in library(Benchmarking) which returns useful objects. I've described 3 key ones below:
crs = dea(mydata_matrix_x, my_data_matrix_y, RTS="IN", ORIENTATION= "in") # both matrixes have N rows
efficiency(crs) # a 'numeric' type object which looks like a 1xN vector
peers(crs) # A matrix: Nx2 (looks to me like a pandas dataframe when run in .ipynb file with R kernel)
lambda(crs) # A matrix: Nx2 of type dbl (also looks like a dataframe)
Now I would like to programatically vary my_data_matrix_x. This matrix represents my inputs. At first it will be a Nx10 matrix. However I intend to drop each column sequentially and run dea() on the Nx9 matrix, then graph the efficiency(crs) scores that come out. The issue is I have no idea how to achieve this in R (amongst other things) and would rather circumvent the issue by writing all my code in Python and importing this dea() function somehow from an R script
I believe the best solution available to me will be to read and write from files:
from Benchmarking_script.r import dea
def test_inputs(data, input):
INPUTS = ['input 1', 'input2', 'input3', 'input4,' 'input5']
OUTPUTS = ['output1', 'output2']
data_inputs = data.drop(f"{input}", axis=1)
data_outputs = data[OUTPUTS]
data_inputs.to_csv("my_inputs.csv")
data_outputs.to_csv("my_outputs.csv")
run Benchmarking.dea(data_inputs, data_outputs, RTS="crs", ORIENTATION="in")
clearly this last line won't work: I am interested to hear flexible (and simple!) ways to run this dea() function idiomatically as if it was a native Python function
Related SO questions
The closest answer on SO I've found has been Importing any function from an R package into python
When adapting the code I've written
import pandas as pd
data = pd.read_csv("path/to_data.csv")
import rpy2
import rpy2.robjects as robjects
import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector
from rpy2.robjects.packages import importr
utils = rpackages.importr('utils')
utils.chooseCRANmirror(ind=1)
packnames = ('Benchmarking')
utils.install_packages(StrVector(packnames))
Benchmarking = importr('Benchmarking')
crs = Benchmarking.dea(data['Age'], data['CO2'], RTS='crs', ORIENTATION='in')
--------------------------------------------------------------
NotImplementedError: Conversion 'py2rpy' not defined for objects of type '<class 'pandas.core.series.Series'>'
So importing the function natively as a Python file hasn't worked
The second approach is the way to go. You need to use a converter context so python and r variables would be converted automatically. Specifically, try pandas2ri submodule shipped with rpy2. Something like this:
from rpy2.robjects import pandas2ri
with pandas2ri:
crs = Benchmarking.dea(data['Age'], data['CO2'], RTS='crs', ORIENTATION='in')
If this doesn't work, update your post with the error.

PyArrow ipc.read_tensor causes seg fault

I am trying to pass numpy arrays from one process to another using PyArrow's shared memory framework. Currently the sender process has this code:
import numpy as np
import pyarrow as pa
data = np.random.rand(100,100, 100)
tensor = pa.Tensor.from_numpy(data)
output_stream = pa.BufferOutputStream()
pa.ipc.write_tensor(tensor, output_stream)
buf = output_stream.getvalue()
print(buf.address, buf.size)
which outputs something like (5311993741568, 8000256)
In the second process I have:
import pyarrow as pa
import numpy as np
buf2 = pa.foreign_buffer(5311993741568, 8000256)
tensor2 = pa.ipc.read_tensor(buf2)
but I get a segfault on the last line. The documentation isn't very clear on what the right way to use read_tensor and write_tensor is. I am also running on windows so I cannot use plasma object store.

Using R packages in Python using rpy2

There is a package in R that I need to use on my data. All my data preprocessing has already been done in python and all the modelling as well. The package in R is 'PMA'. I have used r2py before using Rs PLS package as follows
import numpy as np
from rpy2.robjects.numpy2ri import numpy2ri
import rpy2.robjects as ro
def Rpcr(X_train,Y_train,X_test):
ro.r('''source('R_pls.R')''')
r_pls=ro.globalenv['R_pls']
r_x_train=numpy2ri(X_train)
r_y_train=numpy2ri(Y_train)
r_x_test=numpy2ri(X_test)
p_res=r_pls(r_x_train,r_y_train,r_x_test)
yp_test=np.array(p_res[0])
yp_test=yp_test.reshape((yp_test.size,))
yp_train=np.array(p_res[1])
yp_train=yp_train.reshape((yp_train.size,))
ncomps=np.array(p_res[2])
ncomps=ncomps.reshape((ncomps.size,))
return yp_test,yp_train,ncomps
when I followed this format is gave an error that function numpy2ri does not exist.
So I have been working off of rpy2 manual and have tried a number of things with no success. The package I am working with in R is implemented like so:
library('PMA')
cspa=CCA(X,Z,typex="standard", typez="standard", K=1, penaltyx=0.25, penaltyz=0.25)
# X and Z are dataframes with dimension ppm and pXq
# cspa returns an R object which I need two attributes u and v
U<-cspa$u
V<-cspa$v
So trying to implement something like I was seeing on the rpy2 tried to load the module in python and use it in python like so
import rpy2.robjects as ro
from rpy2.robjects.packages import SignatureTranslatedAnonymousPackage as STAP
from rpy2.robjects import numpy2ri
from rpy2.robjects.packages import importr
base=importr('base'
scca=importr('PMA')
numpy2ri.activate() # To turn NumPy arrays X1 and X2 to r objects
out=scca.CCA(X1,X2,typex="standard",typez="standard", K=1, penaltyz=0.25,penaltyz=0.25)
and got the following error
OMP: Error #15: Initializing libomp.dylib, but found libiomp5.dylib already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/
Abort trap: 6
I also tried using R code directly using an example they had
string<-'''SCCA<-function(X,Z,K,alpha){
library("PMA")
scca<-CCA(X,Z,typex="standard",typez="standard",K=K penaltyx=alpha,penaltyz=alpha)
u<-scca$u
v<-scca$v
out<-list(U=u,V=v)
return(out)}'''
scca=STAP(string,"scca")
which as I understand can be used like an r function directly
numpy2ri.activate()
scca(X,Z,1,0.25)
this results in the same error as above.
So I do not know exactly how to fix it and have been unable to find anything similar.
The error for some reason is a mac-os issue. https://stackoverflow.com/a/53014308/1628393
Thus all you have to do
is modify it with this command and it works well
os.environ['KMP_DUPLICATE_LIB_OK']='True'
string<-'''SCCA<-function(X,Z,K,alpha){
library("PMA")
scca<-CCA(X,Z,typex="standard",typez="standard",K=Kpenaltyx=alpha,penaltyz=alpha)
u<-scca$u
v<-scca$v
out<-list(U=u,V=v)
return(out)}'''
scca=STAP(string,"scca")
then the function is called by
scca.SCCA(X,Z,1,0.25)

RPY2 to use as.xts from XTS library

I am using RPY2 in python to call as.xts object from xts library. 'as' is a reserved word in python. Hence, i am unsure how to proceed use as.xts in my python code.
My aim is to use as.xts on an existing dataframe with time series column.
from rpy2.robjects import pandas2ri
pandas2ri.activate()
r_dataframe = pandas2ri.py2ri(pandas_df)
from rpy2.robjects.packages import importr
xts= importr('xts', lib_loc="local path to R library" , robject_translations = {".subset.xts": "_subset_xts2", "to.period": "to_period2"})
r_ts = as.xts(r_dataframe) # i am unsure of this step usage.
i expect a time series object in output of the last line of code. I am going to use forecast package on top of time series object.

Any benefits to importing sub modules directly (seems to be slower)?

I wanted to see which is faster:
import numpy as np
np.sqrt(4)
-or-
from numpy import sqrt
sqrt(4)
Here is the code I used to find the average time to run each.
def main():
import gen_funs as gf
from time import perf_counter_ns
t = 0
N = 40
for j in range(N):
tic = perf_counter_ns()
for i in range(100000):
imp2() # I ran the code with this then with imp1()
toc = perf_counter_ns()
t += (toc - tic)
t /= N
time = gf.ns2hms(t) # Converts ns to readable object
print("Ave. time to run: {:d}h {:d}m {:d}s {:d}ms" .
format(time.hours, time.minutes, time.seconds, time.milliseconds))
def imp1():
import numpy as np
np.sqrt(4)
return
def imp2():
from numpy import sqrt
sqrt(4)
return
if __name__ == "__main__":
main()
When I import numpy as np then call np.sqrt(4), I get an average time of about 229ms (time to run the loop 10**4 times).
When I run from numpy import sqrt then call sqrt(4), I get an average time of about 332ms.
Since there is such a difference in time to run, what is the benefit to running from numpy import sqrt? Is there a memory benefit or some other reason why I would do this?
I tried timing with the time bash command. I got 215ms for importing numpy and running sqrt(4) and 193ms for importing sqrt from numpy with the same command. The difference is negligible, honestly.
However, if you don't need a certain aspect of a module, importing it is not encouraged.
In this particular case, since there is no discernable performance benefit and because there are few situations in which you would import just numpy.sqrt (math.sqrt is ~4x faster. The extra features numpy.sqrt offers would only be useable if you have numpy data, which would require you to import the entire module, of course).
There might be a rare scenario in which you don't need all of numpy but still need numpy.sqrt, e.g. using pandas.DataFrame.to_numpy() and manipulating the data in some ways, but honestly I don't feel the 20ms of speed is worth anything in the real world. Especially since you saw worse performance with importing just numpy.sqrt.

Categories

Resources