Slow glm calculation when using rpy2 - python

I want to calculate logistic regression parameters using R's glm package. I'm working with python and using rpy2 for that.
For some reason, when I'm running the glm function using R I get much faster results than by using rpy2. Do you know why the calculations using rpy2 is much slower?
I'm using R - V2.13.1 and rpy2 - V2.0.8
Here is the code I'm using:
import numpy
from rpy2 import robjects as ro
import rpy2.rlike.container as rlc
def train(self, x_values, y_values, weights):
x_float_vector = [ro.FloatVector(x) for x in numpy.array(x_values).transpose()]
y_float_vector = ro.FloatVector(y_values)
weights_float_vector = ro.FloatVector(weights)
names = ['v' + str(i) for i in xrange(len(x_float_vector))]
d = rlc.TaggedList(x_float_vector + [y_float_vector], names + ['y'])
data = ro.RDataFrame(d)
formula = 'y ~ '
for x in names:
formula += x + '+'
formula = formula[:-1]
fit_res = ro.r.glm(formula=ro.r(formula), data=data, weights=weights_float_vector, family=ro.r('binomial(link="logit")'))

Without the full R code you are benchmarking against, it is difficult to precisely point out where the problem might be.
You might want to run this through a Python profiler to see where the bottleneck(s) is (are).
Finally, the current release for rpy2 is 2.2.6. Beside API changes, it is running faster and has (presumably) less bugs than 2.0.8.
Edit: From your comments I am now suspecting that you are calling your function
in a loop, and a large fraction of the time is spent building R vectors (that might only have to be built once).

Related

Error calling a R function from python using rpy2 with survival library

When calling a function in the survival package in R from within python with the rpy2 interface I get the following error:
RRuntimeError: Error in formula[[2]] : subscript out of bounds
Any pointer to solve the issue please?
Thanks
Code:
import pandas as pd
import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector
from rpy2.robjects.packages import importr
import rpy2.robjects as ro
R = ro.r
from rpy2.robjects import pandas2ri
pandas2ri.activate()
## install the survival package
utils = rpackages.importr('utils')
utils.chooseCRANmirror(ind=1) # select the first mirror in the list
utils.install_packages(StrVector('survival'))
#Load the library and example data set
survival=importr('survival')
infert = R('infert')
## Linear model works fine
reslm=R.lm('case~spontaneous+induced',data=infert)
#Run the example clogit function, which fails
rescl=R.clogit('case~spontaneous+induced+strata(stratum)',data=infert)
After trying around, I found out, there is a difference, whether you offer the R instance of rpy2 the full R-code string to execute, or not.
Thus, you can make your function run, by giving as much as possible as R code:
#Run the example clogit function, which fails
rescl=R.clogit('case~spontaneous+induced+strata(stratum)',data=infert)
#But give the R code to be executed as one complete string - this works:
rescl=R('clogit(case ~ spontaneous + induced + strata(stratum), data = infert)')
If you capture the return value to a variable within R, you can inspect the data and get out the critical information of the model
by the usual functions in R.
E.g.
R('rescl.in.R <- clogit(case ~ spontaneous + induced + strata(stratum), data = infert)')
R('str(rescl.in.R)')
# or:
R('coef(rescl.in.R)')
## array([1.98587552, 1.40901163])
R('names(rescl.in.R)')
## array(['coefficients', 'var', 'loglik', 'score', 'iter',
## 'linear.predictors', 'residuals', 'means', 'method', 'n', 'nevent',
## 'terms', 'assign', 'wald.test', 'y', 'formula', 'xlevels', 'call',
## 'userCall'], dtype='<U17')
It helps a lot - at least in this first phase of using rpy2 (for me, too), to have your r instance open and trying the code in parallel which you do, since the output in R is far more readable and you know and see what you are doing and what you could address.
In Python, the output is stripped off of important informations (like the name etc) - and in addition, it is not pretty-printed.
This fails when including the strata() function within the formula because it's not evaluated in the right environment. In R, formulas are special language constructs and so they need to be treated separately by rpy2.
So, for your example, this would look like:
rescl = R.clogit(ro.Formula('case ~ spontaneous + induced + strata(stratum)'),
data = infert)
See the documentation for rpy2.robjects.Formula for more details. That documentation also discusses the pros & cons of this approach vs that provided by #Gwang-jin-kim

Running Python code from Matlab

I would really appreciate some help with running code written in Python 3 from Matlab.
My Python code loads various libraries and uses them to perform numerical integration of a differential equation (for the numpy vector: e_array ).
The Python code which I would like to call from Matlab is the following:
from numba import jit
from scipy.integrate import quad
import numpy as np
#jit(nopython = True)
def integrand1(x,e,delta,r):
return (-2*np.sqrt(e*r)/np.pi)*(x/np.sqrt(1-x**2))/(1+(delta+2*x*np.sqrt(e*r))**2)
#jit(nopython = True)
def f1(e,delta,r):
return quad(integrand1, -1, 1, args=(e,delta,r))[0]
#jit(nopython = True)
def runge1(e,dtau,delta,r):
k1 = f1(e,delta,r)
k2 = f1((e+k1*dtau/2),delta,r)
k3 = f1((e+k2*dtau/2),delta,r)
k4 = f1((e+k3*dtau),delta,r)
return e + (dtau/6)*(k1+2*k2+2*k3+k4)
time_steps = 60
e = 10
dtau=1
r=1
delta=-1
e_array = np.zeros(time_steps)
time = np.zeros(time_steps)
for i in range(time_steps):
e_array[i] = e
time[i] = i*dtau
e = runge1(e,dtau,delta,r)
Ideally, I would like to be able to call this Python code (pythoncode.py) in Matlab as if it were a Matlab function and feed it the parameters: time_steps, e, dtau, r and delta. I would be very happy with a solution which looks like this:
e_array = pythoncode.py(time_steps = 60, e = 10, dtau = 1, r = 1, delta = -1)
where pythoncode.py is treated as a Matlab function which takes said parameters, feeds them into the Python code and returns the Matlab vector e_array.
I want to point out that there are several additional Python codes that I'd like to be able to call from Matlab and I'm hope to get an idea of how to do this from your answers regarding this specific Python code.
A related question involves the Python libraries which I use in the Python code: Is there a way to "compile" the Python code such that I can call it in Matlab without installing the libraries it uses (f.e the numba library) on the computer running the Matlab code?
Thanks very much for helping,
Asaf
You'll probably need to shell escape out of Matlab to invoke python -- prefix the command you'd run on the shell with !.
Matlab Shell Escape Functions suggests saving a mat file and then opening it in your python code -- see Read .mat files in Python .
In terms of compiling the python, you could take a look at How to compile a Python file and see if that helps you.

Call R library DirichletReg from Python using rpy2

I'm trying to do Dirichlet Regression using Python. Unfortunately I cannot find a Python package that does the job. So I tried to call R library DirichletReg using rpy2. However, it is not very intuitive to me how to call a regression function such as DirichReg(Y ~ X1 + X2 + X3, data=predictorData) where Y = DR_data(compositionalData). I saw an example of calling linear regression function lm in the documentation of rpy2. But my case is slightly different as Y is not a column name in the table but an R object DR_data.
I'm wondering what the proper way is to do this, or whether there is a Python package for Dirichlet Regression.
You can send objects into the "Formula" environment from python. This example is from the rpy2 docs:
import array
from rpy2.robjects import IntVector, Formula
from rpy2.robjects.packages import importr
stats = importr('stats')
x = IntVector(range(1, 11))
y = x.ro + stats.rnorm(10, sd=0.2)
fmla = Formula('y ~ x')
env = fmla.environment
env['x'] = x
env['y'] = y
fit = stats.lm(fmla)
You can also create named variables in the R environment (outside the Formula). See here. Worst case scenario, you move some your python data into R through rpy2, then issue the commands directly in R through the rpy2 bridge as described here.

How to use Stargazer to print fit in rpy2

I'm learning how to use rpy2, and I would like to create formatted regression output using the stargazer package. My best guess of how to do this is the following code:
import pandas as pd
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
stargazer = importr('stargazer')
from rpy2.robjects import pandas2ri
pandas2ri.activate()
r = robjects.r
df = pd.DataFrame({'x': [1,2,3,4,5],
'y': [2,1,3,5,4]})
fit = r.lm('y~x', data=df)
print fit
print r.stargazer(fit)
However, when I run it, the I get the following output:
Coefficients:
(Intercept) x
0.6 0.8
[1] "\n"
[2] "% Error: Unrecognized object type.\n"
So the fit is being generated, and prints fine. But stargazer doesn't seem to recognize the fit object as something it can parse.
Any suggestions? Am I calling stargazer incorrectly in this context?
I should mention that I am running this in Python 2.7.5 on a windows 10 machine, with R 3.3.2, and rpy2 version 2.7.8 from the unofficial windows binary. So it could just be a problem with the windows build, but it seems odd that everything except stargazer would work.
I am not familiar with the R package stargazer but from a quick look at the documentation this seems to be the correct usage.
Before anything, you may want to check whether the issue is with execution or with printing. At which one of the two lines is this failing ?
p = r.stargazer(fit)
print(p)
If the failure is with the execution, you may want to move more code to R and see if you reach a point where you get it to work. If not, this is likely an issue with the R code and/or stargazer. If you get it to work the issue is on the rpy2/conversion side.
rcode = """
df <- data.frame(x = c(1,2,3,4,5),
y = c(2,1,3,5,4))
fit <- lm('y~x', data=df)
p <- stargazer(fit)
"""
# parse and evaluate the R code
r(rcode)
# intermediate objects can be retrieved from the `globalenv` to
# investigate where they differ from the ones obtained earlier.
# For example:
print(robjects.globalenv["p"])
Now that we showed that it is likely an issue on the stargazer side, we can make the use of arbitrary data frames a matter of binding it to a symbol in R's globalenv:
robjects.globalenv["df"] = df
rcode = """
fit <- lm('y~x', data=df)
p <- stargazer(fit)
"""
# parse and evaluate the R code
r(rcode)
print(robjects.globalenv["p"])

Speeding up a big a loop with a logarithm in Python

I'm trying to speed up the following code:
from math import log
from random import random
def logtest1(N):
tr=0
for i in range(1,N):
T= 40 + 10*random()
tr += -log(random())/T
I'm fairly new to python (coming from matlab)... and this same code runs 5x slower in python than matlab (and Julia), which got my attention.
I tried using a numba and a parakeet wrapper, and numpy functions instead of python ones, but didn't get any improvement at all.
I'd appreciate any help.
Thanks.
edit: the whole thing is a Monte Carlo simulation, so N is very large... 10e6 for testing purposes
You should really be looking into numpy. And Scipy, while you're at it. Numpy is a speed-optimized package for N-dimensional array numerics, and Scipy is a collection of scientific computing stuff built upon numpy.
If you write the function using numpy arrays, it looks like this:
def logtest2(N):
T = 40. + 10. * np.random.rand(N)
return np.sum(-1*np.log(np.random.rand(N)) / T)
It's also a lot faster. Testing with N = 1000000 gave me a runtime of 500ms for your version and 75ms for this one.
Right off the bat id say use xrange if you are using Python 2.7x
So in 2.7 it would be:
def logtest1(N):
tr=0
for i in xrange(N):
a = random() # Just generate the random number once
T= 40 + 10*a
tr += -log(a)/T
Here is a summary on why xrange is better: Should you always favor xrange() over range()?

Categories

Resources