Creating an R data.frame in python with low level rpy2

Creating an R data.frame in python with low level rpy2 - python

I am using the rpy2 package to bring some R functionality to python.
The functions I'm using in R need a data.frame object, and by using rlike.TaggedList and then robjects.DataFrame I am able to make this work.
However I'm having performance issues, when comparing to the exact same R functions with the exact same data, which led me to try and use the rpy2 low level interface as mentioned here - http://rpy.sourceforge.net/rpy2/doc-2.3/html/performances.html
So far I have tried:
Using TaggedList with FloatSexpVector objects instead of numpy arrays, and the DataFrame object.
Dumping the TaggedList and DataFrame classes by using a dictionary like this:
d = dict((var_name, var_sexp_vector) for ...)
dataframe = robjects.r('data.frame')(**d)
Both did not get me any noticeable speedup.
I have noticed that DataFrame objects can get a rinterface.SexpVector in their constructor , so I have thought of creating a such a named vector, but I have no idea on how to put in the names (in R I know its just names(vec) = c('a','b'...)).
How do I do that? Is there another way?
And is there an easy way to profile rpy itself, so I could know where the bottleneck is?
EDIT:
The following code seem to work great (x4 faster) on newer rpy (2.2.3)
data = ro.r('list')([ri.FloatSexpVector(x) for x in vectors])[0]
data.names = ri.StrSexpVector(vector_names)
However it doesn't on version 2.0.8 (last one supported by windows), since R cant seem to be able to use the names: "Error in eval(expr, envir, enclos) : object 'y' not found"
Ideas?
EDIT #2:
Someone did the fine job of building a rpy2.3 binary for windows (python 2.7), the mentioned works great with it (almost x6 faster for my code)
link: https://bitbucket.org/breisfeld/rpy2_w32_fix/issue/1/binary-installer-for-win32

Python can be several times faster than R (even byte-compiled R), and I managed to perform operations on R data structures with rpy2 faster than R would have. Sharing the relevant R and rpy2 code would help make more specific advice (and improve rpy2 if needed).
In the meantime, SexpVector might not be what you want; it is little more than an abstract class for all R vectors (see class diagram for rpy2.rinterface). ListSexpVector might be more appropriate:
import rpy2.rinterface as ri
ri.initr()
l = ri.ListSexpVector([ri.IntSexpVector((1,2,3)),
ri.StrSexpVector(("a","b","c")),])
An important detail is that R lists are recursive data structures, and R avoids a catch 22-type of situation by having the operator "[[" (in addittion to "["). Python does not have that, and I have not (yet ?) implemented "[[" as a method at the low-level.
Profiling in Python can be done with the module stdlib module cProfile, for example.

Related

Error calling modules when using x = _import_("y") with dictionary values

I'm having trouble understanding an issue with calling packages after they have been imported using the __import_ function in Python. I'll note that using the usual import x as y works fine, but this is an learning exercise for me. I am importing and then checking the version for multiple packages and to learn a little more about Python, I wanted to automate this by using a dictionary.
My dictionary looks something like this:
pack = {"numpy": ["np", "1.7.1"]}
and I then use this to load and check the modules:
for keys in pack.keys():
pack[keys][0] = __import__(keys)
print("%s version: %6.6s (need at least %s)" %(keys, pack[keys][0].__version__, pack[keys][1]))
This works fine, but when I later try to call the package, it does not recognize it: x = np.linspace(0,10,30)
produces an error saying np isn't recognized, but this works: x = pack[keys][0].linspace(0,10,30)
Since this is just a way for me to learn, I'd also be interested in any solutions that change how I've approached the problem. I went with dictionaries because I've at least heard of them, and I used the _import__ function since I was forced to either use quoted characters or numeric values in my dictionary values. The quoted characters created problems for the import x as y technique.

Althought it isn't a good practice, you can create variables dynamically using the builtin locals function.
So,
for module, data in pack.items():
locals()[data[0]] = __import__(module)
nickname = locals()[data[0]]
print("%s version: %6.6s (need at least %s)" %(module, nickname.__version__, pack[module][1]))
Output:
numpy version: 1.12.1 (need at least 1.7.1)

My python codes in general are very slow, is this normal?

I recently began self-learning python, and have been using this language for an online course in algorithms. For some reason, many of my codes I created for this course are very slow (relatively to C/C++ Matlab codes I have created in the past), and I'm starting to worry that I am not using python properly.
Here is a simple python and matlab code to compare their speed.
MATLAB
for i = 1:100000000
a = 1 + 1
end
Python
for i in list(range(0, 100000000)):
a=1 + 1
The matlab code takes about 0.3 second, and the python code takes about 7 seconds. Is this normal? My python codes for much complex problems are very slow. For example, as a HW assignment, I'm running depth first search on a graph with about 900000 nodes, and this is taking forever. Thank you.

Performance is not an explicit design goal of Python:
Don’t fret too much about performance--plan to optimize later when
needed.
That's one of the reasons why Python integrated with a lot of high performance calculating backend engines, such as numpy, OpenBLAS and even CUDA, just to name a few.
The best way to go foreward if you want to increase performance is to let high-performance libraries do the heavy lifting for you. Optimizing loops within Python (by using xrange instead of range in Python 2.7) won't get you very dramatic results.
Here is a bit of code that compares different approaches:
Your original list(range())
The suggestes use of xrange()
Leaving the i out
Using numpy to do the addition using numpy array's (vector addition)
Using CUDA to do vector addition on the GPU
Code:
import timeit
import matplotlib.pyplot as mplplt
iter = 100
testcode = [
"for i in list(range(1000000)): a = 1+1",
"for i in xrange(1000000): a = 1+1",
"for _ in xrange(1000000): a = 1+1",
"import numpy; one = numpy.ones(1000000); a = one+one",
"import pycuda.gpuarray as gpuarray; import pycuda.driver as cuda; import pycuda.autoinit; import numpy;" \
"one_gpu = gpuarray.GPUArray((1000000),numpy.int16); one_gpu.fill(1); a = (one_gpu+one_gpu).get()"
]
labels = ["list(range())", "i in xrange()", "_ in xrange()", "numpy", "numpy and CUDA"]
timings = [timeit.timeit(t, number=iter) for t in testcode]
print labels, timings
label_idx = range(len(labels))
mplplt.bar(label_idx, timings)
mplplt.xticks(label_idx, labels)
mplplt.ylabel('Execution time (sec)')
mplplt.title('Timing of integer addition in python 2.7\n(smaller value is better performance)')
mplplt.show()
Results (graph) ran on Python 2.7.13 on OSX:
The reason that Numpy performs faster than the CUDA solution is that the overhead of using CUDA does not beat the efficiency of Python+Numpy. For larger, floating point calculations, CUDA does even better than Numpy.
Note that the Numpy solution performs more that 80 times faster than your original solution. If your timings are correct, this would even be faster than Matlab...
A final note on DFS (Depth-afirst-Search): here is an interesting article on DFS in Python.

Try using xrange instead of range.
The difference between them is that **xrange** generates the values as you use them instead of range, which tries to generate a static list at runtime.

Unfortunately, python's amazing flexibility and ease comes at the cost of being slow. And also, for such large values of iteration, I suggest using itertools module as it has faster caching.
The xrange is a good solution however if you want to iterate over dictionaries and such, it's better to use itertools as in that, you can iterate over any type of sequence object.

Optimising python function with numba

I am trying to speed up a python function using numba, however I cannot seem to make it compile.
The input for my function is a 27x4 array of type np.int32.
My function is:
#nb.jit(nopython=True)
def edge_profile(input):
pos = input[:,:3]
val = input[:,3]
centre = np.mean(pos,axis=0).astype(np.int32)
diff = np.absolute(pos-centre).sum(axis=1)
cell_edge = np.zeros(3)
for i in range(3):
idx = np.where(diff==i+1)[0]
idy = np.where(val[idx]==1)[0]
cell_edge[i] = len(idy)
return cell_edge.astype(np.int32)
However this produces an extremely large error message which I have unable to use to diagnose the problem. I have tried specifying the input types as follows:
#nb.jit(nb.int32[:](nb.int32[:,:]))
def ...
however this produces an equally large error message.
I fell that I am probably using some function/feature that is not supported in numba, but I do not know enough about it to identify the problem. Any help would be greatly appreciated.

Numba should work ok so long as you stick to basic lists and arrays in the function you want to speed up. It appears that you are already using functions from numpy that are probably already well optimized. So its unlikely you will see a speed up even if you did get it to work. You haven't mentioned what your OS is. Under ubuntu 14.04 you can get it to work through some steps outlined here.

How can I partition pyspark RDDs holding R functions

import rpy2.robjects as robjects
dffunc = sc.parallelize([(0,robjects.r.rnorm),(1,robjects.r.runif)])
dffunc.collect()
Outputs
[(0, <rpy2.rinterface.SexpClosure - Python:0x7f2ecfc28618 / R:0x26abd18>), (1, <rpy2.rinterface.SexpClosure - Python:0x7f2ecfc283d8 / R:0x26aad28>)]
While the partitioned version results in an error:
dffuncpart = dffunc.partitionBy(2)
dffuncpart.collect()
RuntimeError: ('R cannot evaluate code before being initialized.', <built-in function unserialize>
It seems like this error is that R wasn't loaded on one of the partitions, which I assume implies that the first import step was not performed. Is there anyway around this?
EDIT 1 This second example causes me to think there's a bug in the timing of pyspark or rpy2.
dffunc = sc.parallelize([(0,robjects.r.rnorm), (1,robjects.r.runif)]).partitionBy(2)
def loadmodel(model):
import rpy2.robjects as robjects
return model[1](2)
dffunc.map(loadmodel).collect()
Produces the same error R cannot evaluate code before being initialized.
dffuncpickle = sc.parallelize([(0,pickle.dumps(robjects.r.rnorm)),(1,pickle.dumps(robjects.r.runif))]).partitionBy(2)
def loadmodelpickle(model):
import rpy2.robjects as robjects
import pickle
return pickle.loads(model[1])(2)
dffuncpickle.map(loadmodelpickle).collect()
Works just as expected.

I'd like to say that "this is not a bug in rpy2, this is a feature" but I'll realistically have to settle with "this is a limitation".
What is happening is that rpy2 has 2 interface levels. One is a low-level one (closer to R's C API) and available through rpy2.rinterface and the other one is a high-level interface with more bells and whistles, more "pythonic", and with classes for R objects inheriting from rinterface level-ones (that last part is important for the part about pickling below). Importing the high-level interface results in initializing (starting) the embedded R with default parameters if necessary. Importing the low-level interface rinterface does not have this side effect and the initialization of the embedded R must be performed explicitly (function initr). rpy2 was designed this way because the initialization of the embedded R can have parameters: importing first rpy2.rinterface, setting the initialization, then importing rpy2.robjects makes this possible.
In addition to that the serialization (pickling) of R objects wrapped by rpy2 is currently only defined at the rinterface level (see the documentation). Pickling robjects-level (high-level) rpy2 objects is using the rinterface-level code and when unpickling them they will remain at that lower-level (the Python pickle contains the module the class of the object is defined in and will import that module - here rinterface, which does not imply the initialization of the embedded R). The reason for things being this way are simply that it was "good enough for now": at the time this was implemented I had to simultaneously think of a good way to bridge two somewhat different languages and learn my way through Python C-API and pickling/unpickling Python objects. Given the ease with which one can write something like
import rpy2.robjects
or
import rpy2.rinterface
rpy2.rinterface.initr()
before unpickling, this was never revisited. The uses of rpy2's pickling I know about are using Python's multiprocessing (and adding something similar to the import statements in the code initializing a child process was a cheap and sufficient fix). May this is the time to look at this again. File a bug report for rpy2 if the case.
edit: this is undoubtedly an issue with rpy2. pickled robjects-level objects should unpickle back to robjects-level, not rinterface-level. I have opened an issue in the rpy2 tracker (and already pushed a rudimentary patch in the default/dev branch).
2nd edit: The patch is part of released rpy2 starting with version 2.7.7 (latest release at the time of writing is 2.7.8).

Python c_types .dll functions (pari library)

Alright, so a couple days ago I decided to try and write a primitive wrapper for the PARI library. Ever since then I've been playing with ctypes library in loading the dll and accessing the functions contained using code similar to the following:
from ctypes import *
libcyg=CDLL("<path/cygwin1.dll") #It needs cygwin to be loaded. Not sure why.
pari=CDLL("<path>/libpari-gmp-2.4.dll")
print pari.fibo #fibonacci function
#prints something like "<_FuncPtr object at 0x00BA5828>"
So the functions are there and they can potentially be accessed, but I always receive an access violation no matter what I try. For example:
pari.fibo(5) #access violation
pari.fibo(c_int(5)) #access violation
pari.fibo.argtypes = [c_long] #setting arguments manually
pari.fibo.restype = long #set the return type
pari.fibo(byref(c_int(5))) #access violation reading 0x04 consistently
and any variation on that, including setting argtypes to receive pointers.
The Pari .dll is written in C and the fibonacci function's syntax within the library is GEN fibo(long x).
Could it be the return type that's causing these errors, as it is not a standard int or long but a GEN type, which is unique to the PARI library? Any help would be appreciated. If anyone is able to successfully load the library and use ANY function from within python, please tell; I've been at this for hours now.
EDIT: Seems as though I was simply forgetting to initialize the library. After a quick pari.pari_init(4000000,500000) it stopped erroring. Now my problem lies in the in the fact that it returns a GEN object; which is fine, but whenever I try to reference the address to which it points, it's always 33554435, which I presume is still an address. I'm trying further commands and I'll update if I succeed in getting the correct value of something.

You have two problems here, one give fibo the correct return type and two convert the GEN return type to the value you are looking for.
Poking around the source code a bit, you'll find that GEN is defined as a pointer to a long. Also, at looks like the library provides some converting/printing GENs. I focused in on GENtostr since it would probably be safer for all the pari functions.
import cytpes
pari = ctypes.CDLL("./libpari.so.2.3.5") #I did this under linux
pari.fibo.restype = ctypes.POINTER(ctypes.c_long)
pari.GENtostr.restype = ctypes.POINTER(ctypes.c_char)
pari.pari_init(4000000,500000)
x = pari.fibo(100)
y = pari.GENtostr(x)
ctypes.string_at(y)
Results in:
'354224848179261915075'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating an R data.frame in python with low level rpy2 - python

Related

Error calling modules when using x = _import_("y") with dictionary values

My python codes in general are very slow, is this normal?

Optimising python function with numba

How can I partition pyspark RDDs holding R functions

Python c_types .dll functions (pari library)

Categories

Resources