How do Rpy2, pyrserve and PypeR compare? - python

I would like to access R from within a Python program. I am aware of Rpy2, pyrserve and PypeR.
What are the advantages or disadvantages of these three options?

I know one of the 3 better than the others, but in the order given in the question:
rpy2:
C-level interface between Python and R (R running as an embedded process)
R objects exposed to Python without the need to copy the data over
Conversely, Python's numpy arrays can be exposed to R without making a copy
Low-level interface (close to the R C-API) and high-level interface (for convenience)
In-place modification for vectors and arrays possible
R callback functions can be implemented in Python
Possible to have anonymous R objects with a Python label
Python pickling possible
Full customization of R's behavior with its console (so possible to implement a full R GUI)
MSWindows with limited support
pyrserve:
native Python code (will/should/may work with CPython, Jython, IronPython)
use R's Rserve
advantages and inconveniences linked to remote computation and to RServe
pyper:
native Python code (will/should/may work with CPython, Jython, IronPython)
use of pipes to have Python communicate with R (with the advantages and inconveniences linked to it)
edit: Windows support for rpy2

From the paper in the Journal of Statistical Software on PypeR:
RPy presents a simple and efficient way of accessing R from Python. It is robust and very
convenient for frequent interaction operations between Python and R. This package allows
Python programs to pass Python objects of basic data types to R functions and return the
results in Python objects. Such features make it an attractive solution for the cases in which Python and R interact frequently. However, there are still limitations of this package as listed below.
Performance:
RPy may not behave very well for large-size data sets or for computation-intensive
duties. A lot of time and memory are inevitably consumed in producing the Python
copy of the R data because in every round of a conversation RPy converts the returned
value of an R expression into a Python object of basic types or NumPy array. RPy2, a
recently developed branch of RPy, uses Python objects to refer to R objects instead of
copying them back into Python objects. This strategy avoids frequent data conversions
and improves speed. However, memory consumption remains a problem. [...]
When we were implementing WebArray (Xia et al. 2005), an online platform for microarray data analysis, a job consumed roughly one quarter more computational time if running R through RPy instead of through R's command-line user interface. Therefore, we decided to run R in Python through pipes in subsequent developments, e.g., WebArrayDB (Xia et al. 2009), which retained the same performance as achieved when running R independently. We do not know the exact reason for such a difference in performance, but we noticed that RPy directly uses the shared library of R to run R scripts. In contrast, running R through pipes means running the R interpreter directly.
Memory:
R has been denounced for its uneconomical use of memory. The memory used by large-
size R objects is rarely released after these objects are deleted. Sometimes the only
way to release memory from R is to quit R. RPy module wraps R in a Python object.
However, the R library will stay in memory even if the Python object is deleted. In other
words, memory used by R cannot be released until the host Python script is terminated.
Portability:
As a module with extensions written in C, the RPy source package has to be compiled
with a specific R version on POSIX (Portable Operating System Interface for Unix)
systems, and the R must be compiled with the shared library enabled. Also, the binary
distributions for Windows are bound to specic combinations of different versions of
Python/R, so it is quite frequent that a user has difficulty in finding a distribution that
ts the user's software environment.

From a developer's prospective, we used to use rpy/rpy2 to provide statistical and drawing functions to our Python-based application. It has caused huge problems in delivering our application because rpy/rpy2 needs to be compiled for specific combinations of Python and R, which makes it infeasible for us to provide binary distributions that work out of box unless we bundle R as well. Because rpy/rpy2 are not particularly easy to install, we ended up replacing relevant parts with native Python modules such as matplotlib. We would have switched to pyrserve if we had to use R because we could start a R server locally and connect to it without worrying about the version of R.

in pyper, i can't pass large matrix from python to r instance with assign(). however, i don't have issue with rpy2.
it is just my experience.

Related

Does it hurt performance to dynamically link a shared library that won't be used?

I have a C++ program (in fact it is a dll) which dynamically links to another shared library (python dll), this program has two occasions to be used.
In occasion A, the program will make function calls to that dynamically linked shared library while in occasion B, the program will not.
My question is if I build the program specially for occasion B without linking to the shared library, will I gain performance comparing to the case where I link the shared library without actually using it?
It's really dependent on several factors: What OS, what shared library, and what the application actually does. Possibly also how the shared library is built.
In general, it's not a particularly large penalty, since shared libraries are demand-loaded and use position independent addressing [PIC] (PC-relative, and similar). What this means is that the shared library is being loaded only when it actually is used, and that there's "no work" to load the library. This is something that OS designers and system architects think a lot about, because for many applications that are performance sensitive (for example compilers or web-services), a badly designed shared library system will make performance BAD.
It is of course possible to configure when building the shared library. At least the use of PIC aspect of this, so if the person/company configuring the build of the shared library "wants to", it could be badly configured and worse than zero effect.
To this you have to add any initialization that the shared library does. Well designed shared libraries does "on demand" or "lazy" initialization, in other words, doesn't do much initialization until it's actually required. Again, the details of exactly which library - including how that was configured when it was built - can make a huge difference here.
The only REAL way to tell, in any particular use-case, is to build "with" and "without" the extra shared library, and measure the actual performance.

Python + Q (KDB) - which tools are easy to use and well maintained

From a thread dating a few years back I found some options to integrate python and kdb, namely
qpt
Dan's tools
PyQ
qPython
The last two seem to be the only ones actively updated at the moment. My question goes to the folks that actually use any (and ideally tried several) of these tools. From your experience, which of the two latter ones is more suitable for me. Selection criteria would be (in that order)
ease of use (I am new to q, ideally I would do more work in python than in q)
documentation (seems to be generally not great on anything kdb)
python 3.x support
speed
If I completely missed a tool that fits my requirements, please let me know. I am aware of threads that raise similar questions, but I am looking for a 2017 answer, not 2015.
Exxeleron's qPython "is a Python library providing support for interprocess communication between Python and kdb+ processes." While the same functionality is available from PyQ, PyQ offers much more than just IPC.
PyQ is a full-featured Python interpreter running inside a kdb+ instance. For a Python programmer, PyQ gives a direct access to kdb+ data without any need to program in q. For a q programmer, PyQ offers an easy access to a rich set of computational and visualization libraries that Python is famous for.
To give an example, here is a linear interpolation function inp written in q:
inp:{y[i]+(z-x i)*(deltas[y]%deltas x)1+i:x bin z}
It takes three arguments: x and y are coordinates of the known data points and z is x-coordinates of the interpolated values. It returns the y-coordinates of the interpolated values. The same function can be written in PyQ using pure Python syntax:
def inp(x, y, z):
slope = y.deltas / x.deltas
i = x.bin(z)
return y[i] + (z - x[i]) * slope[i+1]
If you prepare the data in q
x:0.1*til 10
y:x - x * x
z:5?1f
and call either Python or q implementation, you will get the same result. At the PyQ's Python prompt this can be verified as follows:
>>> inp(q.x, q.y, q.z) == q.inp(q.x, q.y, q.z)
True
Of course, an experienced Python programmer would not need to write such a function from scratch because NumPy already has numpy.interp which does the same and more. If, as a q programmer, you want to use numpy.interp from q, all you need is a simple wrapper that converts the result to a K object before returning it. This is how it can be done at the q) prompt
q)p)import numpy; from pyq import q, K
q)p)def inp2(x, y, z): return K(numpy.interp(z, x, y))
q)p)q.inp2 = inp2
And now, inp2 is ready to use:
q)inp[x;y;z] ~ inp2(x;y;z)
1b
Since PyQ runs inside kdb+, it gets its IPC implementation for free. For example, I can open a connection to the remote server at port 8888 and ask for its local time in two lines of code:
>>> h = q.hopen('::8888')
>>> h('.z.P')
k('2017.07.07D17:15:19.261285000')
However, most of the tasks can be accomplished in PyQ without any IPC (or even copying) at all because all your kdb+ data is already in the same process as your Python code.
To cover the OP's rubrics, on the ease of use, qPython being a pure Python library may be easier to install, but PyQ programing is often easier because it does not require a separate kdb+ server. PyQ documentation is comparable in quality to that of qPython. PyQ offered python 3.x support since its version 3.0.1 and Python 3.1. Currently (2017) it is actively tested with Python 2.7, 3.5 and 3.6. Speed comparison would not be fair because PyQ has direct access to kdb+ data and does not require IPC, so it can accomplish many tasks 100x faster than qPython.
Disclaimer: I am the author of PyQ.
This kdb/python guide was updated 2017:
For anyone else who needs a Python library, I am highly recommending
the exxeleron qpython library (though it does require numpy, which
requires 2.6 As a minimum, I believe, which can be a limitation)
I have used the exxeleron qpython library fairly extensively, and have found it to be a nice package for Python <-> kdb+ IPC. Last I recall, it has issues with serialising multibyte characters (at least in Python 2.7) when sending to kdb+, so as a workaround I convert strings/symbols to bytecode and do a `$ or `char$ on the kdb+ side.
It's not the fastest thing in the world - its de/serialisation feels a little less fast than it could be (at least in 2.7 - I haven't tested in Python 3) - but it is a friendly interface to kdb+ IPC from Python. It has nice hooks for sub/pub model (using .receive on the connection object), and is relatively well-documented for something kdb+ related (there's even some nice client examples for pub/sub processing!).
I haven't tested with pyQ, which should in theory be better for doing computation-heavy work as it does as much as possible in kdb+ rather than in Python, but for times when you can offload most of your work to a kdb+ process and want to e.g. analyse results or use Python specific packages (e.g. for NLP/ML etc.) qpython works quite well.

IPC/RPC for communication between R and Python

Summary
I'm looking for an IPC/RPC protocol which:
Has libraries in both R and Python 2.7 which work on Windows
Allows me to pass data and make calls between the two languages
Does not embed one language inside the other (I want to use Python and R in my favourite IDEs, not embedding bits of R expression strings in Python code, vice/versa)
Supports circular references
Supports millisecond time data
Fast and efficient when I need to pass large amounts of data
I have seen this similar question, but unlike the OP for that question, I do not want to use R in a pythonic way. I want to use R in an R way in my RStudio IDE. And I want to use Python in a Pythonic way in a PyCharm IDE. I just want to occasionally pass data or make calls between the two languages, not to blend the languages into one.
Does anyone have any suggestions?
Background
I use both Python and R interactively (by typing into the console of my favourite IDE: PyCharm and RStudio respectively). I often need to pass data and call functions between the two languages ad-hoc when doing exploratory data analysis. For instance, I may start with processing data in Python, but later stumble across a great machine learning library in R that I want to try out, vice/versa.
I've looked at Python libraries PypeR and rpy2 but both embed R within Python, so I lose the ability to use R interactively within RStudio. I have also looked at RPython, but I use Windows and it does not work with Windows.
Additionally, I've looked at XML-RPC but some of my data has objects which contain circular references (e.g. a tree structure where child nodes have references to their parent). Problem is, Python's xmlrpclib does not support circular references, nor timestamps to millisecond precision which my data also contains.

Rserve v.s. rpy [duplicate]

I would like to access R from within a Python program. I am aware of Rpy2, pyrserve and PypeR.
What are the advantages or disadvantages of these three options?
I know one of the 3 better than the others, but in the order given in the question:
rpy2:
C-level interface between Python and R (R running as an embedded process)
R objects exposed to Python without the need to copy the data over
Conversely, Python's numpy arrays can be exposed to R without making a copy
Low-level interface (close to the R C-API) and high-level interface (for convenience)
In-place modification for vectors and arrays possible
R callback functions can be implemented in Python
Possible to have anonymous R objects with a Python label
Python pickling possible
Full customization of R's behavior with its console (so possible to implement a full R GUI)
MSWindows with limited support
pyrserve:
native Python code (will/should/may work with CPython, Jython, IronPython)
use R's Rserve
advantages and inconveniences linked to remote computation and to RServe
pyper:
native Python code (will/should/may work with CPython, Jython, IronPython)
use of pipes to have Python communicate with R (with the advantages and inconveniences linked to it)
edit: Windows support for rpy2
From the paper in the Journal of Statistical Software on PypeR:
RPy presents a simple and efficient way of accessing R from Python. It is robust and very
convenient for frequent interaction operations between Python and R. This package allows
Python programs to pass Python objects of basic data types to R functions and return the
results in Python objects. Such features make it an attractive solution for the cases in which Python and R interact frequently. However, there are still limitations of this package as listed below.
Performance:
RPy may not behave very well for large-size data sets or for computation-intensive
duties. A lot of time and memory are inevitably consumed in producing the Python
copy of the R data because in every round of a conversation RPy converts the returned
value of an R expression into a Python object of basic types or NumPy array. RPy2, a
recently developed branch of RPy, uses Python objects to refer to R objects instead of
copying them back into Python objects. This strategy avoids frequent data conversions
and improves speed. However, memory consumption remains a problem. [...]
When we were implementing WebArray (Xia et al. 2005), an online platform for microarray data analysis, a job consumed roughly one quarter more computational time if running R through RPy instead of through R's command-line user interface. Therefore, we decided to run R in Python through pipes in subsequent developments, e.g., WebArrayDB (Xia et al. 2009), which retained the same performance as achieved when running R independently. We do not know the exact reason for such a difference in performance, but we noticed that RPy directly uses the shared library of R to run R scripts. In contrast, running R through pipes means running the R interpreter directly.
Memory:
R has been denounced for its uneconomical use of memory. The memory used by large-
size R objects is rarely released after these objects are deleted. Sometimes the only
way to release memory from R is to quit R. RPy module wraps R in a Python object.
However, the R library will stay in memory even if the Python object is deleted. In other
words, memory used by R cannot be released until the host Python script is terminated.
Portability:
As a module with extensions written in C, the RPy source package has to be compiled
with a specific R version on POSIX (Portable Operating System Interface for Unix)
systems, and the R must be compiled with the shared library enabled. Also, the binary
distributions for Windows are bound to specic combinations of different versions of
Python/R, so it is quite frequent that a user has difficulty in finding a distribution that
ts the user's software environment.
From a developer's prospective, we used to use rpy/rpy2 to provide statistical and drawing functions to our Python-based application. It has caused huge problems in delivering our application because rpy/rpy2 needs to be compiled for specific combinations of Python and R, which makes it infeasible for us to provide binary distributions that work out of box unless we bundle R as well. Because rpy/rpy2 are not particularly easy to install, we ended up replacing relevant parts with native Python modules such as matplotlib. We would have switched to pyrserve if we had to use R because we could start a R server locally and connect to it without worrying about the version of R.
in pyper, i can't pass large matrix from python to r instance with assign(). however, i don't have issue with rpy2.
it is just my experience.

Using embedded C library in Python emulation

Short Question
Which would be easier to emulate (in Python) a complex (SAE J1939) communication stack from an existing embedded C library:
1) Full port - meaning manually convert all of the C functions to python modules
2) Wrap the stack in a Python wrapper - meaning call the real c code in Python
Background Information
I have already written small portions of this stack in Python, however they are very non-trival to implement with 100% coverage. Because of this very reason, we have recently purchased an off the shelf SAE J1939 stack for our embedded platforms. To clarify, I know that portions touching the hardware layer will have to be re-created and mapped to the PC's CAN drivers.
I am hoping to find someone here on SO that has or even looked into porting a 5k LOC C library to Python. If there are any C to Python tools that work well that would be helpful for me to look into as well.
My advice would be to wrap it.
Reasons for that:
if you convert function by function, you'll introduce new bugs (we're just human) and this kind of stuff is pretty hard to test
wrapping for python is done easily, using swig or even ctypes to load a dll on the fly, you'll find tons of tutorial
if your lib gets updated, you have less impact in the long term.
However, you need to
check that the license you purchase allows you to do that
know that having same implementation on embedded and PC side, it won't help tracking bugs
you might have a bit less portability than a full python implementation (anyway, not much of a point for you as your low layer needs to be rewritten per target)
Definitely wrap it. It might be as easy are running ctypesgen.py and then using it. Check this blog article about using ctypesgen to create a wrapper for libreadline http://wavetossed.blogspot.com/2011/07/asynchronous-gnu-readline.html in order to get access to the full API.

Categories

Resources