Summary
I'm looking for an IPC/RPC protocol which:
Has libraries in both R and Python 2.7 which work on Windows
Allows me to pass data and make calls between the two languages
Does not embed one language inside the other (I want to use Python and R in my favourite IDEs, not embedding bits of R expression strings in Python code, vice/versa)
Supports circular references
Supports millisecond time data
Fast and efficient when I need to pass large amounts of data
I have seen this similar question, but unlike the OP for that question, I do not want to use R in a pythonic way. I want to use R in an R way in my RStudio IDE. And I want to use Python in a Pythonic way in a PyCharm IDE. I just want to occasionally pass data or make calls between the two languages, not to blend the languages into one.
Does anyone have any suggestions?
Background
I use both Python and R interactively (by typing into the console of my favourite IDE: PyCharm and RStudio respectively). I often need to pass data and call functions between the two languages ad-hoc when doing exploratory data analysis. For instance, I may start with processing data in Python, but later stumble across a great machine learning library in R that I want to try out, vice/versa.
I've looked at Python libraries PypeR and rpy2 but both embed R within Python, so I lose the ability to use R interactively within RStudio. I have also looked at RPython, but I use Windows and it does not work with Windows.
Additionally, I've looked at XML-RPC but some of my data has objects which contain circular references (e.g. a tree structure where child nodes have references to their parent). Problem is, Python's xmlrpclib does not support circular references, nor timestamps to millisecond precision which my data also contains.
Related
I would like to access R from within a Python program. I am aware of Rpy2, pyrserve and PypeR.
What are the advantages or disadvantages of these three options?
I know one of the 3 better than the others, but in the order given in the question:
rpy2:
C-level interface between Python and R (R running as an embedded process)
R objects exposed to Python without the need to copy the data over
Conversely, Python's numpy arrays can be exposed to R without making a copy
Low-level interface (close to the R C-API) and high-level interface (for convenience)
In-place modification for vectors and arrays possible
R callback functions can be implemented in Python
Possible to have anonymous R objects with a Python label
Python pickling possible
Full customization of R's behavior with its console (so possible to implement a full R GUI)
MSWindows with limited support
pyrserve:
native Python code (will/should/may work with CPython, Jython, IronPython)
use R's Rserve
advantages and inconveniences linked to remote computation and to RServe
pyper:
native Python code (will/should/may work with CPython, Jython, IronPython)
use of pipes to have Python communicate with R (with the advantages and inconveniences linked to it)
edit: Windows support for rpy2
From the paper in the Journal of Statistical Software on PypeR:
RPy presents a simple and efficient way of accessing R from Python. It is robust and very
convenient for frequent interaction operations between Python and R. This package allows
Python programs to pass Python objects of basic data types to R functions and return the
results in Python objects. Such features make it an attractive solution for the cases in which Python and R interact frequently. However, there are still limitations of this package as listed below.
Performance:
RPy may not behave very well for large-size data sets or for computation-intensive
duties. A lot of time and memory are inevitably consumed in producing the Python
copy of the R data because in every round of a conversation RPy converts the returned
value of an R expression into a Python object of basic types or NumPy array. RPy2, a
recently developed branch of RPy, uses Python objects to refer to R objects instead of
copying them back into Python objects. This strategy avoids frequent data conversions
and improves speed. However, memory consumption remains a problem. [...]
When we were implementing WebArray (Xia et al. 2005), an online platform for microarray data analysis, a job consumed roughly one quarter more computational time if running R through RPy instead of through R's command-line user interface. Therefore, we decided to run R in Python through pipes in subsequent developments, e.g., WebArrayDB (Xia et al. 2009), which retained the same performance as achieved when running R independently. We do not know the exact reason for such a difference in performance, but we noticed that RPy directly uses the shared library of R to run R scripts. In contrast, running R through pipes means running the R interpreter directly.
Memory:
R has been denounced for its uneconomical use of memory. The memory used by large-
size R objects is rarely released after these objects are deleted. Sometimes the only
way to release memory from R is to quit R. RPy module wraps R in a Python object.
However, the R library will stay in memory even if the Python object is deleted. In other
words, memory used by R cannot be released until the host Python script is terminated.
Portability:
As a module with extensions written in C, the RPy source package has to be compiled
with a specific R version on POSIX (Portable Operating System Interface for Unix)
systems, and the R must be compiled with the shared library enabled. Also, the binary
distributions for Windows are bound to specic combinations of different versions of
Python/R, so it is quite frequent that a user has difficulty in finding a distribution that
ts the user's software environment.
From a developer's prospective, we used to use rpy/rpy2 to provide statistical and drawing functions to our Python-based application. It has caused huge problems in delivering our application because rpy/rpy2 needs to be compiled for specific combinations of Python and R, which makes it infeasible for us to provide binary distributions that work out of box unless we bundle R as well. Because rpy/rpy2 are not particularly easy to install, we ended up replacing relevant parts with native Python modules such as matplotlib. We would have switched to pyrserve if we had to use R because we could start a R server locally and connect to it without worrying about the version of R.
in pyper, i can't pass large matrix from python to r instance with assign(). however, i don't have issue with rpy2.
it is just my experience.
Is it possible to create a glue that makes it possible for python modules (more specifically, library bindings) to be used in node.js? Some of the data structures can be directly mapped to V8 objects - e.g. array, dict.
More importantly - would that be a more elegant way to create bindings than manually or through FFI. In short, would it be worth it?
Try this node.js module, that is a bridge: Node-Python,
NOTE: The Project is 7 years old and still stuck at v0.4. A lot of functionality like converting between Python and Node arrays is still missing. It may be safe to assume that it's no longer supported by its original author(s)
Edge.js does a fine job at this. It allows you to write a Python script and then call the routines from Node.js, which can be used to easily create bindings with python modules.
Short Question
Which would be easier to emulate (in Python) a complex (SAE J1939) communication stack from an existing embedded C library:
1) Full port - meaning manually convert all of the C functions to python modules
2) Wrap the stack in a Python wrapper - meaning call the real c code in Python
Background Information
I have already written small portions of this stack in Python, however they are very non-trival to implement with 100% coverage. Because of this very reason, we have recently purchased an off the shelf SAE J1939 stack for our embedded platforms. To clarify, I know that portions touching the hardware layer will have to be re-created and mapped to the PC's CAN drivers.
I am hoping to find someone here on SO that has or even looked into porting a 5k LOC C library to Python. If there are any C to Python tools that work well that would be helpful for me to look into as well.
My advice would be to wrap it.
Reasons for that:
if you convert function by function, you'll introduce new bugs (we're just human) and this kind of stuff is pretty hard to test
wrapping for python is done easily, using swig or even ctypes to load a dll on the fly, you'll find tons of tutorial
if your lib gets updated, you have less impact in the long term.
However, you need to
check that the license you purchase allows you to do that
know that having same implementation on embedded and PC side, it won't help tracking bugs
you might have a bit less portability than a full python implementation (anyway, not much of a point for you as your low layer needs to be rewritten per target)
Definitely wrap it. It might be as easy are running ctypesgen.py and then using it. Check this blog article about using ctypesgen to create a wrapper for libreadline http://wavetossed.blogspot.com/2011/07/asynchronous-gnu-readline.html in order to get access to the full API.
I would like to access R from within a Python program. I am aware of Rpy2, pyrserve and PypeR.
What are the advantages or disadvantages of these three options?
I know one of the 3 better than the others, but in the order given in the question:
rpy2:
C-level interface between Python and R (R running as an embedded process)
R objects exposed to Python without the need to copy the data over
Conversely, Python's numpy arrays can be exposed to R without making a copy
Low-level interface (close to the R C-API) and high-level interface (for convenience)
In-place modification for vectors and arrays possible
R callback functions can be implemented in Python
Possible to have anonymous R objects with a Python label
Python pickling possible
Full customization of R's behavior with its console (so possible to implement a full R GUI)
MSWindows with limited support
pyrserve:
native Python code (will/should/may work with CPython, Jython, IronPython)
use R's Rserve
advantages and inconveniences linked to remote computation and to RServe
pyper:
native Python code (will/should/may work with CPython, Jython, IronPython)
use of pipes to have Python communicate with R (with the advantages and inconveniences linked to it)
edit: Windows support for rpy2
From the paper in the Journal of Statistical Software on PypeR:
RPy presents a simple and efficient way of accessing R from Python. It is robust and very
convenient for frequent interaction operations between Python and R. This package allows
Python programs to pass Python objects of basic data types to R functions and return the
results in Python objects. Such features make it an attractive solution for the cases in which Python and R interact frequently. However, there are still limitations of this package as listed below.
Performance:
RPy may not behave very well for large-size data sets or for computation-intensive
duties. A lot of time and memory are inevitably consumed in producing the Python
copy of the R data because in every round of a conversation RPy converts the returned
value of an R expression into a Python object of basic types or NumPy array. RPy2, a
recently developed branch of RPy, uses Python objects to refer to R objects instead of
copying them back into Python objects. This strategy avoids frequent data conversions
and improves speed. However, memory consumption remains a problem. [...]
When we were implementing WebArray (Xia et al. 2005), an online platform for microarray data analysis, a job consumed roughly one quarter more computational time if running R through RPy instead of through R's command-line user interface. Therefore, we decided to run R in Python through pipes in subsequent developments, e.g., WebArrayDB (Xia et al. 2009), which retained the same performance as achieved when running R independently. We do not know the exact reason for such a difference in performance, but we noticed that RPy directly uses the shared library of R to run R scripts. In contrast, running R through pipes means running the R interpreter directly.
Memory:
R has been denounced for its uneconomical use of memory. The memory used by large-
size R objects is rarely released after these objects are deleted. Sometimes the only
way to release memory from R is to quit R. RPy module wraps R in a Python object.
However, the R library will stay in memory even if the Python object is deleted. In other
words, memory used by R cannot be released until the host Python script is terminated.
Portability:
As a module with extensions written in C, the RPy source package has to be compiled
with a specific R version on POSIX (Portable Operating System Interface for Unix)
systems, and the R must be compiled with the shared library enabled. Also, the binary
distributions for Windows are bound to specic combinations of different versions of
Python/R, so it is quite frequent that a user has difficulty in finding a distribution that
ts the user's software environment.
From a developer's prospective, we used to use rpy/rpy2 to provide statistical and drawing functions to our Python-based application. It has caused huge problems in delivering our application because rpy/rpy2 needs to be compiled for specific combinations of Python and R, which makes it infeasible for us to provide binary distributions that work out of box unless we bundle R as well. Because rpy/rpy2 are not particularly easy to install, we ended up replacing relevant parts with native Python modules such as matplotlib. We would have switched to pyrserve if we had to use R because we could start a R server locally and connect to it without worrying about the version of R.
in pyper, i can't pass large matrix from python to r instance with assign(). however, i don't have issue with rpy2.
it is just my experience.
i just discovered http://code.google.com/p/re2, a promising library that uses a long-neglected way (Thompson NFA) to implement a regular expression engine that can be orders of magnitudes faster than the available engines of awk, Perl, or Python.
so i downloaded the code and did the usual sudo make install thing. however, that action had seemingly done little more than adding /usr/local/include/re2/re2.h to my system. there seemed to be some *.a file in addition, but then what is it with this *.a extension?
i would like to use re2 from Python (preferrably Python 3.1) and was excited to see files like make_unicode_groups.py in the distro (maybe just used during the build process?). those however were not deployed on my machine.
how can i use re2 from Python?
update two friendly people have pointed out that i could try to build DLLs / *.so files from the sources and then use Python’s ctypes library to access those. can anyone give useful pointers how to do just that? i’m pretty much clueless here, especially with the first part (building the *.so files).
update i have also posted this question (earlier) to the re2 developers’ group, without reply till now (it is a small group), and today to the (somewhat more populous) comp.lang.py group [—thread here—]. the hope is that people from various corners can contact each other. my guess is a skilled person can do this in a few hours during their 20% your-free-time-belongs-google-too timeslice; it would tie me for weeks. is there a tool to automatically dumb-down C++ to whatever flavor of C that Python needs to be able to connect? then maybe getting a viable result can be reduced to clever tool chaining.
(rant)why is this so difficult? to think that in 2010 we still cannot have our abundant pieces of software just talk to each other. this is such a roadblock that whenever you want to address some C code from Python you must always cruft these linking bits. this requires a lot of work, but only delivers an extension module that is specific to the version of the C code and the version of Python, so it ages fast.(/rant) would it be possible to run such things in separate processes (say if i had an re2 executable that can produce results for data that comes in on, say, subprocess/Popen/communicate())? (this should not be a pure command-line tool that necessitates the opening of a process each time it is needed, but a single processs that runs continuously; maybe there exist wrappers that sort of ‘demonize’ such C code).
David Reiss has put together a Python wrapper for re2. It doesn't have all of the functionality of Python's re module, but it's a start. It's available here: http://github.com/facebook/pyre2.
Possible yes, easy no. Looking at the re2.h, this is a C++ library exposed as a class. There are two ways you could use it from Python.
1.) As Tuomas says, compile it as a DLL/so and use ctypes. In order to use it from python, though, you would need to wrap the object init and methods into c style externed functions. I've done this in the past with ctypes by externing functions that pass a pointer to the object around. The "init" function returns a void pointer to the object that gets passed on each subsequent method call. Very messy indeed.
2.) Wrap it into a true python module. Again those functions exposed to python would need to be extern "C". One option is to use Boost.Python, that would ease this work.
SWIG handles C++ (unlike ctypes), so it may be more straightforward to use it.
You could try to build re2 into its own DLL/so and use ctypes to call functions from that DLL/so. You will probably need to define your own entry points in the DLL/so.
You can use the python package https://pypi.org/project/google-re2/. Although look at the bottom, there are a few requirements to install yourself before installing the python package.