dataframes classes from Rpy2 to pandas and back - python

I have a daframe in rpy2 that ouputs the following classes:
In [58]: [tuple(x.rclass) for x in df_rpy2]
Out[58]: [('integer',), ('integer',), ('numeric',), ('numeric',)]
Now if I do :
from rpy2.robjects import pandas2ri as pd2ri
tmp = pd2ri.ri2py(df_rpy2)
df_from_pd = pd2ri.py2ri(tmp)
print([tuple(x.rclass) for x in df_from_pd])
I get [('array',), ('array',), ('array',), ('array',)]
How can still have the initial classes ?

Related

Convert MatchIt summary object into a pandas dataframe with pyr2

I am using R's MatchIt package but calling it from Python via the pyr2 package.
On the R-side MatchIt gives me a complex result object including raw data and some additional statistic information. One of is a matrix I want to transform into a data set which I can do in R code like this
# R Code
m.out <- matchit(....)
m.sum <- summary(m.out)
# The following two lines should be somehow "translated" into
# Pythons rpy2
balance <- m.sum$sum.matched
balance <- as.data.frame(balance)
My problem is that I don't know how to implement the two last lines with Pythons rpy2 package. I am able to get m.out and m.sum with rpy2.
See this MWE please
#!/usr/bin/env python3
import rpy2
from rpy2.robjects.packages import importr
import rpy2.robjects as robjects
import rpy2.robjects.pandas2ri as pandas2ri
import pydataset
if __name__ == '__main__':
# import
robjects.packages.importr('MatchIt')
# data
p_df = pydataset.data('respiratory')
p_df.treat = p_df.treat.replace({'P': 0, 'A': 1})
# Convert Panda data into R data
with robjects.conversion.localconverter(
robjects.default_converter + pandas2ri.converter):
r_df = robjects.conversion.py2rpy(p_df)
# Call R's matchit with R data object
match_out = robjects.r['matchit'](
formula=robjects.Formula('treat ~ age + sex'),
data=r_df,
method='nearest',
distance='glm')
# matched data
match_data = robjects.r['match.data'](match_out)
# Convert R data into Pandas data
with robjects.conversion.localconverter(
robjects.default_converter + pandas2ri.converter):
match_data = robjects.conversion.rpy2py(match_data)
# summary object
match_sum = robjects.r['summary'](match_out)
# x = robjects.r('''
# balance <- match_sum$sum.matched
# balance <- as.data.frame(balance)
#
# balance
# ''')
When inspecting the python object match_sum I can't find anything like sum.matched in it. So I have to "translate" the match_sum$sum.matched somehow with rpy2. But I don't know how.
An alternative solution would be to run everything as R code with robjects.r(''' # r code ...'''). But in that case I don't know how to bring a Pandas data frame into that code.
EDIT: Be aware that in the MWE presented here the conversion from R objects into Python objects and vis-à-vis an outdated solution is used. Please see the answer below for a better one.
Ah, it is always the same phenomena: While formulating the question the answers jump'n right into your face.
My (maybe not the best) solution is:
Use real R code and run it with rpy2.robjects.r().
That R code need to create an R function() to be able to receive a dataframe from the outside (the caller).
Beside that solution and based on another answer I also modified the conversion from R to Python data frames in that code.
#!/usr/bin/env python3
import rpy2
from rpy2.robjects.packages import importr
import rpy2.robjects as robjects
import rpy2.robjects.pandas2ri as pandas2ri
import pydataset
if __name__ == '__main__':
# For converting objects from/into Pandas <-> R
# Credits: https://stackoverflow.com/a/20808449/4865723
pandas2ri.activate()
# import
robjects.packages.importr('MatchIt')
# data
df = pydataset.data('respiratory')
df.treat = df.treat.replace({'P': 0, 'A': 1})
# match object
match_out = robjects.r['matchit'](
formula=robjects.Formula('treat ~ age + sex'),
data=df,
method='nearest',
distance='glm')
# matched data
match_data = robjects.r['match.data'](match_out)
match_data = robjects.conversion.rpy2py(match_data)
# SOLUTION STARTS HERE:
get_balance_dataframe = robjects.r('''f <- function(match_out) {
as.data.frame(summary(match_out)$sum.matched)
}
''')
balance = get_balance_dataframe(match_out)
balance = robjects.conversion.rpy2py(balance)
print(type(balance))
print(balance)
Here is the output.
<class 'pandas.core.frame.DataFrame'>
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max Std. Pair Dist.
distance 0.514630 0.472067 0.471744 0.512239 0.077104 0.203704 0.507222
age 32.888889 34.129630 -0.089355 1.071246 0.063738 0.203704 0.721511
sexF 0.111111 0.259259 -0.471405 NaN 0.148148 0.148148 0.471405
sexM 0.888889 0.740741 0.471405 NaN 0.148148 0.148148 0.471405
EDIT: Take that there are no umlauts or other unicode-problematic characters in the cell values or in the column and row names when you do this on Windows. From time to time then there comes a unicode decode error. I wasn't able to reproduce this stable so I have no fresh bug report about it.

Robust 2-Way ANOVA in Python

I need to run robust ANOVA from Python. The function I want to use is t2way from R package WRS2. I tried with r2py, but I'm stuck with an error:
>>> import rpy2.robjects.packages as rpackages
>>> from rpy2.robjects import pandas2ri
>>> pandas2ri.activate()
>>> df = pd.read_csv("https://github.com/lawrence009/dsur/raw/master/data/goggles.csv")
>>> rdf = pandas2ri.py2rpy(df)
>>> WRS2 = rpackages.importr('WRS2')
>>> WRS2.t2way("attractiveness ~ gender*alcohol", data = rdf)
RRuntimeError: Error in x[[grp[i]]] :
attempt to select less than one element in get1index
I'm looking for either a way to make this work with rpy2, or (even better) a port of WRS2 to the python environment. Any help would be much appreciated.
If the issue is with columns in the dataframe that are not factors (as suggested in other answer), casting them into factors is quite easy:
rdf = pandas2ri.py2rpy(df)
base = importr('base')
import rpy2.robjects as ro
for cn in ('alcohol', 'gender'):
i = rdf.colnames.index(cn)
rdf[i] = base.as_factor(rdf[i])
# We could also do it with
# rdf[i] = ro.FactorVector(rdf[i])
To be on the safe side, it is recommended to create an R formula object. Some R functions will accept strings and assume that they are formula, but this is up to a package author and not always the case.
WRS2.t2way(ro.Formula('attractiveness ~ gender*alcohol'), data = rdf)
here is my particular solution for this problem. At the very beginnig the first problem in R is that when you import the data frame you have to change the type of the column alcohol and gender as.factor.
in R the script would be:
library(WRS2)
df <- read.csv2("https://github.com/lawrence009/dsur/raw/master/data/goggles.csv",header = TRUE, sep=',')
df[ , c('attractiveness')] <- as.numeric(df[ , c('attractiveness')])
df[ , c('alcohol')] <- as.factor(df[ , c('alcohol')])
df[ , c('gender')] <- as.factor(df[ , c('gender')])
t2way(attractiveness ~ gender*alcohol, data = df)
In python, although, I didn't find the way to change the data type of the column, but I came with this solution:
First you have to create an .R file named my_t2way.R that contains:
my_t2way <- function(df1){
library(WRS2)
df <- read.csv2(df1,header = TRUE, sep=',')
df[ , c('attractiveness')] <- as.numeric(df[ , c('attractiveness')])
df[ , c('alcohol')] <- as.factor(df[ , c('alcohol')])
df[ , c('gender')] <- as.factor(df[ , c('gender')])
f <- t2way(attractiveness ~ gender*alcohol, data = df)
df1 = data.frame(factor=c('gender','alcohol','gender:alcohol'),
value = c(f$Qa,f$Qb,f$Qab),
p.value = c(f$A.p.value,f$B.p.value,f$AB.p.value))
return(df1)
}
And then you can run the following commands from python
import pandas as pd
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri# Defining the R script and loading the instance in Python
pandas2ri.activate()
r = robjects.r
r['source']('my_t2way.R')# Loading the function we have defined in R.
my_t2way_r = robjects.globalenv['my_t2way']# Reading and processing data
df1 = "https://github.com/lawrence009/dsur/raw/master/data/goggles.csv"
df_result_r = my_t2way_r(df1)
Certainly this solution only works for this particular case, but I think that could be easily extensible to other dataframes.

how to select columns from R dataframe in rpy2 in python?

I have a dataframe in rpy2 in python and I want to pull out columns from it. What is the rpy2 equivalent of this R code?
df[,c("colA", "colC")]
this works to get the first column:
mydf.rx(1)
but how can I pull a set of columns, e.g. the 1st, 3rd and 5th?
mydf.rx([1,3,5])
does not work. neither does:
mydf.rx(rpy2.robjects.r.c([1,3,5]))
Alternatively, you can pass the R data frame into a Python pandas data frame and subset your resulting 1, 3, 5 columns:
#!/usr/bin/python
import rpy2
import rpy2.robjects as ro
import pandas as pd
import pandas.rpy.common as com
# SOURCE R SCRIPT INSIDE PYTHON
ro.r.source('C:\\Path\To\R script.R')
# DEFINE PYTHON DF AS R DF
pydf = com.load_data('rdf')
cols = pydf[[1,3,5]]
I think the answer is:
# cols to select
c = rpy2.robjects.IntVector((1,3))
# selection from df
mydf.rx(True, c)
The best possible way that I found is by doing this simple thing:
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
import rpy2.robjects as robjects
dataframe = robjects.r('data.frame')
df_rpy2 = dataframe([1,2,],[5,6])
df_pd = pd.DataFrame({'A': [1,2], 'B': [5,6]})
base = importr('base') #Creates an instance of R's base package
pandas2ri.activate() #Converts any pandas dataframe to R equivalent
base.colnames(df_pd) #Finds the column names of the dataframe df_pd
base.colnames(df_rpy2) #Finds the column names of the dataframe df_rpy2
The output is:
R object with classes: ('character',) mapped to:
<StrVector - Python:0x7fa3504d3048 / R:0x10f65ac0>
['X1L', 'X2L', 'X5L', 'X6L']
R object with classes: ('character',) mapped to:
<StrVector - Python:0x7fa352493548 / R:0x103b6e40>
['A', 'B']
This works for both the dataframes created using pandas & rpy2. Hope this helps!

pandas and rpy2: Why does ezANOVA work via robjects.r but not robjects.packages.importr?

Like many, I'm hoping to stop straddling R and Python worlds and just work in Python using Pandas, Pyr2, Numpy, etc. I'm using the R package ez for its ezANOVA facility. It works if I do things the hard way, but why doesn't it work when I do them the easy way? I don't understand the resulting error:
File "/Users/malcomreynolds/analysis/r_with_pandas.py", line 38, in <module>
res = ez.ezANOVA(data=testData, dv='score', wid='subjectid', between='block', detailed=True)
File "/usr/local/lib/python2.7/site-packages/rpy2/robjects/functions.py", line 178, in __call__
return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/rpy2/robjects/functions.py", line 106, in __call__
res = super(Function, self).__call__(*new_args, **new_kwargs)
rpy2.rinterface.RRuntimeError: Error in table(temp[, names(temp) == wid]) :
attempt to set an attribute on NULL
See below for full reproducible code (requires some python packages: pyr2, pandas, numpy):
import pandas as pd
from rpy2 import robjects
from rpy2.robjects import pandas2ri
pandas2ri.activate() # make pyr2 accept and auto-convert pandas dataframes
from rpy2.robjects.packages import importr
base = importr('base')
ez = importr('ez')
robjects.r['options'](warn=-1) # ???
import numpy as np
"""Make pandas data from from scratch"""
score = np.random.normal(loc=10, scale=20, size=10)
subjectid = range(10)
block = ["Sugar"] * 5 + ["Salt"] * 5
testData = pd.DataFrame({'score':score, 'block':block, 'subjectid': subjectid})
# it looks just like a dataframe from R
print testData
"""HARD WAY: Use ezANOVA thorugh pyr2 *** THIS WORKS ***"""
anova1 = robjects.r("""
library(ez)
function(df) {
# df gets passed in
ezANOVA(
data=df,
dv=score,
wid=subjectid,
between=block,
detailed=TRUE)
}
""")
print anova1(testData)
# this command shows that ez instance is setup properly
print ez.ezPrecis(data=testData) # successful
"""EASY WAY: Import ez directly and use it """
# *** THIS APPROACH DOES NOT WORK ***
# yet, trying to use ez.ezANOVA yields an excpetion aboutthe wid value
# res = ez.ezANOVA(data=testData, dv='score', wid='subjectid', between='block', detailed=True)
# print res
# *** THIS APPROACH WORKS (and also uses my options change) ***
res = ez.ezANOVA(data=testData, dv=base.as_symbol('score'), wid=base.as_symbol('subjectid'), between=base.as_symbol('block'))
print res
In the easy version you are passing symbol names as strings. This is not the same as a symbol.
Check the use of as_symbol in Minimal example of rpy2 regression using pandas data frame

How to use pandas dataframes and numpy arrays in Rpy2?

I'd like to use pandas for all my analysis along with numpy but use Rpy2 for plotting my data. I want to do all analyses using pandas dataframes and then use full plotting of R via rpy2 to plot these. py2, and am using ipython to plot. What's the correct way to do this?
Nearly all commands I try fail. For example:
I'm trying to plot a scatter between two columns of a pandas DataFrame df. I'd like the labels of df to be used in x/y axis just like would be used if it were an R dataframe. Is there a way to do this? When I try to do it with r.plot, I get this gibberish plot:
In: r.plot(df.a, df.b) # df is pandas DataFrame
yields:
Out: rpy2.rinterface.NULL
resulting in the plot:
As you can see, the axes labels are messed up and it's not reading the axes labels from the DataFrame like it should (the X axis is column a of df and the Y axis is column b).
If I try to make a histogram with r.hist, it doesn't work at all, yielding the error:
In: r.hist(df.a)
Out:
...
vectors.pyc in <genexpr>((x,))
293 if l < 7:
294 s = '[' + \
--> 295 ', '.join((p_str(x, max_width = math.floor(52 / l)) for x in self[ : 8])) +\
296 ']'
297 else:
vectors.pyc in p_str(x, max_width)
287 res = x
288 else:
--> 289 res = "%s..." % (str(x[ : (max_width - 3)]))
290 return res
291
TypeError: slice indices must be integers or None or have an __index__ method
And resulting in this plot:
Any idea what the error means? And again here, the axes are all messed up and littered with gibberish data.
EDIT: This error occurs only when using ipython. When I run the command from a script, it still produces the problematic plot, but at least runs with no errors. It must be something wrong with calling these commands from ipython.
I also tried to convert the pandas DataFrame df to an R DataFrame as recommended by the poster below, but that fails too with this error:
com.convert_to_r_dataframe(mydf) # mydf is a pandas DataFrame
----> 1 com.convert_to_r_dataframe(mydf)
in convert_to_r_dataframe(df, strings_as_factors)
275 # FIXME: This doesn't handle MultiIndex
276
--> 277 for column in df:
278 value = df[column]
279 value_type = value.dtype.type
TypeError: iteration over non-sequence
How can I get these basic plotting features to work on Pandas DataFrame (with labels of plots read from the labels of the Pandas DataFrame), and also get the conversion between a Pandas DF to an R DF to work?
EDIT2: Here is a complete example of a csv file "test.txt" (http://pastebin.ca/2311928) and my code to answer #dale's comment:
import rpy2
from rpy2.robjects import r
import rpy2.robjects.numpy2ri
import pandas.rpy.common as com
from rpy2.robjects.packages import importr
from rpy2.robjects.lib import grid
from rpy2.robjects.lib import ggplot2
rpy2.robjects.numpy2ri.activate()
from numpy import *
import scipy
# load up pandas df
import pandas
data = pandas.read_table("./test.txt")
# plotting a column fails
print "data.c2: ", data.c2
r.plot(data.c2)
# Conversion and then plotting also fails
r_df = com.convert_to_r_dataframe(data)
r.plot(r_df)
The call to plot the column of "data.c2" fails, even though data.c2 is a column of a pandas df and therefore for all intents and purposes should be a numpy array. I use the activate() call so I thought it would handle this column as a numpy array and plot it.
The second call to plot the dataframe data after conversion to an R dataframe also fails. Why is that? If I load up test.txt from R as a dataframe, I'm able to plot() it and since my dataframe was converted from pandas to R, it seems like it should work here too.
When I do try rmagic in ipython, it does not fire up a plot window for some reason, though it does not error. I.e. if I do:
In [12]: X = np.array([0,1,2,3,4])
In [13]: Y = np.array([3,5,4,6,7])
In [14]: import rpy2
In [15]: from rpy2.robjects import r
In [16]: import rpy2.robjects.numpy2ri
In [17]: import pandas.rpy.common as com
In [18]: from rpy2.robjects.packages import importr
In [19]: from rpy2.robjects.lib import grid
In [20]: from rpy2.robjects.lib import ggplot2
In [21]: rpy2.robjects.numpy2ri.activate()
In [22]: from numpy import *
In [23]: import scipy
In [24]: r.assign("x", X)
Out[24]:
<Array - Python:0x592ad88 / R:0x6110850>
[ 0, 1, 2, 3, 4]
In [25]: r.assign("y", Y)
<Array - Python:0x592f5f0 / R:0x61109b8>
[ 3, 5, 4, 6, 7]
In [27]: %R plot(x,y)
There's no error, but no plot window either. In any case, I'd like to stick to rpy2 and not rely on rmagic if possible.
Thanks.
[note: Your code in "edit 2" is working here (Python 2.7, rpy2-2.3.2, R-1.15.2).]
As #dale mentions it whenever R objects are anonymous (that is no R symbol exists for the object) the R deparse(substitute()) will end up returning the structure() of the R object, and a possible fix is to specify the "xlab" and "ylab" parameters; for some plots you'll have to also specify main (the title).
An other way to work around that is to use R's formulas and feed the data frame (more below, after we work out the conversion part).
Forget about what is in pandas.rpy. It is both broken and seem to ignore features available in rpy2.
An earlier quick fix to conversion with ipython can be turned into a proper conversion rather easily. I am considering adding one to the rpy2 codebase (with more bells and whistles), but in the meantime just add the following snippet after all your imports in your code examples. It will transparently convert pandas' DataFrame objects into rpy2's DataFrame whenever an R call is made.
from collections import OrderedDict
py2ri_orig = rpy2.robjects.conversion.py2ri
def conversion_pydataframe(obj):
if isinstance(obj, pandas.core.frame.DataFrame):
od = OrderedDict()
for name, values in obj.iteritems():
if values.dtype.kind == 'O':
od[name] = rpy2.robjects.vectors.StrVector(values)
else:
od[name] = rpy2.robjects.conversion.py2ri(values)
return rpy2.robjects.vectors.DataFrame(od)
elif isinstance(obj, pandas.core.series.Series):
# converted as a numpy array
res = py2ri_orig(obj)
# "index" is equivalent to "names" in R
if obj.ndim == 1:
res.names = ListVector({'x': ro.conversion.py2ri(obj.index)})
else:
res.dimnames = ListVector(ro.conversion.py2ri(obj.index))
return res
else:
return py2ri_orig(obj)
rpy2.robjects.conversion.py2ri = conversion_pydataframe
Now the following code will "just work":
r.plot(rpy2.robjects.Formula('c3~c2'), data)
# `data` was converted to an rpy2 data.frame on the fly
# and the a scatter plot c3 vs c2 (with "c2" and "c3" the labels on
# the "x" axis and "y" axis).
I also note that you are importing ggplot2, without using it. Currently the conversion
will have to be explicitly requested. For example:
p = ggplot2.ggplot(rpy2.robjects.conversion.py2ri(data)) +\
ggplot2.geom_histogram(ggplot2.aes_string(x = 'c3'))
p.plot()
You need to pass in the labels explicitly when calling the r.plot function.
r.plot([1,2,3],[1,2,3], xlab="X", ylab="Y")
When you plot in R, it grabs the labels via deparse(substitute(x)) which essentially grabs the variable name from the plot(testX, testY). When you're passing in python objects via rpy2, it's an anonymous R object and akin to the following in R:
> deparse(substitute(c(1,2,3)))
[1] "c(1, 2, 3)"
which is why you're getting the crazy labels.
A lot of times it's saner to use rpy2 to only push data back and forth.
r.assign('testX', df.A)
r.assign('testY', df.B)
%R plot(testX, testY)
rdf = com.convert_to_r_dataframe(df)
r.assign('bob', rdf)
%R plot(bob$$A, bob$$B)
http://nbviewer.ipython.org/4734581/
use rpy. the conversion is part of pandas so you don't need to do it yoursef
http://pandas.pydata.org/pandas-docs/dev/r_interface.html
In [1217]: from pandas import DataFrame
In [1218]: df = DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C':[7,8,9]},
......: index=["one", "two", "three"])
......:
In [1219]: r_dataframe = com.convert_to_r_dataframe(df)
In [1220]: print type(r_dataframe)
<class 'rpy2.robjects.vectors.DataFrame'>

Categories

Resources