Convert MatchIt summary object into a pandas dataframe with pyr2 - python

I am using R's MatchIt package but calling it from Python via the pyr2 package.
On the R-side MatchIt gives me a complex result object including raw data and some additional statistic information. One of is a matrix I want to transform into a data set which I can do in R code like this
# R Code
m.out <- matchit(....)
m.sum <- summary(m.out)
# The following two lines should be somehow "translated" into
# Pythons rpy2
balance <- m.sum$sum.matched
balance <- as.data.frame(balance)
My problem is that I don't know how to implement the two last lines with Pythons rpy2 package. I am able to get m.out and m.sum with rpy2.
See this MWE please
#!/usr/bin/env python3
import rpy2
from rpy2.robjects.packages import importr
import rpy2.robjects as robjects
import rpy2.robjects.pandas2ri as pandas2ri
import pydataset
if __name__ == '__main__':
# import
robjects.packages.importr('MatchIt')
# data
p_df = pydataset.data('respiratory')
p_df.treat = p_df.treat.replace({'P': 0, 'A': 1})
# Convert Panda data into R data
with robjects.conversion.localconverter(
robjects.default_converter + pandas2ri.converter):
r_df = robjects.conversion.py2rpy(p_df)
# Call R's matchit with R data object
match_out = robjects.r['matchit'](
formula=robjects.Formula('treat ~ age + sex'),
data=r_df,
method='nearest',
distance='glm')
# matched data
match_data = robjects.r['match.data'](match_out)
# Convert R data into Pandas data
with robjects.conversion.localconverter(
robjects.default_converter + pandas2ri.converter):
match_data = robjects.conversion.rpy2py(match_data)
# summary object
match_sum = robjects.r['summary'](match_out)
# x = robjects.r('''
# balance <- match_sum$sum.matched
# balance <- as.data.frame(balance)
#
# balance
# ''')
When inspecting the python object match_sum I can't find anything like sum.matched in it. So I have to "translate" the match_sum$sum.matched somehow with rpy2. But I don't know how.
An alternative solution would be to run everything as R code with robjects.r(''' # r code ...'''). But in that case I don't know how to bring a Pandas data frame into that code.
EDIT: Be aware that in the MWE presented here the conversion from R objects into Python objects and vis-à-vis an outdated solution is used. Please see the answer below for a better one.

Ah, it is always the same phenomena: While formulating the question the answers jump'n right into your face.
My (maybe not the best) solution is:
Use real R code and run it with rpy2.robjects.r().
That R code need to create an R function() to be able to receive a dataframe from the outside (the caller).
Beside that solution and based on another answer I also modified the conversion from R to Python data frames in that code.
#!/usr/bin/env python3
import rpy2
from rpy2.robjects.packages import importr
import rpy2.robjects as robjects
import rpy2.robjects.pandas2ri as pandas2ri
import pydataset
if __name__ == '__main__':
# For converting objects from/into Pandas <-> R
# Credits: https://stackoverflow.com/a/20808449/4865723
pandas2ri.activate()
# import
robjects.packages.importr('MatchIt')
# data
df = pydataset.data('respiratory')
df.treat = df.treat.replace({'P': 0, 'A': 1})
# match object
match_out = robjects.r['matchit'](
formula=robjects.Formula('treat ~ age + sex'),
data=df,
method='nearest',
distance='glm')
# matched data
match_data = robjects.r['match.data'](match_out)
match_data = robjects.conversion.rpy2py(match_data)
# SOLUTION STARTS HERE:
get_balance_dataframe = robjects.r('''f <- function(match_out) {
as.data.frame(summary(match_out)$sum.matched)
}
''')
balance = get_balance_dataframe(match_out)
balance = robjects.conversion.rpy2py(balance)
print(type(balance))
print(balance)
Here is the output.
<class 'pandas.core.frame.DataFrame'>
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max Std. Pair Dist.
distance 0.514630 0.472067 0.471744 0.512239 0.077104 0.203704 0.507222
age 32.888889 34.129630 -0.089355 1.071246 0.063738 0.203704 0.721511
sexF 0.111111 0.259259 -0.471405 NaN 0.148148 0.148148 0.471405
sexM 0.888889 0.740741 0.471405 NaN 0.148148 0.148148 0.471405
EDIT: Take that there are no umlauts or other unicode-problematic characters in the cell values or in the column and row names when you do this on Windows. From time to time then there comes a unicode decode error. I wasn't able to reproduce this stable so I have no fresh bug report about it.

Related

Robust 2-Way ANOVA in Python

I need to run robust ANOVA from Python. The function I want to use is t2way from R package WRS2. I tried with r2py, but I'm stuck with an error:
>>> import rpy2.robjects.packages as rpackages
>>> from rpy2.robjects import pandas2ri
>>> pandas2ri.activate()
>>> df = pd.read_csv("https://github.com/lawrence009/dsur/raw/master/data/goggles.csv")
>>> rdf = pandas2ri.py2rpy(df)
>>> WRS2 = rpackages.importr('WRS2')
>>> WRS2.t2way("attractiveness ~ gender*alcohol", data = rdf)
RRuntimeError: Error in x[[grp[i]]] :
attempt to select less than one element in get1index
I'm looking for either a way to make this work with rpy2, or (even better) a port of WRS2 to the python environment. Any help would be much appreciated.
If the issue is with columns in the dataframe that are not factors (as suggested in other answer), casting them into factors is quite easy:
rdf = pandas2ri.py2rpy(df)
base = importr('base')
import rpy2.robjects as ro
for cn in ('alcohol', 'gender'):
i = rdf.colnames.index(cn)
rdf[i] = base.as_factor(rdf[i])
# We could also do it with
# rdf[i] = ro.FactorVector(rdf[i])
To be on the safe side, it is recommended to create an R formula object. Some R functions will accept strings and assume that they are formula, but this is up to a package author and not always the case.
WRS2.t2way(ro.Formula('attractiveness ~ gender*alcohol'), data = rdf)
here is my particular solution for this problem. At the very beginnig the first problem in R is that when you import the data frame you have to change the type of the column alcohol and gender as.factor.
in R the script would be:
library(WRS2)
df <- read.csv2("https://github.com/lawrence009/dsur/raw/master/data/goggles.csv",header = TRUE, sep=',')
df[ , c('attractiveness')] <- as.numeric(df[ , c('attractiveness')])
df[ , c('alcohol')] <- as.factor(df[ , c('alcohol')])
df[ , c('gender')] <- as.factor(df[ , c('gender')])
t2way(attractiveness ~ gender*alcohol, data = df)
In python, although, I didn't find the way to change the data type of the column, but I came with this solution:
First you have to create an .R file named my_t2way.R that contains:
my_t2way <- function(df1){
library(WRS2)
df <- read.csv2(df1,header = TRUE, sep=',')
df[ , c('attractiveness')] <- as.numeric(df[ , c('attractiveness')])
df[ , c('alcohol')] <- as.factor(df[ , c('alcohol')])
df[ , c('gender')] <- as.factor(df[ , c('gender')])
f <- t2way(attractiveness ~ gender*alcohol, data = df)
df1 = data.frame(factor=c('gender','alcohol','gender:alcohol'),
value = c(f$Qa,f$Qb,f$Qab),
p.value = c(f$A.p.value,f$B.p.value,f$AB.p.value))
return(df1)
}
And then you can run the following commands from python
import pandas as pd
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri# Defining the R script and loading the instance in Python
pandas2ri.activate()
r = robjects.r
r['source']('my_t2way.R')# Loading the function we have defined in R.
my_t2way_r = robjects.globalenv['my_t2way']# Reading and processing data
df1 = "https://github.com/lawrence009/dsur/raw/master/data/goggles.csv"
df_result_r = my_t2way_r(df1)
Certainly this solution only works for this particular case, but I think that could be easily extensible to other dataframes.

HTS Prophet Holidays Issue

I am attempting to use the htsprophet package in Python. I am using the following example code below. This example is pulled from https://github.com/CollinRooney12/htsprophet/blob/master/htsprophet/runHTS.py . The issue I am getting is ValueError "holidays must be a DataFrame with 'ds' and 'holiday' column. I am wondering if there is a work around this because I clearly have a data frame holidays with the two columns ds and holidays. I believe that the error comes from one of the dependency packages from fbprophet from the forecaster file. I am wondering if there is anything that I need to add or if anyone has added something to fix this.
import pandas as pd
from htsprophet.hts import hts, orderHier, makeWeekly
from htsprophet.htsPlot import plotNode, plotChild, plotNodeComponents
import numpy as np
#%% Random data (Change this to whatever data you want)
date = pd.date_range("2015-04-02", "2017-07-17")
date = np.repeat(date, 10)
medium = ["Air", "Land", "Sea"]
businessMarket = ["Birmingham","Auburn","Evanston"]
platform = ["Stone Tablet","Car Phone"]
mediumDat = np.random.choice(medium, len(date))
busDat = np.random.choice(businessMarket, len(date))
platDat = np.random.choice(platform, len(date))
sessions = np.random.randint(1000,10000,size=(len(date),1))
data = pd.DataFrame(date, columns = ["day"])
data["medium"] = mediumDat
data["platform"] = platDat
data["businessMarket"] = busDat
data["sessions"] = sessions
#%% Run HTS
##
# Make the daily data weekly (optional)
##
data1 = makeWeekly(data)
##
# Put the data in the format to run HTS, and get the nodes input (a list of list that describes the hierarchical structure)
##
data2, nodes = orderHier(data, 1, 2, 3)
##
# load in prophet inputs (Running HTS runs prophet, so all inputs should be gathered beforehand)
# Made up holiday data
##
holidates = pd.date_range("12/25/2013","12/31/2017", freq = 'A')
holidays = pd.DataFrame(["Christmas"]*5, columns = ["holiday"])
holidays["ds"] = holidates
holidays["lower_window"] = [-4]*5
holidays["upper_window"] = [0]*5
##
# Run hts with the CVselect function (this decides which hierarchical aggregation method to use based on minimum mean Mean Absolute Scaled Error)
# h (which is 12 here) - how many steps ahead you would like to forecast. If youre using daily data you don't have to specify freq.
#
# NOTE: CVselect takes a while, so if you want results in minutes instead of half-hours pick a different method
##
myDict = hts(data2, 52, nodes, holidays = holidays, method = "FP", transform = "BoxCox")
##
The problem lies in the htsProphet package, with the 'fitForecast.py' file. The instantiation of the fbProphet object relies on just positional arguments, however a new argument as been added to the fbProphet class. This means the arguments don't correspond anymore.
You can solve this by hacking the fbProphet module and changing the positional arguments to keyword arguments, just fixing lines '73-74' should be sufficient to get it running:
Prophet(growth=growth, changepoints=changepoints1, n_changepoints=n_changepoints1, yearly_seasonality=yearly_seasonality, weekly_seasonality=weekly_seasonality, holidays=holidays, seasonality_prior_scale=seasonality_prior_scale, \
holidays_prior_scale=holidays_prior_scale, changepoint_prior_scale=changepoint_prior_scale, mcmc_samples=mcmc_samples, interval_width=interval_width, uncertainty_samples=uncertainty_samples)
Ill submit a bug for this to the creators.

how to select columns from R dataframe in rpy2 in python?

I have a dataframe in rpy2 in python and I want to pull out columns from it. What is the rpy2 equivalent of this R code?
df[,c("colA", "colC")]
this works to get the first column:
mydf.rx(1)
but how can I pull a set of columns, e.g. the 1st, 3rd and 5th?
mydf.rx([1,3,5])
does not work. neither does:
mydf.rx(rpy2.robjects.r.c([1,3,5]))
Alternatively, you can pass the R data frame into a Python pandas data frame and subset your resulting 1, 3, 5 columns:
#!/usr/bin/python
import rpy2
import rpy2.robjects as ro
import pandas as pd
import pandas.rpy.common as com
# SOURCE R SCRIPT INSIDE PYTHON
ro.r.source('C:\\Path\To\R script.R')
# DEFINE PYTHON DF AS R DF
pydf = com.load_data('rdf')
cols = pydf[[1,3,5]]
I think the answer is:
# cols to select
c = rpy2.robjects.IntVector((1,3))
# selection from df
mydf.rx(True, c)
The best possible way that I found is by doing this simple thing:
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
import rpy2.robjects as robjects
dataframe = robjects.r('data.frame')
df_rpy2 = dataframe([1,2,],[5,6])
df_pd = pd.DataFrame({'A': [1,2], 'B': [5,6]})
base = importr('base') #Creates an instance of R's base package
pandas2ri.activate() #Converts any pandas dataframe to R equivalent
base.colnames(df_pd) #Finds the column names of the dataframe df_pd
base.colnames(df_rpy2) #Finds the column names of the dataframe df_rpy2
The output is:
R object with classes: ('character',) mapped to:
<StrVector - Python:0x7fa3504d3048 / R:0x10f65ac0>
['X1L', 'X2L', 'X5L', 'X6L']
R object with classes: ('character',) mapped to:
<StrVector - Python:0x7fa352493548 / R:0x103b6e40>
['A', 'B']
This works for both the dataframes created using pandas & rpy2. Hope this helps!

Load R data frame into Python and convert to Pandas data frame

I am trying to run the following code in an R data frame using Python.
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import os
import pandas as pd
import timeit
from rpy2.robjects import r
from rpy2.robjects import pandas2ri
pandas2ri.activate()
start = timeit.default_timer()
def f(x):
return fuzz.partial_ratio(str(x["sig1"]),str(x["sig2"]))
def fu_match(file):
f1=r.load(file)
f1=pandas2ri.ri2py(f1)
f1["partial_ratio"]=f1.apply(f, axis=1)
f1=f1.loc[f1["partial_ratio"]>90]
f1.to_csv("test.csv")
stop = timeit.default_timer()
print stop - start
fu_match('test_full.RData')
Here is the error.
AttributeError: 'numpy.ndarray' object has no attribute 'apply'
I guess the problem has to do with the conversion from R to Pandas data frame. I know this is a repeated question, but I have tried all the solutions given to previous questions with no success.
Please, any help will be much appreciated.
EDIT: Here is the head of .RData.
city sig1 sig2
1 19 claudiopillonrobertoscolari almeidabartolomeufrancisco
2 19 claudiopillonrobertoscolari cruzricardosantasergiosilva
3 19 claudiopillonrobertoscolari costajorgesilva
4 19 claudiopillonrobertoscolari costafrancisconaifesilva
5 19 claudiopillonrobertoscolari camarajoseluizreis
6 19 claudiopillonrobertoscolari almeidafilhojoaopimentel
This line
f1=pandas2ri.ri2py(f1)
is setting f1 to be a numpy.ndarray when I think you expect it to be a pandas.DataFrame.
You can cast the array into a DataFrame with something like
f1 = pd.DataFrame(data=f1)
but you won't have your column names defined (which you use in f(x)). What is the structure of test_full.RData? Do you want to manually define your column names? If so
f1 = pd.DataFrame(data=f1, columns=("my", "column", "names"))
should do the trick.
BUT I would suggest you look at using a more standard data format, maybe .csv. pandas has good support for this, and I expect R does too. Check out the docs.

How to use pandas dataframes and numpy arrays in Rpy2?

I'd like to use pandas for all my analysis along with numpy but use Rpy2 for plotting my data. I want to do all analyses using pandas dataframes and then use full plotting of R via rpy2 to plot these. py2, and am using ipython to plot. What's the correct way to do this?
Nearly all commands I try fail. For example:
I'm trying to plot a scatter between two columns of a pandas DataFrame df. I'd like the labels of df to be used in x/y axis just like would be used if it were an R dataframe. Is there a way to do this? When I try to do it with r.plot, I get this gibberish plot:
In: r.plot(df.a, df.b) # df is pandas DataFrame
yields:
Out: rpy2.rinterface.NULL
resulting in the plot:
As you can see, the axes labels are messed up and it's not reading the axes labels from the DataFrame like it should (the X axis is column a of df and the Y axis is column b).
If I try to make a histogram with r.hist, it doesn't work at all, yielding the error:
In: r.hist(df.a)
Out:
...
vectors.pyc in <genexpr>((x,))
293 if l < 7:
294 s = '[' + \
--> 295 ', '.join((p_str(x, max_width = math.floor(52 / l)) for x in self[ : 8])) +\
296 ']'
297 else:
vectors.pyc in p_str(x, max_width)
287 res = x
288 else:
--> 289 res = "%s..." % (str(x[ : (max_width - 3)]))
290 return res
291
TypeError: slice indices must be integers or None or have an __index__ method
And resulting in this plot:
Any idea what the error means? And again here, the axes are all messed up and littered with gibberish data.
EDIT: This error occurs only when using ipython. When I run the command from a script, it still produces the problematic plot, but at least runs with no errors. It must be something wrong with calling these commands from ipython.
I also tried to convert the pandas DataFrame df to an R DataFrame as recommended by the poster below, but that fails too with this error:
com.convert_to_r_dataframe(mydf) # mydf is a pandas DataFrame
----> 1 com.convert_to_r_dataframe(mydf)
in convert_to_r_dataframe(df, strings_as_factors)
275 # FIXME: This doesn't handle MultiIndex
276
--> 277 for column in df:
278 value = df[column]
279 value_type = value.dtype.type
TypeError: iteration over non-sequence
How can I get these basic plotting features to work on Pandas DataFrame (with labels of plots read from the labels of the Pandas DataFrame), and also get the conversion between a Pandas DF to an R DF to work?
EDIT2: Here is a complete example of a csv file "test.txt" (http://pastebin.ca/2311928) and my code to answer #dale's comment:
import rpy2
from rpy2.robjects import r
import rpy2.robjects.numpy2ri
import pandas.rpy.common as com
from rpy2.robjects.packages import importr
from rpy2.robjects.lib import grid
from rpy2.robjects.lib import ggplot2
rpy2.robjects.numpy2ri.activate()
from numpy import *
import scipy
# load up pandas df
import pandas
data = pandas.read_table("./test.txt")
# plotting a column fails
print "data.c2: ", data.c2
r.plot(data.c2)
# Conversion and then plotting also fails
r_df = com.convert_to_r_dataframe(data)
r.plot(r_df)
The call to plot the column of "data.c2" fails, even though data.c2 is a column of a pandas df and therefore for all intents and purposes should be a numpy array. I use the activate() call so I thought it would handle this column as a numpy array and plot it.
The second call to plot the dataframe data after conversion to an R dataframe also fails. Why is that? If I load up test.txt from R as a dataframe, I'm able to plot() it and since my dataframe was converted from pandas to R, it seems like it should work here too.
When I do try rmagic in ipython, it does not fire up a plot window for some reason, though it does not error. I.e. if I do:
In [12]: X = np.array([0,1,2,3,4])
In [13]: Y = np.array([3,5,4,6,7])
In [14]: import rpy2
In [15]: from rpy2.robjects import r
In [16]: import rpy2.robjects.numpy2ri
In [17]: import pandas.rpy.common as com
In [18]: from rpy2.robjects.packages import importr
In [19]: from rpy2.robjects.lib import grid
In [20]: from rpy2.robjects.lib import ggplot2
In [21]: rpy2.robjects.numpy2ri.activate()
In [22]: from numpy import *
In [23]: import scipy
In [24]: r.assign("x", X)
Out[24]:
<Array - Python:0x592ad88 / R:0x6110850>
[ 0, 1, 2, 3, 4]
In [25]: r.assign("y", Y)
<Array - Python:0x592f5f0 / R:0x61109b8>
[ 3, 5, 4, 6, 7]
In [27]: %R plot(x,y)
There's no error, but no plot window either. In any case, I'd like to stick to rpy2 and not rely on rmagic if possible.
Thanks.
[note: Your code in "edit 2" is working here (Python 2.7, rpy2-2.3.2, R-1.15.2).]
As #dale mentions it whenever R objects are anonymous (that is no R symbol exists for the object) the R deparse(substitute()) will end up returning the structure() of the R object, and a possible fix is to specify the "xlab" and "ylab" parameters; for some plots you'll have to also specify main (the title).
An other way to work around that is to use R's formulas and feed the data frame (more below, after we work out the conversion part).
Forget about what is in pandas.rpy. It is both broken and seem to ignore features available in rpy2.
An earlier quick fix to conversion with ipython can be turned into a proper conversion rather easily. I am considering adding one to the rpy2 codebase (with more bells and whistles), but in the meantime just add the following snippet after all your imports in your code examples. It will transparently convert pandas' DataFrame objects into rpy2's DataFrame whenever an R call is made.
from collections import OrderedDict
py2ri_orig = rpy2.robjects.conversion.py2ri
def conversion_pydataframe(obj):
if isinstance(obj, pandas.core.frame.DataFrame):
od = OrderedDict()
for name, values in obj.iteritems():
if values.dtype.kind == 'O':
od[name] = rpy2.robjects.vectors.StrVector(values)
else:
od[name] = rpy2.robjects.conversion.py2ri(values)
return rpy2.robjects.vectors.DataFrame(od)
elif isinstance(obj, pandas.core.series.Series):
# converted as a numpy array
res = py2ri_orig(obj)
# "index" is equivalent to "names" in R
if obj.ndim == 1:
res.names = ListVector({'x': ro.conversion.py2ri(obj.index)})
else:
res.dimnames = ListVector(ro.conversion.py2ri(obj.index))
return res
else:
return py2ri_orig(obj)
rpy2.robjects.conversion.py2ri = conversion_pydataframe
Now the following code will "just work":
r.plot(rpy2.robjects.Formula('c3~c2'), data)
# `data` was converted to an rpy2 data.frame on the fly
# and the a scatter plot c3 vs c2 (with "c2" and "c3" the labels on
# the "x" axis and "y" axis).
I also note that you are importing ggplot2, without using it. Currently the conversion
will have to be explicitly requested. For example:
p = ggplot2.ggplot(rpy2.robjects.conversion.py2ri(data)) +\
ggplot2.geom_histogram(ggplot2.aes_string(x = 'c3'))
p.plot()
You need to pass in the labels explicitly when calling the r.plot function.
r.plot([1,2,3],[1,2,3], xlab="X", ylab="Y")
When you plot in R, it grabs the labels via deparse(substitute(x)) which essentially grabs the variable name from the plot(testX, testY). When you're passing in python objects via rpy2, it's an anonymous R object and akin to the following in R:
> deparse(substitute(c(1,2,3)))
[1] "c(1, 2, 3)"
which is why you're getting the crazy labels.
A lot of times it's saner to use rpy2 to only push data back and forth.
r.assign('testX', df.A)
r.assign('testY', df.B)
%R plot(testX, testY)
rdf = com.convert_to_r_dataframe(df)
r.assign('bob', rdf)
%R plot(bob$$A, bob$$B)
http://nbviewer.ipython.org/4734581/
use rpy. the conversion is part of pandas so you don't need to do it yoursef
http://pandas.pydata.org/pandas-docs/dev/r_interface.html
In [1217]: from pandas import DataFrame
In [1218]: df = DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C':[7,8,9]},
......: index=["one", "two", "three"])
......:
In [1219]: r_dataframe = com.convert_to_r_dataframe(df)
In [1220]: print type(r_dataframe)
<class 'rpy2.robjects.vectors.DataFrame'>

Categories

Resources