Load R data frame into Python and convert to Pandas data frame - python

I am trying to run the following code in an R data frame using Python.
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import os
import pandas as pd
import timeit
from rpy2.robjects import r
from rpy2.robjects import pandas2ri
pandas2ri.activate()
start = timeit.default_timer()
def f(x):
return fuzz.partial_ratio(str(x["sig1"]),str(x["sig2"]))
def fu_match(file):
f1=r.load(file)
f1=pandas2ri.ri2py(f1)
f1["partial_ratio"]=f1.apply(f, axis=1)
f1=f1.loc[f1["partial_ratio"]>90]
f1.to_csv("test.csv")
stop = timeit.default_timer()
print stop - start
fu_match('test_full.RData')
Here is the error.
AttributeError: 'numpy.ndarray' object has no attribute 'apply'
I guess the problem has to do with the conversion from R to Pandas data frame. I know this is a repeated question, but I have tried all the solutions given to previous questions with no success.
Please, any help will be much appreciated.
EDIT: Here is the head of .RData.
city sig1 sig2
1 19 claudiopillonrobertoscolari almeidabartolomeufrancisco
2 19 claudiopillonrobertoscolari cruzricardosantasergiosilva
3 19 claudiopillonrobertoscolari costajorgesilva
4 19 claudiopillonrobertoscolari costafrancisconaifesilva
5 19 claudiopillonrobertoscolari camarajoseluizreis
6 19 claudiopillonrobertoscolari almeidafilhojoaopimentel

This line
f1=pandas2ri.ri2py(f1)
is setting f1 to be a numpy.ndarray when I think you expect it to be a pandas.DataFrame.
You can cast the array into a DataFrame with something like
f1 = pd.DataFrame(data=f1)
but you won't have your column names defined (which you use in f(x)). What is the structure of test_full.RData? Do you want to manually define your column names? If so
f1 = pd.DataFrame(data=f1, columns=("my", "column", "names"))
should do the trick.
BUT I would suggest you look at using a more standard data format, maybe .csv. pandas has good support for this, and I expect R does too. Check out the docs.

Related

Convert MatchIt summary object into a pandas dataframe with pyr2

I am using R's MatchIt package but calling it from Python via the pyr2 package.
On the R-side MatchIt gives me a complex result object including raw data and some additional statistic information. One of is a matrix I want to transform into a data set which I can do in R code like this
# R Code
m.out <- matchit(....)
m.sum <- summary(m.out)
# The following two lines should be somehow "translated" into
# Pythons rpy2
balance <- m.sum$sum.matched
balance <- as.data.frame(balance)
My problem is that I don't know how to implement the two last lines with Pythons rpy2 package. I am able to get m.out and m.sum with rpy2.
See this MWE please
#!/usr/bin/env python3
import rpy2
from rpy2.robjects.packages import importr
import rpy2.robjects as robjects
import rpy2.robjects.pandas2ri as pandas2ri
import pydataset
if __name__ == '__main__':
# import
robjects.packages.importr('MatchIt')
# data
p_df = pydataset.data('respiratory')
p_df.treat = p_df.treat.replace({'P': 0, 'A': 1})
# Convert Panda data into R data
with robjects.conversion.localconverter(
robjects.default_converter + pandas2ri.converter):
r_df = robjects.conversion.py2rpy(p_df)
# Call R's matchit with R data object
match_out = robjects.r['matchit'](
formula=robjects.Formula('treat ~ age + sex'),
data=r_df,
method='nearest',
distance='glm')
# matched data
match_data = robjects.r['match.data'](match_out)
# Convert R data into Pandas data
with robjects.conversion.localconverter(
robjects.default_converter + pandas2ri.converter):
match_data = robjects.conversion.rpy2py(match_data)
# summary object
match_sum = robjects.r['summary'](match_out)
# x = robjects.r('''
# balance <- match_sum$sum.matched
# balance <- as.data.frame(balance)
#
# balance
# ''')
When inspecting the python object match_sum I can't find anything like sum.matched in it. So I have to "translate" the match_sum$sum.matched somehow with rpy2. But I don't know how.
An alternative solution would be to run everything as R code with robjects.r(''' # r code ...'''). But in that case I don't know how to bring a Pandas data frame into that code.
EDIT: Be aware that in the MWE presented here the conversion from R objects into Python objects and vis-à-vis an outdated solution is used. Please see the answer below for a better one.
Ah, it is always the same phenomena: While formulating the question the answers jump'n right into your face.
My (maybe not the best) solution is:
Use real R code and run it with rpy2.robjects.r().
That R code need to create an R function() to be able to receive a dataframe from the outside (the caller).
Beside that solution and based on another answer I also modified the conversion from R to Python data frames in that code.
#!/usr/bin/env python3
import rpy2
from rpy2.robjects.packages import importr
import rpy2.robjects as robjects
import rpy2.robjects.pandas2ri as pandas2ri
import pydataset
if __name__ == '__main__':
# For converting objects from/into Pandas <-> R
# Credits: https://stackoverflow.com/a/20808449/4865723
pandas2ri.activate()
# import
robjects.packages.importr('MatchIt')
# data
df = pydataset.data('respiratory')
df.treat = df.treat.replace({'P': 0, 'A': 1})
# match object
match_out = robjects.r['matchit'](
formula=robjects.Formula('treat ~ age + sex'),
data=df,
method='nearest',
distance='glm')
# matched data
match_data = robjects.r['match.data'](match_out)
match_data = robjects.conversion.rpy2py(match_data)
# SOLUTION STARTS HERE:
get_balance_dataframe = robjects.r('''f <- function(match_out) {
as.data.frame(summary(match_out)$sum.matched)
}
''')
balance = get_balance_dataframe(match_out)
balance = robjects.conversion.rpy2py(balance)
print(type(balance))
print(balance)
Here is the output.
<class 'pandas.core.frame.DataFrame'>
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max Std. Pair Dist.
distance 0.514630 0.472067 0.471744 0.512239 0.077104 0.203704 0.507222
age 32.888889 34.129630 -0.089355 1.071246 0.063738 0.203704 0.721511
sexF 0.111111 0.259259 -0.471405 NaN 0.148148 0.148148 0.471405
sexM 0.888889 0.740741 0.471405 NaN 0.148148 0.148148 0.471405
EDIT: Take that there are no umlauts or other unicode-problematic characters in the cell values or in the column and row names when you do this on Windows. From time to time then there comes a unicode decode error. I wasn't able to reproduce this stable so I have no fresh bug report about it.

For loop using enumerate runs more than expected for a pandas Data Frame

So, I was working on titanic dataset to extract Title(Mr,Ms,Mrs) from Name column from Data frame(df). Its has 1309 rows.
for ind,name in enumerate(df['Name']):
if type(name)==str:
inf = name.find(', ') + 2
df.loc[ind+1,'Title'] = name[inf:name.find('.')]
else :
print(name,ind)
This peice of code gives the following output
nan 1309
As supposed it had to stop for ind=1308, but it goes one step further even if not indicated to do so.
What could be the flaw here? Is it due to the fact that I am using 1 based indexing of the data frame?
If so, what could be done here to prevent such behaviour?
I am new to this platform, so please ask for clarifications in case of any discrepancies.
Here is a short Example:-
import numpy as np
import pandas as pd
dict1 = {'Name':['Hey, Mr.','Hello, Ms.','Hi, Mrs,','Welcome, Master.','Yes, Mr.'],'ind':[1,2,3,4,5]}
df = pd.DataFrame(data = dict1)
df.set_index('ind')
for ind,name in enumerate(df['Name']):
if type(name)==str:
inf = name.find(', ') + 2
df.loc[ind+1,'Title'] = name[inf:name.find('.')]
else :
print(name,ind)
print(df['Title'])

Pandas read_csv: Columns are being imported as rows

Edit: I believe this was all user error. I have been typing df.T by default, and it just occurred to me that this is very likely the TRANSPOSE output. By typing df, the data frame is output normally (headers as columns). Thank you for those who stepped up to try and help. In the end, it was just my misunderstanding of pandas language..
Original Post
I'm not sure if I am making a simple mistake but the columns in a .csv file are being imported as rows using pd.read_csv. The dataframe turns out to be 5 rows by 2000 columns. I am importing only 5 columns out of 14 so I set up a list to hold the names of the columns I want. They match exactly those in the .csv file. What am I doing wrong here?
import os
import numpy as np
import pandas as pd
fp = 'C:/Users/my/file/path'
os.chdir(fp)
cols_to_use = ['VCOMPNO_CURRENT', 'MEASUREMENT_DATETIME',
'EQUIPMENT_NUMBER', 'AXLE', 'POSITION']
df = pd.read_csv('measurement_file.csv',
usecols=cols_to_use,
dtype={'EQUIPMENT_NUMBER': np.int,
'AXLE': np.int},
parse_dates=[2],
infer_datetime_format=True)
Output:
0 ... 2603
VCOMPNO_CURRENT T92656 ... T5M247
MEASUREMENT_DATETIME 7/26/2018 13:04 ... 9/21/2019 3:21
EQUIPMENT_NUMBER 208 ... 537
AXLE 1 ... 6
POSITION L ... R
[5 rows x 2000 columns]
Thank you.
Edit: To note, if I import the entire .csv with the standard pd.read_csv('measurement_file.csv'), the columns are imported properly.
Edit 2: Sample csv:
VCOMPNO_CURRENT,MEASUREMENT_DATETIME,REPAIR_ORDER_NUMBER,EQUIPMENT_NUMBER,AXLE,POSITION,FLANGE_THICKNESS,FLANGE_HEIGHT,FLANGE_SLOPE,DIAMETER,RO_NUMBER_SRC,CL,VCOMPNO_AT_MEAS,VCOMPNO_SRC
T92656,10/19/2018 7:11,5653054,208,1,L,26.59,27.34,6.52,691.3,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T92656,10/19/2018 7:11,5653054,208,1,R,26.78,27.25,6.64,691.5,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T92656,10/19/2018 7:11,5653054,208,2,L,26.6,27.13,6.49,691.5,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T92656,10/19/2018 7:11,5653054,208,2,R,26.61,27.45,6.75,691.6,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T7L672,10/19/2018 7:11,5653054,208,3,L,26.58,27.14,6.58,644.4,OPTIMESS_DATA,2CTC ,T7L672 ,BOTH
T7L672,10/19/2018 7:11,5653054,208,3,R,26.21,27.44,6.17,644.5,OPTIMESS_DATA,2CTC ,T7L672 ,BOTH
A simple workaround here is to just to take the transpose of the dataframe.
Link to Pandas Documentation
df = pd.DataFrame.transpose(df)
Can you try like this?
import pandas as pd
dataset = pd.read_csv('yourfile.csv')
#filterhere
dataset = dataset[cols_to_use]

Problems Sorting Data out of a text-file

I have a csv file imported into a dataframe and have trouble sorting the data.
df looks like this:
Data
0 <WindSpeed>0.69</WindSpeed>
1 <PowerOutput>0</PowerOutput>
2 <ThrustCoEfficient>0</ThrustCoEffici...
3 <RotorSpeed>8.17</RotorSpeed>
4 <ReactivePower>0</ReactivePower>
5 </DataPoint>
6 <DataPoint>
7 <WindSpeed>0.87</WindSpeed>
8 <PowerOutput>0</PowerOutput
I want it to look like this:
0 Windspeed Poweroutput
1 0.69 0.0
Here´s the code that I wrote so far:
import pandas as pd
from pandas.compat import StringIO
import re
import numpy as np
df= pd.read_csv('powercurve.csv', encoding='utf-8',skiprows=42)
df.columns=['Data']
no_of_rows=df.Data.str.count("WindSpeed").sum()/2
rows=no_of_rows.astype(np.uint32)
TRBX=pd.DataFrame(index=range(0,abs(rows)),columns=['WSpd[m/s]','Power[kW]'],dtype='float')
i=0
for i in range(len(df)):
if 'WindSpeed' in df['Data']:
TRBX['WSpd[m/s]', i]= re.findall ("'(\d+)'",'Data')
elif 'Rotorspeed' in df['Data']:
TRBX['WSpd[m/s]', i]= re.findall ("'(\d+)'",'Data')
Is this a suitable approach? If yes, so far there are no values written into the TRBX dataframe. Where is my mistake?
The code below should help you if your df is indeed in the same format as you:
import re
split_func = lambda x: re.split('<|>', str(x))
split_series = df.Data.apply(split_func)
data = a.apply(lambda x: x[2]).rename('data')
features = a.apply(lambda x: x[1]).rename('features')
df = pd.DataFrame(data).set_index(features).T
You may want to drop some columns that have no data or input some N/A values afterwards. You also may want to rename the variables and series to different names that make more sense to you.

How to add a column to Pandas based off of other columns

I'm using Pandas and I have a very basic dataframe:
session_id datetime
5 t0ubmqqpbt01rhce201cujjtm7 2014-11-28T04:30:09Z
6 k87akpjpl004nbmhf4loiafi72 2014-11-28T04:30:11Z
7 g0t7hrqo8hgc5vlb7240d1n9l5 2014-11-28T04:30:12Z
8 ugh3fkskmedq3br99d20t78gb2 2014-11-28T04:30:15Z
9 fckkf16ahoe1uf9998eou1plc2 2014-11-28T04:30:18Z
I wish to add a third column based on the values of the current columns:
df['key'] = urlsafe_b64encode(md5('l' + df['session_id'] + df['datetime']))
But I receive:
TypeError: must be convertible to a buffer, not Series
You need to use pandas.DataFrame.apply. The code below will apply the lambda function to each row of df. You could, of course, define a separate function (if you need to do more something more complicated).
import pandas as pd
from io import StringIO
from base64 import urlsafe_b64encode
from hashlib import md5
s = ''' session_id datetime
5 t0ubmqqpbt01rhce201cujjtm7 2014-11-28T04:30:09Z
6 k87akpjpl004nbmhf4loiafi72 2014-11-28T04:30:11Z
7 g0t7hrqo8hgc5vlb7240d1n9l5 2014-11-28T04:30:12Z
8 ugh3fkskmedq3br99d20t78gb2 2014-11-28T04:30:15Z
9 fckkf16ahoe1uf9998eou1plc2 2014-11-28T04:30:18Z'''
df = pd.read_csv(StringIO(s), sep='\s+')
df['key'] = df.apply(lambda x: urlsafe_b64encode(md5('l' + x['session_id'] + x['datetime'])), axis=1)
Note: I couldn't get the hashing bit working on my machine unfortunately, some unicode error (might be because I'm using Python 3) and I don't have time to debug the inner workings of it, but the pandas part I'm pretty sure about :P

Categories

Resources