I have a dataframe with 6 columns, the first two are an id and a name column, the remaining 4 are potential matches for the name column.
id name match1 match2 match3 match4
id name match1 match2 match3 match4
1 NXP Semiconductors NaN NaN NaN NaN
2 Cincinnati Children's Hospital Medical Center Montefiore Medical center Children's Hospital Los Angeles Cincinnati Children's Hospital Medical Center SSM Health SLU Hospital
3 Seminole Tribe of Florida The State Board of Administration of Florida NaN NaN NaN
4 Miami-Dade County County of Will County of Orange NaN NaN
5 University of California California Teacher's Association Yale University University of Toronto University System of Georgia
6 Bon Appetit Management Waste Management Sculptor Capital NaN NaN
I'd like to use SequenceMatcher to compare the name column with each match column if there is a value and return the match value with the highest ratio, or closest match, in a new column at the end of the dataframe.
So the output would be something like this:
id name match1 match2 match3 match4 best match
1 NXP Semiconductors NaN NaN NaN NaN NaN
2 Cincinnati Children's Hospital Medical Center Montefiore Medical center Children's Hospital Los Angeles Cincinnati Children's Hospital Medical Center SSM Health SLU Hospital Cincinnati Children's Hospital Medical Center
3 Seminole Tribe of Florida The State Board of Administration of Florida NaN NaN NaN The State Board of Administration of Florida
4 Miami-Dade County County of Will County of Orange NaN NaN County of Orange
5 University of California California Teacher's Association Yale University University of Toronto University System of Georgia California Teacher's Association
6 Bon Appetit Management Waste Management Sculptor Capital NaN NaN Waste Management
I've gotten the data into the dataframe and have been able to compare one column to a single other column using the apply method:
df['diff'] = df.apply(lambda x: diff.SequenceMatcher(None, x[0].strip(), x[1].strip()).ratio(), axis=1)
However, I'm not sure how to loop over multiple columns in the same row. I also thought about trying to reformat my data so it that the method above would work, something like this:
name match
name1 match1
name1 match2
name1 match3
However, I was running into issues dealing with the NaN values. Open to suggestions on the best route to accomplish this.
I ended up solving this using the second idea of reformatting the table. Using the melt function I was able to get a two column table of the name field with each possible match. From there I used the original lambda function to compare the two columns and output a ratio. From there it was relatively easy to go through and see the most likely matches, although it did require some manual effort.
df = pd.read_csv('output.csv')
df1 = df.melt(id_vars = ['id', 'name'], var_name = 'match').dropna().drop('match',1).sort_values('name')
df1['diff'] = df1.apply(lambda x: diff.SequenceMatcher(None, x[1].strip(), x[2].strip()).ratio(), axis=1)
df1.to_csv('comparison-output.csv', encoding='utf-8')
I have a dataframe df1:
Plant name Brand Region Units produced capacity Cost incurred
Gujarat Plant Hyundai Asia 8500 9250 18500000
Haryana Plant Honda Asia 10000 10750 21500000
Chennai Plant Hyundai Asia 12000 12750 25500000
Zurich Plant Volkswagen Europe 25000 25750 77250000
Chennai Plant Suzuki Asia 6000 6750 13500000
Rengensburg BMW Europe 12500 13250 92750000
Dingolfing Mercedes Europe 14000 14750 103250000
I want a output dataframe with the following format:
df2= Region BMW Mercedes Volkswagen Toyota Suzuki Honda Hyundai
Europe
North America
Asia
Oceania
where the contents of each cell equals sum(cost incurred) / sum(units produced) for that specific Region and Brand.
Code I have tried, resulting in a ValueError:
for i,j in itertools.zip_longest(range(len(df2),range(len(df2.columns)):
if (df2.index[i] in list(df1["Region"]) & df2.columns[j] in list(df1["Brand"])==True:
temp1 = df1["Region"]==df2.index[i]
temp2 = df1["Brand"]==df2.columns[j]]
df2.loc[df2.index[i],df2.columns[j]] = df1(temp1&temp2)["Cost incurred"].sum()/
df1(temp1&temp2)["Units Produced"].sum()
elif (df2.index[i] in list(df1["Region"]) & df2.columns[j] in list(df1["Brand"])==False:
df2.loc[df2.index[i],df2.columns[j]] = 0
ValueError: The truth value of an array with more than one element is
ambiguous. Use a.any() or a.all()
df.pivot_table() is designed for pivot-and-aggregation capability. A quick(?) and dirty solution:
df1.pivot_table(index="Region", columns="Brand", values="Cost incurred", aggfunc=np.sum)\
/ df1.pivot_table(index="Region", columns="Brand", values="Units produced", aggfunc=np.sum)
Output
Brand BMW Honda Hyundai Mercedes Suzuki Volkswagen
Region
Asia NaN 2150.0 2146.341463 NaN 2250.0 NaN
Europe 7420.0 NaN NaN 7375.0 NaN 3090.0
I have two df's
df1
date League teams
0 201902272215 brazil cup foz do iguacu fcceara ce
1 201902272300 colombia primera a deportes tolimaatletico bucaramanga
2 201902272300 brazil campeonato gaucho 2nd division ypiranga rsuniao frederiquense
3 201902272300 brazil campeonato gaucho 2nd division esportivo rstupi rs
4 201902272300 brazil campeonato gaucho 2nd division sao paulo rsgremio esportivo bage
14 201902280000 four nations women tournament (in usa) usa (w)japan (w)
25 201902280030 bolivia professional football league real potosibolivar
df2
date league teams
0 201902280000 womens international usa womenjapan women
1 201902280000 brazil amazonense sul america ecrio negro am
2 201902280030 bolivia apertura real potosibolivar
3 201902280030 brazil campeonato paulista palmeirasituano
4 201902280030 copa sudamericana racing clubcorinthians
The result I would want is all the rows from df2 that are near matches with df1
date league teams near_match
0 201902280000 womens international usa womenjapan women 1
1 201902280000 brazil amazonense sul america ecrio negro am 0
2 201902280030 bolivia apertura real potosibolivar 1
3 201902280030 brazil campeonato paulista palmeirasituano 0
4 201902280030 copa sudamericana racing clubcorinthians 0
I have tried to use a variation of a for loop using SequenceMatcher and setting the threshold to a match of above 0.8, but haven't had any luck.
df_1['merge_teams'] = df_1['teams'] # we will use these as the merge keys
df_1['merge_date'] = df_1['date']
# df_1['merge_league'] = df_1['league']
for teams_1, date_1, league_1 in df_1[['teams','date']].values:
for ixb, (teams_1, teams_2) in enumerate(df_2[['teams','date']].values):
if difflib.SequenceMatcher(None,teams_1,teams_2).ratio() > .8:
df_2.ix[ixb,'merge_teams'] = teams_1 # creates a merge key in df_2
if difflib.SequenceMatcher(None,date_1, date_2).ratio() > .8:
df_2.ix[ixb,'merge_date'] = date_1 # creates a merge key in df_2
# This should rturn all rows where teams,date and league all match by over 80%
# This is just for teams and date, I want to include league as well
Any advice or guidance would be greatly appreciated.
I have the following dataframe:
Country Continent Population
--- ------- ------------- ------------
0 United States North America 329,451,665
1 Canada North America 37,602,103
2 Brazil South America 210,147,125
3 Argentina South America 43,847,430
I want to group by the continent, and get the name of the country with the highest population in that continent, so basically I want my result to look as follows:
Continent Country
---------- -------------
North America United States
South America Brazil
How can I do this?
Use idxmax to get index of the max row:
df['Population'] = pd.to_numeric(df['Population'].str.replace(',', ''))
idx = df.groupby('Continent')['Population'].idxmax()
df.loc[idx]
Result:
Country Continent Population
0 United States North America 329451665
2 Brazil South America 210147125
I have a dataframe as below:
import pandas as pd
df = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],
'Region':['Americas','NaN','NaN','Asia','Europe','NaN','NaN'],
'Flower':['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],
'Animal':['Bison','NaN','Golden Eagle','Tiger','Lion','Lion','NaN'],
'Game':['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})
I want to group by Country and Flower and forward fill or backward fill the columns Region and Animal where there are missing values. However the column Game should remain intact
I have tried this but it didn't work:
df['Region'] = df.groupby(['Country','Flower'])['Region'].transform(lambda x: x.ffill())
also :
df.groupby(['Country','Flower'])['Animal', 'Region'].isna().bfill()
I want to know how to go about with this.
while this works but it removes the Games column:
df=df.replace({'NaN':np.nan})
df.groupby(['Country','Flower'])['Animal', 'Region'].bfill().ffill()
And if i do a transform there is a mismatch in the length. Also please note that this is sample dataframe where I had added "NaN" as a string in the original frame it is as np.nan.
If you change your dataframe code to actually include np.nans, then the code you provided actually works. Although nans appear as normal text 'Nan', you can't create a dataframe writing that text by hand because that will be interpreted as a string, not an actual missing value.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],
'Region':['Americas',np.nan,np.nan,'Asia','Europe',np.nan,np.nan],
'Flower':['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],
'Animal':['Bison',np.nan,'Golden Eagle','Tiger','Lion','Lion','NaN'],
'Game':['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})
Then, this:
df['Region'] = df.groupby(['Country','Flower'])['Region'].transform(lambda x: x.ffill())
actually yields this:
Animal Country Flower Game Region
0 Bison USA Rose Baseball Americas
1 NaN USA Rose Baseball Americas
2 Golden Eagle MEX Lily soccer NaN
3 Tiger IND Orchid hockey Asia
4 Lion UK Dandelion cricket Europe
5 Lion UK Dandelion cricket Europe
6 NaN UK Dandelion cricket Europe
First you need to know 'NaN' is not NaN
df=df.replace({'NaN':np.nan})
df.groupby(['Country','Flower'])['Region'].ffill()
Out[109]:
0 Americas
1 Americas
2 NaN# since here only have single row , that why stay NaN
3 Asia
4 Europe
5 Europe
6 Europe
Name: Region, dtype: object
Second if you need to chain two iid function in pandas you need apply
df.update(df.groupby(['Country','Flower'])['Animal', 'Region'].apply(lambda x : x.bfill().ffill()))
df
Out[119]:
Animal Country Flower Game Region
0 Bison USA Rose Baseball Americas
1 Bison USA Rose Baseball Americas
2 Golden Eagle MEX Lily soccer NaN
3 Tiger IND Orchid hockey Asia
4 Lion UK Dandelion cricket Europe
5 Lion UK Dandelion cricket Europe
6 Lion UK Dandelion cricket Europe
As Mex and Lily are only rows and moreover their region value is nan, fillna function not able to find appropriate group value.
If we catch the exception while fillna group mode then those value where there is no group will be left as it is. Then apply ffill and bfill to cover those value which doesn't have appropriate group
df_stack = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],'Region': ['Americas',np.nan,np.nan,'Asia','Europe',np.nan,np.nan],'Flower': ['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],'Animal':['Bison',np.nan,'Golden Eagle','Tiger','Lion','Lion',np.nan],'Game': ['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})
print("-------Before imputation------")
print(df_stack)
def fillna_Region(grp):
try:
return grp.fillna(grp.mode()[0])
except BaseException as e:
print('Error as no correspindg group: ' + str(e))
df_stack["Region"] =
df_stack["Region"].fillna(df_stack.groupby(['Country','Flower']) ['Region'].transform(lambda grp : fillna_Region(grp)))
df_stack["Animal"] =
df_stack["Animal"].fillna(df_stack.groupby(['Country','Flower']) ['Animal'].transform(lambda grp : fillna_Region(grp)))
df_stack = df_stack.ffill(axis = 0)
df_stack = df_stack.bfill(axis =0)
print("-------After imputation------")
print(df_stack)