Matching values of column to another data frame column and replacing values - python

I have two data frames. (These are easy examples, my real data has nearly 3,000 rows)
>df
player position nation Mins
Messi FW ARG 3302
Ronaldo FW POR 3029
Van Dijk DF NED 500
Mane FW SEN 3088
Alena MF SPA 1592
>df2
player position
Alena CM
Ronaldo ST
Mane LW
Van Dijk CB
Messi ST
What I'm trying to do is replace the position data in df with the position data from df2.matching player column.
I've tried sorting values of player columns on both and then just creating a new position column with df['pos2']=df2['position']
but It ends up slightly wrong in some areas of the resulting column.
Which is why I'm looking to do it based on matching a column.

Merge your dataframe based on player column:
>>> df1.drop(columns='position').merge(df2, on='player')
player nation Mins position
0 Messi ARG 3302 ST
1 Ronaldo POR 3029 ST
2 Van Dijk NED 500 CB
3 Mane SEN 3088 LW
4 Alena SPA 1592 CM
Maybe you want to keep history:
>>> df1.merge(df2, on='player', suffixes=('_old', '_new'))
player position_old nation Mins position_new
0 Messi FW ARG 3302 ST
1 Ronaldo FW POR 3029 ST
2 Van Dijk DF NED 500 CB
3 Mane FW SEN 3088 LW
4 Alena MF SPA 1592 CM

Related

Python: add columns to dataframe from another with matching "vlookup"

I'd like to add two columns to an existing dataframe from another dataframe based on a lookup in the name column.
Dataframe to update looks like this:
Player
School
Conf
Cmp
Att
Danny Wuerffel
Florida
1
708
1170
Steve Sarkisian
Brigham Young
0
528
789
Billy Blanton
San Diego State
0
588
920
And I'd like to take the height and weight from this dataframe (actually a json file) and add it based on matching Player names:
Name
School
Conf
Height
Weight
Pct
Yds
Danny Wuerffel
Florida
1
6-2
217
60.5
10875
Steve Sarkisian
Brigham Young
0
6-3
230
66.9
7464
Billy Blanton
San Diego State
0
6-0
222
63.9
8165
Codewise I tried something like this so far:
existing_dataframe['Height'] = pd.Series(height_weight_df['Height'])
But I'm missing the part matching them on the name because the DFs aren't in the same order
Let us try
existing_dataframe = existing_dataframe.merge(height_weight_df[['Name','School','Height','Weight']],left_on=['Player','School'],right_on=['Name','School'],how='left')

Folium FeatureGroup in Python

I am trying to create maps using Folium Feature group. The feature group will be from a pandas dataframe row. I am able to achieve this when there is one data in the dataframe. But when there are more than 1 in the dataframe, and loop through it in the for loop I am not able to acheive what I want. Please find attached the code in Python.
from folium import Map, FeatureGroup, Marker, LayerControl
mapa = Map(location=[35.11567262307692,-89.97423444615382], zoom_start=12,
tiles='Stamen Terrain')
feature_group1 = FeatureGroup(name='Tim')
feature_group2 = FeatureGroup(name='Andrew')
feature_group1.add_child(Marker([35.035075, -89.89969], popup='Tim'))
feature_group2.add_child(Marker([35.821835, -90.70503], popup='Andrew'))
mapa.add_child(feature_group1)
mapa.add_child(feature_group2)
mapa.add_child(LayerControl())
mapa
My dataframe contains the following:
Name Address
0 Dollar Tree #2020 3878 Goodman Rd.
1 Dollar Tree #2020 3878 Goodman Rd.
2 National Guard Products Inc 4985 E Raines Rd
3 434 SAVE A LOT C MID WEST 434 Kelvin 3240 Jackson Ave
4 WALGREENS 06765 108 E HIGHLAND DR
5 Aldi #69 4720 SUMMER AVENUE
6 Richmond, Christopher 1203 Chamberlain Drive
City State Zipcode Group
0 Horn Lake MS 38637 Johnathan Shaw
1 Horn Lake MS 38637 Tony Bonetti
2 Memphis TN 38118 Tony Bonetti
3 Memphis TN 38122 Tony Bonetti
4 JONESBORO AR 72401 Josh Jennings
5 Memphis TN 38122 Josh Jennings
6 Memphis TN 38119 Josh Jennings
full_address Color sequence \
0 3878 Goodman Rd.,Horn Lake,MS,38637,USA blue 1
1 3878 Goodman Rd.,Horn Lake,MS,38637,USA cadetblue 1
2 4985 E Raines Rd,Memphis,TN,38118,USA cadetblue 2
3 3240 Jackson Ave,Memphis,TN,38122,USA cadetblue 3
4 108 E HIGHLAND DR,JONESBORO,AR,72401,USA yellow 1
5 4720 SUMMER AVENUE,Memphis,TN,38122,USA yellow 2
6 1203 Chamberlain Drive,Memphis,TN,38119,USA yellow 3
Latitude Longitude
0 34.962637 -90.069019
1 34.962637 -90.069019
2 35.035367 -89.898428
3 35.165115 -89.952624
4 35.821835 -90.705030
5 35.148707 -89.903760
6 35.098829 -89.866838
The same when I am trying to loop through in the for loop, I am not able to achieve what I need. :
from folium import Map, FeatureGroup, Marker, LayerControl
mapa = Map(location=[35.11567262307692,-89.97423444615382], zoom_start=12,tiles='Stamen Terrain')
#mapa.add_tile_layer()
for i in range(0,len(df_addresses)):
feature_group = FeatureGroup(name=df_addresses.iloc[i]['Group'])
feature_group.add_child(Marker([df_addresses.iloc[i]['Latitude'], df_addresses.iloc[i]['Longitude']],
popup=('Address: ' + str(df_addresses.iloc[i]['full_address']) + '<br>'
'Tech: ' + str(df_addresses.iloc[i]['Group'])),
icon = plugins.BeautifyIcon(
number= str(df_addresses.iloc[i]['sequence']),
border_width=2,
iconShape= 'marker',
inner_icon_style= 'margin-top:2px',
background_color = df_addresses.iloc[i]['Color'],
)))
mapa.add_child(feature_group)
mapa.add_child(LayerControl())
This is an example dataset because I didn't want to format your df. That said, I think you'll get the idea.
print(df_addresses)
Latitude Longitude Group
0 34.962637 -90.069019 B
1 34.962637 -90.069019 B
2 35.035367 -89.898428 A
3 35.165115 -89.952624 B
4 35.821835 -90.705030 A
5 35.148707 -89.903760 A
6 35.098829 -89.866838 A
After I create the map object(maps), I perform a groupby on the group column. I then iterate through each group. I first create a FeatureGroup with the grp_name(A or B). And for each group, I iterate through that group's dataframe and create Markers and add them to the FeatureGroup
mapa = folium.Map(location=[35.11567262307692,-89.97423444615382], zoom_start=12,
tiles='Stamen Terrain')
for grp_name, df_grp in df_addresses.groupby('Group'):
feature_group = folium.FeatureGroup(grp_name)
for row in df_grp.itertuples():
folium.Marker(location=[row.Latitude, row.Longitude]).add_to(feature_group)
feature_group.add_to(mapa)
folium.LayerControl().add_to(mapa)
mapa
Regarding the stamenterrain query, if you're referring to the appearance in the control box you can remove this by declaring your map with tiles=None and adding the TileLayer separately with control set to false: folium.TileLayer('Stamen Terrain', control=False).add_to(mapa)

Exact visual matches between common columns in dataframes not matching on merge

I am trying to merge two dataframes on a common column, "long_name". But the merge is not happening for some names, even what look like visually exact matches, (ie "Lionel Andrés Messi Cuccittini" (df1) to "Lionel Andrés Messi Cuccittini" (df2)) when I merge on "long_name":
df_merged = df.merge(df1, on="long_name", indicator=True, how='right')
Lionel Messi is left out and according to the indicator column, he's a "right_only" row from the merge. What's odd is that "Neymar da Silva Santos Júnior" IS merging. Why is there a discrepancy between the rows? Both have been sourced consistently, df from kaggle and df2 from scraping and using the same script for all row name extractions.
I tried to isolate both the Lionel Messi entries from df and df1 using the following code:
name1 = df.loc[df.short_name == 'L. Messi', ["long_name"]]
name2 = df1.loc[df1.name == 'Lionel Messi', ["long_name"]]
name1.values == name2.values
But the result is array([[False]]). I'm not sure why they're not matching.
The first df looks like this (first 8 lines, df = df.loc[0:7,["short_name", "long_name"]]):
short_name long_name
0 L. Messi Lionel Andrés Messi Cuccittini
1 Cristiano Ronaldo Cristiano Ronaldo dos Santos Aveiro
2 Neymar Jr Neymar da Silva Santos Junior
3 J. Oblak Jan Oblak
4 E. Hazard Eden Hazard
5 K. De Bruyne Kevin De Bruyne
6 M. ter Stegen Marc-André ter Stegen
7 V. van Dijk Virgil van Dijk
The second df looks like this (first 8 lines, df1 = df1.loc[0:7,["name", "long_name"]]):
name long_name
0 Kylian Mbappé Kylian Sanmi Mbappé Lottin
1 Neymar Neymar da Silva Santos Júnior
2 Mohamed Salah محمد صلاح
3 Harry Kane Harry Edward Kane
4 Eden Hazard Eden Michael Hazard
5 Lionel Messi Lionel Andrés Messi Cuccitini
6 Raheem Sterling Raheem Shaquille Sterling
7 Antoine Griezmann Antoine Griezmann
Are you sure it is not just a case of a misspelled name?
df lists the long_name as Lionel Andrés Messi Cuccittini, whereas df1 lists it as Lionel Andrés Messi Cuccitini. I notice df has 2 t's in Cuccittini but df has 1.
Manually correct the second dataframe and retry.

How can I create an artificial key column for merging two datasets using difflab when the column of interest has missing cells?

Goal: If the name in df2 in row i is a sub-string or an exact match of a name in df1 in some row N and the state and district columns of row N in df1 are a match to the respective state and district columns of df2 row i, combine.
I was recommended of using difflib to create an artificial key column to merge on.
This new column is called 'name'. difflib.get_close_matches looks for similar strings in df2.
This works well when all rows in the 'CandidateName' column are present but I get IndexError: list index out of range when a cell is missing.
I tried resolving this issue by filling in the empty column with the string 'EMPTY'. However the same error still occurs.
# I used this method to replace empty cells
df1['CandidateName'] = df1['CandidateName'].replace('', 'EMPTY')
# I then proceeded to run the line again
df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
# Data Frame Samples
# Data Frame 1
CandidateName = ['Theodorick A. Bland','Aedanus Rutherford Burke','Jason Lewis','Barbara Comstock','Theodorick Bland','Aedanus Burke','Jason Initial Lewis', '','']
State = ['VA', 'SC', 'MN','VA','VA', 'SC', 'MN','NH','NH']
District = [9,2,2,10,9,2,2,1,1]
Party = ['','', '','Democrat','','','Democrat','Whig','Whig']
data1 = {'CandidateName':CandidateName, 'State':State, 'District':District,'Party':Party }
df1 = pd.DataFrame(data = data1)
print df1
# CandidateName District Party State
#0 Theodorick A. Bland 9 VA
#1 Aedanus Rutherford Burke 2 SC
#2 Jason Lewis 2 Democrat MN
#3 Barbara Comstock 10 Democrat VA
#4 Theodorick Bland 9 VA
#5 Aedanus Burke 2 SC
#6 Jason Initial Lewis 2 Democrat MN
#7 '' 1 Whig NH
#8 '' 1 Whig NH
Name = ['Theodorick Bland','Aedanus Burke','Jason Lewis', 'Barbara Comstock']
State = ['VA', 'SC', 'MN','VA']
District = [9,2,2,10]
Party = ['','', 'Democrat','Democrat']
data2 = {'Name':Name, 'State':State, 'District':District, 'Party':Party}
df2 = pd.DataFrame(data = data2)
print df2
# CandidateName District Party State
#0 Theodorick Bland 9 VA
#1 Aedanus Burke 2 SC
#2 Jason Lewis 2 Democrat MN
#3 Barbara Comstock 10 Democrat VA
import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'])
Expected
print(df1)
# CandidateName State District Party Name
#0 Theodorick A. Bland VA 9 Theodorick Bland
#1 Aedanus Rutherford Burke SC 2 Aedanus Burke
#2 Jason Lewis MN 2 Jason Lewis
#3 Barbara Comstock VA 10 Democrat Barbara Comstock
#4 Theodorick Bland VA 9 Theodorick Bland
#5 Aedanus Burke SC 2 Aedanus Burke
#6 Jason Initial Lewis MN 2 Democrat Jason Lewis
#7 NH 1 Whig
#8 NH 1 Whig
Actual Error Result:
-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype)
---> 23 df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
IndexError: list index out of range
You are getting a list type object back. And these lists dont have index 0. Thats why you get this error. Second of all, we need to convert these lists to type string to be able to do the merge like following:
note: you dont have to use: df1['CandidateName'] = df1['CandidateName'].replace('', 'EMPTY')
import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: ''.join(difflib.get_close_matches(x, df2['Name'])))
df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'], how='left')
print(df_merge)
CandidateName State District Party Name
0 Theodorick A. Bland VA 9 Theodorick Bland
1 Aedanus Rutherford Burke SC 2 Aedanus Burke
2 Jason Lewis MN 2 Jason Lewis
3 Barbara Comstock VA 10 Democrat Barbara Comstock
4 Theodorick Bland VA 9 Theodorick Bland
5 Aedanus Burke SC 2 Aedanus Burke
6 Jason Initial Lewis MN 2 Democrat Jason Lewis
7 NH 1 Whig
8 NH 1 Whig
Note I added how='left' argument to our merge since you want to keep the shape of your original dataframe.
Explanation of ''.join()
We do this to convert the list to string, see example:
lst = ['hello', 'world']
print(' '.join(lst))
'hello world'

Appending or Concatenating DataFrame via for loop to existing DataFrame

Posted in the output you will see that this code take the Location column(or series), and places it in a data frame. After which, the first,second, and third part of the nested for loop then takes the first index of each column and then creates a dataframe to add to the first dataframe. What I have been trying to do is for loop through, going up one index each for loop, and then adding a new dataframe of repetitve data. However, when I try to print it, the dataframe will only print the first dataframe, and the last repetitive dataframe that it looped through. However I'm trying to make a huge dataframe that attaches a repetitive index data frame from 0-17. I have updated this to show the repetitiveness that I am looking for, but in a truncated way. I hope this helps. Thanks!
Here is the input
for j in range(0,18,1):
for i in range(0,18,1):
df['Rep Loc'] = str(df['Location'][j:j+1])
df['Rep Lat'] = float(df['Latitude'][j:j+1])
df['Rep Long'] = float(df['Longitude'][j:j+1])
break
print(df)
Here is the output
Location Latitude
Longitude \
0 Letsholathebe II Rd, Maun, North-West District... -19.989491
23.397709
1 North-West District, Botswana -19.389353
23.267951
2 Silobela, Kwekwe, Midlands Province, Zimbabwe -18.993930
29.147992
3 Mosi-Oa-Tunya, Livingstone, Southern Province,... -17.910147
25.861904
4 Parkway Drive, Victoria Falls, Matabeleland No... -17.909231
25.827019
5 A33, Kasane, North-West District, Botswana -17.795057
25.197270
6 T1, Southern Province, Zambia -17.040664
26.608454
7 Sikoongo Road, Siavonga, Southern Province, Za... -16.536204
28.708753
8 New Kasama, Lusaka Province, Zambia -15.471934
28.398588
9 Simon Mwansa Kapwepwe Avenue, Avondale, Lusaka... -15.386244
28.397111
10 Lusaka, Lusaka Province, 1010, Zambia -15.416697
28.281381
11 Chigwirizano Road, Rhodes Park, Lusaka, Lusaka... -15.401848
28.302248
12 T2, Kabwe, Central Province, Zambia -14.420744
28.462169
13 Kabushi Road, Ndola, Copperbelt Province, Zambia -12.997968
28.608536
14 Dr Aggrey Avenue, Mishenshi, Kitwe, Copperbelt... -12.797684
28.199061
15 President Avenue, Kalulushi, Copperbelt Provin... -12.833375
28.108370
16 Eglise Methodiste Unie, Avenue Mantola, Mawawa... -11.699407
27.500234
17 Avenue Babemba, Kolwezi, Lwalaba, Katanga, Lua... -10.698109
25.503816
Rep Loc Rep Lat
Rep
Long
0 0 Letsholathebe II Rd, Maun, North-West Dis... -19.989491
23.397709
1 0 Letsholathebe II Rd, Maun, North-West Dis... -19.989491
23.397709
2 0 Letsholathebe II Rd, Maun, North-West Dis... -19.989491
23.397709
Rep Loc Rep Lat
Rep Long
0 1 North-West District, Botswana\nName: Loca... -19.389353
23.267951
1 1 North-West District, Botswana\nName: Loca... -19.389353
23.267951
2 1 North-West District, Botswana\nName: Loca... -19.389353
23.267951
Rep Loc Rep Lat
Rep Long
0 2 Silobela, Kwekwe, Midlands Province, Zimb... -18.99393
29.147992
1 2 Silobela, Kwekwe, Midlands Province, Zimb... -18.99393
29.147992
Rep Loc Rep Lat
Rep Long
0 3 Mosi-Oa-Tunya, Livingstone, Southern Prov... -17.910147
25.861904
1 3 Mosi-Oa-Tunya, Livingstone, Southern Prov... -17.910147
25.861904
2 3 Mosi-Oa-Tunya, Livingstone, Southern Prov... -17.910147
25.861904
Rep Loc Rep Lat Rep
Long
0 4 Parkway Drive, Victoria Falls, Matabelela... -17.909231
25.827019
1 4 Parkway Drive, Victoria Falls, Matabelela... -17.909231
25.827019
2 4 Parkway Drive, Victoria Falls, Matabelela... -17.909231
25.827019
Rep Loc Rep Lat Rep
Long
0 5 A33, Kasane, North-West District, Botswan... -17.795057
25.19727
1 5 A33, Kasane, North-West District, Botswan... -17.795057
25.19727
2 5 A33, Kasane, North-West District, Botswan... -17.795057
25.19727
Good practice when asking questions is to provide an example of what you want your output to look like. However, this is my best guess at what you want.
pd.concat({i: d.shift(i) for i in range(18)}, axis=1)

Categories

Resources