Exact visual matches between common columns in dataframes not matching on merge - python

I am trying to merge two dataframes on a common column, "long_name". But the merge is not happening for some names, even what look like visually exact matches, (ie "Lionel Andrés Messi Cuccittini" (df1) to "Lionel Andrés Messi Cuccittini" (df2)) when I merge on "long_name":
df_merged = df.merge(df1, on="long_name", indicator=True, how='right')
Lionel Messi is left out and according to the indicator column, he's a "right_only" row from the merge. What's odd is that "Neymar da Silva Santos Júnior" IS merging. Why is there a discrepancy between the rows? Both have been sourced consistently, df from kaggle and df2 from scraping and using the same script for all row name extractions.
I tried to isolate both the Lionel Messi entries from df and df1 using the following code:
name1 = df.loc[df.short_name == 'L. Messi', ["long_name"]]
name2 = df1.loc[df1.name == 'Lionel Messi', ["long_name"]]
name1.values == name2.values
But the result is array([[False]]). I'm not sure why they're not matching.
The first df looks like this (first 8 lines, df = df.loc[0:7,["short_name", "long_name"]]):
short_name long_name
0 L. Messi Lionel Andrés Messi Cuccittini
1 Cristiano Ronaldo Cristiano Ronaldo dos Santos Aveiro
2 Neymar Jr Neymar da Silva Santos Junior
3 J. Oblak Jan Oblak
4 E. Hazard Eden Hazard
5 K. De Bruyne Kevin De Bruyne
6 M. ter Stegen Marc-André ter Stegen
7 V. van Dijk Virgil van Dijk
The second df looks like this (first 8 lines, df1 = df1.loc[0:7,["name", "long_name"]]):
name long_name
0 Kylian Mbappé Kylian Sanmi Mbappé Lottin
1 Neymar Neymar da Silva Santos Júnior
2 Mohamed Salah محمد صلاح
3 Harry Kane Harry Edward Kane
4 Eden Hazard Eden Michael Hazard
5 Lionel Messi Lionel Andrés Messi Cuccitini
6 Raheem Sterling Raheem Shaquille Sterling
7 Antoine Griezmann Antoine Griezmann

Are you sure it is not just a case of a misspelled name?
df lists the long_name as Lionel Andrés Messi Cuccittini, whereas df1 lists it as Lionel Andrés Messi Cuccitini. I notice df has 2 t's in Cuccittini but df has 1.
Manually correct the second dataframe and retry.

Related

How do you create new column from two distinct categorical column values in a dataframe by same column ID in pandas?

Sorry for the confusing title. I am practicing how to manipulate dataframes in Python through pandas. How do I make this kind of table:
id role name
0 11 ACTOR Luna Wedler, Jannis Niewöhner, Milan Peschel, ...
1 11 DIRECTOR Christian Schwochow
2 22 ACTOR Guy Pearce, Matilda Anna Ingrid Lutz, Travis F...
3 22 DIRECTOR Andrew Baird
4 33 ACTOR Glenn Fredly, Marcello Tahitoe, Andien Aisyah,...
5 33 DIRECTOR Saron Sakina
Into this kind:
id director actors name
0 11 Christian Schwochow Luna Wedler, Jannis Niewöhner, Milan Peschel, ...
1 22 Andrew Baird Guy Pearce, Matilda Anna Ingrid Lutz, Travis F...d
2 33 Saron Sakina Glenn Fredly, Marcello Tahitoe, Andien Aisyah,...
Try this way
df.pivot(index='id', columns='role', values='name')
You can do in addition to #Tejas's answer:
df = (df.pivot(index='id', columns='role', values='name').
reset_index().
rename_axis('',axis=1).
rename(columns={'ACTOR':'actors name','DIRECTOR':'director'}))

Pandas add column with value by matching similar names from 2 different data frames

I'm trying to make a Champions League Fantasy statistic table, and I'm getting data from 2 different sites that name the player slightly different between each other.
I have df1 from site 1:
Name Pos
0 Lionel Messi DEL
1 Junior Messias DEL
2 Kepa POR
3 Alisson POR
and df2 from site 2:
Name Cost
0 Messi 40
1 Messias 20
2 Kepa Arrizabalaga 10
3 Neymar 40
Note that Messi is still Messi but in one site they call him by his full name and in the other they call him by his last name.
The same for the last one Kepa Arrizabalaga.
What I want to do is add a column 'Cost' in df1 with the corresponding cost of the player. The challenge here is that if I try to look for the word 'Messi' in df2 to look for his cost, I will potentially get the Junior Messias cost because 'Messi' is in his last name, so I was thinking in the SequenceMatcher.ratio() type of solution, but I tried that for the player Kepa and of course 'Kepa' is very different from 'Kepa Arrizabalaga' so I won't get a close match on that one.
I don't know how to parse df1 and find a close-enough match of the 'Name' column on the df2 to look for its cost and add it to df1.
The desire output for df3 would be as follows:
Name Pos Cost
0 Lionel Messi DEL 40
1 Junior Messias DEL 20
2 Kepa POR 10
3 Alisson POR NaN

How to change platform when scraping a website (Futbin) in Python?

I am currently looking at obtaining price data from Futbin specifically from this page of player data. I have used bs4 successfully for this with the following specific code:
spans = soup.find_all("span", class_="ps4_color font-weight-bold")
This collects all PS4 prices from the players page but I would like to also obtain Xbox and PC prices. To do this on the site you have to manually select it from the icons in the top right but from what I can tell this links to the same url but with updated price data. How can I scrape this data in a similar way to above as I'm sure there must be an easier way than using Selenium or similar packages.
Any help would be greatly appreciated!
To change page for another platform, set cookie= parameter in your request:
import requests
from bs4 import BeautifulSoup
url = 'https://www.futbin.com/20/players?page=1'
platforms = ['ps4', 'xone', 'pc']
for platform in platforms:
print()
print('Platform: {}'.format(platform))
print('-' * 80)
soup = BeautifulSoup( requests.get(url, cookies={'platform': platform}).content, 'html.parser' )
for s in soup.select('span.font-weight-bold'):
print('{:<40} {}'.format(s.find_previous('a', class_="player_name_players_table").text, s.text))
Prints:
Platform: ps4
--------------------------------------------------------------------------------
Lionel Messi 2.5M
Virgil van Dijk 1.82M
Cristiano Ronaldo 3.2M
Diego Maradona 4.5M
Pelé 6.65M
Kevin De Bruyne 1.95M
Virgil van Dijk 1.75M
Lionel Messi 2.18M
Robert Lewandowski 805K
Cristiano Ronaldo 3.08M
Pelé 3.35M
Kylian Mbappé 2.62M
Kevin De Bruyne 1.21M
Sadio Mané 783K
Kylian Mbappé 2.66M
Neymar Jr 3.83M
Diego Maradona 2.19M
Sadio Mané 625K
Alisson 148K
N'Golo Kanté 1.51M
Robert Lewandowski 269K
Ronaldo 0
Zinedine Zidane 7.15M
Lionel Messi 4.6M
Lionel Messi 1.4M
Alisson 143K
Mohamed Salah 459K
Raphaël Varane 847K
Karim Benzema 310K
Luis Suárez 407K
Platform: xone
--------------------------------------------------------------------------------
Lionel Messi 2.15M
Virgil van Dijk 1.65M
Cristiano Ronaldo 2.53M
Diego Maradona 4.07M
Pelé 0
Kevin De Bruyne 1.73M
Virgil van Dijk 1.6M
Lionel Messi 1.9M
Robert Lewandowski 719K
Cristiano Ronaldo 2.51M
Pelé 3.15M
Kylian Mbappé 2.27M
Kevin De Bruyne 1.02M
Sadio Mané 695K
Kylian Mbappé 2.24M
Neymar Jr 3.27M
Diego Maradona 1.61M
Sadio Mané 585K
Alisson 153K
N'Golo Kanté 1.3M
Robert Lewandowski 247K
Ronaldo 0
Zinedine Zidane 6.78M
Lionel Messi 4.26M
Lionel Messi 1.24M
Alisson 130K
Mohamed Salah 470K
Raphaël Varane 725K
Karim Benzema 272K
Luis Suárez 351K
Platform: pc
--------------------------------------------------------------------------------
Lionel Messi 3.56M
Virgil van Dijk 2.5M
Cristiano Ronaldo 3.75M
Diego Maradona 4.3M
Pelé 0
Kevin De Bruyne 2.52M
Virgil van Dijk 2.4M
Lionel Messi 2.86M
Robert Lewandowski 1.16M
Cristiano Ronaldo 3.75M
Pelé 5.75M
Kylian Mbappé 3.35M
Kevin De Bruyne 1.4M
Sadio Mané 925K
Kylian Mbappé 3.3M
Neymar Jr 4.85M
Diego Maradona 1.98M
Sadio Mané 730K
Alisson 179K
N'Golo Kanté 1.9M
Robert Lewandowski 400K
Ronaldo 0
Zinedine Zidane 0
Lionel Messi 4.77M
Lionel Messi 2.3M
Alisson 160K
Mohamed Salah 520K
Raphaël Varane 940K
Karim Benzema 370K
Luis Suárez 679K

Find and replace in dataframe from another dataframe

I have two dataframes, here are snippets of both below. I am trying to find and replace the artists names in the second dataframe with the id's in the first dataframe. Is there a good way to do this?
id fullName
0 1 Colin McCahon
1 2 Robert Henry Dickerson
2 3 Arthur Dagley
Artists
0 Arthur Dagley, Colin McCahon, Maria Cruz
1 Fiona Gilmore, Peter Madden, Nicholas Spratt, ...
2 Robert Henry Dickerson
3 Steve Carr
Desired output:
Artists
0 3, 1, Maria Cruz
1 Fiona Gilmore, Peter Madden, Nicholas Spratt, ...
2 2
3 Steve Carr
You mean check with replace
df1.Artists.replace(dict(zip(df.fullName,df.id.astype(str))),regex=True)
0 3, 1, Maria Cruz
1 Fiona Gilmore, Peter Madden, Nicholas Spratt, ...
2 2
3 Steve Carr
Name: Artists, dtype: object
Convert your first dataframe into a dictionary:
d = Series(name_df.id.astype(str),index=name_df.fullName).to_dict()
Then use .replace():
artists_df["Artists"] = artists_df["Artists"].replace(d, regex=True)

Concatenate a set of column values based on another column in Pandas

Given a Pandas dataframe which has a few labeled series in it, say Name and Villain.
Say the dataframe has values such:
Name: {'Batman', 'Batman', 'Spiderman', 'Spiderman', 'Spiderman', 'Spiderman'}
Villain: {'Joker', 'Bane', 'Green Goblin', 'Electro', 'Venom', 'Dr Octopus'}
In total the above dataframe has 2 series(or columns) each with six datapoints.
Now, based on the Name, I want to concatenate 3 more columns: FirstName, LastName, LoveInterest to each datapoint.
The result of which adds 'Bruce; Wayne; Catwoman' to every row which has Name as Batman. And 'Peter; Parker; MaryJane' to every row which has Name as Spiderman.
The final result should be a dataframe containing 5 columns(series) and 6 rows each.
This is a classic inner-join scenario. In pandas, use the merge module-level function:
In [13]: df1
Out[13]:
Name Villain
0 Batman Joker
1 Batman Bane
2 Spiderman Green Goblin
3 Spiderman Electro
4 Spiderman Venom
5 Spiderman Dr. Octopus
In [14]: df2
Out[14]:
FirstName LastName LoveInterest Name
0 Bruce Wayne Catwoman Batman
1 Peter Parker MaryJane Spiderman
In [15]: pd.DataFrame.merge(df1,df2,on='Name')
Out[15]:
Name Villain FirstName LastName LoveInterest
0 Batman Joker Bruce Wayne Catwoman
1 Batman Bane Bruce Wayne Catwoman
2 Spiderman Green Goblin Peter Parker MaryJane
3 Spiderman Electro Peter Parker MaryJane
4 Spiderman Venom Peter Parker MaryJane
5 Spiderman Dr. Octopus Peter Parker MaryJane

Categories

Resources