Get value Dataframe based on similar string

Get value Dataframe based on similar string - python

I want get the value of a cell in Dataframe based on string that is not equal but so similar.
This is the dataframe
Teams GP Pts
0 Liverpool 15 44
1 Chelsea 15 35
2 Manchester C. 15 32
3 West Ham Utd 15 28
4 Manchester Utd 14 24
5 Leicester City 14 22
6 Watford 15 20
7 Aston Villa 14 19
8 Crystal Palace 14 19
9 Arsenal 14 17
10 Brentford 14 17
11 Everton 14 17
12 Newcastle Utd 15 17
13 Brighton 15 14
14 Burnley 14 14
15 Southampton 15 14
16 Leeds Utd 14 13
17 Tottenham 13 13
18 Wolverhampton 15 12
19 Norwich City 14 8
Code
hometeam = 'Manchester City'
pts_man_city = df[df.Teams == hometeam].iloc[0]['Pts']
But got IndexError: single positional indexer is out-of-bounds

You can use thefuzz.process (previously fuzzywuzzy):
# pip install thefuzz
from thefuzz import process
hometeam = 'Manchester City'
best = process.extractOne(hometeam, df['Teams'])[0]
df.loc[df['Teams'].eq(best), 'Pts'].iloc[0]
output: 32

We need to find similar strings. Ok, let's do it!
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
alpha = 0.75
idx = df.team.apply(lambda x: x if similar(x, your_team) > alpha else None).dropna().index[0]
df.iloc[idx]['pts']
Just change alpha parameter for your task.

The below code returns row of specific team
'''
df.loc[df['Teams'] == hometown]
'''

Related

How to make a combined histogram of two grouped columns?

My data
I have these data as attached and I'm trying to make overlap the Home and Away Histogram for each team individually? I'm new to python btw.
So far I made which looks exactly what I want but I want to combine them again by each team:
df_EPL['Away_score'].hist(by=df_EPL['AwayTeam'],figsize = (8,8),color = '#96ddff');
and
df_EPL['Home_score'].hist(by=df_EPL['HomeTeam'],figsize = (8,8),color = '#82c065');

Fake Dataframe creation
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
np.random.seed(42)
teams = ['Arsenal', 'Chelsea', 'Liverpool', 'Manchester City', 'Manchester Utd']
df = pd.DataFrame({'HomeTeam': np.repeat(teams, len(teams) - 1)})
df['AwayTeam'] = [away_team for home_team in teams for away_team in teams if away_team != home_team]
df['Home_score'] = np.random.randint(0, 5, len(df))
df['Away_score'] = np.random.randint(0, 5, len(df))
HomeTeam AwayTeam Home_score Away_score
0 Arsenal Chelsea 3 1
1 Arsenal Liverpool 4 4
2 Arsenal Manchester City 2 3
3 Arsenal Manchester Utd 4 0
4 Chelsea Arsenal 4 0
5 Chelsea Liverpool 1 2
6 Chelsea Manchester City 2 2
7 Chelsea Manchester Utd 2 1
8 Liverpool Arsenal 2 3
9 Liverpool Chelsea 4 3
10 Liverpool Manchester City 3 2
11 Liverpool Manchester Utd 2 3
12 Manchester City Arsenal 4 3
13 Manchester City Chelsea 1 0
14 Manchester City Liverpool 3 2
15 Manchester City Manchester Utd 1 4
16 Manchester Utd Arsenal 3 2
17 Manchester Utd Chelsea 4 4
18 Manchester Utd Liverpool 0 0
19 Manchester Utd Manchester City 3 1
Dataframe re-shape
You need to re-shape your dataframe in a different format in order to make the plot you want. For this purpose, you can use pandas.melt:
df = pd.melt(frame = df,
id_vars = ['HomeTeam', 'AwayTeam'],
var_name = 'H/A',
value_name = 'Score')
df = df.drop('AwayTeam', axis = 1).rename(columns = {'HomeTeam': 'Team'}).replace({'Home_score': 'Home', 'Away_score': 'Away'})
Team H/A Score
0 Arsenal Home 3
1 Arsenal Home 4
2 Arsenal Home 2
3 Arsenal Home 4
4 Chelsea Home 4
5 Chelsea Home 1
6 Chelsea Home 2
7 Chelsea Home 2
8 Liverpool Home 2
9 Liverpool Home 4
10 Liverpool Home 3
11 Liverpool Home 2
12 Manchester City Home 4
13 Manchester City Home 1
14 Manchester City Home 3
15 Manchester City Home 1
16 Manchester Utd Home 3
17 Manchester Utd Home 4
18 Manchester Utd Home 0
19 Manchester Utd Home 3
20 Arsenal Away 1
21 Arsenal Away 4
22 Arsenal Away 3
23 Arsenal Away 0
24 Chelsea Away 0
25 Chelsea Away 2
26 Chelsea Away 2
27 Chelsea Away 1
28 Liverpool Away 3
29 Liverpool Away 3
30 Liverpool Away 2
31 Liverpool Away 3
32 Manchester City Away 3
33 Manchester City Away 0
34 Manchester City Away 2
35 Manchester City Away 4
36 Manchester Utd Away 2
37 Manchester Utd Away 4
38 Manchester Utd Away 0
39 Manchester Utd Away 1
Plot
Now dataframe is ready to be plotted. You can use seaborn.FacetGrid to create the grid of subplots, one for each team. Each subplot will have two seaborn.histplot: one for Home_score and one for Away_score:
g = sns.FacetGrid(df, col = 'Team', hue = 'H/A')
g.map(sns.histplot, 'Score', bins = np.arange(df['Score'].min() - 0.5, df['Score'].max() + 1.5, 1))
g.add_legend()
g.set(xticks = np.arange(df['Score'].min(), df['Score'].max() + 1, 1))
plt.show()

How do I Convert a list to dataframe in for bucle

I appreciate your collaboration to convert the result code into a dataframe with the 2 columns. I was able to do the for loop to print each result and now I need to save this data to a dataframe. But I have not been able to get the result correct. Can you help me?
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.skysports.com/premier-league-table') #get page web information
soup = BeautifulSoup(r.text, 'html.parser') # interpreter
print(soup.prettify())
print(soup.title)
league_table = soup.find('table', class_ = 'standing-table__table callfn')
for team in league_table.find_all('tbody'):
rows = team.find_all('tr')
for row in rows:
pl_team = row.find('td', class_ ='standing-table__cell standing-table__cell--name')
pl_team = pl_team['data-long-name']
points = row.find_all('td', class_ = 'standing-table__cell')[9].text
print(pl_team, points)

Use pandas. You can do it in 1 line.
import pandas as pd
df = pd.read_html('https://www.skysports.com/premier-league-table')[0]
Ouput:
print(df[['Team','Pts']])
Team Pts
0 Manchester City 83
1 Manchester United 71
2 Chelsea 67
3 Liverpool 66
4 Leicester City 66
5 West Ham United 62
6 Tottenham Hotspur 59
7 Everton 59
8 Arsenal 58
9 Leeds United 56
10 Aston Villa 52
11 Wolverhampton Wanderers 45
12 Crystal Palace 44
13 Southampton 43
14 Newcastle United 42
15 Brighton and Hove Albion 41
16 Burnley 39
17 Fulham 28
18 West Bromwich Albion 26
19 Sheffield United 20

Pandas - read a text file

I have a text file that looks like this:
************************************************************************************************
English Premier Division - Saturday 25th May 2002
************************************************************************************************
================================================================================================
2001/2 Assists
================================================================================================
Pos Player Club Apps Asts
-------------------------------------------------------------------------
1st David Beckham Man Utd 29 15
2nd Dean Gordon Middlesbrough 30 (1) 11
3rd John Collins Fulham 32 11
4th Ryan Giggs Man Utd 32 11
5th Kieron Dyer Newcastle 33 10
6th Sean Davis Fulham 23 (1) 10
7th Damien Duff Blackburn 30 (3) 10
8th Alan Smith Leeds 23 (6) 9
9th Jesper Grønkjær Chelsea 34 9
10th Andrejs Stolcers Fulham 28 9
11th Ian Harte Leeds 37 8
12th Eidur Gudjohnsen Chelsea 28 (3) 8
13th Robert Pires Arsenal 24 (3) 7
14th Lauren Arsenal 32 (1) 7
15th John Robinson Charlton 33 7
16th Michael Gray Sunderland 37 7
17th Henrik Pedersen Bolton 36 7
18th Anders Svensson Southampton 34 (2) 7
19th Lee Bowyer Leeds 32 7
20th Craig Hignett Blackburn 21 (6) 7
21st Paul Merson Aston Villa 27 7
22nd Teddy Sheringham Tottenham 37 7
23rd Steed Malbranque Fulham 16 (14) 7
24th Marian Pahars Southampton 37 7
25th Muzzy Izzet Leicester 28 7
26th Sergei Rebrov Tottenham 36 (1) 7
27th Julio Arca Sunderland 32 (1) 7
28th Christian Bassedas Newcastle 37 7
29th Juan Sebastián Verón Man Utd 29 (2) 7
30th Joe Cole West Ham 32 6
I'm trying to read it into a pandas data frame like this:
df = pd.read_table('assist1.txt',
sep='\s+',
skiprows=6,
header=0,)
This code throws an exception - pandas.errors.ParserError: Error tokenizing data. C error: Expected 7 fields in line 31, saw 8.
I guess that's because of the space between the first and last name of the player (should be the value of the Player column).
Is there a way to achieve this?
Furthermore, it is a part of a larger text file that looks like this:
************************************************************************************************
English Premier Division - Saturday 25th May 2002
************************************************************************************************
================================================================================================
2001/2 Table
================================================================================================
Pos Team Pld Won Drn Lst For Ag Won Drn Lst For Ag Pts
--------------------------------------------------------------------------------------------------
1st C Man Utd 38 15 4 0 41 4 10 4 5 34 20 83
--------------------------------------------------------------------------------------------------
2nd Arsenal 38 15 2 2 38 9 11 3 5 28 14 83
3rd Leeds 38 15 4 0 33 8 9 4 6 36 37 80
4th Liverpool 38 13 4 2 25 7 9 2 8 26 24 72
5th Chelsea 38 16 1 2 44 18 4 5 10 24 33 66
6th Newcastle 38 11 5 3 40 23 7 3 9 25 33 62
7th Blackburn 38 11 3 5 36 24 5 5 9 23 30 56
8th Middlesbrough 38 9 7 3 31 19 5 6 8 20 29 55
9th Sunderland 38 8 5 6 31 30 8 2 9 22 25 55
10th West Ham 38 11 3 5 31 17 3 7 9 14 29 52
11th Tottenham 38 10 3 6 35 26 4 5 10 23 35 50
12th Leicester 38 7 5 7 23 20 6 4 9 26 28 48
13th Fulham 38 7 5 7 39 35 5 7 7 33 44 48
14th Ipswich 38 9 4 6 23 22 3 3 13 14 34 43
15th Charlton 38 5 5 9 18 26 5 4 10 16 30 39
16th Everton 38 8 4 7 30 28 1 5 13 11 36 36
17th Aston Villa 38 2 8 9 19 28 5 6 8 21 26 35
--------------------------------------------------------------------------------------------------
18th R Derby 38 6 4 9 25 28 3 3 13 14 39 34
19th R Southampton 38 5 7 7 34 34 1 4 14 12 35 29
20th R Bolton 38 6 3 10 25 31 1 4 14 15 40 28
================================================================================================
2001/2 Goals
================================================================================================
Pos Player Club Apps Gls
-------------------------------------------------------------------------
1st Thierry Henry Arsenal 34 25
2nd Alan Shearer Newcastle 36 25
3rd Ruud van Nistelrooy Man Utd 26 23
4th Steve Marlet Fulham 38 20
5th Jimmy Floyd Hasselbaink Chelsea 30 (1) 20
6th Les Ferdinand Sunderland 27 (2) 17
7th Kevin Phillips Sunderland 36 17
8th Frédéric Kanouté West Ham 32 (3) 14
9th Marcus Bent Blackburn 28 (4) 13
10th Alen Boksic Middlesbrough 36 13
11th Eidur Gudjohnsen Chelsea 28 (3) 13
12th Luis Boa Morte Fulham 36 13
13th Michael Owen Liverpool 32 (1) 12
14th Dwight Yorke Man Utd 29 (1) 11
15th Henrik Pedersen Bolton 36 11
16th Juan Pablo Angel Aston Villa 34 (2) 11
17th Juan Sebastián Verón Man Utd 29 (2) 11
18th Shaun Bartlett Charlton 35 10
19th Matt Jansen Blackburn 28 (5) 10
20th Duncan Ferguson Everton 28 (5) 10
21st Ian Harte Leeds 37 10
22nd Bosko Balaban Aston Villa 36 10
23rd Robbie Fowler Liverpool 25 (3) 10
24th Georgi Kinkladze Derby 36 (1) 10
25th Hamilton Ricard Middlesbrough 28 (2) 10
26th Robert Pires Arsenal 24 (3) 9
27th Andrew Cole Man Utd 15 (5) 9
28th Rod Wallace Bolton 31 9
29th James Beattie Southampton 28 (1) 9
30th Robbie Keane Leeds 28 (8) 9
================================================================================================
2001/2 Assists
================================================================================================
Pos Player Club Apps Asts
-------------------------------------------------------------------------
1st David Beckham Man Utd 29 15
2nd Dean Gordon Middlesbrough 30 (1) 11
3rd John Collins Fulham 32 11
4th Ryan Giggs Man Utd 32 11
5th Kieron Dyer Newcastle 33 10
6th Sean Davis Fulham 23 (1) 10
7th Damien Duff Blackburn 30 (3) 10
8th Alan Smith Leeds 23 (6) 9
9th Jesper Grønkjær Chelsea 34 9
10th Andrejs Stolcers Fulham 28 9
11th Ian Harte Leeds 37 8
12th Eidur Gudjohnsen Chelsea 28 (3) 8
13th Robert Pires Arsenal 24 (3) 7
14th Lauren Arsenal 32 (1) 7
15th John Robinson Charlton 33 7
16th Michael Gray Sunderland 37 7
17th Henrik Pedersen Bolton 36 7
18th Anders Svensson Southampton 34 (2) 7
19th Lee Bowyer Leeds 32 7
20th Craig Hignett Blackburn 21 (6) 7
21st Paul Merson Aston Villa 27 7
22nd Teddy Sheringham Tottenham 37 7
23rd Steed Malbranque Fulham 16 (14) 7
24th Marian Pahars Southampton 37 7
25th Muzzy Izzet Leicester 28 7
26th Sergei Rebrov Tottenham 36 (1) 7
27th Julio Arca Sunderland 32 (1) 7
28th Christian Bassedas Newcastle 37 7
29th Juan Sebastián Verón Man Utd 29 (2) 7
30th Joe Cole West Ham 32 6
================================================================================================
2001/2 Average Rating
================================================================================================
Pos Player Club Apps Av R
-------------------------------------------------------------------------
1st Ruud van Nistelrooy Man Utd 26 8.54
2nd Thierry Henry Arsenal 34 8.09
3rd Alan Shearer Newcastle 36 7.97
4th Kieron Dyer Newcastle 33 7.94
5th Steve Marlet Fulham 38 7.89
6th Ian Harte Leeds 37 7.86
7th Andrew Cole Man Utd 15 (5) 7.85
8th Roy Keane Man Utd 19 7.84
9th Les Ferdinand Sunderland 27 (2) 7.83
10th Juan Sebastián Verón Man Utd 29 (2) 7.81
11th Eidur Gudjohnsen Chelsea 28 (3) 7.77
12th Jesper Grønkjær Chelsea 34 7.76
13th Michaël Silvestre Man Utd 32 7.72
14th Dean Gordon Middlesbrough 30 (1) 7.71
15th Michael Owen Liverpool 32 (1) 7.70
16th Patrick Vieira Arsenal 29 7.69
17th Robert Pires Arsenal 24 (3) 7.67
18th Ryan Giggs Man Utd 32 7.66
19th Dwight Yorke Man Utd 29 (1) 7.63
20th Mario Stanic Chelsea 29 (3) 7.63
21st Frédéric Kanouté West Ham 32 (3) 7.57
22nd Mark Viduka Leeds 21 7.57
23rd David Beckham Man Utd 29 7.55
24th Jimmy Floyd Hasselbaink Chelsea 30 (1) 7.55
25th Martin Taylor Blackburn 14 (8) 7.55
26th Titus Bramble Ipswich 33 7.55
27th Sol Campbell Arsenal 20 (1) 7.52
28th Mario Melchiot Chelsea 19 (2) 7.52
29th Stephane Henchoz Liverpool 29 7.52
30th Rio Ferdinand Leeds 36 (1) 7.51
================================================================================================
2001/2 Man of Match
================================================================================================
Pos Player Club Apps MoM
-------------------------------------------------------------------------
1st Thierry Henry Arsenal 34 8
2nd Ruud van Nistelrooy Man Utd 26 8
3rd Kieron Dyer Newcastle 33 6
4th Les Ferdinand Sunderland 27 (2) 6
5th Steve Marlet Fulham 38 6
6th Eidur Gudjohnsen Chelsea 28 (3) 6
7th Ian Harte Leeds 37 5
8th Richie Wellens Leicester 20 (9) 5
9th Henrik Pedersen Bolton 36 5
10th Alan Shearer Newcastle 36 5
11th Michael Owen Liverpool 32 (1) 4
12th Dean Gordon Middlesbrough 30 (1) 4
13th Matt Jansen Blackburn 28 (5) 4
14th Marcus Bent Blackburn 28 (4) 4
15th Kevin Campbell Everton 27 (4) 4
16th Titus Bramble Ipswich 33 4
17th Roy Keane Man Utd 19 4
18th Frédéric Kanouté West Ham 32 (3) 4
19th Patrick Vieira Arsenal 29 4
20th Hermann Hreidarsson Ipswich 34 4
21st Dennis Bergkamp Arsenal 22 (9) 4
22nd Jimmy Floyd Hasselbaink Chelsea 30 (1) 4
23rd Claus Lundekvam Southampton 27 (2) 4
24th Robert Pires Arsenal 24 (3) 3
25th Shaun Bartlett Charlton 35 3
26th Kevin Phillips Sunderland 36 3
27th Lucas Radebe Leeds 31 (1) 3
28th Ragnvald Soma West Ham 27 (3) 3
29th Dean Richards Tottenham 34 3
30th Wayne Quinn Liverpool 25 (4) 3
Ideally I would like to run a function that creates a data frame out of each table above, but can't figure it out.
Thanks
Thanks

another way you can specify the seperator as more than one space, and skiprows as a list of rows. I tried this and it gave me your expected output. You can write simple script to find which lines to be skipped and which to be considered.
df = pd.read_table('assist1.txt', sep='\s\s+', skiprows=[0,1,2,3,4,5,6,7,8,10], header=0,engine='python')

You're using whitespace as a delimiter, but this is fixed-length delimited, not whitespace delimited. You should google fixed-length parsing, e.g. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html.

Pandas: transform column's values in independent columns

I have Pandas DataFrame which looks like following (df_olymic).
I would like the values of column Type to be transformed in independent columns (df_olympic_table)
Original dataframe
In [3]: df_olympic
Out[3]:
Country Type Num
0 USA Gold 46
1 USA Silver 37
2 USA Bronze 38
3 GB Gold 27
4 GB Silver 23
5 GB Bronze 17
6 China Gold 26
7 China Silver 18
8 China Bronze 26
9 Russia Gold 19
10 Russia Silver 18
11 Russia Bronze 19
Transformed dataframe
In [5]: df_olympic_table
Out[5]:
Country N_Gold N_Silver N_Bronze
0 USA 46 37 38
1 GB 27 23 17
2 China 26 18 26
3 Russia 19 18 19
What would be the most convenient way to achieve this?

You can use DataFrame.pivot:
df = df.pivot(index='Country', columns='Type', values='Num')
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 19 19 18
USA 38 46 37
Another solution with DataFrame.set_index and Series.unstack:
df = df.set_index(['Country','Type'])['Num'].unstack()
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 19 19 18
USA 38 46 37
but if get:
ValueError: Index contains duplicate entries, cannot reshape
need pivot_table with some aggreagte function, by default it is np.mean, but you can use sum, first...
#add new row with duplicates value in 'Country' and 'Type'
print (df)
Country Type Num
0 USA Gold 46
1 USA Silver 37
2 USA Bronze 38
3 GB Gold 27
4 GB Silver 23
5 GB Bronze 17
6 China Gold 26
7 China Silver 18
8 China Bronze 26
9 Russia Gold 19
10 Russia Silver 18
11 Russia Bronze 20 < - changed value to 20
11 Russia Bronze 100 < - add new row with duplicates
df = df.pivot_table(index='Country', columns='Type', values='Num', aggfunc=np.mean)
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 60 19 18 < - Russia get ((100 + 20)/ 2 = 60
USA 38 46 37
Or groupby with aggreagting mean and reshape by unstack:
df = df.groupby(['Country','Type'])['Num'].mean().unstack()
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 60 19 18 < - Russia get ((100 + 20)/ 2 = 60
USA 38 46 37

How to add labels to the row index

Here is my code:
>>> import pandas as pd
>>> df = pd.read_csv('Grade.txt',index_col=0,header=None)
>>> print(df)
1 2 3 4 5 6 7 8 9 10
0
Sarah K. 10 9 7 9 10 20 19 19 45 92
John M. 9 9 8 9 8 20 20 18 43 95
David R. 8 7 7 9 6 18 17 17 40 83
Joan A. 9 10 10 10 10 20 19 20 47 99
Nick J. 9 7 10 10 10 20 20 19 46 98
Vicki T. 7 7 8 9 9 17 18 19 44 88
>>> print(df.mean(axis=0))
1 8.666667
2 8.166667
3 8.333333
4 9.333333
5 8.833333
6 19.166667
7 18.833333
8 18.666667
9 44.166667
10 92.500000
Right now they are labeled 1-10 and I want the rows to look like this:
Homework #1 8.67
Homework #2 8.17
Homework #3 8.33
Homework #4 9.33
Homework #5 8.83
Quiz #1 19.17
Quiz #2 18.83
Quiz #3 18.67
Midterm #1 44.17
Final #1 92.50
I'm just looking for the right way to go about the labeling. So instead of 1-10 I'm looking for (Homework#1, Homework#2, Homework#3, etc.) Thanks

Absent logic to derive the column names from the column number, you'll probably need label the columns with a simple list of the names:
cols
['Homework #1', 'Homework #2', 'Homework #3', 'Homework #4', 'Homework #5', 'Quiz #1', 'Quiz #2', 'Quiz #3', 'Midterm #1', 'Final #1']
df
1 2 3 4 5 6 7 8 9 10
0
Sarah K. 10 9 7 9 10 20 19 19 45 92
John M. 9 9 8 9 8 20 20 18 43 95
David R. 8 7 7 9 6 18 17 17 40 83
Joan A. 9 10 10 10 10 20 19 20 47 99
Nick J. 9 7 10 10 10 20 20 19 46 98
Vicki T. 7 7 8 9 9 17 18 19 44 88
df.columns = cols
df.mean()
Homework #1 8.666667
Homework #2 8.166667
Homework #3 8.333333
Homework #4 9.333333
Homework #5 8.833333
Quiz #1 19.166667
Quiz #2 18.833333
Quiz #3 18.666667
Midterm #1 44.166667
Final #1 92.500000

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get value Dataframe based on similar string - python

You can use thefuzz.process (previously fuzzywuzzy): # pip install thefuzz from thefuzz import process hometeam = 'Manchester City' best = process.extractOne(hometeam, df['Teams'])[0] df.loc[df['Teams'].eq(best), 'Pts'].iloc[0] output: 32

The below code returns row of specific team ''' df.loc[df['Teams'] == hometown] '''

Related

How to make a combined histogram of two grouped columns?

How do I Convert a list to dataframe in for bucle

Pandas - read a text file

Pandas: transform column's values in independent columns

How to add labels to the row index

Categories

Resources