How to change value of a pd.DataFrame based on a condition? - python

I have Fifa dataset and it includes information about football players. One of the features of this dataset is the value of football players but it is in string form such as "$300K" or "$50M". How can I delete simply these euro and "M, K" symbol and write their values in same units?
import numpy as np
import pandas as pd
location = r'C:\Users\bemrem\Desktop\Python\fifa\fifa_dataset.csv'
_dataframe = pd.read_csv(location)
_dataframe = _dataframe.dropna()
_dataframe = _dataframe.reset_index(drop=True)
_dataframe = _dataframe[['Name', 'Value', 'Nationality', 'Age', 'Wage',
'Overall', 'Potential']]
_array = ['Belgium', 'France', 'Brazil', 'Croatia', 'England',' Portugal',
'Uruguay', 'Switzerland', 'Spain', 'Denmark']
_dataframe = _dataframe.loc[_dataframe['Nationality'].isin(_array)]
_dataframe = _dataframe.reset_index(drop=True)
print(_dataframe.head())
print()
print(_dataframe.tail())
I tried to convert this Value column but I failed. This is what I get
Name Value Nationality Age Wage Overall Potential
0 Neymar €123M Brazil 25 €280K 92 94
1 L. Suárez €97M Uruguay 30 €510K 92 92
2 E. Hazard €90.5M Belgium 26 €295K 90 91
3 Sergio Ramos €52M Spain 31 €310K 90 90
4 K. De Bruyne €83M Belgium 26 €285K 89 92
Name Value Nationality Age Wage Overall Potential
4931 A. Kilgour €40K England 19 €1K 47 56
4932 R. White €60K England 18 €2K 47 65
4933 T. Sawyer €50K England 18 €1K 46 58
4934 J. Keeble €40K England 18 €1K 46 56
4935 J. Lundstram €60K England 18 €1K 46 64
But I want to my output looks like this:
Name Value Nationality Age Wage Overall Potential
0 Neymar 123 Brazil 25 €280K 92 94
1 L. Suárez 97 Uruguay 30 €510K 92 92
2 E. Hazard 90.5 Belgium 26 €295K 90 91
3 Sergio Ramos 52 Spain 31 €310K 90 90
4 K. De Bruyne 83 Belgium 26 €285K 89 92
Name Value Nationality Age Wage Overall Potential
4931 A. Kilgour 0.04 England 19 €1K 47 56
4932 R. White 0.06 England 18 €2K 47 65
4933 T. Sawyer 0.05 England 18 €1K 46 58
4934 J. Keeble 0.04 England 18 €1K 46 56
4935 J. Lundstram 0.06 England 18 €1K 46 64

I do not have enough reputation to flag an answer as a duplicate. However, I believe that this will solve your particular question in addition to providing a solution if there is no "K" or "M" in your string.
You will also need to replace $ with € in the regex.
Convert the string 2.90K to 2900 or 5.2M to 5200000 in pandas dataframe

Related

I have number of matches, but only for home matches for given team. How can I sum values for duplicate pairs?

My dataframe contains number of matches for given fixtures, but only for home matches for given team (i.e. number of matches for Argentina-Uruguay matches is 97, but for Uruguay-Argentina this number is 80). In short I want to sum both numbers of home matches for given teams, so that I have the total number of matches between the teams concerned. The dataframe's top 30 rows looks like this:
most_often = mc.groupby(["home_team", "away_team"]).size().reset_index(name="how_many").sort_values(by=['how_many'], ascending = False)
most_often = most_often.reset_index(drop=True)
most_often.head(30)
home_team away_team how_many
0 Argentina Uruguay 97
1 Uruguay Argentina 80
2 Austria Hungary 69
3 Hungary Austria 68
4 Kenya Uganda 65
5 Argentina Paraguay 64
6 Belgium Netherlands 63
7 Netherlands Belgium 62
8 England Scotland 59
9 Argentina Brazil 58
10 Brazil Paraguay 58
11 Scotland England 58
12 Norway Sweden 56
13 England Wales 54
14 Sweden Denmark 54
15 Wales Scotland 54
16 Denmark Sweden 53
17 Argentina Chile 53
18 Scotland Wales 52
19 Scotland Northern Ireland 52
20 Sweden Norway 51
21 Wales England 50
22 England Northern Ireland 50
23 Wales Northern Ireland 50
24 Chile Uruguay 49
25 Northern Ireland England 49
26 Brazil Argentina 48
27 Brazil Chile 48
28 Brazil Uruguay 47
29 Chile Peru 46
In turn, I mean something like this
0 Argentina Uruguay 177
1 Uruguay Argentina 177
2 Austria Hungary 137
3 Hungary Austria 137
4 Kenya Uganda 107
5 Uganda Kenya 107
6 Belgium Netherlands 105
7 Netherlands Belgium 105
But this is only an example, I want to apply it for every team, which I have on dataframe.
What should I do?
Ok, you can follow steps below.
Here is the initial df.
home_team away_team how_many
0 Argentina Uruguay 97
1 Uruguay Argentina 80
2 Austria Hungary 69
3 Hungary Austria 68
4 Kenya Uganda 65
Here you need to create a siorted list that will be the key foraggregations.
df1['sorted_list_team'] = list(zip(df1['home_team'],df1['away_team']))
df1['sorted_list_team'] = df1['sorted_list_team'].apply(lambda x: np.sort(np.unique(x)))
home_team away_team how_many sorted_list_team
0 Argentina Uruguay 97 [Argentina, Uruguay]
1 Uruguay Argentina 80 [Argentina, Uruguay]
2 Austria Hungary 69 [Austria, Hungary]
3 Hungary Austria 68 [Austria, Hungary]
4 Kenya Uganda 65 [Kenya, Uganda]
Here you will covert this list to tuple and turn it able to be aggregations.
def converter(list):
return (*list, )
df1['sorted_list_team'] = df1['sorted_list_team'].apply(converter)
df_sum = df1.groupby(['sorted_list_team']).agg({'how_many':'sum'}).reset_index()
sorted_list_team how_many
0 (Argentina, Brazil) 106
1 (Argentina, Chile) 53
2 (Argentina, Paraguay) 64
3 (Argentina, Uruguay) 177
4 (Austria, Hungary) 137
Do the aggregation to make a sum of 'how_many' values in another dataframe that i call 'df_sum'.
df_sum = df1.groupby(['sorted_list_team']).agg({'how_many':'sum'}).reset_index()
sorted_list_team how_many
0 (Argentina, Brazil) 106
1 (Argentina, Chile) 53
2 (Argentina, Paraguay) 64
3 (Argentina, Uruguay) 177
4 (Austria, Hungary) 137
And merge with 'df1' to get the result of a sum, the colum 'how_many' are in both dfs, for this reason pandas rename the column of df_sum as 'how_many_y'
df1 = pd.merge(df1,df_sum[['sorted_list_team','how_many']], on='sorted_list_team',how='left').drop_duplicates()
And final step you need select only columns that you need from result df.
df1 = df1[['home_team','away_team','how_many_y']]
df1 = df1.drop_duplicates()
df1.head()
home_team away_team how_many_y
0 Argentina Uruguay 177
1 Uruguay Argentina 177
2 Austria Hungary 137
3 Hungary Austria 137
4 Kenya Uganda 65
I found a relatively straightforward thing that hopefully does what you want, but is slightly different than your desired output. Your output has what looks like repetitive information where we aren't caring anymore about home-vs-away team but just want the game counts, and so let's get rid of that distinction (if we can...).
If we make a new column that combines the values from home_team and away_team in the same order each time, we can just do a sum on the how_many where that new column matches
df['teams'] = pd.Series(map('-'.join,np.sort(df[['home_team','away_team']],axis=1)))
# this creates values like 'Argentina-Brazil' and 'Chile-Peru'
df[['how_many','teams']].groupby('teams').sum()
This code gave me the following:
how_many
teams
Argentina-Brazil 106
Argentina-Chile 53
Argentina-Paraguay 64
Argentina-Uruguay 177
Austria-Hungary 137
Belgium-Netherlands 125
Brazil-Chile 48
Brazil-Paraguay 58
Brazil-Uruguay 47
Chile-Peru 46
Chile-Uruguay 49
Denmark-Sweden 107
England-Northern Ireland 99
England-Scotland 117
England-Wales 104
Kenya-Uganda 65
Northern Ireland-Scotland 52
Northern Ireland-Wales 50
Norway-Sweden 107
Scotland-Wales 106

Create new dataframe column based on 2 criteria from another dataframe

I have the following dataframe:
df=pd.DataFrame({'Name':['JOHN','ALLEN','BOB','NIKI','CHARLIE','CHANG'],
'Age':[35,42,63,29,47,51],
'Salary_in_1000':[100,93,78,120,64,115],
'FT_Team':['STEELERS','SEAHAWKS','FALCONS','FALCONS','PATRIOTS','STEELERS']})
df output:
- Name Age Salary_in_1000 FT_Team
0 JOHN 35 100 STEELERS
1 ALLEN 42 93 SEAHAWKS
2 BOB 63 78 FALCONS
3 NIKI 29 120 FALCONS
4 CHARLIE 47 64 PATRIOTS
5 CHANG 51 115 STEELERS
And my dataframe that I am trying to complete:
df1=pd.DataFrame({'Name':['JOHN','ALLEN','BOB','NIKI','CHARLIE','CHANG'],
'Age':[35,42,63,29,47,51],})
df1 output:
- Name Age
0 JOHN 35
1 ALLEN 42
2 BOB 63 78
3 NIKI 29
4 CHARLIE 47
5 CHANG 51
I would like to add a new column to df1 that references ['FT_Team'] from df based upon 'Name' and 'Age' in df1.
I believe that the new code should look something like a .map; however, I am completely stumped as to what the arguments would be for multiple arguments.
df1['FT_Team] =
final output:
- Name Age FT_Team
0 JOHN 35 STEELERS
1 ALLEN 42 SEAHAWKS
2 BOB 63 FALCONS
3 NIKI 29 FALCONS
4 CHARLIE 47 PATRIOTS
5 CHANG 51 STEELERS
Ultimately, I would like to match the football team from df based upon Name AND Age in df1
Per Quang Hoang:
df1=df1.merge(df[['Name','Age','FT_Team']], on=['Name','Age'], how='left')

How to take mean across row in Pandas pivot table Dataframe? [duplicate]

This question already has answers here:
Compute row average in pandas
(5 answers)
Closed 2 years ago.
I have a pandas dataframe as seen below which is a pivot table. I would like to print Africa in 2007 as well as do the mean of the entire Americas row; any ideas how to do this? I have been doing combinations of stack/unstack for a while now to no avail.
year 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
continent
Africa 12 13 15 20 39 25 81 12 22 23 25 44
Americas 12 14 65 10 119 15 21 42 47 84 15 89
Asia 12 13 89 20 39 25 81 29 77 23 25 89
Europe 12 13 15 20 39 25 81 29 23 32 15 89
Oceania 12 13 15 20 39 25 81 27 32 85 25 89
import pandas as pd
df = pd.read_csv('dummy_data.csv')
# handy to see the continent name against the value rather than '0' or '3'
df.set_index('continent', inplace=True)
# print mean for all rows - see how the continent name helps here
print(df.mean(axis=1))
print('---')
print()
# print the mean for just the 'Americas' row
print(df.mean(axis=1)['Americas'])
print('---')
print()
# print the value of the 'Africa' row for the year (column) 2007
print(df.query('continent == "Africa"')['2007'])
print('---')
print()
Output:
continent
Africa 27.583333
Americas 44.416667
Asia 43.500000
Europe 32.750000
Oceania 38.583333
dtype: float64
---
44.416666666666664
---
continent
Africa 44
Name: 2007, dtype: int64
---

how to select multiple columns after grouping by a single column [duplicate]

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 3 years ago.
I want to find the player with max overall rating in each position. What is the best and compact way to do it in pandas?.
Name Overall Potential Club Position
L. Messi 94 94 FC Barcelona RF
Ronaldo 94 94 Juventus ST
Neymar Jr 92 93 Paris Saint-Germain LW
De Gea 91 93 Manchester United GK
K. De Bruyne 91 92 Manchester City RCM
E. Hazard 91 91 Chelsea LF
L. Modrić 91 91 Real Madrid RCM
L. Suárez 91 91 FC Barcelona RS
Sergio Ramos 91 91 Real Madrid RCB
J. Oblak 90 93 Atlético Madrid GK
R. Lewandowski 90 90 FC Bayern München ST
T. Kroos 90 90 Real Madrid LCM
I have tried:
fifa.groupby(by = ["Position"])['Overall'].max()
followed by
fifa.loc[(fifa["Position"] == "CAM") & (fifa['Overall'] == 89),:]
But since there are so many categories in Position, it's a tedious task.
You can try this:
df[df["Overall"]==df["Overall"].max()]
This will help.
Use DataFrame.drop_duplicates(assuming Overall column is sorted):
df = df.drop_duplicates(subset=['Position'], keep='first')
print(df)
Name Overall Potential Club Position
0 L. Messi 94 94 FC Barcelona RF
1 Ronaldo 94 94 Juventus ST
2 Neymar Jr 92 93 Paris Saint-Germain LW
3 De Gea 91 93 Manchester United GK
4 K. De Bruyne 91 92 Manchester City RCM
5 E. Hazard 91 91 Chelsea LF
7 L. Suárez 91 91 FC Barcelona RS
8 Sergio Ramos 91 91 Real Madrid RCB
11 T. Kroos 90 90 Real Madrid LCM
You could merge your intermediate result with the original dataframe to get the full rows:
pd.DataFrame(df.groupby('Position')['Overall'].max()).reset_index().merge(df,
on=['Position', 'Overall'])
It gives:
Position Overall Name Potential Club
0 GK 91 De Gea 93 Manchester United
1 LCM 90 T. Kroos 90 Real Madrid
2 LF 91 E. Hazard 91 Chelsea
3 LW 92 Neymar Jr 93 Paris Saint-Germain
4 RCB 91 Sergio Ramos 91 Real Madrid
5 RCM 91 K. De Bruyne 92 Manchester City
6 RCM 91 L. Modrić 91 Real Madrid
7 RF 94 L. Messi 94 FC Barcelona
8 RS 91 L. Suárez 91 FC Barcelona
9 ST 94 Ronaldo 94 Juventus
You can note the 2 ex-aequo for RCM position.

Count number of counties per state using python {census}

I am troubling with counting the number of counties using famous cenus.csv data.
Task: Count number of counties in each state.
Facing comparing (I think) / Please read below?
I've tried this:
df = pd.read_csv('census.csv')
dfd = df[:]['STNAME'].unique() //Gives out names of state
serr = pd.Series(dfd) // converting to series (from array)
After this, i've tried using two approaches:
1:
df[df['STNAME'] == serr] **//ERROR: series length must match**
2:
i = 0
for name in serr: //This generate error 'Alabama'
df['STNAME'] == name
for i in serr:
serr[i] == serr[name]
print(serr[name].count)
i+=1
Please guide me; it has been three days with this stuff.
Use groupby and aggregate COUNTY using nunique:
In [1]: import pandas as pd
In [2]: df = pd.read_csv('census.csv')
In [3]: unique_counties = df.groupby('STNAME')['COUNTY'].nunique()
Now the results
In [4]: unique_counties
Out[4]:
STNAME
Alabama 68
Alaska 30
Arizona 16
Arkansas 76
California 59
Colorado 65
Connecticut 9
Delaware 4
District of Columbia 2
Florida 68
Georgia 160
Hawaii 6
Idaho 45
Illinois 103
Indiana 93
Iowa 100
Kansas 106
Kentucky 121
Louisiana 65
Maine 17
Maryland 25
Massachusetts 15
Michigan 84
Minnesota 88
Mississippi 83
Missouri 116
Montana 57
Nebraska 94
Nevada 18
New Hampshire 11
New Jersey 22
New Mexico 34
New York 63
North Carolina 101
North Dakota 54
Ohio 89
Oklahoma 78
Oregon 37
Pennsylvania 68
Rhode Island 6
South Carolina 47
South Dakota 67
Tennessee 96
Texas 255
Utah 30
Vermont 15
Virginia 134
Washington 40
West Virginia 56
Wisconsin 73
Wyoming 24
Name: COUNTY, dtype: int64
juanpa.arrivillaga has a great solution. However, the code needs a minor modification.
The "counties" with 'SUMLEV' == 40 or 'COUNTY' == 0 should be filtered. Otherwise, all the number of counties are too big by one.
So, the correct answer should be:
unique_counties = census_df[census_df['SUMLEV'] == 50].groupby('STNAME')['COUNTY'].nunique()
with the following result:
STNAME
Alabama 67
Alaska 29
Arizona 15
Arkansas 75
California 58
Colorado 64
Connecticut 8
Delaware 3
District of Columbia 1
Florida 67
Georgia 159
Hawaii 5
Idaho 44
Illinois 102
Indiana 92
Iowa 99
Kansas 105
Kentucky 120
Louisiana 64
Maine 16
Maryland 24
Massachusetts 14
Michigan 83
Minnesota 87
Mississippi 82
Missouri 115
Montana 56
Nebraska 93
Nevada 17
New Hampshire 10
New Jersey 21
New Mexico 33
New York 62
North Carolina 100
North Dakota 53
Ohio 88
Oklahoma 77
Oregon 36
Pennsylvania 67
Rhode Island 5
South Carolina 46
South Dakota 66
Tennessee 95
Texas 254
Utah 29
Vermont 14
Virginia 133
Washington 39
West Virginia 55
Wisconsin 72
Wyoming 23
Name: COUNTY, dtype: int64
#Bakhtawar - This is a very simple way:
df.groupby(df['STNAME']).count().COUNTY

Categories

Resources