My dataframe contains number of matches for given fixtures, but only for home matches for given team (i.e. number of matches for Argentina-Uruguay matches is 97, but for Uruguay-Argentina this number is 80). In short I want to sum both numbers of home matches for given teams, so that I have the total number of matches between the teams concerned. The dataframe's top 30 rows looks like this:
most_often = mc.groupby(["home_team", "away_team"]).size().reset_index(name="how_many").sort_values(by=['how_many'], ascending = False)
most_often = most_often.reset_index(drop=True)
most_often.head(30)
home_team away_team how_many
0 Argentina Uruguay 97
1 Uruguay Argentina 80
2 Austria Hungary 69
3 Hungary Austria 68
4 Kenya Uganda 65
5 Argentina Paraguay 64
6 Belgium Netherlands 63
7 Netherlands Belgium 62
8 England Scotland 59
9 Argentina Brazil 58
10 Brazil Paraguay 58
11 Scotland England 58
12 Norway Sweden 56
13 England Wales 54
14 Sweden Denmark 54
15 Wales Scotland 54
16 Denmark Sweden 53
17 Argentina Chile 53
18 Scotland Wales 52
19 Scotland Northern Ireland 52
20 Sweden Norway 51
21 Wales England 50
22 England Northern Ireland 50
23 Wales Northern Ireland 50
24 Chile Uruguay 49
25 Northern Ireland England 49
26 Brazil Argentina 48
27 Brazil Chile 48
28 Brazil Uruguay 47
29 Chile Peru 46
In turn, I mean something like this
0 Argentina Uruguay 177
1 Uruguay Argentina 177
2 Austria Hungary 137
3 Hungary Austria 137
4 Kenya Uganda 107
5 Uganda Kenya 107
6 Belgium Netherlands 105
7 Netherlands Belgium 105
But this is only an example, I want to apply it for every team, which I have on dataframe.
What should I do?
Ok, you can follow steps below.
Here is the initial df.
home_team away_team how_many
0 Argentina Uruguay 97
1 Uruguay Argentina 80
2 Austria Hungary 69
3 Hungary Austria 68
4 Kenya Uganda 65
Here you need to create a siorted list that will be the key foraggregations.
df1['sorted_list_team'] = list(zip(df1['home_team'],df1['away_team']))
df1['sorted_list_team'] = df1['sorted_list_team'].apply(lambda x: np.sort(np.unique(x)))
home_team away_team how_many sorted_list_team
0 Argentina Uruguay 97 [Argentina, Uruguay]
1 Uruguay Argentina 80 [Argentina, Uruguay]
2 Austria Hungary 69 [Austria, Hungary]
3 Hungary Austria 68 [Austria, Hungary]
4 Kenya Uganda 65 [Kenya, Uganda]
Here you will covert this list to tuple and turn it able to be aggregations.
def converter(list):
return (*list, )
df1['sorted_list_team'] = df1['sorted_list_team'].apply(converter)
df_sum = df1.groupby(['sorted_list_team']).agg({'how_many':'sum'}).reset_index()
sorted_list_team how_many
0 (Argentina, Brazil) 106
1 (Argentina, Chile) 53
2 (Argentina, Paraguay) 64
3 (Argentina, Uruguay) 177
4 (Austria, Hungary) 137
Do the aggregation to make a sum of 'how_many' values in another dataframe that i call 'df_sum'.
df_sum = df1.groupby(['sorted_list_team']).agg({'how_many':'sum'}).reset_index()
sorted_list_team how_many
0 (Argentina, Brazil) 106
1 (Argentina, Chile) 53
2 (Argentina, Paraguay) 64
3 (Argentina, Uruguay) 177
4 (Austria, Hungary) 137
And merge with 'df1' to get the result of a sum, the colum 'how_many' are in both dfs, for this reason pandas rename the column of df_sum as 'how_many_y'
df1 = pd.merge(df1,df_sum[['sorted_list_team','how_many']], on='sorted_list_team',how='left').drop_duplicates()
And final step you need select only columns that you need from result df.
df1 = df1[['home_team','away_team','how_many_y']]
df1 = df1.drop_duplicates()
df1.head()
home_team away_team how_many_y
0 Argentina Uruguay 177
1 Uruguay Argentina 177
2 Austria Hungary 137
3 Hungary Austria 137
4 Kenya Uganda 65
I found a relatively straightforward thing that hopefully does what you want, but is slightly different than your desired output. Your output has what looks like repetitive information where we aren't caring anymore about home-vs-away team but just want the game counts, and so let's get rid of that distinction (if we can...).
If we make a new column that combines the values from home_team and away_team in the same order each time, we can just do a sum on the how_many where that new column matches
df['teams'] = pd.Series(map('-'.join,np.sort(df[['home_team','away_team']],axis=1)))
# this creates values like 'Argentina-Brazil' and 'Chile-Peru'
df[['how_many','teams']].groupby('teams').sum()
This code gave me the following:
how_many
teams
Argentina-Brazil 106
Argentina-Chile 53
Argentina-Paraguay 64
Argentina-Uruguay 177
Austria-Hungary 137
Belgium-Netherlands 125
Brazil-Chile 48
Brazil-Paraguay 58
Brazil-Uruguay 47
Chile-Peru 46
Chile-Uruguay 49
Denmark-Sweden 107
England-Northern Ireland 99
England-Scotland 117
England-Wales 104
Kenya-Uganda 65
Northern Ireland-Scotland 52
Northern Ireland-Wales 50
Norway-Sweden 107
Scotland-Wales 106
I have the following pandas dataframe, only showing one column
0 Atlantic Division
1 Tampa Bay Lightning*
2 Boston Bruins*
3 Toronto Maple Leafs*
4 Florida Panthers
5 Detroit Red Wings
6 Montreal Canadiens
7 Ottawa Senators
8 Buffalo Sabres
9 Metropolitan Division
10 Washington Capitals*
11 Pittsburgh Penguins*
12 Philadelphia Flyers*
13 Columbus Blue Jackets*
14 New Jersey Devils*
15 Carolina Hurricanes
16 New York Islanders
17 New York Rangers
18 Central Division
19 Nashville Predators*
20 Winnipeg Jets*
21 Minnesota Wild*
22 Colorado Avalanche*
23 St. Louis Blues
24 Dallas Stars
25 Chicago Blackhawks
26 Pacific Division
27 Vegas Golden Knights*
28 Anaheim Ducks*
29 San Jose Sharks*
30 Los Angeles Kings*
31 Calgary Flames
32 Edmonton Oilers
33 Vancouver Canucks
34 Arizona Coyotes
35 Atlantic Division
36 Montreal Canadiens*
37 Ottawa Senators*
38 Boston Bruins*
39 Toronto Maple Leafs*
40 Tampa Bay Lightning
41 Florida Panthers
42 Detroit Red Wings
43 Buffalo Sabres
44 Metropolitan Division
45 Washington Capitals*
46 Pittsburgh Penguins*
47 Columbus Blue Jackets*
48 New York Rangers*
49 New York Islanders
50 Philadelphia Flyers
51 Carolina Hurricanes
52 New Jersey Devils
53 Central Division
54 Chicago Blackhawks*
55 Minnesota Wild*
56 St. Louis Blues*
57 Nashville Predators*
58 Winnipeg Jets
59 Dallas Stars
60 Colorado Avalanche
61 Pacific Division
62 Anaheim Ducks*
63 Edmonton Oilers*
64 San Jose Sharks*
65 Calgary Flames*
66 Los Angeles Kings
67 Arizona Coyotes
68 Vancouver Canucks
69 Atlantic Division
70 Florida Panthers*
71 Tampa Bay Lightning*
72 Detroit Red Wings*
73 Boston Bruins
74 Ottawa Senators
75 Montreal Canadiens
76 Buffalo Sabres
77 Toronto Maple Leafs
78 Metropolitan Division
79 Washington Capitals*
80 Pittsburgh Penguins*
81 New York Rangers*
82 New York Islanders*
83 Philadelphia Flyers*
84 Carolina Hurricanes
85 New Jersey Devils
86 Columbus Blue Jackets
87 Central Division
88 Dallas Stars*
89 St. Louis Blues*
90 Chicago Blackhawks*
91 Nashville Predators*
92 Minnesota Wild*
93 Colorado Avalanche
94 Winnipeg Jets
95 Pacific Division
96 Anaheim Ducks*
97 Los Angeles Kings*
98 San Jose Sharks*
99 Arizona Coyotes
100 Calgary Flames
101 Vancouver Canucks
102 Edmonton Oilers
103 Atlantic Division
104 Montreal Canadiens*
105 Tampa Bay Lightning*
106 Detroit Red Wings*
107 Ottawa Senators*
108 Boston Bruins
109 Florida Panthers
110 Toronto Maple Leafs
111 Buffalo Sabres
112 Metropolitan Division
113 New York Rangers*
114 Washington Capitals*
115 New York Islanders*
116 Pittsburgh Penguins*
117 Columbus Blue Jackets
118 Philadelphia Flyers
119 New Jersey Devils
120 Carolina Hurricanes
121 Central Division
122 St. Louis Blues*
123 Nashville Predators*
124 Chicago Blackhawks*
125 Minnesota Wild*
126 Winnipeg Jets*
127 Dallas Stars
128 Colorado Avalanche
129 Pacific Division
130 Anaheim Ducks*
131 Vancouver Canucks*
132 Calgary Flames*
133 Los Angeles Kings
134 San Jose Sharks
135 Edmonton Oilers
136 Arizona Coyotes
137 Atlantic Division
138 Boston Bruins*
139 Tampa Bay Lightning*
140 Montreal Canadiens*
141 Detroit Red Wings*
142 Ottawa Senators
143 Toronto Maple Leafs
144 Florida Panthers
145 Buffalo Sabres
146 Metropolitan Division
147 Pittsburgh Penguins*
148 New York Rangers*
149 Philadelphia Flyers*
150 Columbus Blue Jackets*
151 Washington Capitals
152 New Jersey Devils
153 Carolina Hurricanes
154 New York Islanders
155 Central Division
156 Colorado Avalanche*
157 St. Louis Blues*
158 Chicago Blackhawks*
159 Minnesota Wild*
160 Dallas Stars*
161 Nashville Predators
162 Winnipeg Jets
163 Pacific Division
164 Anaheim Ducks*
165 San Jose Sharks*
166 Los Angeles Kings*
167 Phoenix Coyotes
168 Vancouver Canucks
169 Calgary Flames
170 Edmonton Oilers
Name: team, dtype: object
I need to create one additional column with the city name.
At first look the regex would be simple (the first word) should be the city name, and the rest is the team name.
However some cities have 2 words (Los Angeles, St Louis ,etc)
Is there a possibility to do this with regex or it has to be done manually?
Update: I tried the following:
nhl_df['city']=nhl_df['team'].str.extract(r'^(?:([\w.]{1,5}\s\w+)|(\w+)|)(?:\s\w+)+\*?$')
But I get this error:
ValueError: Wrong number of items passed 2, placement implies 1
You can try something like that:
^(?:([\w.]{1,5}\s\w+)|(\w+)|)(?:\s\w+)+\*?$
Here you should look for city name in first or second group.
This pattern uses assumption that first part of two-word city names has no more than 5 symbols. The result might not be so clean, but seems to work fine on given example.
You can use
^([\w.]{1,5}(?:\s\w+)?\w*)
See the regex demo. Details:
^ - start of string
([\w.]{1,5}(?:\s\w+)?\w*) - Capturing group 1:
[\w.]{1,5} - one to five word or dot chars
(?:\s\w+)? - an optional occurrence of a whitespace and then one or more word chars
\w* - zero or more word chars.
Pandas test:
import pandas as pd
nhl_df = pd.DataFrame({"team":["Atlantic Division","Tampa Bay Lightning*","Boston Bruins*","Toronto Maple Leafs*","Florida Panthers","Detroit Red Wings","Montreal Canadiens","Ottawa Senators","Buffalo Sabres","Metropolitan Division","Washington Capitals*","Pittsburgh Penguins*","Philadelphia Flyers*","Columbus Blue Jackets*","New Jersey Devils*","Carolina Hurricanes","New York Islanders","New York Rangers","Central Division","Nashville Predators*","Winnipeg Jets*","Minnesota Wild*","Colorado Avalanche*","St. Louis Blues","Dallas Stars","Chicago Blackhawks","Pacific Division","Vegas Golden Knights*","Anaheim Ducks*","San Jose Sharks*","Los Angeles Kings*","Calgary Flames","Edmonton Oilers","Vancouver Canucks","Arizona Coyotes","Atlantic Division","Montreal Canadiens*","Ottawa Senators*","Boston Bruins*","Toronto Maple Leafs*","Tampa Bay Lightning","Florida Panthers","Detroit Red Wings","Buffalo Sabres","Metropolitan Division","Washington Capitals*","Pittsburgh Penguins*","Columbus Blue Jackets*","New York Rangers*","New York Islanders","Philadelphia Flyers","Carolina Hurricanes","New Jersey Devils","Central Division","Chicago Blackhawks*","Minnesota Wild*","St. Louis Blues*","Nashville Predators*","Winnipeg Jets","Dallas Stars","Colorado Avalanche","Pacific Division","Anaheim Ducks*","Edmonton Oilers*","San Jose Sharks*","Calgary Flames*","Los Angeles Kings","Arizona Coyotes","Vancouver Canucks","Atlantic Division","Florida Panthers*","Tampa Bay Lightning*","Detroit Red Wings*","Boston Bruins","Ottawa Senators","Montreal Canadiens","Buffalo Sabres","Toronto Maple Leafs","Metropolitan Division","Washington Capitals*","Pittsburgh Penguins*","New York Rangers*","New York Islanders*","Philadelphia Flyers*","Carolina Hurricanes","New Jersey Devils","Columbus Blue Jackets","Central Division","Dallas Stars*","St. Louis Blues*","Chicago Blackhawks*","Nashville Predators*","Minnesota Wild*","Colorado Avalanche","Winnipeg Jets","Pacific Division","Anaheim Ducks*","Los Angeles Kings*","San Jose Sharks*","Arizona Coyotes","Calgary Flames","Vancouver Canucks","Edmonton Oilers","Atlantic Division","Montreal Canadiens*","Tampa Bay Lightning*","Detroit Red Wings*","Ottawa Senators*","Boston Bruins","Florida Panthers","Toronto Maple Leafs","Buffalo Sabres","Metropolitan Division","New York Rangers*","Washington Capitals*","New York Islanders*","Pittsburgh Penguins*","Columbus Blue Jackets","Philadelphia Flyers","New Jersey Devils","Carolina Hurricanes","Central Division","St. Louis Blues*","Nashville Predators*","Chicago Blackhawks*","Minnesota Wild*","Winnipeg Jets*","Dallas Stars","Colorado Avalanche","Pacific Division","Anaheim Ducks*","Vancouver Canucks*","Calgary Flames*","Los Angeles Kings","San Jose Sharks","Edmonton Oilers","Arizona Coyotes","Atlantic Division","Boston Bruins*","Tampa Bay Lightning*","Montreal Canadiens*","Detroit Red Wings*","Ottawa Senators","Toronto Maple Leafs","Florida Panthers","Buffalo Sabres","Metropolitan Division","Pittsburgh Penguins*","New York Rangers*","Philadelphia Flyers*","Columbus Blue Jackets*","Washington Capitals","New Jersey Devils","Carolina Hurricanes","New York Islanders","Central Division","Colorado Avalanche*","St. Louis Blues*","Chicago Blackhawks*","Minnesota Wild*","Dallas Stars*","Nashville Predators","Winnipeg Jets","Pacific Division","Anaheim Ducks*","San Jose Sharks*","Los Angeles Kings*","Phoenix Coyotes","Vancouver Canucks","Calgary Flames","Edmonton Oilers"]})
nhl_df['city']=nhl_df['team'].str.extract(r'^([\w.]{1,5}(?:\s\w+)?\w*)')
>>> nhl_df
team city
0 Atlantic Division Atlantic
1 Tampa Bay Lightning* Tampa Bay
2 Boston Bruins* Boston
3 Toronto Maple Leafs* Toronto
4 Florida Panthers Florida
.. ... ...
166 Los Angeles Kings* Los Angeles
167 Phoenix Coyotes Phoenix
168 Vancouver Canucks Vancouver
169 Calgary Flames Calgary
170 Edmonton Oilers Edmonton
^\S+(?=\s\S+$)
This regex gives you the first word of all teamnames that only consist of two words. The others you have to sort manually, because there is no way to tell just by pattern if the middle word is part of the city or the teamname.
Try using the below regex
/(\d*\s*)([a-zA-Z\s]*)(\s)(\b([a-zA-z\*]*))$/
Checkthis
function Replace(str) {
var result = str.replace(/(\d*\s*)([a-zA-Z\s]*)(\s)(\b([a-zA-z\*]*))$/gim, function (a, $1, $2, $3, $4) {
return `${$2}--${$4}`;
});
return result;
}
Here is my code:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
r = requests.get('https://rolltide.com/roster.aspx?roster=226&path=football', headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.content, 'lxml')
results = {}
for num, p in enumerate(soup.select('.sidearm-roster-player')):
results[num] = {'position': p.select_one('.sidearm-roster-player-position >span:first-child').text.strip()
,'Height': p.select_one('.sidearm-roster-player-height').text
,'Weight': p.select_one('.sidearm-roster-player-weight').text
,'Number': p.select_one('.sidearm-roster-player-jersey-number').text
,'Name': p.select_one('.sidearm-roster-player-name a').text
,'Year': p.select_one('.sidearm-roster-player-academic-year').text
,'Hometown': p.select_one('.sidearm-roster-player-hometown').text
,'Highschool': p.select_one('.sidearm-roster-player-highschool').text
}
df = pd.DataFrame(results.values(), columns = ['Number','Name','Position','Height','Year','Hometown','Highschool'])
df.to_excel(r'desktop\Alabama.xlsx', index=False)
It scrapes everything but 'Number' and 'Position' and I can't figure out why. Any idea what's going wrong?
you could just use pandas builtin read_html function:
df = pd.read_html('https://rolltide.com/sports/football/roster/2019')[2] #one you're after is the 3rd table being scraped
result:
# Full Name Pos. Ht. Wt. Academic Year Hometown / High School
0 1 Ben Davis LB 6-4 243 R-Jr. Gordo, Ala. / Gordo
1 2 Keilan Robinson RB 5-9 190 Fr. Washington, D.C. / St. John's
2 2 Patrick Surtain II DB 6-2 203 So. Plantation, Fla. / American Heritage
3 3 Daniel Wright DB 6-1 190 R-So. Fort Lauderdale, Fla. / Boyd Anderson
4 4 Christopher Allen LB 6-4 250 R-So. Baton Rouge, La. / Southern Lab School
5 4 Jerry Jeudy WR 6-1 192 Jr. Deerfield Beach, Fla. / Deerfield Beach
6 5 Shyheim Carter DB 6-0 191 Sr. Kentwood, La. / Kentwood
7 5 Taulia Tagovailoa QB 5-11 208 Fr. Ewa Beach, Hawai'i / Thompson
8 6 DeVonta Smith WR 6-1 175 Jr. Amite, La. / Amite
9 7 Braxton Barker QB 6-1 202 R-Fr. Birmingham, Ala. / Spain Park
10 7 Trevon Diggs DB 6-2 207 Sr. Gaithersburg, Md. / Avalon School
11 8 Christian Harris LB 6-2 244 Fr. Baton Rouge, La. / University Lab
12 8 John Metchie III WR 6-0 195 Fr. Brampton, Canada / St. James School (Md.)
13 9 Jordan Battle DB 6-1 201 Fr. Fort Lauderdale, Fla. / St. Thomas Aquinas
14 9 Xavier Williams WR 6-1 195 R-Fr. Hollywood, Fla. / Chaminade-Madonna Prep
15 10 Mac Jones QB 6-2 205 R-So. Jacksonville, Fla. / The Bolles School
16 11 Scooby Carter DB 6-0 186 Fr. Mansfield, Texas / Mansfield Legacy
17 11 Henry Ruggs III WR 6-0 190 Jr. Montgomery, Ala. / Lee
18 12 Skyler DeLong P 6-4 188 So. Fort Mill, S.C. / Nation Ford
19 12 Chadarius Townsend RB/WR 6-0 194 R-So. Tanner, Ala. / Tanner
20 13 Tua Tagovailoa QB 6-1 218 Jr. Ewa Beach, Hawai'i / St. Louis
21 14 Tyrell Shavers WR 6-6 205 R-So. Lewisville, Texas / Lewisville
22 14 Brandon Turnage DB 6-1 185 Fr. Oxford, Miss. / Lafayette
23 15 Xavier McKinney DB 6-1 200 Jr. Roswell, Ga. / Roswell
24 15 Paul Tyson QB 6-5 220 Fr. Trussville, Ala. / Hewitt-Trussville
25 16 Jayden George QB 6-3 192 Fr. Indianapolis, Ind. / Warren Central
26 16 Will Reichard PK 6-1 180 Fr. Hoover, Ala. / Hoover
27 17 Jaylen Waddle WR 5-10 182 So. Houston, Texas / Episcopal
28 18 Slade Bolden WR 5-11 191 R-Fr. West Monroe, La. / West Monroe
29 19 Jahleel Billingsley TE 6-4 228 Fr. Chicago, Ill. / Phillips Academy
.. ... ... ... ... ... ... ...
94 76 Scott Lashley OL 6-7 307 R-Jr. West Point, Miss. / West Point
95 77 Matt Womack OL 6-7 325 R-Sr. Hernando, Miss. / Magnolia Heights
96 78 Amari Kight OL 6-7 302 Fr. Alabaster, Ala. / Thompson
97 80 Michael Parker TE 6-6 216 R-Fr. Huntsville, Ala. / Westminster Christian
98 81 Cameron Latu TE 6-5 247 R-Fr. Salt Lake City, Utah / Olympus
99 82 Richard Hunt TE 6-7 235 Fr. Memphis, Tenn. / Briarcrest Christian
100 83 John Parker WR 6-0 190 Sr. Huntsville, Ala. / Westminster Christian
101 84 Joshua Lanier WR 5-11 160 Sr. Tuscaloosa, Ala. / Tuscaloosa Academy
102 84/79 Chris Owens OL 6-3 315 R-Jr. Arlington, Texas / Lamar
103 85 Drew Kobayashi WR 6-2 200 R-Jr. Honolulu, Hawai'i / St. Louis
104 85/60 Kendall Randolph TE/OL 6-4 296 R-So. Madison, Ala. / Bob Jones
105 86 Connor Adams DB 6-1 194 Sr. Sugar Land, Texas / Austin
106 86 Quindarius Watkins TE 6-4 230 Jr. Fort Stewart, Ga. / Bradwell Institute
107 87 Miller Forristall TE 6-5 242 R-Jr. Cartersville, Ga. / Cartersville
108 88 Major Tennison TE 6-5 248 R-So. Flint, Texas / Bullard
109 89 Grant Krieger WR 6-2 192 Fr. Pittsburgh, Pa. / Pine-Richland
110 89 LaBryan Ray DL 6-5 292 Jr. Madison, Ala. / James Clemens
111 90 Stephon Wynn Jr. DL 6-4 311 R-Fr. Anderson, S.C. / IMG Academy
112 91 Tevita Musika DL 6-1 338 Sr. Milpitas, Calif. / Milpitas/San Mateo J.C.
113 92 Justin Eboigbe DL 6-5 294 Fr. Forest Park, Ga. / Forest Park
114 93 Landon Bothwell DL 5-11 220 So. Oneonta, Ala. / Oneonta
115 93 Tripp Slyman PK/P 6-1 180 R-Fr. Huntsville, Ala. / Randolph
116 94 DJ Dale DL 6-3 308 Fr. Birmingham, Ala. / Clay-Chalkville
117 95 Jack Martin P 6-0 206 Fr. Mobile, Ala. / McGill-Toolen
118 95 Ishmael Sopsher DL 6-4 334 Fr. Amite, La. / Amite
119 96 Taylor Wilson DL 6-0 232 Sr. Huntington Beach, Calif. / Mater Dei
120 97 Joseph Bulovas PK 6-0 203 R-So. Mandeville, La. / Mandeville
121 98 Mike Bernier P 6-2 219 R-Sr. Madison, Ala. / Bob Jones/ Eastern Illinois
122 99 Raekwon Davis DL 6-7 312 Sr. Meridian, Miss. / Meridian
123 99 Ty Perine PK/P 6-1 190 Fr. Prattville, Ala. / Prattville
For the number you do have results just replace:
,'Number': p.select_one('.sidearm-roster-player-jersey-number').text
with
,'Number': p.select_one('.sidearm-roster-player-jersey-number').text.strip()
For the position this is because you have a discrepancy between the way you wrote it.
change :
'position': p.select_one('.sidearm-roster-player-position >span:first-child').text.strip()
to
'Position': p.select_one('.sidearm-roster-player-position >span:first-child').text.strip()
I have Fifa dataset and it includes information about football players. One of the features of this dataset is the value of football players but it is in string form such as "$300K" or "$50M". How can I delete simply these euro and "M, K" symbol and write their values in same units?
import numpy as np
import pandas as pd
location = r'C:\Users\bemrem\Desktop\Python\fifa\fifa_dataset.csv'
_dataframe = pd.read_csv(location)
_dataframe = _dataframe.dropna()
_dataframe = _dataframe.reset_index(drop=True)
_dataframe = _dataframe[['Name', 'Value', 'Nationality', 'Age', 'Wage',
'Overall', 'Potential']]
_array = ['Belgium', 'France', 'Brazil', 'Croatia', 'England',' Portugal',
'Uruguay', 'Switzerland', 'Spain', 'Denmark']
_dataframe = _dataframe.loc[_dataframe['Nationality'].isin(_array)]
_dataframe = _dataframe.reset_index(drop=True)
print(_dataframe.head())
print()
print(_dataframe.tail())
I tried to convert this Value column but I failed. This is what I get
Name Value Nationality Age Wage Overall Potential
0 Neymar €123M Brazil 25 €280K 92 94
1 L. Suárez €97M Uruguay 30 €510K 92 92
2 E. Hazard €90.5M Belgium 26 €295K 90 91
3 Sergio Ramos €52M Spain 31 €310K 90 90
4 K. De Bruyne €83M Belgium 26 €285K 89 92
Name Value Nationality Age Wage Overall Potential
4931 A. Kilgour €40K England 19 €1K 47 56
4932 R. White €60K England 18 €2K 47 65
4933 T. Sawyer €50K England 18 €1K 46 58
4934 J. Keeble €40K England 18 €1K 46 56
4935 J. Lundstram €60K England 18 €1K 46 64
But I want to my output looks like this:
Name Value Nationality Age Wage Overall Potential
0 Neymar 123 Brazil 25 €280K 92 94
1 L. Suárez 97 Uruguay 30 €510K 92 92
2 E. Hazard 90.5 Belgium 26 €295K 90 91
3 Sergio Ramos 52 Spain 31 €310K 90 90
4 K. De Bruyne 83 Belgium 26 €285K 89 92
Name Value Nationality Age Wage Overall Potential
4931 A. Kilgour 0.04 England 19 €1K 47 56
4932 R. White 0.06 England 18 €2K 47 65
4933 T. Sawyer 0.05 England 18 €1K 46 58
4934 J. Keeble 0.04 England 18 €1K 46 56
4935 J. Lundstram 0.06 England 18 €1K 46 64
I do not have enough reputation to flag an answer as a duplicate. However, I believe that this will solve your particular question in addition to providing a solution if there is no "K" or "M" in your string.
You will also need to replace $ with € in the regex.
Convert the string 2.90K to 2900 or 5.2M to 5200000 in pandas dataframe
I am troubling with counting the number of counties using famous cenus.csv data.
Task: Count number of counties in each state.
Facing comparing (I think) / Please read below?
I've tried this:
df = pd.read_csv('census.csv')
dfd = df[:]['STNAME'].unique() //Gives out names of state
serr = pd.Series(dfd) // converting to series (from array)
After this, i've tried using two approaches:
1:
df[df['STNAME'] == serr] **//ERROR: series length must match**
2:
i = 0
for name in serr: //This generate error 'Alabama'
df['STNAME'] == name
for i in serr:
serr[i] == serr[name]
print(serr[name].count)
i+=1
Please guide me; it has been three days with this stuff.
Use groupby and aggregate COUNTY using nunique:
In [1]: import pandas as pd
In [2]: df = pd.read_csv('census.csv')
In [3]: unique_counties = df.groupby('STNAME')['COUNTY'].nunique()
Now the results
In [4]: unique_counties
Out[4]:
STNAME
Alabama 68
Alaska 30
Arizona 16
Arkansas 76
California 59
Colorado 65
Connecticut 9
Delaware 4
District of Columbia 2
Florida 68
Georgia 160
Hawaii 6
Idaho 45
Illinois 103
Indiana 93
Iowa 100
Kansas 106
Kentucky 121
Louisiana 65
Maine 17
Maryland 25
Massachusetts 15
Michigan 84
Minnesota 88
Mississippi 83
Missouri 116
Montana 57
Nebraska 94
Nevada 18
New Hampshire 11
New Jersey 22
New Mexico 34
New York 63
North Carolina 101
North Dakota 54
Ohio 89
Oklahoma 78
Oregon 37
Pennsylvania 68
Rhode Island 6
South Carolina 47
South Dakota 67
Tennessee 96
Texas 255
Utah 30
Vermont 15
Virginia 134
Washington 40
West Virginia 56
Wisconsin 73
Wyoming 24
Name: COUNTY, dtype: int64
juanpa.arrivillaga has a great solution. However, the code needs a minor modification.
The "counties" with 'SUMLEV' == 40 or 'COUNTY' == 0 should be filtered. Otherwise, all the number of counties are too big by one.
So, the correct answer should be:
unique_counties = census_df[census_df['SUMLEV'] == 50].groupby('STNAME')['COUNTY'].nunique()
with the following result:
STNAME
Alabama 67
Alaska 29
Arizona 15
Arkansas 75
California 58
Colorado 64
Connecticut 8
Delaware 3
District of Columbia 1
Florida 67
Georgia 159
Hawaii 5
Idaho 44
Illinois 102
Indiana 92
Iowa 99
Kansas 105
Kentucky 120
Louisiana 64
Maine 16
Maryland 24
Massachusetts 14
Michigan 83
Minnesota 87
Mississippi 82
Missouri 115
Montana 56
Nebraska 93
Nevada 17
New Hampshire 10
New Jersey 21
New Mexico 33
New York 62
North Carolina 100
North Dakota 53
Ohio 88
Oklahoma 77
Oregon 36
Pennsylvania 67
Rhode Island 5
South Carolina 46
South Dakota 66
Tennessee 95
Texas 254
Utah 29
Vermont 14
Virginia 133
Washington 39
West Virginia 55
Wisconsin 72
Wyoming 23
Name: COUNTY, dtype: int64
#Bakhtawar - This is a very simple way:
df.groupby(df['STNAME']).count().COUNTY