Unable to scrape 2nd table from Fbref.com - python

I would like to scrape the 2nd table in the page seen below from the link - https://fbref.com/en/comps/82/stats/Indian-Super-League-Stats#all_stats_standard
on google collab.
but pd.read_html("https://fbref.com/en/comps/82/stats/Indian-Super-League-Stats#all_stats_standard") only gives me the first table.
Please help me understand where I am going wrong.
Snippet of page

This is one way to read that data:
import pandas as pd
import requests
url= 'https://fbref.com/en/comps/82/stats/Indian-Super-League-Stats#all_stats_standard'
response = requests.get(url).text.replace('<!--', '').replace('-->', '')
df = pd.read_html(response, header=1)[2]
print(df)
Result in terminal:
Rk Player Nation Pos Squad Age Born MP Starts Min 90s Gls Ast G-PK PK PKatt CrdY CrdR Gls.1 Ast.1 G+A G-PK.1 G+A-PK Matches
0 1 Sahal Abdul Samad in IND MF Kerala Blasters 24 1997 20 19 1443 16.0 5 1 5 0 0 0 0 0.31 0.06 0.37 0.31 0.37 Matches
1 2 Ayush Adhikari in IND MF Kerala Blasters 21 2000 14 6 540 6.0 0 0 0 0 0 3 1 0.00 0.00 0.00 0.00 0.00 Matches
2 3 Gani Ahammed Nigam in IND FW NorthEast Utd 23 1998 6 0 66 0.7 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 Matches
3 4 Airam es ESP FW Goa 33 1987 13 8 751 8.3 6 1 5 1 2 0 0 0.72 0.12 0.84 0.60 0.72 Matches
4 5 Alex br BRA MF Jamshedpur 32 1988 20 12 1118 12.4 1 4 1 0 0 2 0 0.08 0.32 0.40 0.08 0.40 Matches
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
302 292 João Victor br BRA MF Hyderabad FC 32 1988 18 18 1590 17.7 5 1 3 2 2 3 0 0.28 0.06 0.34 0.17 0.23 Matches
303 293 David Williams au AUS FW Mohun Bagan 33 1988 15 6 602 6.7 4 1 4 0 1 2 0 0.60 0.15 0.75 0.60 0.75 Matches
304 294 Banana Yaya cm CMR DF Bengaluru 30 1991 5 2 229 2.5 0 1 0 0 0 1 0 0.00 0.39 0.39 0.00 0.39 Matches
305 295 Joe Zoherliana in IND DF NorthEast Utd 22 1999 9 6 677 7.5 0 1 0 0 0 0 0 0.00 0.13 0.13 0.00 0.13 Matches
306 296 Mark Zothanpuia in IND MF Hyderabad FC 19 2002 3 0 63 0.7 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 Matches
307 rows × 24 columns
Relevant pandas documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

Related

get value from one column as variable for subtraction

i have a data frame with XY's and distances. what i am trying to do is store the distance as a variable and subtract it from the next distance if X or Y has a value greater than 0
here is a sample df
dist x y
0 12.93 99.23
200 0 0
400 0 0
600 0 0
800 0 0
1000 12.46 99.14
1200 0 0
1400 0 0
1600 0 0
1800 0 0
2000 12.01 99.07
and this is new df
dist x y
0 12.93 99.23
200 0 0
400 0 0
600 0 0
800 0 0
0 12.46 99.14
200 0 0
400 0 0
600 0 0
800 0 0
2000 12.01 99.07
the last value doesn't matter, but technically, it would be 0.
the idea is that at every know XY, assign the distance as 0 and subtract that distance until the next known XY
in the above example, the distances are rounded numbers, but in reality, they could be like
132.05
19.999
1539.65
and so on
Check with transform
df.dist-=df.groupby(df.x.ne(0).cumsum())['dist'].transform('first')
df
Out[769]:
dist x y
0 0 12.93 99.23
1 200 0.00 0.00
2 400 0.00 0.00
3 600 0.00 0.00
4 800 0.00 0.00
5 0 12.46 99.14
6 200 0.00 0.00
7 400 0.00 0.00
8 600 0.00 0.00
9 800 0.00 0.00
10 0 12.01 99.07
You can use groupby and apply, using a custom grouper calculated as follows:
grouper = (df['x'].ne(0) | df['y'].ne(0)).cumsum()
df['dist'].groupby(grouper).apply(lambda x: x - x.values[0])
0 0
1 200
2 400
3 600
4 800
5 0
6 200
7 400
8 600
9 800
10 0
Name: dist, dtype: int64
Where,
grouper
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 2
10 3
dtype: int64
The idea is to mark all rows that must be subtracted from the first non-zero value of that corresponding group.
With where + ffill
df['dist'] = df.dist - df.where(df.x.gt(0) | df.y.gt(0)).dist.ffill()
dist x y
0 0.0 12.93 99.23
1 200.0 0.00 0.00
2 400.0 0.00 0.00
3 600.0 0.00 0.00
4 800.0 0.00 0.00
5 0.0 12.46 99.14
6 200.0 0.00 0.00
7 400.0 0.00 0.00
8 600.0 0.00 0.00
9 800.0 0.00 0.00
10 0.0 12.01 99.07

Pandas: find the n lowest values each m rows

I have a dataframe with a Counter, increasing by 1 each 24 rows, and a value column, like below.
value counter
0 0.00 1
1 0.00 1
2 0.00 1
3 0.00 1
4 0.00 1
5 0.00 1
6 0.00 1
7 0.00 1
8 55.00 1
9 90.00 1
10 49.27 1
11 51.80 1
12 49.06 1
13 43.46 1
14 45.96 1
15 43.95 1
16 45.00 1
17 43.97 1
18 42.00 1
19 41.14 1
20 43.92 1
21 51.74 1
22 40.85 1
23 0.00 2
24 0.00 2
25 0.00 2
26 0.00 2
27 0.00 2
28 0.00 2
29 0.00 2
... ... ...
187 82.38 9
188 66.89 9
189 59.83 9
190 52.46 9
191 40.48 9
192 28.87 9
193 41.90 9
194 42.56 9
195 40.93 9
196 40.02 9
197 36.54 9
198 33.70 9
199 38.99 9
200 46.10 9
201 44.82 9
202 0.00 9
203 0.00 9
204 0.00 9
205 0.00 9
206 0.00 9
207 0.00 10
208 0.00 10
209 0.00 10
210 74.69 10
211 89.20 10
212 74.59 10
213 55.11 10
214 58.39 10
215 40.81 10
216 45.06 10
I would like to know if there is a way to create a third column with the 4 lowest values in each Group where the Counter has the same value. See below an example for the first Group with Count=1:
value counter value 2
0 0.00 1 0.00
1 0.00 1 0.00
2 0.00 1 0.00
3 0.00 1 0.00
4 0.00 1 0.00
5 0.00 1 0.00
6 0.00 1 0.00
7 0.00 1 0.00
8 55.00 1 0.00
9 90.00 1 0.00
10 49.27 1 0.00
11 51.80 1 0.00
12 49.06 1 0.00
13 43.46 1 43.46
14 45.96 1 0.00
15 43.95 1 0.00
16 45.00 1 0.00
17 43.97 1 0.00
18 42.00 1 42.00
19 41.14 1 41.14
20 43.92 1 0.00
21 51.74 1 0.00
22 40.85 1 40.85
I know about some functions like nsmallest(n,'column') but I don't know how to limit it with the Count grouping
Any idea? thank you in advance! .
I think you need first filter out rows with 0 values in value, sorting by sort_values and get DataFrame.head for top 4 values, last add reindex for filling 0 for not matched values:
df['value 2'] = (df[df['value'] != 0]
.sort_values('value')
.groupby('counter')['value'].head(4)
.reindex(df.index, fill_value=0))
print (df)
value counter value 2
0 0.00 1 0.00
1 0.00 1 0.00
2 0.00 1 0.00
3 0.00 1 0.00
4 0.00 1 0.00
5 0.00 1 0.00
6 0.00 1 0.00
7 0.00 1 0.00
8 55.00 1 0.00
9 90.00 1 0.00
10 49.27 1 0.00
11 51.80 1 0.00
12 49.06 1 0.00
13 43.46 1 43.46
14 45.96 1 0.00
15 43.95 1 0.00
16 45.00 1 0.00
17 43.97 1 0.00
18 42.00 1 42.00
19 41.14 1 41.14
20 43.92 1 0.00
21 51.74 1 0.00
22 40.85 1 40.85
23 0.00 2 0.00
24 0.00 2 0.00
25 0.00 2 0.00
26 0.00 2 0.00
27 0.00 2 0.00

How to create multiple spacing CSV from pandas?

I have a pandas dataframe and want to output a text file separated by different spacing for input to other model. How can I do that?
The sample OUTPUT text file is as follow (each columns in the text file correspond to columns in df):
SO HOUREMIS 92 5 1 1 MC12 386.91 389.8 11.45
SO HOUREMIS 92 5 1 1 MC3 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 MC4 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 ED1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 ED2 322.00 397.4 13.00
SO HOUREMIS 92 5 1 1 HL2 25.55 464.3 7.46
SO HOUREMIS 92 5 1 1 WC1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 WC2 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 MC12 405.35 389.3 11.54
SO HOUREMIS 92 5 1 2 MC3 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 MC4 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 ED1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 ED2 319.90 396.3 13.00
After referring to this post. I found the solution:
fmt = '%0s %+1s %+1s %+2s %+2s %+2s %+6s %+15s'
np.savetxt('test.txt', data.values[0:10], fmt=fmt)
I can format each columns and specify how many spacing and the alignment.

Fill zero values for combinations of unique multi-index values after groupby

To better explain by problem better lets pretend i have a shop with 3 unique customers and my dataframe contains every purchase of my customers with weekday, name and paid price.
name price weekday
0 Paul 18.44 0
1 Micky 0.70 0
2 Sarah 0.59 0
3 Sarah 0.27 1
4 Paul 3.45 2
5 Sarah 14.03 2
6 Paul 17.21 3
7 Micky 5.35 3
8 Sarah 0.49 4
9 Micky 17.00 4
10 Paul 2.62 4
11 Micky 17.61 5
12 Micky 10.63 6
The information i would like to get is the average price per unique customer per weekday. What i often do in similar situations is to group by several columns with sum and then take the average of a subset of the columns.
df = df.groupby(['name','weekday']).sum()
price
name weekday
Micky 0 0.70
3 5.35
4 17.00
5 17.61
6 10.63
Paul 0 18.44
2 3.45
3 17.21
4 2.62
Sarah 0 0.59
1 0.27
2 14.03
4 0.49
df = df.groupby(['weekday']).mean()
price
weekday
0 6.576667
1 0.270000
2 8.740000
3 11.280000
4 6.703333
5 17.610000
6 10.630000
Of course this only works if all my unique customers would have at least one purchase per day.
Is there an elegant way to get a zero value for all combinations between unique index values that have no sum after the first groupby?
My solutions has been so far to either to reindex on a multi index i created from the unique values of the grouped columns or the combination of unstack-fillna-stack but both solutions do not really satisfy me.
Appreciate your help!
IIUC, let's use unstack and fillna then stack:
df_out = df.groupby(['name','weekday']).sum().unstack().fillna(0).stack()
Output:
price
name weekday
Micky 0 0.70
1 0.00
2 0.00
3 5.35
4 17.00
5 17.61
6 10.63
Paul 0 18.44
1 0.00
2 3.45
3 17.21
4 2.62
5 0.00
6 0.00
Sarah 0 0.59
1 0.27
2 14.03
3 0.00
4 0.49
5 0.00
6 0.00
And,
df_out.groupby('weekday').mean()
Output:
price
weekday
0 6.576667
1 0.090000
2 5.826667
3 7.520000
4 6.703333
5 5.870000
6 3.543333
I think you can use pivot_table to do all the steps at once. I'm not exactly sure what you want but the default aggregation from pivot_table is the mean. You can change it to 'sum'.
df1 = df.pivot_table(index='name', columns='weekday', values='price',
fill_value=0, aggfunc='sum')
weekday 0 1 2 3 4 5 6
name
Micky 0.70 0.00 0.00 5.35 17.00 17.61 10.63
Paul 18.44 0.00 3.45 17.21 2.62 0.00 0.00
Sarah 0.59 0.27 14.03 0.00 0.49 0.00 0.00
And then take the mean of each column.
df1.mean()
weekday
0 6.576667
1 0.090000
2 5.826667
3 7.520000
4 6.703333
5 5.870000
6 3.543333

Inner join on 2 columns for two dataframes python

I have 2 dataframes named geostat and geostat_query. I am trying to do inner join on 2 columns. The code that I have written is giving me empty result.
My dataframes are:
geostat:
STATE COUNT PERCENT state pool number STATE CODE
0 0.00 251 CA
1 0.00 252 CA
2 0.00 253 CA
3 0.00 787 CA
4 0.00 789 CA
5 0.00 4401 CA
6 0.00 4402 CA
7 0.00 4403 CA
8 0.00 4404 CA
9 0.00 4406 CA
10 0.00 4568 CA
11 0.00 4569 FL
12 0.00 4576 CA
13 0.00 4577 CA
14 0.00 4578 CA
15 0.00 4579 CA
16 0.00 4580 CA
17 0.00 4581 CA
18 0.00 4582 CA
19 0.00 4584 CA
20 0.00 4585 CA
21 0.00 4588 CA
22 0.00 4589 CA
23 0.00 4591 CA
24 0.00 4592 CA
25 0.00 4593 CA
26 0.00 4594 FL
27 0.00 4595 CA
28 0.00 4595 FL
29 0.00 6221 MS
30 0.00 817085 GA
31 0.03 817085 IL
32 0.03 817085 IN
33 0.03 817085 MA
34 0.03 817085 ME
35 0.07 817085 MI
36 0.07 817085 MO
37 0.03 817085 NE
38 0.07 817085 OH
39 0.03 817085 PA
40 0.03 817085 SC
41 0.03 817085 SD
42 0.03 817085 TX
43 0.07 817085 WI
44 0.08 817094 AL
45 0.09 817094 CA
geostat_query:
MaxOfState count percent state pool number
0 100 251
1 100 252
2 100 253
3 100 787
4 100 789
5 100 4401
6 100 4402
7 100 4403
8 100 4404
9 100 4406
10 100 4568
11 100 4569
12 100 4576
13 100 4577
14 100 4578
15 100 4579
16 100 4580
17 100 4581
18 100 4582
19 100 4584
20 100 4585
21 100 4588
22 100 4589
23 100 4591
24 100 4592
25 100 4593
26 100 4594
27 75 4595
28 100 6221
29 100 8194
The code I wrote is :
geomerge = geostat.merge(geostat_query, left_on=['STATE COUNT PERCENT','state pool number'], right_on=['MaxOfState count percent','state pool number'],how='inner')
But this gives me empty result. I dont understand where am I going wrong?

Categories

Resources