How to title a pandas dataframe - python

I have the following code that prints out descriptive statistics with df.describe for each class of a categorical variable
for i in list(merged.Response.unique()):
print(merged[(merged.Response==i)].describe().round(2))
and it returns
OrderCount OrderAvgSize AvgDeliverCost AvgOrderValue CustomerValue
count 687.00 687.00 687.00 687.00 687.00
mean 24.75 13.45 4.56 9.61 243.91
std 7.04 3.35 0.17 1.95 107.45
min 11.00 7.00 4.13 5.85 83.27
25% 20.00 11.00 4.45 8.18 167.44
50% 24.00 13.00 4.57 9.34 213.08
75% 29.00 15.00 4.67 10.51 289.74
max 51.00 24.00 4.97 15.75 700.80
OrderCount OrderAvgSize AvgDeliverCost AvgOrderValue CustomerValue
count 1099.0 1099.00 1099.00 1099.00 1099.00
mean 17.2 6.85 4.08 5.18 97.88
std 12.8 2.47 0.24 1.45 101.26
min 1.0 2.00 3.24 2.40 5.72
25% 7.0 5.00 3.89 4.12 31.38
50% 14.0 7.00 4.13 5.21 62.58
75% 24.0 8.00 4.22 5.86 130.90
max 55.0 21.00 4.91 13.46 686.46
OrderCount OrderAvgSize AvgDeliverCost AvgOrderValue CustomerValue
count 392.00 392.00 392.00 392.00 392.00
mean 12.41 11.46 4.44 10.13 125.04
std 3.75 3.34 0.19 1.94 43.91
min 3.00 6.00 4.02 6.98 36.92
25% 10.00 9.00 4.31 8.71 92.68
50% 13.00 10.00 4.38 9.30 121.58
75% 15.00 13.00 4.51 11.00 148.64
max 26.00 22.00 4.94 16.25 266.56
Is there any way I can title each summary table so I know which class is which?
I treid the following with the pandas styler, but despite titling the dataframe, it only printed one of them and it doesn't look as good (I'm in google colab btw):
for i in list(merged.Response.unique()):
test = merged[(merged.Response==i)].describe().round(2).style.set_caption(i)
test
AmznPrime
OrderCount OrderAvgSize AvgDeliverCost AvgOrderValue CustomerValue
count 392.000000 392.000000 392.000000 392.000000 392.000000
mean 12.410000 11.460000 4.440000 10.130000 125.040000
std 3.750000 3.340000 0.190000 1.940000 43.910000
min 3.000000 6.000000 4.020000 6.980000 36.920000
25% 10.000000 9.000000 4.310000 8.710000 92.680000
50% 13.000000 10.000000 4.380000 9.300000 121.580000
75% 15.000000 13.000000 4.510000 11.000000 148.640000
max 26.000000 22.000000 4.940000 16.250000 266.560000
All help is appreciated. Thanks!

How about:
merged.groupby("Response").describe().round(2)
To match your expected output, do stack/unstack:
merged.groupby("Response").describe().stack(level=1).unstack(level=0)

Related

pandas categorical doesn't sort multiindex

I've pulled some data from SQL as a CSV:
Year,Decision,Residency,Class,Count
2019,Applied,Resident,Freshmen,1143
2019,Applied,Resident,Transfer,404
2019,Applied," ",Grad/Postbacc,418
2019,Applied,Non-Resident,Freshmen,1371
2019,Applied,Non-Resident,Transfer,371
2019,Admitted,Resident,Freshmen,918
2019,Admitted,Resident,Transfer,358
2019,Admitted," ",Grad/Postbacc,311
2019,Admitted,Non-Resident,Freshmen,1048
2019,Admitted,Non-Resident,Transfer,313
2020,Applied,Resident,Freshmen,1094
2020,Applied,Resident,Transfer,406
2020,Applied," ",Grad/Postbacc,374
2020,Applied,Non-Resident,Freshmen,1223
2020,Applied,Non-Resident,Transfer,356
2020,Admitted,Resident,Freshmen,1003
2020,Admitted,Resident,Transfer,354
2020,Admitted," ",Grad/Postbacc,282
2020,Admitted,Non-Resident,Freshmen,1090
2020,Admitted,Non-Resident,Transfer,288
I've written a transform as follows:
data = pd.read_csv("Data.csv")
#Categorize the rows
data["Class"] = pd.Categorical(data["Class"],["Freshmen","Transfer","Grad/Postbacc","Grand"],ordered=True)
data["Decision"] = pd.Categorical(data["Decision"],["Applied","Admitted"],ordered=True)
data["Residency"] = pd.Categorical(data["Residency"],["Resident","Non-Resident"],ordered=True)
#Subtotal classes
tmp = data.groupby(["Year","Class","Decision"],sort=False).sum("Count")
tmp["Residency"] = "Total"
tmp.reset_index(inplace=True)
tmp = pd.concat([data,tmp],ignore_index=True)
#Grand total
tmp2 = data.groupby(["Year","Decision"],sort=False).sum("Count")
tmp2["Class"] = "Grand"
tmp2["Residency"] = "Total"
tmp2.reset_index(inplace=True)
tmp = pd.concat([tmp,tmp2],ignore_index=True)
#Crosstab it
tmp = pd.crosstab(index=[tmp["Year"],tmp["Class"],tmp["Residency"]],
columns=[tmp["Decision"]],
values=tmp["Count"],
aggfunc="sum")
tmp = tmp.loc[~(tmp==0).all(axis=1)]
tmp["%"] = np.round(100*tmp["Admitted"]/tmp["Applied"],1)
tmp = tmp.stack().unstack(["Year","Decision"])
print(tmp)
and it outputs as follows:
Year 2019 2020
Decision Applied Admitted % Applied Admitted %
Class Residency
Freshmen Non-Resident 1371.0 1048.0 76.4 1223.0 1090.0 89.1
Resident 1143.0 918.0 80.3 1094.0 1003.0 91.7
Total 2514.0 1966.0 78.2 2317.0 2093.0 90.3
Grad/Postbacc Total 418.0 311.0 74.4 374.0 282.0 75.4
Grand Total 3707.0 2948.0 79.5 3453.0 3017.0 87.4
Transfer Non-Resident 371.0 313.0 84.4 356.0 288.0 80.9
Resident 404.0 358.0 88.6 406.0 354.0 87.2
Total 775.0 671.0 86.6 762.0 642.0 84.3
Expected output is
Year 2019 2020
Decision Applied Admitted % Applied Admitted %
Class Residency
Freshmen Resident 1143.0 918.0 80.3 1094.0 1003.0 91.7
Non-Resident 1371.0 1048.0 76.4 1223.0 1090.0 89.1
Total 2514.0 1966.0 78.2 2317.0 2093.0 90.3
Transfer Resident 404.0 358.0 88.6 406.0 354.0 87.2
Non-Resident 371.0 313.0 84.4 356.0 288.0 80.9
Total 775.0 671.0 86.6 762.0 642.0 84.3
Grad/Postbacc Total 418.0 311.0 74.4 374.0 282.0 75.4
Grand Total 3707.0 2948.0 79.5 3453.0 3017.0 87.4
The categories successfully sort themselves correctly right up until I throw the dataframe into pd.crosstab at which point it all falls apart. What's going on and how do I fix it?
I coudn't fix your code but I got the expected result doing this:
import pandas as pd
df = pd.read_csv("Data.csv")
df["Class"] = pd.Categorical(df["Class"],["Freshmen","Transfer","Grad/Postbacc","Grand"],ordered=True)
df["Decision"] = pd.Categorical(df["Decision"],["Applied","Admitted","%"],ordered=True)
df["Residency"] = pd.Categorical(df["Residency"],["Resident","Non-Resident"," "],ordered=True)
df_grouped = df.groupby(['Year', 'Decision', 'Class', 'Residency'],as_index=False)['Count'].sum()
df_pivot = df_grouped.pivot_table(columns=["Year","Decision"],index=["Class","Residency"], values="Count",aggfunc='sum')
#Create subtotal for rows
df_totals = pd.concat([y.append(y.sum().rename((x, 'Total'))) for x, y in df_pivot.groupby(level=0)]).append(df_pivot.sum().rename(('Grand', 'Total')))
#Drop not wanted rows
df_totals = df_totals[~(df_totals.values == 0).all(axis=1)].drop_duplicates(keep="last")
#Calculate "%" columns
for year in df_totals.columns.get_level_values('Year').unique():
df_totals[year, '%'] = df_totals[year, 'Admitted'] / df_totals[year, 'Applied']
df_totals
Output:
Year 2019 2020
Decision Applied Admitted % Applied Admitted %
Class Residency
Freshmen Resident 1143.0 918.0 80.3 1094.0 1003.0 91.7
Non-Resident 1371.0 1048.0 76.4 1223.0 1090.0 89.1
Total 2514.0 1966.0 78.2 2317.0 2093.0 90.3
Transfer Resident 404.0 358.0 88.6 406.0 354.0 87.2
Non-Resident 371.0 313.0 84.4 356.0 288.0 80.9
Total 775.0 671.0 86.6 762.0 642.0 84.3
Grad/Postbacc Total 418.0 311.0 74.4 374.0 282.0 75.4
Grand Total 3707.0 2948.0 79.5 3453.0 3017.0 87.4
Note: I got a warning about df.append()

How to groupby in pandas and return dataframe instead of series?

print(df.sample(50)):
match_datetime country league home_team away_team home_odds draw_odds away_odds run_time home_score away_score
72170 2021-10-17 12:30:00 Ukraine Persha Liga Alliance Uzhhorod 1.22 5.62 9.71 2021-10-17 09:22:20.212731 NaN NaN
100398 2021-11-02 14:35:00 Saudi Arabia Division 1 Al Qadisiya Bisha 1.61 3.61 4.94 2021-11-02 09:13:18.768604 2.0 1.0
33929 2021-09-11 23:00:00 Panama LPF Veraguas Plaza Amador 2.75 2.75 2.71 2021-09-10 23:47:54.682982 1.0 1.0
12328 2021-08-15 15:30:00 Poland Ekstraklasa Slask Wroclaw Leczna 1.74 3.74 4.59 2021-08-14 22:44:26.136608 0.0 0.0
81500 2021-10-24 13:00:00 Italy Serie D - Group A Caronnese Saluzzo 1.69 3.60 4.28 2021-10-23 13:37:16.920175 2.0 2.0
143370 2021-12-05 14:00:00 Poland Division 1 Chrobry Glogow Widzew Lodz 3.36 3.17 2.15 2021-11-30 17:40:24.833519 0.0 0.0
175061 2022-01-08 18:00:00 Spain Primera RFEF - Group 1 R. Union Extremadura UD 1.26 4.40 18.00 2022-01-08 17:00:46.662761 0.0 1.0
21293 2021-08-29 16:00:00 Italy Serie B Cittadella Crotone 2.32 3.11 3.31 2021-08-26 18:04:46.221393 4.0 2.0
97427 2021-11-01 17:00:00 Israel Leumit League M. Nazareth Beitar Tel Aviv 1.92 3.26 3.75 2021-10-30 09:40:08.966330 4.0 2.0
177665 2022-01-13 12:30:00 Egypt Division 2 - Group C Said El Mahalla Al Magd 4.12 3.08 1.94 2022-01-12 17:53:33.570126 0.0 0.0
69451 2021-10-17 05:00:00 South Korea K League 1 Gangwon Gwangju FC 2.06 3.38 3.65 2021-10-15 09:55:54.578112 NaN NaN
4742 2021-08-10 20:30:00 Peru Liga 2 Deportivo Coopsol Grau 3.14 3.49 2.06 2021-08-10 18:14:01.996860 0.0 2.0
22266 2021-08-29 13:00:00 France Ligue 1 Angers Rennes 2.93 3.27 2.56 2021-08-27 12:26:34.904374 2.0 0.0
46412 2021-09-26 04:00:00 Japan J2 League Okayama Blaublitz 2.24 2.90 3.63 2021-09-23 09:08:26.979783 1.0 1.0
133207 2021-11-27 21:15:00 Bolivia Division Profesional Palmaflor Blooming 1.51 4.05 5.10 2021-11-25 18:22:28.275844 3.0 0.0
140825 2021-11-28 11:00:00 Spain Tercera RFEF - Group 6 Valencia B Torrellano 1.58 3.56 5.26 2021-11-28 19:54:40.066637 2.0 0.0
226985 2022-03-04 00:30:00 Argentina Copa de la Liga Profesional Central Cordoba Rosario Central 2.36 3.26 2.86 2022-03-02 17:23:10.014424 0.0 1.0
137226 2021-11-28 12:45:00 Greece Super League 2 Apollon Pontou PAOK B 3.37 3.25 2.01 2021-11-27 15:13:05.937815 0.0 3.0
182756 2022-01-22 10:30:00 Turkey 1. Lig Umraniyespor Menemenspor 1.40 4.39 7.07 2022-01-19 17:25:27.128331 2.0 1.0
89895 2021-10-28 16:45:00 Netherlands KNVB Beker Ajax Cambuur 9.10 5.55 1.26 2021-10-27 07:46:56.253996 0.0 5.0
227595 2022-03-06 17:00:00 Israel Ligat ha'Al Ashdod Maccabi Petah Tikva 2.30 3.21 3.05 2022-03-02 17:23:10.014424 NaN NaN
57568 2021-10-02 13:00:00 Estonia Meistriliiga Kalju Legion 1.58 4.10 4.84 2021-10-02 10:55:35.287359 2.0 2.0
227035 2022-03-04 19:00:00 Denmark Superliga FC Copenhagen Randers FC 1.70 3.84 5.06 2022-03-02 17:23:10.014424 NaN NaN
108668 2021-11-07 13:30:00 Germany Oberliga Mittelrhein Duren Freialdenhoven 1.35 5.20 6.35 2021-11-06 17:37:37.629603 2.0 0.0
86270 2021-10-25 18:00:00 Belgium Pro League U21 Lommel SK U21 Lierse K. U21 3.23 3.84 1.92 2021-10-26 01:22:31.111441 0.0 0.0
89437 2021-11-01 02:10:00 Colombia Primera A America De Cali Petrolera 1.86 2.92 4.60 2021-10-27 07:41:24.427246 NaN NaN
13986 2021-08-21 13:00:00 France Ligue 2 Dijon Toulouse 3.92 3.51 1.94 2021-08-16 13:22:02.749887 2.0 4.0
105179 2021-11-06 15:00:00 England NPL Premier Division Atherton South Shields 3.90 3.42 1.82 2021-11-05 10:01:28.567328 1.0 1.0
142821 2021-12-01 12:30:00 Bulgaria Vtora liga Marek Septemvri Simitli 1.79 3.38 4.35 2021-11-30 17:40:24.833519 2.0 2.0
45866 2021-09-24 00:30:00 Venezuela Primera Division Dep. Tachira Portuguesa 1.96 3.60 3.22 2021-09-23 09:08:26.979783 4.0 1.0
76100 2021-10-22 16:30:00 Denmark 1st Division Hvidovre IF Koge 1.91 3.56 3.81 2021-10-21 08:43:12.445245 NaN NaN
115896 2021-11-14 16:00:00 Spain Tercera RFEF - Group 6 Olimpic Xativa Torrellano 2.78 2.89 2.39 2021-11-13 12:21:45.955738 1.0 0.0
156159 2021-12-12 16:00:00 Spain Segunda RFEF - Group 1 Marino de Luanco Coruxo FC 2.19 3.27 3.07 2021-12-10 09:26:45.001977 0.0 0.0
18240 2021-08-21 12:00:00 Germany Regionalliga West Rodinghausen Fortuna Koln 3.25 3.60 2.00 2021-08-21 03:30:43.193978 NaN NaN
184913 2022-01-22 10:00:00 World Club Friendly Zilina B Trinec 3.56 4.14 1.78 2022-01-22 16:44:32.650325 0.0 3.0
16782 2021-08-22 23:05:00 Colombia Primera A Petrolera Dep. Cali 3.01 3.00 2.44 2021-08-19 18:24:24.966505 2.0 3.0
63847 2021-10-10 09:30:00 Spain Tercera RFEF - Group 7 Carabanchel RSD Alcala 4.39 3.42 1.75 2021-10-09 12:03:50.720013 NaN NaN
7254 2021-08-12 16:45:00 Europe Europa Conference League Hammarby Cukaricki 1.72 3.87 4.13 2021-08-11 23:48:31.958394 NaN NaN
82727 2021-10-24 14:00:00 Lithuania I Lyga Zalgiris 2 Neptunas 1.76 3.78 3.35 2021-10-24 12:02:06.306279 1.0 3.0
43074 2021-09-22 18:00:00 Ukraine Super Cup Shakhtar Donetsk Dyn. Kyiv 2.57 3.49 2.59 2021-09-19 09:39:56.624504 NaN NaN
65187 2021-10-11 18:45:00 World World Cup Norway Montenegro 1.56 4.17 6.28 2021-10-11 10:56:09.973470 NaN NaN
120993 2021-11-18 00:00:00 USA NISA Maryland Bobcats California Utd. 2.76 3.23 2.39 2021-11-17 20:36:26.562731 1.0 1.0
201469 2022-02-12 15:00:00 England League One AFC Wimbledon Sunderland 3.30 3.48 2.17 2022-02-10 17:47:36.501159 1.0 1.0
142180 2021-12-01 19:45:00 Scotland Premiership St. Mirren Ross County 2.06 3.25 3.85 2021-11-29 18:28:22.249662 0.0 0.0
4681 2021-08-10 18:30:00 Europe Champions League Young Boys CFR Cluj 1.48 4.29 6.92 2021-08-10 18:14:01.996860 3.0 1.0
67321 2021-10-17 13:00:00 Spain LaLiga Rayo Vallecano Elche 1.78 3.64 4.99 2021-10-13 11:22:34.979378 NaN NaN
27499 2021-09-04 14:00:00 Iceland Inkasso-deildin Kordrengir Fjolnir 2.18 3.66 2.82 2021-09-02 23:28:49.414126 1.0 4.0
48962 2021-09-25 21:00:00 Mexico Liga Premier Serie B Uruapan Lobos Huerta 1.83 3.69 3.70 2021-09-25 13:02:58.238466 NaN NaN
65636 2021-10-16 17:00:00 Switzerland Super League Young Boys Luzern 1.26 6.04 9.43 2021-10-11 10:56:09.973470 NaN NaN
17333 2021-08-21 14:00:00 Finland Kakkonen Group A Atlantis Kiffen 1.57 4.29 4.42 2021-08-20 12:41:03.159846 1.0 1.0
I am trying to get the latest 2 match_datetime values for every run_time and then filter(join) df to get all the relevant values as below:
df['match_datetime'] = pd.to_datetime(df['match_datetime'])
s = (df['match_datetime'].dt.normalize()
.groupby([df['run_time']])
.value_counts()
.groupby(level=0)
.head(2))
print(s)
run_time match_datetime
2021-08-07 00:04:36.326391 2021-08-07 255
2021-08-06 188
2021-08-07 10:50:34.574040 2021-08-07 649
2021-08-08 277
2021-08-07 16:56:22.322338 2021-08-07 712
This returns a series while I want a DataFrame so I can merge.
To do this:
df_n = df.reset_index().merge(s, how="left",
left_on=["match_datetime", "run_time"],
right_on=["match_datetime", "run_time"])
While I am sure there is a better manner I can write function s but I am unsure how to do it the correct way.
If I understand correctly, you would like to filter the dataframe to retain, for each run_time, the last two rows (or up to two rows) by match_datetime.
Simplified answer
This can be done relatively easily without any join, using GroupBy.tail(). (Note, my original answer was using GroupBy.rank(), but this is simpler, although slower):
out = df.sort_values(
['run_time', 'match_datetime']
).groupby('run_time').tail(2)
Minimal example
import numpy as np
np.random.seed(0)
n = 10
rt = np.random.choice(pd.date_range('2022-01-01', periods=n//2, freq='D'), n)
df = pd.DataFrame({
'run_time': rt,
'match_datetime': rt - pd.to_timedelta(np.random.uniform(size=n), unit='days'),
})
df['match_datetime'] = df['match_datetime'].dt.round('h')
Then:
out = df.sort_values(
['run_time', 'match_datetime']
).groupby('run_time').tail(2)
>>> out
run_time match_datetime
9 2022-01-01 2021-12-31 04:00:00
1 2022-01-01 2021-12-31 15:00:00
5 2022-01-02 2022-01-01 02:00:00
7 2022-01-03 2022-01-02 22:00:00
3 2022-01-04 2022-01-03 11:00:00
6 2022-01-04 2022-01-03 22:00:00
0 2022-01-05 2022-01-04 01:00:00
8 2022-01-05 2022-01-05 00:00:00
On the OP's (extended) data
The output is quite verbose. Here is a sample:
>>> out['run_time match_datetime country draw_odds'.split()].head()
run_time match_datetime country draw_odds
4681 2021-08-10 18:14:01.996860 2021-08-10 18:30:00 Europe 4.29
4742 2021-08-10 18:14:01.996860 2021-08-10 20:30:00 Peru 3.49
7254 2021-08-11 23:48:31.958394 2021-08-12 16:45:00 Europe 3.87
12328 2021-08-14 22:44:26.136608 2021-08-15 15:30:00 Poland 3.74
13986 2021-08-16 13:22:02.749887 2021-08-21 13:00:00 France 3.51
Performance
For several millions of rows, the timing difference starts counting, and using rank is faster. Even faster, you can avoid sorting on run_time (the result is the same, but the rows are in a different order):
np.random.seed(0)
n = 1_000_000
rt = np.random.choice(pd.date_range('2022-01-01', periods=n//2, freq='min'), n)
df = pd.DataFrame({
'run_time': rt,
'match_datetime': rt - pd.to_timedelta(np.random.uniform(size=n), unit='s'),
})
%timeit df.sort_values(['run_time', 'match_datetime']).groupby('run_time').tail(2)
# 981 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.loc[df.groupby('run_time')['match_datetime'].rank(method='first', ascending=False) <= 2]
# 355 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.sort_values(['match_datetime']).groupby('run_time').tail(2)
# 258 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Check solutions:
a = df.sort_values(['run_time', 'match_datetime']).groupby('run_time').tail(2)
b = df.loc[df.groupby('run_time')['match_datetime'].rank(method='first', ascending=False) <= 2]
c = df.sort_values(['match_datetime']).groupby('run_time').tail(2)
by = ['run_time', 'match_datetime']
>>> a.sort_values(by).equals(b.sort_values(by))
True
>>> b.sort_values(by).equals(c.sort_values(by))
True

Calculate Positive Streak for Pandas Rows in reverse

I want to calculate a positive streak for numbers in a row in reverse fashion.
I tried using cumsum() but that's not helping me.
The DataFrame looks as follows with the expected output:
country score_1 score_2 score_3 score_4 score_5 expected_streak
U.S. 12.4 13.6 19.9 22 28.7 4
Africa 11.1 15.5 9.2 7 34.2 1
India 13.9 6.6 16.3 21.8 30.9 3
Australia 25.4 36.9 18.9 29 NaN 0
Malaysia 12.8 NaN -6.2 28.6 31.7 2
Argentina 40.7 NaN 16.3 20.1 39 2
Canada 56.4 NaN NaN -2 -1 1
So, basically score_5 should be greater than score_4 and so on... to get a count of streak. If a number is greater than score_5 the streak count ends.
One way using diff with cummin:
df2 = df.filter(like="score_").loc[:, ::-1]
df["expected"] = df2.diff(-1, axis=1).gt(0).cummin(1).sum(1)
print(df)
Output:
country score_1 score_2 score_3 score_4 score_5 expected
0 U.S. 12.4 13.6 19.9 22.0 28.7 4
1 Africa 11.1 15.5 9.2 7.0 34.2 1
2 India 13.9 6.6 16.3 21.8 30.9 3
3 Australia 25.4 36.9 18.9 29.0 NaN 0
4 Malaysia 12.8 NaN -6.2 28.6 31.7 2
5 Argentina 40.7 NaN 16.3 20.1 39.0 2
6 Canada 56.4 NaN NaN -2.0 -1.0 1

Webscraping in BeautifulSoup is returning an empty list

I am trying to web scrape a table from basketball reference and it returns an empty list. I was hoping someone could help me debug or explain why. The page has many tables but it is the Miscellaneous Stats section in particular. Thanks in advance!
from bs4 import BeautifulSoup
import requests
import time
import pandas as pd
import matplotlib as plt
import numpy as np
url = 'https://www.basketball-reference.com/leagues/NBA_2020.html#all_misc_stats'
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
soup.find('div', {'id':'div_misc_stats'})
Your implementation isn't wrong for parsing the soup, its just that the particular element you're looking for requires javascript to render. You're probably better off looking for some other source of the data if you can find it.
If you really need THIS data, then you may wish to looking rendering the page first (see this for some inspiration)
From my cursory analysis, it also seems there isn't an external network call made to get the data before rendering it, so it may be elsewhere embeded in the page, as xml/json/etc, although I didn't find it in my search. May be worth checking that before investing in a more compute expensive approach if this is not a one time thing you'll need to scrape.
The data is inside HTML comment <!-- ... -->. You can use this script to load it inside a DataFrame:
import requests
import pandas as pd
from bs4 import BeautifulSoup, Comment
url = 'https://www.basketball-reference.com/leagues/NBA_2020.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
table = soup.select_one('h2:contains("Miscellaneous Stats")').find_next(text=lambda t: isinstance(t, Comment))
df = pd.read_html(str(table))[0].droplevel(0, axis=1)
print(df)
Prints:
Rk Team Age W L PW PL MOV SOS SRS ORtg DRtg ... TS% eFG% TOV% ORB% FT/FGA eFG% TOV% DRB% FT/FGA Arena Attend. Attend./G
0 1.0 Milwaukee Bucks* 29.2 53.0 12.0 52 13 11.29 -0.85 10.44 112.6 101.9 ... 0.583 0.553 12.8 20.7 0.196 0.486 12.2 81.7 0.172 Fiserv Forum 549036 17711
1 2.0 Los Angeles Lakers* 29.6 49.0 14.0 45 18 7.41 0.34 7.75 113.0 105.6 ... 0.577 0.548 13.2 24.6 0.196 0.509 13.8 78.4 0.202 STAPLES Center 588907 18997
2 3.0 Los Angeles Clippers* 27.4 44.0 20.0 44 20 6.52 0.22 6.74 113.6 107.2 ... 0.574 0.532 12.7 24.0 0.232 0.503 12.3 77.3 0.210 STAPLES Center 610176 19068
3 4.0 Toronto Raptors* 26.6 46.0 18.0 44 20 6.45 -0.57 5.88 111.6 105.2 ... 0.574 0.536 12.8 21.6 0.205 0.502 14.6 76.1 0.200 Scotiabank Arena 633456 19796
4 5.0 Dallas Mavericks 26.2 40.0 27.0 45 22 6.04 -0.21 5.84 116.7 110.6 ... 0.581 0.548 11.3 23.5 0.198 0.519 10.9 77.4 0.172 American Airlines Center 682096 20062
5 6.0 Boston Celtics* 25.3 43.0 21.0 44 20 6.17 -0.48 5.69 112.9 106.8 ... 0.567 0.529 12.0 23.9 0.204 0.510 13.6 77.5 0.212 TD Garden 610864 19090
6 7.0 Houston Rockets* 29.1 40.0 24.0 39 25 3.75 0.03 3.78 113.8 110.2 ... 0.578 0.539 12.6 22.4 0.226 0.528 13.5 75.6 0.194 Toyota Center 578458 18077
7 8.0 Utah Jazz* 27.5 41.0 23.0 38 26 3.17 0.03 3.20 112.6 109.4 ... 0.587 0.552 13.6 21.2 0.208 0.514 10.9 79.0 0.180 Vivint Smart Home Arena 567486 18306
8 9.0 Denver Nuggets* 25.6 43.0 22.0 39 26 2.95 0.06 3.02 112.5 109.5 ... 0.564 0.532 12.3 24.7 0.178 0.526 13.0 77.0 0.194 Pepsi Center 633153 19186
9 10.0 Oklahoma City Thunder* 25.6 40.0 24.0 37 27 2.45 0.34 2.79 111.6 109.1 ... 0.577 0.534 12.3 19.2 0.233 0.520 12.4 76.8 0.164 Chesapeake Energy Arena 600699 18203
10 11.0 Miami Heat* 25.9 41.0 24.0 39 26 3.23 -0.65 2.58 112.7 109.4 ... 0.587 0.549 13.5 20.5 0.231 0.522 12.3 79.7 0.208 AmericanAirlines Arena 629771 19680
11 12.0 Philadelphia 76ers* 26.4 39.0 26.0 37 28 2.22 0.01 2.22 110.4 108.2 ... 0.562 0.530 12.7 23.7 0.189 0.522 12.7 80.4 0.211 Wells Fargo Center 639491 20629
12 13.0 Indiana Pacers* 25.6 39.0 26.0 37 28 1.94 -0.33 1.61 110.3 108.3 ... 0.565 0.533 11.9 20.3 0.170 0.513 12.8 77.1 0.193 Bankers Life Fieldhouse 529002 16531
13 14.0 New Orleans Pelicans 25.4 28.0 36.0 30 34 -0.83 1.13 0.30 110.8 111.6 ... 0.567 0.538 13.7 24.3 0.183 0.531 12.3 78.1 0.207 Smoothie King Center 528172 16505
14 15.0 Orlando Magic 26.0 30.0 35.0 30 35 -0.97 0.12 -0.85 108.0 109.0 ... 0.540 0.503 11.4 22.4 0.191 0.535 13.5 79.0 0.170 Amway Center 529870 17093
15 16.0 Memphis Grizzlies 24.0 32.0 33.0 30 35 -1.08 0.02 -1.05 109.4 110.4 ... 0.561 0.530 13.2 23.2 0.178 0.520 12.6 77.6 0.213 FedEx Forum 523297 15857
16 17.0 Phoenix Suns 24.7 26.0 39.0 30 35 -1.37 0.32 -1.05 110.5 111.8 ... 0.572 0.528 13.3 22.2 0.226 0.543 14.0 78.3 0.221 Talking Stick Resort Arena 550633 15606
17 18.0 Portland Trail Blazers 27.5 29.0 37.0 30 36 -1.61 0.49 -1.11 112.5 114.1 ... 0.566 0.530 11.5 22.0 0.191 0.523 11.0 75.0 0.204 Moda Center 628303 19634
18 19.0 Brooklyn Nets 26.5 30.0 34.0 31 33 -0.64 -0.54 -1.18 108.1 108.7 ... 0.550 0.515 13.4 23.5 0.199 0.507 10.9 77.8 0.181 Barclays Center 524907 16403
19 20.0 San Antonio Spurs 27.9 27.0 36.0 28 35 -1.76 0.57 -1.21 111.9 113.7 ... 0.569 0.529 11.0 19.5 0.206 0.542 11.5 79.2 0.194 AT&T Center 550515 18351
20 21.0 Sacramento Kings 27.1 28.0 36.0 28 36 -1.92 0.48 -1.44 109.7 111.6 ... 0.563 0.531 13.0 21.8 0.178 0.540 13.6 78.5 0.222 Golden 1 Center 520663 16796
21 22.0 Minnesota Timberwolves 24.8 19.0 45.0 24 40 -4.30 0.51 -3.78 108.1 112.2 ... 0.551 0.514 13.0 22.1 0.209 0.541 13.2 77.2 0.218 Target Center 482112 15066
22 23.0 Chicago Bulls 24.4 22.0 43.0 26 39 -3.08 -0.73 -3.81 106.7 109.8 ... 0.547 0.515 13.7 22.8 0.175 0.546 16.3 75.6 0.239 United Center 639352 18804
23 24.0 Detroit Pistons 25.9 20.0 46.0 26 40 -3.56 -0.66 -4.22 109.0 112.7 ... 0.561 0.529 13.8 22.6 0.194 0.541 12.7 75.9 0.186 Little Caesars Arena 509469 15294
24 25.0 Washington Wizards 25.4 24.0 40.0 24 40 -4.05 -0.81 -4.86 111.9 115.8 ... 0.568 0.528 12.1 22.0 0.214 0.560 14.0 74.9 0.230 Capital One Arena 532702 16647
25 26.0 New York Knicks 24.5 21.0 45.0 20 46 -6.45 -0.09 -6.55 106.5 113.0 ... 0.531 0.501 12.6 25.8 0.182 0.541 12.4 78.3 0.224 Madison Square Garden (IV) 620789 18812
26 27.0 Charlotte Hornets 24.3 23.0 42.0 19 46 -6.75 -0.12 -6.88 106.3 113.3 ... 0.539 0.504 13.3 23.9 0.188 0.546 13.1 74.4 0.159 Spectrum Center 478591 15428
27 28.0 Cleveland Cavaliers 25.0 19.0 46.0 18 47 -7.89 0.33 -7.55 107.5 115.4 ... 0.553 0.522 14.6 24.6 0.172 0.560 11.7 77.4 0.164 Quicken Loans Arena 643008 17861
28 29.0 Atlanta Hawks 24.1 20.0 47.0 18 49 -7.97 0.40 -7.57 107.2 114.8 ... 0.554 0.515 13.8 21.6 0.204 0.543 12.7 74.9 0.233 State Farm Arena 545453 16043
29 30.0 Golden State Warriors 24.4 15.0 50.0 16 49 -8.71 0.79 -7.92 105.2 113.8 ... 0.540 0.497 13.2 21.5 0.212 0.553 13.7 76.4 0.193 Chase Center 614176 18064
30 NaN League Average 26.2 NaN NaN 32 32 0.00 0.00 0.00 110.4 110.4 ... 0.564 0.528 12.8 22.6 0.199 0.528 12.8 77.4 0.199 NaN 575820 17788
[31 rows x 28 columns]
this website that you want to scrape, is a dynamic website, because of this you can't access to all the data at the first you request to the website, you need to wait for some seconds for rendering javascript and then access to all of the website data, for this solution you can use selenium. read the documentation and download the driver chrome or firefox then use it, I wrote the code that you can access to that table :
from selenium import webdriver
import pandas as pd
import os
import time
chromedriver = "driver/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
url = 'https://www.basketball-reference.com/leagues/NBA_2020.html#all_misc_stats'
driver.get(url)
time.sleep(15)
soruce = driver.page_source
tables = pd.read_html(soruce)
for table in tables:
try:
if 'Arena' in table.columns[25][1]:
print(table)
except:
pass
print:
Rk Team Age ... Arena Attend. Attend./G
0 1.0 Milwaukee Bucks* 29.2 ... Fiserv Forum 549036 17711
1 2.0 Los Angeles Lakers* 29.6 ... STAPLES Center 588907 18997
2 3.0 Los Angeles Clippers* 27.4 ... STAPLES Center 610176 19068
3 4.0 Toronto Raptors* 26.6 ... Scotiabank Arena 633456 19796
4 5.0 Dallas Mavericks 26.2 ... American Airlines Center 682096 20062
5 6.0 Boston Celtics* 25.3 ... TD Garden 610864 19090
6 7.0 Houston Rockets* 29.1 ... Toyota Center 578458 18077
7 8.0 Utah Jazz* 27.5 ... Vivint Smart Home Arena 567486 18306
8 9.0 Denver Nuggets* 25.6 ... Pepsi Center 633153 19186
9 10.0 Oklahoma City Thunder* 25.6 ... Chesapeake Energy Arena 600699 18203
10 11.0 Miami Heat* 25.9 ... AmericanAirlines Arena 629771 19680
11 12.0 Philadelphia 76ers* 26.4 ... Wells Fargo Center 639491 20629
12 13.0 Indiana Pacers* 25.6 ... Bankers Life Fieldhouse 529002 16531
13 14.0 New Orleans Pelicans 25.4 ... Smoothie King Center 528172 16505
14 15.0 Orlando Magic 26.0 ... Amway Center 529870 17093
15 16.0 Memphis Grizzlies 24.0 ... FedEx Forum 523297 15857
16 17.0 Phoenix Suns 24.7 ... Talking Stick Resort Arena 550633 15606
17 18.0 Portland Trail Blazers 27.5 ... Moda Center 628303 19634
18 19.0 Brooklyn Nets 26.5 ... Barclays Center 524907 16403
19 20.0 San Antonio Spurs 27.9 ... AT&T Center 550515 18351
20 21.0 Sacramento Kings 27.1 ... Golden 1 Center 520663 16796
21 22.0 Minnesota Timberwolves 24.8 ... Target Center 482112 15066
22 23.0 Chicago Bulls 24.4 ... United Center 639352 18804
23 24.0 Detroit Pistons 25.9 ... Little Caesars Arena 509469 15294
24 25.0 Washington Wizards 25.4 ... Capital One Arena 532702 16647
25 26.0 New York Knicks 24.5 ... Madison Square Garden (IV) 620789 18812
26 27.0 Charlotte Hornets 24.3 ... Spectrum Center 478591 15428
27 28.0 Cleveland Cavaliers 25.0 ... Quicken Loans Arena 643008 17861
28 29.0 Atlanta Hawks 24.1 ... State Farm Arena 545453 16043
29 30.0 Golden State Warriors 24.4 ... Chase Center 614176 18064
30 NaN League Average 26.2 ... NaN 575820 17788
[31 rows x 28 columns]

how to separate one DataFrame into two small ones

I have a big DataFrame as below:
count mean median min max std
datet
2001-05-16 17 NaN NaN NaN NaN NaN
2001-05-17 24 8.28 8.27 8.15 8.46 0.09
2001-05-18 24 8.41 8.31 8.18 8.85 0.19
2001-05-19 24 10.44 10.64 9.03 10.98 0.60
2001-05-20 24 10.53 10.56 9.98 10.92 0.28
2001-05-21 24 10.28 10.31 9.90 10.66 0.23
2001-05-22 24 10.40 10.42 10.17 10.67 0.17
2001-05-23 24 10.04 10.03 9.87 10.17 0.08
2001-05-24 24 9.63 9.66 9.41 9.88 0.15
2001-05-25 24 9.21 9.22 9.01 9.41 0.11
how can I separate this DataFrame into two small ones according to before or after date '2001-05-20'? like below:
df1:
count mean median min max std
datet
2001-05-16 17 NaN NaN NaN NaN NaN
2001-05-17 24 8.28 8.27 8.15 8.46 0.09
2001-05-18 24 8.41 8.31 8.18 8.85 0.19
2001-05-19 24 10.44 10.64 9.03 10.98 0.60
2001-05-20 24 10.53 10.56 9.98 10.92 0.28
df2:
count mean median min max std
datet
2001-05-21 24 10.28 10.31 9.90 10.66 0.23
2001-05-22 24 10.40 10.42 10.17 10.67 0.17
2001-05-23 24 10.04 10.03 9.87 10.17 0.08
2001-05-24 24 9.63 9.66 9.41 9.88 0.15
2001-05-25 24 9.21 9.22 9.01 9.41 0.11
For a single before/after split, I think grouping by a boolean criterion is the most direct approach.
In [1]: df = DataFrame(np.random.randn(10),
index=pd.date_range('2001-05-16', '2001-05-25'))
In [2]: grouper = df.groupby(df.index < pd.Timestamp('2001-05-21'))
In [3]: before, after = grouper.get_group(True), grouper.get_group(False)
In [4]: before
Out[4]:
0
2001-05-16 2.560516
2001-05-17 -2.207314
2001-05-18 0.646882
2001-05-19 0.660611
2001-05-20 0.437303
And after comes out right as well. Can anyone improve on my In [3]?
0.11-dev (.ix will work equivalently)
In [16]: df.loc[:'20010520']
Out[16]:
0
2001-05-16 0.105445
2001-05-17 1.660771
2001-05-18 0.485668
2001-05-19 -0.102616
2001-05-20 -0.228228
In [17]: df.loc['20010521':]
Out[17]:
0
2001-05-21 -0.024324
2001-05-22 -1.004362
2001-05-23 2.342225
2001-05-24 1.124695
2001-05-25 -0.291302
or (ix will work here as well, this is just more explicit)
In [27]: i = df.index.get_loc('20010520')
In [28]: df.iloc[:i+1]
Out[28]:
0
2001-05-16 0.105445
2001-05-17 1.660771
2001-05-18 0.485668
2001-05-19 -0.102616
2001-05-20 -0.228228
In [29]: df.iloc[i+1:]
Out[29]:
0
2001-05-21 -0.024324
2001-05-22 -1.004362
2001-05-23 2.342225
2001-05-24 1.124695
2001-05-25 -0.291302

Categories

Resources