There's a website https://www.hockey-reference.com//leagues/NHL_2022.html
I need to get table in div with id=div_stats
from bs4 import BeautifulSoup
url = 'https://www.hockey-reference.com/leagues/NHL_2022.html'
r = requests.get(url=url)
soup = BeautifulSoup(r.text, 'html.parser')
table = soup.find('div', id='div_stats')
print(table)
#None
Response is 200, but there's no such div in BeautifulSoup object. If I open the page using selenium or manually - it gets loaded properly.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
url = 'https://www.hockey-reference.com/leagues/NHL_2022.html'
with webdriver.Chrome() as browser:
browser.get(url)
#sleep(1)
html = browser.page_source
#r = requests.get(url=url, stream=True)
soup = BeautifulSoup(html, 'html.parser')
table = soup.find_all('div', id='div_stats')
However, while using webdriver it may load page for quite a long time (even if I see the whole page, it's still loading browser.get(url), and the code couldn't continue).
Is there any solution that can help avoiding selenium / stop the loading when the table is in the HTML?
I tried: stream and timeout in requests.get(),
for season in seasons:
browser.get(url)
wait = WebDriverWait(browser, 5)
wait.until(EC.visibility_of_element_located((By.ID, 'div_stats')))
html = browser.execute_script('return document.documentElement.outerHTML')
Nothing of that worked.
This is one way to get that table as a dataframe:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
url= 'https://www.hockey-reference.com//leagues/NHL_2022.html'
response = requests.get(url).text.replace('<!--', '').replace('-->', '')
soup = bs(response, 'html.parser')
table_w_data = soup.select_one('table#stats')
df = pd.read_html(str(table_w_data), header=1)[0]
print(df)
Result in terminal:
0_level_0 Unnamed: 1_level_0 Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0 Unnamed: 5_level_0 Unnamed: 6_level_0 Unnamed: 7_level_0 Unnamed: 8_level_0 Unnamed: 9_level_0 ... Special Teams Shot Data Unnamed: 31_level_0
Rk Unnamed: 1_level_1 AvAge GP W L OL PTS PTS% GF ... PK% SH SHA PIM/G oPIM/G S S% SA SV% SO
0 1.0 Florida Panthers* 27.8 82 58 18 6 122 0.744 337 ... 79.54 12 8 10.1 10.8 3062 11.0 2515 0.904 5
1 2.0 Colorado Avalanche* 28.2 82 56 19 7 119 0.726 308 ... 79.66 6 5 9.0 10.4 2874 10.7 2625 0.912 7
2 3.0 Carolina Hurricanes* 28.3 82 54 20 8 116 0.707 277 ... 88.04 4 3 9.2 7.7 2798 9.9 2310 0.913 6
3 4.0 Toronto Maple Leafs* 28.4 82 54 21 7 115 0.701 312 ... 82.05 13 4 8.6 8.5 2835 11.0 2511 0.900 7
4 5.0 Minnesota Wild* 29.4 82 53 22 7 113 0.689 305 ... 76.14 2 5 10.8 10.8 2666 11.4 2577 0.903 3
5 6.0 Calgary Flames* 28.8 82 50 21 11 111 0.677 291 ... 83.20 7 3 9.1 8.6 2908 10.0 2374 0.913 11
6 7.0 Tampa Bay Lightning* 29.6 82 51 23 8 110 0.671 285 ... 80.56 7 5 11.0 11.4 2535 11.2 2441 0.907 3
7 8.0 New York Rangers* 26.7 82 52 24 6 110 0.671 250 ... 82.30 8 2 8.2 8.2 2392 10.5 2528 0.919 9
8 9.0 St. Louis Blues* 28.8 82 49 22 11 109 0.665 309 ... 84.09 9 5 7.5 7.9 2492 12.4 2591 0.908 4
9 10.0 Boston Bruins* 28.5 82 51 26 5 107 0.652 253 ... 81.30 5 6 9.9 9.4 2962 8.5 2354 0.907 4
10 11.0 Edmonton Oilers* 29.1 82 49 27 6 104 0.634 285 ... 79.37 11 6 8.1 7.1 2790 10.2 2647 0.905 4
11 12.0 Pittsburgh Penguins* 29.7 82 46 25 11 103 0.628 269 ... 84.43 3 8 6.9 8.4 2849 9.4 2576 0.914 7
12 13.0 Washington Capitals* 29.5 82 44 26 12 100 0.610 270 ... 80.44 8 9 7.7 8.8 2577 10.5 2378 0.898 8
13 14.0 Los Angeles Kings* 28.0 82 44 27 11 99 0.604 235 ... 76.65 11 9 7.7 8.3 2865 8.2 2341 0.901 5
14 15.0 Dallas Stars* 29.4 82 46 30 6 98 0.598 233 ... 79.00 7 5 6.7 7.5 2486 9.4 2545 0.904 2
15 16.0 Nashville Predators* 27.7 82 45 30 7 97 0.591 262 ... 79.23 2 5 12.6 11.9 2439 10.7 2646 0.906 4
16 17.0 Vegas Golden Knights 28.5 82 43 31 8 94 0.573 262 ... 77.40 10 7 7.6 7.7 2830 9.3 2458 0.901 3
17 18.0 Vancouver Canucks 27.7 82 40 30 12 92 0.561 246 ... 74.89 5 6 8.0 8.6 2622 9.4 2612 0.912 1
18 19.0 Winnipeg Jets 28.2 82 39 32 11 89 0.543 250 ... 75.00 9 8 8.8 9.5 2645 9.5 2721 0.907 5
19 20.0 New York Islanders 30.1 82 37 35 10 84 0.512 229 ... 84.19 5 7 8.9 8.4 2367 9.7 2669 0.913 9
20 21.0 Columbus Blue Jackets 26.6 82 37 38 7 81 0.494 258 ... 78.57 7 6 7.7 7.2 2463 10.5 2887 0.897 2
21 22.0 San Jose Sharks 29.0 82 32 37 13 77 0.470 211 ... 85.20 4 11 8.8 8.6 2400 8.8 2622 0.900 3
22 23.0 Anaheim Ducks 27.9 82 31 37 14 76 0.463 228 ... 80.80 6 4 9.3 9.8 2393 9.5 2725 0.902 4
23 24.0 Buffalo Sabres 27.5 82 32 39 11 75 0.457 229 ... 76.42 6 6 8.1 7.9 2451 9.3 2702 0.894 1
24 25.0 Detroit Red Wings 26.9 82 32 40 10 74 0.451 227 ... 73.78 4 10 8.9 8.5 2414 9.4 2761 0.888 4
25 26.0 Ottawa Senators 26.6 82 33 42 7 73 0.445 224 ... 80.32 9 4 10.0 10.2 2463 9.1 2740 0.904 2
26 27.0 Chicago Blackhawks 28.0 82 28 42 12 68 0.415 213 ... 76.23 2 6 7.9 8.7 2362 9.0 2703 0.893 4
27 28.0 New Jersey Devils 25.8 82 27 46 9 63 0.384 245 ... 80.19 6 14 8.1 8.4 2562 9.6 2540 0.881 2
28 29.0 Philadelphia Flyers 28.3 82 25 46 11 61 0.372 210 ... 75.74 6 11 9.0 9.0 2539 8.3 2785 0.894 1
29 30.0 Seattle Kraken 28.7 82 27 49 6 60 0.366 213 ... 74.89 8 7 8.5 8.0 2380 8.9 2367 0.880 3
30 31.0 Arizona Coyotes 28.0 82 25 50 7 57 0.348 206 ... 75.00 3 4 10.2 8.2 2121 9.7 2910 0.894 1
31 32.0 Montreal Canadiens 27.8 82 22 49 11 55 0.335 218 ... 75.55 6 12 10.2 9.0 2442 8.9 2823 0.888 3
32 NaN League Average 28.2 82 41 32 9 91 0.555 255 ... 79.39 7 7 8.9 8.9 2593 9.8 2593 0.902 4
33 rows × 32 columns
Expect to do a little cleanup of that data, once you get it.
Relevant documentation for pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
And for requests: https://requests.readthedocs.io/en/latest/
And for BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/index.html
I want to divide data each 2unit raw using pandas
for example
df_A: raw data
data1
data2
data3
23
13.3
983
13
33.4
124
24
62.3
574
25
78.5
554
63
93.3
982
29
43.3
123
53
62.6
364
83
74.3
453
21
83.0
165
93
23.4
433
df_B :result data
group
data1
data2
data3
0
23
13.3
983
0
13
33.4
124
1
24
62.3
574
1
25
78.5
554
2
63
93.3
982
2
29
43.3
123
3
53
62.6
364
3
83
74.3
453
4
21
83.0
165
4
93
23.4
433
thank you
Try:
df["group"] = df.index // 2
Or:
df["group"] = np.arange(len(df)) // 2
This creates "group" column:
data1 data2 data3 group
0 23 13.3 983 0
1 13 33.4 124 0
2 24 62.3 574 1
3 25 78.5 554 1
4 63 93.3 982 2
5 29 43.3 123 2
6 53 62.6 364 3
7 83 74.3 453 3
8 21 83.0 165 4
9 93 23.4 433 4
This is Sample data ..
Inn R B W Eco AVG SR
111 368 432 30 5.11 12.27 14.4
94 359 444 24 4.85 14.96 18.5
47 187 202 13 5.55 14.38 15.54
59 273 279 16 5.87 17.06 17.44
34 132 140 9 5.66 14.67 15.56
135 437 536 33 4.89 13.24 16.24
1 0 1 1 0 0 1
Now I would like to Make a new column which is Choice with values as Good, Bad, Moderate Bowling option for each row. How can i achieve it?
I need to calculate average values (row wise without index) of columns with constant step.
I have already done a simple operation for the first 4 columns. It works nicely. After that I have created a list with column names (for storing average values) for dataframe. I have found out that I can do this using apply and lambda. I have tried many variants to get a result, but I have not found a solution.
data= np.arange(400).reshape(20,20)
df=pd.DataFrame(data=data)
df.columns=['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T']
df['A1_avg'] = df[['A', 'B', 'C', 'D']].mean(axis=1)
colnames_avg=['A1_avg', 'A2_avg', 'A3_avg', 'A4_avg', 'A5_avg']
df.head()
I have tried this code for generating 5 extra columns containing the average of several subsets of data:
df[colnames_avg]=df[colnames_avg].applymap(lambda x: df[['A', 'B', 'C', 'D'], ['E', 'F', 'G', 'H'], ['I', 'J', 'K', 'L'],['M', 'N', 'O', 'P'],['Q', 'R', 'S', 'T']].mean(axis=1)
Is it possible to do this with the range function with a predefined step (e.g. 4)?
I would do that as follows in a loop, going over the columns and cutting them into groups of 4 columns each (the last group might be smaller):
cols=list(df.columns)
while len(cols) > 0:
group= cols[:4]
cols= cols[4:]
df['mean_' + '_'.join(group)]= df[group].mean(axis='columns')
The result looks like
df[[col for col in df if col.startswith('mean_')]]
mean_A_B_C_D mean_E_F_G_H mean_I_J_K_L mean_M_N_O_P mean_Q_R_S_T
0 1.5 5.5 9.5 13.5 17.5
1 21.5 25.5 29.5 33.5 37.5
2 41.5 45.5 49.5 53.5 57.5
3 61.5 65.5 69.5 73.5 77.5
4 81.5 85.5 89.5 93.5 97.5
5 101.5 105.5 109.5 113.5 117.5
...
If you want result columns like A1..., just add a counter variable in the loop and use 'A{}'.format(i) as the column name.
Method 1: numpy.split & DataFrame.loc:
We can split your columns into evenly size chunks and then use .loc to create the new columns:
for idx, chunk in enumerate(np.split(df.columns, len(df.columns)/4)):
df[f'A{idx+1}_avg'] = df.loc[:, chunk].mean(axis=1)
Output
A B C D E F G H I J ... P Q R S T A1_avg A2_avg A3_avg A4_avg A5_avg
0 0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 1.5 5.5 9.5 13.5 17.5
1 20 21 22 23 24 25 26 27 28 29 ... 35 36 37 38 39 21.5 25.5 29.5 33.5 37.5
2 40 41 42 43 44 45 46 47 48 49 ... 55 56 57 58 59 41.5 45.5 49.5 53.5 57.5
3 60 61 62 63 64 65 66 67 68 69 ... 75 76 77 78 79 61.5 65.5 69.5 73.5 77.5
4 80 81 82 83 84 85 86 87 88 89 ... 95 96 97 98 99 81.5 85.5 89.5 93.5 97.5
5 100 101 102 103 104 105 106 107 108 109 ... 115 116 117 118 119 101.5 105.5 109.5 113.5 117.5
6 120 121 122 123 124 125 126 127 128 129 ... 135 136 137 138 139 121.5 125.5 129.5 133.5 137.5
7 140 141 142 143 144 145 146 147 148 149 ... 155 156 157 158 159 141.5 145.5 149.5 153.5 157.5
8 160 161 162 163 164 165 166 167 168 169 ... 175 176 177 178 179 161.5 165.5 169.5 173.5 177.5
9 180 181 182 183 184 185 186 187 188 189 ... 195 196 197 198 199 181.5 185.5 189.5 193.5 197.5
10 200 201 202 203 204 205 206 207 208 209 ... 215 216 217 218 219 201.5 205.5 209.5 213.5 217.5
11 220 221 222 223 224 225 226 227 228 229 ... 235 236 237 238 239 221.5 225.5 229.5 233.5 237.5
12 240 241 242 243 244 245 246 247 248 249 ... 255 256 257 258 259 241.5 245.5 249.5 253.5 257.5
13 260 261 262 263 264 265 266 267 268 269 ... 275 276 277 278 279 261.5 265.5 269.5 273.5 277.5
14 280 281 282 283 284 285 286 287 288 289 ... 295 296 297 298 299 281.5 285.5 289.5 293.5 297.5
15 300 301 302 303 304 305 306 307 308 309 ... 315 316 317 318 319 301.5 305.5 309.5 313.5 317.5
16 320 321 322 323 324 325 326 327 328 329 ... 335 336 337 338 339 321.5 325.5 329.5 333.5 337.5
17 340 341 342 343 344 345 346 347 348 349 ... 355 356 357 358 359 341.5 345.5 349.5 353.5 357.5
18 360 361 362 363 364 365 366 367 368 369 ... 375 376 377 378 379 361.5 365.5 369.5 373.5 377.5
19 380 381 382 383 384 385 386 387 388 389 ... 395 396 397 398 399 381.5 385.5 389.5 393.5 397.5
Method 2: .range & iloc:
We can create a range for each 4 columns, then use iloc to acces each slice of your dataframe and calculate the mean and at the same time create your new column:
slices = range(0, len(df.columns)+1, 4)
for idx, rng in enumerate(slices):
if idx > 0:
df[f'A{idx}_avg'] = df.iloc[:, slices[idx-1]:slices[idx]].mean(axis=1)
Output
A B C D E F G H I J ... P Q R S T A1_avg A2_avg A3_avg A4_avg A5_avg
0 0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 1.5 5.5 9.5 13.5 17.5
1 20 21 22 23 24 25 26 27 28 29 ... 35 36 37 38 39 21.5 25.5 29.5 33.5 37.5
2 40 41 42 43 44 45 46 47 48 49 ... 55 56 57 58 59 41.5 45.5 49.5 53.5 57.5
3 60 61 62 63 64 65 66 67 68 69 ... 75 76 77 78 79 61.5 65.5 69.5 73.5 77.5
4 80 81 82 83 84 85 86 87 88 89 ... 95 96 97 98 99 81.5 85.5 89.5 93.5 97.5
5 100 101 102 103 104 105 106 107 108 109 ... 115 116 117 118 119 101.5 105.5 109.5 113.5 117.5
6 120 121 122 123 124 125 126 127 128 129 ... 135 136 137 138 139 121.5 125.5 129.5 133.5 137.5
7 140 141 142 143 144 145 146 147 148 149 ... 155 156 157 158 159 141.5 145.5 149.5 153.5 157.5
8 160 161 162 163 164 165 166 167 168 169 ... 175 176 177 178 179 161.5 165.5 169.5 173.5 177.5
9 180 181 182 183 184 185 186 187 188 189 ... 195 196 197 198 199 181.5 185.5 189.5 193.5 197.5
10 200 201 202 203 204 205 206 207 208 209 ... 215 216 217 218 219 201.5 205.5 209.5 213.5 217.5
11 220 221 222 223 224 225 226 227 228 229 ... 235 236 237 238 239 221.5 225.5 229.5 233.5 237.5
12 240 241 242 243 244 245 246 247 248 249 ... 255 256 257 258 259 241.5 245.5 249.5 253.5 257.5
13 260 261 262 263 264 265 266 267 268 269 ... 275 276 277 278 279 261.5 265.5 269.5 273.5 277.5
14 280 281 282 283 284 285 286 287 288 289 ... 295 296 297 298 299 281.5 285.5 289.5 293.5 297.5
15 300 301 302 303 304 305 306 307 308 309 ... 315 316 317 318 319 301.5 305.5 309.5 313.5 317.5
16 320 321 322 323 324 325 326 327 328 329 ... 335 336 337 338 339 321.5 325.5 329.5 333.5 337.5
17 340 341 342 343 344 345 346 347 348 349 ... 355 356 357 358 359 341.5 345.5 349.5 353.5 357.5
18 360 361 362 363 364 365 366 367 368 369 ... 375 376 377 378 379 361.5 365.5 369.5 373.5 377.5
19 380 381 382 383 384 385 386 387 388 389 ... 395 396 397 398 399 381.5 385.5 389.5 393.5 397.5
[20 rows x 25 columns]
I'm trying to read in a discharge data file which looks like this:
Station number: 420
Location: Kotagaon Shringe
Latitude: 27 45 00
River: Kali Gandaki
Longitude: 84 20 50
Year: 2001
Mean daily discharge in m3/s
============================
Day Jan. Feb. Mar. Apr. May Jun. Jul. Aug. Sep. Oct. Nov. Dec. Year
01 118 99.3 85.9 75.5 119 182 656 2790 1690 402 232 158
02 123 97.4 82.9 74.3 134 251 514 2420 2180 397 230 158
03 118 95.5 80.7 73.1 168 377 466 2190 2190 386 226 157
-------------------------------- Skipping some rows of no real interest
25 95.5 85.5 70.7 83.3 163 583 898 3230 485 257 177 123
26 94.1 88.6 69.9 84.6 167 579 996 2330 474 252 175 121
27 92.2 88.6 71.9 88.1 166 736 1180 2270 461 248 173 120
28 91.8 87.3 69.9 91.3 172 419 1020 2270 431 246 168 118
29 95.5 71.9 93.2 165 446 1670 2140 410 244 163 118
30 98.4 76.0 109 176 575 2040 2100 403 239 159 117
31 98.4 75.1 174 3330 1600 234 117
My problem is that when using white space as a separator it does shift over the March value at day 29 since February got no day 29. And again for other places with empty/no values.
Is there a good way to work around this?
I have looked for solutions online but all I could find is dealing with uneven row length, not uneven column length.
My attempt this far has resulted in the code:
disc = pd.read_csv(filename,header = 6,sep = '\s+',nrows = 31)
disc['Year'] = 2001
With the dataframe looking like:
Day Jan. Feb. Mar. Apr. May Jun. Jul. Aug. Sep. Oct. Nov. Dec. Year
0 1 118.0 99.3 85.9 75.5 119 182 656 2790.0 1690.0 402.0 232.0 158.0 2001
1 2 123.0 97.4 82.9 74.3 134 251 514 2420.0 2180.0 397.0 230.0 158.0 2001
2 3 118.0 95.5 80.7 73.1 168 377 466 2190.0 2190.0 386.0 226.0 157.0 2001
----------------------------------------------- Skipping some rows of no real interest
28 29 95.5 71.9 93.2 165.0 446 1670 2140 410.0 244.0 163.0 118.0 NaN 2001
29 30 98.4 76.0 109.0 176.0 575 2040 2100 403.0 239.0 159.0 117.0 NaN 2001
30 31 98.4 75.1 174.0 3330.0 1600 234 117 NaN NaN NaN NaN NaN 2001
You can use the pd.read_fwf() module for reading fixed-width files and leverage the skiprows keyword:
disc = pd.read_fwf('test.csv', skiprows=11)
Yields:
Day Jan. Feb. Mar. Apr. ... Sep. Oct. Nov. Dec. Year
0 1 118.0 99.3 85.9 75.5 ... 1690.0 402 232.0 158 NaN
1 2 123.0 97.4 82.9 74.3 ... 2180.0 397 230.0 158 NaN
2 3 118.0 95.5 80.7 73.1 ... 2190.0 386 226.0 157 NaN
3 25 95.5 85.5 70.7 83.3 ... 485.0 257 177.0 123 NaN
4 26 94.1 88.6 69.9 84.6 ... 474.0 252 175.0 121 NaN
5 27 92.2 88.6 71.9 88.1 ... 461.0 248 173.0 120 NaN
6 28 91.8 87.3 69.9 91.3 ... 431.0 246 168.0 118 NaN
7 29 95.5 NaN 71.9 93.2 ... 410.0 244 163.0 118 NaN
8 30 98.4 NaN 76.0 109.0 ... 403.0 239 159.0 117 NaN
9 31 98.4 NaN 75.1 NaN ... NaN 234 NaN 117 NaN