Scraping table returning repeated values

Scraping table returning repeated values - python

I'm trying to build a simple web scraper. I am trying to scrape a table, but I'm not sure why the output is: School, 20-5, 33.2 26 times over.
Here is my code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.maxpreps.com/rankings/basketball/1/state/michigan.htm'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
teams = soup.find_all('tr')
for team in teams:
teamname = soup.find('th', class_ = "school").text
record = soup.find('td', class_= "overall dw").text
rating = soup.find('td', class_ = "rating sorted dw").text
print(teamname, record, rating)

Notice that you're never using the Tag that team refers to. Inside the for loop, all of the calls to soup.find() should be calls to team.find():
for team in teams[1:]:
teamname = team.find('th', class_ = "school").text
record = team.find('td', class_= "overall dw").text
rating = team.find('td', class_ = "rating sorted dw").text
print(teamname, record, rating)
This outputs:
St. Mary's Prep (Orchard Lake) 20-5 33.2
University of Detroit Jesuit (Detroit) 16-7 30.0
Williamston 25-0 29.3
Ferndale 21-3 28.9
Catholic Central (Grand Rapids) 25-1 28.4
King (Detroit) 18-3 27.4
De La Salle Collegiate (Warren) 18-7 27.2
Catholic Central (Novi) 16-9 26.6
Brother Rice (Bloomfield Hills) 15-7 26.5
Unity Christian (Hudsonville) 21-1 26.4
Hamtramck 21-4 26.3
Grand Blanc 20-5 25.9
East Lansing 18-5 25.0
Muskegon 20-3 24.8
Northview (Grand Rapids) 25-1 24.6
Cass Tech (Detroit) 21-4 24.3
North Farmington (Farmington Hills) 18-4 24.2
Beecher (Flint) 23-2 24.0
Okemos 19-5 23.9
Benton Harbor 23-3 23.2
Rockford 19-3 22.9
Grand Haven 17-4 21.9
Hartland 19-4 21.0
Marshall 20-3 21.0
Freeland 24-0 21.0
We use [1:] to skip the table header, slicing off the first element in the teams list.

Let pandas parse that table for you (it uses BeautifulSoup under the hoop).
import pandas as pd
url = 'https://www.maxpreps.com/rankings/basketball/1/state/michigan.htm'
df = pd.read_html(url)[0]
Output:
print(df)
# School Ovr. Rating Str. +/-
0 1 St. Mary's Prep (Orchard Lake) 20-5 33.2 23.0 NaN
1 2 University of Detroit Jesuit (Detroit) 16-7 30.0 24.1 NaN
2 3 Williamston 25-0 29.3 10.9 NaN
3 4 Ferndale 21-3 28.9 16.5 NaN
4 5 Catholic Central (Grand Rapids) 25-1 28.4 11.4 NaN
5 6 King (Detroit) 18-3 27.4 15.2 NaN
6 7 De La Salle Collegiate (Warren) 18-7 27.2 19.6 2.0
7 8 Catholic Central (Novi) 16-9 26.6 22.6 -1.0
8 9 Brother Rice (Bloomfield Hills) 15-7 26.5 21.0 -1.0
9 10 Unity Christian (Hudsonville) 21-1 26.4 10.4 NaN
10 11 Hamtramck 21-4 26.3 14.5 2.0
11 12 Grand Blanc 20-5 25.9 15.3 -1.0
12 13 East Lansing 18-5 25.0 15.6 1.0
13 14 Muskegon 20-3 24.8 11.4 1.0
14 15 Northview (Grand Rapids) 25-1 24.6 8.2 1.0
15 16 Cass Tech (Detroit) 21-4 24.3 11.8 -4.0
16 17 North Farmington (Farmington Hills) 18-4 24.2 13.1 NaN
17 18 Beecher (Flint) 23-2 24.0 8.6 2.0
18 19 Okemos 19-5 23.9 13.7 -1.0
19 20 Benton Harbor 23-3 23.2 9.9 -1.0
20 21 Rockford 19-3 22.9 11.6 NaN
21 22 Grand Haven 17-4 21.9 11.3 NaN
22 23 Hartland 19-4 21.0 10.4 1.0
23 24 Marshall 20-3 21.0 8.6 -1.0
24 25 Freeland 24-0 21.0 2.7 4.0

Related

How to find elements that match specific conditions selenium

i want to crawl data in web, but i don't know how to get data from these tags
i don't know how to get data from these tags. Please help me
from selenium import webdriver
import pandas as pd
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
browser = webdriver.Chrome(executable_path="./chromedriver.exe")
idx = 0
data = []
title = []
#print("Process 300 days from {}-{}-{}".format(current_date.day, current_date.month, current_date.year))
url = 'https://24hmoney.vn/stock/HAG/financial-report'
web = browser.get(url)
#Click nut theo quy
btn1 = browser.find_element(By.XPATH, "/html/body/div[1]/div/div/div[2]/div[1]/div[4]/div[2]/div[1]")
btn1.click()
#click hien thi tang giam so voi cung ki
#btn2 = browser.find_element(By.XPATH,"/html/body/div[1]/div/div/div[2]/div[1]/div[4]/div[3]/div[1]/span")
#btn2.click()
lai = browser.find_elements(By.CSS_SELECTOR,'p')
for raw in lai:
data.append(raw.text)
#print(raw.text)
tieude = browser.find_elements(By.CLASS_NAME,'sticky-col.first-col')
for raw2 in tieude:
title.append(raw2.text)
print(raw2.text)
#df = pd.DataFrame(data,columns=["HAG"])
df = pd.DataFrame(title,columns=["Tieude"])
df.to_csv("HAG.csv",index=False)
#a = input()

Maybe the following code will solve your issue?
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://24hmoney.vn/stock/HAG/financial-report'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
table = soup.select_one('div[class="financial-report-box-content"] table')
df = pd.read_html(str(table))[0]
print(df)
Result in terminal:
Tiêu đề Q3/22 % Q3/21 Q2/22 % Q2/21 Q1/22 % Q1/21 Q4/21 % Q4/20 Q3/21 % Q3/20 Q2/21 % Q2/20 Q1/21 % Q1/20 Q4/20 % Q4/19
0 Doanh thu 1441.4 160.1% 1233.6 125.2% 802.6 182.3% 743.7 -19.2% 554.1 -20.9% 547.7 -15.4% 284.4 -66% 920.4 51%
1 Các khoản giảm trừ NaN NaN 6.2 -81.3% NaN NaN NaN NaN NaN NaN 3.4 68.3% 18.5 -678.5% 6.7 5.7%
2 Doanh thu thuần 1441.4 160.1% 1227.4 125.5% 802.6 201.9% 743.7 -18.6% 554.1 -20.9% 544.3 -14.6% 265.8 -68.1% 913.7 51.6%
3 Giá vốn hàng bán 1160.6 -207.3% 1051.7 -115.3% 512.8 -140.3% 511.5 52.7% 377.6 50.1% 488.4 3% 213.4 61.3% 1082.2 -74.1%
4 Lợi nhuận gộp 280.8 59.1% 175.7 214.4% 289.8 452.8% 232.3 237.9% 176.5 414.3% 55.9 -58.1% 52.4 -81.5% -168.5 -783.9%
5 Thu nhập tài chính 117.5 -10.9% 95.4 -24.8% 192.4 -44.9% 127.6 -83.7% 131.9 -5.3% 126.8 -34.3% 349.4 122.3% 783.6 203.9%
6 Chi phí tài chính 166.0 76% 875.9 -410.8% 185.9 13.4% 254.7 150.6% 691.5 -161.9% 171.5 -39.8% 214.8 33.6% 503.4 -44.1%
7 Chi phí tiền lãi 166.9 -0.2% 223.6 -35.6% 162.7 18.6% 167.4 66.3% 166.5 21.9% 165.0 25.9% 199.8 25.3% 496.9 -72.7%
8 Lãi/lỗ từ công ty liên doanh NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -7.6 -1,149% 1.8 -18.8% 4.9 -86.8%
9 Chi phí bán hàng 58.6 -51.5% 90.5 -191.7% 52.1 -202.7% 42.3 34.5% 38.7 47.6% 31.0 76.5% 17.2 79.6% 64.6 14.5%
10 Chi phí quản lý doanh nghiệp 181.1 -60.4% 950.6 322.9% 5.2 101.4% 404.5 56% 457.0 395.4% 224.8 424.4% 367.3 -272.5% 919.8 -890.1%
11 Lãi/lỗ từ hoạt động kinh doanh 354.8 909% 255.2 29.3% 249.3 227.4% 167.7 119.3% 35.2 108.6% 197.3 10,174% -195.7 -202.8% -867.8 -258.5%
12 Thu nhập khác 2.8 35% 24.8 455.4% 5.7 -81.7% 44.0 66.5% 2.1 7.2% 4.5 -85.2% 31.1 68.9% 26.5 140.4%
13 Chi phí khác -7.4 56.3% -56.0 66.7% -15.3 81.7% -143.8 78.8% -17.0 89.7% -168.3 -98.2% -83.5 -153.3% -679.1 -141.2%
14 Thu nhập khác, ròng -4.6 69.2% -31.2 81% -9.6 81.7% -99.8 84.7% -14.9 90.8% -163.8 -198.8% -52.4 -260.2% -652.7 -141.2%
15 Lãi/lỗ từ công ty liên doanh NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16 LỢI NHUẬN TRƯỚC THUẾ 350.2 1,629% 224.0 569.6% 239.8 196.6% 67.9 104.5% 20.2 103.5% 33.5 163.2% -248.1 -213.3% -1520.5 -196.6%
17 Thuế thu nhập doanh nghiệp – hiện thời 1.2 NaN 1.4 -333.9% 0.2 NaN 0.0 96.6% NaN NaN 0.3 -76% NaN NaN 1.1 -27.7%
18 Thuế thu nhập doanh nghiệp – hoãn lại 20.5 1,267% 42.2 -3.9% 18.4 -89.7% 28.6 801.7% 1.5 -19.5% 43.9 1,847% 179.4 15,941% 4.1 -102.4%
19 Chi phí thuế thu nhập doanh nghiệp 19.4 1,191% 40.8 -6.3% 18.2 -89.8% 28.6 656.6% 1.5 -13.8% 43.6 1,718% 179.4 18,238% 5.1 -103%
20 LỢI NHUẬN SAU THUẾ TNDN 369.5 1,599% 264.9 243.7% 258.0 475.2% 96.5 106.3% 21.7 103.8% 77.1 238.6% -68.8 12.1% -1525.6 -344.5%
21 Lợi ích của cổ đông thiểu số 8.8 548.1% -14.9 -3,505% 8.0 177% -45.7 87% -2.0 99.5% 0.4 100.2% -10.3 -14.7% -352.1 11.8%
22 Lợi nhuận của Cổ đông của Công ty mẹ 360.7 1,421% 279.7 265% 250.0 528% 142.2 112.1% 23.7 112.7% 76.6 -56.6% -58.4 15.6% -1173.5 -2,193%
23 EPS 4QGN (đ) 389.0 1,396% 301.0 262.6% 270.0 528.6% 153.0 112.1% 26.0 112.9% 83.0 -56.5% -63.0 16% -1265.0 -540.8%

How do I parse out specific dataframes when using pandas to web scrape data from a page with multiple dataframes?

I'm working on a project that will scrape data off of https://www.pro-football-reference.com/years/2021/opp.htm. When you visit this webpage you'll see that there are multiple tables. I can obtain the first table with no issues when I run the following code:
import pandas as pd
year=2021
defense_url = 'https://www.pro-football-reference.com/years/{}/opp.htm'.format(year)
df5 = pd.read_html(defense_url, header=1)[0]
df5.head()
However, when I attempt to obtain data from the other tables by changing the index I get a table without a header or an error. For example, df5 = pd.read_html(defense_url, header=1)[1] will create a dataframe without a header a (as shown in the image below):
Additionally, df5 = pd.read_html(defense_url, header=1)[2] generates an IndexError (as shown below):
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Input In [46], in <cell line: 1>()
----> 1 df5 = pd.read_html(defense_url, header=1)[2]
2 df5.head()
IndexError: list index out of range
Does anyone know what i could possibly be doing wrong here?

The reason for the index error is because the html response only returns 2 <table> tags. So when you try to get the dataframe at index position [2] (the 3rd table), its not in the list of dataframes.
Those other tables are actually there in the html response but as comments. So there is 2 ways to get that:
Use BeautifulSoups function of pulling out the comments and parse those.
Simply remove/replace the html string that denotes commented html.
I'll code out both for you below:
1. bs4 Comments:
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
year=2021
defense_url = 'https://www.pro-football-reference.com/years/{}/opp.htm'.format(year)
df5 = pd.read_html(defense_url, header=1)
result = requests.get(defense_url).text
data = BeautifulSoup(result, 'html.parser')
comments = data.find_all(string=lambda text: isinstance(text, Comment))
for each in comments:
if 'table' in str(each):
try:
level = 0
if isinstance(pd.read_html(str(each))[0].columns, pd.MultiIndex):
level = 1
df5.append(pd.read_html(str(each), header=level)[0])
except:
continue
2: remove/replace the html string that denotes commented html:
import requests
import pandas as pd
year=2021
defense_url = 'https://www.pro-football-reference.com/years/{}/opp.htm'.format(year)
# Get the data
response = requests.get(defense_url)
html = response.text.replace('<!--', '').replace('-->', '')
df5 = pd.read_html(html)
Only difference between the 2 is the second option you'll need to do one more step to find which dataframes are multi-indexed and deal with those if needed.
Output from option 1:
print(df5[2])
Rk Tm G Cmp ... Sk% NY/A ANY/A EXP
0 1.0 Buffalo Bills 17.0 297.0 ... 7.3 4.80 3.8 20.04
1 2.0 New England Patriots 17.0 319.0 ... 6.3 5.50 4.5 -0.86
2 3.0 Chicago Bears 17.0 314.0 ... 9.3 6.20 6.7 -78.80
3 4.0 Carolina Panthers 17.0 337.0 ... 7.0 5.90 6.1 -47.59
4 5.0 Cleveland Browns 17.0 367.0 ... 6.9 5.60 5.5 -51.01
5 6.0 San Francisco 49ers 17.0 372.0 ... 8.1 5.90 6.1 -91.78
6 7.0 Arizona Cardinals 17.0 367.0 ... 6.8 6.10 6.1 -49.69
7 8.0 Denver Broncos 17.0 341.0 ... 6.0 6.10 5.9 -43.39
8 9.0 Pittsburgh Steelers 17.0 355.0 ... 8.9 5.90 5.7 -44.60
9 10.0 Green Bay Packers 17.0 379.0 ... 6.1 5.80 5.5 -33.40
10 11.0 Philadelphia Eagles 17.0 409.0 ... 4.7 6.10 6.1 -81.44
11 12.0 Los Angeles Chargers 17.0 357.0 ... 5.9 6.30 6.4 -89.74
12 13.0 Las Vegas Raiders 17.0 400.0 ... 5.5 5.90 6.4 -159.12
13 14.0 New Orleans Saints 17.0 369.0 ... 7.2 6.00 5.3 -1.74
14 15.0 New York Giants 17.0 402.0 ... 5.3 6.00 5.7 -56.19
15 16.0 Miami Dolphins 17.0 373.0 ... 7.3 5.90 5.6 -26.14
16 17.0 Jacksonville Jaguars 17.0 377.0 ... 5.6 6.70 7.0 -137.16
17 18.0 Atlanta Falcons 17.0 391.0 ... 3.0 6.60 6.8 -119.11
18 19.0 Indianapolis Colts 17.0 390.0 ... 5.2 6.30 6.0 -73.90
19 20.0 Dallas Cowboys 17.0 364.0 ... 6.3 6.20 5.1 13.82
20 21.0 Tampa Bay Buccaneers 17.0 445.0 ... 6.5 5.60 5.3 -30.09
21 22.0 Los Angeles Rams 17.0 416.0 ... 7.4 6.10 5.3 -41.65
22 23.0 Houston Texans 17.0 363.0 ... 5.5 7.10 6.7 -131.46
23 24.0 Detroit Lions 17.0 359.0 ... 5.2 7.20 7.5 -161.97
24 25.0 Tennessee Titans 17.0 395.0 ... 6.4 6.20 5.9 -59.34
25 26.0 Cincinnati Bengals 17.0 420.0 ... 6.3 6.30 6.2 -67.23
26 27.0 Kansas City Chiefs 17.0 401.0 ... 4.8 6.70 6.5 -110.25
27 28.0 Minnesota Vikings 17.0 401.0 ... 7.5 6.40 6.1 -57.45
28 29.0 Washington Football Team 17.0 400.0 ... 6.0 6.80 7.1 -142.08
29 30.0 New York Jets 17.0 401.0 ... 5.3 7.10 7.5 -198.18
30 31.0 Seattle Seahawks 17.0 443.0 ... 4.9 6.50 6.5 -126.34
31 32.0 Baltimore Ravens 17.0 397.0 ... 5.2 7.20 7.6 -166.26
32 NaN Avg Team NaN 378.8 ... 6.2 6.22 6.1 -76.40
33 NaN League Total NaN 12121.0 ... 6.2 6.22 6.1 NaN
34 NaN Avg Tm/G NaN 22.3 ... 6.2 6.22 6.1 NaN
[35 rows x 25 columns]

Unable to extract Tables

Beginner here. I'm having issues while trying to extract data from the second (Team Statistics) and third (Team Analytics 5-on-5) Table on this page:
https://www.hockey-reference.com/leagues/NHL_2021.html
I'm using this code:
import pandas as pd
url = 'https://www.hockey-reference.com/leagues/NHL_2021.html'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[1]
print(df)
and
url = 'https://www.hockey-reference.com/leagues/NHL_2021.html'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[2]
print(df)
to get the right tables.
But for some kind of reason I will always get this error message:
IndexError: list index out of range
I could extract the first table by using the same code with df = df_list[0], that will work, but it is useless to me. I really need the 2nd an 3rd Table, and I just don't know why it doesn't work.
Pretty sure that's easy to answer for most of you.
Thanx in advance!

You get this error because the read_html() method returns a list of 1 element and that element is at position 0
instead of
df = df_list[1]
use this
df = df_list[0]
You get combined table of all teams from your mentioned site so if you want to extract the table of 2nd and 3rd team use loc[] accessor:-
east_division=df.loc[9:17]
north_division=df.loc[18:25]

Use the URL directly in pandas.read_html
df = pd.read_html('https://www.hockey-reference.com/leagues/NHL_2021.html')

The tables are in fact there in the html (within the comments). Use BeautifulSoup to pull out the comments and parse those tables as well. The code below will pull all (both commented and uncommented tables). and put it into a list. Just a matter of pulling out the table by index that you want, in this case indices 1 and 2.
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
url = "https://www.hockey-reference.com/leagues/NHL_2021.html"
# Gets all uncommented tables
tables = pd.read_html(url, header=1)
# Get the html source
response = requests.get(url, headers=headers)
# Creat soup object form html
soup = BeautifulSoup(response.content, 'html.parser')
# Get the comments in html
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
# Iterate thorugh each comment and parse the table if found
# # Append the table to the tables list
for each in comments:
if 'table' in str(each):
try:
tables.append(pd.read_html(each, header=1)[0])
tables = tables[tables['Rk'].ne('Rk')]
tables = tables.rename(columns={'Unnamed: 1':'Team'})
except:
continue
Output:
for table in tables[1:3]:
print(table)
Rk Unnamed: 1 AvAge GP W ... S S% SA SV% SO
0 1.0 New York Islanders 29.0 28 18 ... 841 9.8 767 0.920 5
1 2.0 Tampa Bay Lightning 28.3 26 19 ... 798 12.2 725 0.919 3
2 3.0 Florida Panthers 28.1 27 18 ... 918 10.0 840 0.910 0
3 4.0 Toronto Maple Leafs 28.9 29 19 ... 883 11.2 828 0.909 2
4 5.0 Carolina Hurricanes 27.2 26 19 ... 816 10.9 759 0.912 3
5 6.0 Washington Capitals 30.4 27 17 ... 768 12.0 808 0.895 0
6 7.0 Vegas Golden Knights 29.1 25 18 ... 752 11.0 691 0.920 4
7 8.0 Edmonton Oilers 28.4 30 18 ... 945 10.6 938 0.907 2
8 9.0 Winnipeg Jets 28.0 27 17 ... 795 11.4 856 0.910 1
9 10.0 Pittsburgh Penguins 28.1 27 17 ... 779 11.0 784 0.899 1
10 11.0 Chicago Blackhawks 27.2 29 14 ... 863 10.1 997 0.910 2
11 12.0 Minnesota Wild 28.8 25 16 ... 764 10.3 723 0.913 2
12 13.0 St. Louis Blues 28.2 28 14 ... 836 10.4 835 0.892 0
13 14.0 Boston Bruins 28.8 25 14 ... 772 8.8 665 0.913 2
14 15.0 Colorado Avalanche 26.8 25 15 ... 846 8.7 622 0.905 4
15 16.0 Montreal Canadiens 28.8 27 12 ... 890 9.7 782 0.909 0
16 17.0 Philadelphia Flyers 27.5 25 13 ... 699 11.7 753 0.892 3
17 18.0 Calgary Flames 28.0 28 13 ... 838 8.9 845 0.904 3
18 19.0 Los Angeles Kings 27.7 26 11 ... 748 10.3 814 0.910 2
19 20.0 Vancouver Canucks 27.7 31 13 ... 951 8.8 1035 0.903 1
20 21.0 Columbus Blue Jackets 27.0 29 11 ... 839 9.3 902 0.895 1
21 22.0 Arizona Coyotes 28.5 27 12 ... 689 9.7 851 0.907 1
22 23.0 San Jose Sharks 29.3 25 11 ... 749 9.5 800 0.890 1
23 24.0 New York Rangers 25.7 26 11 ... 773 9.2 746 0.906 2
24 25.0 Nashville Predators 28.9 28 11 ... 880 7.4 837 0.885 1
25 26.0 Anaheim Ducks 28.4 29 8 ... 804 7.7 852 0.891 3
26 27.0 Dallas Stars 28.3 23 8 ... 657 10.2 626 0.904 3
27 28.0 Detroit Red Wings 29.4 28 8 ... 785 8.0 870 0.891 0
28 29.0 Ottawa Senators 26.4 30 9 ... 942 8.2 960 0.874 0
29 30.0 New Jersey Devils 26.2 24 8 ... 708 8.5 741 0.896 2
30 31.0 Buffalo Sabres 27.4 26 6 ... 728 7.7 804 0.893 0
31 NaN League Average 28.1 27 13 ... 808 9.8 808 0.902 2
[32 rows x 32 columns]
Rk Unnamed: 1 S% SV% ... HDGF HDC% HDGA HDCO%
0 1 New York Islanders 8.3 0.931 ... 11 12.2 11 11.8
1 2 Tampa Bay Lightning 8.7 0.933 ... 11 14.9 6 6.3
2 3 Florida Panthers 7.9 0.926 ... 15 14.4 12 17.6
3 4 Toronto Maple Leafs 8.8 0.933 ... 16 13.4 8 11.1
4 5 Carolina Hurricanes 7.5 0.932 ... 12 12.8 7 9.3
5 6 Washington Capitals 9.8 0.919 ... 10 10.9 5 7.8
6 7 Vegas Golden Knights 9.3 0.927 ... 20 15.9 11 14.5
7 8 Edmonton Oilers 8.2 0.920 ... 9 11.3 13 9.8
8 9 Winnipeg Jets 8.5 0.926 ... 15 15.0 8 7.8
9 10 Pittsburgh Penguins 8.8 0.922 ... 10 14.5 15 13.5
10 11 Chicago Blackhawks 7.3 0.925 ... 10 10.5 14 15.1
11 12 Minnesota Wild 9.9 0.930 ... 16 14.2 8 11.9
12 13 St. Louis Blues 8.4 0.914 ... 15 18.1 15 15.8
13 14 Boston Bruins 6.6 0.922 ... 5 7.4 11 12.2
14 15 Colorado Avalanche 6.7 0.916 ... 8 8.1 8 13.3
15 16 Montreal Canadiens 7.8 0.935 ... 15 12.0 8 11.3
16 17 Philadelphia Flyers 10.1 0.907 ... 18 15.9 9 12.9
17 18 Calgary Flames 7.6 0.929 ... 6 6.9 8 9.2
18 19 Los Angeles Kings 7.5 0.925 ... 11 13.1 8 9.8
19 20 Vancouver Canucks 7.3 0.919 ... 17 13.2 20 17.4
20 21 Columbus Blue Jackets 8.1 0.918 ... 5 9.6 15 13.6
21 22 Arizona Coyotes 7.7 0.924 ... 11 14.7 14 12.8
22 23 San Jose Sharks 8.1 0.909 ... 12 14.6 16 14.0
23 24 New York Rangers 7.8 0.921 ... 17 14.0 8 12.7
24 25 Nashville Predators 5.7 0.918 ... 5 10.6 11 13.4
25 26 Anaheim Ducks 7.4 0.909 ... 12 13.3 25 16.8
26 27 Dallas Stars 7.4 0.929 ... 11 13.3 5 12.8
27 28 Detroit Red Wings 7.5 0.923 ... 13 15.3 12 16.7
28 29 Ottawa Senators 7.1 0.894 ... 7 8.6 20 14.3
29 30 New Jersey Devils 7.2 0.923 ... 10 14.3 12 13.2
30 31 Buffalo Sabres 5.8 0.911 ... 6 8.2 16 14.0

Webscraping in BeautifulSoup is returning an empty list

I am trying to web scrape a table from basketball reference and it returns an empty list. I was hoping someone could help me debug or explain why. The page has many tables but it is the Miscellaneous Stats section in particular. Thanks in advance!
from bs4 import BeautifulSoup
import requests
import time
import pandas as pd
import matplotlib as plt
import numpy as np
url = 'https://www.basketball-reference.com/leagues/NBA_2020.html#all_misc_stats'
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
soup.find('div', {'id':'div_misc_stats'})

Your implementation isn't wrong for parsing the soup, its just that the particular element you're looking for requires javascript to render. You're probably better off looking for some other source of the data if you can find it.
If you really need THIS data, then you may wish to looking rendering the page first (see this for some inspiration)
From my cursory analysis, it also seems there isn't an external network call made to get the data before rendering it, so it may be elsewhere embeded in the page, as xml/json/etc, although I didn't find it in my search. May be worth checking that before investing in a more compute expensive approach if this is not a one time thing you'll need to scrape.

The data is inside HTML comment <!-- ... -->. You can use this script to load it inside a DataFrame:
import requests
import pandas as pd
from bs4 import BeautifulSoup, Comment
url = 'https://www.basketball-reference.com/leagues/NBA_2020.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
table = soup.select_one('h2:contains("Miscellaneous Stats")').find_next(text=lambda t: isinstance(t, Comment))
df = pd.read_html(str(table))[0].droplevel(0, axis=1)
print(df)
Prints:
Rk Team Age W L PW PL MOV SOS SRS ORtg DRtg ... TS% eFG% TOV% ORB% FT/FGA eFG% TOV% DRB% FT/FGA Arena Attend. Attend./G
0 1.0 Milwaukee Bucks* 29.2 53.0 12.0 52 13 11.29 -0.85 10.44 112.6 101.9 ... 0.583 0.553 12.8 20.7 0.196 0.486 12.2 81.7 0.172 Fiserv Forum 549036 17711
1 2.0 Los Angeles Lakers* 29.6 49.0 14.0 45 18 7.41 0.34 7.75 113.0 105.6 ... 0.577 0.548 13.2 24.6 0.196 0.509 13.8 78.4 0.202 STAPLES Center 588907 18997
2 3.0 Los Angeles Clippers* 27.4 44.0 20.0 44 20 6.52 0.22 6.74 113.6 107.2 ... 0.574 0.532 12.7 24.0 0.232 0.503 12.3 77.3 0.210 STAPLES Center 610176 19068
3 4.0 Toronto Raptors* 26.6 46.0 18.0 44 20 6.45 -0.57 5.88 111.6 105.2 ... 0.574 0.536 12.8 21.6 0.205 0.502 14.6 76.1 0.200 Scotiabank Arena 633456 19796
4 5.0 Dallas Mavericks 26.2 40.0 27.0 45 22 6.04 -0.21 5.84 116.7 110.6 ... 0.581 0.548 11.3 23.5 0.198 0.519 10.9 77.4 0.172 American Airlines Center 682096 20062
5 6.0 Boston Celtics* 25.3 43.0 21.0 44 20 6.17 -0.48 5.69 112.9 106.8 ... 0.567 0.529 12.0 23.9 0.204 0.510 13.6 77.5 0.212 TD Garden 610864 19090
6 7.0 Houston Rockets* 29.1 40.0 24.0 39 25 3.75 0.03 3.78 113.8 110.2 ... 0.578 0.539 12.6 22.4 0.226 0.528 13.5 75.6 0.194 Toyota Center 578458 18077
7 8.0 Utah Jazz* 27.5 41.0 23.0 38 26 3.17 0.03 3.20 112.6 109.4 ... 0.587 0.552 13.6 21.2 0.208 0.514 10.9 79.0 0.180 Vivint Smart Home Arena 567486 18306
8 9.0 Denver Nuggets* 25.6 43.0 22.0 39 26 2.95 0.06 3.02 112.5 109.5 ... 0.564 0.532 12.3 24.7 0.178 0.526 13.0 77.0 0.194 Pepsi Center 633153 19186
9 10.0 Oklahoma City Thunder* 25.6 40.0 24.0 37 27 2.45 0.34 2.79 111.6 109.1 ... 0.577 0.534 12.3 19.2 0.233 0.520 12.4 76.8 0.164 Chesapeake Energy Arena 600699 18203
10 11.0 Miami Heat* 25.9 41.0 24.0 39 26 3.23 -0.65 2.58 112.7 109.4 ... 0.587 0.549 13.5 20.5 0.231 0.522 12.3 79.7 0.208 AmericanAirlines Arena 629771 19680
11 12.0 Philadelphia 76ers* 26.4 39.0 26.0 37 28 2.22 0.01 2.22 110.4 108.2 ... 0.562 0.530 12.7 23.7 0.189 0.522 12.7 80.4 0.211 Wells Fargo Center 639491 20629
12 13.0 Indiana Pacers* 25.6 39.0 26.0 37 28 1.94 -0.33 1.61 110.3 108.3 ... 0.565 0.533 11.9 20.3 0.170 0.513 12.8 77.1 0.193 Bankers Life Fieldhouse 529002 16531
13 14.0 New Orleans Pelicans 25.4 28.0 36.0 30 34 -0.83 1.13 0.30 110.8 111.6 ... 0.567 0.538 13.7 24.3 0.183 0.531 12.3 78.1 0.207 Smoothie King Center 528172 16505
14 15.0 Orlando Magic 26.0 30.0 35.0 30 35 -0.97 0.12 -0.85 108.0 109.0 ... 0.540 0.503 11.4 22.4 0.191 0.535 13.5 79.0 0.170 Amway Center 529870 17093
15 16.0 Memphis Grizzlies 24.0 32.0 33.0 30 35 -1.08 0.02 -1.05 109.4 110.4 ... 0.561 0.530 13.2 23.2 0.178 0.520 12.6 77.6 0.213 FedEx Forum 523297 15857
16 17.0 Phoenix Suns 24.7 26.0 39.0 30 35 -1.37 0.32 -1.05 110.5 111.8 ... 0.572 0.528 13.3 22.2 0.226 0.543 14.0 78.3 0.221 Talking Stick Resort Arena 550633 15606
17 18.0 Portland Trail Blazers 27.5 29.0 37.0 30 36 -1.61 0.49 -1.11 112.5 114.1 ... 0.566 0.530 11.5 22.0 0.191 0.523 11.0 75.0 0.204 Moda Center 628303 19634
18 19.0 Brooklyn Nets 26.5 30.0 34.0 31 33 -0.64 -0.54 -1.18 108.1 108.7 ... 0.550 0.515 13.4 23.5 0.199 0.507 10.9 77.8 0.181 Barclays Center 524907 16403
19 20.0 San Antonio Spurs 27.9 27.0 36.0 28 35 -1.76 0.57 -1.21 111.9 113.7 ... 0.569 0.529 11.0 19.5 0.206 0.542 11.5 79.2 0.194 AT&T Center 550515 18351
20 21.0 Sacramento Kings 27.1 28.0 36.0 28 36 -1.92 0.48 -1.44 109.7 111.6 ... 0.563 0.531 13.0 21.8 0.178 0.540 13.6 78.5 0.222 Golden 1 Center 520663 16796
21 22.0 Minnesota Timberwolves 24.8 19.0 45.0 24 40 -4.30 0.51 -3.78 108.1 112.2 ... 0.551 0.514 13.0 22.1 0.209 0.541 13.2 77.2 0.218 Target Center 482112 15066
22 23.0 Chicago Bulls 24.4 22.0 43.0 26 39 -3.08 -0.73 -3.81 106.7 109.8 ... 0.547 0.515 13.7 22.8 0.175 0.546 16.3 75.6 0.239 United Center 639352 18804
23 24.0 Detroit Pistons 25.9 20.0 46.0 26 40 -3.56 -0.66 -4.22 109.0 112.7 ... 0.561 0.529 13.8 22.6 0.194 0.541 12.7 75.9 0.186 Little Caesars Arena 509469 15294
24 25.0 Washington Wizards 25.4 24.0 40.0 24 40 -4.05 -0.81 -4.86 111.9 115.8 ... 0.568 0.528 12.1 22.0 0.214 0.560 14.0 74.9 0.230 Capital One Arena 532702 16647
25 26.0 New York Knicks 24.5 21.0 45.0 20 46 -6.45 -0.09 -6.55 106.5 113.0 ... 0.531 0.501 12.6 25.8 0.182 0.541 12.4 78.3 0.224 Madison Square Garden (IV) 620789 18812
26 27.0 Charlotte Hornets 24.3 23.0 42.0 19 46 -6.75 -0.12 -6.88 106.3 113.3 ... 0.539 0.504 13.3 23.9 0.188 0.546 13.1 74.4 0.159 Spectrum Center 478591 15428
27 28.0 Cleveland Cavaliers 25.0 19.0 46.0 18 47 -7.89 0.33 -7.55 107.5 115.4 ... 0.553 0.522 14.6 24.6 0.172 0.560 11.7 77.4 0.164 Quicken Loans Arena 643008 17861
28 29.0 Atlanta Hawks 24.1 20.0 47.0 18 49 -7.97 0.40 -7.57 107.2 114.8 ... 0.554 0.515 13.8 21.6 0.204 0.543 12.7 74.9 0.233 State Farm Arena 545453 16043
29 30.0 Golden State Warriors 24.4 15.0 50.0 16 49 -8.71 0.79 -7.92 105.2 113.8 ... 0.540 0.497 13.2 21.5 0.212 0.553 13.7 76.4 0.193 Chase Center 614176 18064
30 NaN League Average 26.2 NaN NaN 32 32 0.00 0.00 0.00 110.4 110.4 ... 0.564 0.528 12.8 22.6 0.199 0.528 12.8 77.4 0.199 NaN 575820 17788
[31 rows x 28 columns]

this website that you want to scrape, is a dynamic website, because of this you can't access to all the data at the first you request to the website, you need to wait for some seconds for rendering javascript and then access to all of the website data, for this solution you can use selenium. read the documentation and download the driver chrome or firefox then use it, I wrote the code that you can access to that table :
from selenium import webdriver
import pandas as pd
import os
import time
chromedriver = "driver/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
url = 'https://www.basketball-reference.com/leagues/NBA_2020.html#all_misc_stats'
driver.get(url)
time.sleep(15)
soruce = driver.page_source
tables = pd.read_html(soruce)
for table in tables:
try:
if 'Arena' in table.columns[25][1]:
print(table)
except:
pass
print:
Rk Team Age ... Arena Attend. Attend./G
0 1.0 Milwaukee Bucks* 29.2 ... Fiserv Forum 549036 17711
1 2.0 Los Angeles Lakers* 29.6 ... STAPLES Center 588907 18997
2 3.0 Los Angeles Clippers* 27.4 ... STAPLES Center 610176 19068
3 4.0 Toronto Raptors* 26.6 ... Scotiabank Arena 633456 19796
4 5.0 Dallas Mavericks 26.2 ... American Airlines Center 682096 20062
5 6.0 Boston Celtics* 25.3 ... TD Garden 610864 19090
6 7.0 Houston Rockets* 29.1 ... Toyota Center 578458 18077
7 8.0 Utah Jazz* 27.5 ... Vivint Smart Home Arena 567486 18306
8 9.0 Denver Nuggets* 25.6 ... Pepsi Center 633153 19186
9 10.0 Oklahoma City Thunder* 25.6 ... Chesapeake Energy Arena 600699 18203
10 11.0 Miami Heat* 25.9 ... AmericanAirlines Arena 629771 19680
11 12.0 Philadelphia 76ers* 26.4 ... Wells Fargo Center 639491 20629
12 13.0 Indiana Pacers* 25.6 ... Bankers Life Fieldhouse 529002 16531
13 14.0 New Orleans Pelicans 25.4 ... Smoothie King Center 528172 16505
14 15.0 Orlando Magic 26.0 ... Amway Center 529870 17093
15 16.0 Memphis Grizzlies 24.0 ... FedEx Forum 523297 15857
16 17.0 Phoenix Suns 24.7 ... Talking Stick Resort Arena 550633 15606
17 18.0 Portland Trail Blazers 27.5 ... Moda Center 628303 19634
18 19.0 Brooklyn Nets 26.5 ... Barclays Center 524907 16403
19 20.0 San Antonio Spurs 27.9 ... AT&T Center 550515 18351
20 21.0 Sacramento Kings 27.1 ... Golden 1 Center 520663 16796
21 22.0 Minnesota Timberwolves 24.8 ... Target Center 482112 15066
22 23.0 Chicago Bulls 24.4 ... United Center 639352 18804
23 24.0 Detroit Pistons 25.9 ... Little Caesars Arena 509469 15294
24 25.0 Washington Wizards 25.4 ... Capital One Arena 532702 16647
25 26.0 New York Knicks 24.5 ... Madison Square Garden (IV) 620789 18812
26 27.0 Charlotte Hornets 24.3 ... Spectrum Center 478591 15428
27 28.0 Cleveland Cavaliers 25.0 ... Quicken Loans Arena 643008 17861
28 29.0 Atlanta Hawks 24.1 ... State Farm Arena 545453 16043
29 30.0 Golden State Warriors 24.4 ... Chase Center 614176 18064
30 NaN League Average 26.2 ... NaN 575820 17788
[31 rows x 28 columns]

bs4 not giving table

URL = 'https://www.basketball-reference.com/leagues/NBA_2019.html'
content = requests.get(URL)
soup = BeautifulSoup(content.text, 'html.parser')
table = soup.find_all('table', {'class' : 'sortable stats_table now_sortable'})
rows = table.find_all('td')
for i in rows:
print(i.get_text())
I want to get content of the table with team per game stats from this website but I got error
>>>AttributeError: 'NoneType' object has no attribute 'find_all'

The table that you want is dynamically loaded, meaning it not loaded into the html when you first make a request to the page. So, the table you are searching for does not yet exist.
To scrape sites that use javascript, you can look into using selenium webdriver and PhantomJS, better described by this post –> https://stackoverflow.com/a/26440563/13275492

Actually you can use pandas.read_html() which will read the all tables in nice format. it's will return tables as list. so you can access it as DataFrame with index such as print(df[0]) for example
import pandas as pd
df = pd.read_html("https://www.basketball-reference.com/leagues/NBA_2019.html")
print(df)

The tables (with the exception of a few) in these sports reference sites are within the comments. You would need to pull out the comments, then render these tables with pandas.
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.basketball-reference.com/leagues/NBA_2019.html"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
comments = pageSoup.find_all(string=lambda text: isinstance(text, Comment))
for each in comments:
if 'table' in each and 'id="team-stats-per_game"' in each:
df = pd.read_html(each, attrs = {'id': 'team-stats-per_game'})[0]
Output:
print (df)
Rk Team G MP FG ... STL BLK TOV PF PTS
0 1.0 Milwaukee Bucks* 82 241.2 43.4 ... 7.5 5.9 13.9 19.6 118.1
1 2.0 Golden State Warriors* 82 241.5 44.0 ... 7.6 6.4 14.3 21.4 117.7
2 3.0 New Orleans Pelicans 82 240.9 43.7 ... 7.4 5.4 14.8 21.1 115.4
3 4.0 Philadelphia 76ers* 82 241.5 41.5 ... 7.4 5.3 14.9 21.3 115.2
4 5.0 Los Angeles Clippers* 82 241.8 41.3 ... 6.8 4.7 14.5 23.3 115.1
5 6.0 Portland Trail Blazers* 82 242.1 42.3 ... 6.7 5.0 13.8 20.4 114.7
6 7.0 Oklahoma City Thunder* 82 242.1 42.6 ... 9.3 5.2 14.0 22.4 114.5
7 8.0 Toronto Raptors* 82 242.4 42.2 ... 8.3 5.3 14.0 21.0 114.4
8 9.0 Sacramento Kings 82 240.6 43.2 ... 8.3 4.4 13.4 21.4 114.2
9 10.0 Washington Wizards 82 243.0 42.1 ... 8.3 4.6 14.1 20.7 114.0
10 11.0 Houston Rockets* 82 241.8 39.2 ... 8.5 4.9 13.3 22.0 113.9
11 12.0 Atlanta Hawks 82 242.1 41.4 ... 8.2 5.1 17.0 23.6 113.3
12 13.0 Minnesota Timberwolves 82 241.8 41.6 ... 8.3 5.0 13.1 20.3 112.5
13 14.0 Boston Celtics* 82 241.2 42.1 ... 8.6 5.3 12.8 20.4 112.4
14 15.0 Brooklyn Nets* 82 243.7 40.3 ... 6.6 4.1 15.1 21.5 112.2
15 16.0 Los Angeles Lakers 82 241.2 42.6 ... 7.5 5.4 15.7 20.7 111.8
16 17.0 Utah Jazz* 82 240.9 40.4 ... 8.1 5.9 15.1 21.1 111.7
17 18.0 San Antonio Spurs* 82 241.5 42.3 ... 6.1 4.7 12.1 18.1 111.7
18 19.0 Charlotte Hornets 82 241.8 40.2 ... 7.2 4.9 12.2 18.9 110.7
19 20.0 Denver Nuggets* 82 240.6 41.9 ... 7.7 4.4 13.4 20.0 110.7
20 21.0 Dallas Mavericks 82 241.2 38.8 ... 6.5 4.3 14.2 20.1 108.9
21 22.0 Indiana Pacers* 82 240.3 41.3 ... 8.7 4.9 13.7 19.4 108.0
22 23.0 Phoenix Suns 82 242.4 40.1 ... 9.0 5.1 15.6 23.6 107.5
23 24.0 Orlando Magic* 82 241.2 40.4 ... 6.6 5.4 13.2 18.6 107.3
24 25.0 Detroit Pistons* 82 242.1 38.8 ... 6.9 4.0 13.8 22.1 107.0
25 26.0 Miami Heat 82 240.6 39.6 ... 7.6 5.5 14.7 20.9 105.7
26 27.0 Chicago Bulls 82 242.7 39.8 ... 7.4 4.3 14.1 20.3 104.9
27 28.0 New York Knicks 82 241.2 38.2 ... 6.8 5.1 14.0 20.9 104.6
28 29.0 Cleveland Cavaliers 82 240.9 38.9 ... 6.5 2.4 13.5 20.0 104.5
29 30.0 Memphis Grizzlies 82 242.4 38.0 ... 8.3 5.5 14.0 22.0 103.5
30 NaN League Average 82 241.6 41.1 ... 7.6 5.0 14.1 20.9 111.2
[31 rows x 25 columns]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping table returning repeated values - python

Related

How to find elements that match specific conditions selenium

How do I parse out specific dataframes when using pandas to web scrape data from a page with multiple dataframes?

Unable to extract Tables

Webscraping in BeautifulSoup is returning an empty list

bs4 not giving table

Categories

Resources