How to find elements that match specific conditions selenium

How to find elements that match specific conditions selenium - python

i want to crawl data in web, but i don't know how to get data from these tags
i don't know how to get data from these tags. Please help me
from selenium import webdriver
import pandas as pd
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
browser = webdriver.Chrome(executable_path="./chromedriver.exe")
idx = 0
data = []
title = []
#print("Process 300 days from {}-{}-{}".format(current_date.day, current_date.month, current_date.year))
url = 'https://24hmoney.vn/stock/HAG/financial-report'
web = browser.get(url)
#Click nut theo quy
btn1 = browser.find_element(By.XPATH, "/html/body/div[1]/div/div/div[2]/div[1]/div[4]/div[2]/div[1]")
btn1.click()
#click hien thi tang giam so voi cung ki
#btn2 = browser.find_element(By.XPATH,"/html/body/div[1]/div/div/div[2]/div[1]/div[4]/div[3]/div[1]/span")
#btn2.click()
lai = browser.find_elements(By.CSS_SELECTOR,'p')
for raw in lai:
data.append(raw.text)
#print(raw.text)
tieude = browser.find_elements(By.CLASS_NAME,'sticky-col.first-col')
for raw2 in tieude:
title.append(raw2.text)
print(raw2.text)
#df = pd.DataFrame(data,columns=["HAG"])
df = pd.DataFrame(title,columns=["Tieude"])
df.to_csv("HAG.csv",index=False)
#a = input()

Maybe the following code will solve your issue?
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://24hmoney.vn/stock/HAG/financial-report'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
table = soup.select_one('div[class="financial-report-box-content"] table')
df = pd.read_html(str(table))[0]
print(df)
Result in terminal:
Tiêu đề Q3/22 % Q3/21 Q2/22 % Q2/21 Q1/22 % Q1/21 Q4/21 % Q4/20 Q3/21 % Q3/20 Q2/21 % Q2/20 Q1/21 % Q1/20 Q4/20 % Q4/19
0 Doanh thu 1441.4 160.1% 1233.6 125.2% 802.6 182.3% 743.7 -19.2% 554.1 -20.9% 547.7 -15.4% 284.4 -66% 920.4 51%
1 Các khoản giảm trừ NaN NaN 6.2 -81.3% NaN NaN NaN NaN NaN NaN 3.4 68.3% 18.5 -678.5% 6.7 5.7%
2 Doanh thu thuần 1441.4 160.1% 1227.4 125.5% 802.6 201.9% 743.7 -18.6% 554.1 -20.9% 544.3 -14.6% 265.8 -68.1% 913.7 51.6%
3 Giá vốn hàng bán 1160.6 -207.3% 1051.7 -115.3% 512.8 -140.3% 511.5 52.7% 377.6 50.1% 488.4 3% 213.4 61.3% 1082.2 -74.1%
4 Lợi nhuận gộp 280.8 59.1% 175.7 214.4% 289.8 452.8% 232.3 237.9% 176.5 414.3% 55.9 -58.1% 52.4 -81.5% -168.5 -783.9%
5 Thu nhập tài chính 117.5 -10.9% 95.4 -24.8% 192.4 -44.9% 127.6 -83.7% 131.9 -5.3% 126.8 -34.3% 349.4 122.3% 783.6 203.9%
6 Chi phí tài chính 166.0 76% 875.9 -410.8% 185.9 13.4% 254.7 150.6% 691.5 -161.9% 171.5 -39.8% 214.8 33.6% 503.4 -44.1%
7 Chi phí tiền lãi 166.9 -0.2% 223.6 -35.6% 162.7 18.6% 167.4 66.3% 166.5 21.9% 165.0 25.9% 199.8 25.3% 496.9 -72.7%
8 Lãi/lỗ từ công ty liên doanh NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -7.6 -1,149% 1.8 -18.8% 4.9 -86.8%
9 Chi phí bán hàng 58.6 -51.5% 90.5 -191.7% 52.1 -202.7% 42.3 34.5% 38.7 47.6% 31.0 76.5% 17.2 79.6% 64.6 14.5%
10 Chi phí quản lý doanh nghiệp 181.1 -60.4% 950.6 322.9% 5.2 101.4% 404.5 56% 457.0 395.4% 224.8 424.4% 367.3 -272.5% 919.8 -890.1%
11 Lãi/lỗ từ hoạt động kinh doanh 354.8 909% 255.2 29.3% 249.3 227.4% 167.7 119.3% 35.2 108.6% 197.3 10,174% -195.7 -202.8% -867.8 -258.5%
12 Thu nhập khác 2.8 35% 24.8 455.4% 5.7 -81.7% 44.0 66.5% 2.1 7.2% 4.5 -85.2% 31.1 68.9% 26.5 140.4%
13 Chi phí khác -7.4 56.3% -56.0 66.7% -15.3 81.7% -143.8 78.8% -17.0 89.7% -168.3 -98.2% -83.5 -153.3% -679.1 -141.2%
14 Thu nhập khác, ròng -4.6 69.2% -31.2 81% -9.6 81.7% -99.8 84.7% -14.9 90.8% -163.8 -198.8% -52.4 -260.2% -652.7 -141.2%
15 Lãi/lỗ từ công ty liên doanh NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16 LỢI NHUẬN TRƯỚC THUẾ 350.2 1,629% 224.0 569.6% 239.8 196.6% 67.9 104.5% 20.2 103.5% 33.5 163.2% -248.1 -213.3% -1520.5 -196.6%
17 Thuế thu nhập doanh nghiệp – hiện thời 1.2 NaN 1.4 -333.9% 0.2 NaN 0.0 96.6% NaN NaN 0.3 -76% NaN NaN 1.1 -27.7%
18 Thuế thu nhập doanh nghiệp – hoãn lại 20.5 1,267% 42.2 -3.9% 18.4 -89.7% 28.6 801.7% 1.5 -19.5% 43.9 1,847% 179.4 15,941% 4.1 -102.4%
19 Chi phí thuế thu nhập doanh nghiệp 19.4 1,191% 40.8 -6.3% 18.2 -89.8% 28.6 656.6% 1.5 -13.8% 43.6 1,718% 179.4 18,238% 5.1 -103%
20 LỢI NHUẬN SAU THUẾ TNDN 369.5 1,599% 264.9 243.7% 258.0 475.2% 96.5 106.3% 21.7 103.8% 77.1 238.6% -68.8 12.1% -1525.6 -344.5%
21 Lợi ích của cổ đông thiểu số 8.8 548.1% -14.9 -3,505% 8.0 177% -45.7 87% -2.0 99.5% 0.4 100.2% -10.3 -14.7% -352.1 11.8%
22 Lợi nhuận của Cổ đông của Công ty mẹ 360.7 1,421% 279.7 265% 250.0 528% 142.2 112.1% 23.7 112.7% 76.6 -56.6% -58.4 15.6% -1173.5 -2,193%
23 EPS 4QGN (đ) 389.0 1,396% 301.0 262.6% 270.0 528.6% 153.0 112.1% 26.0 112.9% 83.0 -56.5% -63.0 16% -1265.0 -540.8%

Related

Scraping table returning repeated values

I'm trying to build a simple web scraper. I am trying to scrape a table, but I'm not sure why the output is: School, 20-5, 33.2 26 times over.
Here is my code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.maxpreps.com/rankings/basketball/1/state/michigan.htm'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
teams = soup.find_all('tr')
for team in teams:
teamname = soup.find('th', class_ = "school").text
record = soup.find('td', class_= "overall dw").text
rating = soup.find('td', class_ = "rating sorted dw").text
print(teamname, record, rating)

Notice that you're never using the Tag that team refers to. Inside the for loop, all of the calls to soup.find() should be calls to team.find():
for team in teams[1:]:
teamname = team.find('th', class_ = "school").text
record = team.find('td', class_= "overall dw").text
rating = team.find('td', class_ = "rating sorted dw").text
print(teamname, record, rating)
This outputs:
St. Mary's Prep (Orchard Lake) 20-5 33.2
University of Detroit Jesuit (Detroit) 16-7 30.0
Williamston 25-0 29.3
Ferndale 21-3 28.9
Catholic Central (Grand Rapids) 25-1 28.4
King (Detroit) 18-3 27.4
De La Salle Collegiate (Warren) 18-7 27.2
Catholic Central (Novi) 16-9 26.6
Brother Rice (Bloomfield Hills) 15-7 26.5
Unity Christian (Hudsonville) 21-1 26.4
Hamtramck 21-4 26.3
Grand Blanc 20-5 25.9
East Lansing 18-5 25.0
Muskegon 20-3 24.8
Northview (Grand Rapids) 25-1 24.6
Cass Tech (Detroit) 21-4 24.3
North Farmington (Farmington Hills) 18-4 24.2
Beecher (Flint) 23-2 24.0
Okemos 19-5 23.9
Benton Harbor 23-3 23.2
Rockford 19-3 22.9
Grand Haven 17-4 21.9
Hartland 19-4 21.0
Marshall 20-3 21.0
Freeland 24-0 21.0
We use [1:] to skip the table header, slicing off the first element in the teams list.

Let pandas parse that table for you (it uses BeautifulSoup under the hoop).
import pandas as pd
url = 'https://www.maxpreps.com/rankings/basketball/1/state/michigan.htm'
df = pd.read_html(url)[0]
Output:
print(df)
# School Ovr. Rating Str. +/-
0 1 St. Mary's Prep (Orchard Lake) 20-5 33.2 23.0 NaN
1 2 University of Detroit Jesuit (Detroit) 16-7 30.0 24.1 NaN
2 3 Williamston 25-0 29.3 10.9 NaN
3 4 Ferndale 21-3 28.9 16.5 NaN
4 5 Catholic Central (Grand Rapids) 25-1 28.4 11.4 NaN
5 6 King (Detroit) 18-3 27.4 15.2 NaN
6 7 De La Salle Collegiate (Warren) 18-7 27.2 19.6 2.0
7 8 Catholic Central (Novi) 16-9 26.6 22.6 -1.0
8 9 Brother Rice (Bloomfield Hills) 15-7 26.5 21.0 -1.0
9 10 Unity Christian (Hudsonville) 21-1 26.4 10.4 NaN
10 11 Hamtramck 21-4 26.3 14.5 2.0
11 12 Grand Blanc 20-5 25.9 15.3 -1.0
12 13 East Lansing 18-5 25.0 15.6 1.0
13 14 Muskegon 20-3 24.8 11.4 1.0
14 15 Northview (Grand Rapids) 25-1 24.6 8.2 1.0
15 16 Cass Tech (Detroit) 21-4 24.3 11.8 -4.0
16 17 North Farmington (Farmington Hills) 18-4 24.2 13.1 NaN
17 18 Beecher (Flint) 23-2 24.0 8.6 2.0
18 19 Okemos 19-5 23.9 13.7 -1.0
19 20 Benton Harbor 23-3 23.2 9.9 -1.0
20 21 Rockford 19-3 22.9 11.6 NaN
21 22 Grand Haven 17-4 21.9 11.3 NaN
22 23 Hartland 19-4 21.0 10.4 1.0
23 24 Marshall 20-3 21.0 8.6 -1.0
24 25 Freeland 24-0 21.0 2.7 4.0

Unable to extract Tables

Beginner here. I'm having issues while trying to extract data from the second (Team Statistics) and third (Team Analytics 5-on-5) Table on this page:
https://www.hockey-reference.com/leagues/NHL_2021.html
I'm using this code:
import pandas as pd
url = 'https://www.hockey-reference.com/leagues/NHL_2021.html'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[1]
print(df)
and
url = 'https://www.hockey-reference.com/leagues/NHL_2021.html'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[2]
print(df)
to get the right tables.
But for some kind of reason I will always get this error message:
IndexError: list index out of range
I could extract the first table by using the same code with df = df_list[0], that will work, but it is useless to me. I really need the 2nd an 3rd Table, and I just don't know why it doesn't work.
Pretty sure that's easy to answer for most of you.
Thanx in advance!

You get this error because the read_html() method returns a list of 1 element and that element is at position 0
instead of
df = df_list[1]
use this
df = df_list[0]
You get combined table of all teams from your mentioned site so if you want to extract the table of 2nd and 3rd team use loc[] accessor:-
east_division=df.loc[9:17]
north_division=df.loc[18:25]

Use the URL directly in pandas.read_html
df = pd.read_html('https://www.hockey-reference.com/leagues/NHL_2021.html')

The tables are in fact there in the html (within the comments). Use BeautifulSoup to pull out the comments and parse those tables as well. The code below will pull all (both commented and uncommented tables). and put it into a list. Just a matter of pulling out the table by index that you want, in this case indices 1 and 2.
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
url = "https://www.hockey-reference.com/leagues/NHL_2021.html"
# Gets all uncommented tables
tables = pd.read_html(url, header=1)
# Get the html source
response = requests.get(url, headers=headers)
# Creat soup object form html
soup = BeautifulSoup(response.content, 'html.parser')
# Get the comments in html
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
# Iterate thorugh each comment and parse the table if found
# # Append the table to the tables list
for each in comments:
if 'table' in str(each):
try:
tables.append(pd.read_html(each, header=1)[0])
tables = tables[tables['Rk'].ne('Rk')]
tables = tables.rename(columns={'Unnamed: 1':'Team'})
except:
continue
Output:
for table in tables[1:3]:
print(table)
Rk Unnamed: 1 AvAge GP W ... S S% SA SV% SO
0 1.0 New York Islanders 29.0 28 18 ... 841 9.8 767 0.920 5
1 2.0 Tampa Bay Lightning 28.3 26 19 ... 798 12.2 725 0.919 3
2 3.0 Florida Panthers 28.1 27 18 ... 918 10.0 840 0.910 0
3 4.0 Toronto Maple Leafs 28.9 29 19 ... 883 11.2 828 0.909 2
4 5.0 Carolina Hurricanes 27.2 26 19 ... 816 10.9 759 0.912 3
5 6.0 Washington Capitals 30.4 27 17 ... 768 12.0 808 0.895 0
6 7.0 Vegas Golden Knights 29.1 25 18 ... 752 11.0 691 0.920 4
7 8.0 Edmonton Oilers 28.4 30 18 ... 945 10.6 938 0.907 2
8 9.0 Winnipeg Jets 28.0 27 17 ... 795 11.4 856 0.910 1
9 10.0 Pittsburgh Penguins 28.1 27 17 ... 779 11.0 784 0.899 1
10 11.0 Chicago Blackhawks 27.2 29 14 ... 863 10.1 997 0.910 2
11 12.0 Minnesota Wild 28.8 25 16 ... 764 10.3 723 0.913 2
12 13.0 St. Louis Blues 28.2 28 14 ... 836 10.4 835 0.892 0
13 14.0 Boston Bruins 28.8 25 14 ... 772 8.8 665 0.913 2
14 15.0 Colorado Avalanche 26.8 25 15 ... 846 8.7 622 0.905 4
15 16.0 Montreal Canadiens 28.8 27 12 ... 890 9.7 782 0.909 0
16 17.0 Philadelphia Flyers 27.5 25 13 ... 699 11.7 753 0.892 3
17 18.0 Calgary Flames 28.0 28 13 ... 838 8.9 845 0.904 3
18 19.0 Los Angeles Kings 27.7 26 11 ... 748 10.3 814 0.910 2
19 20.0 Vancouver Canucks 27.7 31 13 ... 951 8.8 1035 0.903 1
20 21.0 Columbus Blue Jackets 27.0 29 11 ... 839 9.3 902 0.895 1
21 22.0 Arizona Coyotes 28.5 27 12 ... 689 9.7 851 0.907 1
22 23.0 San Jose Sharks 29.3 25 11 ... 749 9.5 800 0.890 1
23 24.0 New York Rangers 25.7 26 11 ... 773 9.2 746 0.906 2
24 25.0 Nashville Predators 28.9 28 11 ... 880 7.4 837 0.885 1
25 26.0 Anaheim Ducks 28.4 29 8 ... 804 7.7 852 0.891 3
26 27.0 Dallas Stars 28.3 23 8 ... 657 10.2 626 0.904 3
27 28.0 Detroit Red Wings 29.4 28 8 ... 785 8.0 870 0.891 0
28 29.0 Ottawa Senators 26.4 30 9 ... 942 8.2 960 0.874 0
29 30.0 New Jersey Devils 26.2 24 8 ... 708 8.5 741 0.896 2
30 31.0 Buffalo Sabres 27.4 26 6 ... 728 7.7 804 0.893 0
31 NaN League Average 28.1 27 13 ... 808 9.8 808 0.902 2
[32 rows x 32 columns]
Rk Unnamed: 1 S% SV% ... HDGF HDC% HDGA HDCO%
0 1 New York Islanders 8.3 0.931 ... 11 12.2 11 11.8
1 2 Tampa Bay Lightning 8.7 0.933 ... 11 14.9 6 6.3
2 3 Florida Panthers 7.9 0.926 ... 15 14.4 12 17.6
3 4 Toronto Maple Leafs 8.8 0.933 ... 16 13.4 8 11.1
4 5 Carolina Hurricanes 7.5 0.932 ... 12 12.8 7 9.3
5 6 Washington Capitals 9.8 0.919 ... 10 10.9 5 7.8
6 7 Vegas Golden Knights 9.3 0.927 ... 20 15.9 11 14.5
7 8 Edmonton Oilers 8.2 0.920 ... 9 11.3 13 9.8
8 9 Winnipeg Jets 8.5 0.926 ... 15 15.0 8 7.8
9 10 Pittsburgh Penguins 8.8 0.922 ... 10 14.5 15 13.5
10 11 Chicago Blackhawks 7.3 0.925 ... 10 10.5 14 15.1
11 12 Minnesota Wild 9.9 0.930 ... 16 14.2 8 11.9
12 13 St. Louis Blues 8.4 0.914 ... 15 18.1 15 15.8
13 14 Boston Bruins 6.6 0.922 ... 5 7.4 11 12.2
14 15 Colorado Avalanche 6.7 0.916 ... 8 8.1 8 13.3
15 16 Montreal Canadiens 7.8 0.935 ... 15 12.0 8 11.3
16 17 Philadelphia Flyers 10.1 0.907 ... 18 15.9 9 12.9
17 18 Calgary Flames 7.6 0.929 ... 6 6.9 8 9.2
18 19 Los Angeles Kings 7.5 0.925 ... 11 13.1 8 9.8
19 20 Vancouver Canucks 7.3 0.919 ... 17 13.2 20 17.4
20 21 Columbus Blue Jackets 8.1 0.918 ... 5 9.6 15 13.6
21 22 Arizona Coyotes 7.7 0.924 ... 11 14.7 14 12.8
22 23 San Jose Sharks 8.1 0.909 ... 12 14.6 16 14.0
23 24 New York Rangers 7.8 0.921 ... 17 14.0 8 12.7
24 25 Nashville Predators 5.7 0.918 ... 5 10.6 11 13.4
25 26 Anaheim Ducks 7.4 0.909 ... 12 13.3 25 16.8
26 27 Dallas Stars 7.4 0.929 ... 11 13.3 5 12.8
27 28 Detroit Red Wings 7.5 0.923 ... 13 15.3 12 16.7
28 29 Ottawa Senators 7.1 0.894 ... 7 8.6 20 14.3
29 30 New Jersey Devils 7.2 0.923 ... 10 14.3 12 13.2
30 31 Buffalo Sabres 5.8 0.911 ... 6 8.2 16 14.0

bs4 not giving table

URL = 'https://www.basketball-reference.com/leagues/NBA_2019.html'
content = requests.get(URL)
soup = BeautifulSoup(content.text, 'html.parser')
table = soup.find_all('table', {'class' : 'sortable stats_table now_sortable'})
rows = table.find_all('td')
for i in rows:
print(i.get_text())
I want to get content of the table with team per game stats from this website but I got error
>>>AttributeError: 'NoneType' object has no attribute 'find_all'

The table that you want is dynamically loaded, meaning it not loaded into the html when you first make a request to the page. So, the table you are searching for does not yet exist.
To scrape sites that use javascript, you can look into using selenium webdriver and PhantomJS, better described by this post –> https://stackoverflow.com/a/26440563/13275492

Actually you can use pandas.read_html() which will read the all tables in nice format. it's will return tables as list. so you can access it as DataFrame with index such as print(df[0]) for example
import pandas as pd
df = pd.read_html("https://www.basketball-reference.com/leagues/NBA_2019.html")
print(df)

The tables (with the exception of a few) in these sports reference sites are within the comments. You would need to pull out the comments, then render these tables with pandas.
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.basketball-reference.com/leagues/NBA_2019.html"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
comments = pageSoup.find_all(string=lambda text: isinstance(text, Comment))
for each in comments:
if 'table' in each and 'id="team-stats-per_game"' in each:
df = pd.read_html(each, attrs = {'id': 'team-stats-per_game'})[0]
Output:
print (df)
Rk Team G MP FG ... STL BLK TOV PF PTS
0 1.0 Milwaukee Bucks* 82 241.2 43.4 ... 7.5 5.9 13.9 19.6 118.1
1 2.0 Golden State Warriors* 82 241.5 44.0 ... 7.6 6.4 14.3 21.4 117.7
2 3.0 New Orleans Pelicans 82 240.9 43.7 ... 7.4 5.4 14.8 21.1 115.4
3 4.0 Philadelphia 76ers* 82 241.5 41.5 ... 7.4 5.3 14.9 21.3 115.2
4 5.0 Los Angeles Clippers* 82 241.8 41.3 ... 6.8 4.7 14.5 23.3 115.1
5 6.0 Portland Trail Blazers* 82 242.1 42.3 ... 6.7 5.0 13.8 20.4 114.7
6 7.0 Oklahoma City Thunder* 82 242.1 42.6 ... 9.3 5.2 14.0 22.4 114.5
7 8.0 Toronto Raptors* 82 242.4 42.2 ... 8.3 5.3 14.0 21.0 114.4
8 9.0 Sacramento Kings 82 240.6 43.2 ... 8.3 4.4 13.4 21.4 114.2
9 10.0 Washington Wizards 82 243.0 42.1 ... 8.3 4.6 14.1 20.7 114.0
10 11.0 Houston Rockets* 82 241.8 39.2 ... 8.5 4.9 13.3 22.0 113.9
11 12.0 Atlanta Hawks 82 242.1 41.4 ... 8.2 5.1 17.0 23.6 113.3
12 13.0 Minnesota Timberwolves 82 241.8 41.6 ... 8.3 5.0 13.1 20.3 112.5
13 14.0 Boston Celtics* 82 241.2 42.1 ... 8.6 5.3 12.8 20.4 112.4
14 15.0 Brooklyn Nets* 82 243.7 40.3 ... 6.6 4.1 15.1 21.5 112.2
15 16.0 Los Angeles Lakers 82 241.2 42.6 ... 7.5 5.4 15.7 20.7 111.8
16 17.0 Utah Jazz* 82 240.9 40.4 ... 8.1 5.9 15.1 21.1 111.7
17 18.0 San Antonio Spurs* 82 241.5 42.3 ... 6.1 4.7 12.1 18.1 111.7
18 19.0 Charlotte Hornets 82 241.8 40.2 ... 7.2 4.9 12.2 18.9 110.7
19 20.0 Denver Nuggets* 82 240.6 41.9 ... 7.7 4.4 13.4 20.0 110.7
20 21.0 Dallas Mavericks 82 241.2 38.8 ... 6.5 4.3 14.2 20.1 108.9
21 22.0 Indiana Pacers* 82 240.3 41.3 ... 8.7 4.9 13.7 19.4 108.0
22 23.0 Phoenix Suns 82 242.4 40.1 ... 9.0 5.1 15.6 23.6 107.5
23 24.0 Orlando Magic* 82 241.2 40.4 ... 6.6 5.4 13.2 18.6 107.3
24 25.0 Detroit Pistons* 82 242.1 38.8 ... 6.9 4.0 13.8 22.1 107.0
25 26.0 Miami Heat 82 240.6 39.6 ... 7.6 5.5 14.7 20.9 105.7
26 27.0 Chicago Bulls 82 242.7 39.8 ... 7.4 4.3 14.1 20.3 104.9
27 28.0 New York Knicks 82 241.2 38.2 ... 6.8 5.1 14.0 20.9 104.6
28 29.0 Cleveland Cavaliers 82 240.9 38.9 ... 6.5 2.4 13.5 20.0 104.5
29 30.0 Memphis Grizzlies 82 242.4 38.0 ... 8.3 5.5 14.0 22.0 103.5
30 NaN League Average 82 241.6 41.1 ... 7.6 5.0 14.1 20.9 111.2
[31 rows x 25 columns]

To scrape the data from span tag using beautifulsoup

I am trying to scrape the webpage, where I need to decode the entire table into a dataframe. I am using beautiful soup for this purpose. In certain td tags, there are span tags which do not have any text. But the values are shown on the webpage in that particular span tag.
The following html code corresponds to that webpage,
<td>
<span class="nttu">::after</span>
<span class="ntbb">::after</span>
<span class="ntyc">::after</span>
<span class="nttu">::after</span>
</td>
But, the value shown in this td tag is 23.8. I tried to scrape it, but I am getting am empty text.
How to scrape this value using beautiful soup.
URL: https://en.tutiempo.net/climate/ws-432950.html
and my code is for scraping the table is given below,
http_url = "https://en.tutiempo.net/climate/01-2013/ws-432950.html"
retreived_data = requests.get(http_url).text
soup = BeautifulSoup(retreived_data, "lxml")
climate_table = soup.find("table", attrs={"class": "medias mensuales numspan"})
climate_data = climate_table.find_all("tr")
for data in climate_data[1:-2]:
table_data = data.find_all("td")
row_data = []
for row in table_data:
row_data.append(row.get_text())
climate_df.loc[len(climate_df)] = row_data

Misunderstood your question as you have 2 different urls referenced. I see now what you mean.
Ya that is weird that in that second table, they used CSS to fill in the content of some of those <td> tags. What you need to do is pull out those special cases from the <style> tag. Once you have that, you can replace those elements within the html source, and finally parse it into a dataframe. I used pandas as it uses BeautifulSoup under the hood to parse <table> tags. But I believe this will get you what you want:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
http_url = "https://en.tutiempo.net/climate/01-2013/ws-432950.html"
retreived_data = requests.get(http_url).text
soup = BeautifulSoup(retreived_data, "lxml")
hiddenData = str(soup.find_all('style')[1])
hiddenSpan = {}
for group in re.findall(r'span\.(.+?)}',hiddenData):
class_attr = group.split('span.')[-1].split('::')[0]
content = group.split('"')[1]
hiddenSpan[class_attr] = content
climate_table = str(soup.find("table", attrs={"class": "medias mensuales numspan"}))
for k, v in hiddenSpan.items():
climate_table = climate_table.replace('<span class="%s"></span>' %(k), hiddenSpan[k])
df = pd.read_html(climate_table)[0]
Output:
print (df.to_string())
Day T TM Tm SLP H PP VV V VM VG RA SN TS FG
0 1 23.4 30.3 19 - 59 0 6.3 4.3 5.4 - NaN NaN NaN NaN
1 2 22.4 30.3 16.9 - 57 0 6.9 3.3 7.6 - NaN NaN NaN NaN
2 3 24 31.8 16.9 - 51 0 6.9 2.8 5.4 - NaN NaN NaN NaN
3 4 24.2 32 17.4 - 53 0 6 3.3 5.4 - NaN NaN NaN NaN
4 5 23.8 32 18 - 58 0 6.9 3.1 7.6 - NaN NaN NaN NaN
5 6 23.3 31 18.3 - 60 0 6.9 5 9.4 - NaN NaN NaN NaN
6 7 22.8 30.2 17.6 - 55 0 7.7 3.7 7.6 - NaN NaN NaN NaN
7 8 23.1 30.6 17.4 - 46 0 6.9 3.3 5.4 - NaN NaN NaN NaN
8 9 22.9 30.6 17.4 - 51 0 6.9 3.5 3.5 - NaN NaN NaN NaN
9 10 22.3 30 17 - 56 0 6.3 3.3 7.6 - NaN NaN NaN NaN
10 11 22.3 29.4 17 - 53 0 6.9 4.3 7.6 - NaN NaN NaN NaN
11 12 21.8 29.4 15.7 - 54 0 6.9 2.8 3.5 - NaN NaN NaN NaN
12 13 22.3 30.1 15.7 - 43 0 6.9 2.8 5.4 - NaN NaN NaN NaN
13 14 21.8 30.6 14.8 - 41 0 6.9 1.9 5.4 - NaN NaN NaN NaN
14 15 21.6 30.6 14.2 - 43 0 6.9 3.1 7.6 - NaN NaN NaN NaN
15 16 21.1 29.9 15.4 - 55 0 6.9 4.1 7.6 - NaN NaN NaN NaN
16 17 20.4 28.1 15.4 - 59 0 6.9 5 11.1 - NaN NaN NaN NaN
17 18 21.2 28.3 14.5 - 53 0 6.9 3.1 7.6 - NaN NaN NaN NaN
18 19 21.6 29.6 16.4 - 58 0 6.9 2.2 3.5 - NaN NaN NaN NaN
19 20 21.9 29.6 16.6 - 58 0 6.9 2.4 5.4 - NaN NaN NaN NaN
20 21 22.3 29.9 17.5 - 55 0 6.9 3.1 5.4 - NaN NaN NaN NaN
21 22 21.9 29.9 15.1 - 46 0 6.9 4.3 7.6 - NaN NaN NaN NaN
22 23 21.3 29 15.2 - 50 0 6.9 3.3 5.4 - NaN NaN NaN NaN
23 24 21.3 28.8 14.6 - 45 0 6.9 3 5.4 - NaN NaN NaN NaN
24 25 21.6 29.1 15.5 - 47 0 7.7 4.8 7.6 - NaN NaN NaN NaN
25 26 21.8 29.2 14.6 - 41 0 6.9 2.8 3.5 - NaN NaN NaN NaN
26 27 22.3 30.1 15.6 - 40 0 6.9 2.4 5.4 - NaN NaN NaN NaN
27 28 22.4 30.3 16 - 51 0 6.9 2.8 3.5 - NaN NaN NaN NaN
28 29 23 30.3 16.9 - 53 0 6.6 2.8 5.4 - NaN NaN NaN o
29 30 23.1 30 17.8 - 54 0 6.9 5.4 7.6 - NaN NaN NaN NaN
30 31 22.1 29.8 17.3 - 54 0 6.9 5.2 9.4 - NaN NaN NaN NaN
31 Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals:
32 NaN 22.3 30 16.4 - 51.6 0 6.9 3.5 6.3 NaN 0 0 0 1

Dropping multiple columns in pandas at once

I have a data set consisting of 135 columns. I am trying to drop the columns which have empty data of more than 60%. There are some 40 columns approx in it. So, I wrote a function to drop this empty columns. But I am getting "Not contained in axis" error. Could some one help me solving this?. Or any other way to drop this 40 columns at once?
My function:
list_drop = df.isnull().sum()/(len(df))
def empty(df):
if list_drop > 0.5:
df.drop(list_drop,axis=1,inplace=True)
return df
Other method i tried:
df.drop(df.count()/len(df)<0.5,axis=1,inplace=True)

You could use isnull + sum and then use the mask to filter df.columns.
m = df.isnull().sum(0) / len(df) < 0.6
df = df[df.columns[m]]
Demo
df
A B C
0 29.0 NaN 26.6
1 NaN NaN 23.3
2 23.0 94.0 28.1
3 35.0 168.0 43.1
4 NaN NaN 25.6
5 32.0 88.0 31.0
6 NaN NaN 35.3
7 45.0 543.0 30.5
8 NaN NaN NaN
9 NaN NaN 37.6
10 NaN NaN 38.0
11 NaN NaN 27.1
12 23.0 846.0 30.1
13 19.0 175.0 25.8
14 NaN NaN 30.0
15 47.0 230.0 45.8
16 NaN NaN 29.6
17 38.0 83.0 43.3
18 30.0 96.0 34.6
m = df.isnull().sum(0) / len(df) < 0.3 # 0.3 as an example
m
A False
B False
C True
dtype: bool
df[df.columns[m]]
C
0 26.6
1 23.3
2 28.1
3 43.1
4 25.6
5 31.0
6 35.3
7 30.5
8 NaN
9 37.6
10 38.0
11 27.1
12 30.1
13 25.8
14 30.0
15 45.8
16 29.6
17 43.3
18 34.6

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to find elements that match specific conditions selenium - python

Related

Scraping table returning repeated values

Unable to extract Tables

bs4 not giving table

To scrape the data from span tag using beautifulsoup

Dropping multiple columns in pandas at once

Categories

Resources