how Iterate each link to scrape all dataframe inside HTML?

how Iterate each link to scrape all dataframe inside HTML? - python

Im scraping a website thru a list of link, 442 links in total. In each of the link have Dataframe, by using pd.read_html() I manage to pull the dataframe. So I tried to loop all the link and scrape all the dataframe and joined them, but after I finished everything, I found out that, some of the link have different dataframe positioning and I was unable to extract the Dataframe. How do I fix this problem. So sorry if I unable to explain it clearly, but here's my script :
allin = []
for link in titlelink :
driver.get(link)
html = driver.page_source
soup = bs(html, 'html.parser')
iframe = soup.find('iframe')['src']
#open iframe
openiframe = driver.get(iframe)
iframehtml = driver.page_source
print('fetching --',link)
#using pandas read html ang get table
All = pd.read_html(iframehtml)
try :
table1 = All[1].set_index([0, All[1].groupby(0).cumcount()])[1].unstack(0)
except :
table1 = All[2].set_index([0, All[2].groupby(0).cumcount()])[1].unstack(0)
try :
table2 = All[3].set_index([0, All[3].groupby(0).cumcount()])[1].unstack(0)
except :
pass
df = table1.join(table2)
try :
df['Remarks'] = All[2].iloc[1]
except :
df['Remarks'] = All[3].iloc[1]
allin.append(df)
finaldf = pd.concat(allin, ignore_index=True)
print(finaldf)
finaldf.to_csv('data.csv', index=False)
Also, I've exported all the links into csv and attached it here(https://drive.google.com/file/d/1Tk2oKVEZwfxAnHIx3p2HbACE6vOrJq5A/view?usp=sharing), so that you are able to get more clearer picture on the problem I've faced. Appreciate all of your help.

I had found some pattern in the links, so I I tried this and it is now working fine. Not 100% perfectly but it is working, 95%. Here's the code:
import pandas as pd
import requests
df=pd.read_csv("link.csv") # That google drive document
links=df["0"].values.tolist()
for link in links:
nlink=f"https://disclosure.bursamalaysia.com/FileAccess/viewHtml?e={link.split('ann_id=')[1]}"
page=requests.get(nlink,headers={"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15"})
df=pd.read_html(page.text)
df=pd.concat(df[1:],axis=0).to_numpy().flatten()
df=pd.DataFrame(df[~pd.isna(df)].reshape(-1,2))
# For explanation about these last two line you may check here https://stackoverflow.com/questions/68479177/how-to-shift-a-dataframe-element-wise-to-fill-nans
print(df)
You have to change most of the things, as your needs. If you need any help, you can ask in comment. It's not technically a complete answer but it speeds up your scraping and also gives your desire output but not completely. You can take it as suggestion and idea to solve issue 90% to 95%.

been try and error and finally, I get the ans for my own question. Here's the script
frame = []
for link in titlelink :
time.sleep(1)
driver.get(link)
html = driver.page_source
soup = bs(html, 'html.parser')
iframe = soup.find('iframe')['src']
#open iframe
openiframe = driver.get(iframe)
iframehtml = driver.page_source
print('fetching --',link)
#using pandas read html ang get table
df_proposed_company_name = pd.read_html(iframehtml, match='Proposed company name')[0]
df_announcement_info = pd.read_html(iframehtml, match='Stock Name ')[0]
try:
df_remarks = pd.read_html(iframehtml, match='Remarks :')[0].iloc[1]
except:
pass
try :
df_Admission_Sponsor = pd.read_html(iframehtml, match='Admission Sponsor')[1]
except :
pass
try:
t1_1 = df_Admission_Sponsor.set_index([0,df_Admission_Sponsor.groupby(0).cumcount()])[1].unstack(0)
except:
t1_1 = pd.DataFrame({'Admission Sponsor':np.nan,
'Sponsor':np.nan},index=[0])
t1_2 = df_proposed_company_name.set_index([0, df_proposed_company_name.groupby(0).cumcount()])[1].unstack(0)
t3 = df_announcement_info.set_index([0, df_announcement_info.groupby(0).cumcount()])[1].unstack(0)
dfs = t1_1.join(t1_2).join(t3)
try:
dfs['remarks'] = df_remarks
except:
dfs['remarks'] = np.nan
frame.append(dfs)
finaldf = pd.concat(frame, ignore_index=True)
# print(finaldf)
finaldf.to_csv('data.csv', index=False)
if any of you have more advance experience and better solutions, i'm open to it and learn new things from you :-)

Related

Selenium - Iterating Through Drop Down Menu - Let Page Load

I am trying to iterate through player seasons on NBA.com and pull shooting statistics after each click of the season dropdown menu. After each click, I get the error message "list index out of range" for:
headers = table[1].findAll('th')
It seems to me that the page doesn't load all the way before the source data is saved.
Looking at other similar questions, I have tried using an browser.implicitly_wait() for each loop, but I am still getting the same error. It also doesn't seem that the browser waits after more than the first iteration of the loop.
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
import pandas as pd
player_id = str(1629216)
url = 'https://www.nba.com/stats/player/' + player_id + "/shooting/"
browser = Chrome(executable_path='/usr/local/bin/chromedriver')
browser.get(url)
select = Select(browser.find_element_by_xpath('/html/body/main/div/div/div/div[4]/div/div/div/div/div[1]/div[1]/div/div/label/select'))
options = select.options
for index in range(0, len(options)):
select.select_by_index(index)
browser.implicitly_wait(5)
src = browser.page_source
parser = BeautifulSoup(src, "lxml")
table = parser.findAll("div", attrs = {"class":"nba-stat-table__overflow"})
headers = table[1].findAll('th')
headerlist = [h.text.strip() for h in headers[1:]]
headerlist = [a for a in headerlist if not '\n' in a]
headerlist.append('AST%')
headerlist.append('UAST%')
row_labels = table[1].findAll("td", {"class": "first"})
row_labels_list = [r.text.strip() for r in row_labels[0:]]
rows = table[1].findAll('tr')[1:]
player_stats = [[td.getText().strip() for td in rows[i].findAll('td')[1:]] for i in range(len(rows))]
df = pd.DataFrame(data=player_stats, columns=headerlist, index = row_labels_list)
print(df)

I found my own answer. I used time.sleep(1) at the top of the loop to give the browser a second to load all the way. Without this delay, the pages source code did not have the appropriate table that I am scraping.
Responding to those who answered - I did not want to go the api route, but I have seen people scrape nba.com using that method. Table[1] is the correct table; just needed the source code a chance to load after the I loop through the season dropdown.
select.select_by_index(index)
time.sleep(1)
src = browser.page_source
parser = BeautifulSoup(src, "lxml")
table = parser.findAll("div", attrs = {"class":"nba-stat-table__overflow"})
headers = table[1].findAll('th')
headerlist = [h.text.strip() for h in headers[1:]]
headerlist = [a for a in headerlist if not '\n' in a]
headerlist.append('AST%')
headerlist.append('UAST%')
row_labels = table[1].findAll("td", {"class": "first"})
row_labels_list = [r.text.strip() for r in row_labels[0:]]
rows = table[1].findAll('tr')[1:]
player_stats = [[td.getText().strip() for td in rows[i].findAll('td')[1:]] for i in range(len(rows))]
df = pd.DataFrame(data=player_stats, columns=headerlist, index = row_labels_list)
print(df)

Web scraping - .append add whitespaces and \n to list

I have written some code that help me scrape websites. It has worked well on some sites but I am currently running into an issue.
The collectData() function collects data from a site and appends it to 'dataList'. From this dataList I can create a csv file to export the data.
The issue I am having right now is that the function appends multiple whitespances and \n characters into my list. The output look like this: (the excessive whitespaces are not shown here)
dataList = ['\n 2.500.000 ']
Does anyone what what could cause this? As I mentioned, there are some websites where the function works fine.
Thank you!
def scrape():
dataList = []
pageNr = range(0, 1)
for page in pageNr:
pageUrl = ('https://www.example.com/site:{}'.format(page))
print(pageUrl)
def getUrl(pageUrl):
openUrl = urlopen(pageUrl)
soup = BeautifulSoup(openUrl, 'lxml')
links = soup.find_all('a', class_="ellipsis")
for link in links:
linkNew = link.get('href')
linkList.append(linkNew)
#print(linkList)
return linkList
anzList = getUrl(pageUrl)
lenght = len(anzList)
print(lenght)
anzLinks = []
for i in range(lenght):
anzeigenLinks.append('https://www.example.com/ + anzList[i]')
print(anzLinks)
def collectData():
for link in anzLinks:
openAnz = urlopen(link)
soup = BeautifulSoup(openAnz, 'lxml')
try:
kaufpreisSuche = soup.find('h2')
kaufpreis = kaufpreisSuche.text
dataListe.append(kaufpreis)
print(kaufpreis)
except:
kaufpreis = None
dataListe.append(kaufpreis)

How to efficiently parse large HTML div-class and span data on Python BeautifulSoup?

The data needed:
I want to scrape through two webpages, one here: https://finance.yahoo.com/quote/AAPL/balance-sheet?p=AAPL and the other: https://finance.yahoo.com/quote/AAPL/financials?p=AAPL.
From the first page, I need values of the row called Total Assets. This would be 5 values in that row named: 365,725,000 375,319,000 321,686,000 290,479,000 231,839,000
Then I need 5 values of the row named Total Current Liabilities. These would be: 43,658,000 38,542,000 27,970,000 20,722,000 11,506,000
From the second link, I need 10 values of the row named Operating Income or Loss. These would be: 52,503,000 48,999,000 55,241,000 33,790,000 18,385,000.
EDIT: I need the TTM value too, and then the five years' values mentioned above. Thanks.
Here is the logic of what I want. I want to run this module, and when run, I want the output to be:
TTM array: 365725000, 116866000, 64423000
year1 array: 375319000, 100814000, 70898000
year2 array: 321686000, 79006000, 80610000
My code:
This is what I have written so far. I can extract the value within the div class if I just put it in a variable as shown below. However, how do I loop efficiently through the 'div' classes as there are thousands of them in the page. In other words, how do I find just the values I am looking for?
# Import libraries
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
# Set the URL you want to webscrape from
url = 'https://finance.yahoo.com/quote/AAPL/balance-sheet?p=AAPL'
# Connect to the URL
response = requests.get(url)
# Parse HTML and save to BeautifulSoup object¶
soup = BeautifulSoup(response.text, "html.parser")
soup1 = BeautifulSoup("""<div class="D(tbc) Ta(end) Pstart(6px) Pend(4px) Bxz(bb) Py(8px) BdB Bdc($seperatorColor) Miw(90px) Miw(110px)--pnclg" data-test="fin-col"><span>321,686,000</span></div>""", "html.parser")
spup2 = BeautifulSoup("""<span data-reactid="1377">""", "html.parser");
#This works
print(soup1.find("div", class_="D(tbc) Ta(end) Pstart(6px) Pend(4px) Bxz(bb) Py(8px) BdB Bdc($seperatorColor) Miw(90px) Miw(110px)--pnclg").text)
#How to loop through all the relevant div classes?

EDIT - At the request of #Life is complex, edited to add date headings.
Try this using lxml:
import requests
from lxml import html
url = 'https://finance.yahoo.com/quote/AAPL/balance-sheet?p=AAPL'
url2 = 'https://finance.yahoo.com/quote/AAPL/financials?p=AAPL'
page = requests.get(url)
page2 = requests.get(url2)
tree = html.fromstring(page.content)
tree2 = html.fromstring(page2.content)
total_assets = []
Total_Current_Liabilities = []
Operating_Income_or_Loss = []
heads = []
path = '//div[#class="rw-expnded"][#data-test="fin-row"][#data-reactid]'
data_path = '../../div/span/text()'
heads_path = '//div[contains(#class,"D(ib) Fw(b) Ta(end)")]/span/text()'
dats = [tree.xpath(path),tree2.xpath(path)]
for entry in dats:
heads.append(entry[0].xpath(heads_path))
for d in entry[0]:
for s in d.xpath('//div[#title]'):
if s.attrib['title'] == 'Total Assets':
total_assets.append(s.xpath(data_path))
if s.attrib['title'] == 'Total Current Liabilities':
Total_Current_Liabilities.append(s.xpath(data_path))
if s.attrib['title'] == 'Operating Income or Loss':
Operating_Income_or_Loss.append(s.xpath(data_path))
del total_assets[0]
del Total_Current_Liabilities[0]
del Operating_Income_or_Loss[0]
print('Date Total Assets Total_Current_Liabilities:')
for date,asset,current in zip(heads[0],total_assets[0],Total_Current_Liabilities[0]):
print(date, asset, current)
print('Operating Income or Loss:')
for head,income in zip(heads[1],Operating_Income_or_Loss[0]):
print(head,income)
Output:
Date Total Assets Total_Current_Liabilities:
9/29/2018 365,725,000 116,866,000
9/29/2017 375,319,000 100,814,000
9/29/2016 321,686,000 79,006,000
Operating Income or Loss:
ttm 64,423,000
9/29/2018 70,898,000
9/29/2017 61,344,000
9/29/2016 60,024,000
Of course, if so desired, this can be easily incorporated into a pandas dataframe.

Some suggestions for parse html use 'BeautifulSoup' which is helpful for me maybe helpful for you.
use 'id' to location the element, instead of using 'class' because the 'class' change more frequently than id.
use structure info to location the element instead of using 'class', the structure info change less frequently.
use headers with user-agent info to get response is always better than no headers. In this case, if do not specify headers info, you can not find id 'Col1-1-Financials-Proxy', but you can find 'Col1-3-Financials-Proxy', which is not same with result in Chrome inspector.
Here is runnable codes for your requirement use structure info to location elements. You definitely can use 'class' info to make it. Just remember that when your code do not work well, check the website's source code.
# import libraries
import requests
from bs4 import BeautifulSoup
# set the URL you want to webscrape from
first_page_url = 'https://finance.yahoo.com/quote/AAPL/balance-sheet?p=AAPL'
second_page_url = 'https://finance.yahoo.com/quote/AAPL/financials?p=AAPL'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
}
#################
# first page
#################
print('*' * 10, ' FIRST PAGE RESULT ', '*' * 10)
total_assets = {}
total_current_liabilities = {}
operating_income_or_loss = {}
page1_table_keys = []
page2_table_keys = []
# connect to the first page URL
response = requests.get(first_page_url, headers=headers)
# parse HTML and save to BeautifulSoup object¶
soup = BeautifulSoup(response.text, "html.parser")
# the nearest id to get the result
sheet = soup.find(id='Col1-1-Financials-Proxy')
sheet_section_divs = sheet.section.find_all('div', recursive=False)
# last child
sheet_data_div = sheet_section_divs[-1]
div_ele_table = sheet_data_div.find('div').find('div').find_all('div', recursive=False)
# table header
div_ele_header = div_ele_table[0].find('div').find_all('div', recursive=False)
# first element is label, the remaining element containing data, so use range(1, len())
for i in range(1, len(div_ele_header)):
page1_table_keys.append(div_ele_header[i].find('span').text)
# table body
div_ele = div_ele_table[-1]
div_eles = div_ele.find_all('div', recursive=False)
tgt_div_ele1 = div_eles[0].find_all('div', recursive=False)[-1]
tgt_div_ele1_row = tgt_div_ele1.find_all('div', recursive=False)[-1]
tgt_div_ele1_row_eles = tgt_div_ele1_row.find('div').find_all('div', recursive=False)
# first element is label, the remaining element containing data, so use range(1, len())
for i in range(1, len(tgt_div_ele1_row_eles)):
total_assets[page1_table_keys[i - 1]] = tgt_div_ele1_row_eles[i].find('span').text
tgt_div_ele2 = div_eles[1].find_all('div', recursive=False)[-1]
tgt_div_ele2 = tgt_div_ele2.find('div').find_all('div', recursive=False)[-1]
tgt_div_ele2 = tgt_div_ele2.find('div').find_all('div', recursive=False)[-1]
tgt_div_ele2_row = tgt_div_ele2.find_all('div', recursive=False)[-1]
tgt_div_ele2_row_eles = tgt_div_ele2_row.find('div').find_all('div', recursive=False)
# first element is label, the remaining element containing data, so use range(1, len())
for i in range(1, len(tgt_div_ele2_row_eles)):
total_current_liabilities[page1_table_keys[i - 1]] = tgt_div_ele2_row_eles[i].find('span').text
print('Total Assets', total_assets)
print('Total Current Liabilities', total_current_liabilities)
#################
# second page, same logic as the first page
#################
print('*' * 10, ' SECOND PAGE RESULT ', '*' * 10)
# Connect to the second page URL
response = requests.get(second_page_url, headers=headers)
# Parse HTML and save to BeautifulSoup object¶
soup = BeautifulSoup(response.text, "html.parser")
# the nearest id to get the result
sheet = soup.find(id='Col1-1-Financials-Proxy')
sheet_section_divs = sheet.section.find_all('div', recursive=False)
# last child
sheet_data_div = sheet_section_divs[-1]
div_ele_table = sheet_data_div.find('div').find('div').find_all('div', recursive=False)
# table header
div_ele_header = div_ele_table[0].find('div').find_all('div', recursive=False)
# first element is label, the remaining element containing data, so use range(1, len())
for i in range(1, len(div_ele_header)):
page2_table_keys.append(div_ele_header[i].find('span').text)
# table body
div_ele = div_ele_table[-1]
div_eles = div_ele.find_all('div', recursive=False)
tgt_div_ele_row = div_eles[4]
tgt_div_ele_row_eles = tgt_div_ele_row.find('div').find_all('div', recursive=False)
for i in range(1, len(tgt_div_ele_row_eles)):
operating_income_or_loss[page2_table_keys[i - 1]] = tgt_div_ele_row_eles[i].find('span').text
print('Operating Income or Loss', operating_income_or_loss)
Output with header info:
********** FIRST PAGE RESULT **********
Total Assets {'9/29/2018': '365,725,000', '9/29/2017': '375,319,000', '9/29/2016': '321,686,000'}
Total Current Liabilities {'9/29/2018': '116,866,000', '9/29/2017': '100,814,000', '9/29/2016': '79,006,000'}
********** SECOND PAGE RESULT **********
Operating Income or Loss {'ttm': '64,423,000', '9/29/2018': '70,898,000', '9/29/2017': '61,344,000', '9/29/2016': '60,024,000'}

AttributeError: 'NoneType' object has no attribute 'tbody' - Spyder 3.3.1 / beautifulsoup4 / python 3.6

Hey this is my setup: Spyder 3.3.1 / beautifulsoup4 / python 3.6
The below code is from an article on medium (here) about webscraping with python and Beautifulsoup. Was supposed to be a quick read but now TWO days later I still cant not get the code to run in spyder and keep getting:
File "/Users/xxxxxxx/Documents/testdir/swiftScrape.py", line 9, in table_to_df
return pd.DataFrame([[td.text for td in row.findAll('td')] for row in table.tbody.findAll('tr')])
AttributeError: 'NoneType' object has no attribute 'tbody'
Not sure what is going wrong and seems to be an implementation error. Can anyone assist in sheding some light on this issue.
Thanks in advance.
import os
import bs4
import requests
import pandas as pd
PATH = os.path.join("C:\\","Users","xxxxx","Documents","tesdir")
def table_to_df(table):
return pd.DataFrame([[td.text for td in row.findAll('td')] for row in table.tbody.findAll('tr')])
def next_page(soup):
return "http:" + soup.find('a', attrs={'rel':'next'}).get('href')
res = pd.DataFrame()
url = "http://bank-code.net/country/FRANCE-%28FR%29/"
counter = 0
while True:
print(counter)
page = requests.get(url)
soup = bs4.BeautifulSoup(page.content, 'lxml')
table = soup.find(name='table', attrs={'id':'tableID'})
res = res.append(table_to_df(table))
res.to_csv(os.path.join(PATH,"BIC","table.csv"), index=None, sep=';', encoding='iso-8859-1')
url = next_page(soup)
counter += 1

Like a lost of example code found on the web, this code is not production-grade code - it blindly assumes that http requests always succeed and returns the expected content. The truth is that it's quite often not the case (network errors, proxies or firewall that blocks you, site down - temporarily or definitely, updates in the site that changed either the urls and/or the page's markup etc).
Your problem manifests itself here:
def table_to_df(table):
return pd.DataFrame([[td.text for td in row.findAll('td')] for row in table.tbody.findAll('tr')])
and comes from table actually being None, which means that here in the for loop:
table = soup.find(name='table', attrs={'id':'tableID'})
there was no "table" tag with id "tableID" found in the html document. You can check this by printing the actual html content:
while True:
print(counter)
page = requests.get(url)
soup = bs4.BeautifulSoup(page.content, 'lxml')
table = soup.find(name='table', attrs={'id':'tableID'})
if table is None:
print("no table 'tableID' found for url {}".format(url))
print("html content:\n{}\n".format( page.content))
continye
# etc

Thanks #bruno desthuilliers for your pointers. Much appreciated.
This is the rewritten code that worked for me using Selenium and webdriver rather than import requests:
import os
import bs4
import pandas as pd
from selenium import webdriver
PATH = os.path.join('/','Users','benmorris','documents','testdir')
def table_to_df(table):
return pd.DataFrame([[td.text for td in row.find_all('td')] for row in soup.find_all('tr')])
def next_page(soup):
return "http:" + soup.find('a', attrs={'rel':'next'}).get('href')
res = pd.DataFrame()
url = "http://bank-code.net/country/FRANCE-%28FR%29/"
counter = 0
driver = webdriver.Chrome()
driver.get(url)
while True:
print(counter)
page = driver.get(url)
soup = bs4.BeautifulSoup(driver.page_source, 'lxml')
table = driver.find_element_by_xpath('//*[#id="tableID"]')
if table is None:
print("no table 'tableID' found for url {}".format(url))
print("html content:\n{}\n".format( page.content))
continue
res = res.append(table_to_df(table))
res.to_csv(os.path.join(PATH,"BIC","table.csv"), index=False, sep=',', encoding='iso-8859-1')
url = next_page(soup)
counter += 1

Python - Beginner Scraping with Beautiful Soup 4 - onmouseover

i'm a beginner python (3) user and i'm currently trying to scrape some sports stats for my fantasy football season. Previously I did this in a round about way (downloading in HT-track, converting to excel and using VBA to combine my data). But now I'm trying to learn python to improve my coding abilities.
I want to scrape this page but running into some difficulty in selecting only the rows/tables I want. Here is how my code currently stands. It still has a bit of code where I have been trying to play around with it.
from urllib.request import urlopen # import the library
from bs4 import BeautifulSoup # Import BS
from bs4 import SoupStrainer # Import Soup Strainer
page = urlopen('http://www.footywire.com/afl/footy/ft_match_statistics?mid=6172') # access the website
only_tables = SoupStrainer('table') # parse only table elements when parsing
soup = BeautifulSoup(page, 'html.parser') # parse the html
# for row in soup('table',{'class':'tbody'}[0].tbody('tr')):
# tds = row('td')
# print (tds[0].string, tds[1].string)
# create variables to keep the data in
team = []
player = []
kicks = []
handballs = []
disposals = []
marks = []
goals = []
tackles = []
hitouts = []
inside50s = []
freesfor = []
freesagainst = []
fantasy = []
supercoach = []
table = soup.find_all('tr')
# print(soup.prettify())
print(table)
Right now I can select all 'tr' from the page, however I'm having trouble only selecting the rows which have the following attribute:
<tr bgcolor="#ffffff" onmouseout="this.bgColor='#ffffff';" onmouseover="this.bgColor='#cbcdd0';">
"onmouseover" seems to be the only attribute which is common/unique to the table I want to scrape.
Does anyone know how I can alter this line of code, to select this attribute?
table = soup.find_all('tr')
From here I am confident I can place the data into a dataframe which hopefully I can export to CSV.
Any help would be greatly appreciated as I have looked through the BS4 documentation with no luck.

As explained on the BeautifulSoup documentation
You may use this :
table = soup.findAll("tr", {"bgcolor": "#ffffff", "onmouseout": "this.bgColor='#ffffff'", "onmouseover": "this.bgColor='#cbcdd0';"})
More, you can also use the following approach:
tr_tag = soup.findAll(lambda tag:tag.name == "tr" and tag["bgcolor"] == "#ffffff") and tag["onmouseout"] = "this.bgColor='#ffffff'" and tag["onmouseover"] = "this.bgColor='#cbcdd0';"
The advantage of the above approach is that it uses the full power of BS and it's giving you the result in a very optimized way

Check this:
soup.find_all("tr", attrs={"onmouseover" : "this.bgColor='#cbcdd0';"})

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how Iterate each link to scrape all dataframe inside HTML? - python

Related

Selenium - Iterating Through Drop Down Menu - Let Page Load

Web scraping - .append add whitespaces and \n to list

How to efficiently parse large HTML div-class and span data on Python BeautifulSoup?

AttributeError: 'NoneType' object has no attribute 'tbody' - Spyder 3.3.1 / beautifulsoup4 / python 3.6

Python - Beginner Scraping with Beautiful Soup 4 - onmouseover

Categories

Resources