WebScrape - Python/Selenium

WebScrape - Python/Selenium - python

My code goes into a webpage and I want to scrape the href/HTML of each listing within this webpage.
(This code goes to a website which has 2)
I tried xpath, and beautifulSoup but it returns an empty list for me.
Here is the code-
import time
from selenium import webdriver
import pandas as pd
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import pandas as pd
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
bracket=[]
driver.get('https://casehippo.com/spa/symposium/national-kidney-foundation-2021-spring-clinical-meetings/event/gallery/?search=Patiromer')
time.sleep(3)
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser')
eachRow=driver.find_element_by_partial_link_text('symposium')
print(eachRow.text)

I just ran the code what you provided, BeautifulSoup set soup variable with all page source successfully:
soup = BeautifulSoup(page_source, 'html.parser')
and in the next line:
eachRow=driver.find_element_by_partial_link_text('symposium')
exception has been raised with message:
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"partial link text","selector":"symposium"}
seems like you're using incorrect selector, try to use, somethink like:
element = driver.find_element_by_xpath("//a[#class='title ng-binding']")
code what i'm using:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
try:
bracket = []
driver.get(
'https://casehippo.com/spa/symposium/national-kidney-foundation-2021-spring-clinical-meetings/event/gallery/?search=Patiromer')
time.sleep(3)
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser')
print(soup)
element = driver.find_element_by_xpath("//a[#class='title ng-binding']")
print(element.get_attribute('href'))
elements = driver.find_elements_by_xpath("//a[#class='title ng-binding']")
for el in elements:
print(el.get_attribute('href'))
finally:
driver.quit()

Updated code:
import time
from selenium import webdriver
import pandas as pd
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import pandas as pd
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
bracket=[]
driver.get('https://casehippo.com/spa/symposium/national-kidney-foundation-2021-spring-clinical-meetings/event/gallery/?search=Patiromer')
time.sleep(3)
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser')
eachRow=driver.find_elements_by_xpath("//a[contains(#ui-sref,'symposium')]")
for row in eachRow:
print(row.text)
You need to use find_elements (not find_element) if there are more than one, and then iterate over them to see their values. Also partial text wont work because the symposium text is embedded in another element, its not regular text, so xpath is needed

Related

Can't scrape table BeautifulSoup

I'm trying to scrape the following table from this URL: https://baseballsavant.mlb.com/leaderboard/outs_above_average?type=Fielder&startYear=2022&endYear=2022&split=no&team=&range=year&min=10&pos=of&roles=&viz=show
This is my code:
import requests
from bs4 import BeautifulSoup
url = "https://baseballsavant.mlb.com/leaderboard/outs_above_average?type=Fielder&startYear=2022&endYear=2022&split=no&team=&range=year&min=10&pos=of&roles=&viz=show"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
table = soup.find("table")
for row in table.findAll("tr"):
print([i.text for i in row.findAll("td")])
However, my variable table returns None, even though there is clearly a table tag in the HTML code of the website. How do I get it?

The webpage is loaded dynamically and relies on JavaScript, therefore requests won't support it. You could use another parser library such as selenium.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
url = "https://baseballsavant.mlb.com/leaderboard/outs_above_average?type=Fielder&startYear=2022&endYear=2022&split=no&team=&range=year&min=10&pos=of&roles=&viz=show"
driver.get(url)
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.TAG_NAME, 'table')))
table = driver.find_element(By.TAG_NAME, 'table')
table_html = table.get_attribute('innerHTML')
# print('table html:', table_html)
for tr_web_element in table.find_elements(By.TAG_NAME, 'tr'):
for td_web_element in tr_web_element.find_elements(By.TAG_NAME, 'td'):
print(td_web_element.text)
driver.close()
Or see this answer to incorporate Selenium with BeautifulSoup.

Scraping webpage with tabs that do not change url

I am trying to scrape Nasdaq webpage and have some issue with locating elements:
My code:
from selenium import webdriver
import time
import pandas as pd
driver.get('http://www.nasdaqomxnordic.com/shares/microsite?Instrument=CSE32679&symbol=ALK%20B&name=ALK-Abell%C3%B3%20B')
time.sleep(5)
btn_overview = driver.find_element_by_xpath('//*[#id="tabarea"]/section/nav/ul/li[2]/a')
btn_overview.click()
time.sleep(5)
employees = driver.find_element_by_xpath('//*[#id="CompanyProfile"]/div[6]')
After the last call, I receive the following error:
NoSuchElementException: no such element: Unable to locate element: {"method":"xpath","selector":"//*[#id="CompanyProfile"]/div[6]"}
Normally the problem would be in wrong 'xpath' but I tried several items, also by 'id'. I suspect that it has something to do with tabs (in my case navigating to "Overview"). Visually the webpage changes, but if for example, I scrape the table, it gets it from the first page:
table_test = pd.read_html(driver.page_source)[0]
What am I missing or doing wrong?

The overview page is under iframe
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
option = webdriver.ChromeOptions()
option.add_argument("start-maximized")
#chrome to stay open
option.add_experimental_option("detach", True)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=option)
driver.get('http://www.nasdaqomxnordic.com/shares/microsite?Instrument=CSE32679&symbol=ALK%20B&name=ALK-Abell%C3%B3%20B')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="tabarea"]/section/nav/ul/li[2]/a'))).click()
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="cookieConsentOK"]'))).click()
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe#MorningstarIFrame")))
employees=WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, '//*[#id="CompanyProfile"]/div[6]'))).text.split()[1]
print(employees)
Output:
2,537
webdriverManager

You sure you need Selenium?
import requests
from bs4 import BeautifulSoup
url = 'http://lt.morningstar.com/gj8uge2g9k/stockreport/default.aspx'
payload = {
'SecurityToken': '0P0000A5LL]3]1]E0EXG$XCSE_3060'}
response = requests.get(url, params=payload)
soup = BeautifulSoup(response.text, 'html.parser')
employees = soup.find('h3', text='Employees').next_sibling.text
print(employees)
Output:
2,537

Trying to use selenium to webscrape ncbi, the data doesn't load and isn't contained in an element with an ID that I can wait for

I am trying to scrape genetic data from web pages like https://www.ncbi.nlm.nih.gov//nuccore/KC208619.1?report=fasta.
I am using beautiful soup and selenium.
The data is located inside an element with the id viewercontent1.
When I print that out with this code:
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import re
secondDriver = webdriver.Chrome(executable_path='/Users/me/Documents/chloroPlastGenScrape/chromedriver')
newLink = "https://www.ncbi.nlm.nih.gov//nuccore/KC208619.1?report=fasta"
secondDriver.implicitly_wait(10)
WebDriverWait(secondDriver, 10).until(lambda driver: driver.execute_script('return document.readyState') == 'complete')
secondDriver.get(newLink)
html2 = secondDriver.page_source
subSoup = BeautifulSoup(html2, 'html.parser')
viewercontent1 = subSoup.findAll("div", {"id" : "viewercontent1"})[0]
print(viewercontent1)
I print out :
<div class="seq gbff" id="viewercontent1" sequencesize="450826" style="display: block;" val="426261815" virtualsequence=""><div class="loading">Loading ... <img alt="record loading animation" src="/core/extjs/ext-2.1/resources/images/default/grid/loading.gif"/></div></div>
It seems the content hasn't finished loading.
I tried implicitly waiting and checking to see if the content is done loading (before and after calling the .get() function) but that doesn't seem like it did anything.
I can't wait for the content to load specifically via ID (presence_of_element_located) because the data is contained directly within a <pre></pre> element with on id.
Any help would be greatly appreciated.

To get content of the <div>, you can use this script:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ncbi.nlm.nih.gov//nuccore/KC208619.1?report=fasta'
fasta_url = 'https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id={id}&report=fasta'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
id_ = soup.select_one('meta[name="ncbi_uidlist"]')['content']
fasta_txt = requests.get(fasta_url.format(id=id_)).text
print(fasta_txt)
Prints:
>KC208619.1 Butomus umbellatus mitochondrion, complete genome
CCGCCTCTCCCCCCCCCCCCCCGCTCCGTTGTTGAAGCGGGCCCCCCCCATACTCATGAATCTGCATTCC
CAACCAAGGAGTTGTCTCATATAGACAGAGTTGGGCCCCCGTGTTCTGAGATCTTTTTCAACTTGATTAA
TAAAGAGGATTTCTCGGCCGTCTTTTTCGGCTAGGCTCCATTCGGGGTGGGTGTCCAGCTCGTCCCGCTT
CTCGTTAAAGAAATCGATAAAGGCTTCTTCGGGGGTGTAGGCGGCATTTTCCCCCAAGTGGGGATGTCGA
GAAAGCACTTCTTGAAAACGAGAATAAGCTGCGTGCTTACGTTCCCGGATTTGGAGATCCCGGTTTTCGA
...and so on.

#Andrej's solution seems much simpler but if you still want to go the waiting route...
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
driver = webdriver.Chrome()
newLink = "https://www.ncbi.nlm.nih.gov//nuccore/KC208619.1?report=fasta"
driver.get(newLink)
WebDriverWait(driver, 10).until(lambda driver: driver.execute_script('return document.readyState') == 'complete')
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "#viewercontent1 pre"))
)
html2 = driver.page_source
subSoup = BeautifulSoup(html2, 'html.parser')
viewercontent1 = subSoup.findAll("div", {"id" : "viewercontent1"})[0]
print(viewercontent1)

Locating an element in bs4

Trying to scrape all the information of every item dozer on this page.
I have just started and have only fair idea about scraping but not sure of how to doing that.
driver=webdriver.Firefox()
driver.get('https://www.rbauction.com/dozers?keywords=&category=21261693092')
soup=BeautifulSoup(driver.page_source,'html.parser')
#trying all d/f ways buh getting oly nonetype or no element
get= soup.findAll('div' , attrs={'class' : 'sc-gisBJw eHFfwj'})
get2= soup.findAll('div' , attrs={'id' : 'searchResultsList'})
get3= soup.find('div.searchResultsList').find_all('a')
I have to get into each class/id and loop a['href'] and get information of each dozer.
Please help.

You need to wait for the data you are looking for to load before reading it into
the BeautifulSoup object. Use WebDriverWait in selenium to wait for the page to load as it takes a while to render fully:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
driver.get('https://www.rbauction.com/dozers?keywords=&category=21261693092')
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'searchResultsList')))
soup = BeautifulSoup(driver.page_source,'html.parser')
This line should return the hrefs from the page then:
hrefs = [el.attrs.get('href') for el in soup.find('div', attrs={'id': 'searchResultsList'}).find_all('a')]

You can just use requests
import requests
headers = {'Referrer':'https://www.rbauction.com/dozers?keywords=&category=21261693092'}
data = requests.get('https://www.rbauction.com/rba-msapi/search?keywords=&searchParams=%7B%22category%22%3A%2221261693092%22%7D&page=0&maxCount=48&trackingType=2&withResults=true&withFacets=true&withBreadcrumbs=true&catalog=ci&locale=en_US', headers = headers).json()
for item in data['response']['results']:
print(item['name'],item['url'])

Selenium Scraping Javascript Table

I am stuggling to scrape as per code below. Would apprciate it if someone can have a look at what I am missing?
Regards
PyProg70
from selenium import webdriver
from selenium.webdriver import FirefoxOptions
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from bs4 import BeautifulSoup
import pandas as pd
import re, time
binary = FirefoxBinary('/usr/bin/firefox')
opts = FirefoxOptions()
opts.add_argument("--headless")
browser = webdriver.Firefox(options=opts, firefox_binary=binary)
browser.implicitly_wait(10)
url = 'http://tenderbulletin.eskom.co.za/'
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())

not Java but Javascript. it dynamic page you need to wait and check if Ajax finished the request and content rendered using WebDriverWait.
....
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
.....
browser.get(url)
# wait max 30 second until table loaded
WebDriverWait(browser, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR , 'table.CSSTableGenerator .ng-binding')))
html = browser.find_element_by_css_selector('table.CSSTableGenerator')
soup = BeautifulSoup(html.get_attribute("outerHTML"), 'lxml')
print(soup.prettify().encode('utf-8'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

WebScrape - Python/Selenium - python

Related

Can't scrape table BeautifulSoup

Scraping webpage with tabs that do not change url

Trying to use selenium to webscrape ncbi, the data doesn't load and isn't contained in an element with an ID that I can wait for

Locating an element in bs4

Selenium Scraping Javascript Table

Categories

Resources