When scraping data from NASDAQ there are tickers like ACHC that have empty pages. ACHC Empty Field
My program iterates through all ticker symbols and when I get to this one it times out because there is no data to grasp. I am trying to figure out a way to check if there is nothing and if so skip the ticker, but continue the loop. The code is pretty long, so Ill post the most relevant part: the beginning of the loop where it opens the page:
## navigate to income statement annualy page
url = url_form.format(symbol, "income-statement")
browser.get(url)
company_xpath = "//h1[contains(text(), 'Company Financials')]"
company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text
annuals_xpath = "//thead/tr[th[1][text() = 'Period Ending:']]/th[position()>=3]"
annuals = get_elements(browser,annuals_xpath)
Here is a pic of the error message
Selenium doesn't have a built-in method for determining whether an element exists or not, so the most common thing to do is use a try/except block.
from selenium.common.exceptions import TimeoutException
...
try:
company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text
except TimeoutException:
continue
This should keep the loop going without crashing, assuming that continue works as expected with your loop.
You can use libraries like requests or urllib to scrape that web page and check if what you need is there. These libraries are much faster than Selenium because they just fetch the source of the page. If there are particular tags or structures like tables, etc. that you're looking for, you should take a look at beautifulsoup which you can use with requests to identify very specific parts of the page.
Related
I'm very new and learning web scraping in python by trying to get the search results from the website below after a user types in some information, and then print the results. Everything works great up until the very last 2 lines of this script. When I include them in the script, nothing happens. However, when I remove them and then just try typing them into the shell after the script is done running, they work exactly as I'd intended. Can you think of a reason this is happening? As I'm a beginner I'm also super open if you see a much easier solution. All feedback is welcome. Thank you!
#Setup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time
#Open Chrome
driver = webdriver.Chrome()
driver.get("https://myutilities.seattle.gov/eportal/#/accountlookup/calendar")
#Wait for site to load
time.sleep(10)
#Click on street address search box
elem = driver.find_element(By.ID, 'sa')
elem.click()
#Get input from the user
addr = input('Please enter part of your address and press enter to search.\n')
#Enter user input into search box
elem.send_keys(addr)
#Get search results
elem = driver.find_element(By.XPATH, ('/html/body/app-root/main/div/div/account-lookup/div/section/div[2]/div[2]/div/div/form/div/ul/li/div/div[1]'))
print(elem.text)
I haven't used Selenium in a while, so I can only point you in the right direction. It seems to me you need to iterate over the individual entries, and print those, as opposed to printing the entire div as one element.
You should remove the parentheses from the xpath expression
You can shorten the xpath expression as follows:
Code:
elems = driver.find_element(By.XPATH, '//*[#class="addressResults"]/div')
for elem in elems:
print(elem.text)
You are using an absolute XPATH, what you should be looking into are relative XPATHs
Something like this should do it:
elems = driver.find_elements(By.XPATH, ("//*[#id='addressResults']/div"))
for elem in elems:
...
I ended up figuring out my problem - I just needed to add in a bit that waits until the search results actually load before proceeding on with the script. tossing in a time.sleep(5) did the trick. Eventually I'll add a bit that checks that an element has loaded before proceeding with the script, but this lets me continue for now. Thanks everyone for your answers!
I am looking for a solution to the StaleElementReferenceException that arises when navigating back to a previous page with Selenium.
Here is a sample code to reproduce the error:
from selenium.webdriver import Chrome
from selenium.common.exceptions import NoSuchElementException
browser = Chrome()
browser.get('https://stackoverflow.com/questions/')
# Closing the pop-up for cookies
try:
browser.find_element_by_class_name('js-accept-cookies').click()
except NoSuchElementException:
pass
# Getting list of links on a StackOverflow page
links = browser.find_element_by_id('questions').find_elements_by_tag_name('a')
links[0].click()
# Going back
browser.back()
try:
browser.find_element_by_class_name('js-accept-cookies').click()
except NoSuchElementException:
pass
# Using the old links
links[1].click()
I understood the root cause from similar stackoverflow questions like this one Stale Element Reference Exception: How to solve?
However the proposed solution i.e. refetching the links everytime I am navigating back, does not suit me for performance reasons.
Is there any alternative ?
For example forcing the new page to open in a new tab so that I can therefore navigating between the two tabs ?
Any other solution is appreciated
links =[ x.get_attribute('href') for x in driver.find_element_by_id('questions').find_elements_by_tag_name('a')]
driver.get(links[0])
driver.back()
Simply get the href value and go back and forth like so. The elements you get from a page get lost on page moving.
A total newbie here in search for your wisdom (1st post/question, too)! Thank you in advance for you time and patience.
I am hoping to automatize scientific literature searches in Google Scholar using Selenium specifically (via Chrome) with Python. I envision entering a topic, which will be searched on Google Scholar, and then entering each link of the articles/books in the results, extracting the abstract/summary, and printing them on the console (or saving them on a text file). This will be an easy way to determine the relevancy of the articles in the results for the stuff that I'm writing.
Thus far, I am able to visit Google scholar, enter text in the search bar, filter by date (newest to oldest), and extract each of the links on the results. I have not been able to write a loop that will enter each article link and extract the abstracts (or other relevant text), as each result may have been coded differently.
Kind regards,
JP (Aotus_californicus)
This is my code so far:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
def get_results(search_term):
url = 'https://scholar.google.com'
browser = webdriver.Chrome(executable_path=r'C:\Users\Aotuscalifornicus\Downloads\chromedriver_win32\chromedriver.exe')
browser.get(url)
searchBar = browser.find_element_by_id('gs_hdr_tsi')
searchBar.send_keys(search_term)
searchBar.submit()
browser.find_element_by_link_text("Trier par date").click()
results = []
links = browser.find_elements_by_xpath('//h3/a')
for link in links:
href = link.get_attribute('href')
print(href)
results.append(href)
browser.close()
get_results('Primate thermoregulation')
Wrt your comment, and using that as a basis for my answer:
To clarify, I am looking to write a loop that enters each link and extracts an element by tag, for example
Open a new window or start a new driver session to check the links in the results. Then use a rule to extract the text you want. You could re-use your existing driver session if you extract all the hrefs first or create a new tab as you get each result link.
for link in links:
href = link.get_attribute('href')
print(href)
results.append(href)
extractor = webdriver.Chrome(executable_path=...) # as above
for result in results:
extractor.get(url)
section_you_want = extractor.find_elements_by_xpath(...) # or whichever set of rules
# other code here
extractor.close()
You can setup rules to use with the base find_element() or find_elements() finders and then iterate over them until you get a result (validate best on element presence or text length or something sane & useful). Each of the the rules is a tuple that can be passed to the base finder function:
from selenium.webdriver.common.by import By # see the docs linked above for the available `By` class attributes
rules = [(By.XPATH, '//h3/p'),
(By.ID, 'summary'),
(By.TAG_NAME, 'div'),
... # etc.
]
for url in results:
extractor.get(url)
for rule in rules:
elems = extractor.find_elements(*rule) # argument unpacking
if not elems:
continue # not found, try next rule
print(elems[0].getText())
break # stop after first successful "find"
else: # only executed if no rules match and `break` is never reached, or `rules` list is empty
print('Could not find anything for url:', url)
This python function aims to scrape a specific identifier (called as PMID) from a JavaScript web-page. When a URL is passed to the function, it gets the page using selenium. The code then tries to find the class "pubmedLink" within tag of html. If found, it returns the extracted PMID to another function.
This works fine, but is literally really slow. Is there a way to accelerate the process may be by using another parser or with a completely different method?
from selenium import webdriver
def _getPMIDfromURL_(url):
driver = webdriver.Chrome('/usr/protoLivingSystematicReviews/drivers/chromedriver')
driver.get(url)
try:
if driver.find_element_by_css_selector('a.pubmedLink').is_displayed():
json_text = driver.find_element_by_css_selector('a.pubmedLink').text
return json_text
except:
return "no_pmid"
driver.quit()
Examples of the URL for the JS web-page,
http://www.embase.com/search/results?subaction=viewrecord&from=export&id=L617434973
http://www.embase.com/search/results?subaction=viewrecord&from=export&id=L617388849
http://www.embase.com/search/results?subaction=viewrecord&from=export&id=L46141767
Well, selenium is fast, that's why is the favorite for many testers. On the other hand you could improve your code by parsing the content once instead two times.
The return value of the statement
driver.find_element_by_css_selector('a.pubmedLink')
might by stored in a variable and use that variable. This will improve your speed about 1.5x.
try:
elem =driver.find_element_by_css_selector('a.pubmedLink')
if elem.is_displayed():
return elem.text
except:
return "no_pmid
You can try phantomjs, its faster:
https://realpython.com/headless-selenium-testing-with-python-and-phantomjs/
firebug
console
I have a project that I chose Selenium to open 1-5 links. It's stopping at the 3rd link. I've followed the same methods for the previously successful requests. I've allowed 17 seconds and watched as I can see the page load, before the script continues to run in my console. I'm just not sure why it can't find this link, and I hope it's something I'm simply overlooking...
from selenium import *
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import csv
import time
username = "xxxxxxx"
password = "xxxxxxx"
driver = webdriver.Firefox()
driver.get("https://tm.login.trendmicro.com/simplesaml/saml2/idp/SSOService.php")
assert "Trend" in driver.title
elem1 = driver.find_element_by_class_name("input_username")
elem2 = driver.find_element_by_class_name("input_password")
elem3 = driver.find_element_by_id("btn_logon")
elem1.send_keys(username)
elem2.send_keys(password)
elem3.send_keys(Keys.RETURN)
time.sleep(7)
assert "No results found." not in driver.page_source
elem4 = driver.find_element_by_css_selector("a.float-right.open-console")
elem4.send_keys(Keys.RETURN)
time.sleep(17)
elem5 = driver.find_element_by_tag_name("a.btn_left")
elem5.send_keys(Keys.RETURN)
Well one of the reasons is elem5 is looking for the element by tag name, but you are passing it a css tag. "a.btn_left" is not an html tag name and so your script will never actually find it, because it simply doesn't exist in the dom.
You either need to find it by css_selector or better yet by Xpath. If you want to make this as reliable possible and more future proof I always try and find elements on a page with at least 2 descriptors using Xpath if possible.
Change this:
elem5 = driver.find_element_by_tag_name("a.btn_left")
To this:
elem5 = driver.find_element_by_css_selector("a.btn_left")
You will almost never use tag_name, mostly because it will always retrieve the first tag you pass to it, so "a" will always find the first link and click it, yours however does not exist.
I wound up solving it with this code. I increased time to 20 secs, believe it or not, I did try the find by css, I actually left the a.btn_left, and cycled through all the elements, and none of them worked, fortunately, I could access by tab and key functions so that worked for now.
time.sleep(20)
driver.get("https://wfbs-svc-nabu.trendmicro.com/wfbs-svc/portal/en/view/cm")
elem5 = driver.find_element_by_link_text("Devices")
elem5.send_keys(Keys.ENTER)