How to use XPath to scrape javascript website values - python

I'm trying to scrape (in python) the savings interest rate from this website using the value's xpath variable.
I've tried everything: beautifulsoup, selenium, etree, etc. I've been able to scrape a few other websites successfully. However, this site and many others are giving me fits. I'd love a solution that can scrape info from several sites regardless of their formatting using xpath variables.
My current attempt:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
service = Service(executable_path="/chromedriver")
options = Options()
options.add_argument(' — incognito')
options.headless = True
driver = webdriver.Chrome(service=service, options=options)
url = 'https://www.americanexpress.com/en-us/banking/online-savings/account/'
driver.get(url)
element = driver.find_element(By.XPATH, '//*[#id="hysa-apy-2"]')
print(element.text)
if element.text == "":
print("Error: Element text is empty")
driver.quit()

The interest rates are written inside span elements. All span elements which contain interest rates share the same class heading-6. But bear in mind, the result returns two span elements for each interest rate, each element for a different viewport.
The xpath selector:
'//span[#class="heading-6"]'
You can also get elements by containing text APY:
'//span[contains(., "APY")]'
But this selector looks for all span elements in the DOM that contain word APY.

If you find unique id, it is recommended to be priority, like this :find_element(By.ID,'hysa-apy-2') like #John Gordon comment.
But sometimes when the element found, the text not yet load.
Use xpath with add this logic and text()!=""
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//span[#id="hysa-apy-2" and text()!=""]')))
Following import:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

Related

Can't get all the necessary links from web-page via Selenium

I'm currently trying to use some automation while performing a patent searching task. I'd like to get all the links corresponding to search query result. Particularly, I'm interested in Apple patents starting from the year 2015. So the code is the next one -
import selenium
from selenium import webdriver
from selenium.webdriver.firefox.options import Options as options
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.common.by import By
new_driver_path = r"C:/Users/alexe/Desktop/Apple/PatentSearch/geckodriver-v0.30.0-win64/geckodriver.exe"
ops = options()
serv = Service(new_driver_path)
browser1 = selenium.webdriver.Firefox(service=serv, options=ops)
browser1.get("https://patents.google.com/?assignee=apple&after=priority:20150101&sort=new")
elements = browser1.find_elements(By.CLASS_NAME, "search-result-item")
links = []
for elem in elements:
href = elem.get_attribute('href')
if href:
links.append(href)
links = set(links)
for href in links:
print(href)
And the output is the next one -
https://patentimages.storage.googleapis.com/ed/06/50/67e30960a7f68d/JP2021152951A.pdf
https://patentimages.storage.googleapis.com/86/30/47/7bc39ddf0e1ea7/KR20210106968A.pdf
https://patentimages.storage.googleapis.com/ca/2a/bc/9380e1657c2767/US20210318798A1.pdf
https://patentimages.storage.googleapis.com/c1/1a/c6/024f785fd5ea10/AU2021204695A1.pdf
https://patentimages.storage.googleapis.com/b3/19/cc/8dc1fae714194f/US20210312694A1.pdf
https://patentimages.storage.googleapis.com/e6/16/c0/292a198e6f1197/AU2021218193A1.pdf
https://patentimages.storage.googleapis.com/3e/77/e0/b59cf47c2b30a1/AU2021212005A1.pdf
https://patentimages.storage.googleapis.com/1b/3d/c2/ad77a8c9724fbc/AU2021204422A1.pdf
https://patentimages.storage.googleapis.com/ad/bc/0f/d1fcc65e53963e/US20210314041A1.pdf
The problem here is that I've got 1 missing link -
result item and the missing link
So I've tried different selectors and still got the same result - one link is missing. I've also tried to search with different parameters and the pattern is the next one - all the missing links aren't linked with pdf output. I've spent a lot of time trying to figure out what's the reason, so I would be really grateful If you could provide me with any clue on the matter. Thanks in advance!
The option highlighted has no a tag with class pdflink in it. Put the line of code to extract the link in try block. If the required element is not found, search for the a tag available for that article.
Try like below once:
driver.get("https://patents.google.com/?assignee=apple&after=priority:20150101&sort=new")
articles = driver.find_elements_by_tag_name("article")
print(len(articles))
for article in articles:
try:
link = article.find_element_by_xpath(".//a[contains(#class,'pdfLink')]").get_attribute("href") # Use a dot in the xpath to find an element with in an element.
print(link)
except:
print("Exception")
link = article.find_element_by_xpath(".//a").get_attribute("href")
print(link)
10
https://patentimages.storage.googleapis.com/86/30/47/7bc39ddf0e1ea7/KR20210106968A.pdf
https://patentimages.storage.googleapis.com/e6/16/c0/292a198e6f1197/AU2021218193A1.pdf
https://patentimages.storage.googleapis.com/3e/77/e0/b59cf47c2b30a1/AU2021212005A1.pdf
https://patentimages.storage.googleapis.com/c1/1a/c6/024f785fd5ea10/AU2021204695A1.pdf
https://patentimages.storage.googleapis.com/1b/3d/c2/ad77a8c9724fbc/AU2021204422A1.pdf
https://patentimages.storage.googleapis.com/ca/2a/bc/9380e1657c2767/US20210318798A1.pdf
Exception
https://patents.google.com/?assignee=apple&after=priority:20150101&sort=new#
https://patentimages.storage.googleapis.com/b3/19/cc/8dc1fae714194f/US20210312694A1.pdf
https://patentimages.storage.googleapis.com/ed/06/50/67e30960a7f68d/JP2021152951A.pdf
https://patentimages.storage.googleapis.com/ad/bc/0f/d1fcc65e53963e/US20210314041A1.pdf
To extract all the href attributes of the pdfs using Selenium and python you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a.search-result-item[href]")))])
Using XPATH:
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[contains(#class, 'search-result-item') and #href]")))])
Console Output:
['https://patentimages.storage.googleapis.com/86/30/47/7bc39ddf0e1ea7/KR20210106968A.pdf', 'https://patentimages.storage.googleapis.com/e6/16/c0/292a198e6f1197/AU2021218193A1.pdf', 'https://patentimages.storage.googleapis.com/3e/77/e0/b59cf47c2b30a1/AU2021212005A1.pdf', 'https://patentimages.storage.googleapis.com/c1/1a/c6/024f785fd5ea10/AU2021204695A1.pdf', 'https://patentimages.storage.googleapis.com/1b/3d/c2/ad77a8c9724fbc/AU2021204422A1.pdf', 'https://patentimages.storage.googleapis.com/ca/2a/bc/9380e1657c2767/US20210318798A1.pdf', 'https://patentimages.storage.googleapis.com/b3/19/cc/8dc1fae714194f/US20210312694A1.pdf', 'https://patentimages.storage.googleapis.com/ed/06/50/67e30960a7f68d/JP2021152951A.pdf', 'https://patentimages.storage.googleapis.com/ad/bc/0f/d1fcc65e53963e/US20210314041A1.pdf']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
PS: You can extract only nine(9) href attributes as one of the search items is a <span> element and isn't a link i.e. doesn't have the href attribute

Can't get tweets with selenium in python

I am trying to build a tweet scraper for my nlp project but i cant get tweets.
Here is codes:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import time
query = 'mutluluk'
URL = 'https://twitter.com/search?q=' + query + '&src=typed_query&f=live'
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get(URL)
wait.until(EC.title_contains(query + ' - Twitter Araması / Twitter'))
tweets = driver.find_elements_by_css_selector("div#tweet-text").text
print(tweets)
The page that is returned does not have the title you expect, your wait condition is too specific. If you change it to:
wait.until(EC.title_contains(query)
or
wait.until(EC.title_contains(query + ' - Twitter)
you'll get a page of tweets. After that, I don't think you have the right CSS selector, because it finds no matching element, so you need to further investigate the page contents with the developer tools.
You can wait until the elements you are searching for are present, rather than waiting explicitly for a text in the page title.
Your css selector is too poor for those types of websites, I recommend using XPATH, because big websites generally randomize the classes of the most elements in the DOM, so parsing the document will not be easy for beginners.
Use this snippet and you will get the text of your elements :
elements = wait.until(EC.presence_of_all_elements_located(
(By.XPATH, "//main//article")))
for ele in elements:
print(ele.text)

Element Not Clickable - even though it is there

Hoping you can help. I'm relatively new to Python and Selenium. I'm trying to pull together a simple script that will automate news searching on various websites. The primary focus was football and to go and get me the latest Manchester United news from a couple of places and save the list of link titles and URLs for me. I could then look through the links myself and choose anything I wanted to review.
In trying the the independent newspaper (https://www.independent.co.uk/) I seem to have come up against a problem with element not interactable when using the following approaches:
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome('chromedriver')
driver.get('https://www.independent.co.uk')
time.sleep(3)
#accept the cookies/privacy bit
OK = driver.find_element_by_id('qcCmpButtons')
OK.click()
#wait a few seconds, just in case
time.sleep(5)
search_toggle = driver.find_element_by_class_name('icon-search.dropdown-toggle')
search_toggle.click()
This throws the selenium.common.exceptions.ElementNotInteractableException: Message: element not interactable error
I've also tried with XPATH
search_toggle = driver.find_element_by_xpath('//*[#id="quick-search-toggle"]')
and I also tried ID.
I did a lot of reading on here and then also tried using WebDriverWait and execute_script methods:
element = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, '//*[#id="quick-search-toggle"]')))
driver.execute_script("arguments[0].click();", element)
This didn't seem to error but the search box never appeared, i.e. the appropriate click didn't happen.
Any help you could give would be fantastic.
Thanks,
Pete
Your locator is //*[#id="quick-search-toggle"], there are 2 on the page. The first is invisible and the second is visible. By default selenium refers to the first element, sadly the element you mean is the second one, so you need another unique locator. Try this:
search_toggle = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//div[#class="row secondary"]//a[#id="quick-search-toggle"]')))
search_toggle.click()
First you need to open search box, then send search keys:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
import os
chrome_options = Options()
chrome_options.add_argument("--start-maximized")
browser = webdriver.Chrome(executable_path=os.path.abspath(os.getcwd()) + "/chromedriver", options=chrome_options)
link = 'https://www.independent.co.uk'
browser.get(link)
# accept privacy
button = browser.find_element_by_xpath('//*[#id="qcCmpButtons"]/button').click()
# open search box
li = browser.find_element_by_xpath('//*[#id="masthead"]/div[3]/nav[2]/ul/li[1]')
search_tab = li.find_element_by_tag_name('a').click()
# send keys to search box
search = browser.find_element_by_xpath('//*[#id="gsc-i-id1"]')
search.send_keys("python")
search.send_keys(Keys.RETURN)
Can you try with below steps
search_toggle = driver.find_element_by_xpath('//*[#class="row secondary"]/nav[2]/ul/li[1]/a')
search_toggle.click()

Selenium Python Unable to locate element

I'm trying to gather price information for each product variation from this web page: https://www.safetysign.com/products/7337/ez-pipe-marker
I'm using Selenium and FireFox with Python 3 and Windows 10.
Here is my current code:
driver = webdriver.Firefox()
driver.get('https://www.safetysign.com/products/7337/ez-pipe-marker')
#frame = driver.find_element_by_class_name('product-dual-holder')
# driver.switch_to.frame('skuer5c866ddb91611')
# driver.implicitly_wait(5)
driver.find_element_by_id('skuer5c866ddb91611-size-label-324').click()
price = driver.find_element_by_class_name("product-pricingnodecontent product-price-content").text.replace('$', '')
products.at[counter, 'safetysign.com Price'] = price
print(price)
print(products['safetysign.com URL'].count()-counter)
So, I'm trying to start by just selecting the first product variation by id (I've also tried class name). But, I get an Unable to locate element error. As suggested in numerous SO posts, I tried to change frames (even though I can't find a frame tag in the html that contains this element). I tried switching to different frames using index, class name, and id of different div elements that I thought might be a frame, but none of this worked. I also tried using waits, but that return the same error.
Any idea what I am missing or doing wrong?
To locate the elements you have to induce WebDriverWait for the visibility_of_all_elements_located() and you can create a List and iterate over it to click() each item and you can use the following solution:
Code Block:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver=webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
driver.get("https://www.safetysign.com/products/7337/ez-pipe-marker")
for product in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//form[#class='product-page-form']//div[#class='sku-contents']//following::ul[1]/li//label[starts-with(#for, 'skuer') and contains(., 'Pipe')]"))):
WebDriverWait(driver, 20).until(EC.visibility_of(product)).click()
driver.quit()
They may well be dynamic. Select by label type selector instead and index to click on required item e.g. 0 for the item you mention (first in the list). Also, add a wait condition for labels to be present.
If you want to limit to just those 5 size choices then use the following css selector instead of label :
.sku-contents ul:nth-child(3) label
i.e.
sizes = WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".sku-contents ul:nth-child(3) label")))
sizes[0].click()
After selecting size you can grab the price from the price node depending on whether you want the price for a given sample size e.g. 0-99.
To get final price use:
.product-under-sku-total-label
Code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = 'https://www.safetysign.com/products/7337/ez-pipe-marker'
driver = webdriver.Chrome()
driver.get(url)
labels = WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "label")))
labels[0].click()
price0to99 = driver.find_element_by_css_selector('.product-pricingnodecontent').text
priceTotal = driver.find_element_by_css_selector('.product-under-sku-total-label').text
print(priceTotal, price0To99)
# driver.quit()

How to speed up python selenium find_elements?

I am trying to scrape company info from kompass.com
However, as each company profile provide different amount of details, certain pages may have missing elements. For example, not all companies have info on 'Associations'. In such cases, my script takes extremely long searching for these missing elements. Is there anyway I can speed up the search process?
Here's the excerpt of my script:
import time
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import ElementNotVisibleException
from lxml import html
def init_driver():
driver = webdriver.Firefox()
driver.wait = WebDriverWait(driver, 5)
return driver
def convert2text(webElement):
if webElement != []:
webElement = webElement[0].text.encode('utf8')
else:
webElement = ['NA']
return webElement
link='http://sg.kompass.com/c/mizkan-asia-pacific-pte-ltd/sg050477/'
driver = init_driver()
driver.get(link)
driver.implicitly_wait(10)
name = driver.find_elements_by_xpath("//*[#id='productDetailUpdateable']/div[1]/div[2]/div/h1")
name = convert2text(name)
## Problem:
associations = driver.find_elements_by_xpath("//body//div[#class='item minHeight']/div[#id='associations']/div/ul/li/strong")
associations = convert2text(associations)
It takes more than a minute to scrape each page and I have more than 26,000 pages to scrape.
driver.implicitly_wait(10) tell the driver to wait up to 10 seconds for the element to exist in the DOM. That means that each time you are looking for non-existing element it waits for 10 seconds. Reducing the time to 2-3 seconds will improve the run time.
In addition, xpath is the slowest selector, and you are making it worth by giving absolute path. Use find_elements_by_id and find_elements_by_class_name where you can. For example, you can improve
driver.find_elements_by_xpath("//body//div[#class='item minHeight']/div[#id='associations']/div/ul/li/strong")
Simply by stating with the associations id
driver.find_elements_by_xpath("//*div[#id='associations']/div/ul/li/strong")
Or changing it to css_selector
driver.find_elements_by_css_selector("#associations > div > ul > li > strong")
Since your XPaths are not using any attributes apart from class and id to find elements, you could migrate your searches to CSS Selectors. These may be faster on browsers like IE where native XPath searching is not supported.
For example:
//body//div[#class='item minHeight']/div[#id='associations']/div/ul/li/strong
Can become:
body .item .minHeight > #associations > div > ul > li > strong

Categories

Resources