web scraping of website in python

web scraping of website in python - python

On the below mentioned website, When I select date as 27 jun-2017 and Series/Run rates as "USD RATES 1100". After submitting it, rates opens below on that page. Till this point I am able to do it programitically. But I need 10 year rate(answer is 2.17) of above mentioned date and rate combination. Can some one please tell me what error I am making in the last line of the code.
https://www.theice.com/marketdata/reports/180
from selenium import webdriver
chrome_path = r"C:\Users\vick\Desktop\python_1\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("https://www.theice.com/marketdata/reports/180")
try:
driver.find_element_by_xpath('/html/body/div[3]/div/div[2]/div/div/
div[2]/button').click()
except:
pass
driver.find_element_by_xpath('//*
[#id="seriesNameAndRunCode_chosen"]/a/span').click()
driver.find_element_by_xpath('//*
[#id="seriesNameAndRunCode_chosen"]/div/ul/li[5]').click()
driver.find_element_by_xpath('//*[#id="reportDate"]').clear()
driver.find_element_by_xpath('//*[#id="reportDate"]').send_keys("27-Jul-
2017")
driver.find_element_by_xpath('//*[#id="selectForm"]/input').click()
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)/2;")
print(driver.find_element_by_xpath('//*[#id="report-
content"]/div/div/table/tbody/tr[10]/td[2]').get_attribute('innerHTML'))
Error I am getting in last line:
NoSuchElementException: no such element: Unable to locate element: {"method":"xpath","selector":"//*[#id="report-content"]/div/div/table/tbody/tr[10]/td[2]"}
Thankyou for the help

You have to wait a second or two when you click the input field. Like:
from selenium import webdriver
chrome_path = r"C:\Users\vick\Desktop\python_1\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("https://www.theice.com/marketdata/reports/180")
try:
driver.find_element_by_xpath('/html/body/div[3]/div/div[2]/div/div/div[2]/button').click()
except:
pass
driver.find_element_by_xpath('//*[#id="seriesNameAndRunCode_chosen"]/a/span').click()
driver.find_element_by_xpath('//*[#id="seriesNameAndRunCode_chosen"]/div/ul/li[5]').click()
driver.find_element_by_xpath('//*[#id="reportDate"]').clear()
driver.find_element_by_xpath('//*[#id="reportDate"]').send_keys("27-Jul-2017")
driver.find_element_by_xpath('//*[#id="selectForm"]/input').click()
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)/2;")
time.sleep(2) #here is the part where you should wait.
print(driver.find_element_by_xpath('//*[#id="report-content"]/div/div/table/tbody/tr[10]/td[2]').get_attribute('innerHTML'))
Option B is to wait until the element has been loaded:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException
....
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)/2;")
timeout = 5
try:
element_present = EC.presence_of_element_located((By.ID, 'report-content'))
WebDriverWait(driver, timeout).until(element_present)
except TimeoutException:
print "Timed out waiting for page to load"
......
print(driver.find_element_by_xpath('//*[#id="report-content"]/div/div/table/tbody/tr[10]/td[2]').get_attribute('innerHTML'))
In the first case Python waits 2 seconds and than continues.
In the second case the Webdriver waits until the element is loaded (for maximal 5 seconds)
Tried the code and it works. Hope that helped.

Related

Parsing a dynamically loaded webpage with Selenium

I'm trying to parse https://www.flashscore.com/football/albania/ using Selenium in Python, but my webdriver often doesn't wait for the scores to finish loading.
Here's the code:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Firefox()
driver.get("https://www.flashscore.com/football/albania/")
try:
WebDriverWait(driver, 100).until(
lambda s: s.execute_script("return jQuery.active == 0"))
print(driver.page_source)
finally:
driver.quit()
Occasionally, this will print out source code for a flashscore page with a blank table (i.e. the driver does not wait for the scores to finish loading). I suspect that this is because some of the live scores on the page are dynamically loaded. Is there any way to improve my wait condition?

There's an accept cookies button, so we have to click on that first.
I am using Explicit waits, first presence of table and then visibility of it's main body.
Code :
driver.maximize_window()
driver.implicitly_wait(30)
wait = WebDriverWait(driver, 30)
driver.get("https://www.flashscore.com/football/albania/")
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
try:
wait.until(EC.presence_of_element_located((By.ID, "live-table")))
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "section.event")))
print(driver.page_source)
finally:
driver.quit()
Imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Output is certainly to long, so I would be able to post it here because stackoverflow won't allow me to do so.

How to refresh a page until an element is available using python (Selenium)?

I want to design a code where it tends to refresh the page until the particular element is visible in on the webpage using selenium. I have designed the following code but it gave me an error.
Error: selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"partial link text","selector":"BUY NOW"}
Code:
while True:
if driver.find_element_by_partial_link_text('BUY NOW'):
break
driver.refresh()
buy = driver.find_element_by_partial_link_text('BUY NOW')
buy.click()
Can anyone help?

selenium raises exceptions if it failed to find single element.
so your loop (while True) cannot be a successful loop, the code "if driver.find_element_by_partial_link_text('BUY NOW')" may break the script.
there are two ways to do it:
use 'WebDriverWait' to wait the element explicitly. (suggest to use WebDriverWait)
example:
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 30)
wait.until(EC.presence_of_element_located(("partial link text","BUY NOW")))
driver.refresh()
use find_elements instead.
example:
from selenium.webdriver.common.by import By
while True:
if len(driver.find_elements(By.PARTIAL_LINK_TEXT, 'BUY NOW')) > 0:
break
driver.refresh()
buy = driver.find_element_by_partial_link_text('BUY NOW')
buy.click()

You should give some time for the element to show up. Try using waits:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
while True:
wait = WebDriverWait(driver, 1)
wait.until(EC.element_to_be_clickable(driver.find_element_by_partial_link_text('BUY NOW'))
if driver.find_element_by_partial_link_text('BUY NOW'):
break
driver.refresh()

Difficulty with simulating clicks in Selenium and then scraping data of new page after click

I am trying to simulate a click from this page (http://www.oddsportal.com/baseball/usa/mlb/results/) to the last page number found at the bottom. The click I use on the icon in my code seems to work, but I can't get it to scrape the actual page data I want to after simulating this click. Instead, it just scrapes the data from the first original url. Any help on this would be greatly appreciated.
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
url='http://www.oddsportal.com/baseball/usa/mlb/results/'
driver = webdriver.Chrome()
driver.get(url)
timeout=5
while True:
try:
element_present = EC.presence_of_element_located((By.LINK_TEXT, '»|'))
WebDriverWait(driver, timeout).until(element_present)
last_page_link = driver.find_element_by_link_text('»|')
last_page_link.click()
element_present2 = EC.presence_of_element_located((By.XPATH, ".//th[#class='first2 tl']"))
WebDriverWait(driver, timeout).until(element_present2)
content=driver.page_source
soup=BeautifulSoup(content,'lxml')
dates2 = soup.find_all('th',{'class':'first2'})
dates2 = [element.text for element in dates2]
dates2=dates2[1:]
driver.quit()
except TimeoutException:
print('Timeout Error!')
driver.quit()
continue
break
print(dates2)

Selenium Python - Explicit waits not working

I am unable to get explicit waits to work while waiting for the page to render the js, so I am forced to use time.sleep() in order for the code to work as intended.
I read the docs and still wasn't able to get it to work.
http://selenium-python.readthedocs.io/waits.html
The commented out section of code with the time.sleep() works as intended.
The WebDriverWait part runs but does not wait.
from selenium import webdriver
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
driver = webdriver.Chrome()
url = "https://www.target.com/"
# tells the driver to wait up to 10 seconds before timing out
# for data that will be loaded on the screen
DELAY = 10
driver.implicitly_wait(DELAY)
SLEEP_TIME = 1
# navigate to the page
driver.get(url)
time.sleep(SLEEP_TIME)
try:
WebDriverWait(driver, DELAY).until(EC.visibility_of_element_located((By.XPATH, """//*[#id="js-toggleLeftNav"]/img"""))).click()
WebDriverWait(driver, DELAY).until(EC.visibility_of_element_located((By.XPATH, """//*[#id="5"]"""))).click()
WebDriverWait(driver, DELAY).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#leftNavigation > ul:nth-child(2)")))
"""
# opens up the side bar javascript
driver.find_element_by_xpath("""//*[#id="js-toggleLeftNav"]/img""").click()
time.sleep(SLEEP_TIME)
# clicks on browse by category
driver.find_element_by_xpath("""//*[#id="5"]""").click()
time.sleep(SLEEP_TIME)
# gets all the category elements
items = driver.find_element_by_css_selector("#leftNavigation > ul:nth-child(2)").find_elements_by_tag_name("li")
time.sleep(SLEEP_TIME)
"""
# gets the hyperlink and category name but the first and the last,
# since the first is back to main menu and the last is exit
category_links = {}
for i in range(1, len(items) - 1):
hyperlink = items[i].find_element_by_tag_name('a').get_attribute('href')
category_name = items[i].text
category_links[category_name] = hyperlink
print(category_links)
except:
print("Timed out.")

This version successfully loads the site, waits for it to render, then opens the side menu. Notice how the wait.until method is is used successfully wait until the page is loaded. You should be able to use the pattern below with the rest of your code to achieve your goal.
CODE
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://www.target.com/")
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#leftNavigation > ul:nth-child(2)")))
button = driver.find_element_by_xpath("""//*[#id="js-toggleLeftNavLg"]""")
button.click()
time.sleep(5)
driver.quit()

Can't get rid of hardcoded delay even when Explicit Wait is already there

I've written some code in python in combination with selenium to parse the different questions from quora.com. My scraper is doing it's job at this moment. The thing is I've used here hardcoded delay for the scraper to work, even when Explicit Wait has already been defined. As the page is an infinite scrolling one, i tried to make the scrolling process to a limited number. Now, I have got two questions:
Why wait.until(EC.staleness_of(page)) is not working within my scraper. It is commented out now.
If i use something else instead of page = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "question_link"))) the scraper throws an error: can't focus element.
Btw, I do not wish to go for page = driver.find_element_by_tag_name('body') this option.
Here is what I've written so far:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://www.quora.com/topic/C-programming-language")
wait = WebDriverWait(driver, 10)
page = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "question_link")))
for scroll in range(10):
page.send_keys(Keys.PAGE_DOWN)
time.sleep(2)
# wait.until(EC.staleness_of(page))
for item in wait.until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "rendered_qtext"))):
print(item.text)
driver.quit()

You can try below code to get as much XHR as possible and then parse the page:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
driver = webdriver.Chrome()
driver.get("https://www.quora.com/topic/C-programming-language")
wait = WebDriverWait(driver, 10)
page = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "question_link")))
links_counter = len(wait.until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "question_link"))))
while True:
page.send_keys(Keys.END)
try:
wait.until(lambda driver: len(driver.find_elements_by_class_name("question_link")) > links_counter)
links_counter = len(driver.find_elements_by_class_name("question_link"))
except TimeoutException:
break
for item in wait.until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "rendered_qtext"))):
print(item.text)
driver.quit()
Here we scroll page down and wait up to 10 seconds for more links to be loaded or break the while loop if the number of links remains the same
As for your questions:
wait.until(EC.staleness_of(page)) is not working because when you scroll page down you don't get the new DOM - you just make XHR which adds more links into existed DOM, so the first link (page) will not be stale in this case
(I'm not quite confident about this, but...) I guess you can send keys only to nodes that can be focused (user can set focus manually), e.g. links, input fields, textareas, buttons..., but not content division (div), paragraphs (p), etc

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

web scraping of website in python - python

Related

Parsing a dynamically loaded webpage with Selenium

How to refresh a page until an element is available using python (Selenium)?

Difficulty with simulating clicks in Selenium and then scraping data of new page after click

Selenium Python - Explicit waits not working

Can't get rid of hardcoded delay even when Explicit Wait is already there

Categories

Resources