2 issues when getting Yahoo Finance historical data in Selenium python? - python

Currently I try to download the historical stock prices from Yahoo Finance for personal research purpose. But when I used Selenium in Python to download data, I encountered 2 issues:
1. It took too long time to fully download the web page because it has a lot of external links need to load. There was always a Loading Timeout exception.
When I used try and exception to deal with the timeout exception, but the button used to change the date doesn't work. I guess that this is caused by the web page hasn't been totally loaded.
I am a beginner to Python and Selenium, so could you please advise on this issue?

Find below 3 methods:
Checking page readyState (not reliable):
def page_has_loaded(self):
self.log.info("Checking if {} page is loaded.".format(self.driver.current_url))
page_state = self.driver.execute_script('return document.readyState;')
return page_state == 'complete'
Comparing new page ids with the old one:
def page_has_loaded2():
self.log.info("Checking if {} page is loaded.".format(self.driver.current_url))
try:
new_page = browser.find_element_by_tag_name('html')
return new_page.id != old_page.id
except NoSuchElementException:
return False
Using staleness_of method:
#contextlib.contextmanager
def wait_for_page_load(self, timeout=10):
self.log.debug("Waiting for page to load at {}.".format(self.driver.current_url))
old_page = self.find_element_by_tag_name('html')
yield
WebDriverWait(self, timeout).until(staleness_of(old_page))
For more details, check Harry's blog.
Hope it will help you :)

Related

Speed up Selenium, WebDriver, and BeautifulSoup? - Python

I am trying to scrape a google shopping page and have it work very reliably. The page is full of javascript(which BeautifulSoup can't parse to my knowledge) so I am using selenium and web driver to wake the page up first and then using BeautifulSoup to parse the page content. The problem is that it is really really slow. Just parsing this one page takes about 9 seconds on average and I need to parse multiple pages using the same method at once. 9 seconds for each is just too long for my application. I have done a lot of research and implemented various methods to speed up Selenium, WebDriver, and BeautifulSoup such as (cchardet) but to no significant or noticeable difference. In order to test what is slowing the operation down, I put a print in between each line and watched the prints in the terminal to see where it was getting stuck. My code is below and the slowest line by far which is causing %99 of the problem is...
google_driver.get('https://www.google.com/search?q=desk&source=lmns&tbm=shop&bih=1043&biw=1866&hl=en&sa=X&ved=2ahUKEwjxh5DYj9T5AhVEsHIEHfpsA_0Q_AUoAXoECAEQAQ')
I can't tell if the long pause in this line is only because it takes a while to wake up the page fully before extracting the contents or if it is taking a long time due to the extraction of content.
def google_initiate(request):
form = SearchForm(request.POST or None)
google_service = Service(chromedriver_path)
google_options = webdriver.ChromeOptions()
google_options.add_argument("--incognito")
google_options.add_argument('headless')
google_driver = webdriver.Chrome(service=google_service, options=google_options)
google_driver.get('https://www.google.com/search?q=desk&source=lmns&tbm=shop&bih=1043&biw=1866&hl=en&sa=X&ved=2ahUKEwjxh5DYj9T5AhVEsHIEHfpsA_0Q_AUoAXoECAEQAQ')
google_soup = BeautifulSoup(google_driver.page_source, 'lxml')
google_parsed = google_soup.find_all('div', {'class': ['sh-dgr__gr-auto', 'sh-dgr__grid-result']})
return google_parsed
If it is due to the page needing to fully wake up and there are no fixes for the current set up is there an alternative way to do this that is faster? Can I do this with just BeautifulSoup because it is very fast on its own(Again the reason I am not is due to javascript on the page)? Thanks in Advance!!
P.S. I am new to Selenium and WebDriver and really know just enough to make this work and some various modifications.
UPDATE: - Still Stuck
def home(request):
form = SearchForm(request.POST or None)
if form.is_valid():
form.save()
if request.POST:
for google_post in google_initiate(request, self):
#Do some stuff
#Make a list
#Append stuff to list
Call function at top of code
def google_initiate(request, self):
self.open(
"https://www.google.com/"
"search?q=desk&source=lmns&tbm=shop&bih=1043&biw=1866&hl=en&sa="
"X&ved=2ahUKEwjxh5DYj9T5AhVEsHIEHfpsA_0Q_AUoAXoECAEQAQ")
soup = self.get_beautiful_soup()
parsed = soup.find_all(
'div', {'class': ['sh-dgr__gr-auto', 'sh-dgr__grid-result']}
)
print(parsed)
return parsed
Underlying function at the bottom of the code
I'm still working at it and trying different stuff just stuck on getting seleniumbase to work with django and views. Thanks!
Below is a SeleniumBase pytest test that will do that in 5 seconds or less.
Add --headless --block-images as command-line options to speed it up:
from seleniumbase import BaseCase
class MyTestClass(BaseCase):
def test_parse_shopping(self):
self.open(
"https://www.google.com/"
"search?q=desk&source=lmns&tbm=shop&bih=1043&biw=1866&hl=en&sa="
"X&ved=2ahUKEwjxh5DYj9T5AhVEsHIEHfpsA_0Q_AUoAXoECAEQAQ")
soup = self.get_beautiful_soup()
parsed = soup.find_all(
'div', {'class': ['sh-dgr__gr-auto', 'sh-dgr__grid-result']}
)
print(len(parsed))
pytest test_NAME.py --headless --block-images
I ran that and it found 88 items.

Python Selenium Scraper: Pagination to next page shows error. Scrap Protection from Website?

I'm running a python selenium script in a lambda function on AWS.
I'm scraping this page: Link
The scraper itself is working fine.
But the pagination to the next page stopped working. It worked before for many months.
I exported a screenshot via:
png = driver.get_screenshot_as_base64()
It shows this page instead of the second page:
I run this code (simplified version):
while url:
driver.get(url)
png = driver.get_screenshot_as_base64()
print(png)
button_next = driver.find_elements_by_class_name("PaginationArrowLink-sc-imp866-0")
print("button_next_url: " + str(button_next[-1].get_attribute("href")))
try:
url = button_next[-1].get_attribute("href")
except:
url=""
print('Error in URL')
The interesting thing is the printed URL is totally fine and when I open it manually in the browser it loads page 2:
https://www.stepstone.de/5/ergebnisliste.html?what=Berufskraftfahrer&searchorigin=Resultlist_top-search&suid=1faad076-5348-48d8-9834-4e0d9a836e34&of=25&action=paging_next
But "driver.get(url)" leads to the error page on the screenshot.
Is this some sort of scrape protection from the website? Or is there another reason it sopped working from one day to the other?
The solution was to cut the last part of the URL.
from:
https://www.stepstone.de/5/ergebnisliste.html?what=berufskraftfahrer&searchorigin=Resultlist_top-search&of=25&action=paging_next
to:
https://www.stepstone.de/5/ergebnisliste.html?what=berufskraftfahrer&searchorigin=Resultlist_top-search&of=25
I still don't understand why Selenium was not able to load it, but manually it works. But now it is running again.

Scraping javascript website with Selenium where pages randomly fail to load across multiple browsers

I have a python scraper with selenium for scraping a dynamically loaded javascript website.
Scraper by itself works ok but pages sometimes fail to load with 404 error.
Problem is that public http doesn't have data I need but loads everytime and javascript http with data I need sometimes won't load for a random time.
Even weirder is that same javascript http loads in one browser but not in another and vice versa.
I tried webdriver for chrome, firefox, firefox developer edition and opera. Not a single one loads all pages every time.
Public link that doesn't have data I need looks like this: <https://www.sazka.cz/kurzove-sazky/fotbal/*League*/>.
Javascript link that have data I need looks like this <https://rsb.sazka.cz/fotbal/*League*/>.
On average from around 30 links, about 8 fail to load although in different browsers that same link at the same time loads flawlessly.
I tried to search in page source for some clues but I found nothing.
Can anyone help me find out where might be a problem? Thank you.
Edit: here is my code that i think is relevant
Edit2: You can reproduce this problem by right-clicking on some league and try to open link in another tab. Then can be seen that even that page at first loaded properly after opening it in new tab it changes start of http link from https://www.sazka.cz to https://rsb.sazka.cz and sometimes gives 404 error that can last for an hour or more.
driver = webdriver.Chrome(executable_path='chromedriver',
service_args=['--ssl-protocol=any',
'--ignore-ssl-errors=true'])
driver.maximize_window()
for single_url in urls:
randomLoadTime = random.randint(400, 600)/100
time.sleep(randomLoadTime)
driver1 = driver
driver1.get(single_url)
htmlSourceRedirectCheck = driver1.page_source
# Redirect Check
redirectCheck = re.findall('404 - Page not found', htmlSourceRedirectCheck)
if '404 - Page not found' in redirectCheck:
leaguer1 = single_url
leagueFinal = re.findall('fotbal/(.*?)/', leaguer1)
print(str(leagueFinal) + ' ' + '404 - Page not found')
pass
else:
try:
loadedOddsCheck = WebDriverWait(driver1, 25)
loadedOddsCheck.until(EC.element_to_be_clickable \
((By.XPATH, ".//h3[contains(#data-params, 'hideShowEvents')]")))
except TimeoutException:
pass
unloadedOdds = driver1.find_elements_by_xpath \
(".//h3[contains(#data-params, 'loadExpandEvents')]")
for clicking in unloadedOdds:
clicking.click()
randomLoadTime2 = random.randint(50, 100)/100
time.sleep(randomLoadTime2)
matchArr = []
leaguer = single_url
htmlSourceOrig = driver1.page_source

Improve Web Scraping for Elements in a Container Using Selenium

I am using FireFox, and my code is working just fine, except that its very slow. I prevent loading the images, just to speed up a little bit:
firefox_profile = webdriver.FirefoxProfile()
firefox_profile.set_preference('permissions.default.image', 2)
firefox_profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
firefox_profile.set_preference("browser.privatebrowsing.autostart", True)
driver = webdriver.Firefox(firefox_profile=firefox_profile)
but the performance is still slow. I have tried going headless but unfortunately, it did not work, as I receive NoSuchElement errors. So is there anyway to speed up Selenium web scraping? I can't use scrapy, because this is a dynamic web scrape I need to click through the next button several times, until no clickable buttons exist, and need to click pop-up buttons as well.
here is a snippet of the code:
a = []
b = []
c = []
d = []
e = []
f = []
while True:
container = driver.find_elements_by_xpath('.//*[contains(#class,"review-container")]')
for item in container:
time.sleep(2)
A = item.find_elements_by_xpath('.//*[contains(#class,"ui_bubble_rating bubble_")]')
for i in A:
a.append(i,text)
time.sleep(2)
B = item.find_elements_by_xpath('.//*[contains(#class,"recommend-titleInline noRatings")]')
for j in B:
b.append(j.text)
time.sleep(3)
C = item.find_elements_by_xpath('.//*[contains(#class,"noQuotes")]')
for k in C:
c.append(k.text)
time.sleep(3)
D = item.find_elements_by_xpath('.//*[contains(#class,"ratingDate")]')
for l in D:
d.append(l.text)
time.sleep(3)
E = item.find_elements_by_xpath('.//*[contains(#class,"partial_entry")]')
for m in E:
e.append(m.text)
try:
time.sleep(2)
next = driver.find_element_by_xpath('.//*[contains(#class,"nav next taLnk ui_button primary")]')
next.click()
time.sleep(2)
driver.find_element_by_xpath('.//*[contains(#class,"taLnk ulBlueLinks")]').click()
except (ElementClickInterceptedException,NoSuchElementException) as e:
break
Here is an edited version, but speed does not improve.
========================================================================
while True:
container = driver.find_elements_by_xpath('.//*[contains(#class,"review-container")]')
for item in container:
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"ui_bubble_rating bubble_")]')))
A = item.find_elements_by_xpath('.//*[contains(#class,"ui_bubble_rating bubble_")]')
for i in A:
a.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"recommend-titleInline noRatings")]')))
B = item.find_elements_by_xpath('.//*[contains(#class,"recommend-titleInline noRatings")]')
for i in B:
b.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"noQuotes")]')))
C = item.find_elements_by_xpath('.//*[contains(#class,"noQuotes")]')
for i in C:
c.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"ratingDate")]')))
D = item.find_elements_by_xpath('.//*[contains(#class,"ratingDate")]')
for i in D:
d.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"partial_entry")]')))
E = item.find_elements_by_xpath('.//*[contains(#class,"partial_entry")]')
for i in E:
e.append(i.text)
try:
#time.sleep(2)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"nav next taLnk ui_button primary")]')))
next = driver.find_element_by_xpath('.//*[contains(#class,"nav next taLnk ui_button primary")]')
next.click()
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"taLnk ulBlueLinks")]')))
driver.find_element_by_xpath('.//*[contains(#class,"taLnk ulBlueLinks")]').click()
except (ElementClickInterceptedException,NoSuchElementException) as e:
break
For dynamic webpages(pages rendered or augmented using javascript), I will suggest you use scrapy-splash
Not that you can't use selenium but for scrapping purposes scrapy-splash is a better fit.
Also, if you have to use selenium for scraping a good idea would be to use headless option. And also you can use chrome. I had some benchmarks for chrome headless being faster than firefox headless sometimes back.
Also, rather than sleep better would be to use webdriverwait with expected condition as it waits as long as necessary rather than thread sleep, which makes you to wait for the mentioned time.
Edit : Adding as edit while trying to answer #QHarr, as the answer is pretty long.
It is a suggestion to evaluate scrapy-splash.
I gravitate towards to scrapy because the whole eco system around scrapping purposes. Like middleware, proxies, deployment, scheduling, scaling. So bascially if you are looking for some serious scrapping scrapy might be better starting position. So that suggestion comes with caveat.
At the coming to speed, I can't give any objective answer, as I have never contrasted and benchmarked scrapy with selenium from time perspective with any project of size.
But I would assume more or less you will be able to get comparable times, on a serial run, if you are doing the same things. As most cases the time you spend is on waiting for responses.
If you scrapping with any considerable no of items, the speed-up you get is by generally parallelising the requests. Also in cases, falling back to basic http request and response where it is not necessary, rather to render the page in any user agent.
Also, anecdotally, some of the in web page action can be performed using the underlying http request/response. So time is a priority then you should be looking to get as many thing as possible done with http request/response.

Scraping Flickr with Selenium/Beautiful soup in Python - ABSWP

I'm going through Automate Boring Stuff with Python and I'm stuck at the chapter about downloading data from the internet. One of the tasks is download photos for a given keyword from Flickr.
I have a massive problem with scraping this site. I've tried BeautifulSoup (which I think is not appropriate in this case as it uses Javascript) and Selenium. Looking at the html I think that I should locate 'overlay' class. However no matter which option I use (find_element_by_class_name, ...by_text, ...by_partial_text) I am not able to find these elements (I get: ".
Could you please help me to clarify what I'm doing wrong? I'd be also grateful for any materials that could help me understadt such cases better. Thanks!
Here's my simple code:
import sys
search_keywords = sys.argv[1]
from selenium import webdriver
browser = webdriver.Firefox()
browser.get(f'https://www.flickr.com/search/?text={search_keywords}')
elems = browser.find_element_by_class_name("overlay")
print(elems)
elems.click()
Sample keywords I type in shell: "industrial design interior"
Are you getting any error message? With Selenium it's useful to surround your code in try/except blocks.
What are you trying to do exactly, download the photos? With a bit of re-writing
try:
options = webdriver.ChromeOptions()
#options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options = options)
search_keywords = "cars"
driver.get(f'https://www.flickr.com/search/?text={search_keywords}')
time.sleep(1)
except Exception as e:
print("Error loading search results page" + str(e))
try:
elems = driver.find_element_by_class_name("overlay")
print(elems)
elems.click()
time.sleep(5)
except Exception as e:
print(str(e))
Loads the page as expected and then clicks on the photo, taking us to This Page
I would be able to help more if you could go into more detail of what you're wanting to accomplish.

Categories

Resources