Selenium Webdriver Python- Page loads incompletely/ sometimes freezes on refresh - python

I am scraping a website with a lot of javascript that is generated when the page is called. As a result, traditional web scraping methods (beautifulsoup, ect.) are not working for my purposes (at least I have been unsuccessful in getting them to work, all of the important data is in the javascript parts). As a result I have started using selenium webdriver. I need to scrape a few hundred pages, each of which has between 10 and 80 data points (each with about 12 fields), so it is important that this script (is that the right terminology?) can run for quite awhile without me having to babysit it.
I have the code working for a single page, and I have a controlling section that tells the scraping section what page to scrape. The problem is that sometimes the javascript portions of the page load, and sometimes they don't when they don't(~1/7), a refresh fixes things, but occasionally the refresh will freeze webdriver and thus the python runtime environment as well. Annoyingly, when it freezes like this, the code fails to time out. What is going on?
Here is a stripped down version of my code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException, TimeoutException
import time, re, random, csv
from collections import namedtuple
def main(url_full):
driver = webdriver.Firefox()
driver.implicitly_wait(15)
driver.set_page_load_timeout(30)
#create HealthPlan namedtuple
HealthPlan = namedtuple( "HealthPlan", ("State, County, FamType, Provider, PlanType, Tier,") +
(" Premium, Deductible, OoPM, PrimaryCareVisitCoPay, ER, HospitalStay,") +
(" GenericRx, PreferredPrescription, RxOoPM, MedicalDeduct, BrandDrugDeduct"))
#check whether the page has loaded and handle page load and time out errors
pageNotLoaded= bool(True)
while pageNotLoaded:
try:
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
except TimeoutException:
driver.quit()
time.sleep(3+ abs(random.normalvariate(1.8,3)))
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
# Handle page load error by testing presence of showAll,
# an important feature of the page, which only appears if everything else loads
try:
driver.find_element_by_xpath('//*[#id="showAll"]').text
# catch NoSuchElementException=>refresh page
except NoSuchElementException:
try:
driver.refresh()
# catch TimeoutException => quit and load the page
# in a new instance of firefox,
# I don't think the code ever gets here, because it freezes in the refresh
# and will not throw the timeout exception like I would like
except TimeoutException:
driver.quit()
time.sleep(3+ abs(random.normalvariate(1.8,3)))
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
pageNotLoaded= False
scrapePage() # this is a dummy function, everything from here down works fine,
I have looked extensively for similar problems, and I do not think anyone else has posted about this on so, or anywhere else that I have looked. I am using python 2.7, selenium 2.39.0 and I am trying to scrape Healthcare.gov 's get premium estimate's pages
EDIT: (as an example,this page) It may also be worth mentioning that the page fails to load completely more often when the computer has been on/ doing this for awhile (i'm guessing that the free ram is getting full, and it glitches while loading) this is kind of beside the point though, because this should be handled by the try/except.
EDIT2: I should also mention that this is being run on windows7 64bit, with firefox 17 (which I believe is the newest supported version)

Dude, time.sleep it's a fail!
What's this?
time.sleep(3+ abs(random.normalvariate(1.8,3)))
Try this:
class TestPy(unittest.TestCase):
def waits(self):
self.implicit_wait = 30
Or this:
(self.)driver.implicitly_wait(10)
Or this:
WebDriverWait(driver, 10).until(lambda driver: driver.find_element_by_xpath('some_xpath'))
Or, instead of driver.refresh() you can trick it :
driver.get(your url)
Also you can cick the cookie :
driver.delete_all_cookies()
scrapePage() # this is a dummy function, everything from here down works fine, :
http://scrapy.org

Related

Selenium with proxy returns empty website

I am having trouble getting a page source HTML out of a site with selenium through a proxy. Here is my code
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
import codecs
import time
import shutil
proxy_username = 'myProxyUser'
proxy_password = 'myProxyPW'
port = '1080'
hostname = 'myProxyIP'
PROXY = proxy_username+":"+proxy_password+"#"+hostname+":"+port
options = Options()
options.add_argument("--headless")
options.add_argument("--kiosk")
options.add_argument('--proxy-server=%s' %PROXY)
driver = webdriver.Chrome(r'C:\Users\kingOtto\Downloads\chromedriver\chromedriver.exe', options=options)
driver.get("https://www.whatismyip.com")
time.sleep(10)
html = driver.page_source
f = codecs.open('dummy.html', "w", "utf-8")
f.write(html)
driver.close()
This results in a very incomplete HTML, showing only outer brackets of head and body:
html
Out[3]: '<html><head></head><body></body></html>'
Also the dummy.html file written to disk does not show any other content that what is displayed in the line above.
I am lost, here is what I tried
It does work when I run it without options.add_argument('--proxy-server=%s' %PROXY) line. So I am sure it is the proxy. But the proxy connection itself seems to be ok (I do not get any proxy connection erros - plus I do get the outer frame from the website, right? So the driver request gets through & back to me)
Different URLs: Not only whatismyip.com fails, also any other pages - tried different news outlets such as CNN or even google - virtually nothing comes back from any website, except for head and body brackets. It cannot be any javascript/iframe issue, right?
Different wait times (this article does not help: Make Selenium wait 10 seconds), up to 60 seconds -- plus my connection is super fast, <1 second should be enough (in browser)
What am I getting wrong about the connection?
driver.page_source does not always return what you expect via selenium. It's likely NOT the full dom. This is documented in the selenium doc and in various SO answers, e.g.:
https://stackoverflow.com/a/45247539/1387701
Selenium gives a best effort to provide the page source as it is fetched. Only highly dynamic pages this can often be limited in it's return.

While using the chromedriver with Selenium, how do I correct a timeout error?

I may be missing something simple here but I have tried a lot already without any luck. I'm new to selenium and I cannot correct the following issue. When navigating to a web-page using get() I continually get a timeout message. The page loads properly but after everything on the page loads (i assume it may have something to do with how long it takes to load due to the ads loading) I get this error.
selenium.common.exceptions.TimeoutException: Message: timeout
(Session info: chrome=65.0.3325.181)
(Driver info: chromedriver=2.36.540470 (e522d04694c7ebea4ba8821272dbef4f9b818c91),platform=Windows NT 10.0.16299 x86_64)
I have tried the following; moving chromedriver location, trying older versions of selenium, waits, implicitly waits, time.sleep and others. Any input would be great as this seems like something simple and I would like to get it fixed soon.
The code in question:
from selenium import webdriver
import time
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome("Path\To\chromedriver.exe")
driver.set_page_load_timeout(10)
driver.get("https://www.website.com")
driver.find_element_by_name("name").send_keys("com")
driver.find_element_by_name("word").send_keys("pw")
driver.find_element_by_id("idItem").click()
driver.find_element_by_name("word").send_keys(Keys.ENTER)
#driver.implicitly_wait(10)
driver.get("https://www.website2.com")
--------------Error here, never gets past this point------------
time.sleep(10)
driver.close()
As per your question while navigating to a web-page using get() apparently it may appear that the page got loaded properly but factually the JavaScripts and Ajax Calls under the hood might not have completed and the Web Client may not have achieved 'document.readyState' is equal to "complete" as well.
But it seems you have induced set_page_load_timeout(10) in your code which incase the complete page loading (including the JS and Ajax) doesn't gets completed within 10 seconds will result in a TimeoutException. That is exactly happening in your case.
Solution
If your usecase doesn't have a constraint on Page Load Timeout, remove the line of code set_page_load_timeout(10)
If your usecase have a dependency on Page Load Timeout, catch the exception and invoke quit() to shutdown gracefully as follows :
Code Block :
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'C:\path\to\chromedriver.exe')
driver.set_page_load_timeout(2)
try :
driver.get("https://www.booking.com/hotel/in/the-taj-mahal-palace-tower.html?label=gen173nr-1FCAEoggJCAlhYSDNiBW5vcmVmaGyIAQGYATG4AQbIAQzYAQHoAQH4AQKSAgF5qAID;sid=338ad58d8e83c71e6aa78c67a2996616;dest_id=-2092174;dest_type=city;dist=0;group_adults=2;hip_dst=1;hpos=1;room1=A%2CA;sb_price_type=total;srfid=ccd41231d2f37b82d695970f081412152a59586aX1;srpvid=c71751e539ea01ce;type=total;ucfs=1&#hotelTmpl")
print("URL successfully Accessed")
driver.quit()
except :
print("Page load Timeout Occured. Quiting !!!")
driver.quit()
Console Output :
Page load Timeout Occured. Quiting !!!
You can find a detailed discussion on set_page_load_timeout() in How to set the timeout of 'driver.get' for python selenium 3.8.0?

Python: how i can print all the source code by using Selenium

driver.page_source don't returns all the source code.It is detaily printing only some parts of code, but it's missing a big part of code. How can i fix this?
This is my code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
def htmlToLuna():
url ='https://codefights.com/tournaments/Xph7eTJQssbXjDLzP/A'
driver = webdriver.Chrome('C:\\Python27\\chromedriver\\chromedriver.exe')
driver.get(url)
web=open('web.txt','w')
web.write(driver.page_source)
print driver.page_source
web.close()
print htmlToLuna()
Here is a simple code all it does is it opens the url and gets the length page source and waits for five seconds and will get the length of page source again.
if __name__=="__main__":
browser = webdriver.Chrome()
browser.get("https://codefights.com/tournaments/Xph7eTJQssbXjDLzP/A")
initial = len(browser.page_source)
print(initial)
time.sleep(5)
new_source = browser.page_source
print(len(new_source)
see the output:
15722
48800
you see that the length of the page source increases after a wait? you must make sure that the page is fully loaded before getting the source. But this is not a proper implementation since it blindly waits.
Here is a nice way to do this, The browser will wait until the element of your choice is found. Timeout is set for 10 sec.
if __name__=="__main__":
browser = webdriver.Chrome()
browser.get("https://codefights.com/tournaments/Xph7eTJQssbXjDLzP/A")
try:
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.CodeMirror > div:nth-child(1) > textarea:nth-child(1)'))) # 10 seconds delay
print("Result:")
print(len(browser.page_source))
except TimeoutException:
print("Your exception message here!")
The output: Result: 52195
Reference:
https://stackoverflow.com/a/26567563/7642415
http://selenium-python.readthedocs.io/locating-elements.html
Hold on! even that wont make any guarantees for getting full page source, since individual elements are loaded dynamically. If the browser finds the element it moves on. So make sure you find the proper element to make sure the page has been loaded fully.
P.S Mine is Python3 & webdriver is in my environment PATH. So my code needs to be modified a bit to make it work for Python 2.x versions. I guess only print statements are to be modified.

Intercept when url changes before the page is completely loaded

Is it possible to catch the event when the url is changed inside my browser using selenium?
Here is my scenario:
I load my website test.com
After all the static files are loaded, when executing one of the js file, I am redirected (not sure how) to another page redirect-one.test.com/blah
My browser gets the url redirect-one.test.com/blah and gets a 307 response to go to redirect-two.test.com/blahblah
Here my browser receives a final 302 to go to final.test.com/
The page of final.test.com/ is loaded and at the end of this, selenium enables me to search for elements and so on...
I'd like to be able to intercept (and time the moment it happens) each time I am redirected.
After that, I still need to do some other steps for which selenium is more suitable:
Enter my username and password
Test some functionnalities
Log out
Here a sample of how I tried to intercept the first redirect:
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import WebDriverWait
def url_contains(url):
def check_contains_url(driver):
return (url in driver.current_url)
return check_contains_url
driver = webdriver.Remote(
command_executor='http://127.0.0.1:4444/wd/hub',
desired_capabilities=DesiredCapabilities.FIREFOX)
driver.get("http://test.com/")
try:
url = "redirect-one.test.com"
first_redirect = WebDriverWait(driver, 20).until(url_contains(url))
print("found first redirect")
finally:
print("move on to the next redirect...."
Is this even possible using selenium?
I cannot change the behavior of the website and the reason it is built like this is because of an SSO mechanism I cannot bypass.
I realize I specified python but I am open to tools in other languages.
Selenium is not the tool for this. All the redirects that the browser encounters are handled by the browser in a way that Selenium does not allow you to check.
You can perform the checks using urllib2, or if you prefer a sane interface, using requests.

How to use find_element_by_link_text() properly to not raise NoSuchElementException?

I have a HTML code like this:
<div class="links nopreview"><span><a class="csiAction"
href="/WebAccess/home.html#URL=centric://REFLECTION/INSTANCE/_CS_Data/null">Home</a></span> • <span><span><a class="csiAction"
href="/WebAccess/home.html#URL=centric://SITEADMIN/_CS_Site">Setup</a></span> • </span><span><a
title="Sign Out" class="csiAction csiActionLink">Sign Out</a></span></div>
I would like to click on the link that has the text Home. As this Home link appears after login, I have a code like this:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import re
browser = webdriver.Firefox() # Get local session of firefox
browser.get("http://myServer/WebAccess/login.html") # Load App page
elem = browser.find_element_by_name("LoginID") # Find the Login box
elem.send_keys("Administrator")
elem = browser.find_element_by_name("Password") # Find the Password box
elem.send_keys("Administrator" + Keys.RETURN)
#try:
elem = browser.find_element_by_link_text("Home")
elem.click()
The part till login works great. However the last but one line is problematic
elem = browser.find_element_by_link_text("Home")
It raises this NoSuchElementException where the Home link is there as you can see from the HTML code.
raise exception_class(message, screen, stacktrace)
NoSuchElementException: Message: u'Unable to locate element: {"method":"link text","selector":"Home"}'
Any guidance as to what I am doing wrong, please?
Have you tried adding an implicit wait to this so that it waits instead of running to quickly.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import re
browser = webdriver.Firefox() # Get local session of firefox
browser.implicitly_wait(10) #wait 10 seconds when doing a find_element before carrying on
browser.get("http://myServer/WebAccess/login.html") # Load App page
elem = browser.find_element_by_name("LoginID") # Find the Login box
elem.send_keys("Administrator")
elem = browser.find_element_by_name("Password") # Find the Password box
elem.send_keys("Administrator" + Keys.RETURN)
#try:
elem = browser.find_element_by_link_text("Home")
elem.click()
The implicitly_wait call makes the browser poll until the item is on the page and visible to be interacted with.
The most common issues with NoSuchElementException while the element is there are:
the element is in different window/frame, so you've to switch to it first,
your page is not loaded or your method of page load is not reliable.
Solution could include:
check if you're using the right frame/window by: driver.window_handles,
write a wait wrapper to wait for an element to appear,
try XPath instead, like: driver.find_element_by_xpath(u'//a[text()="Foo"]').click(),
use pdb to diagnose your problem more efficiently.
See also: How to find_element_by_link_text while having: NoSuchElement Exception?
Maybe the element you are looking for doesn't exactly match that text string?
I know it can be tricky if it looks like it does on-screen, but sometimes there are oddities embedded like this simple markup "Home" or "Home" which makes the first char italic:
"<i>H</i>ome" is visually identical to "<em>H</em>ome" but does not match text.
Edit: after writing the above answer, I studied the question closer and discovered the HTML sample does show "Home" in plain text, but was not visible due to long lines not wrapping. So I edited the OP to wrap the line for readability.
New observation: I noticed that the Logout element has a "title" attribute, but the Home link element lacks such--try giving it one and using that.
Try adding an implicit wait to this in order to wait, instead of running too quickly.
Or
else you can import time and use time.sleep(25)

Categories

Resources