I've looked all over and found some very nice solutions to wait for page load and window load (and yes I know there are a lot of Stack questions regarding them in isolation). However, there doesn't seem to be a way to combine these two waits effectively.
Taking my primary inspiration from the end of this post, I came up with these two functions to be used in a "with" statement:
#This is within a class where the browser is at self.driver
#contextmanager
def waitForLoad(self, timeout=60):
oldPage = self.driver.find_element_by_tag_name('html')
yield
WebDriverWait(self.driver, timeout).until(staleness_of(oldPage))
#contextmanager
def waitForWindow(self, timeout=60):
oldHandles = self.driver.window_handles
yield
WebDriverWait(self.driver, timeout).until(
lambda driver: len(oldHandles) != len(self.driver.window_handles)
)
#example
button = self.driver.find_element_by_xpath(xpath)
if button:
with self.waitForPage():
button.click()
These work great in isolation. However, they can't be combined due to the way each one checks their own internal conditions. For example, this code will fail to wait until the second page has loaded because the act of switching windows causes "oldPage" to become stale:
#contextmanager
def waitForWindow(self, timeout=60):
with self.waitForLoad(timeout):
oldHandles = self.driver.window_handles
yield
WebDriverWait(self.driver, timeout).until(
lambda driver: len(oldHandles) != len(self.driver.window_handles)
)
self.driver.switch_to_window(self.driver.window_handles[-1])
Is there some Selenium method that will allow these to work together?
Related
I am trying to use some explicit waits using Undetected Chromedriver (v2). Rather than executing the statements once the element had loaded, etc it appears to pause until the wait time expires.
When I use the normal selenium chromedriver everything works as expected ("opt-in" is closed in 1-2 seconds) and when I use sleeps instead of waits the statements are executed much quicker.
Can anyone see the problem?
Here's the code:
class My_Chrome(uc.Chrome):
def __del__(self):
pass
options = uc.ChromeOptions()
arguments = [
'--log-level=3', '--no-first-run', '--no-service-autorun', '--password-store=basic',
'--start-maximized',
'--window-size=1920, 1080',
'--credentials_enable_service=False',
'--profile.password_manager_enabled=False,'
'--add_experimental_option("detach", True)'
]
for argument in arguments:
options.add_argument(argument)
driver = My_Chrome(options=options)
wait = WebDriverWait(driver, 20)
driver.get('https://www.oddschecker.com')
try:
opt_in = wait.until(EC.visibility_of_element_located((By.XPATH, "//span[text()='Not Now']/..")))
VirtualClick(driver, opt_in)
current_time('Closing opt-in')
except:
pass
I am trying to wait for a page to fully load with Selenium, and tried to use code from other answers here: https://stackoverflow.com/a/30385843/8165689 3rd method in this answer using Selenium's 'staleness_of' property, and originally at: http://www.obeythetestinggoat.com/how-to-get-selenium-to-wait-for-page-load-after-a-click.html
However, I think I have some problem with the Python yield keyword specifically in this code. Based on the above, I have the method:
#contextmanager
def wait_for_page_load(driver, timeout = 30):
old_page = driver.find_element_by_tag_name('html')
yield WebDriverWait(driver, timeout).until(staleness_of(old_page))
This doesn't get called by Python, breakpoint shows it is skipped.
I also have same problem with apparent original code:
#contextmanager
def wait_for_page_load(driver, timeout = 30):
old_page = driver.find_element_by_tag_name('html') # up to here with decorator, the function is called OK, with 'yield' it is NOT called
yield
WebDriverWait(driver, timeout).until(staleness_of(old_page))
But if I delete the yield statement onwards this function does at least get called:
#contextmanager
def wait_for_page_load(driver, timeout = 30):
old_page = driver.find_element_by_tag_name('html')
Anyone know how I should write the yield statement? I'm not experienced with yield, but it looks like Python has to yield something, so perhaps original code which seems to have yield in line of its own has a problem?
I think you have might missed out the expected conditions here.Please try that code see if this helps.
from selenium.webdriver.support import expected_conditions as EC
def wait_for_page_load(driver, timeout = 30):
old_page = driver.find_element_by_tag_name('html')
yield WebDriverWait(driver, timeout).until(EC.staleness_of(old_page))
I am using FireFox, and my code is working just fine, except that its very slow. I prevent loading the images, just to speed up a little bit:
firefox_profile = webdriver.FirefoxProfile()
firefox_profile.set_preference('permissions.default.image', 2)
firefox_profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
firefox_profile.set_preference("browser.privatebrowsing.autostart", True)
driver = webdriver.Firefox(firefox_profile=firefox_profile)
but the performance is still slow. I have tried going headless but unfortunately, it did not work, as I receive NoSuchElement errors. So is there anyway to speed up Selenium web scraping? I can't use scrapy, because this is a dynamic web scrape I need to click through the next button several times, until no clickable buttons exist, and need to click pop-up buttons as well.
here is a snippet of the code:
a = []
b = []
c = []
d = []
e = []
f = []
while True:
container = driver.find_elements_by_xpath('.//*[contains(#class,"review-container")]')
for item in container:
time.sleep(2)
A = item.find_elements_by_xpath('.//*[contains(#class,"ui_bubble_rating bubble_")]')
for i in A:
a.append(i,text)
time.sleep(2)
B = item.find_elements_by_xpath('.//*[contains(#class,"recommend-titleInline noRatings")]')
for j in B:
b.append(j.text)
time.sleep(3)
C = item.find_elements_by_xpath('.//*[contains(#class,"noQuotes")]')
for k in C:
c.append(k.text)
time.sleep(3)
D = item.find_elements_by_xpath('.//*[contains(#class,"ratingDate")]')
for l in D:
d.append(l.text)
time.sleep(3)
E = item.find_elements_by_xpath('.//*[contains(#class,"partial_entry")]')
for m in E:
e.append(m.text)
try:
time.sleep(2)
next = driver.find_element_by_xpath('.//*[contains(#class,"nav next taLnk ui_button primary")]')
next.click()
time.sleep(2)
driver.find_element_by_xpath('.//*[contains(#class,"taLnk ulBlueLinks")]').click()
except (ElementClickInterceptedException,NoSuchElementException) as e:
break
Here is an edited version, but speed does not improve.
========================================================================
while True:
container = driver.find_elements_by_xpath('.//*[contains(#class,"review-container")]')
for item in container:
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"ui_bubble_rating bubble_")]')))
A = item.find_elements_by_xpath('.//*[contains(#class,"ui_bubble_rating bubble_")]')
for i in A:
a.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"recommend-titleInline noRatings")]')))
B = item.find_elements_by_xpath('.//*[contains(#class,"recommend-titleInline noRatings")]')
for i in B:
b.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"noQuotes")]')))
C = item.find_elements_by_xpath('.//*[contains(#class,"noQuotes")]')
for i in C:
c.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"ratingDate")]')))
D = item.find_elements_by_xpath('.//*[contains(#class,"ratingDate")]')
for i in D:
d.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"partial_entry")]')))
E = item.find_elements_by_xpath('.//*[contains(#class,"partial_entry")]')
for i in E:
e.append(i.text)
try:
#time.sleep(2)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"nav next taLnk ui_button primary")]')))
next = driver.find_element_by_xpath('.//*[contains(#class,"nav next taLnk ui_button primary")]')
next.click()
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"taLnk ulBlueLinks")]')))
driver.find_element_by_xpath('.//*[contains(#class,"taLnk ulBlueLinks")]').click()
except (ElementClickInterceptedException,NoSuchElementException) as e:
break
For dynamic webpages(pages rendered or augmented using javascript), I will suggest you use scrapy-splash
Not that you can't use selenium but for scrapping purposes scrapy-splash is a better fit.
Also, if you have to use selenium for scraping a good idea would be to use headless option. And also you can use chrome. I had some benchmarks for chrome headless being faster than firefox headless sometimes back.
Also, rather than sleep better would be to use webdriverwait with expected condition as it waits as long as necessary rather than thread sleep, which makes you to wait for the mentioned time.
Edit : Adding as edit while trying to answer #QHarr, as the answer is pretty long.
It is a suggestion to evaluate scrapy-splash.
I gravitate towards to scrapy because the whole eco system around scrapping purposes. Like middleware, proxies, deployment, scheduling, scaling. So bascially if you are looking for some serious scrapping scrapy might be better starting position. So that suggestion comes with caveat.
At the coming to speed, I can't give any objective answer, as I have never contrasted and benchmarked scrapy with selenium from time perspective with any project of size.
But I would assume more or less you will be able to get comparable times, on a serial run, if you are doing the same things. As most cases the time you spend is on waiting for responses.
If you scrapping with any considerable no of items, the speed-up you get is by generally parallelising the requests. Also in cases, falling back to basic http request and response where it is not necessary, rather to render the page in any user agent.
Also, anecdotally, some of the in web page action can be performed using the underlying http request/response. So time is a priority then you should be looking to get as many thing as possible done with http request/response.
In selenium doc we can see that we must set some timeout for wait.
For example: code from that doc
wait = WebDriverWait(driver, 10)
element = wait.until(EC.element_to_be_clickable((By.ID,'someid')))
I wonder do we always must set up some timeout? Or there is some method that will wait until all of the AJAX code will download and only after it our driver will interact with some web-elements(I mean without any fixed timeout , it just loads all things and only after it starts interacting)?
Hopefully this code will help you. This is how I solved this issue.
#Check with jQuery if it has any outstanding ajax
def ajax_complete(self):
try:
return 0 == self.execute_script("return jQuery.active")
except:
pass
#Create a method to wait for ajax to complete
driver.wait_for_ajax = lambda: WebDriverWait(driver, 10).until(ajax_complete, "")
driver.implicitly_wait(30)
First of all, I created several functions to use them instead of default "find_element_by_..." and login() function to create "browser". This is how I use it:
def login():
browser = webdriver.Firefox()
return browser
def find_element_by_id_u(browser, element):
try:
obj = WebDriverWait(browser, 10).until(
lambda browser : browser.find_element_by_id(element)
)
return obj
#########
driver = login()
find_element_by_link_text_u(driver, 'the_id')
Now I use such tests through jenkins(and launch them on a virtual machine). And in case I got TimeoutException, browser session will not be killed, and I have to manually go to VM and kill the process of Firefox. And jenkins will not stop it's job while web browser process is active.
So I faced the problem and I expect it may be resoved due to exceptions handling.
I tryed to add this to my custom functions, but it's not clear where exactly exception was occured. Even if I got line number, it takes me to my custom function, but not the place where is was called:
def find_element_by_id_u(browser, element):
try:
obj = WebDriverWait(browser, 1).until(
lambda browser : browser.find_element_by_id(element)
)
return obj
except TimeoutException, err:
print "Timeout Exception for element '{elem}' using find_element_by_id\n".format(elem = element)
print traceback.format_exc()
browser.close()
sys.exit(1)
#########
driver = login()
driver .get(host)
find_element_by_id_u('jj_username').send_keys('login' + Keys.TAB + 'passwd' + Keys.RETURN)
This will print for me the line number of string "lambda browser : browser.find_element_by_id(element)" and it's useles for debugging. In my case I have near 3000 rows, so I need a propper line number.
Can you please share your expirience with me.
PS: I divided my program for few scripts, one of them contains only selenium part, that's why I need login() function, to call it from another script and use returned object in it.
Well, spending some time in my mind, I've found a proper solution.
def login():
browser = webdriver.Firefox()
return browser
def find_element_by_id_u(browser, element):
try:
obj = WebDriverWait(browser, 10).until(
lambda browser : browser.find_element_by_id(element)
)
return obj
#########
try:
driver = login()
find_element_by_id_u(driver, 'the_id')
except TimeoutException:
print traceback.format_exc()
browser.close()
sys.exit(1)
It was so obvious, that I missed it :(