I am trying to make a scraper that will go through a bunch of links, export the guide as a PDF, and loop through all the guides that are in the parent folder. It works fine going in, ,but when I try to go backwards, it throws stale exceptions, even when I make sure to refresh the elements in the code, or refresh the page.
from selenium import webdriver
import time, bs4
browser = webdriver.Firefox()
browser.get('MYURL')
loginElem = browser.find_element_by_id('email')
loginElem.send_keys('LOGIN')
pwdElem = browser.find_element_by_id('password')
pwdElem.send_keys('PASSWORD')
pwdElem.submit()
time.sleep(3)
category = browser.find_elements_by_class_name('title')
for i in category:
i.click()
time.sleep(3)
guide = browser.find_elements_by_class_name('cell')
for j in guide:
j.click()
time.sleep(3)
soup = bs4.BeautifulSoup(browser.page_source, features="html.parser")
guidetitle = soup.find_all(id='guide-intro-title')
print(guidetitle)
browser.find_element_by_link_text('Options').click()
time.sleep(0.5)
browser.find_element_by_partial_link_text('Download PDF').click()
browser.find_element_by_id('download').click()
browser.execute_script("window.history.go(-2)")
print("went back")
time.sleep(5)
print("waited")
guide = browser.find_elements_by_class_name('thumb')
print("refreshed elements")
print("made it to outer loop")
This happens if I both use a script to move the browser back, or the driver.back() method. I can see that it makes it back to the child directory, then waits, and refreshes the elements. But, then it can't seem to load the new element to go into the next guide. I found a similar questions here on SO but someone just provided code tailored to the problem instead of explaining so I am still confused.
I also know about using waitdriver but I am just using sleep now since I don't fully understand the EC wait conditions. In any case, increasing the sleep time doesn't fix this issue.
Stale Element Reference Exception occurs upon page refresh because of an element UUID change in the DOM.
How to avoid it: Always try to search for an element right before interaction.
In your code, you searched for cells, found them and stored them in guide. So now, guide has a list of selenium UUIDs. But then, you are making a loop to go through the list, and upon each refresh (that happens when you do back I believe), cell's UUID changes, so old ones that you have stored are no longer attached to the DOM. When trying to interact with them, Selenium cannot find them in the DOM and throws this exception.
Instead of looping through guide your way, try re-find element every time, like:
guide = browser.find_elements_by_class_name('cell')
for j in range(len(guide)):
browser.find_elements_by_class_name('cell')[j].click()
Note, it looks like category might have a similar problem, so try applying this solution to category as well.
Hope this helps. Here is a similar issue and a solution.
Related
I am pretty new to web-scraping...
For example here is the part of my code:
labels = driver.find_elements(By.CLASS_NAME, 'form__item-checkbox-label.placeFinder-search__checkbox-label')
checkboxes = driver. find_elements(By.CLASS_NAME, 'form__item-checkbox-input.placeFinder-search__checkbox-input')
boxes = zip(labels,checkboxes)
time.sleep(3)
for label,checkbox in boxes:
if checkbox.is_selected():
label.click()
Here is another example:
driver.get(product_link)
time.sleep(3)
button = driver.find_element(By.XPATH, '//*[#id="tab-panel__tab--product-pos-search"]/h2')
time.sleep(3)
button.click()
And I am scraping through let's say hundreds of products. 90% of the time it works fina, but occasionally giver errors like couldn't locate the element or something is not clickable etc. But all these products pages are built the same. Moreover, if I just re-run code on the product that resulted in the error, mosr of the time from the 2nd or 3rd time I will be able to scrape the data and will not get the error back.
Why does it happen? Code stays the same, web page stays the same.. What is causing an error when it happens? The only thing that comes to my mind the Internet connection sometimes gets behind the code and the program is unable to see the elenebts it is looking for... But as you can see I have added time.sleep() but it does not always help...
How can this be avoided? It is really annoying to be forced to stay in front of the monitor all the day just to supervise and re-run the code.... I mean I guess I could just add the scrape fubction inside the try: except: else: block but I am still wondering why does the same code will sometimes work and sometimes return the error on the same page?
In short Selenium deals with three distinct states of a WebElement.
presence
visibile
interactable / clickable
Ideally, to click on any clickable element you need to induce WebDriverWait for the element_to_be_clickable() as follows:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//*[#id="tab-panel__tab--product-pos-search"]/h2"))).click()
Similarly you can also create a list of desired elements waiting for their visibility and click on them one by one waiting for each of them to be clickable as follows:
checkboxes = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "form__item-checkbox-input.placeFinder-search__checkbox-input")))
for checkbox in checkboxes:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((checkbox))).click()
Welcome to the "dirty" side of Web automation. We called it "Flaky" tests. In other word they are "fragile". And the major disadvantage of Selenium Webdriver.
There could be several reasons of flaky situation:
Network instability: Since all commands sent over network: client ->
(selenium grid: in case need) -> browser driver -> actual browser. Any connection issue may cause reason to failed.
CSS animations: Since it executes commands directly, if you have some animative transitions, it may cause to fail
Ajax similar requests or dynamic element changing. If you have such "load more" or displaying after some actions, It may not dedect or still overlapping
And, last comment is sleep is not good idea to use, actually it is againts to best practices. Instead of, use Expected Conditions to ensure elements are visible and ready
https://squidindustries.co/checkout
checkout_cc_number = driver.find_element_by_id("number")
checkout_cc_number.send_keys(card_number)
When I try to input information into the card number field I get an error saying the element could not be located. I tried using time.sleep and driver.implicitly_wait when i first got to the page but both failed. Any ideas?
The element is in a frame (i.e. a webpage within a webpage). Selenium will look for elements in the page it has loaded and not within frames. That's the problem.
To solve this we just need a bit more code, which will tell Selenium to look in the frame.
The example you've given is several pages deep into a shopping cart, so I'm going to use a much more accessible example instead: the mozilla guide to iframes.
Here is some code to open that page and then click the CSS button within the frame:
from selenium import webdriver
import time
browser = webdriver.Chrome()
browser.get(r"https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe")
time.sleep(5)
browser.switch_to.frame(browser.find_element_by_class_name("interactive"))
css_button = browser.find_element_by_id("css")
css_button.click()
browser.switch_to.default_content()
There are two lines that are important. The first one is:
browser.switch_to.frame(browser.find_element_by_class_name("interactive"))
That finds the frame and then switches to it. Once we have done that, any code that looks for elements will be looking in the frame and not in the page that we navigated to. That is what you need to do to access the number element. In your example the class of the frame is card-fields-iframe, so use that instead of interactive.
The second important line is:
browser.switch_to.default_content()
That reverts the previous line. So now Selenium will be looking for elements within the page that we navigated to. You'll want to do that after interacting with the frame, so that you can continue through the shopping cart.
have you tried getting the input element using the DOM? what happens if you do document.getElementById('number') ?
I ran into the same issue, and with checkouts, as you mentioned, all the iframe class names are the same. What I did was get all the iframes with the same class name as a list:
iframes = driver.find_elements(By.CLASS_NAME, "card-fields-iframe")
I then switched through the iframes referencing each one by its place in the list. Since there are only four fields in the checkout, the list is only 4 elements long, starting with [0].
driver.switch_to.frame(iframes[0])
number = driver.find_element(By.ID, "number")
if number.is_displayed:
number.send_keys("4000300040005000")
driver.switch_to.default_content()
It's important to note that switching back to the default content, using driver.switch_to.default_content(), before switching to the next frame is the only way I was able to make this work. The is_displayed function just checks to see whether the element is on the page or not.
I am trying to click on a link, scrape data from that webpage, go back again, click on the next link and so on. But, I am not able to go back to the previous page for some reason. I observed that I can execute the code to go back if I am outside the loop, and I can't figure out what is wrong with the loop. I tried to use driver.back() too and yet it won't work. Any help is appreciated!! TYI
x = 0 #counter
contents=[]
for link in soup_level1.find_all('a', href=re.compile(r"^/new-homes/arizona/phoenix/"), tabindex=-1):
python_button =driver.find_element_by_xpath("//div[#class='clearfix len-results-items len-view-list']//a[contains(#href,'/new-homes/arizona/phoenix/')]")
driver.execute_script("arguments[0].click();",python_button)
driver.implicitly_wait(50)
soup_level2=BeautifulSoup(driver.page_source, 'lxml')
a=soup_level2.find('ul', class_ ='plan-info-lst')
for names in a.find('li'):
contents.append(names.span.next_sibling.strip())
driver.execute_script("window.history.go(-1)")
driver.implicitly_wait(50)
x += 1
Some more information about your usecase interms of:
Selenium client version
WebDriver variant and version
Browser type and version
would have helped us to debug the issue in a better way.
However to go back to the previous page you can use either of the following solutions:
Using back(): Goes one step backward in the browser history.
Usage:
driver.back()
Using execute_script(): Synchronously Executes JavaScript in the current window/frame.
Usage:
driver.execute_script("window.history.go(-1)")
Usecase Internet Explorer
As per #james.h.evans.jr's comment in the discussion driver.navigate().back() blocks when back button triggers a javascript alert on the page if you are using internet-explorer at times back() may not work and is pretty much expected as ie navigates back in the history by using the COM GoBack() method of the IWebBrowser interface. Given that, if there are any modal dialogs that appear during the execution of the method, the method will block.
You may even face similar issues while invoking forward() in the history, and submitting forms. The GoBack method can be executed on a separate thread which would involve calling a few not-very-intuitive COM object marshaling functions e.g. CoGetInterfaceAndReleaseStream() and CoMarshalInterThreadInterfaceInStream() but there seems we can't do much about that.
Instead of using
driver.execute_script("window.history.go(-1)")
You can try using
driver.back() see here
Please be aware that this functionality depends entirely on the underlying driver. It’s just possible that something unexpected may happen when you call these methods if you’re used to the behavior of one browser over another.
I'm using Python's version of Selenium to iterate through Select elements options. It works quite well on of the websites, but fails on the another one, with error: Message: stale element reference: element is not attached to the page document I looked it up of course, but the answers I found didn't work out for me. I use time.sleep() to wait for page to load and I can see it being loaded in the browser. I'm not sure what should I do with it.
How it looks in code:
options = Select(driver.find_element_by_xpath("my_element's_xpath")).options
for option in options:
option.click()
sleep(5)
First run it works fine, second run I get the error.
Here is the Select element in Dev Tools in Chromium:
screenshot
I believe it might have to do something with first select option not having <option> tag around it, but I'm not sure how to remove it from DOM.
The code in my program is a bit larger than what I showed, and as J0HN pointed out, it caused browser to refresh. I solved it in a kind of hack, storing each option value in references list, then iterating through it. Code speaks better than words so take a look at it below:
for option in options:
options_reference.append(option.text)
for option in options_reference:
option_element = driver.find_element_by_xpath(
"//*[contains(text(), '" + option + "')]")
option_element.click()
It can be further improved by narrowing down XPath to option tag only.
In my case it was enough to add sleep(2) (taken from from time import sleep) before options = Select(...)
I'm using selenium webdriver with python so as to find an element and click it. This is the code. I'm passing 'number' to this code's method and this doesn't work. I see it on the browser that the element is found but it doesn't click the element.
subIDTypeIcon = "//a[#id='s_%s_IdType']/img" % str(number)
self.driver.find_element_by_xpath(subIDTypeIcon).click()
Whereas, I tried placing the 'self.driver.find_.....' twice and to my surprise it works
subIDTypeIcon = "//a[#id='s_%s_IdType']/img" % str(number)
self.driver.find_element_by_xpath(subIDTypeIcon).click()
self.driver.find_element_by_xpath(subIDTypeIcon).click()
I have the browser getting opened on a remote server so there is sometimes timeout problem.
Is there a proper way to make this work? why does it work when same statement is placed twice
This is a common problem and the main reason to create abstract, per page helper classes. Instead of blindly finding elements, you usually need a loop which tries to find an element for a couple of seconds so the browser can update the DOM.
The second version often works because starting to load a new page doesn't invalidate the DOM. That only happens when the remote server has started to send enough of the new document to the browser. You can see this yourself when you use a browser: Pages don't become blank the same instant you click on a link. Instead, they stay for a while.