I desire to iterate thru a set of URLs using Selenium. From time to time I get 'element is not attached to the page document'. Thus after reading a couple of other questions indicated that it's because I am changing the page that is looking at. But I am not satisfied with that argument since:
for url in urlList:
driver.get(url)
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, '//div/div')))
#^ WebDriverWait shall had taken care of it
myString = driver.find_element_by_xpath('//div/div').get_attribute("innerHTML")
# ^ Error occurs here
# Then I call this function to go thru other elements given other conditions not shown
if myString:
getMoreElements(driver)
But if I add a delay like this:
for url in urlList:
driver.get(url)
time.sleep(5) # <<< IT WORKS, BUT WHY?
element = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, '//div/div')))
myString = driver.find_element_by_xpath('//div/div').get_attribute("innerHTML") # Error occured here
I feel I am hiding the problem by adding the delay right there. I have implicity_wait set to 30s and set_page_load_timeout to 90s, that would had been sufficient. So, why am I still facing to add what looks like useless time.sleep?
Did you try the xpath: //div/div manually in dev tool to see how many div will be found on the page? I thinks there should be many. So your below explicity wait code can very easy to satisfied, maybe no more than 1 second, selenium can find such one div after browser.get() and your wait end.
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, '//div/div')))
Consider following possiblity:
Due to your above explicity wait issue, the page loading not complete, more and more //div/div are rendering to page, at this time point, you ask selenium to find such one div and to interact with it.
Think about the possiblity of the first found div by selenium won't be deleted or moved to another DOM node.
What do you think the rate of above possiblity will be high or low? I think it's very hight, because div is very common tag in nowdays web page and you use such a relaxed xpath which lead to so many matched div will be found, and each one of them is possible to cause the 'Element Stale' issue
To resolve your issue, please use more strict locator to wait some special element, rather than such hasty xpath which result in finding very common and many exist element.
What you observe as element is not attached to the page document is pretty much possible.
Analysis:
In your code, while iterating over the urlList, we are opening an url then waiting for the WebElement with XPATH as //div/div with ExpectedConditions clause set to presence_of_element_located which does not necessarily mean that the element is visible or clickable.
Hence, next when you try to driver.find_element_by_xpath('//div/div').get_attribute("innerHTML") the reference of previous search/find_element is not found.
Solution:
The solution to your question would be to change the ExpectedConditions clause from presence_of_element_located to element_to_be_clickable which checks that element is visible and enabled such that you can even click it.
Code Block:
Your optimized code block may look like:
for url in urlList:
driver.get(url)
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.XPATH, '//div/div')))
myString = driver.find_element_by_xpath('//div/div').get_attribute("innerHTML")
Your other solution:
Your other solution works because you are trying to covering up Selenium's work through time.sleep(5) which is not a part of best practices.
Related
I am pretty new to web-scraping...
For example here is the part of my code:
labels = driver.find_elements(By.CLASS_NAME, 'form__item-checkbox-label.placeFinder-search__checkbox-label')
checkboxes = driver. find_elements(By.CLASS_NAME, 'form__item-checkbox-input.placeFinder-search__checkbox-input')
boxes = zip(labels,checkboxes)
time.sleep(3)
for label,checkbox in boxes:
if checkbox.is_selected():
label.click()
Here is another example:
driver.get(product_link)
time.sleep(3)
button = driver.find_element(By.XPATH, '//*[#id="tab-panel__tab--product-pos-search"]/h2')
time.sleep(3)
button.click()
And I am scraping through let's say hundreds of products. 90% of the time it works fina, but occasionally giver errors like couldn't locate the element or something is not clickable etc. But all these products pages are built the same. Moreover, if I just re-run code on the product that resulted in the error, mosr of the time from the 2nd or 3rd time I will be able to scrape the data and will not get the error back.
Why does it happen? Code stays the same, web page stays the same.. What is causing an error when it happens? The only thing that comes to my mind the Internet connection sometimes gets behind the code and the program is unable to see the elenebts it is looking for... But as you can see I have added time.sleep() but it does not always help...
How can this be avoided? It is really annoying to be forced to stay in front of the monitor all the day just to supervise and re-run the code.... I mean I guess I could just add the scrape fubction inside the try: except: else: block but I am still wondering why does the same code will sometimes work and sometimes return the error on the same page?
In short Selenium deals with three distinct states of a WebElement.
presence
visibile
interactable / clickable
Ideally, to click on any clickable element you need to induce WebDriverWait for the element_to_be_clickable() as follows:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//*[#id="tab-panel__tab--product-pos-search"]/h2"))).click()
Similarly you can also create a list of desired elements waiting for their visibility and click on them one by one waiting for each of them to be clickable as follows:
checkboxes = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "form__item-checkbox-input.placeFinder-search__checkbox-input")))
for checkbox in checkboxes:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((checkbox))).click()
Welcome to the "dirty" side of Web automation. We called it "Flaky" tests. In other word they are "fragile". And the major disadvantage of Selenium Webdriver.
There could be several reasons of flaky situation:
Network instability: Since all commands sent over network: client ->
(selenium grid: in case need) -> browser driver -> actual browser. Any connection issue may cause reason to failed.
CSS animations: Since it executes commands directly, if you have some animative transitions, it may cause to fail
Ajax similar requests or dynamic element changing. If you have such "load more" or displaying after some actions, It may not dedect or still overlapping
And, last comment is sleep is not good idea to use, actually it is againts to best practices. Instead of, use Expected Conditions to ensure elements are visible and ready
driver = webdriver.Chrome(service=s)
url="https://fourminutebooks.com/book-summaries/"
driver.get(url)
page_tabs = driver.find_elements(By.CSS_SELECTOR, "a[class='post_title w4pl_post_title']")
#html = driver.find_elements(By.CSS_SELECTOR,"header[class='entry-header page-header']")
length_page_tabs = len(page_tabs)
in_length = len(page_tabs)
for i in range(length_page_tabs):
ran = random.randint(0,in_length)
page_tabs[ran].click()
driver.execute_script("window.history.go(-1)")
time.sleep(10)
#need to get page source of html and then open it to a new file, extract what I want and add it to the email
I am trying to click one of the links, get the html code, email it to myself, and then go back a page and repeat. However after clicking the first random link, the code stops working and instead I get this error
You have to be very careful, when you put some elements collection to the variable, and going to iterate and perform some actions.
page_tabs = driver.find_elements...
All the elements in this case are cached, and each web browser action of navigate to another page, refrech the page, etc. will make all of these cached elements stale. This means they bacame like out-of-date and not possible to interact them any more.
So, to avoid stale element reference errors, you have to prevent any page reloads, or just refresh the elements every time after the page state has been changed.
StaleElementReferenceException
StaleElementReferenceException is a type of WebDriverException which is thrown when a reference to an element have gone stale, i.e. the element no longer appears on the HTML DOM of the page.
Some of the possible causes of StaleElementReferenceException include:
You are no longer on the same page, or the page may have refreshed since the element was last located.
The element may have been removed and re-added to the DOM Tree, since it was located. Such as an element being relocated. This can happen typically with a javascript framework when values are updated and the node is rebuilt.
Element may have been inside an iframe or another context which was refreshed.
This usecase
In your usecase, you have created a list of webelement i.e. page_tabs using the locator strategy:
page_tabs = driver.find_elements(By.CSS_SELECTOR, "a[class='post_title w4pl_post_title']")
Next within the loop whenever you invoke click on page_tabs[ran] you are redirected to a new page, where the elements within the list page_tabs becomes stale and new elements are loaded.
Moving forward when you invoke driver.execute_script("window.history.go(-1)") you are moving back to the main page where the elements of page_tabs were present and they reload again. At this point of time, the list page_tabs still continues to hold the webelements of the previous search, which have now become stale. Hence during the second iteration you face StaleElementReferenceException
Solution
In your usecase to avoid StaleElementReferenceException as the desired elements are <A> tag so instead of saving the elements you can store the href attributes in a list and invoke get(href) as follows:
driver.get("https://fourminutebooks.com/book-summaries/")
hrefs = [my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a[class='post_title w4pl_post_title']")))]
for href in hrefs:
driver.get(href)
print("Placeholder to perform the desired operations on the respective page")
driver.quit()
References
You can find a couple of relevant detailed discussions in:
StaleElementException when iterating with Python
Message: stale element reference: element is not attached to the page document in Python
StaleElementReferenceException: Message: stale element reference: element is not attached to the page document with Selenium and Python
Use driver.execute_script and javascript. Javascript is never stale because it evaluates right away. In other words, if you select an element with Python and later interact with it, there's a decent chance it won't be there anymore. The only way you can be sure it's still there is to evaluate it as you interact with it and the only way to do that is to stay in the browser context.
I'm trying to use Selenium to scrape google maps unfortunatly its's not quite working,the element is not present on page load and is added after clicking on some button but it seems that the element is not always loaded when looking for it. (I'm talking about the carousel's items that appears after clicking on a shop,restaurant while doing a specific search)
I already tried to use the classic Selenium wait options.
Things I tried:
time.sleep()
WebDriverWait(driver,30).until(EC.visibility_of_element_located(...)
WebDriverWait(driver,30).until(EC.element_to_be_clickable(...)
WebDriverWait(driver,60).until(EC.presence_of_all_elements_located(...)
Even with theses things results are random, I sometimes can access the element via Selenium and sometimes I don't.
It seem's that even with long waits, Selenium can't always access the web element even tho I can click it and see it.
You can make the web driver wait until the element is clickable. Have you tried that?
element = WebDriverWait(driver, 20).until(
EC.presence_of_element_located((By.ID, "myElement")))
or a better way of doing this as pointed out here
element = WebDriverWait(driver, 20).until(
EC.element_to_be_clickable((By.XPATH, "myXpath")))
I'm trying to find all the my subject in my dashboard of my college website.
I'm using selenium to do it.
The site is a little slow so first I wait
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//span[#class='multiline']")))
then I find all the elements with
course = driver.find_elements_by_xpath("//span[#class='multiline']")
after that in a for loop I try to traverse it the 0th place of the "course" works fine and I'm able to click it and go to webpage but when the loop runs for the secon d time that is for the 1st place in "course" it gives me error selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
So I tried adding a lit bit wait time to using 2 method it still gives me error
driver.implicitly_wait(20)
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//span[#class='multiline']")))
the loop
for i in course[1::]:
#driver.implicitly_wait(20)
#WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//span[#class='multiline']")))
print(i)
i.click()
driver.implicitly_wait(2)
driver.back()
a snippet of the website
Thanks in advance
Answering my own question after extensive research
A common technique used for simulating a tabbed UI in a web app is to prepare DIVs for each tab, but only attach
one at a time, storing the rest in variables. In this case my code have a reference
to an element that is no longer attached to the DOM (that is, that has an ancestor which is "document.documentElement").
If WebDriver throws a stale element exception in this case, even though the element still exists, the reference
is lost. You should discard the current reference you hold and replace it, possibly by locating the element again
once it is attached to the DOM
for i in range(len(course)):
# here you need to find all the elements again because once we
leave the page the reference will be lost and we need to find it again
course = driver.find_elements_by_xpath("//span[#class='multiline']")
print(course[i].text)
course[i].click()
driver.implicitly_wait(2)
driver.back()
I want to parse list of price in a web page using Selenium webdriver in Python. So, I try to fetch all the DOM elements using this code
url = 'https://www.google.com/flights/explore/#explore;f=BDO;t=r-Asia-0x88d9b427c383bc81%253A0xb947211a2643e5ac;li=0;lx=2;d=2018-01-09'
driver = webdriver.Chrome()
driver.get(url)
print(driver.page_source)
The problem is what I got from page_source is different from what I see in the inspected element
<div class="CTPFVNB-f-a">
<div class="CTPFVNB-f-c"></div>
<div class="CTPFVNB-f-d elt="toolbelt"></div>
<div class="CTPFVNB-f-e" elt="result">Here is the difference</div>
</div>
The difference exist inside the CTPFVNB-f-e class. In the inspected DOM element, this tag hold all the prices that I want to fetch. But, in the result of page_source, this part is missing.
Could anyone tell me what is wrong with my code? Or do I need further steps to parse the list of prices?
JavaScript is modifying the page after the page loads. As you are printing page source immediately after opening the page, you're getting the initial code without the execution of JavaScript.
You can do any one of the following things:
Add delay: Using time.sleep(x) (change value of x according to your requirements. it is in seconds) (NOT recommended)
Implicit wait: driver.implicitly_wait(x) (again x is same as above)
Explicit wait: Wait for the HTML element to appear and then get the page source. To learn how to do this, refer this link. (HIGHLY recommended)
Using explicit wait is the better option here as it waits only for the time required for the element to become visible. Thus won't cause any excess delays. Or if the page loads slower than expected, you won't get the desired output using implicit wait.