Stale Element Reference Exception when webscraping using python selenium - python

I am trying to scrape a website and have written up a working script. The problem is that after some time running the script I get the stale element reference exception telling me the referenced element (the href) was not found.
Here I am extracting the links of all products on each page in a website and saving them in a list which I later use to extract the data from each link.
for a in tqdm(range(1,pages+1)):
time.sleep(3)
link=driver.find_elements_by_xpath('//div[#class="col-xs-4 animation"]/a')
for b in link:
x = b.get_attribute("href")
print(x)
LINKS.append(x)
time.sleep(3)
#next page
try:
WebDriverWait(driver, delay).until(ec.presence_of_element_located((By.XPATH, '//ul[#class="pagination-sm pagination"]')))
next_page = driver.find_element_by_xpath('.//li[#class="prev"]')
driver.execute_script("arguments[0].click()", next_page)
except NoSuchElementException:
pass
Any idea on how to fix this? The error occurs randomly. Sometimes it finds the links and sometimes it does not, confusing me. Only when I scrape for a long time does this error occur.

Related

Stale element reference: element is not attached to the page document when looping through pages

I am trying to go through each products in my catalogue and print product image links. Following is my code.
product_links = driver.find_elements_by_css_selector(".product-link")
for link in product_links:
driver.get(link.get_attribute("href"))
images = driver.find_elements_by_css_selector("#gallery img")
for image in images:
print(image.get_attribute("src"))
driver.back()
But I receiving the error selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document, I think this is happening because when we go back to catalogue page the page get loaded again and the element references in product_links became stale.
How we can avoid this issue? is there any better solution for this?
I ran into a similar problem, and here's how I solved it. Basically, you have to refresh the page and re-establish the list of links each time you return to the page.Of course, doing this you can't use a for loop, because your objects are stale each time.
Unfortunately I can't test this, as I don't have access to your actual URL, but this should be close
def get_prod_page(link):
driver.get(link.get_attribute("href"))
images = driver.find_elements_by_css_selector("#gallery img")
for image in images:
print(image.get_attribute("src"))
driver.back()
counter=0
link_count= len(driver.find_elements_by_css_selector(".product-link"))
while counter <= link_count:
product_links = driver.find_elements_by_css_selector(".product-link")[counter:]
get_prod_page(product_links[0])
counter+=1
driver.refresh()

Selenium check if page loaded, by using title tag in python

I'm just trying to figure out how to check if the title of a specific page has loaded in selenium using python, and if it hasn't then go back to the beginning (continue)
However, it's not working at all, when I get a 500 error, or the page does not load at all. Why is this not working as it should? Is there a different way to do this?
try:
element = WebDriverWait(self.driver, 2).until(
EC.title_is("This is my title")
)
except TimeoutException as ex:
print(ex.message)
self.driver.quit()
continue

Selenium Python StaleElementReferenceException

I'm trying to download all the pdf on a webpage using Selenium Python with Chrome as browser but every time the session ends with this message:
StaleElementReferenceException: stale element reference: element is not attached to the page document
(Session info: chrome=52.0.2743.116)
(Driver info: chromedriver=2.22.397933
This is the code:
def download_pdf(self):
current = self.driver.current_url
lista_link_temp = self.driver.find_elements_by_xpath("//*[#href]")
for link in lista_link_temp:
if "pdf+html" in str(link.get_attribute("href")):
tutor = link.get_attribute("href")
self.driver.get(str(tutor))
self.driver.get(current)
Please help me.. I've just tried lambda, implicit and explicit wait
Thanks
As soon as you call self.driver.get() in your loop, all the other elements in the list of elements will become stale. Try collecting the href attributes from the elements first, and then visiting them:
def download_pdf(self):
current = self.driver.current_url
lista_link_temp = self.driver.find_elements_by_xpath("//*[#href]")
pdf_hrefs = []
# You could do this part with a single line list comprehension too, but would be really long...
for link in lista_link_temp:
href = str(link.get_attribute("href"))
if "pdf+html" in href:
pdf_hrefs.append(href)
for h in pdf_hrefs:
self.driver.get(h)
self.driver.get(current)
You get stale element when you search for an element and before doing any action on it the page has changed/reloaded.
Make sure the page is fully loaded before doing any actions in the page.
So you need to add first a condition to wait for the page to be loaded an maybe check all requests are done.

python using Selenium to download files

guys I need to write a script that use selenium to go over the pages on the website and download each page to a file.
This is the website I need to go through and I wanna download all 10 pages of reviews.
This is my code:
import urllib2,os,sys,time
from selenium import webdriver
browser=urllib2.build_opener()
browser.addheaders=[('User-agent', 'Mozilla/5.0')]
url='http://www.imdb.com/title/tt2948356/reviews?ref_=tt_urv'
driver = webdriver.Chrome('chromedriver.exe')
driver.get(url)
time.sleep(2)
if not os.path.exists('reviewPages'):os.mkdir('reviewPages')
response=browser.open(url)
myHTML=response.read()
fwriter=open('reviewPages/'+str(1)+'.html','w')
fwriter.write(myHTML)
fwriter.close()
print 'page 1 done'
page=2
while True:
cssPath='#tn15content > table:nth-child(4) > tbody > tr > td:nth-child(2) > a:nth-child(11) > img'
try:
button=driver.find_element_by_css_selector(cssPath)
except:
error_type, error_obj, error_info = sys.exc_info()
print 'STOPPING - COULD NOT FIND THE LINK TO PAGE: ', page
print error_type, 'Line:', error_info.tb_lineno
break
button.click()
time.sleep(2)
response=browser.open(url)
myHTML=response.read()
fwriter=open('reviewPages/'+str(page)+'.html','w')
fwriter.write(myHTML)
fwriter.close()
time.sleep(2)
print 'page',page,'done'
page+=1
But the program just stop downloading the first page. Could someone help? Thanks.
So a few things that are causing this.
Your first I think that's causing you issues is:
table:nth-child(4)
When I go to that website, I think you just want:
table >
The second error is the break statement in your except message. This says, when I get an error, stop looping.
So what's happening is your try, except is not working because your CSS selector is not quite correct, and going to your exception where you are telling it to stop looping.
Instead of that very complex CSS path try this simpler xpath ('//a[child::img[#alt="[Next]"]]/#href') which will return the URL associated with the little triangular 'next' button on each page.
Or notice that each page has 10 reviews and the URLs for pages 2 to 10 just give the start review number, ie http://www.imdb.com/title/tt2948356/reviews?start=10 which is the URL for page 2. Simply calculate the URL for the next page and stop when it doesn't fetch anything.

Scraper: Try skips code in while loop (Python)

I am working on my first scraper and ran into an issue. My scraper accesses a website and saves links from the each result page. Now, I only want it to go through 10 pages. The problem comes when the search results has less than 10 pages. I tried using a while loop along with a try statement, but it does not seem to work. After the scraper goes through the first page of results, it does not return any links on the successive pages; however, it does not give me an error and stops once it reaches 10 pages or the exception.
Here is a snippet of my code:
links = []
page = 1
while(page <= 10):
try:
# Get information from the propertyInfo class
properties = WebDriverWait(driver, 10).until(lambda driver: driver.find_elements_by_xpath('//div[#class = "propertyInfo item"]'))
# For each listing
for p in properties:
# Find all elements with a tags
tmp_link = p.find_elements_by_xpath('.//a')
# Get the link from the second element to avoid error
links.append(tmp_link[1].get_attribute('href'))
page += 1
WebDriverWait(driver, 10).until(lambda driver: driver.find_element_by_xpath('//*[#id="paginador_siguiente"]/a').click())
except ElementNotVisibleException:
break
I really appreciate any pointers on how to fix this issue.
You are explicitely catching ElementNotVisibleException exception and stopping on it. This way you won't see any error message. The error is probably in this line:
WebDriverWait(driver, 10).until(lambda driver:
driver.find_element_by_xpath('//*[#id="paginador_siguiente"]/a').click())
I assume lambda here should be a test, which is run until succeeded. So it shouldn't make any action like click. I actually believe that you don't need to wait here at all, page should be already fully loaded so you can just click on the link:
driver.find_element_by_xpath('//*[#id="paginador_siguiente"]/a').click()
This will either pass to next page (and WebDriverWait at the start of the loop will wait for it) or raise exception if no next link is found.
Also, you better minimize try ... except scope, this way you won't capture something unintentionally. E.g. here you only want to surround next link finding code not the whole loop body:
# ...
while(page <= 10):
# Scrape this page
properties = WebDriverWait(driver, 10).until(...)
for p in properties:
# ...
page += 1
# Try to pass to next page
try:
driver.find_element_by_xpath('//*[#id="paginador_siguiente"]/a').click()
except ElementNotVisibleException:
# Break if no next link is found
break

Categories

Resources