set-up
I use Python + Selenium to scrape info of companies of this site.
Since the website doesn't allow me to simply load page urls, I plan to click on the next page arrow element at the bottom of the list and using a while loop with a counter.
the code
browser.get('https://new.abb.com/channel-partners/search#')
wait.until(EC.visibility_of_element_located((By.CLASS_NAME,'abb-pagination')))
# start while loop and counter
c = 1
while c < 65:
c += 1
# obtain list of companies element
wait.until(EC.visibility_of_element_located((By.CLASS_NAME,'#PublicWrapper > main > section:nth-child(7) > div:nth-child(2)')))
resultlist = el_css('#PublicWrapper > main > section:nth-child(7) > div:nth-child(2)')
# loop over companies in list
for company in resultlist.find_elements_by_xpath('div'):
# company name
name = company.find_element_by_xpath('h3/a/span').text
# code to capture more company info follows
# next page arrow element
next_page_arrow = el_cn('abb-pagination__item--next')
next_page_arrow.click()
issue
The code captures the company info just fine outside of the while loop, i.e. just the first page.
However, when inserted in the while loop to iterate over the pages, I get the following error: StaleElementReferenceException: stale element reference: element is not attached to the page document (Session info: chrome=88.0.4324.192)
If I go over it, it seems resultlist of the subsequent page does get captured, but the loop over companies in resultlist yields this error.
What to do?
the simplest solution would be to use an implicity wait:
driver.get('https://new.abb.com/channel-partners/search#')
company_name = []
while True:
time.sleep(1)
company_name+=[elem.text for elem in wait.until(EC.presence_of_all_elements_located((By.XPATH,'//span[#property="name"]')))]
# if next page arrow element still available, click, else break while
if driver.find_elements_by_xpath('//li[#class="abb-pagination__item--next"]/a[contains(#href,"#page")]'):
wait.until(EC.presence_of_element_located((By.XPATH,'//li[#class="abb-pagination__item--next"]/a'))).click()
else:
break
len(company_name)
output:
951
You don't need the counter, you can check if arrow url is still available, this way if a page 65, 66, [...] were added, your logic would still work.
The problem here is that the while is too fast, and the page does not load in time. You could alternatively save the first list of company names, click in the next arrow and compare with the new list, if both were the same, wait a little more until the new list is differente from the previous one.
Related
I'm very new to programming so apologies in advance if I'm not communicating my issue clearly.
Essentially, using Selenium I have created a list of elements on a webpage by finding all the elements with the same class name I'm looking for.
In this case, I'm finding songs, which have the html class 'item-song' on this website.
On the website, there are lots of clickable options for each listed song . I just want to click the title of the song, which opens a popup modal window in which I edit the note attached to the song, then click save, which closes the popup.
I have successfully been able to do that by using what I guess would be called the title’s XPATH 'relative' to the song class.
songs = driver.find_elements(By.CLASS_NAME, "item-song")
songs[0].find_element(By.XPATH, "div[5]/a").click()
# other code that ends by closing popup
This works, hooray! It also works for any other list index that I put in that line of code.
However, it does not work sequentially, or in a for loop.
i.e.
songs[0].find_element(By.XPATH, "div[5]/a").click()
# other code
time.sleep(5) # to ensure the popup has finished closing
songs[1].find_element(By.XPATH, "div[5]/a").click()
Does not work.
for song in songs:
song.find_element(By.XPATH, "div[5]/a").click()
# other code
time.sleep(5)
continue
Also does not work.
I get a traceback error:
StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
After going back to the original page, the song does now say note(1) so I suppose the site has changed slightly. But as far as I can tell, the 'songs' list object and the xpath for the title of the next song should be exactly the same. To verify this, I even tried:
for song in songs:
print(song)
print(songs)
print()
song.find_element(By.XPATH, "div[5]/a").click()
# other code
Sure enough, on the first iteration, print(song) matched the first index of print(songs) and on the second iteration, print(song) matches the second index of print(songs). And print(songs) is identical both times. (Only prints twice as the error happens halfway through the second iteration)
Any help is greatly appreciated, I'm stumped!
---------------------------------
Edit: Of course, it would be easier if my songs list could be all the song titles instead of the class ‘item-song’, that was what I was trying first. However I couldn’t find anything common between the titles in the HTML that would let me use find_elements to just get the song title element, as each song has a different title, and there are also other items like videos listed in between each song.
Through the comments, the solution is to use an iterative loop and an xpath.
songs = driver.find_elements(By.CLASS_NAME, "item-song")
for i in range(songs.count):
driver.find_element(By.XPATH, "(//*[#class='item-song'][" + i + "])/div[5]/a").click()
Breaking this down:
this: By.XPATH, "//*[#class='item-song']" is the same as this: By.CLASS_NAME, "item-song". The former is the xpath equivalent of the latter. I did this so we can build a single identification string to the link instead of trying to find elements within elements.
The [" + i + "] is the iteration for the the loop. If you were to print this you'd see (//*[#class='item-song'][1])") then (//*[#class='item-song'][2])"). That [x] is the ordinal identifier - it means the xth instance of the element in the DOM. The brackets around it ensure the entire thing is matched for the next part - you can sometimes get unexpected matches without it.
The last part /div[5]/a is just the original solution. Doing div[5] isn't great. Your link must ALWAYS be inside the 5th div else it will fail - but as i can't see your application I can't comment on another way.
The original approach throws a StaleElementReferenceException because of the way Selenium stores identified elements.
Once you've identified an element by doing driver.find_elements(By.CLASS_NAME, "item-song") Selenium essentially captures a reference to it - it doesn't store the identifier you used. Stick a break point and after you identify an element and you'll see something like this:
That image is from visual studio as I have it hand but you can see it's a GUID on the ID.
Once you change the page that reference is lost.
Repeat the same steps, identify the same object and the ID is unique every time. This is same break point, same element on a second test run:
Page has changed == Selenium can no longer find it == Stale element.
The solution in this answer works because we're not storing an element.
Every action in the loop freshly identifies the element.
..Then add some clever pun about fresh vs stale... ;-)
I want to collect data from website pages with Python and Selenium.
Website is news website, I have come to the page where links/different news articles are listed.
This is my code:
# finding list of news articles
all_links = driver.find_elements_by_tag_name('article.post a')
print(len(all_links)) # I got 10 different articles
for element in all_links:
print(element.get_attribute('outerHTML')) # if I print only this, I get 10 different HTML-s
link = element.click()# clicking on the link to go to specific page
time.sleep(1)
# DATES
date = driver.find_element_by_tag_name('article header span.no-break-text.lite').text
print(date)
#until now everything words, everything works for the first element
But I'm getting the error when I want to iterate trough second element.
So, I'm getting good results for the first element in the list, but then I get this:
StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
(Session info: chrome=92.0.4515.159)
I have tried to put time.sleep(4) pauses and to add driver.close() and to add driver.back() after each iteration but the error is the same.
What am I doing wrong?
You need to define the list of web elements once again, when you are inside the for loop.
Explanation :
See the exact problem here is, when you click on the first element, it will go that first page where you have the element, and when you come back using
driver.execute_script("window.history.go(-1)") the other elements becomes stale in nature (This is how selenium works), so we have to redefined them again in order to interact with them. Please see below for illustration :-
# finding list of news articles
all_links = driver.find_elements_by_tag_name('article.post a')
print(len(all_links)) # I got 10 different articles
j = 0
for element in range(len(all_links)):
elements = driver.find_elements_by_tag_name('article.post a')
print(elements[j].get_attribute('outerHTML')) # if I print only this, I get 10 different HTML-s
elements[j].click() # clicking on the link to go to specific page
time.sleep(1)
# DATES
date = driver.find_element_by_tag_name('article header span.no-break-text.lite').text
print(date)
time.sleep(1)
driver.execute_script("window.history.go(-1)")
# code to go back to previous page should be written here, something like, driver.execute_script("window.history.go(-1)") or if this works driver.back()
time.sleep(1)
j = j + 1
You are facing here with classic case of StaleElementReferenceException.
Initially you have picked a list of elements with
all_links = driver.find_elements_by_tag_name('article.post a')
But once you click the first link and being passed to another page previously picked references (pointers) to the web elements located on the initial web page become Stale since these elements no more presented on the new page.
So even if you will get back to the initial page these references are no more valid since they become stale.
To continue you will have to get the links again.
You can do this as following:
# finding list of news articles
all_links = driver.find_elements_by_tag_name('article.post a')
print(len(all_links)) # I got 10 different articles
i = 0
for element in range(len(all_links)):
#get all the elements again
elements = driver.find_elements_by_tag_name('article.post a')
#get the i-th element from list and click it
link = elements[i].click() # clicking on the link to go to specific page
time.sleep(1)
# DATES
date = driver.find_element_by_tag_name('article header span.no-break-text.lite').text
print(date)
#get back to the previous page
driver.execute_script("window.history.go(-1)")
time.sleep(1)
#increase the counter
i = i + 1
In my project, I am downloading all the reports by clicking each link written as a "Date". Below is the image of the table.
I have to extract a report of each date mentioned in the table column "Payment Date". Each date is a link for a report. So, I am clicking all the dates one-by-one to get the report downloaded.
for dt in driver.find_elements_by_xpath('//*[#id="tr-undefined"]/td[1]/span'):
dt.click()
time.sleep(random.randint(5, 10))
So, the process here going is when I click one date it will download a report of that date. Then, I will click next date to get a report of that date. So, I made a for loop to loop through all the links and get a report of all the dates.
But it is giving me Stale element exception. After clicking 1st date it is not able to click the next date. I am getting error and code stops.
How can I solve this?
You're getting a stale element exception because the DOM is updating elements in your selection on each click.
An example: on-click, a tag "clicked" is appended to an element's class. Since the list you've selected contains elements which have changed (1st element has a new class), it throws an error.
A quick and dirty solution is to re-perform your query after each iteration. This is especially helpful if the list of values grows or shrinks with clicks.
# Create an anonymous function to re-use
# This function can contain any selector
get_elements = lambda: driver.find_elements_by_xpath('//*[#id="tr-undefined"]/td[1]/span')
i = 0
while True:
elements = get_elements()
# Exit if you're finished iterating
if not elements or i>len(elements):
break
# This should always work
element[i].click()
# sleep
time.sleep(random.randint(5, 10))
# Update your counter
i+=1
The simplest way to solve it is to get a specific link each time before clicking on it.
links = driver.find_elements_by_xpath('//*[#id="tr-undefined"]/td[1]/span')
for i in range(len(links)):
element = driver.find_elements_by_xpath('(//*[#id="tr-undefined"]/td[1]/span)[i+1]')
element.click()
time.sleep(random.randint(5, 10))
I am trying to create a web scraper and I ran into problem. I am trying to iterate over elements on the left side of the widget and if name starts with 'a', I want to click on minus sign and move it to the right side. I managed to find all the elements, however, once the element move to the right is side is executed, right after that loop I get the following error.
StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
(Session info: chrome=80.0.3987.163)
JS widget.
You need to refactor your code. Your code pattern is likely something like this (of course with different id-s but since you did not include your code or the page source this is the best I can offer):
container = driver.find_elements_by_xpath('//*[#class="window_of_elements"]')
elements = container.find_elements_by_xpath('//*[#class="my_selected_class"]')
for e in elements:
minus_part = e.find_element_by_xpath('//span[#class="remove"]')
minus_part.click()
When you click the minus_part, the container of your elements is probably getting re-rendered/reloaded and all your previously found elements turn stale.
To bypass this you should try a different approach:
container = driver.find_elements_by_xpath('//*[#class="window_of_elements"]')
to_be_removed_count = len(container.find_elements_by_xpath('//*[#class="my_selected_class"]'))
for _ in range(to_be_removed_count):
target_element = container.find_element_by_xpath('//*[#class="window_of_elements"]//*[#class="my_selected_class"]')
minus_part = target_element.find_element_by_xpath('//span[#class="remove"]')
minus_part.click()
So basically you should:
find out how many elements you should find to be clicked
in a for loop find and click them one by one
recently I tried scraping, so this time i wanted to go from page to page until I get the final destination I want. Here's my code:
sub_categories = browser.find_elements_by_class_name("ty-menu__submenu-link")
for sub_category in sub_categories:
sub_category = str(sub_category.get_attribute("href"))
if(sub_category is not 'http://www.lsbags.co.uk/all-bags/view-all-handbags-en/' and sub_category is not "None"):
browser.get(sub_category)
print("Entered: " + sub_category)
product_titles = browser.find_elements_by_class_name("product-title")
for product_title in product_titles:
final_link = product_title.get_attribute("href")
if(str(final_link) is not "None"):
browser.get(str(final_link))
print("Entered: " + str(final_link))
#DO STUFF
I already tried doing the wait and the wrapper(the try and exception one) solutions from here, but I do not get why its happening, I have an idea why this s happening, because it the browser gets lost right? when it finishes one item?
I don't know how should I express this idea. In my mind I imagine it would be like this:
TIMELINE:
*PAGE 1 is within a loop, ALL THE URLS WITHIN IT IS PROCESSED ONE BY ONE
*The first url of PAGE 1 is caught. Thus do browser.get page turn to PAGE 2
*PAGE 2 has the final list of links I want to evaluate, so another loop here
to get that url, and within that url #DO STUFF
*After #DO STUFF get to the second url of PAGE 2 and #DO STUFF again.
*Let's assume PAGE 2 has only two urls, so it finished looping, so it goes back to PAGE 1
*The second url of PAGE 1 is caught...
and so on... I think I have expressed my idea in some poitn of my code, I dont know what part is not working thus returning the exception.
Any help is appreciated, please help. Thanks!
Problem is that after navigating to the next page but before reaching this page Selenium finds the elements where you are waiting for but this are the elements of the page where you are coming from, after loading the next page this elements are not connected to the Dom anymore but replaced by the ones of the new page but Selenium is going to interact with the elements of the former page wich are no longer attached to the Dom giving a StaleElement exception.
After you pressed on the link for the next page you have to wait till the next page is completly loaded before you start your loop again.
So you have to find something on your page, not being the elements you are going to interact with, that tells you that the next page is loaded.