Iterate and collect data over website pages with Selenium and Python

Iterate and collect data over website pages with Selenium and Python - python

I want to collect data from website pages with Python and Selenium.
Website is news website, I have come to the page where links/different news articles are listed.
This is my code:
# finding list of news articles
all_links = driver.find_elements_by_tag_name('article.post a')
print(len(all_links)) # I got 10 different articles
for element in all_links:
print(element.get_attribute('outerHTML')) # if I print only this, I get 10 different HTML-s
link = element.click()# clicking on the link to go to specific page
time.sleep(1)
# DATES
date = driver.find_element_by_tag_name('article header span.no-break-text.lite').text
print(date)
#until now everything words, everything works for the first element
But I'm getting the error when I want to iterate trough second element.
So, I'm getting good results for the first element in the list, but then I get this:
StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
(Session info: chrome=92.0.4515.159)
I have tried to put time.sleep(4) pauses and to add driver.close() and to add driver.back() after each iteration but the error is the same.
What am I doing wrong?

You need to define the list of web elements once again, when you are inside the for loop.
Explanation :
See the exact problem here is, when you click on the first element, it will go that first page where you have the element, and when you come back using
driver.execute_script("window.history.go(-1)") the other elements becomes stale in nature (This is how selenium works), so we have to redefined them again in order to interact with them. Please see below for illustration :-
# finding list of news articles
all_links = driver.find_elements_by_tag_name('article.post a')
print(len(all_links)) # I got 10 different articles
j = 0
for element in range(len(all_links)):
elements = driver.find_elements_by_tag_name('article.post a')
print(elements[j].get_attribute('outerHTML')) # if I print only this, I get 10 different HTML-s
elements[j].click() # clicking on the link to go to specific page
time.sleep(1)
# DATES
date = driver.find_element_by_tag_name('article header span.no-break-text.lite').text
print(date)
time.sleep(1)
driver.execute_script("window.history.go(-1)")
# code to go back to previous page should be written here, something like, driver.execute_script("window.history.go(-1)") or if this works driver.back()
time.sleep(1)
j = j + 1

You are facing here with classic case of StaleElementReferenceException.
Initially you have picked a list of elements with
all_links = driver.find_elements_by_tag_name('article.post a')
But once you click the first link and being passed to another page previously picked references (pointers) to the web elements located on the initial web page become Stale since these elements no more presented on the new page.
So even if you will get back to the initial page these references are no more valid since they become stale.
To continue you will have to get the links again.
You can do this as following:
# finding list of news articles
all_links = driver.find_elements_by_tag_name('article.post a')
print(len(all_links)) # I got 10 different articles
i = 0
for element in range(len(all_links)):
#get all the elements again
elements = driver.find_elements_by_tag_name('article.post a')
#get the i-th element from list and click it
link = elements[i].click() # clicking on the link to go to specific page
time.sleep(1)
# DATES
date = driver.find_element_by_tag_name('article header span.no-break-text.lite').text
print(date)
#get back to the previous page
driver.execute_script("window.history.go(-1)")
time.sleep(1)
#increase the counter
i = i + 1

Related

Click all links of table using Selenium Python

In my project, I am downloading all the reports by clicking each link written as a "Date". Below is the image of the table.
I have to extract a report of each date mentioned in the table column "Payment Date". Each date is a link for a report. So, I am clicking all the dates one-by-one to get the report downloaded.
for dt in driver.find_elements_by_xpath('//*[#id="tr-undefined"]/td[1]/span'):
dt.click()
time.sleep(random.randint(5, 10))
So, the process here going is when I click one date it will download a report of that date. Then, I will click next date to get a report of that date. So, I made a for loop to loop through all the links and get a report of all the dates.
But it is giving me Stale element exception. After clicking 1st date it is not able to click the next date. I am getting error and code stops.
How can I solve this?

You're getting a stale element exception because the DOM is updating elements in your selection on each click.
An example: on-click, a tag "clicked" is appended to an element's class. Since the list you've selected contains elements which have changed (1st element has a new class), it throws an error.
A quick and dirty solution is to re-perform your query after each iteration. This is especially helpful if the list of values grows or shrinks with clicks.
# Create an anonymous function to re-use
# This function can contain any selector
get_elements = lambda: driver.find_elements_by_xpath('//*[#id="tr-undefined"]/td[1]/span')
i = 0
while True:
elements = get_elements()
# Exit if you're finished iterating
if not elements or i>len(elements):
break
# This should always work
element[i].click()
# sleep
time.sleep(random.randint(5, 10))
# Update your counter
i+=1

The simplest way to solve it is to get a specific link each time before clicking on it.
links = driver.find_elements_by_xpath('//*[#id="tr-undefined"]/td[1]/span')
for i in range(len(links)):
element = driver.find_elements_by_xpath('(//*[#id="tr-undefined"]/td[1]/span)[i+1]')
element.click()
time.sleep(random.randint(5, 10))

StaleElementReferenceException in while loop iterating over pages

set-up
I use Python + Selenium to scrape info of companies of this site.
Since the website doesn't allow me to simply load page urls, I plan to click on the next page arrow element at the bottom of the list and using a while loop with a counter.
the code
browser.get('https://new.abb.com/channel-partners/search#')
wait.until(EC.visibility_of_element_located((By.CLASS_NAME,'abb-pagination')))
# start while loop and counter
c = 1
while c < 65:
c += 1
# obtain list of companies element
wait.until(EC.visibility_of_element_located((By.CLASS_NAME,'#PublicWrapper > main > section:nth-child(7) > div:nth-child(2)')))
resultlist = el_css('#PublicWrapper > main > section:nth-child(7) > div:nth-child(2)')
# loop over companies in list
for company in resultlist.find_elements_by_xpath('div'):
# company name
name = company.find_element_by_xpath('h3/a/span').text
# code to capture more company info follows
# next page arrow element
next_page_arrow = el_cn('abb-pagination__item--next')
next_page_arrow.click()
issue
The code captures the company info just fine outside of the while loop, i.e. just the first page.
However, when inserted in the while loop to iterate over the pages, I get the following error: StaleElementReferenceException: stale element reference: element is not attached to the page document (Session info: chrome=88.0.4324.192)
If I go over it, it seems resultlist of the subsequent page does get captured, but the loop over companies in resultlist yields this error.
What to do?

the simplest solution would be to use an implicity wait:
driver.get('https://new.abb.com/channel-partners/search#')
company_name = []
while True:
time.sleep(1)
company_name+=[elem.text for elem in wait.until(EC.presence_of_all_elements_located((By.XPATH,'//span[#property="name"]')))]
# if next page arrow element still available, click, else break while
if driver.find_elements_by_xpath('//li[#class="abb-pagination__item--next"]/a[contains(#href,"#page")]'):
wait.until(EC.presence_of_element_located((By.XPATH,'//li[#class="abb-pagination__item--next"]/a'))).click()
else:
break
len(company_name)
output:
951
You don't need the counter, you can check if arrow url is still available, this way if a page 65, 66, [...] were added, your logic would still work.
The problem here is that the while is too fast, and the page does not load in time. You could alternatively save the first list of company names, click in the next arrow and compare with the new list, if both were the same, wait a little more until the new list is differente from the previous one.

How to loop through all pages of search results, select href for each search result, scrape data using python and selenium

I am scraping a website, where i need to select a value from dropdown. Then i need to click on search button. Which then populates a table with multiple pages of search results. For each result i want to click on it, which redirects to another page, extract data from that. Then come back and do the same for other search results.
select = Select(browser.find_element_by_id("dropdown id here"))
options = select.options
for index in range(0,len(options)):
select.select_by_index(index)
browser.find_element_by_id("checkbox").click()
time.sleep(5)
browser.find_element_by_id("click on search").click()
elems = browser.find_elements_by_xpath("need to enter all href for search results")
time.sleep(5)
for elem in elems:
#Need to enter each search result's href scrape within data and get back to search results again##
elem.get_attribute("href")
elem.click()
browser.find_element_by_id("for going back to search page again").click()
I get this error when i'm trying to iterate
StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
(Session info: chrome=87.0.4280.88)

This problem comes when the element is not able to find out as it has become stale
Try to bring the find elemenet logic inside for loop, that should solve your problem.
elems = browser.find_elements_by_xpath((xpath)[i])
length = elems.size();
Run your logic with number
for (int i=0;i<length; i++)
#Need to enter each search result's href scrape within data and get back to search results again##
elems = browser.find_elements_by_xpath("need to enter all href for search results"[i])
elem.get_attribute("href")
elem.click()
time.sleep(1)
you can read more about it here https://stackoverflow.com/a/12967602/2986279

If you are opening new windows, you will likely want to switch focus between newly opened windows/tabs and the original.
Switch to newly opened window.
windows = browser.window_handles
browser.switch_to.window(windows[-1])
Close newly opened window after finding what you need on the new page, then switch back to the previous window.
browser.close()
windows = browser.window_handles
browser.switch_to.window(windows[-1])

How to click through a Selenium WebElement list?

I'm trying to expand some collapsible content by taking all the elements that need to be expanded, and then clicking on them. Then once they're open, scrape the data shown. So far I'm grabbing a list of elements by their xpath, with this:
clicks = driver.find_elements_by_xpath("//div[contains(#class, 'collapsableContent') and contains(#class, 'empty')]")
and I've tried iterating with a simple for loop:
for item in clicks:
item.click()
but that doesn't seem to work. Any suggestions on where to look?
The specific page I'm trying to get this from is: https://sports.betway.com/en/sports/sct/esports/cs-go

Here is the code that you should use to open all the divs which have the collapsed empty class.
# click on close button in cookies policy (as this is the element which will overlap the element clicks)
driver.find_element_by_css_selector(".messageBoxCloseButton.icon-cross").click()
# get all the divs (collapsed divs)
links = driver.find_elements_by_xpath("//div[#class='collapsableContent empty']/preceding-sibling::div[#class='collapsableHeader']")
# click on each of those links
for link in links:
link.location_once_scrolled_into_view
link.click()

Selenium Python: StaleElementReferenceException with a twist

I'm running into the infamous StaleElementReferceExeption error with selenium. I've checked previous questions on the subject, and the common solution is to add and implicit.wait, explicit.wait, or time.sleep to give the website time to load. I've tried this, but I am still experiencing an error. Can anyone tell what the issue is
Here is my code:
links = driver.find_elements_by_css_selector("a.overline-productName")
time.sleep(2)
#finds pricing data of links on page
link_count = 0
for element in links:
links[link_count].click()
cents = driver.find_element_by_css_selector("span.cents")
dollar = driver.find_element_by_css_selector("span.dollar")
text_price = dollar.text + "." + cents.text
price = float(text_price)
print(price)
print(link_count)
driver.execute_script("window.history.go(-1)")
link_count = link_count + 1
time.sleep(5)
what am I missing?

You're storing your links in a list. The second you follow a link to another page, that set of links is stale. So the next iteration in your loop will attempt to click a stale link from the list.
Even if you go back in history as you do later, that original element reference is gone.
Your best bet is to loop through based on index, and find the links each time you return to the page.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterate and collect data over website pages with Selenium and Python - python

Related

Click all links of table using Selenium Python

StaleElementReferenceException in while loop iterating over pages

How to loop through all pages of search results, select href for each search result, scrape data using python and selenium

How to click through a Selenium WebElement list?

Selenium Python: StaleElementReferenceException with a twist

Categories

Resources