recently I tried scraping, so this time i wanted to go from page to page until I get the final destination I want. Here's my code:
sub_categories = browser.find_elements_by_class_name("ty-menu__submenu-link")
for sub_category in sub_categories:
sub_category = str(sub_category.get_attribute("href"))
if(sub_category is not 'http://www.lsbags.co.uk/all-bags/view-all-handbags-en/' and sub_category is not "None"):
browser.get(sub_category)
print("Entered: " + sub_category)
product_titles = browser.find_elements_by_class_name("product-title")
for product_title in product_titles:
final_link = product_title.get_attribute("href")
if(str(final_link) is not "None"):
browser.get(str(final_link))
print("Entered: " + str(final_link))
#DO STUFF
I already tried doing the wait and the wrapper(the try and exception one) solutions from here, but I do not get why its happening, I have an idea why this s happening, because it the browser gets lost right? when it finishes one item?
I don't know how should I express this idea. In my mind I imagine it would be like this:
TIMELINE:
*PAGE 1 is within a loop, ALL THE URLS WITHIN IT IS PROCESSED ONE BY ONE
*The first url of PAGE 1 is caught. Thus do browser.get page turn to PAGE 2
*PAGE 2 has the final list of links I want to evaluate, so another loop here
to get that url, and within that url #DO STUFF
*After #DO STUFF get to the second url of PAGE 2 and #DO STUFF again.
*Let's assume PAGE 2 has only two urls, so it finished looping, so it goes back to PAGE 1
*The second url of PAGE 1 is caught...
and so on... I think I have expressed my idea in some poitn of my code, I dont know what part is not working thus returning the exception.
Any help is appreciated, please help. Thanks!
Problem is that after navigating to the next page but before reaching this page Selenium finds the elements where you are waiting for but this are the elements of the page where you are coming from, after loading the next page this elements are not connected to the Dom anymore but replaced by the ones of the new page but Selenium is going to interact with the elements of the former page wich are no longer attached to the Dom giving a StaleElement exception.
After you pressed on the link for the next page you have to wait till the next page is completly loaded before you start your loop again.
So you have to find something on your page, not being the elements you are going to interact with, that tells you that the next page is loaded.
Related
I'm very new to programming so apologies in advance if I'm not communicating my issue clearly.
Essentially, using Selenium I have created a list of elements on a webpage by finding all the elements with the same class name I'm looking for.
In this case, I'm finding songs, which have the html class 'item-song' on this website.
On the website, there are lots of clickable options for each listed song . I just want to click the title of the song, which opens a popup modal window in which I edit the note attached to the song, then click save, which closes the popup.
I have successfully been able to do that by using what I guess would be called the title’s XPATH 'relative' to the song class.
songs = driver.find_elements(By.CLASS_NAME, "item-song")
songs[0].find_element(By.XPATH, "div[5]/a").click()
# other code that ends by closing popup
This works, hooray! It also works for any other list index that I put in that line of code.
However, it does not work sequentially, or in a for loop.
i.e.
songs[0].find_element(By.XPATH, "div[5]/a").click()
# other code
time.sleep(5) # to ensure the popup has finished closing
songs[1].find_element(By.XPATH, "div[5]/a").click()
Does not work.
for song in songs:
song.find_element(By.XPATH, "div[5]/a").click()
# other code
time.sleep(5)
continue
Also does not work.
I get a traceback error:
StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
After going back to the original page, the song does now say note(1) so I suppose the site has changed slightly. But as far as I can tell, the 'songs' list object and the xpath for the title of the next song should be exactly the same. To verify this, I even tried:
for song in songs:
print(song)
print(songs)
print()
song.find_element(By.XPATH, "div[5]/a").click()
# other code
Sure enough, on the first iteration, print(song) matched the first index of print(songs) and on the second iteration, print(song) matches the second index of print(songs). And print(songs) is identical both times. (Only prints twice as the error happens halfway through the second iteration)
Any help is greatly appreciated, I'm stumped!
---------------------------------
Edit: Of course, it would be easier if my songs list could be all the song titles instead of the class ‘item-song’, that was what I was trying first. However I couldn’t find anything common between the titles in the HTML that would let me use find_elements to just get the song title element, as each song has a different title, and there are also other items like videos listed in between each song.
Through the comments, the solution is to use an iterative loop and an xpath.
songs = driver.find_elements(By.CLASS_NAME, "item-song")
for i in range(songs.count):
driver.find_element(By.XPATH, "(//*[#class='item-song'][" + i + "])/div[5]/a").click()
Breaking this down:
this: By.XPATH, "//*[#class='item-song']" is the same as this: By.CLASS_NAME, "item-song". The former is the xpath equivalent of the latter. I did this so we can build a single identification string to the link instead of trying to find elements within elements.
The [" + i + "] is the iteration for the the loop. If you were to print this you'd see (//*[#class='item-song'][1])") then (//*[#class='item-song'][2])"). That [x] is the ordinal identifier - it means the xth instance of the element in the DOM. The brackets around it ensure the entire thing is matched for the next part - you can sometimes get unexpected matches without it.
The last part /div[5]/a is just the original solution. Doing div[5] isn't great. Your link must ALWAYS be inside the 5th div else it will fail - but as i can't see your application I can't comment on another way.
The original approach throws a StaleElementReferenceException because of the way Selenium stores identified elements.
Once you've identified an element by doing driver.find_elements(By.CLASS_NAME, "item-song") Selenium essentially captures a reference to it - it doesn't store the identifier you used. Stick a break point and after you identify an element and you'll see something like this:
That image is from visual studio as I have it hand but you can see it's a GUID on the ID.
Once you change the page that reference is lost.
Repeat the same steps, identify the same object and the ID is unique every time. This is same break point, same element on a second test run:
Page has changed == Selenium can no longer find it == Stale element.
The solution in this answer works because we're not storing an element.
Every action in the loop freshly identifies the element.
..Then add some clever pun about fresh vs stale... ;-)
I've built a scraper that has a parent url and many children. I built a list with the urls of the children and am looping through it -all are https-. However, when I get to the second object of the loop, it adds a suffix (?Nao=0) and scrapes the parent again.
I illustrate it below:
links_products = ['https://www.target.com/c/grocery-deals/-/N-5xt0rZ55e6uZ55e69Z55e6tZ5tdv0r&Nao=24',
'https://www.target.com/c/grocery-deals/-/N-5xt0rZ55e6uZ55e69Z55e6tZ5tdv0r&Nao=48',
'https://www.target.com/c/grocery-deals/-/N-5xt0rZ55e6uZ55e69Z55e6tZ5tdv0r&Nao=72']
from selenium import webdriver
driver = webdriver.Chrome('/home/chromedriver')
for i in links_products:
driver.get(i)
print(driver.current_url)
The result -which adds '?Nao=0'- at the end of each url is.
https://www.target.com/c/grocery-deals/-/N-5xt0rZ55e6uZ55e69Z55e6tZ5tdv0r&Nao=24?Nao=0
https://www.target.com/c/grocery-deals-/N-5xt0rZ55e6uZ55e69Z55e6tZ5tdv0r&Nao=48?Nao=0
https://www.target.com/c/grocery-deals-/N-5xt0rZ55e6uZ55e69Z55e6tZ5tdv0r&Nao=72?Nao=0
I've tried adding
driver.execute_script('window.history.go(-1)')
driver.refresh()
print(driver.current_url)
Then it prints the urls I actually want to scrape:
https://www.target.com/c/grocery-deals/-/N-5xt0rZ55e6uZ55e69Z55e6tZ5tdv0r&Nao=24
https://www.target.com/c/grocery-deals/-/N-5xt0rZ55e6uZ55e69Z55e6tZ5tdv0r&Nao=48
https://www.target.com/c/grocery-deals/-/N-5xt0rZ55e6uZ55e69Z55e6tZ5tdv0r&Nao=72
But only scrapes three times the parent of the three links above, namely:
https://www.target.com/c/grocery-deals/-/N-5xt0rZ55e6uZ55e69Z55e6tZ5tdv0r
Any suggestions on how to bypass this issue?
ps. it is the same if I go through the loop, as above described, or by clicking on the button "next". It all comes back to the parent.
Just try with these urls:
https://www.target.com/c/grocery-deals/-/N-5xt0rZ55e6uZ55e69Z55e6tZ5tdv0r&Nao=0?Nao=24
Keep the Nao=0 and modify the one after ?Nao=...
It worked for me
Context
This is a repost of Get a page with Selenium but wait for element value to not be empty, which was Closed without any validity so far as I can tell.
The linked answers in the closure reasoning both rely on knowing what the expected text value will be. In each answer, it explicitly shows the expected text hardcoded into the WebDriverWait call. Furthermore, neither of the linked answers even remotely touch upon the final part of my question:
[whether the expected conditions] come before or after the page Get
"Duplicate" Questions
How to extract data from the following html?
Assert if text within an element contains specific partial text
Original Question
I'm grabbing a web page using Selenium, but I need to wait for a certain value to load. I don't know what the value will be, only what element it will be present in.
It seems that using the expected condition text_to_be_present_in_element_value or text_to_be_present_in_element is the most likely way forward, but I'm having difficulty finding any actual documentation on how to use these and I don't know if they come before or after the page Get:
webdriver.get(url)
Rephrase
How do I get a page using Selenium but wait for an unknown text value to populate an element's text or value before continuing?
I'm sure that my answer is not the best one but, here is a part of my own code, which helped me with similar to your question.
In my case I had trouble with loading time of the DOM. Sometimes it took 5 sec sometimes 1 sec and so on.
url = 'www.somesite.com'
browser.get(url)
Because in my case browser.implicitly_wait(7) was not enought. I made a simple for loop to check if the content is loaded.
some code...
for try_html in range(7):
""" Make 7 tries to check if the element is loaded """
browser.implicitly_wait(7)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
raw_data = soup.find_all('script', type='application/ld+json')
"""if SKU in not found in the html page we skip
for another loop, else we break the
tryes and scrape the page"""
if 'sku' not in html:
continue
else:
scrape(raw_data)
break
It's not perfect, but you can try it.
I have a problem with this particular webstite link to website
I'm trying to create a script that can go through all entries but with the condition that it has "memory" so it can continue from the page it last was on. That means I need to know current page number AND a direct url to that page.
Here is what I have so far:
current_page_el = driver.find_element_by_xpath("//ul[contains(#class, 'pagination')]/li[#class='disabled']/a")
current_page = int(current_page_el.text)
current_page_url = current_page_el.get_attribute("href")
That code will result with
current_page_url = 'javascript:void(0);'
Is there a way to get current url from sites like this? Also, when you click to get to the next page, link just remains the same like what I posted in the beginning.
I use this great solution for waiting until the page loads completely.
But for one page it's don't work:
from selenium import webdriver
driver = webdriver.FireFox()
driver.get("https://vodafone.taleo.net/careersection/2a/jobsearch.ftl")
element = driver.find_element_by_xpath(".//*[#id='currentPageInfo']")
print element.id, element.text
driver.find_element_by_xpath(".//a[#id='next']").click()
element = driver.find_element_by_xpath(".//*[#id='currentPageInfo']")
print element.id, element.text
Output:
{52ce3a9f-0efb-49e1-be86-70446760e422} 1 - 25 of 1715
{52ce3a9f-0efb-49e1-be86-70446760e422} 26 - 50 of 1715
How to explain this behavior?
P.S.
With PhantomJS occurs the same thing
Selenium lib version 2.47.1
Edit
It's ajax calls on page.
This solution is used in tasks similar to that described in this article
Without the HTML one can only guess:
Reading the linked answer, the behaviour you observe is most probably because clicking the "next" button is not loading the whole page again but is only making an ajax call or something and filling an already existing table with new values.
This means that the whole page stays "the same" and thus also the "current page info" element still has the same id (just some java-script that changed its text value)
To check this you can do the following:
write another test-method identical to the one you got, but this time you replace this line:
driver.find_element_by_xpath(".//a[#id='next']").click()
with this line:
driver.navigate().refresh();
If it now gives you different ids, then I'm pretty sure, my guess is correct.