I am trying to click on a link, scrape data from that webpage, go back again, click on the next link and so on. But, I am not able to go back to the previous page for some reason. I observed that I can execute the code to go back if I am outside the loop, and I can't figure out what is wrong with the loop. I tried to use driver.back() too and yet it won't work. Any help is appreciated!! TYI
x = 0 #counter
contents=[]
for link in soup_level1.find_all('a', href=re.compile(r"^/new-homes/arizona/phoenix/"), tabindex=-1):
python_button =driver.find_element_by_xpath("//div[#class='clearfix len-results-items len-view-list']//a[contains(#href,'/new-homes/arizona/phoenix/')]")
driver.execute_script("arguments[0].click();",python_button)
driver.implicitly_wait(50)
soup_level2=BeautifulSoup(driver.page_source, 'lxml')
a=soup_level2.find('ul', class_ ='plan-info-lst')
for names in a.find('li'):
contents.append(names.span.next_sibling.strip())
driver.execute_script("window.history.go(-1)")
driver.implicitly_wait(50)
x += 1
Some more information about your usecase interms of:
Selenium client version
WebDriver variant and version
Browser type and version
would have helped us to debug the issue in a better way.
However to go back to the previous page you can use either of the following solutions:
Using back(): Goes one step backward in the browser history.
Usage:
driver.back()
Using execute_script(): Synchronously Executes JavaScript in the current window/frame.
Usage:
driver.execute_script("window.history.go(-1)")
Usecase Internet Explorer
As per #james.h.evans.jr's comment in the discussion driver.navigate().back() blocks when back button triggers a javascript alert on the page if you are using internet-explorer at times back() may not work and is pretty much expected as ie navigates back in the history by using the COM GoBack() method of the IWebBrowser interface. Given that, if there are any modal dialogs that appear during the execution of the method, the method will block.
You may even face similar issues while invoking forward() in the history, and submitting forms. The GoBack method can be executed on a separate thread which would involve calling a few not-very-intuitive COM object marshaling functions e.g. CoGetInterfaceAndReleaseStream() and CoMarshalInterThreadInterfaceInStream() but there seems we can't do much about that.
Instead of using
driver.execute_script("window.history.go(-1)")
You can try using
driver.back() see here
Please be aware that this functionality depends entirely on the underlying driver. It’s just possible that something unexpected may happen when you call these methods if you’re used to the behavior of one browser over another.
Related
I am pretty new to web-scraping...
For example here is the part of my code:
labels = driver.find_elements(By.CLASS_NAME, 'form__item-checkbox-label.placeFinder-search__checkbox-label')
checkboxes = driver. find_elements(By.CLASS_NAME, 'form__item-checkbox-input.placeFinder-search__checkbox-input')
boxes = zip(labels,checkboxes)
time.sleep(3)
for label,checkbox in boxes:
if checkbox.is_selected():
label.click()
Here is another example:
driver.get(product_link)
time.sleep(3)
button = driver.find_element(By.XPATH, '//*[#id="tab-panel__tab--product-pos-search"]/h2')
time.sleep(3)
button.click()
And I am scraping through let's say hundreds of products. 90% of the time it works fina, but occasionally giver errors like couldn't locate the element or something is not clickable etc. But all these products pages are built the same. Moreover, if I just re-run code on the product that resulted in the error, mosr of the time from the 2nd or 3rd time I will be able to scrape the data and will not get the error back.
Why does it happen? Code stays the same, web page stays the same.. What is causing an error when it happens? The only thing that comes to my mind the Internet connection sometimes gets behind the code and the program is unable to see the elenebts it is looking for... But as you can see I have added time.sleep() but it does not always help...
How can this be avoided? It is really annoying to be forced to stay in front of the monitor all the day just to supervise and re-run the code.... I mean I guess I could just add the scrape fubction inside the try: except: else: block but I am still wondering why does the same code will sometimes work and sometimes return the error on the same page?
In short Selenium deals with three distinct states of a WebElement.
presence
visibile
interactable / clickable
Ideally, to click on any clickable element you need to induce WebDriverWait for the element_to_be_clickable() as follows:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//*[#id="tab-panel__tab--product-pos-search"]/h2"))).click()
Similarly you can also create a list of desired elements waiting for their visibility and click on them one by one waiting for each of them to be clickable as follows:
checkboxes = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "form__item-checkbox-input.placeFinder-search__checkbox-input")))
for checkbox in checkboxes:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((checkbox))).click()
Welcome to the "dirty" side of Web automation. We called it "Flaky" tests. In other word they are "fragile". And the major disadvantage of Selenium Webdriver.
There could be several reasons of flaky situation:
Network instability: Since all commands sent over network: client ->
(selenium grid: in case need) -> browser driver -> actual browser. Any connection issue may cause reason to failed.
CSS animations: Since it executes commands directly, if you have some animative transitions, it may cause to fail
Ajax similar requests or dynamic element changing. If you have such "load more" or displaying after some actions, It may not dedect or still overlapping
And, last comment is sleep is not good idea to use, actually it is againts to best practices. Instead of, use Expected Conditions to ensure elements are visible and ready
I am trying to make a scraper that will go through a bunch of links, export the guide as a PDF, and loop through all the guides that are in the parent folder. It works fine going in, ,but when I try to go backwards, it throws stale exceptions, even when I make sure to refresh the elements in the code, or refresh the page.
from selenium import webdriver
import time, bs4
browser = webdriver.Firefox()
browser.get('MYURL')
loginElem = browser.find_element_by_id('email')
loginElem.send_keys('LOGIN')
pwdElem = browser.find_element_by_id('password')
pwdElem.send_keys('PASSWORD')
pwdElem.submit()
time.sleep(3)
category = browser.find_elements_by_class_name('title')
for i in category:
i.click()
time.sleep(3)
guide = browser.find_elements_by_class_name('cell')
for j in guide:
j.click()
time.sleep(3)
soup = bs4.BeautifulSoup(browser.page_source, features="html.parser")
guidetitle = soup.find_all(id='guide-intro-title')
print(guidetitle)
browser.find_element_by_link_text('Options').click()
time.sleep(0.5)
browser.find_element_by_partial_link_text('Download PDF').click()
browser.find_element_by_id('download').click()
browser.execute_script("window.history.go(-2)")
print("went back")
time.sleep(5)
print("waited")
guide = browser.find_elements_by_class_name('thumb')
print("refreshed elements")
print("made it to outer loop")
This happens if I both use a script to move the browser back, or the driver.back() method. I can see that it makes it back to the child directory, then waits, and refreshes the elements. But, then it can't seem to load the new element to go into the next guide. I found a similar questions here on SO but someone just provided code tailored to the problem instead of explaining so I am still confused.
I also know about using waitdriver but I am just using sleep now since I don't fully understand the EC wait conditions. In any case, increasing the sleep time doesn't fix this issue.
Stale Element Reference Exception occurs upon page refresh because of an element UUID change in the DOM.
How to avoid it: Always try to search for an element right before interaction.
In your code, you searched for cells, found them and stored them in guide. So now, guide has a list of selenium UUIDs. But then, you are making a loop to go through the list, and upon each refresh (that happens when you do back I believe), cell's UUID changes, so old ones that you have stored are no longer attached to the DOM. When trying to interact with them, Selenium cannot find them in the DOM and throws this exception.
Instead of looping through guide your way, try re-find element every time, like:
guide = browser.find_elements_by_class_name('cell')
for j in range(len(guide)):
browser.find_elements_by_class_name('cell')[j].click()
Note, it looks like category might have a similar problem, so try applying this solution to category as well.
Hope this helps. Here is a similar issue and a solution.
I'm trying to load one web page and get some elements from it. So the first thing I do is to check the page using "inspect element". When I search for the tags I'm looking for, I can see them (in Chrome).
But when I try to do driver.get(url) and then driver.find_element_by_..., it doesn't find those elements because they aren't in the source code.
I think that it is probably because it doesn't load the whole page but only a part.
Here is an example:
I'm trying to find ads on the web page.
PREPARED_TABOOLA_BLOCK = """//div[contains(#id,'taboola') and not(ancestor::div[contains(#id,'taboola')])]"""
driver = webdriver.PhantomJS(service_args=["--load-images=false"])
# driver = webdriver.Chrome()
driver.maximize_window()
def find_taboola_blocks_selenium(url):
driver.get(url)
taboola_blocks = driver.find_elements_by_xpath(PREPARED_TABOOLA_BLOCK)
return taboola_blocks
print len(find_taboola_blocks_selenium('http://www.breastfeeding-problems.com/breastfeeding-a-sick-baby.html'))
driver.get('http://www.breastfeeding-problems.com/breastfeeding-a-sick-baby.html')
print len(driver.page_source)
OUTPUTS:
Using PhantomJS:
0
85103
Using ChromeDriver:
3
420869
Do you know how to make PhantomJS to load as much Html as possible or any other way to solve this?
Can you compare the request that ChromeDriver is making versus the request you are making in PhantomJS? Since you are only doing GET for the specified url, you may not be including other request parameters that are needed to get the advertisements.
The open() method may give you a better representation of what you are looking for here: http://phantomjs.org/api/webpage/method/open.html
The reason for this is because PhantomJS, by default, renders in a really small window, which makes it load the mobile version of the site. And with the PhantomJSDriver, calling maximizeWindow() (or maximize_window() in python) does absolutely nothing, since there is no rendered window to maximize. You will have to explicitly set the window's render size with:
edit: Below is the Java solution. I'm not entirely sure what the Python solution would be when setting the window size, but it should be similar.
driver.manage().window().setSize(new Dimension(1920, 1200));
edit again: Found the python version:
driver.set_window_size(1920, 1200)
Hope that helps!
PhantomJS 1.x is a really old browser. It only uses SSLv3 (now disabled on most sites) by default and doesn't implement most cutting edge functionality.
Advertisement scripts are usually delivered over HTTPS (SSLv3/TLS) and usually use some obscure feature of JavaScript which is not well tested or simply not implemented in PhantomJS.
If you use PhantomJS < v1.9.8 then you should use those commandline options (service_args): --ignore-ssl-errors=true --ssl-protocol=any.
If iframes or strange cross-domain requests are necessary for the page/ads to work, then add --web-security=false to the service_args.
If this still doesn't solve the problem, then try upgrading to PhantomJS 2.0.0. You might need to compile it yourself on Linux.
I essentially have a start_url that has my javascript search form and button, hence the need of selenium. I use selenium to select the appropriate items in my select box objects, and click the search button. The following page, I do some scrapy magic. However, now I want to go BACK to the original start_url and fill out a different object, etc. and repeat until no more.
Essentially, I have tried making a for-loop and trying to get the browser to go back to the original response.url, but somehow it crashed. I may try having a duplicate list of start_url's on the top for scrapy to parse through, but I'm not sure if that is the best approach. What can I do in my situation?
Here the advice is to use driver.back() : https://selenium-python.readthedocs.io/navigating.html#navigation-history-and-location
The currently selected answer provides a link to an external site and that link is broken. The selenium docs talk about
driver.forward()
driver.back()
but those will sometimes fail, even if you explicitly use some wait functions.
I found a better solution. You can use the below command to navigate backwards.
driver.execute_script("window.history.go(-1)")
hope this helps someone else in the future.
To move backwards and forwards in your browser’s history use
driver.forward()
driver.back()
I'm using selenium webdriver with python so as to find an element and click it. This is the code. I'm passing 'number' to this code's method and this doesn't work. I see it on the browser that the element is found but it doesn't click the element.
subIDTypeIcon = "//a[#id='s_%s_IdType']/img" % str(number)
self.driver.find_element_by_xpath(subIDTypeIcon).click()
Whereas, I tried placing the 'self.driver.find_.....' twice and to my surprise it works
subIDTypeIcon = "//a[#id='s_%s_IdType']/img" % str(number)
self.driver.find_element_by_xpath(subIDTypeIcon).click()
self.driver.find_element_by_xpath(subIDTypeIcon).click()
I have the browser getting opened on a remote server so there is sometimes timeout problem.
Is there a proper way to make this work? why does it work when same statement is placed twice
This is a common problem and the main reason to create abstract, per page helper classes. Instead of blindly finding elements, you usually need a loop which tries to find an element for a couple of seconds so the browser can update the DOM.
The second version often works because starting to load a new page doesn't invalidate the DOM. That only happens when the remote server has started to send enough of the new document to the browser. You can see this yourself when you use a browser: Pages don't become blank the same instant you click on a link. Instead, they stay for a while.