I am using proxies rotating and sometimes it could happen one of these randomly doesn't work and I get "ERR_TIMED_OUT" enable to reach server and script just crashed without continuing, is it possible to refresh the webpage automatically in silenium when this happens (so the proxy will rotate). I thought about catch exception putting then driver.refresh() but how can I catch into the entire code without try- except for every instruction? is there another solution? thanks!
You can use event_firing_webdriver:
https://www.selenium.dev/selenium/docs/api/py/webdriver_support/selenium.webdriver.support.event_firing_webdriver.html
you can decorate method get() and execute try except there (with refresh on timeout exception).
Related
If I request data from an API using a raspberry pi in a while/for loop in python and append data to csv and one iteration fails due to something like faulty wifi connection that comes and goes, what is a foolproof method of having an indication that an error occurred and have it keep trying again either immediately or after some rest period?
Use try/except to catch the exception, e.g.:
while True:
try:
my_function_that_sometimes_fails()
except Exception e:
print e
I guess retry package (and decorator) will suit your needs. You can specify what kind of exception it should catch and how many times it should retry before stopping completely. You can also specify the amount of time between each try.
I'm sending through some arguments to a Splash endpoint from a Scrapy crawler to be used by a Splash script that will be run. Sometimes errors may occur in that script. For runtime errors I use pcall to wrap the questionable script so I can gracefully catch and return run time errors. For syntax errors however, this doesn't work and instead a 400 error is thrown. I set my crawler to handle such an errors with the handle_httpstatus_list attribute, so my parse callback is being called on such an error, where I can inspect what went wrong gracefully.
So far so good, except since I couldn't handle the syntax error gracefully in the lua script I wasn't able to return some of the input Splash args that the callback will be expecting to access. Right now I'm calling response._splash_args() on the SplashJsonResponse object that I do have and this is allowing me to access those values. However this is essentially a protected method which means it might not be a long term solution.
Is there a better way to access the Splash args in the response when you can't rely on the associated splash script running at all?
I am using python selenium to parse large amount of data from more than 10,000+ urls. The browser is Firefox.
For each url, a Firefox browser will be opened and after data parsing, it will be closed, and wait 5 seconds before opening the next url through Firefox.
However, it happened twice these days, everything was running great, all of a sudden, the newly opened browser is blank, it is not loading the url at all. In real life experience, sometimes, even when I manually open a browser, searching for something, it is blank too.
The problem is, when this happened, there is no error at all, even when I wrote the except code to catch any exception, meanwhile I'm using nohup command to run the code, it will record any exception too, but there is no error at all. And once this happened, the code won't be executed any more, and many urls are left there without being parsed.... If I re-run the code on the rest urls, it works fine again.
Here is my code (all the 10,000+ urls are in comment_urls list):
for comment_url in comment_urls:
driver = webdriver.Firefox(executable_path='/Users/devadmin/Documents/geckodriver')
driver.get(comment_url)
time.sleep(5)
try:
// here is my data parsing code .....
driver.quit() // the browser will be closed when the data has been parsed
time.sleep(5) // and wait 5 secods
except:
with open(error_comment_reactions, 'a') as error_output:
error_output.write(comment_url+"\n")
driver.quit()
time.sleep(5)
At the same time, in that data parsing part, if there will be any exception, my code will also record the exception and close the driver, wait 5 seconds. But so far, no error recorded at all.
I tried to find similar problems and solutions online, but those are not helpful.
So, currently, I have 2 questions in mind:
Have you met this problem before and do you know how to deal with it? It is network problem or selenium problem or browser problem?
Or is there anyway in python, that it can tell the browser is not loading the url and it will close it?
For second problem, Prefer to use work queue to parse urls. One app should add all of them to queue (redis, rabbit-mq, amazon sqs and etc.) and then Second app should get 1 url from queue and try to parse it. In case if it will succeed, it should delete url from queue and switch to other url in queue. In case of exception it should os.exit(1) to stop app. Use shell to run second app, when it will return 1, meaning that error occurred, restart the app. Shell script: Get exit(1) from Python in shell
To answer your 2 questions:
1) Yes I have found selenium to be unpredictable at times. This is usually a problem when opening a browser for the first time which I will talk about in my solution. Try not to close the browser unless you need to.
2) Yes you can use the WebDriverWait() class in selenium.webdriver.support.wait
You said you are parsing thousands of comments so just make a new get request with the webdriver you have open.
I use this in my own scraper with the below code:
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
browser = webdriver.Firefox()
browser.get("http://someurl.com")
table = WebDriverWait(browser,60).until(EC.presence_of_element_located((By.TAG_NAME, "table")))`
The variable browser is just webdriver.Firefox() class.
It is a bit long but what it does is wait for a specific html tag to exist on the page with a timeout of 60 seconds.
It is possible that you are experiencing your own time.sleep() locking the thread up as well. Try not to use sleeps to compensate for things like this.
I'm using PhantomJS as a webdriver to load some urls. Usually, the program runs fine. However, it hangs on driver.get(url) a lot, and i'm wondering if there is anything I can do about it?
driver = webdriver.PhantomJS(executable_path= path_to_phantomjs_exe, service_log_path= path_to_ghostdriver_log)
driver.get(url)
It will just hang trying to load a certain url forever. But if i try it again, it might work. Are webdrivers/phantomJS really just that unstable? I guess last resort would be to constantly call driver.get(url) until it finally loads, but is that really going to be necessary? Thanks!
EDIT: It seems to only hang when loading the first link out of a list of them. It eventually does load, however, but after a few minutes. The rest of the links load within seconds. Any help at all would be great.
I've answered this exact problem on this post here: Geb/Selenium tests hang loading new page but copied it here because I see that this question is older.
I hope you can find a way to implement this into your code, but this is what worked for me when I was having a similar situation with PhantomJS hanging.
I traced it to be hanging on a driver.get() call, which for me was saying something wasn't going through or the webdriver just wasn't - for some reason - giving the load successful command back to the driver, allowing the script to continue.
So, I added the following:
driver = webdriver.PhantomJS()
# set timeout information
driver.set_page_load_timeout(15)
I've tested this at a time of 5 (seconds) and it just didn't wait long enough and nothing would happen. 15 seconds worked great for me, but that's maybe something you should test.
On top of this, I also created a loop whenever there was an option for the webdriver to timeout, so that the driver.get() could attempt to re-send the .get() command. Implementing a try / except stacked scenario, I was able to approach this:
while finished == 0:
try:
driver.get(url3)
finished = 1
except:
sleep(5)
I have seen an except handle as:
except TimeoutException as e:
#Handle your exception here
print(e)
but I had no use for this. It might be nice to know how to catch specific exceptions, though.
See this solution for more options for a timeout: Setting timeout on selenium webdriver.PhantomJS
So I was having the same problem:
driver = webdriver.PhantomJS(executable_path= path_to_phantomjs_exe, service_log_path= path_to_ghostdriver_log)
driver.get(url)
So I changed the service_log_path to:
service_log_path=os.path.devnull
This seemed to work for me!!!
I'm testing a site with lots of proxies, and the problem is some of those proxies are awfully slow. Therefore my code is stuck at loading pages every now and then.
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://example.com/example-page.php")
element = browser.find_element_by_id("someElement")
I've tried lots of stuff like explicit waits or implicit waits and been searching around for quite a while but still not yet found a solution or workaround. Nothing seems to really affect page loading line browser.get("http://example.com/example-page.php"), and that's why it's always stuck there.
Anybody got a solution for this?
Update 1:
JimEvans' answer solved my previous problem, and here you can find python patch for this new feature.
New problem:
browser = webdriver.Firefox()
browser.set_page_load_timeout(30)
browser.get("http://example.com/example-page.php")
element = browser.find_element_by_id("elementA")
element.click() ## assume it's a link to a new page http://example.com/another-example.php
another_element = browser.find_element_by_id("another_element")
As you can see browser.set_page_load_timeout(30) only affects browser.get("http://example.com/example-page.php") which means if this page loads for over 30 seconds it will throw out a timeout exception, but the problem is that it has no power over page loading such as element.click(), although it does not block till the new page entirely loads up, another_element = browser.find_element_by_id("another_element") is the new pain in the ass, because either explicit waits or implicit waits would wait for the whole page to load up before it starts to look for that element. In some extreme cases this would take even HOURS. What can I do about it?
You could try using the page load timeout introduced in the library. The implementation of it is not universal, but it's exposed for certain by the .NET and Java bindings, and has been implemented in and the Firefox driver now, and in the IE driver in the forthcoming 2.22. In Java, to set the page load timeout to 15 seconds, the code to set it would look like this:
driver.manage().timeouts().pageLoadTimeout(15, TimeUnit.SECONDS);
If it's not exposed in the Python language bindings, I'm sure the maintainer would eagerly accept a patch that implemented it.
You can still speedup your script execution by waiting for presence (not waiting for visibility) of expected element for say 5-8 sec and then sending window.stop() JS Script (to stop loading further elements ) without waiting for entire page to load or catching the timeout exception for page load after 5-8 seconds then calling window.stop()
Because if the page not adopted lazy loading technique (loading only visible element and loading rest of element only after scroll) it loads each element before returning window.ready state so it will be slower if any of the element takes longer time to render.