I have a very complex py.test python-selenium test setup where I create a Firefox webdriver inside a py.test fixture. Here is some idea of what I am doing:
'driver.py':
class Driver(object):
"""
Driver class with basic wrappers around the selenium webdriver
and other convenience methods.
"""
def __init__(self, config, options):
"""Sets the driver and the config.
"""
self.remote = options.getoption("--remote")
self.headless = not options.getoption("--with-head")
if self.headless:
self.display = Display(visible=0, size=(13660, 7680))
self.display.start()
# Start the selenium webdriver
self.webdriver = fixefox_module.get_driver()
'conftest.py':
#pytest.fixture
def basedriver(config, options):
driver = driver.Driver(config, options)
yield driver
print("Debug 1")
driver.webdriver.quit()
print("Debug 2")
And when running the test I can only see Debug 1 printed out. The whole process stops at this point and does not seem to proceed. The whole selenium test is stuck at the webdriver.quit).
The tests, however, completed successfully...
What reasons could be for that behavior?
Addendum:
The reason why the execution hangs seems to be a popup that asks the user if he wants to leave the page because of unsaved data. That means that the documentation for the quit method is incorrect. It states:
Quits the driver and close every associated window.
This is a non-trivial problem, to which selenium acts really a inconsistent. The quit method should, as documented, just close the browser window(s) but it does not. Instead you get a popup asking the user if he wants to leave the page:
The nasty thing is that this popup appears only after the user called
driver.quit()
One way to fix this is to set the following profile for the driver
from selenium import webdriver
profile = webdriver.FirefoxProfile()
# other settings here
profile.set_preference("dom.disable_beforeunload", True)
driver = webdriver.Firefox(firefox_profile = profile)
The warning to close is true by default in firefox as you can see in about:config and you can disable them for your profile:
And since,
The reason why the execution hangs seems to be a popup that asks the
user if he wants to leave the page because of unsaved data.
You can set browser.tabs.warnOnClose in your Firefox configuration profile as follows:
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.set_preference("browser.tabs.warnOnClose", False)
driver = webdriver.Firefox(firefox_profile = profile)
You can look at profile.DEFAULT_PREFERENCES which is the json at python/site-packages/selenium/webdriver/firefox/webdriver_prefs.json
As far as I understood, there are basically two questions asked which I will try to answer :
Why failure of driver.webdriver.quit() method call leaves the script in hang/unresponsive state instead of raising any exception ?
Why the testcase was still a pass if the script never completed it's execution cycle ?
For answering the first question I will try to explain the Selenium Architecture which will clear most of our doubts.
So how Selenium Webdriver Functions ?
Every statement or command you write using Selenium Client Library will be converted to JSON Wire Protocol over http which in turn will be passed to our browser drivers(chromedriver, geckodriver) . So basically these generated http URLs (based on REST architecture) reaches to browser drivers. Inside the browser drivers there are http servers which will internally pass the received URLs to Real Browser (as HTTP over HTTP Server) for which the appropriate response will be generated by Web Browser and sent back to Browser Drivers (as HTTP over HTTP Server) which in turn will use JSON Wire Protocol to send the response back to Selenium Client Library which will finally decide on how to proceed further based on response achieved. Please refer to attached image for more clarification :
Now coming back to the question where script is in hold we can simply conclude that our browser is still working on request received that's why no response is sent back to Browser Driver which in turn left Selenium Library quit() function in hold i.e. waiting for the request processing completion.
So there are variety of workarounds available which we can use, among which one is already explained by Alex. But I believe there is a much better way to handle such conditions, as browser Selenium could leave us in hold/freeze state for other cases too as per my experience so I personally prefer Thread Kill Approach with Timeout as Selenium Object always runs in main() thread. We can allocate a specific time to the main thread and can kill main() thread if timeout session time is reached.
Now moving to the second question which is :
Why the testcase was still a pass if the script never completed it's
execution cycle ?
Well I don't have much idea on how pytest works but I do have basic idea on how test engine operates based on which I will try to answer this one.
For starters it's not at all possible for any test case to pass until the full script run is completed. Again, if your test cases are passing there could be very few possible scenarios such as :
Your test methods never made use of method which leaves the whole execution in hang/freeze state.
You must have called the method inside test tear down environment (w.r.t [TestNG][4] test engine in Java : #AfterClass, #AfterTest, #AfterGroups, #AfterMethod, #AfterSuite) meaning your test execution is completed already. So this might be the reason for tests showing up as successful completion.
I am still not sure what proper cause is there for second reason. I will keep looking and update the post if came up with something.
#Alex : Can you update the question with better understanding i.e. your current test design which I can explore to find better explanation.
So I was able to reproduce your issue using below sample HTML file
<html>
<body>
Please enter a value for me: <input name="name" >
<script>
window.onbeforeunload = function(e) {
return 'Dialog text here.';
};
</script>
<h2>ask questions on exit</h2>
</body>
</html>
Then I ran a sample script which reproduces the hang
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://localhost:8090/index.html")
driver.find_element_by_name("name").send_keys("Tarun")
driver.quit()
This will hang selenium python indefinitely. Which is not a good thing as such. The issue is the window.onload and window.onbeforeunload are tough for Selenium to handle because of the lifecycle of when it happens. onload happens even before selenium has injected its own code to suppress alert and confirm for handling. I am pretty sure onbeforeunload also is not in reach of selenium.
So there are multiple ways to get around.
Change in app
Ask the devs not to use onload or onbeforeunload events. Will they listen? Not sure
Disable beforeunload in profile
This is what you have already posted in your answer
from selenium import webdriver
profile = webdriver.FirefoxProfile()
# other settings here
profile.set_preference("dom.disable_beforeunload", True)
driver = webdriver.Firefox(firefox_profile = profile)
Disable the events through code
try:
driver.execute_script("window.onunload = null; window.onbeforeunload=null")
finally:
pass
driver.quit()
This would work only if you don't have multiple tabs opened or the tab suppose to generate the popup is on focus. But is a good generic way to handle this situation
Not letting Selenium hang
Well the reason selenium hangs is that it send a request to the geckodriver which then sends it to the firefox and one of these just doesn't respond as they wait for user to close the dialog. But the problem is Selenium python driver doesn't set any timeout on this connection part.
To solve the problem it is as simple as adding below two lines of code
import socket
socket.setdefaulttimeout(10)
try:
driver.quit()
finally:
# Set to something higher you want
socket.setdefaulttimeout(60)
But the issue with this approach is that driver/browser will still not be closed. This is where you need even more robust approach to kill the browser as discussed in below answer
In Python, how to check if Selenium WebDriver has quit or not?
Code from above link for making answer complete
from selenium import webdriver
import psutil
driver = webdriver.Firefox()
driver.get("http://tarunlalwani.com")
driver_process = psutil.Process(driver.service.process.pid)
if driver_process.is_running():
print ("driver is running")
firefox_process = driver_process.children()
if firefox_process:
firefox_process = firefox_process[0]
if firefox_process.is_running():
print("Firefox is still running, we can quit")
driver.quit()
else:
print("Firefox is dead, can't quit. Let's kill the driver")
firefox_process.kill()
else:
print("driver has died")
The best way to guarantee you run your teardown code in pytest is to define a finalizer function and add it as a finalizer to that fixture. This guarantees that even if something fails before the yield command, you still get your teardown.
To avoid a popup hanging up your teardown, invest in some WebdriverWait.until commands that timeout whenever you want them to. Popup appears, test cannot proceed, times out, teardown is called.
For ChromeDriver users:
options = Options()
options.add_argument('no-sandbox')
driver.close()
driver.quit()
credits to...
https://bugs.chromium.org/p/chromedriver/issues/detail?id=1135
Related
For some unknown reasons ,my browser open test pages of my remote server very slowly. So I am thinking if I can reconnect to the browser after quitting the script but don't execute webdriver.quit() this will leave the browser opened. It is probably kind of HOOK or webdriver handle.
I have looked up the selenium API doc but didn't find any function.
I'm using Chrome 62,x64,windows 7,selenium 3.8.0.
I'll be very appreciated whether the question can be solved or not.
No, you can't reconnect to the previous Web Browsing Session after you quit the script. Even if you are able to extract the Session ID, Cookies and other session attributes from the previous Browsing Context still you won't be able to pass those attributes as a HOOK to the WebDriver.
A cleaner way would be to call webdriver.quit() and then span a new Browsing Context.
Deep Dive
There had been a lot of discussions and attempts around to reconnect WebDriver to an existing running Browsing Context. In the discussion Allow webdriver to attach to a running browser Simon Stewart [Creator WebDriver] clearly mentioned:
Reconnecting to an existing Browsing Context is a browser specific feature, hence can't be implemented in a generic way.
With internet-explorer, it's possible to iterate over the open windows in the OS and find the right IE process to attach to.
firefox and google-chrome needs to be started in a specific mode and configuration, which effectively means that just
attaching to a running instance isn't technically possible.
tl; dr
webdriver.firefox.useExisting not implemented
Yes, that's actually quite easy to do.
A selenium <-> webdriver session is represented by a connection url and session_id, you just reconnect to an existing one.
Disclaimer - the approach is using selenium internal properties ("private", in a way), which may change in new releases; you'd better not use it for production code; it's better not to be used against remote SE (yours hub, or provider like BrowserStack/Sauce Labs), because of a caveat/resource drainage explained at the end.
When a webdriver instance is initiated, you need to get the before-mentioned properties; sample:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.google.com/')
# now Google is opened, the browser is fully functional; print the two properties
# command_executor._url (it's "private", not for a direct usage), and session_id
print(f'driver.command_executor._url: {driver.command_executor._url}')
print(f'driver.session_id: {driver.session_id}')
With those two properties now known, another instance can connect; the "trick" is to initiate a Remote driver, and provide the _url above - thus it will connect to that running selenium process:
driver2 = webdriver.Remote(command_executor=the_known_url)
# when the started selenium is a local one, the url is in the form 'http://127.0.0.1:62526'
When that is ran, you'll see a new browser window being opened.
That's because upon initiating the driver, the selenium library automatically starts a new session for it - and now you have 1 webdriver process with 2 sessions (browsers instances).
If you navigate to an url, you'll see it is executed on that new browser instance, not the one that's left from the previous start - which is not the desired behavior.
At this point, two things need to be done - a) close the current SE session ("the new one"), and b) switch this instance to the previous session:
if driver2.session_id != the_known_session_id: # this is pretty much guaranteed to be the case
driver2.close() # this closes the session's window - it is currently the only one, thus the session itself will be auto-killed, yet:
driver2.quit() # for remote connections (like ours), this deletes the session, but does not stop the SE server
# take the session that's already running
driver2.session_id = the_known_session_id
# do something with the now hijacked session:
driver.get('https://www.bing.com/')
And, that is it - you're now connected to the previous/already existing session, with all its properties (cookies, LocalStorage, etc).
By the way, you do not have to provide desired_capabilities when initiating the new remote driver - those are stored and inherited from the existing session you took over.
Caveat - having a SE process running can lead to some resource drainage in the system.
Whenever one is started and then not closed - like in the first piece of the code - it will stay there until you manually kill it. By this I mean - in Windows for example - you'll see a "chromedriver.exe" process, that you have to terminate manually once you are done with it. It cannot be closed by a driver that has connected to it as to a remote selenium process.
The reason - whenever you initiate a local browser instance, and then call its quit() method, it has 2 parts in it - the first one is to delete the session from the Selenium instance (what's done in the second code piece up there), and the other is to stop the local service (the chrome/geckodriver) - which generally works ok.
The thing is, for Remote sessions the second piece is missing - your local machine cannot control a remote process, that's the work of that remote's hub. So that 2nd part is literally a pass python statement - a no-op.
If you start too many selenium services on a remote hub, and don't have a control over it - that'll lead to resource drainage from that server. Cloud providers like BrowserStack take measures against this - they are closing services with no activity for the last 60s, etc, yet - this is something you don't want to do.
And as for local SE services - just don't forget to occasionally clean up the OS from orphaned selenium drivers you forgot about :)
OK after mixing various solutions shared on here and tweaking I have this working now as below. Script will use previously left open chrome window if present - the remote connection is perfectly able to kill the browser if needed and code functions just fine.
I would love a way to automate the getting of session_id and url for previous active session without having to write them out to a file during hte previous session for pick up...
This is my first post on here so apologies for breaking any norms
#Set manually - read/write from a file for automation
session_id = "e0137cd71ab49b111f0151c756625d31"
executor_url = "http://localhost:50491"
def attach_to_session(executor_url, session_id):
original_execute = WebDriver.execute
def new_command_execute(self, command, params=None):
if command == "newSession":
# Mock the response
return {'success': 0, 'value': None, 'sessionId': session_id}
else:
return original_execute(self, command, params)
# Patch the function before creating the driver object
WebDriver.execute = new_command_execute
driver = webdriver.Remote(command_executor=executor_url, desired_capabilities={})
driver.session_id = session_id
# Replace the patched function with original function
WebDriver.execute = original_execute
return driver
remote_session = 0
#Try to connect to the last opened session - if failing open new window
try:
driver = attach_to_session(executor_url,session_id)
driver.current_url
print(" Driver has an active window we have connected to it and running here now : ")
print(" Chrome session ID ",session_id)
print(" executor_url",executor_url)
except:
print("No Driver window open - make a new one")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=myoptions)
session_id = driver.session_id
executor_url = driver.command_executor._url
Without getting into why do you think that leaving an open browser windows will solve the problem of being slow, you don't really need a handle to do that. Just keep running the tests without closing the session or, in other words, without calling driver.quit() as you have mentioned yourself. The question here though framework that comes with its own runner? Like Cucumber?
In any case, you must have some "setup" and "cleanup" code. So what you need to do is to ensure during the "cleanup" phase that the browser is back to its initial state. That means:
Blank page is displayed
Cookies are erased for the session
Program Logic
I'm opening multiple selenium threads from the list using multithreading library in python3.
These threads are stored in an array from which they're started like this:
for each_thread in browser_threads:
each_thread.start()
for each_thread in browser_threads:
each_thread.join()
Each thread calls a function to start the selenium firefox browser. Function is as follows..
Browser Function
# proxy browser session
def proxy_browser(proxy):
global arg_pb_timesec
global arg_proxyurl
global arg_youtubevideo
global arg_browsermode
# recheck proxyurl
if arg_proxyurl == '':
arg_proxyurl = 'https://www.duckduckgo.com/'
# apply proxy to firefox using desired capabilities
PROX = proxy
webdriver.DesiredCapabilities.FIREFOX['proxy']={
"httpProxy":PROX,
"ftpProxy":PROX,
"sslProxy":PROX,
"proxyType":"MANUAL"
}
options = Options()
# for browser mode
options.headless = False
if arg_browsermode == 'headless':
options.headless = True
driver = webdriver.Firefox(options=options)
try:
print(f"{c_green}[URL] >> {c_blue}{arg_proxyurl}{c_white}")
print(f"{c_green}[Proxy Used] >> {c_blue}{proxy}{c_white}")
print(f"{c_green}[Browser Mode] >> {c_blue}{arg_browsermode}{c_white}")
print(f"{c_green}[TimeSec] >> {c_blue}{arg_pb_timesec}{c_white}\n\n")
driver.get(arg_proxyurl)
time.sleep(2) # seconds
# check if redirected to google captcha (for quitting abused proxies)
if not "google.com/sorry/" in driver.current_url:
# if youtube view mode
if arg_youtubevideo:
delay_time = 5 # seconds
# if delay time is more than timesec for proxybrowser
if delay_time > arg_pb_timesec:
# increase proxybrowser timesec
arg_pb_timesec += 5
# wait for the web element to load
try:
player_elem = WebDriverWait(driver, delay_time).until(EC.presence_of_element_located((By.ID, 'movie_player')))
togglebtn_elem = WebDriverWait(driver, delay_time).until(EC.presence_of_element_located((By.ID, 'toggleButton')))
time.sleep(2)
# click player
webdriver.ActionChains(driver).move_to_element(player_elem).click(player_elem).perform()
try:
# click autoplay button to disable autoplay
webdriver.ActionChains(driver).move_to_element(togglebtn_elem).click(togglebtn_elem).perform()
except Exception:
pass
except TimeoutException:
print("Loading video control taking too much time!")
else:
print(f"{c_red}[Network Error] >> Abused Proxy: {proxy}{c_white}")
driver.close()
driver.quit()
#if proxy not in abused_proxies:
# abused_proxies.append(proxy)
except Exception as e:
print(f"{c_red}{e}{c_white}")
driver.close()
driver.quit()
What the above does is start the browser with a proxy, check if the redirected url is not google recaptcha to avoid sticking on abused proxies page, if youtube video argument is passed, then wait for movie player to load and click it to autoplay.
Sort of like a viewbot for websites as well as youtube.
Problem
The threads indicate to end, but they keep running in the background. The browser window never quits and scripts exists with all browser threads runnning forever!
I tried every Stackoverflow solution and various methods, but nothing works. Here is the only relevant SO question which is also not so relevant since OP is spawing os.system processes, which I'm not: python daemon thread exits but process still run in the background
EDIT: Even when the whole page is loaded, youtube clicker does not work and there is no exception. The threads indicate to stop after network error, but there is no error?!
Entire Script
As suggested by previous stackoverflow programmers, I kept code here minimal and reproducable. But if you need the entire logic it's here: https://github.com/ProHackTech/FreshProxies/blob/master/fp.py
Here is the screenshot of what is happening:
As you are starting multiple threads and joining them as follows:
for each_thread in browser_threads:
each_thread.start()
for each_thread in browser_threads:
each_thread.join()
At this point, it is worth to note that WebDriver is not thread-safe. Having said that, if you can serialise access to the underlying driver instance, you can share a reference in more than one thread. This is not advisable. But you can always instantiate one WebDriver instance for each thread.
Ideally the issue of thread-safety isn't in your code but in the actual browser bindings. They all assume there will only be one command at a time (e.g. like a real user). But on the other hand you can always instantiate one WebDriver instance for each thread which will launch multiple browsing tabs/windows. Till this point it seems your program is perfect.
Now, different threads can be run on same Webdriver, but then the results of the tests would not be what you expect. The reason behind is, when you use multi-threading to run different tests on different tabs/windows a little bit of thread safety coding is required or else the actions you will perform like click() or send_keys() will go to the opened tab/window that is currently having the focus regardless of the thread you expect to be running. Which essentially means all the test will run simultaneously on the same tab/window that has focus but not on the intended tab/window.
Reference
You can find a relevant detailed discussion in:
Chrome crashes after several hours while multiprocessing using Selenium through Python
For some unknown reasons ,my browser open test pages of my remote server very slowly. So I am thinking if I can reconnect to the browser after quitting the script but don't execute webdriver.quit() this will leave the browser opened. It is probably kind of HOOK or webdriver handle.
I have looked up the selenium API doc but didn't find any function.
I'm using Chrome 62,x64,windows 7,selenium 3.8.0.
I'll be very appreciated whether the question can be solved or not.
No, you can't reconnect to the previous Web Browsing Session after you quit the script. Even if you are able to extract the Session ID, Cookies and other session attributes from the previous Browsing Context still you won't be able to pass those attributes as a HOOK to the WebDriver.
A cleaner way would be to call webdriver.quit() and then span a new Browsing Context.
Deep Dive
There had been a lot of discussions and attempts around to reconnect WebDriver to an existing running Browsing Context. In the discussion Allow webdriver to attach to a running browser Simon Stewart [Creator WebDriver] clearly mentioned:
Reconnecting to an existing Browsing Context is a browser specific feature, hence can't be implemented in a generic way.
With internet-explorer, it's possible to iterate over the open windows in the OS and find the right IE process to attach to.
firefox and google-chrome needs to be started in a specific mode and configuration, which effectively means that just
attaching to a running instance isn't technically possible.
tl; dr
webdriver.firefox.useExisting not implemented
Yes, that's actually quite easy to do.
A selenium <-> webdriver session is represented by a connection url and session_id, you just reconnect to an existing one.
Disclaimer - the approach is using selenium internal properties ("private", in a way), which may change in new releases; you'd better not use it for production code; it's better not to be used against remote SE (yours hub, or provider like BrowserStack/Sauce Labs), because of a caveat/resource drainage explained at the end.
When a webdriver instance is initiated, you need to get the before-mentioned properties; sample:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.google.com/')
# now Google is opened, the browser is fully functional; print the two properties
# command_executor._url (it's "private", not for a direct usage), and session_id
print(f'driver.command_executor._url: {driver.command_executor._url}')
print(f'driver.session_id: {driver.session_id}')
With those two properties now known, another instance can connect; the "trick" is to initiate a Remote driver, and provide the _url above - thus it will connect to that running selenium process:
driver2 = webdriver.Remote(command_executor=the_known_url)
# when the started selenium is a local one, the url is in the form 'http://127.0.0.1:62526'
When that is ran, you'll see a new browser window being opened.
That's because upon initiating the driver, the selenium library automatically starts a new session for it - and now you have 1 webdriver process with 2 sessions (browsers instances).
If you navigate to an url, you'll see it is executed on that new browser instance, not the one that's left from the previous start - which is not the desired behavior.
At this point, two things need to be done - a) close the current SE session ("the new one"), and b) switch this instance to the previous session:
if driver2.session_id != the_known_session_id: # this is pretty much guaranteed to be the case
driver2.close() # this closes the session's window - it is currently the only one, thus the session itself will be auto-killed, yet:
driver2.quit() # for remote connections (like ours), this deletes the session, but does not stop the SE server
# take the session that's already running
driver2.session_id = the_known_session_id
# do something with the now hijacked session:
driver.get('https://www.bing.com/')
And, that is it - you're now connected to the previous/already existing session, with all its properties (cookies, LocalStorage, etc).
By the way, you do not have to provide desired_capabilities when initiating the new remote driver - those are stored and inherited from the existing session you took over.
Caveat - having a SE process running can lead to some resource drainage in the system.
Whenever one is started and then not closed - like in the first piece of the code - it will stay there until you manually kill it. By this I mean - in Windows for example - you'll see a "chromedriver.exe" process, that you have to terminate manually once you are done with it. It cannot be closed by a driver that has connected to it as to a remote selenium process.
The reason - whenever you initiate a local browser instance, and then call its quit() method, it has 2 parts in it - the first one is to delete the session from the Selenium instance (what's done in the second code piece up there), and the other is to stop the local service (the chrome/geckodriver) - which generally works ok.
The thing is, for Remote sessions the second piece is missing - your local machine cannot control a remote process, that's the work of that remote's hub. So that 2nd part is literally a pass python statement - a no-op.
If you start too many selenium services on a remote hub, and don't have a control over it - that'll lead to resource drainage from that server. Cloud providers like BrowserStack take measures against this - they are closing services with no activity for the last 60s, etc, yet - this is something you don't want to do.
And as for local SE services - just don't forget to occasionally clean up the OS from orphaned selenium drivers you forgot about :)
OK after mixing various solutions shared on here and tweaking I have this working now as below. Script will use previously left open chrome window if present - the remote connection is perfectly able to kill the browser if needed and code functions just fine.
I would love a way to automate the getting of session_id and url for previous active session without having to write them out to a file during hte previous session for pick up...
This is my first post on here so apologies for breaking any norms
#Set manually - read/write from a file for automation
session_id = "e0137cd71ab49b111f0151c756625d31"
executor_url = "http://localhost:50491"
def attach_to_session(executor_url, session_id):
original_execute = WebDriver.execute
def new_command_execute(self, command, params=None):
if command == "newSession":
# Mock the response
return {'success': 0, 'value': None, 'sessionId': session_id}
else:
return original_execute(self, command, params)
# Patch the function before creating the driver object
WebDriver.execute = new_command_execute
driver = webdriver.Remote(command_executor=executor_url, desired_capabilities={})
driver.session_id = session_id
# Replace the patched function with original function
WebDriver.execute = original_execute
return driver
remote_session = 0
#Try to connect to the last opened session - if failing open new window
try:
driver = attach_to_session(executor_url,session_id)
driver.current_url
print(" Driver has an active window we have connected to it and running here now : ")
print(" Chrome session ID ",session_id)
print(" executor_url",executor_url)
except:
print("No Driver window open - make a new one")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=myoptions)
session_id = driver.session_id
executor_url = driver.command_executor._url
Without getting into why do you think that leaving an open browser windows will solve the problem of being slow, you don't really need a handle to do that. Just keep running the tests without closing the session or, in other words, without calling driver.quit() as you have mentioned yourself. The question here though framework that comes with its own runner? Like Cucumber?
In any case, you must have some "setup" and "cleanup" code. So what you need to do is to ensure during the "cleanup" phase that the browser is back to its initial state. That means:
Blank page is displayed
Cookies are erased for the session
I am using python selenium to parse large amount of data from more than 10,000+ urls. The browser is Firefox.
For each url, a Firefox browser will be opened and after data parsing, it will be closed, and wait 5 seconds before opening the next url through Firefox.
However, it happened twice these days, everything was running great, all of a sudden, the newly opened browser is blank, it is not loading the url at all. In real life experience, sometimes, even when I manually open a browser, searching for something, it is blank too.
The problem is, when this happened, there is no error at all, even when I wrote the except code to catch any exception, meanwhile I'm using nohup command to run the code, it will record any exception too, but there is no error at all. And once this happened, the code won't be executed any more, and many urls are left there without being parsed.... If I re-run the code on the rest urls, it works fine again.
Here is my code (all the 10,000+ urls are in comment_urls list):
for comment_url in comment_urls:
driver = webdriver.Firefox(executable_path='/Users/devadmin/Documents/geckodriver')
driver.get(comment_url)
time.sleep(5)
try:
// here is my data parsing code .....
driver.quit() // the browser will be closed when the data has been parsed
time.sleep(5) // and wait 5 secods
except:
with open(error_comment_reactions, 'a') as error_output:
error_output.write(comment_url+"\n")
driver.quit()
time.sleep(5)
At the same time, in that data parsing part, if there will be any exception, my code will also record the exception and close the driver, wait 5 seconds. But so far, no error recorded at all.
I tried to find similar problems and solutions online, but those are not helpful.
So, currently, I have 2 questions in mind:
Have you met this problem before and do you know how to deal with it? It is network problem or selenium problem or browser problem?
Or is there anyway in python, that it can tell the browser is not loading the url and it will close it?
For second problem, Prefer to use work queue to parse urls. One app should add all of them to queue (redis, rabbit-mq, amazon sqs and etc.) and then Second app should get 1 url from queue and try to parse it. In case if it will succeed, it should delete url from queue and switch to other url in queue. In case of exception it should os.exit(1) to stop app. Use shell to run second app, when it will return 1, meaning that error occurred, restart the app. Shell script: Get exit(1) from Python in shell
To answer your 2 questions:
1) Yes I have found selenium to be unpredictable at times. This is usually a problem when opening a browser for the first time which I will talk about in my solution. Try not to close the browser unless you need to.
2) Yes you can use the WebDriverWait() class in selenium.webdriver.support.wait
You said you are parsing thousands of comments so just make a new get request with the webdriver you have open.
I use this in my own scraper with the below code:
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
browser = webdriver.Firefox()
browser.get("http://someurl.com")
table = WebDriverWait(browser,60).until(EC.presence_of_element_located((By.TAG_NAME, "table")))`
The variable browser is just webdriver.Firefox() class.
It is a bit long but what it does is wait for a specific html tag to exist on the page with a timeout of 60 seconds.
It is possible that you are experiencing your own time.sleep() locking the thread up as well. Try not to use sleeps to compensate for things like this.
Below is my script, it works fine, but not to my requirement
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://somewebsite.com/')
#nextline of script
In the above example, it opens the browser and immedtly moves to next step.
i want the script to wait till i close the browser manually and move to next step
( as i want to login and download few files from server and move to next step)
I agree with alecxe that you generally should automate the whole process. However, there are cases where you may be writing "throwaway code" or a proof-of-concept where it might be advantageous to have manual control of part of the process. If I found myself in such a situation, I'd do something like this:
import time
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://google.com/')
try:
while True:
# This will fail when the browser is closed.
browser.execute_script("")
time.sleep(0.2)
# Setting such a wide exception handler is generally not advisable but
# I'm not convinced there is a definite set of exceptions that
# Selenium will stick to if it cannot contact the browser. And I'm not
# convinced the set cannot change from release to release.
except:
has_quit = False
while not has_quit:
try:
# This is to allow Selenium to run cleanup code.
browser.quit()
has_quit = True
except: # See comment above regarding such wide handlers...
pass
# Continue with the script...
print "Whatever"
The call to browser.quit() is so that Selenium can cleanup after itself. It is very important for Firefox in particular because Selenium will create a bunch of temporary files which can fill up /tmp (on a Unix-type system, I don't know where Selenium puts the files on a Windows system) over time. In theory Selenium should be able to handle gracefully the case where the browser no longer exists by the time browser.quit() is called but I've found cases where an internal exception was not caught and browser.quit() would fail right away. (By the way, this supports my comment about the set of exceptions that Selenium can raise if the browser is dead being unclear: even Selenium does not know what exceptions Selenium can raise, which is why browser.quit() sometimes fails.) Repeating the call until it is successful seems to work.
Note that browser becomes effectively unusable as soon as you close the browser. You'll have to spawn a new browser if you wish to do more browserly things.
Also, it is not generally possible to distinguish between the user closing the browser and a browser crash.
If the page is not fully loaded, you can always wait for a specific element on the page to show up, for example, your download button.
Or you can wait for all JavaScript to load.
wait.until( new Predicate<WebDriver>() {
public boolean apply(WebDriver driver) {
return ((JavascriptExecutor)driver).executeScript("return document.readyState").equals("complete");
}
}
);