Trying to deal with the creation of a webdriver timing out (which happens once in a while covered here). I can't use a signal based timeout because my server is running on Windows so I've been trying to find an alternative.
I looked at the timeout from eventlet but I don't think that will cut it. A time.sleep(10000) doesn't trigger the timeout so I don't think the timeout itself would.
What I'm thinking is calling a thread to create and return the browser and then setting a join timeout. So something like:
def SpawnPhantomJS(dcap, service_args):
browser = webdriver.PhantomJS('C:\phantomjs.exe',desired_capabilities=dcap, service_args=service_args)
print "browser made!"
return browser
proxywrite = '--proxy=',nextproxy
service_args = [
proxywrite,
'--proxy-type=http',
'--ignore-ssl-errors=true',
]
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (nextuseragent)
newDriver = Thread(target=SpawnPhantomJS, args=[dcap, service_args]).start().join(20)
So I'm having some issues with the syntax on how to do this properly in theory this should work. If the creation stalls the SpawnPhamtomJS thread will stall not the main one so the timeout join should help it move on.
Is this possible though? Can I create a webdriver in a thread and return it? Any points appreciated.
Updates:
Just calling a function returned a webcontrol so that bodes well for what I'm trying to do.
newDriver = SpawnPhantomJS(dcap, service_args)
So I'm hoping it's just a syntax issue I have running this as a thread with a timeout.
This didn't do it however:
spawnthread = Thread(target=SpawnPhantomJS, args=[dcap, service_args])
spawnthread.start()
newDriver = spawnthread.join()
Wishful thinking there.
Thread pooling.
from multiprocessing.pool import ThreadPool
pool = ThreadPool(processes=1)
async_result = pool.apply_async(SpawnPhantomJS, (dcap, service_args))
newDriver = async_result.get(10)
Related
This question already has an answer here:
Basic python threading is not working. What am I missing in this?
(1 answer)
Closed 1 year ago.
I found this simple example demonstrating how to use threading to parallelize opening multiple chrome sessions with selenium.
from selenium import webdriver
import threading
import time
def test_logic():
driver = webdriver.Chrome()
url = 'https://www.google.de'
driver.get(url)
# Implement your test logic
time.sleep(2)
driver.quit()
N = 5 # Number of browsers to spawn
thread_list = list()
# Start test
for i in range(N):
t = threading.Thread(name='Test {}'.format(i), target=test_logic)
t.start()
time.sleep(1)
print(t.name + ' started!')
thread_list.append(t)
# Wait for all threads to complete
for thread in thread_list:
thread.join()
print('Test completed!')
I tested it and it works. However if I modify the test_logic function to include a variable, i.e. j:
def test_logic(j):
driver = webdriver.Chrome()
url = 'https://www.google.de'
driver.get(url)
# Implement your test logic
time.sleep(j)
driver.quit()
and the corresponding part of threading to:
t = threading.Thread(name='Test {}'.format(i), target=test_logic(i))
the code will stop working in parallel and just runs sequentially.
I don't know what I might have not considered and therefore will be very grateful if anybody can give me some advices. Many thanks!
target=test_logic(i) is invoking the function test_logic and give the return value to the thread.
You may want to do:
t = threading.Thread(name='Test {}'.format(i), target=test_logic, args=[i])
where target is the name of the function, and args is the arguments list for the function.
If you function has 2 args, like def test_logic(a,b), the args should contain 2 values.
More info in Python Thread Documentation
You have to pass arguments to function as below:
t = threading.Thread(name='Test {}'.format(i), target=test_logic, args=(i,))
I have an asyncio-based crawler that occasionally offloads crawling that requires the browser to a ThreadPoolExecutor, as follows:
def browserfetch(url):
browser = webdriver.Chrome()
browser.get(url)
# Some explicit wait stuff that can take up to 20 seconds.
return browser.page_source
async def fetch(url, loop):
with concurrent.futures.ThreadPoolExecutor() as pool:
result = await loop.run_in_executor(pool, browserfetch, url)
return result
My issue is that I believe this respawns the headless browser each time I call fetch, which incurs browser startup time on each call to webdriver.Chrome. Is there a way for me to refactor browserfetch or fetch so that the same headless driver can be used on multiple fetch calls?
What have I tried?
I've considered more explicit use of threads/pools to start the Chrome instance in a separate thread/process, communicating within the fetch call via queues, pipes, etc (all run in Executors to keep the calls from blocking). I'm not sure how to make this work, though.
I believe that starting browsers in separate processes and communicate with him via queue is a good approach (and more scalable). The pseudo-code might look like this:
# worker.py
def entrypoint(in_queue, out_queue): # run in process
crawler = Crawler()
browser = Browser()
while not stop:
command = in_queue.get()
result = crawler.process(command, browser)
out_queue.put(result)
# main.py
import worker
in_queue, out_queue = Process(worker.entrypoint)
while not stop:
in_queue.put(new_task)
result = out_queue.get()
I am building a Tkinter app with python that initializes multiple selenium webdrivers. The initial problem was that lots of chromedriver.exe instances were filling up user's memory, even after using driver.quit() (sometimes). So to get rid of this issue, when closing the tkinter app, I wrote this line os.system("taskkill /f /im chromedriver.exe /T"), that solves my problem, but, by using this, a command prompt instance is initiated that self kills almost instantly. The problem is that the user can see it and I find it kinda disturbing. Is there any way I could hide it? Or is there a workaround for my initial problem, that is user friendly?
Use both driver.close() and driver.quit() in your code in order to free memory.
driver.close()
driver.quit()
To reduce the memory footprint you should use ChromeDriver-Service:
First you start the service, then use it when creating new drivers, and finally stop the service before program exit.
Put the code below in chrome_factory.py and then:
on program start call chrome_factory.start_service()
to create new driver call chrome_factory.create_driver()
the service and drivers will be automatically stopped/quit at program exit.
Using this approach will result in only ever having single chromedriver.exe process.
# chrome_factory.py
import atexit
from os.path import expanduser
from typing import Optional
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
EXECUTABLE = expanduser("~/bin/chromedriver")
_chrome_service: Optional[Service] = None
def start_service():
global _chrome_service
_chrome_service = Service(EXECUTABLE)
_chrome_service.start()
atexit.register(_chrome_service.stop)
def create_driver() -> webdriver.Remote:
global _chrome_service
opts = webdriver.ChromeOptions()
opts.add_argument("--headless")
driver = webdriver.Remote(_chrome_service.service_url,
desired_capabilities=opts.to_capabilities())
atexit.register(driver.quit)
return driver
def main():
start_service()
for _ in range(20):
create_driver()
if __name__ == '__main__':
main()
I'm working on a Django app. I'm using Selenium together with PhantomJS for testing.
I found today that I every time I terminate the test (which I do a lot when debugging,) the PhantomJS process is still alive. This means that after a debugging session I could be left with 200 zombie PhantomJS processes!
How do I get these PhantomJS processes to terminate when I terminate the Python debug process? If there's a time delay, that works too. (i.e. have them terminate if not used for 2 minutes, that would solve my problem.)
The usual setup is to quit the PhantomJS browser in the teardown method of the class. For example:
from django.conf import settings
from django.test import LiveServerTestCase
from selenium.webdriver.phantomjs.webdriver import WebDriver
PHANTOMJS = (settings.BASE_DIR +
'/node_modules/phantomjs/bin/phantomjs')
class PhantomJSTestCase(LiveServerTestCase):
#classmethod
def setUpClass(cls):
cls.web = WebDriver(PHANTOMJS)
cls.web.set_window_size(1280, 1024)
super(PhantomJSTestCase, cls).setUpClass()
#classmethod
def tearDownClass(cls):
screenshot_file = getattr(settings, 'E2E_SCREENSHOT_FILE', None)
if screenshot_file:
cls.web.get_screenshot_as_file(screenshot_file)
cls.web.quit()
super(PhantomJSTestCase, cls).tearDownClass()
If you do not use unittest test cases, you'll have to use the quit method yourself. You can use the atexit module to run code when the Python process terminates, for example:
import atexit
web = WebDriver(PHANTOMJS)
atexit.register(web.quit)
actually this is not hang status, i mean..it slow response,
so in that case,
i would like to close IE and
want to restart from start.
so closing is no problem ,problem is ,how to set timeout ,for example if i set 15sec, if not webpage open less than 15 sec i want to close it and restart from start.
is this possible to use with IE com interface?
really hard to find solution
Paul,
I'm used to follow code to check wether a webpage is completely open or not.
But as I mentioned, it is not working well, because IE.navigate is looks like it hangs or does not respond.
while ie.ReadyState != 4:
time.sleep(0.5)
To avoid blocking problem use IE COM object in a thread.
Here is a simple but powerful example demonstrating how can you use thread and IE com object together. You can improve it for your purpose.
This example starts a thread a uses a queue to communicate with main thread, in main thread user can add urls to queue, and IE thread visits them one by one, after he finishes one url, IE visits next. As IE COM object is being used in a thread you need to call Coinitialize
from threading import Thread
from Queue import Queue
from win32com.client import Dispatch
import pythoncom
import time
class IEThread(Thread):
def __init__(self):
Thread.__init__(self)
self.queue = Queue()
def run(self):
ie = None
# as IE Com object will be used in thread, do CoInitialize
pythoncom.CoInitialize()
try:
ie = Dispatch("InternetExplorer.Application")
ie.Visible = 1
while 1:
url = self.queue.get()
print "Visiting...",url
ie.Navigate(url)
while ie.Busy:
time.sleep(0.1)
except Exception,e:
print "Error in IEThread:",e
if ie is not None:
ie.Quit()
ieThread = IEThread()
ieThread.start()
while 1:
url = raw_input("enter url to visit:")
if url == 'q':
break
ieThread.queue.put(url)