The goal is to write a python script that opens a specific website, fills out some inputs and then submits it. This should be done with different inputs for the same website simultaneously.
I tried using Threads from threading and some other things but i can't make it work simultaneously.
from selenium import webdriver
import time
from threading import Thread
def test_function():
driver = webdriver.Chrome()
driver.get("https://www.google.com")
time.sleep(3)
if __name__ =='__main__':
Thread(target = test_function()).start()
Thread(target = test_function()).start()
So executing this code the goal is that 2 chrome windows will open simultaneously, go to google and then wait for 3 seconds. Now all that's done is that the function is called two times in a serial manner.
Now all that's done is that the function is called
two times in a serial manner.
The behavior you are seeing is because you are calling test_function() when you pass it as a target. Rather than calling the function, just assign the callable name (test_function).
like this:
Thread(target=test_function).start()
You will need a testing framework like pytest to execute tests in parallel. Here is a quick setup guide to get you going.
PythonWebdriverParallel
Related
I am using selenium and chrome-driver to scrape data from some pages and then run some additional tasks with that information (for example, type some comments on some pages)
My program has a button. Every time it's pressed it calls the thread_(self) (bellow), starting a new thread. The target function self.main has the code to run all the selenium work on a chrome-driver.
def thread_(self):
th = threading.Thread(target=self.main)
th.start()
My problem is that after the user press the first time. This th thread will open browser A and do some stuff. While browser A is doing some stuff, the user will press the button again and open browser B that runs the same self.main. I want each browser opened to run simultaneously. The problem I faced is that when I run that thread function, the first browser stops and the second browser is opened.
I know my code can create threads infinitely. And I know that this will affect the pc performance but I am ok with that. I want to speed up the work done by self.main!
Threading for selenium speed up
Consider the following functions to exemplify how threads with selenium give some speed-up compared to a single driver approach. The code bellow scraps the html title from a page opened by selenium using BeautifulSoup. The list of pages is links.
import time
from bs4 import BeautifulSoup
from selenium import webdriver
import threading
def create_driver():
"""returns a new chrome webdriver"""
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless") # make it not visible, just comment if you like seeing opened browsers
return webdriver.Chrome(options=chromeOptions)
def get_title(url, webdriver=None):
"""get the url html title using BeautifulSoup
if driver is None uses a new chrome-driver and quit() after
otherwise uses the driver provided and don't quit() after"""
def print_title(driver):
driver.get(url)
soup = BeautifulSoup(driver.page_source,"lxml")
item = soup.find('title')
print(item.string.strip())
if webdriver:
print_title(webdriver)
else:
webdriver = create_driver()
print_title(webdriver)
webdriver.quit()
links = ["https://www.amazon.com", "https://www.google.com", "https://www.youtube.com/", "https://www.facebook.com/", "https://www.wikipedia.org/",
"https://us.yahoo.com/?p=us", "https://www.instagram.com/", "https://www.globo.com/", "https://outlook.live.com/owa/"]
Calling now get_tile on the links above.
Sequential approach
A single chrome driver and passing all links sequentially. Takes 22.3 s my machine (note:windows).
start_time = time.time()
driver = create_driver()
for link in links: # could be 'like' clicks
get_title(link, driver)
driver.quit()
print("sequential took ", (time.time() - start_time), " seconds")
Multiple threads approach
Using a thread for each link. Results in 10.5 s > 2x faster.
start_time = time.time()
threads = []
for link in links: # each thread could be like a new 'click'
th = threading.Thread(target=get_title, args=(link,))
th.start() # could `time.sleep` between 'clicks' to see whats'up without headless option
threads.append(th)
for th in threads:
th.join() # Main thread wait for threads finish
print("multiple threads took ", (time.time() - start_time), " seconds")
This here and this better are some other working examples. The second uses a fixed number of threads on a ThreadPool. And suggests that storing the chrome-driver instance initialized on each thread is faster than creating-starting it every time.
Still I was not sure this was the optimal approach for selenium to have considerable speed-ups. Since threading on no IO bound code will end-up executed sequentially (one thread after another). Due the Python GIL (Global Interpreter Lock) a Python process cannot run threads in parallel (utilize multiple cpu-cores).
Processes for selenium speed up
To try to overcome the Python GIL limitation using the package multiprocessing and Processes class I wrote the following code and I ran multiple tests. I even added random page hyperlink clicks on the get_title function above. Additional code is here.
start_time = time.time()
processes = []
for link in links: # each thread a new 'click'
ps = multiprocessing.Process(target=get_title, args=(link,))
ps.start() # could sleep 1 between 'clicks' with `time.sleep(1)``
processes.append(ps)
for ps in processes:
ps.join() # Main wait for processes finish
return (time.time() - start_time)
Contrary of what I would expect Python multiprocessing.Process based parallelism for selenium in average was around 8% slower than threading.Thread. But obviously booth were in average more than twice faster than the sequential approach. Just found out that selenium chrome-driver commands uses HTTP-Requets (like POST, GET) so it is I/O bounded therefore it releases the Python GIL indeed making it parallel in threads.
Threading a good start for selenium speed up **
This is not a definitive answer as my tests were only a tiny example. Also I'm using Windows and multiprocessing have many limitations in this case. Each new Process is not a fork like in Linux meaning, among other downsides, a lot of memory is wasted.
Taking all that in account: It seams that depending on the use case threads maybe as good or better than trying the heavier approach of process (specially for Windows users).
try this:
def thread_(self):
th = threading.Thread(target=self.main)
self.jobs.append(th)
th.start()
info: https://pymotw.com/2/threading/
Program Logic
I'm opening multiple selenium threads from the list using multithreading library in python3.
These threads are stored in an array from which they're started like this:
for each_thread in browser_threads:
each_thread.start()
for each_thread in browser_threads:
each_thread.join()
Each thread calls a function to start the selenium firefox browser. Function is as follows..
Browser Function
# proxy browser session
def proxy_browser(proxy):
global arg_pb_timesec
global arg_proxyurl
global arg_youtubevideo
global arg_browsermode
# recheck proxyurl
if arg_proxyurl == '':
arg_proxyurl = 'https://www.duckduckgo.com/'
# apply proxy to firefox using desired capabilities
PROX = proxy
webdriver.DesiredCapabilities.FIREFOX['proxy']={
"httpProxy":PROX,
"ftpProxy":PROX,
"sslProxy":PROX,
"proxyType":"MANUAL"
}
options = Options()
# for browser mode
options.headless = False
if arg_browsermode == 'headless':
options.headless = True
driver = webdriver.Firefox(options=options)
try:
print(f"{c_green}[URL] >> {c_blue}{arg_proxyurl}{c_white}")
print(f"{c_green}[Proxy Used] >> {c_blue}{proxy}{c_white}")
print(f"{c_green}[Browser Mode] >> {c_blue}{arg_browsermode}{c_white}")
print(f"{c_green}[TimeSec] >> {c_blue}{arg_pb_timesec}{c_white}\n\n")
driver.get(arg_proxyurl)
time.sleep(2) # seconds
# check if redirected to google captcha (for quitting abused proxies)
if not "google.com/sorry/" in driver.current_url:
# if youtube view mode
if arg_youtubevideo:
delay_time = 5 # seconds
# if delay time is more than timesec for proxybrowser
if delay_time > arg_pb_timesec:
# increase proxybrowser timesec
arg_pb_timesec += 5
# wait for the web element to load
try:
player_elem = WebDriverWait(driver, delay_time).until(EC.presence_of_element_located((By.ID, 'movie_player')))
togglebtn_elem = WebDriverWait(driver, delay_time).until(EC.presence_of_element_located((By.ID, 'toggleButton')))
time.sleep(2)
# click player
webdriver.ActionChains(driver).move_to_element(player_elem).click(player_elem).perform()
try:
# click autoplay button to disable autoplay
webdriver.ActionChains(driver).move_to_element(togglebtn_elem).click(togglebtn_elem).perform()
except Exception:
pass
except TimeoutException:
print("Loading video control taking too much time!")
else:
print(f"{c_red}[Network Error] >> Abused Proxy: {proxy}{c_white}")
driver.close()
driver.quit()
#if proxy not in abused_proxies:
# abused_proxies.append(proxy)
except Exception as e:
print(f"{c_red}{e}{c_white}")
driver.close()
driver.quit()
What the above does is start the browser with a proxy, check if the redirected url is not google recaptcha to avoid sticking on abused proxies page, if youtube video argument is passed, then wait for movie player to load and click it to autoplay.
Sort of like a viewbot for websites as well as youtube.
Problem
The threads indicate to end, but they keep running in the background. The browser window never quits and scripts exists with all browser threads runnning forever!
I tried every Stackoverflow solution and various methods, but nothing works. Here is the only relevant SO question which is also not so relevant since OP is spawing os.system processes, which I'm not: python daemon thread exits but process still run in the background
EDIT: Even when the whole page is loaded, youtube clicker does not work and there is no exception. The threads indicate to stop after network error, but there is no error?!
Entire Script
As suggested by previous stackoverflow programmers, I kept code here minimal and reproducable. But if you need the entire logic it's here: https://github.com/ProHackTech/FreshProxies/blob/master/fp.py
Here is the screenshot of what is happening:
As you are starting multiple threads and joining them as follows:
for each_thread in browser_threads:
each_thread.start()
for each_thread in browser_threads:
each_thread.join()
At this point, it is worth to note that WebDriver is not thread-safe. Having said that, if you can serialise access to the underlying driver instance, you can share a reference in more than one thread. This is not advisable. But you can always instantiate one WebDriver instance for each thread.
Ideally the issue of thread-safety isn't in your code but in the actual browser bindings. They all assume there will only be one command at a time (e.g. like a real user). But on the other hand you can always instantiate one WebDriver instance for each thread which will launch multiple browsing tabs/windows. Till this point it seems your program is perfect.
Now, different threads can be run on same Webdriver, but then the results of the tests would not be what you expect. The reason behind is, when you use multi-threading to run different tests on different tabs/windows a little bit of thread safety coding is required or else the actions you will perform like click() or send_keys() will go to the opened tab/window that is currently having the focus regardless of the thread you expect to be running. Which essentially means all the test will run simultaneously on the same tab/window that has focus but not on the intended tab/window.
Reference
You can find a relevant detailed discussion in:
Chrome crashes after several hours while multiprocessing using Selenium through Python
I have a very complex py.test python-selenium test setup where I create a Firefox webdriver inside a py.test fixture. Here is some idea of what I am doing:
'driver.py':
class Driver(object):
"""
Driver class with basic wrappers around the selenium webdriver
and other convenience methods.
"""
def __init__(self, config, options):
"""Sets the driver and the config.
"""
self.remote = options.getoption("--remote")
self.headless = not options.getoption("--with-head")
if self.headless:
self.display = Display(visible=0, size=(13660, 7680))
self.display.start()
# Start the selenium webdriver
self.webdriver = fixefox_module.get_driver()
'conftest.py':
#pytest.fixture
def basedriver(config, options):
driver = driver.Driver(config, options)
yield driver
print("Debug 1")
driver.webdriver.quit()
print("Debug 2")
And when running the test I can only see Debug 1 printed out. The whole process stops at this point and does not seem to proceed. The whole selenium test is stuck at the webdriver.quit).
The tests, however, completed successfully...
What reasons could be for that behavior?
Addendum:
The reason why the execution hangs seems to be a popup that asks the user if he wants to leave the page because of unsaved data. That means that the documentation for the quit method is incorrect. It states:
Quits the driver and close every associated window.
This is a non-trivial problem, to which selenium acts really a inconsistent. The quit method should, as documented, just close the browser window(s) but it does not. Instead you get a popup asking the user if he wants to leave the page:
The nasty thing is that this popup appears only after the user called
driver.quit()
One way to fix this is to set the following profile for the driver
from selenium import webdriver
profile = webdriver.FirefoxProfile()
# other settings here
profile.set_preference("dom.disable_beforeunload", True)
driver = webdriver.Firefox(firefox_profile = profile)
The warning to close is true by default in firefox as you can see in about:config and you can disable them for your profile:
And since,
The reason why the execution hangs seems to be a popup that asks the
user if he wants to leave the page because of unsaved data.
You can set browser.tabs.warnOnClose in your Firefox configuration profile as follows:
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.set_preference("browser.tabs.warnOnClose", False)
driver = webdriver.Firefox(firefox_profile = profile)
You can look at profile.DEFAULT_PREFERENCES which is the json at python/site-packages/selenium/webdriver/firefox/webdriver_prefs.json
As far as I understood, there are basically two questions asked which I will try to answer :
Why failure of driver.webdriver.quit() method call leaves the script in hang/unresponsive state instead of raising any exception ?
Why the testcase was still a pass if the script never completed it's execution cycle ?
For answering the first question I will try to explain the Selenium Architecture which will clear most of our doubts.
So how Selenium Webdriver Functions ?
Every statement or command you write using Selenium Client Library will be converted to JSON Wire Protocol over http which in turn will be passed to our browser drivers(chromedriver, geckodriver) . So basically these generated http URLs (based on REST architecture) reaches to browser drivers. Inside the browser drivers there are http servers which will internally pass the received URLs to Real Browser (as HTTP over HTTP Server) for which the appropriate response will be generated by Web Browser and sent back to Browser Drivers (as HTTP over HTTP Server) which in turn will use JSON Wire Protocol to send the response back to Selenium Client Library which will finally decide on how to proceed further based on response achieved. Please refer to attached image for more clarification :
Now coming back to the question where script is in hold we can simply conclude that our browser is still working on request received that's why no response is sent back to Browser Driver which in turn left Selenium Library quit() function in hold i.e. waiting for the request processing completion.
So there are variety of workarounds available which we can use, among which one is already explained by Alex. But I believe there is a much better way to handle such conditions, as browser Selenium could leave us in hold/freeze state for other cases too as per my experience so I personally prefer Thread Kill Approach with Timeout as Selenium Object always runs in main() thread. We can allocate a specific time to the main thread and can kill main() thread if timeout session time is reached.
Now moving to the second question which is :
Why the testcase was still a pass if the script never completed it's
execution cycle ?
Well I don't have much idea on how pytest works but I do have basic idea on how test engine operates based on which I will try to answer this one.
For starters it's not at all possible for any test case to pass until the full script run is completed. Again, if your test cases are passing there could be very few possible scenarios such as :
Your test methods never made use of method which leaves the whole execution in hang/freeze state.
You must have called the method inside test tear down environment (w.r.t [TestNG][4] test engine in Java : #AfterClass, #AfterTest, #AfterGroups, #AfterMethod, #AfterSuite) meaning your test execution is completed already. So this might be the reason for tests showing up as successful completion.
I am still not sure what proper cause is there for second reason. I will keep looking and update the post if came up with something.
#Alex : Can you update the question with better understanding i.e. your current test design which I can explore to find better explanation.
So I was able to reproduce your issue using below sample HTML file
<html>
<body>
Please enter a value for me: <input name="name" >
<script>
window.onbeforeunload = function(e) {
return 'Dialog text here.';
};
</script>
<h2>ask questions on exit</h2>
</body>
</html>
Then I ran a sample script which reproduces the hang
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://localhost:8090/index.html")
driver.find_element_by_name("name").send_keys("Tarun")
driver.quit()
This will hang selenium python indefinitely. Which is not a good thing as such. The issue is the window.onload and window.onbeforeunload are tough for Selenium to handle because of the lifecycle of when it happens. onload happens even before selenium has injected its own code to suppress alert and confirm for handling. I am pretty sure onbeforeunload also is not in reach of selenium.
So there are multiple ways to get around.
Change in app
Ask the devs not to use onload or onbeforeunload events. Will they listen? Not sure
Disable beforeunload in profile
This is what you have already posted in your answer
from selenium import webdriver
profile = webdriver.FirefoxProfile()
# other settings here
profile.set_preference("dom.disable_beforeunload", True)
driver = webdriver.Firefox(firefox_profile = profile)
Disable the events through code
try:
driver.execute_script("window.onunload = null; window.onbeforeunload=null")
finally:
pass
driver.quit()
This would work only if you don't have multiple tabs opened or the tab suppose to generate the popup is on focus. But is a good generic way to handle this situation
Not letting Selenium hang
Well the reason selenium hangs is that it send a request to the geckodriver which then sends it to the firefox and one of these just doesn't respond as they wait for user to close the dialog. But the problem is Selenium python driver doesn't set any timeout on this connection part.
To solve the problem it is as simple as adding below two lines of code
import socket
socket.setdefaulttimeout(10)
try:
driver.quit()
finally:
# Set to something higher you want
socket.setdefaulttimeout(60)
But the issue with this approach is that driver/browser will still not be closed. This is where you need even more robust approach to kill the browser as discussed in below answer
In Python, how to check if Selenium WebDriver has quit or not?
Code from above link for making answer complete
from selenium import webdriver
import psutil
driver = webdriver.Firefox()
driver.get("http://tarunlalwani.com")
driver_process = psutil.Process(driver.service.process.pid)
if driver_process.is_running():
print ("driver is running")
firefox_process = driver_process.children()
if firefox_process:
firefox_process = firefox_process[0]
if firefox_process.is_running():
print("Firefox is still running, we can quit")
driver.quit()
else:
print("Firefox is dead, can't quit. Let's kill the driver")
firefox_process.kill()
else:
print("driver has died")
The best way to guarantee you run your teardown code in pytest is to define a finalizer function and add it as a finalizer to that fixture. This guarantees that even if something fails before the yield command, you still get your teardown.
To avoid a popup hanging up your teardown, invest in some WebdriverWait.until commands that timeout whenever you want them to. Popup appears, test cannot proceed, times out, teardown is called.
For ChromeDriver users:
options = Options()
options.add_argument('no-sandbox')
driver.close()
driver.quit()
credits to...
https://bugs.chromium.org/p/chromedriver/issues/detail?id=1135
I have the following scraping function already implemented in serial, but, because There are multiple URLs with data I would like to parallelize some of the work. here is the working Serial code:
from bs4 import BeautifulSoup as bs
import requests
edbURL='URL1'
psnURL='URL2'
def urlScraper(URL):
page=requests.get(URL)
soup=bs(page.text,'lxml')
l = ['base_URL'+str(i.a['href']) for i in soup.find_all('div',class_='info')]
return l
edbs=urlScraper(edbURL)
psns=urlScraper(psnURL)
What I would like for the two calls to urlScraper(URL) to each get their own thread and run in parallel, I tried using the threads library but only got some big nasty int returns with the following syntax:
edbs = threads.start_new_thread(urlScraper,(edbURL,))
psns = threads.start_new_thread(urlScraper,(psnURL,))
I figure it has something to do with the return in urlScraper(URL), then again, I basically know almost nothing about anything. Thanks for any help everyone!
multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.
https://docs.python.org/2/library/multiprocessing.html
I'm trying to lower execution time by multithreading the more time consuming parts of my script, which is mostly the locator calls.
However, I keep getting "CannotSendRequest" and "ResponseNotReady" exceptions from the two threads.
Is this because I'm using the same http handle?
input_worker = threading.Thread(name="input_worker", target=find_input_fields, args=(form, args, logger))
input_worker.setDaemon(True)
select_worker = threading.Thread(name="select_worker", target=find_select_fields, args=(form, logger))
select_worker.setDaemon(True)
thread_pool.append(input_worker)
thread_pool.append(select_worker)
And in the find_input_fields function is something like
input_fields = form.find_elements_by_tag_name("input")
Selenium takes 1 cpu core per thread. And multi threading is not suggested for Selenium webdriver. Consider If you have a 4 core system you can run 4 selenium separate thread linked to each core.
As you are creating 2 threads you are getting Exceptions from the two threads.
FYI
Is it possible to parallelize selenium webdriver get_attribute calls in python?