I'm trying to lower execution time by multithreading the more time consuming parts of my script, which is mostly the locator calls.
However, I keep getting "CannotSendRequest" and "ResponseNotReady" exceptions from the two threads.
Is this because I'm using the same http handle?
input_worker = threading.Thread(name="input_worker", target=find_input_fields, args=(form, args, logger))
input_worker.setDaemon(True)
select_worker = threading.Thread(name="select_worker", target=find_select_fields, args=(form, logger))
select_worker.setDaemon(True)
thread_pool.append(input_worker)
thread_pool.append(select_worker)
And in the find_input_fields function is something like
input_fields = form.find_elements_by_tag_name("input")
Selenium takes 1 cpu core per thread. And multi threading is not suggested for Selenium webdriver. Consider If you have a 4 core system you can run 4 selenium separate thread linked to each core.
As you are creating 2 threads you are getting Exceptions from the two threads.
FYI
Is it possible to parallelize selenium webdriver get_attribute calls in python?
Related
Running a python selenium script using ChromeDriver that goes to one (and only one) URL, enters data into the form then scrapes the results. Would like to do this in parallel by entering different data into the same URL's form and getting the different results.
In researching Multiprocessing or Multithreading I have found Multithreading is best for I/O bound tasks and Multiprocessing best for CPU bound tasks.
Overall amount of data I'm scraping is small, select text only so don't believe I/O bound? Does this sound correct? From what I've gathered is that in general web scrapers are I/O intensive, maybe my example scenario is just an exception?
Running my current (sequential, non parallel) script, Resource Monitor shows chrome instance CPU usage ramp up AND across all (4) cores. So is chrome using multiprocessing by default and the advantage of multiprocessing within python really in being able to apply the scripts function to each chrome instance? Maybe I got this all wrong...
Also is it that a script that wants to open multiple URL's at once and interact with them inherently CPU bound due to that fact that it runs a lot of chrome instances? Assuming data scraped is small. Ignoring headless for now.
Image attached of CPU usage, spike in the middle (across all 4 CPU's) is when chrome is launched.
Any comments or advice appreciated, including any pseudo code on how you might implement something like this. Didn't share base code, question more around the structure of all this.
I am using selenium and chrome-driver to scrape data from some pages and then run some additional tasks with that information (for example, type some comments on some pages)
My program has a button. Every time it's pressed it calls the thread_(self) (bellow), starting a new thread. The target function self.main has the code to run all the selenium work on a chrome-driver.
def thread_(self):
th = threading.Thread(target=self.main)
th.start()
My problem is that after the user press the first time. This th thread will open browser A and do some stuff. While browser A is doing some stuff, the user will press the button again and open browser B that runs the same self.main. I want each browser opened to run simultaneously. The problem I faced is that when I run that thread function, the first browser stops and the second browser is opened.
I know my code can create threads infinitely. And I know that this will affect the pc performance but I am ok with that. I want to speed up the work done by self.main!
Threading for selenium speed up
Consider the following functions to exemplify how threads with selenium give some speed-up compared to a single driver approach. The code bellow scraps the html title from a page opened by selenium using BeautifulSoup. The list of pages is links.
import time
from bs4 import BeautifulSoup
from selenium import webdriver
import threading
def create_driver():
"""returns a new chrome webdriver"""
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless") # make it not visible, just comment if you like seeing opened browsers
return webdriver.Chrome(options=chromeOptions)
def get_title(url, webdriver=None):
"""get the url html title using BeautifulSoup
if driver is None uses a new chrome-driver and quit() after
otherwise uses the driver provided and don't quit() after"""
def print_title(driver):
driver.get(url)
soup = BeautifulSoup(driver.page_source,"lxml")
item = soup.find('title')
print(item.string.strip())
if webdriver:
print_title(webdriver)
else:
webdriver = create_driver()
print_title(webdriver)
webdriver.quit()
links = ["https://www.amazon.com", "https://www.google.com", "https://www.youtube.com/", "https://www.facebook.com/", "https://www.wikipedia.org/",
"https://us.yahoo.com/?p=us", "https://www.instagram.com/", "https://www.globo.com/", "https://outlook.live.com/owa/"]
Calling now get_tile on the links above.
Sequential approach
A single chrome driver and passing all links sequentially. Takes 22.3 s my machine (note:windows).
start_time = time.time()
driver = create_driver()
for link in links: # could be 'like' clicks
get_title(link, driver)
driver.quit()
print("sequential took ", (time.time() - start_time), " seconds")
Multiple threads approach
Using a thread for each link. Results in 10.5 s > 2x faster.
start_time = time.time()
threads = []
for link in links: # each thread could be like a new 'click'
th = threading.Thread(target=get_title, args=(link,))
th.start() # could `time.sleep` between 'clicks' to see whats'up without headless option
threads.append(th)
for th in threads:
th.join() # Main thread wait for threads finish
print("multiple threads took ", (time.time() - start_time), " seconds")
This here and this better are some other working examples. The second uses a fixed number of threads on a ThreadPool. And suggests that storing the chrome-driver instance initialized on each thread is faster than creating-starting it every time.
Still I was not sure this was the optimal approach for selenium to have considerable speed-ups. Since threading on no IO bound code will end-up executed sequentially (one thread after another). Due the Python GIL (Global Interpreter Lock) a Python process cannot run threads in parallel (utilize multiple cpu-cores).
Processes for selenium speed up
To try to overcome the Python GIL limitation using the package multiprocessing and Processes class I wrote the following code and I ran multiple tests. I even added random page hyperlink clicks on the get_title function above. Additional code is here.
start_time = time.time()
processes = []
for link in links: # each thread a new 'click'
ps = multiprocessing.Process(target=get_title, args=(link,))
ps.start() # could sleep 1 between 'clicks' with `time.sleep(1)``
processes.append(ps)
for ps in processes:
ps.join() # Main wait for processes finish
return (time.time() - start_time)
Contrary of what I would expect Python multiprocessing.Process based parallelism for selenium in average was around 8% slower than threading.Thread. But obviously booth were in average more than twice faster than the sequential approach. Just found out that selenium chrome-driver commands uses HTTP-Requets (like POST, GET) so it is I/O bounded therefore it releases the Python GIL indeed making it parallel in threads.
Threading a good start for selenium speed up **
This is not a definitive answer as my tests were only a tiny example. Also I'm using Windows and multiprocessing have many limitations in this case. Each new Process is not a fork like in Linux meaning, among other downsides, a lot of memory is wasted.
Taking all that in account: It seams that depending on the use case threads maybe as good or better than trying the heavier approach of process (specially for Windows users).
try this:
def thread_(self):
th = threading.Thread(target=self.main)
self.jobs.append(th)
th.start()
info: https://pymotw.com/2/threading/
I am using selenium and Python to do a big project. I have to go through 320.000 webpages (320K) one by one and scrape details and then sleep for a second and move on.
Like bellow:
links = ["https://www.thissite.com/page=1","https://www.thissite.com/page=2", "https://www.thissite.com/page=3"]
for link in links:
browser.get(link )
scrapedinfo = browser.find_elements_by_xpath("*//div/productprice").text
open("file.csv","a+").write(scrapedinfo)
time.sleep(1)
The greatest problem : too slow!
With this script I will take days or maybe weeks.
Is there a way to increase speed? Such as, by visiting multiple
links at the same time and scraping all at once?
I have spent hours finding answers on google and Stackoverflow and only found about multiprocessing.
But, I am unable to apply it in my script.
Threading approach
You should start with threading.Thread and it will give you a considerable performance boost (explained here). Also threads are lighter than processes. You can use a futures.ThreadPoolExecutor with each thread using its own webdriver. Consider also adding the headless option for your webdriver. Example bellow using a chrome-webdriver:
from concurrent import futures
def selenium_work(url):
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
driver = webdriver.Chrome(options=chromeOptions)
#<actual work that needs to be done be selenium>
# default number of threads is optimized for cpu cores
# but you can set with `max_workers` like `futures.ThreadPoolExecutor(max_workers=...)`
with futures.ThreadPoolExecutor() as executor:
# store the url for each thread as a dict, so we can know which thread fails
future_results = { url : executor.submit(selenium_work, links) for url in links }
for url, future in future_results.items():
try:
future.result() # can use `timeout` to wait max seconds for each thread
except Exception as exc: # can give a exception in some thread
print('url {:0} generated an exception: {:1}'.format(url, exc))
Consider also storing the chrome-driver instance initialized on each thread using threading.local(). From here they reported a reasonable performance improvement.
Consider if using BeautifulSoup direct on the page from selenium can give some other speed-up. It's a very fast and stablished package. Example something like driver.get(url) ... soup = BeautifulSoup(driver.page_source,"lxml") ... result = soup.find('a')
Other approaches
Although I personally not saw much benefits on using concurrent.futures.ProcessPoolExecutor() you could experiment on that. In fact it was slower than threads on my experiments on Windows. Also on Windows you have many limitations for python Process.
Consider if your use case can be satisfied by using arsenic a asynchronous webdriver client built on asyncio. That really sound promissing, though having many limitations.
Consider if Requests-Html solves your problems with javascript load. Since it claims Full JavaScript support! In that case you could use it with BeautifulSoup on a standard data scraping methodology.
You can use the paralel execution. Devide the list of sites for e.g in ten TC that are going to use same code, just method names will be different (method1, method2,method3,...). You will increse the speed. Number of the browsers depends on your hardver performances.
See more on https://www.guru99.com/sessions-parallel-run-and-dependency-in-selenium.html
Main thing is to use Test NG and edit .xml file and set how many threads you want to use.Like this:
<suite name="TestSuite" thread-count="10" parallel="methods" >
If you are not scraping too security oriented website against bots, it is better to use Requests, it will reduce your time from days to couple of hours and implement multi-threading with multi-processing. Steps are too long to go over, here is just some idea:
def threader_run(data):
futures = []
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
for i in data:
futures.append(executor.submit(scrapper,i))
for future in concurrent.futures.as_completed(futures):
print(future.result())
data = {}
data['process1'] = []
data['process2'] = []
data['process3'] = []
if __name__ == "__main__":
for x in data:
jobs = []
p = Process(target=threader_run,args(data[x],))
jobs.append(p)
p.start()
print(f'Started - {x}')
Basically, what this is doing is first have all the links compiled then split them into 3 arrays for running 3 processes simultaneously (you could run more processes depending on your cpu cores and how data intensive these jobs are). After that split those arrays further could be more than 10 even 100 depending on your project size. This will run threadpool which have maximum 8 workers and then it will run your final function.
Here with 3 process and 8 workers you are looking at 24 times speed boost. however, Use of Requests library is necessary if you use selenium for this, normal Computers/laptops will freeze. Because this would mean 24 browsers running simultaneously.
The goal is to write a python script that opens a specific website, fills out some inputs and then submits it. This should be done with different inputs for the same website simultaneously.
I tried using Threads from threading and some other things but i can't make it work simultaneously.
from selenium import webdriver
import time
from threading import Thread
def test_function():
driver = webdriver.Chrome()
driver.get("https://www.google.com")
time.sleep(3)
if __name__ =='__main__':
Thread(target = test_function()).start()
Thread(target = test_function()).start()
So executing this code the goal is that 2 chrome windows will open simultaneously, go to google and then wait for 3 seconds. Now all that's done is that the function is called two times in a serial manner.
Now all that's done is that the function is called
two times in a serial manner.
The behavior you are seeing is because you are calling test_function() when you pass it as a target. Rather than calling the function, just assign the callable name (test_function).
like this:
Thread(target=test_function).start()
You will need a testing framework like pytest to execute tests in parallel. Here is a quick setup guide to get you going.
PythonWebdriverParallel
Biggest issue I have with selenium is long re-opening time of browser(using it to scrape every few minutes). I am also using proxies and running multiple browsers with python's threading - All starting/stopping every few minutes(when new job comes)
Threading also means only 1 CPU is used and performance suffers.
I've been thinking about starting to use celery(out-of-box multi-core support) and make workers(different proxy/browser) run indefinitely(while loop) with open instances of selenium browsers waiting to get exact URLs to scrape - feed via something like redis.
Is it a good idea to be running continuous tasks like this with celery? Is there any better way to do it?
Its never a good idea to hold open instances of selenium indefinitely,
best practice is to reopen with each task.
so for you question, in my opinion its not a good idea.
let me offer you another architecture instead.
use Docker to run your selenium machines,
basically create selenium-grid (first result in google link)
using Docker
once everything is setup correctly the task will become easy, with multiprocessing send to your selenium hub all the jobs in parallel,
and they will run simultaneously on as many containers as you need.
once the job is done, you can destroy the containers and start fresh, with the next cycle.
Using docker will also allow you to scale you operation very easily