I am using selenium and chrome-driver to scrape data from some pages and then run some additional tasks with that information (for example, type some comments on some pages)
My program has a button. Every time it's pressed it calls the thread_(self) (bellow), starting a new thread. The target function self.main has the code to run all the selenium work on a chrome-driver.
def thread_(self):
th = threading.Thread(target=self.main)
th.start()
My problem is that after the user press the first time. This th thread will open browser A and do some stuff. While browser A is doing some stuff, the user will press the button again and open browser B that runs the same self.main. I want each browser opened to run simultaneously. The problem I faced is that when I run that thread function, the first browser stops and the second browser is opened.
I know my code can create threads infinitely. And I know that this will affect the pc performance but I am ok with that. I want to speed up the work done by self.main!
Threading for selenium speed up
Consider the following functions to exemplify how threads with selenium give some speed-up compared to a single driver approach. The code bellow scraps the html title from a page opened by selenium using BeautifulSoup. The list of pages is links.
import time
from bs4 import BeautifulSoup
from selenium import webdriver
import threading
def create_driver():
"""returns a new chrome webdriver"""
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless") # make it not visible, just comment if you like seeing opened browsers
return webdriver.Chrome(options=chromeOptions)
def get_title(url, webdriver=None):
"""get the url html title using BeautifulSoup
if driver is None uses a new chrome-driver and quit() after
otherwise uses the driver provided and don't quit() after"""
def print_title(driver):
driver.get(url)
soup = BeautifulSoup(driver.page_source,"lxml")
item = soup.find('title')
print(item.string.strip())
if webdriver:
print_title(webdriver)
else:
webdriver = create_driver()
print_title(webdriver)
webdriver.quit()
links = ["https://www.amazon.com", "https://www.google.com", "https://www.youtube.com/", "https://www.facebook.com/", "https://www.wikipedia.org/",
"https://us.yahoo.com/?p=us", "https://www.instagram.com/", "https://www.globo.com/", "https://outlook.live.com/owa/"]
Calling now get_tile on the links above.
Sequential approach
A single chrome driver and passing all links sequentially. Takes 22.3 s my machine (note:windows).
start_time = time.time()
driver = create_driver()
for link in links: # could be 'like' clicks
get_title(link, driver)
driver.quit()
print("sequential took ", (time.time() - start_time), " seconds")
Multiple threads approach
Using a thread for each link. Results in 10.5 s > 2x faster.
start_time = time.time()
threads = []
for link in links: # each thread could be like a new 'click'
th = threading.Thread(target=get_title, args=(link,))
th.start() # could `time.sleep` between 'clicks' to see whats'up without headless option
threads.append(th)
for th in threads:
th.join() # Main thread wait for threads finish
print("multiple threads took ", (time.time() - start_time), " seconds")
This here and this better are some other working examples. The second uses a fixed number of threads on a ThreadPool. And suggests that storing the chrome-driver instance initialized on each thread is faster than creating-starting it every time.
Still I was not sure this was the optimal approach for selenium to have considerable speed-ups. Since threading on no IO bound code will end-up executed sequentially (one thread after another). Due the Python GIL (Global Interpreter Lock) a Python process cannot run threads in parallel (utilize multiple cpu-cores).
Processes for selenium speed up
To try to overcome the Python GIL limitation using the package multiprocessing and Processes class I wrote the following code and I ran multiple tests. I even added random page hyperlink clicks on the get_title function above. Additional code is here.
start_time = time.time()
processes = []
for link in links: # each thread a new 'click'
ps = multiprocessing.Process(target=get_title, args=(link,))
ps.start() # could sleep 1 between 'clicks' with `time.sleep(1)``
processes.append(ps)
for ps in processes:
ps.join() # Main wait for processes finish
return (time.time() - start_time)
Contrary of what I would expect Python multiprocessing.Process based parallelism for selenium in average was around 8% slower than threading.Thread. But obviously booth were in average more than twice faster than the sequential approach. Just found out that selenium chrome-driver commands uses HTTP-Requets (like POST, GET) so it is I/O bounded therefore it releases the Python GIL indeed making it parallel in threads.
Threading a good start for selenium speed up **
This is not a definitive answer as my tests were only a tiny example. Also I'm using Windows and multiprocessing have many limitations in this case. Each new Process is not a fork like in Linux meaning, among other downsides, a lot of memory is wasted.
Taking all that in account: It seams that depending on the use case threads maybe as good or better than trying the heavier approach of process (specially for Windows users).
try this:
def thread_(self):
th = threading.Thread(target=self.main)
self.jobs.append(th)
th.start()
info: https://pymotw.com/2/threading/
Related
I have a dockerized Django app with some threads. Those threads perform periodic tasks, and use Selenium and Beautiful Soup to scrap data and save it to the database.
When initializing one thread, the first scrap goes well, data is checked and the function sleeps. However, when the sleep finishes the next scrap isn't performed. This is the thread head code:
def thread1():
time_mark = 0
while True:
print('Thread1 START')
op = funct_scrap(thread1_url)
...
sleep(60)
The funct_scrap scrapes the web using Selenium, works for the first time but it stops after that. I'd need it to check periodically. In development server works well, but now on Docker there is this problem. What's going on?
I am using selenium and Python to do a big project. I have to go through 320.000 webpages (320K) one by one and scrape details and then sleep for a second and move on.
Like bellow:
links = ["https://www.thissite.com/page=1","https://www.thissite.com/page=2", "https://www.thissite.com/page=3"]
for link in links:
browser.get(link )
scrapedinfo = browser.find_elements_by_xpath("*//div/productprice").text
open("file.csv","a+").write(scrapedinfo)
time.sleep(1)
The greatest problem : too slow!
With this script I will take days or maybe weeks.
Is there a way to increase speed? Such as, by visiting multiple
links at the same time and scraping all at once?
I have spent hours finding answers on google and Stackoverflow and only found about multiprocessing.
But, I am unable to apply it in my script.
Threading approach
You should start with threading.Thread and it will give you a considerable performance boost (explained here). Also threads are lighter than processes. You can use a futures.ThreadPoolExecutor with each thread using its own webdriver. Consider also adding the headless option for your webdriver. Example bellow using a chrome-webdriver:
from concurrent import futures
def selenium_work(url):
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
driver = webdriver.Chrome(options=chromeOptions)
#<actual work that needs to be done be selenium>
# default number of threads is optimized for cpu cores
# but you can set with `max_workers` like `futures.ThreadPoolExecutor(max_workers=...)`
with futures.ThreadPoolExecutor() as executor:
# store the url for each thread as a dict, so we can know which thread fails
future_results = { url : executor.submit(selenium_work, links) for url in links }
for url, future in future_results.items():
try:
future.result() # can use `timeout` to wait max seconds for each thread
except Exception as exc: # can give a exception in some thread
print('url {:0} generated an exception: {:1}'.format(url, exc))
Consider also storing the chrome-driver instance initialized on each thread using threading.local(). From here they reported a reasonable performance improvement.
Consider if using BeautifulSoup direct on the page from selenium can give some other speed-up. It's a very fast and stablished package. Example something like driver.get(url) ... soup = BeautifulSoup(driver.page_source,"lxml") ... result = soup.find('a')
Other approaches
Although I personally not saw much benefits on using concurrent.futures.ProcessPoolExecutor() you could experiment on that. In fact it was slower than threads on my experiments on Windows. Also on Windows you have many limitations for python Process.
Consider if your use case can be satisfied by using arsenic a asynchronous webdriver client built on asyncio. That really sound promissing, though having many limitations.
Consider if Requests-Html solves your problems with javascript load. Since it claims Full JavaScript support! In that case you could use it with BeautifulSoup on a standard data scraping methodology.
You can use the paralel execution. Devide the list of sites for e.g in ten TC that are going to use same code, just method names will be different (method1, method2,method3,...). You will increse the speed. Number of the browsers depends on your hardver performances.
See more on https://www.guru99.com/sessions-parallel-run-and-dependency-in-selenium.html
Main thing is to use Test NG and edit .xml file and set how many threads you want to use.Like this:
<suite name="TestSuite" thread-count="10" parallel="methods" >
If you are not scraping too security oriented website against bots, it is better to use Requests, it will reduce your time from days to couple of hours and implement multi-threading with multi-processing. Steps are too long to go over, here is just some idea:
def threader_run(data):
futures = []
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
for i in data:
futures.append(executor.submit(scrapper,i))
for future in concurrent.futures.as_completed(futures):
print(future.result())
data = {}
data['process1'] = []
data['process2'] = []
data['process3'] = []
if __name__ == "__main__":
for x in data:
jobs = []
p = Process(target=threader_run,args(data[x],))
jobs.append(p)
p.start()
print(f'Started - {x}')
Basically, what this is doing is first have all the links compiled then split them into 3 arrays for running 3 processes simultaneously (you could run more processes depending on your cpu cores and how data intensive these jobs are). After that split those arrays further could be more than 10 even 100 depending on your project size. This will run threadpool which have maximum 8 workers and then it will run your final function.
Here with 3 process and 8 workers you are looking at 24 times speed boost. however, Use of Requests library is necessary if you use selenium for this, normal Computers/laptops will freeze. Because this would mean 24 browsers running simultaneously.
Program Logic
I'm opening multiple selenium threads from the list using multithreading library in python3.
These threads are stored in an array from which they're started like this:
for each_thread in browser_threads:
each_thread.start()
for each_thread in browser_threads:
each_thread.join()
Each thread calls a function to start the selenium firefox browser. Function is as follows..
Browser Function
# proxy browser session
def proxy_browser(proxy):
global arg_pb_timesec
global arg_proxyurl
global arg_youtubevideo
global arg_browsermode
# recheck proxyurl
if arg_proxyurl == '':
arg_proxyurl = 'https://www.duckduckgo.com/'
# apply proxy to firefox using desired capabilities
PROX = proxy
webdriver.DesiredCapabilities.FIREFOX['proxy']={
"httpProxy":PROX,
"ftpProxy":PROX,
"sslProxy":PROX,
"proxyType":"MANUAL"
}
options = Options()
# for browser mode
options.headless = False
if arg_browsermode == 'headless':
options.headless = True
driver = webdriver.Firefox(options=options)
try:
print(f"{c_green}[URL] >> {c_blue}{arg_proxyurl}{c_white}")
print(f"{c_green}[Proxy Used] >> {c_blue}{proxy}{c_white}")
print(f"{c_green}[Browser Mode] >> {c_blue}{arg_browsermode}{c_white}")
print(f"{c_green}[TimeSec] >> {c_blue}{arg_pb_timesec}{c_white}\n\n")
driver.get(arg_proxyurl)
time.sleep(2) # seconds
# check if redirected to google captcha (for quitting abused proxies)
if not "google.com/sorry/" in driver.current_url:
# if youtube view mode
if arg_youtubevideo:
delay_time = 5 # seconds
# if delay time is more than timesec for proxybrowser
if delay_time > arg_pb_timesec:
# increase proxybrowser timesec
arg_pb_timesec += 5
# wait for the web element to load
try:
player_elem = WebDriverWait(driver, delay_time).until(EC.presence_of_element_located((By.ID, 'movie_player')))
togglebtn_elem = WebDriverWait(driver, delay_time).until(EC.presence_of_element_located((By.ID, 'toggleButton')))
time.sleep(2)
# click player
webdriver.ActionChains(driver).move_to_element(player_elem).click(player_elem).perform()
try:
# click autoplay button to disable autoplay
webdriver.ActionChains(driver).move_to_element(togglebtn_elem).click(togglebtn_elem).perform()
except Exception:
pass
except TimeoutException:
print("Loading video control taking too much time!")
else:
print(f"{c_red}[Network Error] >> Abused Proxy: {proxy}{c_white}")
driver.close()
driver.quit()
#if proxy not in abused_proxies:
# abused_proxies.append(proxy)
except Exception as e:
print(f"{c_red}{e}{c_white}")
driver.close()
driver.quit()
What the above does is start the browser with a proxy, check if the redirected url is not google recaptcha to avoid sticking on abused proxies page, if youtube video argument is passed, then wait for movie player to load and click it to autoplay.
Sort of like a viewbot for websites as well as youtube.
Problem
The threads indicate to end, but they keep running in the background. The browser window never quits and scripts exists with all browser threads runnning forever!
I tried every Stackoverflow solution and various methods, but nothing works. Here is the only relevant SO question which is also not so relevant since OP is spawing os.system processes, which I'm not: python daemon thread exits but process still run in the background
EDIT: Even when the whole page is loaded, youtube clicker does not work and there is no exception. The threads indicate to stop after network error, but there is no error?!
Entire Script
As suggested by previous stackoverflow programmers, I kept code here minimal and reproducable. But if you need the entire logic it's here: https://github.com/ProHackTech/FreshProxies/blob/master/fp.py
Here is the screenshot of what is happening:
As you are starting multiple threads and joining them as follows:
for each_thread in browser_threads:
each_thread.start()
for each_thread in browser_threads:
each_thread.join()
At this point, it is worth to note that WebDriver is not thread-safe. Having said that, if you can serialise access to the underlying driver instance, you can share a reference in more than one thread. This is not advisable. But you can always instantiate one WebDriver instance for each thread.
Ideally the issue of thread-safety isn't in your code but in the actual browser bindings. They all assume there will only be one command at a time (e.g. like a real user). But on the other hand you can always instantiate one WebDriver instance for each thread which will launch multiple browsing tabs/windows. Till this point it seems your program is perfect.
Now, different threads can be run on same Webdriver, but then the results of the tests would not be what you expect. The reason behind is, when you use multi-threading to run different tests on different tabs/windows a little bit of thread safety coding is required or else the actions you will perform like click() or send_keys() will go to the opened tab/window that is currently having the focus regardless of the thread you expect to be running. Which essentially means all the test will run simultaneously on the same tab/window that has focus but not on the intended tab/window.
Reference
You can find a relevant detailed discussion in:
Chrome crashes after several hours while multiprocessing using Selenium through Python
The goal is to write a python script that opens a specific website, fills out some inputs and then submits it. This should be done with different inputs for the same website simultaneously.
I tried using Threads from threading and some other things but i can't make it work simultaneously.
from selenium import webdriver
import time
from threading import Thread
def test_function():
driver = webdriver.Chrome()
driver.get("https://www.google.com")
time.sleep(3)
if __name__ =='__main__':
Thread(target = test_function()).start()
Thread(target = test_function()).start()
So executing this code the goal is that 2 chrome windows will open simultaneously, go to google and then wait for 3 seconds. Now all that's done is that the function is called two times in a serial manner.
Now all that's done is that the function is called
two times in a serial manner.
The behavior you are seeing is because you are calling test_function() when you pass it as a target. Rather than calling the function, just assign the callable name (test_function).
like this:
Thread(target=test_function).start()
You will need a testing framework like pytest to execute tests in parallel. Here is a quick setup guide to get you going.
PythonWebdriverParallel
So I've been working on scraper that goes on 10k+pages and scrapes data from it.
The issue is that over time, memory consumption raises drastically. So to overcome this - instead of closing driver instance only at the end of scrape - the scraper is updated so that it closes the instance after every page is loaded and data extracted.
But ram memory still gets populated for some reason.
I tried using PhantomJS but it doesn't load data properly for some reason.
I also tried with the initial version of the scraper to limit cache in Firefox to 100mb, but that also did not work.
Note: I run tests with both chromedriver and firefox, and unfortunately I can't use libraries such as requests, mechanize, etc... instead of selenium.
Any help is appreciated since I've been trying to figure this out for a week now. Thanks.
The only way to force the Python interpreter to release memory to the OS is to terminate the process. Therefore, use multiprocessing to spawn the selenium Firefox instance; the memory will be freed when the spawned process is terminated:
import multiprocessing as mp
import selenium.webdriver as webdriver
def worker()
driver = webdriver.Firefox()
# do memory-intensive work
# closing and quitting is not what ultimately frees the memory, but it
# is good to close the WebDriver session gracefully anyway.
driver.close()
driver.quit()
if __name__ == '__main__':
p = mp.Process(target=worker)
# run `worker` in a subprocess
p.start()
# make the main process wait for `worker` to end
p.join()
# all memory used by the subprocess will be freed to the OS
See also Why doesn't Python release the memory when I delete a large object?
Are you trying to say that your drivers are what's filling up your memory? How are you closing them? If you're extracting your data, do you still have references to some collection that's storing them in memory?
You mentioned that you were already running out of memory when you closed the driver instance at the end of scraping, which makes it seem like you're keeping extra references.
I have experienced similar issue and destroying that driver my self (i.e setting driver to None) prevent those memory leaks for me
I was having the same problem until putting the webdriver.get(url) statements inside a try/except/finally statement, and making sure webdriver.quit() was in the finally statement, this way, it always execute. Like:
webdriver = webdriver.Firefox()
try:
webdriver.get(url)
source_body = webdriver.page_source
except Exception as e:
print(e)
finally:
webdriver.quit()
From the docs:
The finally clause of such a statement can be used to specify cleanup
code which does not handle the exception, but is executed whether an
exception occurred or not in the preceding code.
use this
os.system("taskkill /f /im chromedriver.exe /T")