Im struggling to get multithreading working in Python. I have i function which i want to execute on 5 threads based on a parameter. I also needs 2 parameters that are the same for every thread. This is what i have:
from concurrent.futures import ThreadPoolExecutor
def do_something_parallel(sameValue1, sameValue2, differentValue):
print(str(sameValue1)) #same everytime
print(str(sameValue2)) #same everytime
print(str(differentValue)) #different
main():
differentValues = ["1000ms", "100ms", "10ms", "20ms", "50ms"]
with ThreadPoolExecutor(max_workers=5) as executor:
futures = [executor.submit(do_something_parallel, sameValue1, sameValue2, differentValue) for differentValue in differentValues]
But i don't know what to do next
If you don't care about the order, you can now do:
from concurrent.futures import as_completed
# The rest of your code here
for f in as_completed(futures):
# Do what you want with f.result(), for example:
print(f.result())
Otherwise, if you care about order, it might make sense to use ThreadPoolExecutor.map with functools.partial to fill in the arguments that are always the same:
from functools import partial
# The rest of your code...
with ThreadPoolExecutor(max_workers=5) as executor:
results = executor.map(
partial(do_something_parallel, sameValue1, sameValue2),
differentValues
))
Related
How do I add tqdm for the multiprocessing for loop here. Namely I want to wrap urls in tqdm():
jobs = []
urls = pd.read_csv(dataset, header=None).to_numpy().flatten()
for url in urls:
job = pool.apply_async(worker, (url, q))
jobs.append(job)
for job in jobs:
job.get()
pool.close()
pool.join()
The suggested solution on GitHub is this:
pbar = tqdm(total=100)
def update(*a):
pbar.update()
# tqdm.write(str(a))
for i in range(pbar.total):
pool.apply_async(myfunc, args=(i,), callback=update)
pool.close()
pool.join()
But my iterable is a list of URLs as opposed to a range like in the above. How do I translate the above solution to my for loop?
The easiest solution that is compatible with your current code is to just specify the callback argument to apply_async (and if there is a possibility of an exception in worker, then specify the error_callback argument too).
from multiprocessing import Pool
from tqdm import tqdm
def worker(url):
# So that the progress par does not proceed to quickly
# for demo purposes:
import time
time.sleep(1)
# For compatibility with platforms that use the *spawn* method (e.g. Windows):
if __name__ == '__main__':
def my_callback(result):
pbar.update()
# for this demo:
#urls = pd.read_csv(dataset, header=None).to_numpy().flatten()
urls = list('abcdefghijklmnopqrstuvwxyz')
with tqdm(total=len(urls)) as pbar:
pool = Pool()
jobs = [
pool.apply_async(worker, (url,), callback=my_callback, error_callback=my_callback)
for url in urls
]
# You can delete the next two statements if you don't need
# to save the value of jobs.get() since the calls to
# pool.close() and pool.join() will wait for all submitted
# tasks to complete:
for job in jobs:
job.get()
pool.close()
pool.join()
Or instead of using apply_async, use imap (or imap_unordered if you do not care either about the results or the order of the results):
from multiprocessing import Pool
from tqdm import tqdm
def worker(url):
import time
time.sleep(1) # so that the progress par does not proceed to quickly:
return url
# For compatibility with platforms that use the *spawn* method (e.g. Windows):
if __name__ == '__main__':
# for this demo:
#urls = pd.read_csv(dataset, header=None).to_numpy().flatten()
urls = list('abcdefghijklmnopqrstuvwxyz')
pool = Pool()
results = list(tqdm(pool.imap(worker, urls), total=len(urls)))
print(results)
pool.close()
pool.join()
Note
If you won't or can't use apply_async with a callback, then imap_unordered is to be preferred over imap, assuming you don't need to have the results returned in task-submission order, which imap is obliged to do. The potential problem with imap is that if for some reason the first task submitted to the pool were the last to complete, no results can be returned until that first submitted task finishes. When that occurs all the other submitted task will have already completed and so your progress bar will not move at all and then it will suddenly go from 0% to 100% as quickly as you can iterate the results.
Admittedly the above scenario is an extreme case not likely to occur too often, but you would still like the progress bar to advance as tasks complete regardless of that order of completion. For this and getting results back in task-submission order apply_async with a callback is probably best. The only drawback to apply_async is that if you have a very large number of tasks to submit, they cannot be "chunked up" (see the chunksize argument to imap and imap_unordered) without your doing your own chunking logic.
You can use Parallel and Delayed from Joblib and use tqdm in the following manner:
from multiprocessing import cpu_count
from joblib import Parallel, delayed
def process_urls(urls,i):
#define your function here
Call function using:
urls = pd.read_csv(dataset, header=None).to_numpy().flatten()
Parallel(n_jobs=cpu_count(), prefer='processes')(delayed(process_urls)(urls, i) for i in tqdm(range(len(urls.axes[0]))))
I want to call 4 methods at once so they run parallel-ly in Python. These methods make HTTP calls and do some basic operation like verify response. I want to call them at once so the time taken will be less. Say each method takes ~20min to run, I want all 4methods to return response in 20min and not 20*4 80min
It is important to note that the 4methods I'm trying to run in parallel are async functions. When I tried using ThreadPoolExecutor to run the 4methods in parallel I didn't see much difference in time taken.
Example code - edited from #tomerar comment below
from concurrent.futures import ThreadPoolExecutor
async def foo_1():
print("foo_1")
async def foo_2():
print("foo_2")
async def foo_3():
print("foo_3")
async def foo_4():
print("foo_4")
with ThreadPoolExecutor() as executor:
for foo in [await foo_1,await foo_2,await foo_3,await foo_4]:
executor.submit(foo)
Looking for suggestions
You can use from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import ThreadPoolExecutor
def foo_1():
print("foo_1")
def foo_2():
print("foo_2")
def foo_3():
print("foo_3")
def foo_4():
print("foo_4")
with ThreadPoolExecutor() as executor:
for foo in [foo_1,foo_2,foo_3,foo_4]:
executor.submit(foo)
You can use "multiprocessing" in python.
it's so simple
from multiprocessing import Pool
pool = Pool()
result1 = pool.apply_async(solve1, [A]) # evaluate "solve1(A)"
result2 = pool.apply_async(solve2, [B]) # evaluate "solve2(B)"
answer1 = result1.get(timeout=10)
answer2 = result2.get(timeout=10)
you can see full details
I am trying to capture all output from a ProcessPoolExecutor.
Imagine you have a file func.py:
print("imported") # I do not want this print in subprocesses
def f(x):
return x
then you run that function with a ProcessPoolExecutor like
from concurrent.futures import ProcessPoolExecutor
from func import f # ⚠️ the import will print! ⚠️
if __name__ == "__main__":
with ProcessPoolExecutor() as ex: # ⚠️ the import will happen here again and print! ⚠️
futs = [ex.submit(f, i) for i in range(15)]
for fut in futs:
fut.result()
Now I can capture the output of the first import using e.g., contextlib.redirect_stdout, however, I want to capture all output from the subprocesses too and redirect them to the stdout of the main process.
In my real use case, I get warnings that I want to capture, but a simple print reproduces the problem.
This is relevant to prevent the following bug https://github.com/Textualize/rich/issues/2371.
I need to pass two different function to ThreadPoolExecutor and wait for both of them to complete
without blocking for loop for first or second future as these are very long running tasks .
How may i achieve this with ThreadPoolExecutor ?
from concurrent.futures import ThreadPoolExecutor, as_completed
def perform_set ():
pass
def perform_get ():
pass
with ThreadPoolExecutor(max_workers=4) as executor:
futures_set = [executor.submit(perform_set) for i in range(2)]
futures_get = [executor.submit(perform_get) for i in range(2)]
#for f in as_completed(futures_set):
# print(f.result())
Regards
I'm having issuing using most or all of the cores to process the files faster , it can be reading multiple files a time or using multiple cores to read a single file.
I would prefer using multiple cores to read a single file before moving it to the next.
I tried the code below but can't seem to get all the core used up.
The following code would basically retrieve *.txt file in the directory which contains htmls , in json format.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
import json
import urlparse
import os
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool # This is a thread-based Pool
from multiprocessing import cpu_count
def crawlTheHtml(htmlsource):
htmlArray = json.loads(htmlsource)
for eachHtml in htmlArray:
soup = BeautifulSoup(eachHtml['result'], 'html.parser')
if all(['another text to search' not in str(soup),
'text to search' not in str(soup)]):
try:
gd_no = ''
try:
gd_no = soup.find('input', {'id': 'GD_NO'})['value']
except:
pass
r = requests.post('domain api address', data={
'gd_no': gd_no,
})
except:
pass
if __name__ == '__main__':
pool = Pool(cpu_count() * 2)
print(cpu_count())
fileArray = []
for filename in os.listdir(os.getcwd()):
if filename.endswith('.txt'):
fileArray.append(filename)
for file in fileArray:
with open(file, 'r') as myfile:
htmlsource = myfile.read()
results = pool.map(crawlTheHtml(htmlsource), f)
On top of that , i'm not sure what the ,f represent.
Question 1 :
What did i not do properly to fully utilize all the cores/threads ?
Question 2 :
Is there a better way to use try : except : because sometimes the value is not in the page and that would cause the script to stop. When dealing with multiple variables, i will end up with a lot of try & except statement.
Answer to question 1, your problem is this line:
from multiprocessing.dummy import Pool # This is a thread-based Pool
Answer taken from: multiprocessing.dummy in Python is not utilising 100% cpu
When you use multiprocessing.dummy, you're using threads, not processes:
multiprocessing.dummy replicates the API of multiprocessing but is no
more than a wrapper around the threading module.
That means you're restricted by the Global Interpreter Lock (GIL), and only one thread can actually execute CPU-bound operations at a time. That's going to keep you from fully utilizing your CPUs. If you want get full parallelism across all available cores, you're going to need to address the pickling issue you're hitting with multiprocessing.Pool.
i had this probleme
you need to do
from multiprocessing import Pool
from multiprocessing import freeze_support
and you need to do in the end
if __name__ = '__main__':
freeze_support()
and you can continue your script
from multiprocessing import Pool, Queue
from os import getpid
from time import sleep
from random import random
MAX_WORKERS=10
class Testing_mp(object):
def __init__(self):
"""
Initiates a queue, a pool and a temporary buffer, used only
when the queue is full.
"""
self.q = Queue()
self.pool = Pool(processes=MAX_WORKERS, initializer=self.worker_main,)
self.temp_buffer = []
def add_to_queue(self, msg):
"""
If queue is full, put the message in a temporary buffer.
If the queue is not full, adding the message to the queue.
If the buffer is not empty and that the message queue is not full,
putting back messages from the buffer to the queue.
"""
if self.q.full():
self.temp_buffer.append(msg)
else:
self.q.put(msg)
if len(self.temp_buffer) > 0:
add_to_queue(self.temp_buffer.pop())
def write_to_queue(self):
"""
This function writes some messages to the queue.
"""
for i in range(50):
self.add_to_queue("First item for loop %d" % i)
# Not really needed, just to show that some elements can be added
# to the queue whenever you want!
sleep(random()*2)
self.add_to_queue("Second item for loop %d" % i)
# Not really needed, just to show that some elements can be added
# to the queue whenever you want!
sleep(random()*2)
def worker_main(self):
"""
Waits indefinitely for an item to be written in the queue.
Finishes when the parent process terminates.
"""
print "Process {0} started".format(getpid())
while True:
# If queue is not empty, pop the next element and do the work.
# If queue is empty, wait indefinitly until an element get in the queue.
item = self.q.get(block=True, timeout=None)
print "{0} retrieved: {1}".format(getpid(), item)
# simulate some random length operations
sleep(random())
# Warning from Python documentation:
# Functionality within this package requires that the __main__ module be
# importable by the children. This means that some examples, such as the
# multiprocessing.Pool examples will not work in the interactive interpreter.
if __name__ == '__main__':
mp_class = Testing_mp()
mp_class.write_to_queue()
# Waits a bit for the child processes to do some work
# because when the parent exits, childs are terminated.
sleep(5)