I have a bit of a long script, so a minimum viable example may not be easily possible.
I'm trying to follow some previous SO posts about threading - I wish to execute an exe dozens of times and to use as many CPU cores as possible. Early days here.
First I'm defining a list with elements that are dynamically populated
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
work_items = []
Then I'm defining a function, at the same indent level as all my other working functions:
#To help with executing my exe in Parallel
def worker(tup):
subprocess.call(tup)
Then, finally, inside the function that will call this function:
#Execute jobs
start = time.time()
with ThreadPool(4) as pool:
work_results = pool.map(worker, work_items)
end = time.time()
print(end - start)
The line that is causing me grief is work_results = pool.map(worker, work_items). My linter in VSCode, and the python shell when I attempt to test both return that worker is not defined. My understanding is that the function should be in scope, and it is defined.
Is there something here that stands out as an issue as to why it would be reporting that worker is an undefined function?
Related
Recently I have started using the multiprocessor pool executor in python to accelerate my processing.
So instead of doing a
list_of_res=[]
for n in range(a_number):
res=calculate_something(list_of sources[n])
list_of_res.append(res)
joint_results=pd.concat(list_of_res)
I do
with ProcessPoolExecutor(max_workers=8) as executor:
joint_results=pd.concat(executor.map(calculate_something,list_of_sources))
It works great.
However I've noticed that inside the calculate_something function I call the same function like 8 times, one after another, so I might as well apply a map to them instead of a loop
My question is, can I apply multiprocessing to a function that is already being called in multiprocess?
yes you can have a worker process spawn another pool of workers, but it is not optimal.
each time you launch a new process it takes a few hundred milliseconds to a few seconds for this new process to initialize and start executing work (OS, disk and code dependent.)
launching a worker from a worker is just wasting the overhead of spawning the first child to begin with, and you are better off extracting the loop inside calculate_something and launching it directly within your initial executor.
a better approach is to launch your initial calculate_something using a ThreadPoolExecutor and have one shared ProcessPoolExecutor that all your thread workers will push work into, this way you can limit the number of newly created processes and avoid creating and deleting much more workers than you actually need, and it takes only a few microseconds to launch a threadpool.
this is an example of how to nest threadpool and process_pool.
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
def process_worker(n):
print(n)
return n
def thread_worker(list_of_n,process_pool:ProcessPoolExecutor):
work_done = list(process_pool.map(process_worker,list_of_n))
return work_done
if __name__ == "__main__":
list_of_lists_of_n = [[1,2,3],[4,5,6]]
with ProcessPoolExecutor() as process_pool, ThreadPoolExecutor() as threadpool:
tasks = []
work_done = []
for item in list_of_lists_of_n:
tasks.append(threadpool.submit(thread_worker,item,process_pool))
for item in tasks:
work_done.append(item.result())
print(work_done)
I'm developing a program that involves computing similarity scores for around 480 pairs of images (20 directories with around 24 images in each). I'm utilizing the sentence_transformers Python module for image comparison, and it takes around 0.1 - 0.2 seconds on my Windows 11 machine to compare two images when running in serial, but for some reason, that time gets increased to between 1.5 and 3.0 seconds when running in parallel using a process Pool. So, either a), there's something going on behind the scenes that I'm not yet aware of, or b) I just did it wrong.
Here's a rough structure of the image comparison function:
def compare_images(image_one, image_two, clip_model):
start = time()
images = [image_one, image_two]
# clip_model is set to SentenceTransformer('clip-ViT-B-32') elsewhere in the code
encoded_images = clip_model.encode(images, batch_size = 2, convert_to_tensor = True, show_progress_bar = False)
processed_images = util.paraphrase_mining_embeddings(encoded_images)
stop = time()
print("Comparison time: %f" % (stop - start) )
score, image_id1, image_id2 = processed_images[0]
return score
Here's a rough structure of the serial version of the code to compare every image:
def compare_all_images(candidate_image, directory, clip_model):
for dir_entry in os.scandir(directory):
dir_image_path = dir_entry.path
dir_image = Image.open(dir_image_path)
similiarity_score = compare_images(candidate_image, dir_image, clip_model)
# ... code to determine whether this is the maximum score the program has seen...
Here is a rough structure of the parallel version:
def compare_all_images(candidate_image, directory, clip_model):
pool_results = dict()
pool = Pool()
for dir_entry in os.scandir(directory):
dir_image_path = dir_entry.path
dir_image = Image.open(dir_image_path)
pool_results[dir_image_path] = pool.apply_async(compare_images, args = (candidate_image, dir_image, clip_model)
# Added everything to the pool, close it and wait for everything to finish
pool.close()
pool.join()
# ... remaining code to determine which image has the highest similarity rating
I'm not sure where I might be erring.
The interesting thing here is that I also developed a smaller program to verify whether I was doing things correctly:
def func():
sleep(6)
def main():
pool = Pool()
for i in range(20):
pool.apply_async(func)
pool.close()
start = time()
pool.join()
stop = time()
print("Time: %f" % (stop - start) ) # This gave an average of 12 seconds
# across multiple runs on my Windows 11
# machine, on which multiprocessing.cpu_count=12
Is this a problem with trying to make things parallel with sentence transformers, or does the problem lie elsewhere?
UPDATE: Now I'm especially confused. I'm now only passing str objects to the comparison function and have temporarily slapped a return 0 as the very first line in the function to see if I can further isolate the issue. Oddly, even though the parallel function is doing absolutely nothing now, several seconds (usually around 5) still seem to pass between the time that the pool is closed and the time that pool.join() finishes. Any thoughts?
UPDATE 2: I've done some more playing around, and have found out that an empty pool still has some overhead. This is the code I'm testing out currently:
# ...
pool = Pool()
pool.close()
start = time()
DebuggingUtilities.debug("empty pool closed, doing a join on the empty pool to see if directory traversal is messing things up")
pool.join()
stop = time()
DebuggingUtilities.debug("Empty pool join time: %f" % (stop - start) )
This gives me an "Empty pool join time" of about 5 seconds. Moving this snippet to the very first part of my main function still yields the same. Perhaps Pool works differently on Windows? In WSL (Ubuntu 20.04), the same code runs in about 0.02 seconds. So, what would cause even an empty Pool to hang for such a long time on Windows?
UPDATE 3: I've made another discovery. The empty pool problem goes away if the only imports I have are from multiprocessing import Pool and from time import time. However, the program uses a boatload of import statements across several source files, which causes the program to hang a bit when it first starts. I suspect that this is propagating down into the Pool for some reason. Unfortunately, I need all of the import statements that are in the source files, so I'm not sure how to get around this (or why the imports would affect an empty Pool).
UPDATE 4: So, apparently it's the from sentence_transformers import SentenceTransformer line that's causing issues (without that import, the pool.join() call happens relatively quickly. I think the easiest solution now is to simply move the compare_images function into a separate file. I'll update this question again with updates as I implement this.
UPDATE 5: I've done a little more playing around, and it seems like on Windows, the import statements get executed multiple times whenever a Pool gets created, which I think is just weird. Here's the code I used to verify this:
from multiprocessing import Pool
from datetime import datetime
from time import time
from utils import test
print("outside function lol")
def get_time():
now = datetime.now()
return "%02d/%02d/%04d - %02d:%02d:%02d" % (now.month, now.day, now.year, now.hour, now.minute, now.second)
def main():
pool = Pool()
print("Starting pool")
"""
for i in range(4):
print("applying %d to pool %s" % (i, get_time() ) )
pool.apply_async(test, args = (i, ) )
"""
pool.close()
print("Pool closed, waiting for all processes to finish")
start = time()
pool.join()
stop = time()
print("pool done: %f" % (stop - start) )
if __name__ == "__main__":
main()
Running through Windows command prompt:
outside function lol
Starting pool
Pool closed, waiting for all processes to finish
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
pool done: 4.794051
Running through WSL:
outside function lol
Starting pool
Pool closed, waiting for all processes to finish
pool done: 0.048856
UPDATE 6: I think I might have a workaround, which is to create the Pool in a file that doesn't directly or indirectly import anything from sentence_transformers. I then pass the model and anything else I need from sentence_transformers as parameters to a function that handles the Pool and kicks off all of the parallel processes. Since the sentence_transformers import seems to be the only problematic one, I'll wrap that import statement in an if __name__ == "__main__" so it only runs once, which will be fine, as I'm passing the things I need from it as parameters. It's a rather janky solution, and probably not what others would consider as "Pythonic", but I have a feeling this will work.
UPDATE 7: The workaround was successful. I've managed to get the pool join time on an empty pool down to something reasonable (0.2 - 0.4 seconds). The downside of this approach is that there is definitely considerable overhead in passing the entire model as a parameter to the parallel function, which I needed to do as a result of creating the Pool in a different place than the model was being imported. I'm quite close, though.
I've done a little more digging, and think I've finally discovered the root of the problem, and it has everything to do with what's described here.
To summarize, on Linux systems, processes are forked from the main process, meaning that the current process state is copied (which is why the import statements don't run multiple times). On Windows (and macOS), processes are spawned, meaning that interpreter starts at the beginning of the "main" file, thus running all import statements again. So, the behavior I'm seeing is not a bug, but I will need to rethink my program design to account for this.
I'm experiencing some difficulties with multiprocessing in Python. Using the snippet below I'm getting two errors:
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
and
FileNotFoundError: [Errno 2] No such file or directory
I've found that this is because the main process closes earlier than the sub-processes causing the subprocesses to be discarded, throwing these errors. I'm using p.join() as this should keep the main process waiting until the current process is done (if I understand correctly).
When using a small number of processes, the code works perfectly fine (n_processes = 15) but when n_processes becomes bigger than around 100 (differs per machine), the errors pop up.
I'm asking this question because it seems quite arbitrary when the main process closes too soon, and why the p.join() function is not doing what I'm expecting. I've read a ton of similar SO posts, plus documentation, but can't seem to figure it out. Could someone help me with this?
A minimal example is as follows:
"""Test for multiprocessing some large function."""
from multiprocessing import Process, Manager
from timeit import default_timer as timer
import time
def fun(d, i):
"""Simply add some data to the shared list."""
time.sleep(2)
d.append(("workerbee" + str(i), i))
def main(n_processes):
"""Let's do some arbitrary thing to test overall logic."""
print("Starting!")
with Manager() as manager:
data = manager.list() # <-- can be shared between processes.
procs = []
# Create n processes
# if n is small => everything is fine
# if n is large (>100) => ForkAwareLocal object has no attr connection
for i in range(n_processes):
procs.append(Process(target=fun, args=(data, i)))
procs[-1].start()
for p in procs:
p.join()
print(f'Result in main: {data}')
if __name__ == '__main__':
# Choose number of processes
# n_processes = 15
n_processes = 150
start = timer()
main(n_processes)
end = timer()
print(f'elapsed time: {end - start}')
What I ended up doing, wat to just divide the total number of function calls that I wanted to do over the cpu_count(). This way, we still use all cores, but we don't need to open/close a process each time we call the function. So it's a workaround (that may be better than solving the issue), but I still don't understand why the MainProcess stopped ahead of time.
I want to perform some benchmarking between 'multiprocessing' a file and sequential processing a file.
In basics it's a file that is read line by line (consists of 100 lines), and the first character is read from eachline and is put into the list if it doesn't exists.
import multiprocessing as mp
import sys
import time
database_layout=[]
def get_first_characters(string):
global database_layout
if string[0:1] not in database_layout:
database_layout.append(string[0:1])
if __name__ == '__main__':
start_time = time.time()
bestand_first_read=open('random.txt','r', encoding="latin-1")
for line in bestand_first_read:
p = mp.Process(target=get_first_characters, args=(line,))
p.start()
print(str(len(database_layout)))
print("Finished part one: "+ str(time.time() - start_time))
bestand_first_read.close()
###Part two
database_layout_two=[]
start_time = time.time()
bestand_first_read_two=open('random.txt','r', encoding="latin-1")
for linetwo in bestand_first_read_two:
if linetwo[0:1] not in database_layout_two:
database_layout_two.append(linetwo[0:1])
print(str(len(database_layout_two)))
print("Finished: part two"+ str(time.time() - start_time))
But when i execute this program i get the following result:
python test.py
0
Finished part one: 17.105965852737427
10
Finished part two: 0.0
Two problems arise at this moment.
1) Why does the multiprocessing takes much longer (+/- 17 sec) than the sequential processing (+/- 0 sec).
2) Why does the list 'database_layout' defined not get filled? (It is the same code)
EDIT
A same example which works with Pools.
import multiprocessing as mp
import timeit
def get_first_characters(string):
return string
if __name__ == '__main__':
database_layout=[]
start = timeit.default_timer()
nr = 0
with mp.Pool(processes=4) as pool:
for i in range(99999):
nr += 1
database_layout.append(pool.starmap(get_first_characters, [(str(i),)]))
stop = timeit.default_timer()
print("Pools: %s " % (stop - start))
database_layout=[]
start = timeit.default_timer()
for i in range(99999):
database_layout.append(get_first_characters(str(i)))
stop = timeit.default_timer()
print("Regular: %s " % (stop - start))
After running above example the following output is shown.
Pools: 22.058468394726148
Regular: 0.051738489109649066
This shows that in such a case working with Pools is 440 times slower than using sequential processing. Any clou why this is?
Multiprocessing starts one process for each line of your input. That means that all the overhead of opening one new Python interpreter for each line of your (possibly very long) file. That accounts for the long time it takes to go through the file.
However, there are other issues with your code. While there is no synchronisation issue due to fighting for the file (since all reads are done in the main process, where the line iteration is going on), you have misunderstood how multiprocessing works.
First of all, your global variable is not global across processes. Actually processes don't usually share memory (like threads) and you have to use some interface to share objects (and hence why shared objects must be picklable). When your code opens each process, each interpreter instance starts by loading your file, which creates a new database_layout variable. Because of that, each interpreter starts with an empty list, which means it ends with a single-element list. For actually sharing the list, you might want to use a Manager (also see how to share state in the docs).
Also because of the huge overhead of opening new interpreters, your script performance may benefit from using a pool of workers, since this will open just a few processes for sharing the work. Remember that resource contention will impact performance if opening more processes than you have CPU cores.
The second problem, besides the issue of sharing your variable, is that your code does not wait for the processing to finish. Hence, even if the state was shared, your processing might not have finished when you check the length of database_layout. Again, using a pool might help with that.
PS: unless you want to preserve the insertion order, you might get even faster by using a set, though I'm not sure the Manager supports it.
EDIT after the OP EDIT: Your pool code is still starting up the pool for each line (or number). As you did, you still have much of your processing in the main process, just looping and passing arguments to the other processes. Besides, you're still running each element in the pool individually and appending in the list, which pretty much uses only one worker process at a time (remember that map or starmaps waits until the work finishes to return). This is from Process Explorer running your code:
Note how the main process is still doing all the hard work (22% in a quad-core machine means its CPU is maxed). What you need to do is pass the iterable to map() in a single call, minimizing the work (specially switching between Python and the C side):
import multiprocessing as mp
import timeit
def get_first_characters(number):
return str(number)[0]
if __name__ == '__main__':
start = timeit.default_timer()
with mp.Pool(processes=4) as pool:
database_layout1 = (pool.map(get_first_characters, range(99999)))
stop = timeit.default_timer()
print("Pools: %s " % (stop - start))
database_layout2=[]
start = timeit.default_timer()
for i in range(99999):
database_layout2.append(get_first_characters(str(i)))
stop = timeit.default_timer()
print("Regular: %s " % (stop - start))
assert database_layout1 == database_layout2
This got me from this:
Pools: 14.169268206710512
Regular: 0.056271265139002935
To this:
Pools: 0.35610273658926417
Regular: 0.07681461930314981
It's still slower than the single-processing one, but that's mainly because of the message-passing overhead for a very simple function. If your function is more complex it'll make more sense.
I have searched and cannot find an answer to this question elsewhere. Hopefully I haven't missed something.
I am trying to use Python multiprocessing to essentially batch run some proprietary models in parallel. I have, say, 200 simulations, and I want to batch run them ~10-20 at a time. My problem is that the proprietary software crashes if two models happen to start at the same / similar time. I need to introduce a delay between processes spawned by multiprocessing so that each new model run waits a little bit before starting.
So far, my solution has been to introduced a random time delay at the start of the child process before it fires off the model run. However, this only reduces the probability of any two runs starting at the same time, and therefore I still run into problems when trying to process a large number of models. I therefore think that the time delay needs to be built into the multiprocessing part of the code but I haven't been able to find any documentation or examples of this.
Edit: I am using Python 2.7
This is my code so far:
from time import sleep
import numpy as np
import subprocess
import multiprocessing
def runmodels(arg):
sleep(np.random.rand(1,1)*120) # this is my interim solution to reduce the probability that any two runs start at the same time, but it isn't a guaranteed solution
subprocess.call(arg) # this line actually fires off the model run
if __name__ == '__main__':
arguments = [big list of runs in here
]
count = 12
pool = multiprocessing.Pool(processes = count)
r = pool.imap_unordered(runmodels, arguments)
pool.close()
pool.join()
multiprocessing.Pool() already limits number of processes running concurrently.
You could use a lock, to separate the starting time of the processes (not tested):
import threading
import multiprocessing
def init(lock):
global starting
starting = lock
def run_model(arg):
starting.acquire() # no other process can get it until it is released
threading.Timer(1, starting.release).start() # release in a second
# ... start your simulation here
if __name__=="__main__":
arguments = ...
pool = Pool(processes=12,
initializer=init, initargs=[multiprocessing.Lock()])
for _ in pool.imap_unordered(run_model, arguments):
pass
One way to do this with thread and semaphore :
from time import sleep
import subprocess
import threading
def runmodels(arg):
subprocess.call(arg)
sGlobal.release() # release for next launch
if __name__ == '__main__':
threads = []
global sGlobal
sGlobal = threading.Semaphore(12) #Semaphore for max 12 Thread
arguments = [big list of runs in here
]
for arg in arguments :
sGlobal.acquire() # Block if more than 12 thread
t = threading.Thread(target=runmodels, args=(arg,))
threads.append(t)
t.start()
sleep(1)
for t in threads :
t.join()
The answer suggested by jfs caused problems for me as a result of starting a new thread with threading.Timer. If the worker just so happens to finish before the timer does, the timer is killed and the lock is never released.
I propose an alternative route, in which each successive worker will wait until enough time has passed since the start of the previous one. This seems to have the same desired effect, but without having to rely on another child process.
import multiprocessing as mp
import time
def init(shared_val):
global start_time
start_time = shared_val
def run_model(arg):
with start_time.get_lock():
wait_time = max(0, start_time.value - time.time())
time.sleep(wait_time)
start_time.value = time.time() + 1.0 # Specify interval here
# ... start your simulation here
if __name__=="__main__":
arguments = ...
pool = mp.Pool(processes=12,
initializer=init, initargs=[mp.Value('d')])
for _ in pool.imap_unordered(run_model, arguments):
pass