I am trying to implement multiprocessing with Python. It works when pooling very quick tasks, however, freezes when pooling longer tasks. See my example below:
from multiprocessing import Pool
import math
import time
def iter_count(addition):
print "starting ", addition
for i in range(1,99999999+addition):
if i==99999999:
print "completed ", addition
break
if __name__ == '__main__':
print "starting pooling "
pool = Pool(processes=2)
time_start = time.time()
possibleFactors = range(1,3)
try:
pool.map( iter_count, possibleFactors)
except:
print "exception"
pool.close()
pool.join()
#iter_count(1)
#iter_count(2)
time_end = time.time()
print "total loading time is : ", round(time_end-time_start, 4)," seconds"
In this example, if I use smaller numbers in for loop (something like 9999999) it works. But when running for 99999999 it freezes. I tried running two processes (iter_count(1) and iter_count(2)) in sequence, and it takes about 28 seconds, so not really a big task. But when I pool them it freezes. I know that there are some known bugs in python around multiprocessing, however, in my case, same code works for smaller sub tasks, but freezes for bigger ones.
You're using some version of Python 2 - we can tell because of how print is spelled.
So range(1,99999999+addition) is creating a gigantic list, with at least 100 million integers. And you're doing that in 2 worker processes simultaneously. I bet your disk is grinding itself to dust while the OS swaps out everything it can ;-)
Change range to xrange and see what happens. I bet it will work fine then.
Related
I'm developing a program that involves computing similarity scores for around 480 pairs of images (20 directories with around 24 images in each). I'm utilizing the sentence_transformers Python module for image comparison, and it takes around 0.1 - 0.2 seconds on my Windows 11 machine to compare two images when running in serial, but for some reason, that time gets increased to between 1.5 and 3.0 seconds when running in parallel using a process Pool. So, either a), there's something going on behind the scenes that I'm not yet aware of, or b) I just did it wrong.
Here's a rough structure of the image comparison function:
def compare_images(image_one, image_two, clip_model):
start = time()
images = [image_one, image_two]
# clip_model is set to SentenceTransformer('clip-ViT-B-32') elsewhere in the code
encoded_images = clip_model.encode(images, batch_size = 2, convert_to_tensor = True, show_progress_bar = False)
processed_images = util.paraphrase_mining_embeddings(encoded_images)
stop = time()
print("Comparison time: %f" % (stop - start) )
score, image_id1, image_id2 = processed_images[0]
return score
Here's a rough structure of the serial version of the code to compare every image:
def compare_all_images(candidate_image, directory, clip_model):
for dir_entry in os.scandir(directory):
dir_image_path = dir_entry.path
dir_image = Image.open(dir_image_path)
similiarity_score = compare_images(candidate_image, dir_image, clip_model)
# ... code to determine whether this is the maximum score the program has seen...
Here is a rough structure of the parallel version:
def compare_all_images(candidate_image, directory, clip_model):
pool_results = dict()
pool = Pool()
for dir_entry in os.scandir(directory):
dir_image_path = dir_entry.path
dir_image = Image.open(dir_image_path)
pool_results[dir_image_path] = pool.apply_async(compare_images, args = (candidate_image, dir_image, clip_model)
# Added everything to the pool, close it and wait for everything to finish
pool.close()
pool.join()
# ... remaining code to determine which image has the highest similarity rating
I'm not sure where I might be erring.
The interesting thing here is that I also developed a smaller program to verify whether I was doing things correctly:
def func():
sleep(6)
def main():
pool = Pool()
for i in range(20):
pool.apply_async(func)
pool.close()
start = time()
pool.join()
stop = time()
print("Time: %f" % (stop - start) ) # This gave an average of 12 seconds
# across multiple runs on my Windows 11
# machine, on which multiprocessing.cpu_count=12
Is this a problem with trying to make things parallel with sentence transformers, or does the problem lie elsewhere?
UPDATE: Now I'm especially confused. I'm now only passing str objects to the comparison function and have temporarily slapped a return 0 as the very first line in the function to see if I can further isolate the issue. Oddly, even though the parallel function is doing absolutely nothing now, several seconds (usually around 5) still seem to pass between the time that the pool is closed and the time that pool.join() finishes. Any thoughts?
UPDATE 2: I've done some more playing around, and have found out that an empty pool still has some overhead. This is the code I'm testing out currently:
# ...
pool = Pool()
pool.close()
start = time()
DebuggingUtilities.debug("empty pool closed, doing a join on the empty pool to see if directory traversal is messing things up")
pool.join()
stop = time()
DebuggingUtilities.debug("Empty pool join time: %f" % (stop - start) )
This gives me an "Empty pool join time" of about 5 seconds. Moving this snippet to the very first part of my main function still yields the same. Perhaps Pool works differently on Windows? In WSL (Ubuntu 20.04), the same code runs in about 0.02 seconds. So, what would cause even an empty Pool to hang for such a long time on Windows?
UPDATE 3: I've made another discovery. The empty pool problem goes away if the only imports I have are from multiprocessing import Pool and from time import time. However, the program uses a boatload of import statements across several source files, which causes the program to hang a bit when it first starts. I suspect that this is propagating down into the Pool for some reason. Unfortunately, I need all of the import statements that are in the source files, so I'm not sure how to get around this (or why the imports would affect an empty Pool).
UPDATE 4: So, apparently it's the from sentence_transformers import SentenceTransformer line that's causing issues (without that import, the pool.join() call happens relatively quickly. I think the easiest solution now is to simply move the compare_images function into a separate file. I'll update this question again with updates as I implement this.
UPDATE 5: I've done a little more playing around, and it seems like on Windows, the import statements get executed multiple times whenever a Pool gets created, which I think is just weird. Here's the code I used to verify this:
from multiprocessing import Pool
from datetime import datetime
from time import time
from utils import test
print("outside function lol")
def get_time():
now = datetime.now()
return "%02d/%02d/%04d - %02d:%02d:%02d" % (now.month, now.day, now.year, now.hour, now.minute, now.second)
def main():
pool = Pool()
print("Starting pool")
"""
for i in range(4):
print("applying %d to pool %s" % (i, get_time() ) )
pool.apply_async(test, args = (i, ) )
"""
pool.close()
print("Pool closed, waiting for all processes to finish")
start = time()
pool.join()
stop = time()
print("pool done: %f" % (stop - start) )
if __name__ == "__main__":
main()
Running through Windows command prompt:
outside function lol
Starting pool
Pool closed, waiting for all processes to finish
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
pool done: 4.794051
Running through WSL:
outside function lol
Starting pool
Pool closed, waiting for all processes to finish
pool done: 0.048856
UPDATE 6: I think I might have a workaround, which is to create the Pool in a file that doesn't directly or indirectly import anything from sentence_transformers. I then pass the model and anything else I need from sentence_transformers as parameters to a function that handles the Pool and kicks off all of the parallel processes. Since the sentence_transformers import seems to be the only problematic one, I'll wrap that import statement in an if __name__ == "__main__" so it only runs once, which will be fine, as I'm passing the things I need from it as parameters. It's a rather janky solution, and probably not what others would consider as "Pythonic", but I have a feeling this will work.
UPDATE 7: The workaround was successful. I've managed to get the pool join time on an empty pool down to something reasonable (0.2 - 0.4 seconds). The downside of this approach is that there is definitely considerable overhead in passing the entire model as a parameter to the parallel function, which I needed to do as a result of creating the Pool in a different place than the model was being imported. I'm quite close, though.
I've done a little more digging, and think I've finally discovered the root of the problem, and it has everything to do with what's described here.
To summarize, on Linux systems, processes are forked from the main process, meaning that the current process state is copied (which is why the import statements don't run multiple times). On Windows (and macOS), processes are spawned, meaning that interpreter starts at the beginning of the "main" file, thus running all import statements again. So, the behavior I'm seeing is not a bug, but I will need to rethink my program design to account for this.
I just learned about multiprocessing and tried to see how fast is it compared to simple for loop.
I use simple code to compare it,
import multiprocessing
from time import time as tt
def spawn(num,num2):
print('Process {} {}'.format(num,num2))
#normal process/single core
st = tt()
for i in range (1000):
spawn(i,i+1)
print('Total Running Time simple for loop:{}'.format((tt()-st)))
#multiprocessing
st2 = tt()
if __name__ == '__main__':
for i in range(1000):
p=multiprocessing.Process(target=spawn, args=(i,i+1))
p.start()
print('Total Running Time multiprocessing:{}'.format((tt()-st2)))
The output that I got showed that multiprocessing is much slower than the simple for loop
Total Running Time simple for loop:0.09924721717834473
Total Running Time multiprocessing:40.157875299453735
Can anyone explain why this happens?
It is because of the overhead for handling the processes. In this case the creation and deletion of the processes does not weigh up to the performance boost the code gets from running it parallel. If the executed code is more complex there will probably be a speedup.
I want to perform some benchmarking between 'multiprocessing' a file and sequential processing a file.
In basics it's a file that is read line by line (consists of 100 lines), and the first character is read from eachline and is put into the list if it doesn't exists.
import multiprocessing as mp
import sys
import time
database_layout=[]
def get_first_characters(string):
global database_layout
if string[0:1] not in database_layout:
database_layout.append(string[0:1])
if __name__ == '__main__':
start_time = time.time()
bestand_first_read=open('random.txt','r', encoding="latin-1")
for line in bestand_first_read:
p = mp.Process(target=get_first_characters, args=(line,))
p.start()
print(str(len(database_layout)))
print("Finished part one: "+ str(time.time() - start_time))
bestand_first_read.close()
###Part two
database_layout_two=[]
start_time = time.time()
bestand_first_read_two=open('random.txt','r', encoding="latin-1")
for linetwo in bestand_first_read_two:
if linetwo[0:1] not in database_layout_two:
database_layout_two.append(linetwo[0:1])
print(str(len(database_layout_two)))
print("Finished: part two"+ str(time.time() - start_time))
But when i execute this program i get the following result:
python test.py
0
Finished part one: 17.105965852737427
10
Finished part two: 0.0
Two problems arise at this moment.
1) Why does the multiprocessing takes much longer (+/- 17 sec) than the sequential processing (+/- 0 sec).
2) Why does the list 'database_layout' defined not get filled? (It is the same code)
EDIT
A same example which works with Pools.
import multiprocessing as mp
import timeit
def get_first_characters(string):
return string
if __name__ == '__main__':
database_layout=[]
start = timeit.default_timer()
nr = 0
with mp.Pool(processes=4) as pool:
for i in range(99999):
nr += 1
database_layout.append(pool.starmap(get_first_characters, [(str(i),)]))
stop = timeit.default_timer()
print("Pools: %s " % (stop - start))
database_layout=[]
start = timeit.default_timer()
for i in range(99999):
database_layout.append(get_first_characters(str(i)))
stop = timeit.default_timer()
print("Regular: %s " % (stop - start))
After running above example the following output is shown.
Pools: 22.058468394726148
Regular: 0.051738489109649066
This shows that in such a case working with Pools is 440 times slower than using sequential processing. Any clou why this is?
Multiprocessing starts one process for each line of your input. That means that all the overhead of opening one new Python interpreter for each line of your (possibly very long) file. That accounts for the long time it takes to go through the file.
However, there are other issues with your code. While there is no synchronisation issue due to fighting for the file (since all reads are done in the main process, where the line iteration is going on), you have misunderstood how multiprocessing works.
First of all, your global variable is not global across processes. Actually processes don't usually share memory (like threads) and you have to use some interface to share objects (and hence why shared objects must be picklable). When your code opens each process, each interpreter instance starts by loading your file, which creates a new database_layout variable. Because of that, each interpreter starts with an empty list, which means it ends with a single-element list. For actually sharing the list, you might want to use a Manager (also see how to share state in the docs).
Also because of the huge overhead of opening new interpreters, your script performance may benefit from using a pool of workers, since this will open just a few processes for sharing the work. Remember that resource contention will impact performance if opening more processes than you have CPU cores.
The second problem, besides the issue of sharing your variable, is that your code does not wait for the processing to finish. Hence, even if the state was shared, your processing might not have finished when you check the length of database_layout. Again, using a pool might help with that.
PS: unless you want to preserve the insertion order, you might get even faster by using a set, though I'm not sure the Manager supports it.
EDIT after the OP EDIT: Your pool code is still starting up the pool for each line (or number). As you did, you still have much of your processing in the main process, just looping and passing arguments to the other processes. Besides, you're still running each element in the pool individually and appending in the list, which pretty much uses only one worker process at a time (remember that map or starmaps waits until the work finishes to return). This is from Process Explorer running your code:
Note how the main process is still doing all the hard work (22% in a quad-core machine means its CPU is maxed). What you need to do is pass the iterable to map() in a single call, minimizing the work (specially switching between Python and the C side):
import multiprocessing as mp
import timeit
def get_first_characters(number):
return str(number)[0]
if __name__ == '__main__':
start = timeit.default_timer()
with mp.Pool(processes=4) as pool:
database_layout1 = (pool.map(get_first_characters, range(99999)))
stop = timeit.default_timer()
print("Pools: %s " % (stop - start))
database_layout2=[]
start = timeit.default_timer()
for i in range(99999):
database_layout2.append(get_first_characters(str(i)))
stop = timeit.default_timer()
print("Regular: %s " % (stop - start))
assert database_layout1 == database_layout2
This got me from this:
Pools: 14.169268206710512
Regular: 0.056271265139002935
To this:
Pools: 0.35610273658926417
Regular: 0.07681461930314981
It's still slower than the single-processing one, but that's mainly because of the message-passing overhead for a very simple function. If your function is more complex it'll make more sense.
I'm trying to get a handle on python Parallelism. This is the code i'm using
import time
from concurrent.futures import ProcessPoolExecutor
def listmaker():
for i in xrange(10000000):
pass
#Without duo core
start = time.time()
listmaker()
end = time.time()
nocore = "Total time, no core, %.3f" % (end- start)
#with duo core
start = time.time()
pool = ProcessPoolExecutor(max_workers=2) #I have two cores
results = list(pool.map(listmaker()))
end = time.time()
core = "Total time core, %.3f" % (end- start)
print nocore
print core
I was under the assumption that because i'm using two cores my the speed be be close to double. However when i run this code most of the time the nocore output is faster than the core output. This is true even if I change
def listmaker():
for i in xrange(10000000):
pass
to
def listmaker():
for i in xrange(10000000):
print i
In fact in some runs the no core run is faster. Could someone shed some light on the issue? I'm is my setup correct? I'm I doing something wrong?
You're using pool.map() incorrectly. Take a look at the pool.map documentation. It expects an iterable argument, and it will pass each of the items from the iterable to the pool individually. Since your function only returns None, there is nothing for it to do. However, you're still incurring the overhead of spawning extra processes, which takes time.
Your usage of pool.map should look like this:
results = pool.map(function_name, some_iterable)
Notice a couple of things:
Since you're using the print statement rather than a function, I'm assuming you're using some Python2 variant. In Python2, pool.map returns a list anyway. No need to convert it to a list again.
The first argument should be the function name without parentheses. This identifies the function that the pool workers should execute. When you include the parentheses, the function is called right there, instead of in the pool.
pool.map is intended to call a function on every item in an iterable, so your test cases needs to create some iterable for it to consume, instead of a function that takes no arguments like your current example.
Try to run your trial again with some actual input to the function, and retrieve the output. Here's an example:
import time
from concurrent.futures import ProcessPoolExecutor
def read_a_file(file_name):
with open(file_name) as fi:
text = fi.read()
return text
file_list = ['t1.txt', 't2.txt', 't3.txt']
#Without duo core
start = time.time()
single_process_text_list = []
for file_name in file_list:
single_process_text_list.append(read_a_file(file_name))
end = time.time()
nocore = "Total time, no core, %.3f" % (end- start)
#with duo core
start = time.time()
pool = ProcessPoolExecutor(max_workers=2) #I have two cores
multiprocess_text_list = pool.map(read_a_file, file_list)
end = time.time()
core = "Total time core, %.3f" % (end- start)
print(nocore)
print(core)
Results:
Total time, no core, 0.047
Total time core, 0.009
The text files are 150,000 lines of gibberish each. Notice how much work had to be done before the parallel processing was worth it. When I ran the trial with 10,000 lines in each file, the single process approach was still faster because it didn't have the overhead of spawning extra processes. But with that much work to do, the extra processes become worth the effort.
And by the way, this functionality is available with multiprocessing pools in Python2, so you can avoid importing anything from futures if you want to.
I have 2 simple functions(loops over a range) that can run separately without any dependency.. I'm trying to run this 2 functions both using the Python multiprocessing module as well as multithreading module..
When I compared the output, I see the multiprocess application takes 1 second more than the multi-threading module..
I read multi-threading is not that efficient because of the Global interpreter lock...
Based on the above statements -
1. Is is best to use the multiprocessing if there is no dependency between 2 processes?
2. How to calculate the number of processes/threads that I can run in my machine for maximum efficiency..
3. Also, is there a way to calculate the efficiency of the program by using multithreading...
Multithread module...
from multiprocessing import Process
import thread
import platform
import os
import time
import threading
class Thread1(threading.Thread):
def __init__(self,threadindicator):
threading.Thread.__init__(self)
self.threadind = threadindicator
def run(self):
starttime = time.time()
if self.threadind == 'A':
process1()
else:
process2()
endtime = time.time()
print 'Thread 1 complete : Time Taken = ', endtime - starttime
def process1():
starttime = time.time()
for i in range(100000):
for j in range(10000):
pass
endtime = time.time()
def process2():
for i in range(1000):
for j in range(1000):
pass
def main():
print 'Main Thread'
starttime = time.time()
thread1 = Thread1('A')
thread2 = Thread1('B')
thread1.start()
thread2.start()
threads = []
threads.append(thread1)
threads.append(thread2)
for t in threads:
t.join()
endtime = time.time()
print 'Main Thread Complete , Total Time Taken = ', endtime - starttime
if __name__ == '__main__':
main()
multiprocess module
from multiprocessing import Process
import platform
import os
import time
def process1():
# print 'process_1 processor =',platform.processor()
starttime = time.time()
for i in range(100000):
for j in range(10000):
pass
endtime = time.time()
print 'Process 1 complete : Time Taken = ', endtime - starttime
def process2():
# print 'process_2 processor =',platform.processor()
starttime = time.time()
for i in range(1000):
for j in range(1000):
pass
endtime = time.time()
print 'Process 2 complete : Time Taken = ', endtime - starttime
def main():
print 'Main Process start'
starttime = time.time()
processlist = []
p1 = Process(target=process1)
p1.start()
processlist.append(p1)
p2 = Process(target = process2)
p2.start()
processlist.append(p2)
for i in processlist:
i.join()
endtime = time.time()
print 'Main Process Complete - Total time taken = ', endtime - starttime
if __name__ == '__main__':
main()
If you have two CPUs available on your machine, you have two processes which don't have to communicate, and you want to use both of them to make your program faster, you should use the multiprocessing module, rather than the threading module.
The Global Interpreter Lock (GIL) prevents the Python interpreter from making efficient use of more than one CPU by using multiple threads, because only one thread can be executing Python bytecode at a time. Therefore, multithreading won't improve the overall runtime of your application unless you have calls that are blocking (e.g. waiting for IO) or that release the GIL (e.g. numpy will do this for some expensive calls) for extended periods of time. However, the multiprocessing library creates separate subprocesses, and therefore several copies of the interpreter, so it can make efficient use of multiple CPUs.
However, in the example you gave, you have one process that finishes very quickly (less than 0.1 seconds on my machine) and one process that takes around 18 seconds to finish on the other. The exact numbers may vary depending on your hardware. In that case, nearly all the work is happening in one process, so you're really only using one CPU regardless. In this case, the increased overhead of spawning processes vs threads is probably causing the process-based version to be slower.
If you make both processes do the 18 second nested loops, you should see that the multiprocessing code goes much faster (assuming your machine actually has more than one CPU). On my machine, I saw the multiprocessing code finish in around 18.5 seconds, and the multithreaded code finish in 71.5 seconds. I'm not sure why the multithreaded one took longer than around 36 seconds, but my guess is the GIL is causing some sort of thread contention issue which is slowing down both threads from executing.
As for your second question, assuming there's no other load on the system, you should use a number of processes equal to the number of CPUs on your system. You can discover this by doing lscpu on a Linux system, sysctl hw.ncpu on a Mac system, or running dxdiag from the Run dialog on Windows (there's probably other ways, but this is how I always do it).
For the third question, the simplest way to figure out how much efficiency you're getting from the extra processes is just to measure the total runtime of your program, using time.time() as you were, or the time utility in Linux (e.g. time python myprog.py). The ideal speedup should be equal to the number of processes you're using, so a 4 process program running on 4 CPUs should be at most 4x faster than the same program with 1 process, assuming you get maximum benefit from the extra processes. If the other processes aren't helping you that much, it will be less than 4x.