Python multithreading.pool running slower than sequential - python

I am trying to make one of my functions run asynchronous. In the code this function runs approximately 600-700 times. So I am trying to make it asynchronous so overall code will be faster.
I am using threadpool for it to run faster. But as I searched for solutions I saw something called GIL(Global Interpreter Lock). If I understood correctly because python is not a thread safe language, GIL doesn't let python use more than one thread at a time.
Here is the piece of code I am trying to run asynchronous. Function was too long so I pasted here the time consuming part.
for row in range(0, image_row-1 ,1):
if row in ignore_row:
continue
for column in range(0, image_col-1 ,1):
if image_binary[row][column] > 0:
#For the first triangle of a square
vertice1 = vertices[row][column]
vertice2 = vertices[row][column+1]
vertice3 = vertices[row+1][column+1]
face1 = np.array([vertice1, vertice2, vertice3])
#For the second triangle if a square
vertice1 = vertices[row][column]
vertice2 = vertices[row+1][column]
vertice3 = vertices[row+1][column+1]
face2 = np.array([vertice1,vertice2,vertice3])
faces.extend(face1)
faces.extend(face2)
At first I thought if one of the variables was global maybe the threads are waiting in the queue to access it. But non of the variables are global they all created in the function.
In the sequential code for 1 run all program takes about 0.2 seconds but in the parallel program in this snippet it takes about 4-9 seconds.
Here is the multiprocess code:
pool = ThreadPool(config.THREAD_COUNT)
pool.map(process_image, png_names)
pool.close()
pool.join()
png_names holds the name of the png's in a folder.
And for config.THREAD_COUNT i tried 2-17 threads. They all read the png files and start to process it but when they come to the part that i shared above they wait and execute that part in about 4-9 seconds.
I am using cv2 and numpy libraries for reading and editing the images.
I am using Python 3.10

Related

In python, how can I run a function without the program waiting for its completion?

In Python, I am making a cube game (like Minecraft pre-classic) that renders chunk by chunk (16x16 blocks). It only renders blocks that are not exposed (not covered on all sides). Even though this method is fast when I have little height (like 16x16x2, which is 512 blocks in total), once I make the terrain higher (like 16x16x64, which is 16384 blocks in total), rendering each chunk takes roughly 0.03 seconds, meaning that when I render multiple chunks at once the game freezes for about a quarter of a second. I want to render the chunks "asynchronously", meaning that the program will keep on drawing frames and calling the chunk render function multiple times, no matter how long it takes. Let me show you some pictures to help:
I tried to make another program in order to test it:
import threading
def run():
n=1
for i in range(10000000):
n += 1
print(n)
print("Start")
threading.Thread(target=run()).start()
print("End")
I know that creating such a lot of threads is not the best solution, but nothing else worked.
Threading, however, didn't work, as this is what the output looked like:
>>> Start
>>> 10000001
>>> End
It also took about a quarter of a second to complete, which is about how long the multiple chunk rendering takes.
Then I tried to use async:
import asyncio
async def run():
n = 1
for i in range(10000000):
n += 1
print(n)
print("Start")
asyncio.run(run())
print("End")
It did the exact same thing.
My questions are:
Can I run a function without stopping/pausing the program execution until it's complete?
Did I use the above correctly?
Yes. No. The answer is complicated.
First, your example has at least one error on it:
print("Start")
threading.Thread(target=run).start() #notice the missing parenthesis after run
print("End")
You can use multithreading for your game of course, but it can come at a disadvantage of code complexity because of synchronization and you might not gain any performance because of GIL.
asyncio is probably not for this job either, since you don't need to highly parallelize many tasks and it has the same problems with GIL as multithreading.
The usual solution for this kind of problem is to divide your work into small batches and only process the next batch if you have time to do so on the same frame, kind of like so:
def runBatch(range):
for x in range:
print(x)
batches = [range (x, x+200) for x in range(0, 10000, 200)]
while (true): # main loop
while (timeToNextFrame() > 15):
runBatch(batch.pop())
renderFrame() #or whatever
However, in this instance, optimizing the algorithm itself could be even better than any other option. One thing that Minecraft does is it subdivides chunks into subchunks (you can mostly ignore subchunks that are full of blocks). Another is that it only considers the visible surfaces of the blocks (renders only those sides of the block that could be visible, not the whole block).
asyncio only works asynchronously only when your function is waiting on I/O task like network call or wait on disk I/O etc.
for non I/O tasks to execute asynchronously multi-threading is the only option so create all your threads and wait for the threads to complete their tasks using thread join method
from threading import Thread
import time
def draw_pixels(arg):
time.sleep(arg)
print(arg)
threads = []
args = [1,2,3,4,5]
for arg in args:
t = Thread(target=draw_pixels, args=(arg, ))
t.start()
threads.append(t)
# join all threads
for t in threads:
t.join()

Multiprocessing does not work when first running the same code single-threaded

I am currently trying to parallelize a part of an existing program, and have encountered a strange behaviour that causes my program to stop execution (it is still running, but does not make any further progress).
First, here is a minimal working example:
import multiprocessing
import torch
def foo(x):
print(x)
return torch.zeros(x, x)
size = 200
# first block
print("Run with single core")
res = []
for i in range(size):
res.append(foo(i))
# second block
print("Run with multiprocessing")
data = [i for i in range(size)]
with multiprocessing.Pool(processes=1) as pool:
res = pool.map(foo, data)
The problem is that the script stops running during multiprocessing, and reliably at x = 182. Up to this point, I could come up with some reasonable explanations, but now comes the strange part: If I run the code only in parallel (so only the second block of code), the script works perfectly fine. It also works if I first run the parallel version, and then the single-threaded code. Only at the moment where I first run the first and then the second block, the program gets stuck. This holds also true, if I first run the second block, then the first one, and then the second one again; in that case, the first multiprocessing block works fine, and it gets stuck, the second time I run the multiprocessing version.
The problem seems not to stem from a lack of memory, since I can increase size to much higher values (1000+) when only running the multiprocessing code. Additionally, I have no problem when I use np.zeros((x, x)) instead of torch.zeros. Removing the print function does not help either, so I kept it in for demonstration purposes.
This was also reproducible on other machines, stopping as well at x = 182 when running both blocks, but working fine when only running the second. Python and Pytorch versions were respectively (3.7.3, 1.7.1), (3.8.9, 1.8.1), and (3.8.5, 1.9.0). All systems were Ubuntu based.
Has someone an explanation for this behaviour or did I miss any parameters/options I need to set for this to work?

Mulitprocessing pool for function with no arguments/iterable?

I'm running Python 2.7 on the GCE platform to do calculations. The GCE instances boot, install various packages, copy 80 Gb of data from a storage bucket and runs a "workermaster.py" script with nohangup. The workermaster runs on an infinite loop which checks a task-queue bucket for tasks. When the task bucket isn't empty it picks a random file (task) and passes work to a calculation module. If there is nothing to do the workermaster sleeps for a number of seconds and checks the task-list again. The workermaster runs continuously until the instance is terminated (or something breaks!).
Currently this works quite well, but my problem is that my code only runs instances with a single CPU. If I want to scale up calculations I have to create many identical single-CPU instances and this means there is a large cost overhead for creating many 80 Gb disks and transferring the data to them each time, even though the calculation is only "reading" one small portion of the data for any particular calculation. I want to make everything more efficient and cost effective by making my workermaster capable of using multiple CPUs, but after reading many tutorials and other questions on SO I'm completely confused.
I thought I could just turn the important part of my workermaster code into a function, and then create a pool of processes that "call" it using the multiprocessing module. Once the workermaster loop is running on each CPU, the processes do not need to interact with each other or depend on each other in any way, they just happen to be running on the same instance. The workermaster prints out information about where it is in the calculation and I'm also confused about how it will be possible to tell the "print" statements from each process apart, but I guess that's a few steps from where I am now! My problems/confusion are that:
1) My workermaster "def" doesn't return any value because it just starts an infinite loop, where as every web example seems to have something in the format myresult = pool.map(.....); and
2) My workermaster "def" doesn't need any arguments/inputs - it just runs, whereas the examples of multiprocessing that I have seen on SO and on the Python Docs seem to have iterables.
In case it is important, the simplified version of the workermaster code is:
# module imports are here
# filepath definitions go here
def workermaster():
while True:
tasklist = cloudstoragefunctions.getbucketfiles('<my-task-queue-bucket')
if tasklist:
tasknumber = random.randint(2, len(tasklist))
assignedtask = tasklist[tasknumber]
print 'Assigned task is now: ' + assignedtask
subprocess.call('gsutil -q cp gs://<my-task-queue-bucket>/' + assignedtask + ' "' + taskfilepath + assignedtask + '"', shell=True)
tasktype = assignedtask.split('#')[0]
if tasktype == 'Calculation':
currentcalcid = assignedtask.split('#')[1]
currentfilenumber = assignedtask.split('#')[2].replace('part', '')
currentstartfile = assignedtask.split('#
currentendfile = assignedtask.split('#')[4].replace('.csv', '')
calcmodule.docalc(currentcalcid, currentfilenumber, currentstartfile, currentendfile)
elif tasktype == 'Analysis':
#set up and run analysis module, etc.
print ' Operation completed!'
os.remove(taskfilepath + assignedtask)
else:
print 'There are no tasks to be processed. Going to sleep...'
time.sleep(30)
Im trying to "call" the function multiple times using the multiprocessing module. I think I need to use the "pool" method, so I've tried this:
import multiprocessing
if __name__ == "__main__":
p = multiprocessing.Pool()
pool_output = p.map(workermaster, [])
My understanding from the docs is that the __name__ line is there only as a workaround for doing multiprocessing in Windows (which I am doing for development, but GCE is on Linux). The p = multiprocessing.Pool() line is creating a pool of workers equal to the number of system CPUs as no argument is specified. It the number of CPUs was 1 then I would expect the code to behave as it does before I attempted to use multiprocessing. The last line is the one that I don't understand. I thought that it was telling each of the processors in the pool that the "target" (thing to run) is workermaster. From the docs there appears to be a compulsory argument which is an iterable, but I don't really understand what this is in my case, as workermaster doesn't take any arguments. I've tried passing it an empty list, empty string, empty brackets (tuple?) and it doesn't do anything.
Please would it be possible for someone help me out? There are lots of discussions about using multiprocessing and this thread Mulitprocess Pools with different functions and this one python code with mulitprocessing only spawns one process each time seem to be close to what I am doing but still have iterables as arguments. If there is anything critical that I have left out please advise and I will modify my post - thank you to anyone who can help!
Pool() is useful if you want to run the same function with different argumetns.
If you want to run function only once then use normal Process().
If you want to run the same function 2 times then you can manually create 2 Process().
If you want to use Pool() to run function 2 times then add list with 2 arguments (even if you don't need arguments) because it is information for Pool() to run it 2 times.
But if you run function 2 times with the same folder then it may run 2 times the same task. if you will run 5 times then it may run 5 times the same task. I don't know if it is needed.
As for Ctrl+C I found on Stackoverflow Catch Ctrl+C / SIGINT and exit multiprocesses gracefully in python but I don't know if it resolves your problem.

Using pool for multiprocessing in Python (Windows)

I have to do my study in a parallel way to run it much faster. I am new to multiprocessing library in python, and could not yet make it run successfully.
Here, I am investigating if each pair of (origin, target) remains at certain locations between various frames of my study. Several points:
It is one function, which I want to run faster (It is not several processes).
The process is performed subsequently; it means that each frame is compared with the previous one.
This code is a very simpler form of the original code. The code outputs a residece_list.
I am using Windows OS.
Can someone check the code (the multiprocessing section) and help me improve it to make it work. Thanks.
import numpy as np
from multiprocessing import Pool, freeze_support
def Main_Residence(total_frames, origin_list, target_list):
Previous_List = {}
residence_list = []
for frame in range(total_frames): #Each frame
Current_List = {} #Dict of pair and their residence for frames
for origin in range(origin_list):
for target in range(target_list):
Pair = (origin, target) #Eahc pair
if Pair in Current_List.keys(): #If already considered, continue
continue
else:
if origin == target:
if (Pair in Previous_List.keys()): #If remained from the previous frame, add residence
print "Origin_Target remained: ", Pair
Current_List[Pair] = (Previous_List[Pair] + 1)
else: #If new, add it to the current
Current_List[Pair] = 1
for pair in Previous_List.keys(): #Add those that exited from residence to the list
if pair not in Current_List.keys():
residence_list.append(Previous_List[pair])
Previous_List = Current_List
return residence_list
if __name__ == '__main__':
pool = Pool(processes=5)
Residence_List = pool.apply_async(Main_Residence, args=(20, 50, 50))
print Residence_List.get(timeout=1)
pool.close()
pool.join()
freeze_support()
Residence_List = np.array(Residence_List) * 5
Multiprocessing does not make sense in the context you are presenting here.
You are creating five subprocesses (and three threads belonging to the pool, managing workers, tasks and results) to execute one function once. All of this is coming at a cost, both in system resources and execution time, while four of your worker processes don't do anything at all. Multiprocessing does not speed up the execution of a function. The code in your specific example will always be slower than plainly executing Main_Residence(20, 50, 50) in the main process.
For multiprocessing to make sense in such a context, your work at hand would need to be broken down to a set of homogenous tasks that can be processed in parallel with their results potentially being merged later.
As an example (not necessarily a good one), if you want to calculate the largest prime factors for a sequence of numbers, you can delegate the task of calculating that factor for any specific number to a worker in a pool. Several workers would then do these individual calculations in parallel:
def largest_prime_factor(n):
p = n
i = 2
while i * i <= n:
if n % i:
i += 1
else:
n //= i
return p, n
if __name__ == '__main__':
pool = Pool(processes=3)
start = datetime.now()
# this delegates half a million individual tasks to the pool, i.e.
# largest_prime_factor(0), largest_prime_factor(1), ..., largest_prime_factor(499999)
pool.map(largest_prime_factor, range(500000))
pool.close()
pool.join()
print "pool elapsed", datetime.now() - start
start = datetime.now()
# same work just in the main process
[largest_prime_factor(i) for i in range(500000)]
print "single elapsed", datetime.now() - start
Output:
pool elapsed 0:00:04.664000
single elapsed 0:00:08.939000
(the largest_prime_factor function is taken from #Stefan in this answer)
As you can see, the pool is only roughly twice as fast as single process execution of the same amount of work, all while running in three processes in parallel. That's due to the overhead introduced by multiprocessing/the pool.
So, you stated that the code in your example has been simplified. You'll have to analyse your original code to see if it can be broken down to homogenous tasks that can be passed down to your pool for processing. If that is possible, using multiprocessing might help you speed up your program. If not, multiprocessing will likely cost you time, rather than save it.
Edit:
Since you asked for suggestions on the code. I can hardly say anything about your function. You said yourself that it is just a simplified example to provide an MCVE (much appreciated by the way! Most people don't take the time to strip down their code to its bare minimum). Requests for a code review are anyway better suited over at Codereview.
Play around a bit with the available methods of task delegation. In my prime factor example, using apply_async came with a massive penalty. Execution time increased ninefold, compared to using map. But my example is using just a simple iterable, yours needs three arguments per task. This could be a case for starmap, but that is only available as of Python 3.3.Anyway, the structure/nature of your task data basically determines the correct method to use.
I did some q&d testing with multiprocessing your example function.
The input was defined like this:
inp = [(20, 50, 50)] * 5000 # that makes 5000 tasks against your Main_Residence
I ran that in Python 3.6 in three subprocesses with your function unaltered, except for the removal of the print statment (I/O is costly). I used, starmap, apply, starmap_async and apply_async and also iterated through the results each time to account for the blocking get() on the async results.
Here's the output:
starmap elapsed 0:01:14.506600
apply elapsed 0:02:11.290600
starmap async elapsed 0:01:27.718800
apply async elapsed 0:01:12.571200
# btw: 5k calls to Main_Residence in the main process looks as bad
# as using apply for delegation
single elapsed 0:02:12.476800
As you can see, the execution times differ, although all four methods do the same amount of work; the apply_async you picked appears to be the fastest method.
Coding Style. Your code looks quite ... unconventional :) You use Capitalized_Words_With_Underscore for your names (both, function and variable names), that's pretty much a no-no in Python. Also, assigning the name Previous_List to a dictionary is ... questionable. Have a look at PEP 8, especially the section Naming Conventions to see the commonly accepted coding style for Python.
Judging by the way your print looks, you are still using Python 2. I know that in corporate or institutional environments that's sometimes all you have available. Still, keep in mind that the clock for Python 2 is ticking

Python multiprocessing, using pool multiple times in a loop gets stuck after first iteration

I have the following situation, where I create a pool in a for loop as follows (I know it's not very elegant, but I have to do this for pickling reasons). Assume that the pathos.multiprocessing is equivalent to python's multiprocessing library (as it is up to some details, that are not relevant for this problem).
I have the following code I want to execute:
self.pool = pathos.multiprocessing.ProcessingPool(number_processes)
for i in range(5):
all_responses = self.pool.map(wrapper_singlerun, range(self.no_of_restarts))
pool._clear()
Now my problem: The loop successfully runs the first iteration. However, at the second iteration, the algorithm suddenly stops (Does not finish the pool.map operation. I suspected that zombie processes are generated, or that the process was somehow switched. Below you will find everything I have tried so far.
for i in range(5):
pool = pathos.multiprocessing.ProcessingPool(number_processes)
all_responses = self.pool.map(wrapper_singlerun, range(self.no_of_restarts))
pool._clear()
gc.collect()
for p in multiprocessing.active_children():
p.terminate()
gc.collect()
print("We have so many active children: ", multiprocessing.active_children()) # Returns []
The above code works perfectly well on my mac. However, when I upload it on the cluster that has the following specs, I get the error that it gets stuck after the first iteration:
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04 LTS"
This is the link to the pathos' multiprocessing library file is
I am assuming that you are trying to call this via some function which is not the correct way to use this.
You need to wrap it around with :
if __name__ == '__main__':
for i in range(5):
pool = pathos.multiprocessing.Pool(number_processes)
all_responses = pool.map(wrapper_singlerun,
range(self.no_of_restarts))
If you don't do it will keep on creating a copy of itself and will start putting it into stack which will ultimately fill the stack and block everything. The reason it works on mac is that it has fork while windows does not have it.

Categories

Resources