Process Pool Executor runs code outside of scope - python

I'm trying to run a bunch of processes in parallel with the Process Pool Executor from concurrent futures in Python.
The processes are all running in parallel in a while loop which is great, but for some reason the code outside of the main method repeatedly runs. I saw another answer say to use the name == main check to fix but it still doesn't work.
Any ideas how I can just get the code inside the main method to run? My object keeps getting reset repeatedly.
EDIT: I ran my code using ThreadPoolExecutor instead and it fixed the problem, although I'm still curious about this.
import concurrent.futures
import time
from myFile import myObject
obj = myObject()
def main():
with concurrent.futures.ProcessPoolExecutor() as executor:
while condition:
for index in range(0,10):
executor.submit(obj.function, index, index+1)
executor.submit(obj.function2)
time.sleep(5)
print("test")
if __name__ == "__main__":
main()

Related

I am having problems with ProcessPoolExecutor from concurrent.futures

I have a big code that take a while to make calculation, I have decided to learn about multithreading and multiprocessing because only 20% of my processor was being used to make the calculation. After not having any improvement with multithreading, I have decided to try multiprocessing and whenever I try to use it, it just show a lot of errors even on a very simple code.
this is the code that I tested after starting having problems with my big calculation heavy code :
from concurrent.futures import ProcessPoolExecutor
def func():
print("done")
def func_():
print("done")
def main():
executor = ProcessPoolExecutor(max_workers=3)
p1 = executor.submit(func)
p2 = executor.submit(func_)
main()
and in the error message that I amhaving it says
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
this is not the whole message because it is very big but I think that I may be helpful in order to help me. Pretty much everything else on the error message is just like "error at line ... in ..."
If it may be helpful the big code is at : https://github.com/nobody48sheldor/fuseeinator2.0
it might not be the latest version.
I updated your code to show main being called. This is an issue with spawning operating systems like Windows. To test on my linux machine I had to add a bit of code. But this crashes on my machine:
# Test code to make linux spawn like Windows and generate error. This code
# # is not needed on windows.
if __name__ == "__main__":
import multiprocessing as mp
mp.freeze_support()
mp.set_start_method('spawn')
# test script
from concurrent.futures import ProcessPoolExecutor
def func():
print("done")
def func_():
print("done")
def main():
executor = ProcessPoolExecutor(max_workers=3)
p1 = executor.submit(func)
p2 = executor.submit(func_)
main()
In a spawning system, python can't just fork into a new execution context. Instead, it runs a new instance of the python interpreter, imports the module and pickles/unpickles enough state to make a child execution environment. This can be a very heavy operation.
But your script is not import safe. Since main() is called at module level, the import in the child would run main again. That would create a grandchild subprocess which runs main again (and etc until you hang your machine). Python detects this infinite loop and displays the message instead.
Top level scripts are always called "__main__". Put all of the code that should only be run once at the script level inside an if. If the module is imported, nothing harmful is run.
if __name__ == "__main__":
main()
and the script will work.
There are code analyzers out there that import modules to extract doc strings, or other useful stuff. Your code shouldn't fire the missiles just because some tool did an import.
Another way to solve the problem is to move everything multiprocessing related out of the script and into a module. Suppose I had a module with your code in it
whatever.py
from concurrent.futures import ProcessPoolExecutor
def func():
print("done")
def func_():
print("done")
def main():
executor = ProcessPoolExecutor(max_workers=3)
p1 = executor.submit(func)
p2 = executor.submit(func_)
myscript.py
#!/usr/bin/env pythnon3
import whatever
whatever.main()
Now, since the pool is laready in an imported module that doesn't do this crazy restart-itself thing, no if __name__ == "__main__": is necessary. Its a good idea to put it in myscript.py anyway, but not required.

Stackoverflow: multiprocessing with extra IF condition?

My code is
from some_module import foo, tasks, condition
import multiprocessing
def parallel():
pool = multiprocessing.Pool()
pool.map(foo, tasks)
pool.close()
pool.join()
if __name__ == '__main__' and condition:
parallel()
My question is, what is the flow of this code? As I understand it, first time, it goes into the if-block, then goes into parallel, where it creates a pool, and once I call .map, it creates new child processes, and these child processes will … do what, exactly? Will they run the entire script? I know that means it won't go into the if-block, but if that's the case, how the hell does it know what to do? It's only the if-block which tells it to run a function...

Multiprocessing in python won't releae memory

I am running a multiprocessing code. The framework of the code is something like below:
def func_a(x):
#main function here
return result
def func_b(y):
cores = multiprocessing.cpu_count() - 1
pool = multiprocessing.Pool(processes=cores)
results = pool.map(func_a, np.arange(1000)
return results
if __name__ == '__main__':
final_resu = []
for i in range(0, 200):
final_resu.append(func_b(i))
The problem I found in this code has two problems: Firstly, the memory continues going up during the loop. Secondly, in the task manager (windows10), the number of python executions increased step-wise, i.e. 14 to 25, to 36, to 47... with every iteration finished in the main loop.
I believe it has something wrong with the multiprocessing, but I'm not sure how to deal with it. It looks like the multiprocessing in func_b is not deleted when the main loop finished one loop?
As the examples in the docs show, when you're done with a Pool you should shut it down explicitly, via pool.close() followed by pool.join().That said, it would be better still if, in addition, you created your Pool only once - e.g., pass a Pool as an argument to func_b(). and create it - and close it down - only once, in the __name__ == '__main__' block.

Python multiprocessing map using with statement does not stop

I am using multiprocessing python module to run parallel and unrelated jobs with a function similar to the following example:
import numpy as np
from multiprocessing import Pool
def myFunction(arg1):
name = "file_%s.npy"%arg1
A = np.load(arg1)
A[A<0] = np.nan
np.save(arg1,A)
if(__name__ == "__main__"):
N = list(range(50))
with Pool(4) as p:
p.map_async(myFunction, N)
p.close() # I tried with and without that statement
p.join() # I tried with and without that statement
DoOtherStuff()
My problem is that the function DoOtherStuff is never executed, the processes switches into sleep mode on top and I need to kill it with ctrl+C to stop it.
Any suggestions?
You have at least a couple problems. First, you are using map_async() which does not block until the results of the task are completed. So what you're doing is starting the task with map_async(), but then immediately closes and terminates the pool (the with statement calls Pool.terminate() upon exiting).
When you add tasks to a Process pool with methods like map_async it adds tasks to a task queue which is handled by a worker thread which takes tasks off that queue and farms them out to worker processes, possibly spawning new processes as needed (actually there is a separate thread which handles that).
Point being, you have a race condition where you're terminating the Pool likely before any tasks are even started. If you want your script to block until all the tasks are done just use map() instead of map_async(). For example, I rewrote your script like this:
import numpy as np
from multiprocessing import Pool
def myFunction(N):
A = np.load(f'file_{N:02}.npy')
A[A<0] = np.nan
np.save(f'file2_{N:02}.npy', A)
def DoOtherStuff():
print('done')
if __name__ == "__main__":
N = range(50)
with Pool(4) as p:
p.map(myFunction, N)
DoOtherStuff()
I don't know what your use case is exactly, but if you do want to use map_async(), so that this task can run in the background while you do other stuff, you have to leave the Pool open, and manage the AsyncResult object returned by map_async():
result = pool.map_async(myFunction, N)
DoOtherStuff()
# Is my map done yet? If not, we should still block until
# it finishes before ending the process
result.wait()
pool.close()
pool.join()
You can see more examples in the linked documentation.
I don't know why in your attempt you got a deadlock--I was not able to reproduce that. It's possible there was a bug at some point that was then fixed, though you were also possibly invoking undefined behavior with your race condition, as well as calling terminate() on a pool after it's already been join()ed. As for your why your answer did anything at all, it's possible that with the multiple calls to apply_async() you managed to skirt around the race condition somewhat, but this is not at all guaranteed to work.

python pool doesn't work

I'm trying to use multithreading and in order to keep it simple at first, i'm running the following code:
import multiprocessing as mp
pool = mp.Pool(4)
def square(x):
return x**2
results=pool.map(square,range(1,20))
As i understand, results should be a list containig the squares from 1 to 20.
However, the code does not seem to terminate.(doing the same without pool finishes in a blink, this ran for several minutes, before i stopped it manually).
Additional information: task manager tells me, that the additional python processes have been launched and are running, but are using zero % of my cpu; still other unrelated processes like firefox skyrocket in their cpu usage, while the programm is running.
I'm using windows 8 and a i5-4300U cpu (pooling to 2 instead of 4 doesn't help either)
What am i doing wrong?
Are there any good ressources on the Pool class, which could help me understand what is wrong with my code?
Code with pool initialization should be inside __name__ == "__main__" as multiprocessing imports the module each time to spawn a new process.
import multiprocessing as mp
def square(x):
return x**2
if __name__ == '__main__':
pool = mp.Pool(4)
results=pool.map(square,range(1,20))
Your code causes each process to hit an attribute error because it can't find the square attribute because it wasn't defined when you instantiated the pool. The processes therefore hang thereafter. Defining square before pool would fix the problem.
See also:
yet another confusion with multiprocessing error, 'module' object has no attribute 'f'

Categories

Resources