python pool doesn't work - python

I'm trying to use multithreading and in order to keep it simple at first, i'm running the following code:
import multiprocessing as mp
pool = mp.Pool(4)
def square(x):
return x**2
results=pool.map(square,range(1,20))
As i understand, results should be a list containig the squares from 1 to 20.
However, the code does not seem to terminate.(doing the same without pool finishes in a blink, this ran for several minutes, before i stopped it manually).
Additional information: task manager tells me, that the additional python processes have been launched and are running, but are using zero % of my cpu; still other unrelated processes like firefox skyrocket in their cpu usage, while the programm is running.
I'm using windows 8 and a i5-4300U cpu (pooling to 2 instead of 4 doesn't help either)
What am i doing wrong?
Are there any good ressources on the Pool class, which could help me understand what is wrong with my code?

Code with pool initialization should be inside __name__ == "__main__" as multiprocessing imports the module each time to spawn a new process.
import multiprocessing as mp
def square(x):
return x**2
if __name__ == '__main__':
pool = mp.Pool(4)
results=pool.map(square,range(1,20))

Your code causes each process to hit an attribute error because it can't find the square attribute because it wasn't defined when you instantiated the pool. The processes therefore hang thereafter. Defining square before pool would fix the problem.
See also:
yet another confusion with multiprocessing error, 'module' object has no attribute 'f'

Related

Multiprocessing pool doesn't seem to even call function?

I am trying to call a function in parallel using multiprocessing. I have used both starmap and map many times in the past to do so, but suddenly multiprocessing has stopped working. A pool is created, but the function is never called and the cell never finishes running. To test the issue, I am running a simple example code:
from multiprocessing import Pool
def f(x):
print(x)
return x*x
if __name__ == '__main__':
p = Pool(1)
results = p.map(f, [1, 2, 3])
p.close()
p.join()
When I run this, nothing is printed, and the process never completes.
I have also tried running old code from previous notebooks that contain multiprocessing, and these have failed, too. I tried updating all packages as well. Has anyone else experienced this problem before?

Python multiprocessing map using with statement does not stop

I am using multiprocessing python module to run parallel and unrelated jobs with a function similar to the following example:
import numpy as np
from multiprocessing import Pool
def myFunction(arg1):
name = "file_%s.npy"%arg1
A = np.load(arg1)
A[A<0] = np.nan
np.save(arg1,A)
if(__name__ == "__main__"):
N = list(range(50))
with Pool(4) as p:
p.map_async(myFunction, N)
p.close() # I tried with and without that statement
p.join() # I tried with and without that statement
DoOtherStuff()
My problem is that the function DoOtherStuff is never executed, the processes switches into sleep mode on top and I need to kill it with ctrl+C to stop it.
Any suggestions?
You have at least a couple problems. First, you are using map_async() which does not block until the results of the task are completed. So what you're doing is starting the task with map_async(), but then immediately closes and terminates the pool (the with statement calls Pool.terminate() upon exiting).
When you add tasks to a Process pool with methods like map_async it adds tasks to a task queue which is handled by a worker thread which takes tasks off that queue and farms them out to worker processes, possibly spawning new processes as needed (actually there is a separate thread which handles that).
Point being, you have a race condition where you're terminating the Pool likely before any tasks are even started. If you want your script to block until all the tasks are done just use map() instead of map_async(). For example, I rewrote your script like this:
import numpy as np
from multiprocessing import Pool
def myFunction(N):
A = np.load(f'file_{N:02}.npy')
A[A<0] = np.nan
np.save(f'file2_{N:02}.npy', A)
def DoOtherStuff():
print('done')
if __name__ == "__main__":
N = range(50)
with Pool(4) as p:
p.map(myFunction, N)
DoOtherStuff()
I don't know what your use case is exactly, but if you do want to use map_async(), so that this task can run in the background while you do other stuff, you have to leave the Pool open, and manage the AsyncResult object returned by map_async():
result = pool.map_async(myFunction, N)
DoOtherStuff()
# Is my map done yet? If not, we should still block until
# it finishes before ending the process
result.wait()
pool.close()
pool.join()
You can see more examples in the linked documentation.
I don't know why in your attempt you got a deadlock--I was not able to reproduce that. It's possible there was a bug at some point that was then fixed, though you were also possibly invoking undefined behavior with your race condition, as well as calling terminate() on a pool after it's already been join()ed. As for your why your answer did anything at all, it's possible that with the multiple calls to apply_async() you managed to skirt around the race condition somewhat, but this is not at all guaranteed to work.

Python - Multiprocessing pool.map(): Processes don't do anything

I am trying to use the python multiprocessing library in order to parallize a task I am working on:
import multiprocessing as MP
def myFunction((x,y,z)):
...create a sqlite3 database specific to x,y,z
...write to the database (one DB per process)
y = 'somestring'
z = <large read-only global dictionary to be shared>
jobs = []
for x in X:
jobs.append((x,y,z,))
pool = MP.Pool(processes=16)
pool.map(myFunction,jobs)
pool.close()
pool.join()
Sixteen processes are started as seen in htop, however no errors are returned, no files written, no CPU is used.
Could it happen that there is an error in myFunction that is not reported to STDOUT and blocks execution?
Perhaps it is relevant that the python script is called from a bash script running in background.
The lesson learned here was to follow the strategy suggested in one of the comments and use multiprocessing.dummy until everything works.
At least in my case, errors were not visible otherwise and the processes were still running as if nothing had happened.

multiprocessing.Pool in jupyter notebook works on linux but not windows

I'm trying to run a few independent computations (though reading from the same data). My code works when I run it on Ubuntu, but not on Windows (windows server 2012 R2), where I get the error:
'module' object has no attribute ...
when I try to use multiprocessing.Pool (it appears in the kernel console, not as output in the notebook itself)
(And I've already made the mistake of defining the function AFTER creating the pool, and I've also corrected it, that's not the problem).
This happens even on the simplest of examples:
from multiprocessing import Pool
def f(x):
return x**2
pool = Pool(4)
for res in pool.map(f,range(20)):
print res
I know that it needs to be able to import the module (and I have no idea how this works when working in the notebook), and I've heard of IPython.Parallel, but I have been unable to find any documentation or examples.
Any solutions/alternatives would be most welcome.
I would post this as a comment since I don't have a full answer, but I'll amend as I figure out what is going on.
from multiprocessing import Pool
def f(x):
return x**2
if __name__ == '__main__':
pool = Pool(4)
for res in pool.map(f,range(20)):
print(res)
This works. I believe the answer to this question is here. In short, the subprocesses do not know they are subprocesses and are attempting to run the main script recursively.
This is the error I am given, which gives us the same solution:
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.

Using python multiprocessing Pool in the terminal and in code modules for Django or Flask

When using multiprocessing.Pool in python with the following code, there is some bizarre behavior.
from multiprocessing import Pool
p = Pool(3)
def f(x): return x
threads = [p.apply_async(f, [i]) for i in range(20)]
for t in threads:
try: print(t.get(timeout=1))
except Exception: pass
I get the following error three times (one for each thread in the pool), and it prints "3" through "19":
AttributeError: 'module' object has no attribute 'f'
The first three apply_async calls never return.
Meanwhile, if I try:
from multiprocessing import Pool
p = Pool(3)
def f(x): print(x)
p.map(f, range(20))
I get the AttributeError 3 times, the shell prints "6" through "19", and then hangs and cannot be killed by [Ctrl] + [C]
The multiprocessing docs have the following to say:
Functionality within this package requires that the main module be
importable by the children.
What does this mean?
To clarify, I'm running code in the terminal to test functionality, but ultimately I want to be able to put this into modules of a web server. How do you properly use multiprocessing.Pool in the python terminal and in code modules?
Caveat: Multiprocessing is the wrong tool to use in the context of web servers like Django and Flask. Instead, you should use a task framework like Celery or an infrastructure solution like Elastic Beanstalk Worker Environments. Using multiprocessing to spawn threads or processes is bad because it gives you no oversight or management of those threads/processes, and so you have to build your own failure detection logic, retry logic, etc. At that point, you are better served by using an off-the-shelf tool that is actually designed to handle asynchronous tasks, because it will give you these out of the box.
Understanding the docs
Functionality within this package requires that the main module be importable by the children.
What this means is that pools must be initialized after the definitions of functions to be run on them. Using pools within if __name__ == "__main__": blocks works if you are writing a standalone script, but this isn't possible in either larger code bases or server code (such as a Django or Flask project). So, if you're trying to use Pools in one of these, make sure to follow these guidelines, which are explained in the sections below:
Initialize Pools inside functions whenever possible. If you have to initialize them in the global scope, do so at the bottom of the module.
Do not call the methods of a Pool in the global scope.
Alternatively, if you only need better parallelism on I/O (like database accesses or network calls), you can save yourself all this headache and use pools of threads instead of pools of processes. This involves the completely undocumented:
from multiprocessing.pool import ThreadPool
It's interface is exactly the same as that of Pool, but since it uses threads and not processes, it comes with none of the caveats that using process pools do, with the only downside being you don't get true parallelism of code execution, just parallelism in blocking I/O.
Pools must be initialized after the definitions of functions to be run on them
The inscrutable text from the python docs means that at the time the pool is defined, the surrounding module is imported by the threads in the pool. In the case of the python terminal, this means all and only code you have run so far.
So, any functions you want to use in the pool must be defined before the pool is initialized. This is true both of code in a module and code in the terminal. The following modifications of the code in the question will work fine:
from multiprocessing import Pool
def f(x): return x # FIRST
p = Pool(3) # SECOND
threads = [p.apply_async(f, [i]) for i in range(20)]
for t in threads:
try: print(t.get(timeout=1))
except Exception: pass
Or
from multiprocessing import Pool
def f(x): print(x) # FIRST
p = Pool(3) # SECOND
p.map(f, range(20))
By fine, I mean fine on Unix. Windows has it's own problems, that I'm not going into here.
Using pools in modules
But wait, there's more (to using pools in modules that you want to import elsewhere)!
If you define a pool inside a function, you have no problems. But if you are using a Pool object as a global variable in a module, it must be defined at the bottom of the page, not the top. Though this goes against most good code style, it is necessary for functionality. The way to use a pool declared at the top of a page is to only use it with functions imported from other modules, like so:
from multiprocessing import Pool
from other_module import f
p = Pool(3)
p.map(f, range(20))
Importing a pre-configured pool from another module is pretty horrific, as the import must come after whatever you want to run on it, like so:
### module.py ###
from multiprocessing import Pool
POOL = Pool(5)
### module2.py ###
def f(x):
# Some function
from module import POOL
POOL.map(f, range(10))
And second, if you run anything on the pool in the global scope of a module that you are importing, the system hangs. i.e. this doesn't work:
### module.py ###
from multiprocessing import Pool
def f(x): return x
p = Pool(1)
print(p.map(f, range(5)))
### module2.py ###
import module
This, however, does work, as long as nothing imports module2:
### module.py ###
from multiprocessing import Pool
def f(x): return x
p = Pool(1)
def run_pool(): print(p.map(f, range(5)))
### module2.py ###
import module
module.run_pool()
Now, the reasons behind this are only more bizarre, and likely related to the reason that the code in the question only spits an Attribute Error once each and after that appear to execute code properly. It also appears that pool threads (at least with some reliability) reload the code in module after executing.
The function you want to execute on a thread pool must be already defined when you create the pool.
This should work:
from multiprocessing import Pool
def f(x): print(x)
if __name__ == '__main__':
p = Pool(3)
p.map(f, range(20))
The reason is that (at least on Unix-based systems, which have fork) when you create a pool the workers are created by forking the current process. So if the target function isn't already defined at that point, the worker won't be able to call it.
On Windows it's a bit different, as Windows doesn't have fork. Here new worker processes are started and the main module is imported. That's why on Windows it's important to protect the executing code with a if __name__ == '__main__'. Otherwise each new worker will re-execute the code and therefore spawn new processes infinitely, crashing the program (or the system).
There is another possible source for this error. I got this error when running the example code.
The source was that despite having installed multiprosessing correctly, the C++ compiler was not installed on my system, something pip informed me of when trying to update multiprocessing. So It might be worth checking that the compiler is installed.

Categories

Resources