In Python 2.7, is there's a way to identify if the current forked/spawned process is a child process instance (as opposed to being starting as a regular process). My goal is to set a global variable differently if it's a child process (e.g. create a pool with size 0 for child else pool with some number greater than 0).
I can't pass a parameter into the function (being called to execute in the child process), as even before the function is invoked the process would have been initialized and hence the global variable (especially for spawned process).
Also I am not in a position to use freeze_support (unless of course I am miss understood how to use it) as my application is running in a web service container (flask). Hence there's no main method.
Any help will be much appreciated.
Sample code that goes into infinite loop if you run it on windows:
from multiprocessing import Pool, freeze_support
p = Pool(5) # This should be created only in parent process and not the child process
def f(x):
return x*x
if __name__ == '__main__':
freeze_support()
print(p.map(f, [1, 2, 3]))
I would suggest restructuring your program to something more like my example code below. You mentioned that you don't have a main function, but you can create a wrapper that handles your pool:
from multiprocessing import Pool, freeze_support
def f(x):
return x*x
def handle_request():
p = Pool(5) # pool will only be in the parent process
print(p.map(f, [1, 2, 3]))
p.close() # remember to clean up the resources you use
p.join()
return
if __name__ == '__main__':
freeze_support() # do you really need this?
# start your web service here and make it use `handle_request` as the callback
# when a request needs to be serviced
It sounds like you are having a bit of an XY problem. You shouldn't be making a pool of processes global. It's just bad. You're giving your subprocesses access to their own process objects, which allows you to accidentally do bad things, like make a child process join itself. If you create your pool within a wrapper that is called for each request, then you don't need to worry about a global variable.
In the comments, you mentioned that you want a persistent pool. There is indeed some overhead to creating a pool on each request, but it's far safer than having a global pool. Also, you now have the capability to handle multiple requests simultaneously, assuming your web service handles each request in their own thread/process, without multiple requests trampling on each other by trying to use the same pool. I would strongly suggest you try to use this approach, and if it doesn't meet your performance specifications, you can look at optimizing it in other ways (ie, no global pool) to meet your spec.
One other note: multiprocessing.freeze_support() only needs to be called if you intend to bundle your scripts into a Windows executable. Don't use it if you are not doing that.
Move the pool creation into the main section to only create a multiprocessing pool once, any only in the main process:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(5)
print(p.map(f, [1, 2, 3]))
This works because the only process that is executing in the __main__ name is the original process. Spawned processes run with the __mp_main__ module name.
create a pool with size 0 for child
The child processes should never start a new multiprocessing pool. Only handle your processes from a single entry point.
Related
So I'm using multiprocessing pool with 3 threads, to run a function that does a certain job, and I have a variable defined outside this function which equals 0, and every time the function do it job it should add 1 to that variable and print it, but every thread uses a separated variable
here is the code:
from multiprocessing import Pool
number_of_doe_jobs = 0
def thefunction():
global number_of_doe_jobs
# JOB CODE GOES HERE
number_of_doe_jobs+=1
if __name__ =="__main__":
p = Pool(3)
p.map(checker, datalist)
the desired output is that it adds 1 to number_of_doe_jobs ,
but every thread add 1 to it own number_of_doe_jobs , so there are 3 number_of_doe_jobs variables now.
You are not spawning 3 threads. You are spawning 3 processes. Each process has its own memory space, with its own copy of the interpreter and its own independent object space. Global variables are not shared across processes. There are ways to create shared variables (which communicate over sockets), but you might be better served by using a multiprocessing.Queue. Create it in the mainline code, and pass it as a parameter to the subprocesses. Have the jobs push a "complete" flag on the queue, and have the mainline code read the results.
FOLLOWUP
The NUMBER of jobs will always be equal to len(datalist), so it's not clear why you would track that. Here, I create a multiprocessing queue and pass that to the function. Python implements that by creating a socket. The checker function sends a signal when it finishes, and the mainline code fetches each one and prints it. q.get will block until something is in the queue.
import multiprocessing
def checker(q):
# JOB CODE GOES HERE
q.put( "done" )
if __name__ =="__main__":
q = multiprocessing.Queue()
p = Pool(3)
p.map(lambda: checker(q), datalist)
for _ in datalist:
print( q.get() )
I am using multiprocessing python module to run parallel and unrelated jobs with a function similar to the following example:
import numpy as np
from multiprocessing import Pool
def myFunction(arg1):
name = "file_%s.npy"%arg1
A = np.load(arg1)
A[A<0] = np.nan
np.save(arg1,A)
if(__name__ == "__main__"):
N = list(range(50))
with Pool(4) as p:
p.map_async(myFunction, N)
p.close() # I tried with and without that statement
p.join() # I tried with and without that statement
DoOtherStuff()
My problem is that the function DoOtherStuff is never executed, the processes switches into sleep mode on top and I need to kill it with ctrl+C to stop it.
Any suggestions?
You have at least a couple problems. First, you are using map_async() which does not block until the results of the task are completed. So what you're doing is starting the task with map_async(), but then immediately closes and terminates the pool (the with statement calls Pool.terminate() upon exiting).
When you add tasks to a Process pool with methods like map_async it adds tasks to a task queue which is handled by a worker thread which takes tasks off that queue and farms them out to worker processes, possibly spawning new processes as needed (actually there is a separate thread which handles that).
Point being, you have a race condition where you're terminating the Pool likely before any tasks are even started. If you want your script to block until all the tasks are done just use map() instead of map_async(). For example, I rewrote your script like this:
import numpy as np
from multiprocessing import Pool
def myFunction(N):
A = np.load(f'file_{N:02}.npy')
A[A<0] = np.nan
np.save(f'file2_{N:02}.npy', A)
def DoOtherStuff():
print('done')
if __name__ == "__main__":
N = range(50)
with Pool(4) as p:
p.map(myFunction, N)
DoOtherStuff()
I don't know what your use case is exactly, but if you do want to use map_async(), so that this task can run in the background while you do other stuff, you have to leave the Pool open, and manage the AsyncResult object returned by map_async():
result = pool.map_async(myFunction, N)
DoOtherStuff()
# Is my map done yet? If not, we should still block until
# it finishes before ending the process
result.wait()
pool.close()
pool.join()
You can see more examples in the linked documentation.
I don't know why in your attempt you got a deadlock--I was not able to reproduce that. It's possible there was a bug at some point that was then fixed, though you were also possibly invoking undefined behavior with your race condition, as well as calling terminate() on a pool after it's already been join()ed. As for your why your answer did anything at all, it's possible that with the multiple calls to apply_async() you managed to skirt around the race condition somewhat, but this is not at all guaranteed to work.
I'm trying to reduce the processing time of reading a database with roughly 100,000 entries, but I need them to be formatted a specific way, in an attempt to do this, I tried to use python's multiprocessing.map function which works perfectly except that I can't seem to get any form of queue reference to work across them.
I've been using information from Filling a queue and managing multiprocessing in python to guide me for using queues across multiple processes, and Using a global variable with a thread to guide me for using global variables across threads. I've gotten the software to work, but when I check the list/queue/dict/map length after running the process, it always returns zero
I've written a simple example to show what I mean:
You have to run the script as a file, the map's initialize function does not work from the interpreter.
from multiprocessing import Pool
from collections import deque
global_q = deque()
def my_init(q):
global global_q
global_q = q
q.append("Hello world")
def map_fn(i):
global global_q
global_q.append(i)
if __name__ == "__main__":
with Pool(3, my_init, (global_q,)) as pool:
pool.map(map_fn, range(3))
for p in range(len(global_q)):
print(global_q.pop())
Theoretically, when I pass the queue object reference from the main thread to the worker threads using the pool function, and then initialize that thread's global variables using with the given function, then when I insert elements into the queue from the map function later, that object reference should still be pointing to the original queue object reference (long story short, everything should end up in the same queue, because they all point to the same location in memory).
So, I expect:
Hello World
Hello World
Hello World
1
2
3
of course, the 1, 2, 3's are in arbitrary order, but what you'll see on the output is ''.
How come when I pass object references to the pool function, nothing happens?
Here's an example of how to share something between processes by extending the multiprocessing.managers.BaseManager class to support deques.
There's a Customized managers section in the documentation about creating them.
import collections
from multiprocessing import Pool
from multiprocessing.managers import BaseManager
class DequeManager(BaseManager):
pass
class DequeProxy(object):
def __init__(self, *args):
self.deque = collections.deque(*args)
def __len__(self):
return self.deque.__len__()
def appendleft(self, x):
self.deque.appendleft(x)
def append(self, x):
self.deque.append(x)
def pop(self):
return self.deque.pop()
def popleft(self):
return self.deque.popleft()
# Currently only exposes a subset of deque's methods.
DequeManager.register('DequeProxy', DequeProxy,
exposed=['__len__', 'append', 'appendleft',
'pop', 'popleft'])
process_shared_deque = None # Global only within each process.
def my_init(q):
""" Initialize module-level global. """
global process_shared_deque
process_shared_deque = q
q.append("Hello world")
def map_fn(i):
process_shared_deque.append(i) # deque's don't have a "put()" method.
if __name__ == "__main__":
manager = DequeManager()
manager.start()
shared_deque = manager.DequeProxy()
with Pool(3, my_init, (shared_deque,)) as pool:
pool.map(map_fn, range(3))
for p in range(len(shared_deque)): # Show left-to-right contents.
print(shared_deque.popleft())
Output:
Hello world
0
1
2
Hello world
Hello world
You cant use global variable for multiprocesing.
Pass to the function multiprocessing queue.
from multiprocessing import Queue
queue= Queue()
def worker(q):
q.put(something)
Also you are propably experiencing that the code is allright, but as the pool create separate processes, even the errors are separeted and therefore you dont see the code not only isnt working, but that it throws error.
The reason why your output is '', is because nothing was appended to your q/global_q. And if it was appended, then only some variable, that may be called global_q, but its totally different one than your global_q in your main thread
Try to print('Hello world') inside the function you want to multiprocess and you will see by yourself, that nothing is actually printed at all. That processes is simply outside of your main thread and the only way to access that process is by multiprocessing Queues. You access the Queue by queue.put('something') and something = queue.get()
Try to understand this code and you will do well:
import multiprocessing as mp
shared_queue = mp.Queue() # This will be shared among all procesess, but you need to pass the queue as an argument in the process. You CANNOT use it as global variable. Understand that the functions kind of run in total different processes and nothing can really access them... Except multiprocessing.Queue - that can be shared across all processes.
def channel(que,channel_num):
que.put(channel_num)
if __name__ == '__main__':
processes = [mp.Process(target=channel, args=(shared_queue, channel_num)) for channel_num in range(8)]
for p in processes:
p.start()
for p in processes: # wait for all results to close the pool
p.join()
for i in range(8): # Get data from Queue. (you can get data out of it at any time actually)
print(shared_queue.get())
I have a class variable declared as a list that I want to update from a method declared within that class. However since this method processes a large amount of data, I am using multiprocessing to invoke it and hence I need to put lock on the class variable before updating it. I am unable to figure out how to put such a lock and update the class variable. If it matters, I am only creating one object of the said class at any given time.
Because of python's GIL, multiprocessing can only be used whith completely separate tasks, and no shared memory.
But you still can make it happend by using multiprocessing shared Array/Value:
from https://docs.python.org/2/library/multiprocessing.html#sharing-state-between-processes
from multiprocessing import Process, Value, Array
def f(n, a):
n.value = 3.1415927
for i in range(len(a)):
a[i] = -a[i]
if __name__ == '__main__':
num = Value('d', 0.0)
arr = Array('i', range(10))
p = Process(target=f, args=(num, arr))
p.start()
p.join()
print num.value
print arr[:]
Now as you asked, you need to ensure that differents processes won't access the same variable at the same time, and use a Lock. Hopefuly, all the shared variable available in the multiprocessing module are paired with a Lock.
To access the lock :
num.acquire() # get the lock
# do stuff
num.release() # don't forget to release it
I hope this helps.
If you're using the multiprocessing module (as opposed to multithreading, which is different), then unless I'm mistaken, the multiple processes forked don't share memory and each process would have its own copy of your class. This would mean that a lock would not be necessary, but it would also mean that the class attribute is not shared like you want it to be.
The multiprocessing module does offer several ways to allow communication between processes, including shared array objects. Perhaps this is what you're looking for.
Depending on what you're doing, you might also consider using the master-worker pattern, where you create a worker class with methods to manipulate your data, spawn several processes to run this class, and then dispatch datasets to the workers from your main process using the Queue class from the multiprocessing module.
When using multiprocessing.Pool in python with the following code, there is some bizarre behavior.
from multiprocessing import Pool
p = Pool(3)
def f(x): return x
threads = [p.apply_async(f, [i]) for i in range(20)]
for t in threads:
try: print(t.get(timeout=1))
except Exception: pass
I get the following error three times (one for each thread in the pool), and it prints "3" through "19":
AttributeError: 'module' object has no attribute 'f'
The first three apply_async calls never return.
Meanwhile, if I try:
from multiprocessing import Pool
p = Pool(3)
def f(x): print(x)
p.map(f, range(20))
I get the AttributeError 3 times, the shell prints "6" through "19", and then hangs and cannot be killed by [Ctrl] + [C]
The multiprocessing docs have the following to say:
Functionality within this package requires that the main module be
importable by the children.
What does this mean?
To clarify, I'm running code in the terminal to test functionality, but ultimately I want to be able to put this into modules of a web server. How do you properly use multiprocessing.Pool in the python terminal and in code modules?
Caveat: Multiprocessing is the wrong tool to use in the context of web servers like Django and Flask. Instead, you should use a task framework like Celery or an infrastructure solution like Elastic Beanstalk Worker Environments. Using multiprocessing to spawn threads or processes is bad because it gives you no oversight or management of those threads/processes, and so you have to build your own failure detection logic, retry logic, etc. At that point, you are better served by using an off-the-shelf tool that is actually designed to handle asynchronous tasks, because it will give you these out of the box.
Understanding the docs
Functionality within this package requires that the main module be importable by the children.
What this means is that pools must be initialized after the definitions of functions to be run on them. Using pools within if __name__ == "__main__": blocks works if you are writing a standalone script, but this isn't possible in either larger code bases or server code (such as a Django or Flask project). So, if you're trying to use Pools in one of these, make sure to follow these guidelines, which are explained in the sections below:
Initialize Pools inside functions whenever possible. If you have to initialize them in the global scope, do so at the bottom of the module.
Do not call the methods of a Pool in the global scope.
Alternatively, if you only need better parallelism on I/O (like database accesses or network calls), you can save yourself all this headache and use pools of threads instead of pools of processes. This involves the completely undocumented:
from multiprocessing.pool import ThreadPool
It's interface is exactly the same as that of Pool, but since it uses threads and not processes, it comes with none of the caveats that using process pools do, with the only downside being you don't get true parallelism of code execution, just parallelism in blocking I/O.
Pools must be initialized after the definitions of functions to be run on them
The inscrutable text from the python docs means that at the time the pool is defined, the surrounding module is imported by the threads in the pool. In the case of the python terminal, this means all and only code you have run so far.
So, any functions you want to use in the pool must be defined before the pool is initialized. This is true both of code in a module and code in the terminal. The following modifications of the code in the question will work fine:
from multiprocessing import Pool
def f(x): return x # FIRST
p = Pool(3) # SECOND
threads = [p.apply_async(f, [i]) for i in range(20)]
for t in threads:
try: print(t.get(timeout=1))
except Exception: pass
Or
from multiprocessing import Pool
def f(x): print(x) # FIRST
p = Pool(3) # SECOND
p.map(f, range(20))
By fine, I mean fine on Unix. Windows has it's own problems, that I'm not going into here.
Using pools in modules
But wait, there's more (to using pools in modules that you want to import elsewhere)!
If you define a pool inside a function, you have no problems. But if you are using a Pool object as a global variable in a module, it must be defined at the bottom of the page, not the top. Though this goes against most good code style, it is necessary for functionality. The way to use a pool declared at the top of a page is to only use it with functions imported from other modules, like so:
from multiprocessing import Pool
from other_module import f
p = Pool(3)
p.map(f, range(20))
Importing a pre-configured pool from another module is pretty horrific, as the import must come after whatever you want to run on it, like so:
### module.py ###
from multiprocessing import Pool
POOL = Pool(5)
### module2.py ###
def f(x):
# Some function
from module import POOL
POOL.map(f, range(10))
And second, if you run anything on the pool in the global scope of a module that you are importing, the system hangs. i.e. this doesn't work:
### module.py ###
from multiprocessing import Pool
def f(x): return x
p = Pool(1)
print(p.map(f, range(5)))
### module2.py ###
import module
This, however, does work, as long as nothing imports module2:
### module.py ###
from multiprocessing import Pool
def f(x): return x
p = Pool(1)
def run_pool(): print(p.map(f, range(5)))
### module2.py ###
import module
module.run_pool()
Now, the reasons behind this are only more bizarre, and likely related to the reason that the code in the question only spits an Attribute Error once each and after that appear to execute code properly. It also appears that pool threads (at least with some reliability) reload the code in module after executing.
The function you want to execute on a thread pool must be already defined when you create the pool.
This should work:
from multiprocessing import Pool
def f(x): print(x)
if __name__ == '__main__':
p = Pool(3)
p.map(f, range(20))
The reason is that (at least on Unix-based systems, which have fork) when you create a pool the workers are created by forking the current process. So if the target function isn't already defined at that point, the worker won't be able to call it.
On Windows it's a bit different, as Windows doesn't have fork. Here new worker processes are started and the main module is imported. That's why on Windows it's important to protect the executing code with a if __name__ == '__main__'. Otherwise each new worker will re-execute the code and therefore spawn new processes infinitely, crashing the program (or the system).
There is another possible source for this error. I got this error when running the example code.
The source was that despite having installed multiprosessing correctly, the C++ compiler was not installed on my system, something pip informed me of when trying to update multiprocessing. So It might be worth checking that the compiler is installed.