Why must we explicitly pass constants into multiprocessing functions?

Why must we explicitly pass constants into multiprocessing functions? - python

I have been working with the multiprocessing package to speed up some geoprocessing (GIS/arcpy) tasks that are redundant and need to be done the same for more than 2,000 similar geometries.
The splitting up works well, but my "worker" function is rather long and complicated because the task itself from start to finish is complicated. I would love to break the steps apart down more but I am having trouble passing information to/from the worker function because for some reason ANYTHING that a worker function under multiprocessing uses needs to be passed in explicitly.
This means I cannot define constants in the body of if __name__ == '__main__' and then use them in the worker function. It also means that my parameter list for the worker function is getting really long - which is super ugly since trying to use more than one parameter also requires creating a helper "star" function and then itertools to rezip them back up (a la the second answer on this question).
I have created a trivial example below that demonstrates what I am talking about. Are there any workarounds for this - a different approach I should be using - or can someone at least explain why this is the way it is?
Note: I am running this on Windows Server 2008 R2 Enterprise x64.
Edit: I seem to have not made my question clear enough. I am not that concerned with how pool.map only takes one argument (although it is annoying) but rather I do not understand why the scope of a function defined outside of if __name__ == '__main__' cannot access things defined inside that block if it is used as a multiprocessing function - unless you explicitly pass it as an argument, which is obnoxious.
import os
import multiprocessing
import itertools
def loop_function(word):
file_name = os.path.join(root_dir, word + '.txt')
with open(file_name, "w") as text_file:
text_file.write(word + " food")
def nonloop_function(word, root_dir): # <------ PROBLEM
file_name = os.path.join(root_dir, word + '.txt')
with open(file_name, "w") as text_file:
text_file.write(word + " food")
def nonloop_star(arg_package):
return nonloop_function(*arg_package)
# Serial version
#
# if __name__ == '__main__':
# root_dir = 'C:\\hbrowning'
# word_list = ['dog', 'cat', 'llama', 'yeti', 'parakeet', 'dolphin']
# for word in word_list:
# loop_function(word)
#
## --------------------------------------------
# Multiprocessing version
if __name__ == '__main__':
root_dir = 'C:\\hbrowning'
word_list = ['dog', 'cat', 'llama', 'yeti', 'parakeet', 'dolphin']
NUM_CORES = 2
pool = multiprocessing.Pool(NUM_CORES, maxtasksperchild=1)
results = pool.map(nonloop_star, itertools.izip(word_list, itertools.repeat(root_dir)),
chunksize=1)
pool.close()
pool.join()

The problem is, at least on Windows (although there are similar caveats with *nix fork style of multiprocessing, too) that, when you execute your script, it (to greatly simplify it) effectively ends up as as if you called two blank (shell) processes with subprocess.Popen() and then have them execute:
python -c "from your_script import nonloop_star; nonloop_star(('dog', 'C:\\hbrowning'))"
python -c "from your_script import nonloop_star; nonloop_star(('cat', 'C:\\hbrowning'))"
python -c "from your_script import nonloop_star; nonloop_star(('yeti', 'C:\\hbrowning'))"
python -c "from your_script import nonloop_star; nonloop_star(('parakeet', 'C:\\hbrowning'))"
python -c "from your_script import nonloop_star; nonloop_star(('dolphin', 'C:\\hbrowning'))"
one by one as soon as one of those processes finishes with the previous call. That means that your if __name__ == "__main__" block never gets executed (because it's not the main script, it's imported as a module) so anything declared within it is not readily available to the function (as it was never evaluated).
For the staff outside your function you can at least cheat by accessing your module via sys.modules["your_script"] or even with globals() but that works only for the evaluated staff, so anything that was placed within the if __name__ == "__main__" guard is not available as it didn't even had a chance. That's also a reason why you must use this guard on Windows - without it you'd be executing your pool creation, and other code that you nested within the guard, over and over again with each spawned process.
If you need to share read-only data in your multiprocessing functions, just define it in the global namespace of your script, outside of that __main__ guard, and all functions will have the access to it (as it gets re-evaluated when starting a new process) regardless if they are running as separate processes or not.
If you need data that changes then you need to use something that can synchronize itself over different processes - there is a slew of modules designed for that, but most of the time Python's own pickle-based, datagram communicating multiprocessing.Manager (and types it provides), albeit slow and not very flexible, is enough.

Python » 3.6.1 Documentation: multiprocessing.pool.Pool
map(func, iterable[, chunksize])
A parallel equivalent of the map() built-in function (it supports only one iterable argument though)
There is no Restriction, only it have to be a iterable!
Try a class Container, for instance:
class WP(object):
def __init__(self, name):
self.root_dir ='C:\\hbrowning'
self.name = name
word_list = [WP('dog'), WP('cat'), WP('llama'), WP('yeti'), WP('parakeet'), WP('dolphin')]
results = pool.map(nonloop_star, word_list, chunksize=1)
Note: The Var Types inside the class have to be pickleable!
Read about what-can-be-pickled-and-unpickled

Related

How to share variables in multiprocessing [duplicate]

The following does not work
one.py
import shared
shared.value = 'Hello'
raw_input('A cheap way to keep process alive..')
two.py
import shared
print shared.value
run on two command lines as:
>>python one.py
>>python two.py
(the second one gets an attribute error, rightly so).
Is there a way to accomplish this, that is, share a variable between two scripts?

Hope it's OK to jot down my notes about this issue here.
First of all, I appreciate the example in the OP a lot, because that is where I started as well - although it made me think shared is some built-in Python module, until I found a complete example at [Tutor] Global Variables between Modules ??.
However, when I looked for "sharing variables between scripts" (or processes) - besides the case when a Python script needs to use variables defined in other Python source files (but not necessarily running processes) - I mostly stumbled upon two other use cases:
A script forks itself into multiple child processes, which then run in parallel (possibly on multiple processors) on the same PC
A script spawns multiple other child processes, which then run in parallel (possibly on multiple processors) on the same PC
As such, most hits regarding "shared variables" and "interprocess communication" (IPC) discuss cases like these two; however, in both of these cases one can observe a "parent", to which the "children" usually have a reference.
What I am interested in, however, is running multiple invocations of the same script, ran independently, and sharing data between those (as in Python: how to share an object instance across multiple invocations of a script), in a singleton/single instance mode. That kind of problem is not really addressed by the above two cases - instead, it essentially reduces to the example in OP (sharing variables across two scripts).
Now, when dealing with this problem in Perl, there is IPC::Shareable; which "allows you to tie a variable to shared memory", using "an integer number or 4 character string[1] that serves as a common identifier for data across process space". Thus, there are no temporary files, nor networking setups - which I find great for my use case; so I was looking for the same in Python.
However, as accepted answer by #Drewfer notes: "You're not going to be able to do what you want without storing the information somewhere external to the two instances of the interpreter"; or in other words: either you have to use a networking/socket setup - or you have to use temporary files (ergo, no shared RAM for "totally separate python sessions").
Now, even with these considerations, it is kinda difficult to find working examples (except for pickle) - also in the docs for mmap and multiprocessing. I have managed to find some other examples - which also describe some pitfalls that the docs do not mention:
Usage of mmap: working code in two different scripts at Sharing Python data between processes using mmap | schmichael's blog
Demonstrates how both scripts change the shared value
Note that here a temporary file is created as storage for saved data - mmap is just a special interface for accessing this temporary file
Usage of multiprocessing: working code at:
Python multiprocessing RemoteManager under a multiprocessing.Process - working example of SyncManager (via manager.start()) with shared Queue; server(s) writes, clients read (shared data)
Comparison of the multiprocessing module and pyro? - working example of BaseManager (via server.serve_forever()) with shared custom class; server writes, client reads and writes
How to synchronize a python dict with multiprocessing - this answer has a great explanation of multiprocessing pitfalls, and is a working example of SyncManager (via manager.start()) with shared dict; server does nothing, client reads and writes
Thanks to these examples, I came up with an example, which essentially does the same as the mmap example, with approaches from the "synchronize a python dict" example - using BaseManager (via manager.start() through file path address) with shared list; both server and client read and write (pasted below). Note that:
multiprocessing managers can be started either via manager.start() or server.serve_forever()
serve_forever() locks - start() doesn't
There is auto-logging facility in multiprocessing: it seems to work fine with start()ed processes - but seems to ignore the ones that serve_forever()
The address specification in multiprocessing can be IP (socket) or temporary file (possibly a pipe?) path; in multiprocessing docs:
Most examples use multiprocessing.Manager() - this is just a function (not class instantiation) which returns a SyncManager, which is a special subclass of BaseManager; and uses start() - but not for IPC between independently ran scripts; here a file path is used
Few other examples serve_forever() approach for IPC between independently ran scripts; here IP/socket address is used
If an address is not specified, then an temp file path is used automatically (see 16.6.2.12. Logging for an example of how to see this)
In addition to all the pitfalls in the "synchronize a python dict" post, there are additional ones in case of a list. That post notes:
All manipulations of the dict must be done with methods and not dict assignments (syncdict["blast"] = 2 will fail miserably because of the way multiprocessing shares custom objects)
The workaround to dict['key'] getting and setting, is the use of the dict public methods get and update. The problem is that there are no such public methods as alternative for list[index]; thus, for a shared list, in addition we have to register __getitem__ and __setitem__ methods (which are private for list) as exposed, which means we also have to re-register all the public methods for list as well :/
Well, I think those were the most critical things; these are the two scripts - they can just be ran in separate terminals (server first); note developed on Linux with Python 2.7:
a.py (server):
import multiprocessing
import multiprocessing.managers
import logging
logger = multiprocessing.log_to_stderr()
logger.setLevel(logging.INFO)
class MyListManager(multiprocessing.managers.BaseManager):
pass
syncarr = []
def get_arr():
return syncarr
def main():
# print dir([]) # cannot do `exposed = dir([])`!! manually:
MyListManager.register("syncarr", get_arr, exposed=['__getitem__', '__setitem__', '__str__', 'append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort'])
manager = MyListManager(address=('/tmp/mypipe'), authkey='')
manager.start()
# we don't use the same name as `syncarr` here (although we could);
# just to see that `syncarr_tmp` is actually <AutoProxy[syncarr] object>
# so we also have to expose `__str__` method in order to print its list values!
syncarr_tmp = manager.syncarr()
print("syncarr (master):", syncarr, "syncarr_tmp:", syncarr_tmp)
print("syncarr initial:", syncarr_tmp.__str__())
syncarr_tmp.append(140)
syncarr_tmp.append("hello")
print("syncarr set:", str(syncarr_tmp))
raw_input('Now run b.py and press ENTER')
print
print 'Changing [0]'
syncarr_tmp.__setitem__(0, 250)
print 'Changing [1]'
syncarr_tmp.__setitem__(1, "foo")
new_i = raw_input('Enter a new int value for [0]: ')
syncarr_tmp.__setitem__(0, int(new_i))
raw_input("Press any key (NOT Ctrl-C!) to kill server (but kill client first)".center(50, "-"))
manager.shutdown()
if __name__ == '__main__':
main()
b.py (client)
import time
import multiprocessing
import multiprocessing.managers
import logging
logger = multiprocessing.log_to_stderr()
logger.setLevel(logging.INFO)
class MyListManager(multiprocessing.managers.BaseManager):
pass
MyListManager.register("syncarr")
def main():
manager = MyListManager(address=('/tmp/mypipe'), authkey='')
manager.connect()
syncarr = manager.syncarr()
print "arr = %s" % (dir(syncarr))
# note here we need not bother with __str__
# syncarr can be printed as a list without a problem:
print "List at start:", syncarr
print "Changing from client"
syncarr.append(30)
print "List now:", syncarr
o0 = None
o1 = None
while 1:
new_0 = syncarr.__getitem__(0) # syncarr[0]
new_1 = syncarr.__getitem__(1) # syncarr[1]
if o0 != new_0 or o1 != new_1:
print 'o0: %s => %s' % (str(o0), str(new_0))
print 'o1: %s => %s' % (str(o1), str(new_1))
print "List is:", syncarr
print 'Press Ctrl-C to exit'
o0 = new_0
o1 = new_1
time.sleep(1)
if __name__ == '__main__':
main()
As a final remark, on Linux /tmp/mypipe is created - but is 0 bytes, and has attributes srwxr-xr-x (for a socket); I guess this makes me happy, as I neither have to worry about network ports, nor about temporary files as such :)
Other related questions:
Python: Possible to share in-memory data between 2 separate processes (very good explanation)
Efficient Python to Python IPC
Python: Sending a variable to another script

You're not going to be able to do what you want without storing the information somewhere external to the two instances of the interpreter.
If it's just simple variables you want, you can easily dump a python dict to a file with the pickle module in script one and then re-load it in script two.
Example:
one.py
import pickle
shared = {"Foo":"Bar", "Parrot":"Dead"}
fp = open("shared.pkl","w")
pickle.dump(shared, fp)
two.py
import pickle
fp = open("shared.pkl")
shared = pickle.load(fp)
print shared["Foo"]

sudo apt-get install memcached python-memcache
one.py
import memcache
shared = memcache.Client(['127.0.0.1:11211'], debug=0)
shared.set('Value', 'Hello')
two.py
import memcache
shared = memcache.Client(['127.0.0.1:11211'], debug=0)
print shared.get('Value')

What you're trying to do here (store a shared state in a Python module over separate python interpreters) won't work.
A value in a module can be updated by one module and then read by another module, but this must be within the same Python interpreter. What you seem to be doing here is actually a sort of interprocess communication; this could be accomplished via socket communication between the two processes, but it is significantly less trivial than what you are expecting to have work here.

you can use the relative simple mmap file.
you can use the shared.py to store the common constants. The following code will work across different python interpreters \ scripts \processes
shared.py:
MMAP_SIZE = 16*1024
MMAP_NAME = 'Global\\SHARED_MMAP_NAME'
* The "Global" is windows syntax for global names
one.py:
from shared import MMAP_SIZE,MMAP_NAME
def write_to_mmap():
map_file = mmap.mmap(-1,MMAP_SIZE,tagname=MMAP_NAME,access=mmap.ACCESS_WRITE)
map_file.seek(0)
map_file.write('hello\n')
ret = map_file.flush() != 0
if sys.platform.startswith('win'):
assert(ret != 0)
else:
assert(ret == 0)
two.py:
from shared import MMAP_SIZE,MMAP_NAME
def read_from_mmap():
map_file = mmap.mmap(-1,MMAP_SIZE,tagname=MMAP_NAME,access=mmap.ACCESS_READ)
map_file.seek(0)
data = map_file.readline().rstrip('\n')
map_file.close()
print data
*This code was written for windows, linux might need little adjustments
more info at - https://docs.python.org/2/library/mmap.html

Share a dynamic variable by Redis:
script_one.py
from redis import Redis
from time import sleep
cli = Redis('localhost')
shared_var = 1
while True:
cli.set('share_place', shared_var)
shared_var += 1
sleep(1)
Run script_one in a terminal (a process):
$ python script_one.py
script_two.py
from redis import Redis
from time import sleep
cli = Redis('localhost')
while True:
print(int(cli.get('share_place')))
sleep(1)
Run script_two in another terminal (another process):
$ python script_two.py
Out:
1
2
3
4
5
...
Dependencies:
$ pip install redis
$ apt-get install redis-server

I'd advise that you use the multiprocessing module. You can't run two scripts from the commandline, but you can have two separate processes easily speak to each other.
From the doc's examples:
from multiprocessing import Process, Queue
def f(q):
q.put([42, None, 'hello'])
if __name__ == '__main__':
q = Queue()
p = Process(target=f, args=(q,))
p.start()
print q.get() # prints "[42, None, 'hello']"
p.join()

You need to store the variable in some sort of persistent file. There are several modules to do this, depending on your exact need.
The pickle and cPickle module can save and load most python objects to file.
The shelve module can store python objects in a dictionary-like structure (using pickle behind the scenes).
The dbm/bsddb/dbhash/gdm modules can store string variables in a dictionary-like structure.
The sqlite3 module can store data in a lightweight SQL database.
The biggest problem with most of these are that they are not synchronised across different processes - if one process reads a value while another is writing to the datastore then you may get incorrect data or data corruption. To get round this you will need to write your own file locking mechanism or use a full-blown database.

If you wanna read and modify shared data between 2 scripts which run separately, a good solution would be to take advantage of python multiprocessing module and use a Pipe() or a Queue() (see differences here). This way you get to sync scripts and avoid problems regarding concurrency and global variables (like what happens if both scripts wanna modify a variable at the same time).
The best part about using pipes/queues is that you can pass python objects through them.
Also there are methods to avoid waiting for data if there hasn't been passed yet (queue.empty() and pipeConn.poll()).
See an example using Queue() below:
# main.py
from multiprocessing import Process, Queue
from stage1 import Stage1
from stage2 import Stage2
s1= Stage1()
s2= Stage2()
# S1 to S2 communication
queueS1 = Queue() # s1.stage1() writes to queueS1
# S2 to S1 communication
queueS2 = Queue() # s2.stage2() writes to queueS2
# start s2 as another process
s2 = Process(target=s2.stage2, args=(queueS1, queueS2))
s2.daemon = True
s2.start() # Launch the stage2 process
s1.stage1(queueS1, queueS2) # start sending stuff from s1 to s2
s2.join() # wait till s2 daemon finishes
# stage1.py
import time
import random
class Stage1:
def stage1(self, queueS1, queueS2):
print("stage1")
lala = []
lis = [1, 2, 3, 4, 5]
for i in range(len(lis)):
# to avoid unnecessary waiting
if not queueS2.empty():
msg = queueS2.get() # get msg from s2
print("! ! ! stage1 RECEIVED from s2:", msg)
lala = [6, 7, 8] # now that a msg was received, further msgs will be different
time.sleep(1) # work
random.shuffle(lis)
queueS1.put(lis + lala)
queueS1.put('s1 is DONE')
# stage2.py
import time
class Stage2:
def stage2(self, queueS1, queueS2):
print("stage2")
while True:
msg = queueS1.get() # wait till there is a msg from s1
print("- - - stage2 RECEIVED from s1:", msg)
if msg == 's1 is DONE ':
break # ends loop
time.sleep(1) # work
queueS2.put("update lists")
EDIT: just found that you can use queue.get(False) to avoid blockage when receiving data. This way there's no need to check first if the queue is empty. This is no possible if you use pipes.

Use text files or environnement variables. Since the two run separatly, you can't really do what you are trying to do.

In your example, the first script runs to completion, and then the second script runs. That means you need some sort of persistent state. Other answers have suggested using text files or Python's pickle module. Personally I am lazy, and I wouldn't use a text file when I could use pickle; why should I write a parser to parse my own text file format?
Instead of pickle you could also use the json module to store it as JSON. This might be preferable if you want to share the data to non-Python programs, as JSON is a simple and common standard. If your Python doesn't have json, get simplejson.
If your needs go beyond pickle or json -- say you actually want to have two Python programs executing at the same time and updating the persistent state variables in real time -- I suggest you use the SQLite database. Use an ORM to abstract the database away, and it's super easy. For SQLite and Python, I recommend Autumn ORM.

This method seems straight forward for me:
class SharedClass:
def __init__(self):
self.data = {}
def set_data(self, name, value):
self.data[name] = value
def get_data(self, name):
try:
return self.data[name]
except:
return "none"
def reset_data(self):
self.data = {}
sharedClass = SharedClass()
PS : you can set the data with a parameter name and a value for it, and to access the value you can use the get_data method, below is the example:
to set the data
example 1:
sharedClass.set_data("name","Jon Snow")
example 2:
sharedClass.set_data("email","jon#got.com")\
to get the data
sharedClass.get_data("email")\
to reset the entire state simply use
sharedClass.reset_data()
Its kind of accessing data from a json object (dict in this case)
Hope this helps....

You could use the basic from and import functions in python to import the variable into two.py. For example:
from filename import variable
That should import the variable from the file.
(Of course you should replace filename with one.py, and replace variable with the variable you want to share to two.py.)

You can also solve this problem by making the variable as global
python first.py
class Temp:
def __init__(self):
self.first = None
global var1
var1 = Temp()
var1.first = 1
print(var1.first)
python second.py
import first as One
print(One.var1.first)

Using multiprocessing with runpy

I have a Python module that uses multiprocessing. I'm executing this module from another script with runpy. However, this results in (1) the module running twice, and (2) the multiprocessing jobs never finish (the script just hangs).
In my minimal working example, I have a script runpy_test.py:
import runpy
runpy.run_module('module_test')
and a directory module_test containing an empty __init__.py and a __main__.py:
from multiprocessing import Pool
print 'start'
def f(x):
return x*x
pool = Pool()
result = pool.map(f, [1,2,3])
print 'done'
When I run runpy_test.py, I get:
start
start
and the script hangs.
If I remove the pool.map call (or if I run __main__.py directly, including the pool.map call), I get:
start
done
I'm running this on Scientific Linux 7.6 in Python 2.7.5.

Rewrite your __main__.py like so:
from multiprocessing import Pool
from .implementation import f
print 'start'
pool = Pool()
result = pool.map(f, [1,2,3])
print 'done'
And then write an implementation.py (you can call this whatever you want) in which your function is defined:
def f(x):
return x*x
Otherwise you will have the same problem with most interfaces in multiprocessing, and independently of using runpy. As #Weeble explained, when Pool.map tries to load the function f in each sub-process it will import <your_package>.__main__ where your function is defined, but since you have executable code at module-level in __main__ it will be re-executed by the sub-process.
Aside from this technical reason, this is also better design in terms of separation of concerns and testing. Now you can easily import and call (including for test purposes) the function f without running it in parallel.

Try defining your function f in a separate module. It needs to be serialised to be passed to the pool processes, and then those processes need to recreate it, by importing the module it occurs in. However, the __main__.py file it occurs in isn't a module, or at least, not a well-behaved one. Attempting to import it would result in the creation of another Pool and another invocation of map, which seems like a recipe for disaster.

Although not the "right" way to do it, one solution that ended up working for me was to use runpy's _run_module_as_main instead of run_module. This was ideal for me since I was working with someone else's code and required the fewest changes.

ThreadPoolExecutor, ProcessPoolExecutor and global variables

I am new to parallelization in general and concurrent.futures in particular. I want to benchmark my script and compare the differences between using threads and processes, but I found that I couldn't even get that running because when using ProcessPoolExecutor I cannot use my global variables.
The following code will output Helloas I expect, but when you change ThreadPoolExecutor for ProcessPoolExecutor, it will output None.
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
greeting = None
def process():
print(greeting)
return None
def main():
with ThreadPoolExecutor(max_workers=1) as executor:
executor.submit(process)
return None
def init():
global greeting
greeting = 'Hello'
return None
if __name__ == '__main__':
init()
main()
I don't understand why this is the case. In my real program, init is used to set the global variables to CLI arguments, and there are a lot of them. Hence, passing them as arguments does not seem recommended. So how do I pass those global variables to each process/thread correctly?
I know that I can change things around, which will work, but I don't understand why. E.g. the following works for both Executors, but it also means that the globals initialisation has to happen for every instance.
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
greeting = None
def init():
global greeting
greeting = 'Hello'
return None
def main():
with ThreadPoolExecutor(max_workers=1) as executor:
executor.submit(process)
return None
def process():
init()
print(greeting)
return None
if __name__ == '__main__':
main()
So my main question is, what is actually happening. Why does this code work with threads and not with processes? And, how do I correctly pass set globals to each process/thread without having to re-initialise them for every instance?
(Side note: because I have read that concurrent.futures might behave differently on Windows, I have to note that I am running Python 3.6 on Windows 10 64 bit.)

I'm not sure of the limitations of this approach, but you can pass (serializable?) objects between your main process/thread. This would also help you get rid of the reliance on global vars:
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
def process(opts):
opts["process"] = "got here"
print("In process():", opts)
return None
def main(opts):
opts["main"] = "got here"
executor = [ProcessPoolExecutor, ThreadPoolExecutor][1]
with executor(max_workers=1) as executor:
executor.submit(process, opts)
return None
def init(opts): # Gather CLI opts and populate dict
opts["init"] = "got here"
return None
if __name__ == '__main__':
cli_opts = {"__main__": "got here"} # Initialize dict
init(cli_opts) # Populate dict
main(cli_opts) # Use dict
Works with both executor types.
Edit: Even though it sounds like it won't be a problem for your use case, I'll point out that with ProcessPoolExecutor, the opts dict you get inside process will be a frozen copy, so mutations to it will not be visible across processes nor will they be visible once you return to the __main__ block. ThreadPoolExecutor, on the other hand, will share the dict object between threads.

Actually, the first code of the OP will work as intended on Linux (tested in Python 3.6-3.8) because
On Unix a child process can make use of a shared resource created in a
parent process using a global resource.
as explained in multiprocessing doc. However, for a mysterious reasons, it won't work on my Mac running Mojave (which is supposed to be a UNIX-compliant OS; tested only with Python 3.8). And for sure, it won't work on Windows, and it's in general not a recommended practice with multiple processes.

Let's image a process is a box while a thread is a worker inside a box. A worker can only access the resources in the box and cannot touch the other resources in other boxes.
So when you use threads, you are creating multiple workers for your current box(main process). But when you use process, you are creating another box. In this case, the global variables initialised in this box is completely different from ones in another box. That's why it doesn't work as you expect.
The solution given by jedwards is good enough for most situations. You can expilictly package the resources in current box(serialize variables) and deliver it to another box(transport to another process) so that the workers in that box have access to the resources.

A process represents activity that is run in a separate process in the OS meaning of the term while threads all run in your main process. Every process has its own unique namespace.
Your main process sets the value to greeting by calling init() inside your __name__ == '__main__'condition for its own namespace. In your new process, this does not happen (__name__ is '__mp_name__' here) hence greeting remains None and init() is never actually called unless you do so explicitly in the function your process executes.
While sharing state between processes is generally not recommended, there are ways to do so, like outlined in #jedwards answer.
You might also want to check Sharing State Between Processes from the docs.

How to use multiprocessing.Pool in an imported module?

I have not been able to implement the suggestion here: Applying two functions to two lists simultaneously.
I guess it is because the module is imported by another module and thus my Windows spawns multiple python processes?
My question is: how can I use the code below without the if if __name__ == "__main__":
args_m = [(mortality_men, my_agents, graveyard, families, firms, year, agent) for agent in males]
args_f = [(mortality_women, fertility, year, families, my_agents, graveyard, firms, agent) for agent in females]
with mp.Pool(processes=(mp.cpu_count() - 1)) as p:
p.map_async(process_males, args_m)
p.map_async(process_females, args_f)
Both process_males and process_females are fuctions.
args_m, args_f are iterators
Also, I don't need to return anything. Agents are class instances that need updating.

The reason you need to guard multiprocessing code in a if __name__ == "__main__" is that you don't want it to run again in the child process. That can happen on Windows, where the interpreter needs to reload all of its state since there's no fork system call that will copy the parent process's address space. But you only need to use it where code is supposed to be running at the top level since you're in the main script. It's not the only way to guard your code.
In your specific case, I think you should put the multiprocessing code in a function. That won't run in the child process, as long as nothing else calls the function when it should not. Your main module can import the module, then call the function (from within an if __name__ == "__main__" block, probably).
It should be something like this:
some_module.py:
def process_males(x):
...
def process_females(x):
...
args_m = [...] # these could be defined inside the function below if that makes more sense
args_f = [...]
def do_stuff():
with mp.Pool(processes=(mp.cpu_count() - 1)) as p:
p.map_async(process_males, args_m)
p.map_async(process_females, args_f)
main.py:
import some_module
if __name__ == "__main__":
some_module.do_stuff()
In your real code you might want to pass some arguments or get a return value from do_stuff (which should also be given a more descriptive name than the generic one I've used in this example).

The idea of if __name__ == '__main__': is to avoid infinite process spawning.
When pickling a function defined in your main script, python has to figure out what part of your main script is the function code. It will basically re run your script. If your code creating the Pool is in the same script and not protected by the "if main", then by trying to import the function, you will try to launch another Pool that will try to launch another Pool....
Thus you should separate the function definitions from the actual main script:
from multiprocessing import Pool
# define test functions outside main
# so it can be imported withou launching
# new Pool
def test_func():
pass
if __name__ == '__main__':
with Pool(4) as p:
r = p.apply_async(test_func)
... do stuff
result = r.get()

Cannot yet comment on the question, but a workaround I have used that some have mentioned is just to define the process_males etc. functions in a module that is different to where the processes are spawned. Then import the module containing the multiprocessing spawns.

I solved it by calling the modules' multiprocessing function within "if __ name__ == "__ main__":" of the main script, as the function that involves multiprocessing is the last step in my module, others could try if aplicable.

multiprocessing launch from within module or class, not from main()

I want to use Python's multiprocessing unit to make effective use of multiple cpu's to speed up my processing.
All seems to work, however I want to run Pool.map(f, [item, item]) from within a class, in a sub module somewhere deep in my program. The reason is that the program has to prepare the data first and wait for certain events to happen before there is anything to be processed.
The multiprocessing docs says you can only run from within a if __name__ == '__main__': statement. I don't understand the significance of that and tried it anyway, like so:
from multiprocessing import Pool
class Foo(object):
n = 1000000
def __init__(self, x):
self.x = x + 1
pass
def run(self):
for i in range(1,self.n):
self.x *= 1.0*i/self.x
return self
class Bar(object):
def __init__(self):
pass
def go_all(self):
work = [Foo(i) for i in range(960)]
def do(obj):
return obj.run()
p = Pool(16)
finished_work = p.map(do, work)
return
bar = Bar()
bar.go_all()
It indeed doesn't work! I get the following error:
PicklingError: Can't pickle : attribute lookup
builtin.function failed
I don't quite understand why as everything seems to be perfectly pickeable. I have the following questions:
Can this be made to work without putting the p.map line in my main program?
If not, can "main" programs be called as sub-routines/modules, such to make it still work?
Is there some handy trick to loop back from a submodule to the main program and run it from there?
I'm on Linux and Python 2.7

I believe you misunderstood the documentation. What the documentation says is to do this:
if __name__ == '__main__':
bar = Bar()
bar.go_all()
So your p.map line does not need to be inside your "main function", or whatever. Only the code that actually spawns the subprocesses has to be "guarded". This is unavoidable due to limitations of the Windows OS.
Moreover, the function that you pass to Pool.map has to be importable (functions are pickled simply by their names, the interpreter then has to be able to import them to rebuild the function object when they are passed to the subprocess). So you should probably move your do function at the global level to avoid pickling errors.

The extra restrictions on the multiprocessing module on ms-windows stem from the fact that it doesn't have the fork system call. On UNIX-like operating systems, fork makes a perfect copy of a process and continues to run that next to the parent process. The only difference between them is that fork returns different value in the parent and child processes.
On ms-windows, multiprocessing needs to start a new Python instance using a native method to start processes. Then it needs to bring that Python instance into the same state as the "parent" process.
This means (among other things) that the Python code must be importable without side effects like trying to start yet another process. Hence the use of the if __name__ == '__main__' guard.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why must we explicitly pass constants into multiprocessing functions? - python

Related

How to share variables in multiprocessing [duplicate]

Using multiprocessing with runpy

ThreadPoolExecutor, ProcessPoolExecutor and global variables

How to use multiprocessing.Pool in an imported module?

multiprocessing launch from within module or class, not from main()

Categories

Resources