multiprocessing Programming guidelines unclear - python

I am trying to understand the following guideline:
Better to inherit than pickle/unpickle
When using the spawn or forkserver start methods many types from multiprocessing need to be picklable so that child processes can use them. However, one should generally avoid sending shared objects to other processes using pipes or queues. Instead you should arrange the program so that a process which needs access to a shared resource created elsewhere can inherit it from an ancestor process.
What does it mean to "arrange the program"?
How can I share resources by inheriting?
I'm running windows, so the new processes are spawned, does that means only forked processes can inherit?

1. What does it mean to "arrange the program"?
It means that your program should be able to run as self-contained without any external resources. Sharing files will give you locking issues, sharing memory will either do the same or can give you corruption due to multiple processes modifying the data at the same time.
Here's an example of what would be a bad idea:
while some_queue_is_not_empty():
run_external_process(some_queue)
def external_process(queue):
item = queue.pop()
# do processing here
Versus:
while some_queue_is_not_empty():
item = queue.pop()
run_external_process(item)
def external_process(item):
# do processing here
This way you can avoid locking the queue and/or corruption issues due to multiple processes getting the same item.
2. How can I share resources by inheriting?
On Windows, you can't. On Linux you can use file descriptors that your parent opened, on Windows it will be a brand new process so you don't have anything from your parent except what was given.
Example copied from: http://rhodesmill.org/brandon/2010/python-multiprocessing-linux-windows/
from multiprocessing import Process
f = None
def child():
print f
if __name__ == '__main__':
f = open('mp.py', 'r')
p = Process(target=child)
p.start()
p.join()
On Linux you will get something like:
$ python mp.py
<open file 'mp.py', mode 'r' at 0xb7734ac8>
On Windows you will get:
C:\Users\brandon\dev>python mp.py
None

Related

python multiprocessing -- share huge variable among processes [duplicate]

Do child processes spawned via multiprocessing share objects created earlier in the program?
I have the following setup:
do_some_processing(filename):
for line in file(filename):
if line.split(',')[0] in big_lookup_object:
# something here
if __name__ == '__main__':
big_lookup_object = marshal.load('file.bin')
pool = Pool(processes=4)
print pool.map(do_some_processing, glob.glob('*.data'))
I'm loading some big object into memory, then creating a pool of workers that need to make use of that big object. The big object is accessed read-only, I don't need to pass modifications of it between processes.
My question is: is the big object loaded into shared memory, as it would be if I spawned a process in unix/c, or does each process load its own copy of the big object?
Update: to clarify further - big_lookup_object is a shared lookup object. I don't need to split that up and process it separately. I need to keep a single copy of it. The work that I need to split it is reading lots of other large files and looking up the items in those large files against the lookup object.
Further update: database is a fine solution, memcached might be a better solution, and file on disk (shelve or dbm) might be even better. In this question I was particularly interested in an in memory solution. For the final solution I'll be using hadoop, but I wanted to see if I can have a local in-memory version as well.
Do child processes spawned via multiprocessing share objects created earlier in the program?
No for Python < 3.8, yes for Python ≥ 3.8.
Processes have independent memory space.
Solution 1
To make best use of a large structure with lots of workers, do this.
Write each worker as a "filter" – reads intermediate results from stdin, does work, writes intermediate results on stdout.
Connect all the workers as a pipeline:
process1 <source | process2 | process3 | ... | processn >result
Each process reads, does work and writes.
This is remarkably efficient since all processes are running concurrently. The writes and reads pass directly through shared buffers between the processes.
Solution 2
In some cases, you have a more complex structure – often a fan-out structure. In this case you have a parent with multiple children.
Parent opens source data. Parent forks a number of children.
Parent reads source, farms parts of the source out to each concurrently running child.
When parent reaches the end, close the pipe. Child gets end of file and finishes normally.
The child parts are pleasant to write because each child simply reads sys.stdin.
The parent has a little bit of fancy footwork in spawning all the children and retaining the pipes properly, but it's not too bad.
Fan-in is the opposite structure. A number of independently running processes need to interleave their inputs into a common process. The collector is not as easy to write, since it has to read from many sources.
Reading from many named pipes is often done using the select module to see which pipes have pending input.
Solution 3
Shared lookup is the definition of a database.
Solution 3A – load a database. Let the workers process the data in the database.
Solution 3B – create a very simple server using werkzeug (or similar) to provide WSGI applications that respond to HTTP GET so the workers can query the server.
Solution 4
Shared filesystem object. Unix OS offers shared memory objects. These are just files that are mapped to memory so that swapping I/O is done instead of more convention buffered reads.
You can do this from a Python context in several ways
Write a startup program that (1) breaks your original gigantic object into smaller objects, and (2) starts workers, each with a smaller object. The smaller objects could be pickled Python objects to save a tiny bit of file reading time.
Write a startup program that (1) reads your original gigantic object and writes a page-structured, byte-coded file using seek operations to assure that individual sections are easy to find with simple seeks. This is what a database engine does – break the data into pages, make each page easy to locate via a seek.
Spawn workers with access to this large page-structured file. Each worker can seek to the relevant parts and do their work there.
Do child processes spawned via multiprocessing share objects created earlier in the program?
It depends. For global read-only variables it can be often considered so (apart from the memory consumed) else it should not.
multiprocessing's documentation says:
Better to inherit than pickle/unpickle
On Windows many types from
multiprocessing need to be picklable
so that child processes can use them.
However, one should generally avoid
sending shared objects to other
processes using pipes or queues.
Instead you should arrange the program
so that a process which need access to
a shared resource created elsewhere
can inherit it from an ancestor
process.
Explicitly pass resources to child processes
On Unix a child process can make use
of a shared resource created in a
parent process using a global
resource. However, it is better to
pass the object as an argument to the
constructor for the child process.
Apart from making the code
(potentially) compatible with Windows
this also ensures that as long as the
child process is still alive the
object will not be garbage collected
in the parent process. This might be
important if some resource is freed
when the object is garbage collected
in the parent process.
Global variables
Bear in mind that if code run in a
child process tries to access a global
variable, then the value it sees (if
any) may not be the same as the value
in the parent process at the time that
Process.start() was called.
Example
On Windows (single CPU):
#!/usr/bin/env python
import os, sys, time
from multiprocessing import Pool
x = 23000 # replace `23` due to small integers share representation
z = [] # integers are immutable, let's try mutable object
def printx(y):
global x
if y == 3:
x = -x
z.append(y)
print os.getpid(), x, id(x), z, id(z)
print y
if len(sys.argv) == 2 and sys.argv[1] == "sleep":
time.sleep(.1) # should make more apparant the effect
if __name__ == '__main__':
pool = Pool(processes=4)
pool.map(printx, (1,2,3,4))
With sleep:
$ python26 test_share.py sleep
2504 23000 11639492 [1] 10774408
1
2564 23000 11639492 [2] 10774408
2
2504 -23000 11639384 [1, 3] 10774408
3
4084 23000 11639492 [4] 10774408
4
Without sleep:
$ python26 test_share.py
1148 23000 11639492 [1] 10774408
1
1148 23000 11639492 [1, 2] 10774408
2
1148 -23000 11639324 [1, 2, 3] 10774408
3
1148 -23000 11639324 [1, 2, 3, 4] 10774408
4
S.Lott is correct. Python's multiprocessing shortcuts effectively give you a separate, duplicated chunk of memory.
On most *nix systems, using a lower-level call to os.fork() will, in fact, give you copy-on-write memory, which might be what you're thinking. AFAIK, in theory, in the most simplistic of programs possible, you could read from that data without having it duplicated.
However, things aren't quite that simple in the Python interpreter. Object data and meta-data are stored in the same memory segment, so even if the object never changes, something like a reference counter for that object being incremented will cause a memory write, and therefore a copy. Almost any Python program that is doing more than "print 'hello'" will cause reference count increments, so you will likely never realize the benefit of copy-on-write.
Even if someone did manage to hack a shared-memory solution in Python, trying to coordinate garbage collection across processes would probably be pretty painful.
If you're running under Unix, they may share the same object, due to how fork works (i.e., the child processes have separate memory but it's copy-on-write, so it may be shared as long as nobody modifies it). I tried the following:
import multiprocessing
x = 23
def printx(y):
print x, id(x)
print y
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=4)
pool.map(printx, (1,2,3,4))
and got the following output:
$ ./mtest.py
23 22995656
1
23 22995656
2
23 22995656
3
23 22995656
4
Of course this doesn't prove that a copy hasn't been made, but you should be able to verify that in your situation by looking at the output of ps to see how much real memory each subprocess is using.
Different processes have different address space. Like running different instances of the interpreter. That's what IPC (interprocess communication) is for.
You can use either queues or pipes for this purpose. You can also use rpc over tcp if you want to distribute the processes over a network later.
http://docs.python.org/dev/library/multiprocessing.html#exchanging-objects-between-processes
Not directly related to multiprocessing per se, but from your example, it would seem you could just use the shelve module or something like that. Does the "big_lookup_object" really have to be completely in memory?
No, but you can load your data as a child process and allow it to share its data with other children. see below.
import time
import multiprocessing
def load_data( queue_load, n_processes )
... load data here into some_variable
"""
Store multiple copies of the data into
the data queue. There needs to be enough
copies available for each process to access.
"""
for i in range(n_processes):
queue_load.put(some_variable)
def work_with_data( queue_data, queue_load ):
# Wait for load_data() to complete
while queue_load.empty():
time.sleep(1)
some_variable = queue_load.get()
"""
! Tuples can also be used here
if you have multiple data files
you wish to keep seperate.
a,b = queue_load.get()
"""
... do some stuff, resulting in new_data
# store it in the queue
queue_data.put(new_data)
def start_multiprocess():
n_processes = 5
processes = []
stored_data = []
# Create two Queues
queue_load = multiprocessing.Queue()
queue_data = multiprocessing.Queue()
for i in range(n_processes):
if i == 0:
# Your big data file will be loaded here...
p = multiprocessing.Process(target = load_data,
args=(queue_load, n_processes))
processes.append(p)
p.start()
# ... and then it will be used here with each process
p = multiprocessing.Process(target = work_with_data,
args=(queue_data, queue_load))
processes.append(p)
p.start()
for i in range(n_processes)
new_data = queue_data.get()
stored_data.append(new_data)
for p in processes:
p.join()
print(processes)
For Linux/Unix/MacOS platform, forkmap is a quick-and-dirty solution.

Multiprocessing with pool in python: About several instances with same name at the same time

I'm kind of new to multiprocessing. However, assume that we have a program as below. The program seems to work fine. Now to the question. In my opinion we will have 4 instances of SomeKindOfClass with the same name (a) at the same time. How is that possible? Moreover, is there a potential risk with this kind of programming?
from multiprocessing.dummy import Pool
import numpy
from theFile import someKindOfClass
n = 8
allOutputs = numpy.zeros(n)
def work(index):
a = SomeKindOfClass()
a.theSlowFunction()
allOutputs[index] = a.output
pool = Pool(processes=4)
pool.map(work,range(0,n))
The name a is only local in scope within your work function, so there is no conflict of names here. Internally python will keep track of each class instance with a unique identifier. If you wanted to check this you could check the object id using the id function:
print(id(a))
I don't see any issues with your code.
Actually, you will have 8 instances of SomeKindOfClass (one for each worker), but only 4 will ever be active at the same time.
multiprocessing vs multiprocessing.dummy
Your program will only work if you continue to use the multiprocessing.dummy module, which is just a wrapper around the threading module. You are still using "python threads" (not separate processes). "Python threads" share the same global state; "Processes" don't. Python threads also share the same GIL, so they're still limited to running one python bytecode statement at a time, unlike processes, which can all run python code simultaneously.
If you were to change your import to from multiprocessing import Pool, you would notice that the allOutputs array remains unchanged after all the workers finish executing (also, you would likely get an error because you're creating the pool in the global scope, you should probably put that inside a main() function). This is because multiprocessing makes a new copy of the entire global state when it makes a new process. When the worker modifies the global allOutputs, it will be modifying a copy of that initial global state. When the process ends, nothing will be returned to the main process and the global state of the main process will remain unchanged.
Sharing State Between Processes
Unlike threads, processes aren't sharing the same memory
If you want to share state between processes, you have to explicitly declare shared variables and pass them to each process, or use pipes or some other method to allow the worker processes to communicate with each other or with the main process.
There are several ways to do this, but perhaps the simplest is using the Manager class
import multiprocessing
def worker(args):
index, array = args
a = SomeKindOfClass()
a.some_expensive_function()
array[index] = a.output
def main():
n = 8
manager = multiprocessing.Manager()
array = manager.list([0] * n)
pool = multiprocessing.Pool(4)
pool.map(worker, [(i, array) for i in range(n)])
print array
You can declare class instances inside the pool workers, because each instance has a separate place in memory so they don't conflict. The problem is if you declare a class instance first, then try to pass that one instance into multiple pool workers. Then each worker has a pointer to the same place in memory, and it will fail (this can be handled, just not this way).
Basically pool workers must not have overlapping memory anywhere. As long as the workers don't try to share memory somewhere, or perform operations that may result in collisions (like printing to the same file), there shouldn't be any problem.
Make sure whatever they're supposed to do (like something you want printed to a file, or added to a broader namespace somewhere) is returned as a result at the end, which you then iterate through.
If you are using multiprocessing you shouldn't worry - process doesn't share memory (by-default). So, there is no any risk to have several independent objects of class SomeKindOfClass - each of them will live in own process. How it works? Python runs your program and after that it runs 4 child processes. That's why it's very important to have if __init__ == '__main__' construction before pool.map(work,range(0,n)). Otherwise you will receive a infinity loop of process creation.
Problems could be if SomeKindOfClass keeps state on disk - for example, write something to file or read it.

Run methods in parallel

I have a program that, among other things, parses some big files, and I would like to have this done in parallel to save time.
The code flow looks something like this:
if __name__ == '__main__':
obj = program_object()
obj.do_so_some_stuff(argv)
obj.field1 = parse_file_one(f1)
obj.field2 = parse_file_two(f2)
obj.do_some_more_stuff()
I tried running the file parsing methods in separate processes like this:
p_1 = multiprocessing.Process(target=parse_file_one, args=(f1))
p_2 = multiprocessing.Process(target=parse_file_two, args=(f2))
p_1.start()
p_2.start()
p_1.join()
p_2.join()
There are 2 problems here. One is how to have the separate process modify the filed, but more importantly, forking the process duplicates my whole main! I get exception regarding argv when executing the
do_so_some_stuff(argv)
second time. That really is not what I wanted. It even happened when I run only 1 of the Processes.
How could I get just the file parsing methods to run in parallel to each other, and then continue back with main process like before?
Try putting the parsing methods in a separate module.
First, i guess instead of:
obj = program_object()
program_object.do_so_some_stuff(argv)
you mean:
obj = program_object()
obj.do_so_some_stuff(argv)
Second, try using threading like this:
#!/usr/bin/python
import thread
if __name__ == '__main__':
try:
thread.start_new_thread( parse_file_one, (f1) )
thread.start_new_thread( parse_file_two, (f2) )
except:
print "Error: unable to start thread"
But, as pointed out by Wooble, depending on the implementation of your parsing functions, this might not be a solution that executes truly in parallel, because of the GIL.
In that case, you should check the Python multiprocessing module that will do true concurrent execution:
multiprocessing is a package that supports spawning processes using an
API similar to the threading module. The multiprocessing package
offers both local and remote concurrency, effectively side-stepping
the Global Interpreter Lock by using subprocesses instead of threads.
Due to this, the multiprocessing module allows the programmer to fully
leverage multiple processors on a given machine.

Can I asynchronously delete a file in Python?

I have a long running python script which creates and deletes temporary files. I notice there is a non-trivial amount of time spent on file deletion, but the only purpose of deleting those files is to ensure that the program doesn't eventually fill up all the disk space during a long run. Is there a cross platform mechanism in Python to aschyronously delete a file so the main thread can continue to work while the OS takes care of the file delete?
You can try delegating deleting the files to another thread or process.
Using a newly spawned thread:
thread.start_new_thread(os.remove, filename)
Or, using a process:
# create the process pool once
process_pool = multiprocessing.Pool(1)
results = []
# later on removing a file in async fashion
# note: need to hold on to the async result till it has completed
results.append(process_pool.apply_async(os.remove, filename), callback=lambda result: results.remove(result))
The process version may allow for more parallelism because Python threads are not executing in parallel due to the notorious global interpreter lock. I would expect though that GIL is released when it calls any blocking kernel function, such as unlink(), so that Python lets another thread to make progress. In other words, a background worker thread that calls os.unlink() may be the best solution, see Tim Peters' answer.
Yet, multiprocessing is using Python threads underneath to asynchronously communicate with the processes in the pool, so some benchmarking is required to figure which version gives more parallelism.
An alternative method to avoid using Python threads but requires more coding is to spawn another process and send the filenames to its standard input through a pipe. This way you trade os.remove() to a synchronous os.write() (one write() syscall). It can be done using deprecated os.popen() and this usage of the function is perfectly safe because it only communicates in one direction to the child process. A working prototype:
#!/usr/bin/python
from __future__ import print_function
import os, sys
def remover():
for line in sys.stdin:
filename = line.strip()
try:
os.remove(filename)
except Exception: # ignore errors
pass
def main():
if len(sys.argv) == 2 and sys.argv[1] == '--remover-process':
return remover()
remover_process = os.popen(sys.argv[0] + ' --remover-process', 'w')
def remove_file(filename):
print(filename, file=remover_process)
remover_process.flush()
for file in sys.argv[1:]:
remove_file(file)
if __name__ == "__main__":
main()
You can create a thread to delete files, following a common producer-consumer pattern:
import threading, Queue
dead_files = Queue.Queue()
END_OF_DATA = object() # a unique sentinel value
def background_deleter():
import os
while True:
path = dead_files.get()
if path is END_OF_DATA:
return
try:
os.remove(path)
except: # add the exceptions you want to ignore here
pass # or log the error, or whatever
deleter = threading.Thread(target=background_deleter)
deleter.start()
# when you want to delete a file, do:
# dead_files.put(file_path)
# when you want to shut down cleanly,
dead_files.put(END_OF_DATA)
deleter.join()
CPython releases the GIL (global interpreter lock) around internal file deletion calls, so this should be effective.
Edit - new text
I would advise against spawning a new process per delete. On some platforms, process creation is quite expensive. Would also advise against spawning a new thread per delete: in a long-running program, you really never want the possibility of creating an unbounded number of threads at any point. Depending on how quickly file deletion requests pile up, that could happen here.
The "solution" above is wordier than the others, because it avoids all that. There's only one new thread total. Of course it could easily be generalized to use any fixed number of threads instead, all sharing the same dead_files queue. Start with 1, add more if needed ;-)
The OS-level file removal primitives are synchronous on both Unix and Windows, so I think you pretty much have to use a worker thread. You could have it pull files to delete off a Queue object, and then when the main thread is done with a file it can just post the file to the queue. If you're using NamedTemporaryFile objects, you probably want to set delete=False in the constructor and just post the name to the queue, not the file object, so you don't have object lifetime headaches.

python -> multiprocessing module

Here's what I am trying to accomplish -
I have about a million files which I need to parse & append the parsed content to a single file.
Since a single process takes ages, this option is out.
Not using threads in Python as it essentially comes to running a single process (due to GIL).
Hence using multiprocessing module. i.e. spawning 4 sub-processes to utilize all that raw core power :)
So far so good, now I need a shared object which all the sub-processes have access to. I am using Queues from the multiprocessing module. Also, all the sub-processes need to write their output to a single file. A potential place to use Locks I guess. With this setup when I run, I do not get any error (so the parent process seems fine), it just stalls. When I press ctrl-C I see a traceback (one for each sub-process). Also no output is written to the output file. Here's code (note that everything runs fine without multi-processes) -
import os
import glob
from multiprocessing import Process, Queue, Pool
data_file = open('out.txt', 'w+')
def worker(task_queue):
for file in iter(task_queue.get, 'STOP'):
data = mine_imdb_page(os.path.join(DATA_DIR, file))
if data:
data_file.write(repr(data)+'\n')
return
def main():
task_queue = Queue()
for file in glob.glob('*.csv'):
task_queue.put(file)
task_queue.put('STOP') # so that worker processes know when to stop
# this is the block of code that needs correction.
if multi_process:
# One way to spawn 4 processes
# pool = Pool(processes=4) #Start worker processes
# res = pool.apply_async(worker, [task_queue, data_file])
# But I chose to do it like this for now.
for i in range(4):
proc = Process(target=worker, args=[task_queue])
proc.start()
else: # single process mode is working fine!
worker(task_queue)
data_file.close()
return
what am I doing wrong? I also tried passing the open file_object to each of the processes at the time of spawning. But to no effect. e.g.- Process(target=worker, args=[task_queue, data_file]). But this did not change anything. I feel the subprocesses are not able to write to the file for some reason. Either the instance of the file_object is not getting replicated (at the time of spawn) or some other quirk... Anybody got an idea?
EXTRA: Also Is there any way to keep a persistent mysql_connection open & pass it across to the sub_processes? So I open a mysql connection in my parent process & the open connection should be accessible to all my sub-processes. Basically this is the equivalent of a shared_memory in python. Any ideas here?
Although the discussion with Eric was fruitful, later on I found a better way of doing this. Within the multiprocessing module there is a method called 'Pool' which is perfect for my needs.
It's optimizes itself to the number of cores my system has. i.e. only as many processes are spawned as the no. of cores. Of course this is customizable. So here's the code. Might help someone later-
from multiprocessing import Pool
def main():
po = Pool()
for file in glob.glob('*.csv'):
filepath = os.path.join(DATA_DIR, file)
po.apply_async(mine_page, (filepath,), callback=save_data)
po.close()
po.join()
file_ptr.close()
def mine_page(filepath):
#do whatever it is that you want to do in a separate process.
return data
def save_data(data):
#data is a object. Store it in a file, mysql or...
return
Still going through this huge module. Not sure if save_data() is executed by parent process or this function is used by spawned child processes. If it's the child which does the saving it might lead to concurrency issues in some situations. If anyone has anymore experience in using this module, you appreciate more knowledge here...
The docs for multiprocessing indicate several methods of sharing state between processes:
http://docs.python.org/dev/library/multiprocessing.html#sharing-state-between-processes
I'm sure each process gets a fresh interpreter and then the target (function) and args are loaded into it. In that case, the global namespace from your script would have been bound to your worker function, so the data_file would be there. However, I am not sure what happens to the file descriptor as it is copied across. Have you tried passing the file object as one of the args?
An alternative is to pass another Queue that will hold the results from the workers. The workers put the results and the main code gets the results and writes it to the file.

Categories

Resources