I am running a program which loads 20 GB data to the memory at first. Then I will do N (> 1000) independent tasks where each of them may use (read only) part of the 20 GB data. I am now trying to do those tasks via multiprocessing. However, as this answer says, the entire global variables are copied for each process. In my case, I do not have enough memory to perform more than 4 tasks as my memory is only 96 GB. I wonder if there is any solution to this kind of problem so that I can fully use all my cores without consuming too much memory.
In linux, forked processes have a copy-on-write view of the parent address space. forking is light-weight and the same program runs in both the parent and the child, except that the child takes a different execution path. As a small exmample,
import os
var = "unchanged"
pid = os.fork()
if pid:
print('parent:', os.getpid(), var)
os.waitpid(pid, 0)
else:
print('child:', os.getpid(), var)
var = "changed"
# show parent and child views
print(os.getpid(), var)
Results in
parent: 22642 unchanged
child: 22643 unchanged
22643 changed
22642 unchanged
Applying this to multiprocessing, in this example I load data into a global variable. Since python pickles the data sent to the process pool, I make sure it pickles something small like an index and have the worker get the global data itself.
import multiprocessing as mp
import os
my_big_data = "well, bigger than this"
def worker(index):
"""get char in big data"""
return my_big_data[index]
if __name__ == "__main__":
pool = mp.Pool(os.cpu_count())
for c in pool.imap_unordered(worker, range(len(my_big_data)), chunksize=1):
print(c)
Windows does not have a fork-and-exec model for running programs. It has to start a new instance of the python interpreter and clone all relevant data to the child. This is a heavy lift!
Related
I'm trying to figure out how Lock works under the hood. I run this code on MacOS which using "spawn" as default method to start new process.
from multiprocessing import Process, Lock, set_start_method
from time import sleep
def f(lock, i):
lock.acquire()
print(id(lock))
try:
print('hello world', i)
sleep(3)
finally:
lock.release()
if __name__ == '__main__':
# set_start_method("fork")
lock = Lock()
for num in range(3):
p = Process(target=f, args=(lock, num))
p.start()
p.join()
Output:
140580736370432
hello world 0
140251759281920
hello world 1
140398066042624
hello world 2
The Lock works in my code. However, the ids of lock make me confused. Since idare different, are they still same one lock or there are multiple locks and they somehow communicate secretly? Is id() still hold the position in multiprocessing, I quote "CPython implementation detail: id is the address of the object in memory."?
If I use "fork" method, set_start_method("fork"), it prints out identical id which totally make sense for me.
id is implemented as but not required to be the memory location of the given object. when using fork, the separate process does not get it's own memory space until it modifies something (copy on write), so the memory location does not change because it "is" the same object. When using spawn, an entire new process is created and the __main__ file is imported as a library into the local namesapce, so all your same functions, classes, and module level variables are accessable (sans any modifications from anything that results from if __name__ == "__main__":). Then python creates a connection between the processes (pipe) in which it can send which function to call, and the arguments to call it with. everything passing through this pipe must be pickle'd then unpickle'd. Locks specifically are re-created when un-pickling by asking the operating system for a lock with a specific name (which was created in the parent process when the lock was created, then this name is sent across using pickle). This is how the two locks are synchronized, because it is backed by an object the operating system controls. Python then stores this lock along with some other data (the PyObject as it were) in the memory of the new process. calling id now will get the location of this struct which is different because it was created by a different process in a different chunk of memory.
here's a quick example to convince you that a "spawn'ed" lock is still synchronized:
from multiprocessing import Process, Lock, set_start_method
def foo(lock):
with lock:
print(f'child process lock id: {id(lock)}')
if __name__ == "__main__":
set_start_method("spawn")
lock = Lock()
print(f'parent process lock id: {id(lock)}')
lock.acquire() #lock the lock so child has to wait
p = Process(target=foo, args=(lock,))
p.start()
input('press enter to unlock the lock')
lock.release()
p.join()
The different "id's" are the different PyObject locations, but have little to do with the underlying mutex. I am not aware that there's a direct way to inspect the underlying lock which the operating system manages.
Do child processes spawned via multiprocessing share objects created earlier in the program?
I have the following setup:
do_some_processing(filename):
for line in file(filename):
if line.split(',')[0] in big_lookup_object:
# something here
if __name__ == '__main__':
big_lookup_object = marshal.load('file.bin')
pool = Pool(processes=4)
print pool.map(do_some_processing, glob.glob('*.data'))
I'm loading some big object into memory, then creating a pool of workers that need to make use of that big object. The big object is accessed read-only, I don't need to pass modifications of it between processes.
My question is: is the big object loaded into shared memory, as it would be if I spawned a process in unix/c, or does each process load its own copy of the big object?
Update: to clarify further - big_lookup_object is a shared lookup object. I don't need to split that up and process it separately. I need to keep a single copy of it. The work that I need to split it is reading lots of other large files and looking up the items in those large files against the lookup object.
Further update: database is a fine solution, memcached might be a better solution, and file on disk (shelve or dbm) might be even better. In this question I was particularly interested in an in memory solution. For the final solution I'll be using hadoop, but I wanted to see if I can have a local in-memory version as well.
Do child processes spawned via multiprocessing share objects created earlier in the program?
No for Python < 3.8, yes for Python ≥ 3.8.
Processes have independent memory space.
Solution 1
To make best use of a large structure with lots of workers, do this.
Write each worker as a "filter" – reads intermediate results from stdin, does work, writes intermediate results on stdout.
Connect all the workers as a pipeline:
process1 <source | process2 | process3 | ... | processn >result
Each process reads, does work and writes.
This is remarkably efficient since all processes are running concurrently. The writes and reads pass directly through shared buffers between the processes.
Solution 2
In some cases, you have a more complex structure – often a fan-out structure. In this case you have a parent with multiple children.
Parent opens source data. Parent forks a number of children.
Parent reads source, farms parts of the source out to each concurrently running child.
When parent reaches the end, close the pipe. Child gets end of file and finishes normally.
The child parts are pleasant to write because each child simply reads sys.stdin.
The parent has a little bit of fancy footwork in spawning all the children and retaining the pipes properly, but it's not too bad.
Fan-in is the opposite structure. A number of independently running processes need to interleave their inputs into a common process. The collector is not as easy to write, since it has to read from many sources.
Reading from many named pipes is often done using the select module to see which pipes have pending input.
Solution 3
Shared lookup is the definition of a database.
Solution 3A – load a database. Let the workers process the data in the database.
Solution 3B – create a very simple server using werkzeug (or similar) to provide WSGI applications that respond to HTTP GET so the workers can query the server.
Solution 4
Shared filesystem object. Unix OS offers shared memory objects. These are just files that are mapped to memory so that swapping I/O is done instead of more convention buffered reads.
You can do this from a Python context in several ways
Write a startup program that (1) breaks your original gigantic object into smaller objects, and (2) starts workers, each with a smaller object. The smaller objects could be pickled Python objects to save a tiny bit of file reading time.
Write a startup program that (1) reads your original gigantic object and writes a page-structured, byte-coded file using seek operations to assure that individual sections are easy to find with simple seeks. This is what a database engine does – break the data into pages, make each page easy to locate via a seek.
Spawn workers with access to this large page-structured file. Each worker can seek to the relevant parts and do their work there.
Do child processes spawned via multiprocessing share objects created earlier in the program?
It depends. For global read-only variables it can be often considered so (apart from the memory consumed) else it should not.
multiprocessing's documentation says:
Better to inherit than pickle/unpickle
On Windows many types from
multiprocessing need to be picklable
so that child processes can use them.
However, one should generally avoid
sending shared objects to other
processes using pipes or queues.
Instead you should arrange the program
so that a process which need access to
a shared resource created elsewhere
can inherit it from an ancestor
process.
Explicitly pass resources to child processes
On Unix a child process can make use
of a shared resource created in a
parent process using a global
resource. However, it is better to
pass the object as an argument to the
constructor for the child process.
Apart from making the code
(potentially) compatible with Windows
this also ensures that as long as the
child process is still alive the
object will not be garbage collected
in the parent process. This might be
important if some resource is freed
when the object is garbage collected
in the parent process.
Global variables
Bear in mind that if code run in a
child process tries to access a global
variable, then the value it sees (if
any) may not be the same as the value
in the parent process at the time that
Process.start() was called.
Example
On Windows (single CPU):
#!/usr/bin/env python
import os, sys, time
from multiprocessing import Pool
x = 23000 # replace `23` due to small integers share representation
z = [] # integers are immutable, let's try mutable object
def printx(y):
global x
if y == 3:
x = -x
z.append(y)
print os.getpid(), x, id(x), z, id(z)
print y
if len(sys.argv) == 2 and sys.argv[1] == "sleep":
time.sleep(.1) # should make more apparant the effect
if __name__ == '__main__':
pool = Pool(processes=4)
pool.map(printx, (1,2,3,4))
With sleep:
$ python26 test_share.py sleep
2504 23000 11639492 [1] 10774408
1
2564 23000 11639492 [2] 10774408
2
2504 -23000 11639384 [1, 3] 10774408
3
4084 23000 11639492 [4] 10774408
4
Without sleep:
$ python26 test_share.py
1148 23000 11639492 [1] 10774408
1
1148 23000 11639492 [1, 2] 10774408
2
1148 -23000 11639324 [1, 2, 3] 10774408
3
1148 -23000 11639324 [1, 2, 3, 4] 10774408
4
S.Lott is correct. Python's multiprocessing shortcuts effectively give you a separate, duplicated chunk of memory.
On most *nix systems, using a lower-level call to os.fork() will, in fact, give you copy-on-write memory, which might be what you're thinking. AFAIK, in theory, in the most simplistic of programs possible, you could read from that data without having it duplicated.
However, things aren't quite that simple in the Python interpreter. Object data and meta-data are stored in the same memory segment, so even if the object never changes, something like a reference counter for that object being incremented will cause a memory write, and therefore a copy. Almost any Python program that is doing more than "print 'hello'" will cause reference count increments, so you will likely never realize the benefit of copy-on-write.
Even if someone did manage to hack a shared-memory solution in Python, trying to coordinate garbage collection across processes would probably be pretty painful.
If you're running under Unix, they may share the same object, due to how fork works (i.e., the child processes have separate memory but it's copy-on-write, so it may be shared as long as nobody modifies it). I tried the following:
import multiprocessing
x = 23
def printx(y):
print x, id(x)
print y
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=4)
pool.map(printx, (1,2,3,4))
and got the following output:
$ ./mtest.py
23 22995656
1
23 22995656
2
23 22995656
3
23 22995656
4
Of course this doesn't prove that a copy hasn't been made, but you should be able to verify that in your situation by looking at the output of ps to see how much real memory each subprocess is using.
Different processes have different address space. Like running different instances of the interpreter. That's what IPC (interprocess communication) is for.
You can use either queues or pipes for this purpose. You can also use rpc over tcp if you want to distribute the processes over a network later.
http://docs.python.org/dev/library/multiprocessing.html#exchanging-objects-between-processes
Not directly related to multiprocessing per se, but from your example, it would seem you could just use the shelve module or something like that. Does the "big_lookup_object" really have to be completely in memory?
No, but you can load your data as a child process and allow it to share its data with other children. see below.
import time
import multiprocessing
def load_data( queue_load, n_processes )
... load data here into some_variable
"""
Store multiple copies of the data into
the data queue. There needs to be enough
copies available for each process to access.
"""
for i in range(n_processes):
queue_load.put(some_variable)
def work_with_data( queue_data, queue_load ):
# Wait for load_data() to complete
while queue_load.empty():
time.sleep(1)
some_variable = queue_load.get()
"""
! Tuples can also be used here
if you have multiple data files
you wish to keep seperate.
a,b = queue_load.get()
"""
... do some stuff, resulting in new_data
# store it in the queue
queue_data.put(new_data)
def start_multiprocess():
n_processes = 5
processes = []
stored_data = []
# Create two Queues
queue_load = multiprocessing.Queue()
queue_data = multiprocessing.Queue()
for i in range(n_processes):
if i == 0:
# Your big data file will be loaded here...
p = multiprocessing.Process(target = load_data,
args=(queue_load, n_processes))
processes.append(p)
p.start()
# ... and then it will be used here with each process
p = multiprocessing.Process(target = work_with_data,
args=(queue_data, queue_load))
processes.append(p)
p.start()
for i in range(n_processes)
new_data = queue_data.get()
stored_data.append(new_data)
for p in processes:
p.join()
print(processes)
For Linux/Unix/MacOS platform, forkmap is a quick-and-dirty solution.
I'm kind of new to multiprocessing. However, assume that we have a program as below. The program seems to work fine. Now to the question. In my opinion we will have 4 instances of SomeKindOfClass with the same name (a) at the same time. How is that possible? Moreover, is there a potential risk with this kind of programming?
from multiprocessing.dummy import Pool
import numpy
from theFile import someKindOfClass
n = 8
allOutputs = numpy.zeros(n)
def work(index):
a = SomeKindOfClass()
a.theSlowFunction()
allOutputs[index] = a.output
pool = Pool(processes=4)
pool.map(work,range(0,n))
The name a is only local in scope within your work function, so there is no conflict of names here. Internally python will keep track of each class instance with a unique identifier. If you wanted to check this you could check the object id using the id function:
print(id(a))
I don't see any issues with your code.
Actually, you will have 8 instances of SomeKindOfClass (one for each worker), but only 4 will ever be active at the same time.
multiprocessing vs multiprocessing.dummy
Your program will only work if you continue to use the multiprocessing.dummy module, which is just a wrapper around the threading module. You are still using "python threads" (not separate processes). "Python threads" share the same global state; "Processes" don't. Python threads also share the same GIL, so they're still limited to running one python bytecode statement at a time, unlike processes, which can all run python code simultaneously.
If you were to change your import to from multiprocessing import Pool, you would notice that the allOutputs array remains unchanged after all the workers finish executing (also, you would likely get an error because you're creating the pool in the global scope, you should probably put that inside a main() function). This is because multiprocessing makes a new copy of the entire global state when it makes a new process. When the worker modifies the global allOutputs, it will be modifying a copy of that initial global state. When the process ends, nothing will be returned to the main process and the global state of the main process will remain unchanged.
Sharing State Between Processes
Unlike threads, processes aren't sharing the same memory
If you want to share state between processes, you have to explicitly declare shared variables and pass them to each process, or use pipes or some other method to allow the worker processes to communicate with each other or with the main process.
There are several ways to do this, but perhaps the simplest is using the Manager class
import multiprocessing
def worker(args):
index, array = args
a = SomeKindOfClass()
a.some_expensive_function()
array[index] = a.output
def main():
n = 8
manager = multiprocessing.Manager()
array = manager.list([0] * n)
pool = multiprocessing.Pool(4)
pool.map(worker, [(i, array) for i in range(n)])
print array
You can declare class instances inside the pool workers, because each instance has a separate place in memory so they don't conflict. The problem is if you declare a class instance first, then try to pass that one instance into multiple pool workers. Then each worker has a pointer to the same place in memory, and it will fail (this can be handled, just not this way).
Basically pool workers must not have overlapping memory anywhere. As long as the workers don't try to share memory somewhere, or perform operations that may result in collisions (like printing to the same file), there shouldn't be any problem.
Make sure whatever they're supposed to do (like something you want printed to a file, or added to a broader namespace somewhere) is returned as a result at the end, which you then iterate through.
If you are using multiprocessing you shouldn't worry - process doesn't share memory (by-default). So, there is no any risk to have several independent objects of class SomeKindOfClass - each of them will live in own process. How it works? Python runs your program and after that it runs 4 child processes. That's why it's very important to have if __init__ == '__main__' construction before pool.map(work,range(0,n)). Otherwise you will receive a infinity loop of process creation.
Problems could be if SomeKindOfClass keeps state on disk - for example, write something to file or read it.
I am trying to understand how to share read-only objects with multiprocessing. Sharing bigset when it is a global variable works fine:
from multiprocessing import Pool
bigset = set(xrange(pow(10, 7)))
def worker(x):
return x in bigset
def main():
pool = Pool(5)
print all(pool.imap(worker, xrange(pow(10, 6))))
pool.close()
pool.join()
if __name__ == '__main__':
main()
htop shows that the parent process uses 100% CPU and 0.8% memory, while the workload is distributed evenly among the five children processes: each is using 10% CPU and 0.8% memory. It's all good.
But the numbers start going crazy if I move bigset inside main:
from multiprocessing import Pool
from functools import partial
def worker(x, l):
return x in l
def main():
bigset = set(xrange(pow(10, 7)))
_worker = partial(worker, l=bigset)
pool = Pool(5)
print all(pool.imap(_worker, xrange(pow(10, 6))))
pool.close()
pool.join()
if __name__ == '__main__':
main()
Now htop shows 2 or 3 processes jumping up and down between 50% and 80% CPU, while the remaining processes use less than 10% CPU. And while the parent process still uses 0.8% memory, now all children use 1.9% memory.
What's happening?
When you pass bigset as an argument, it is pickled by the parent process and unplickled by the children processes.[1][2]
Pickling and unpickling a large set requires a lot of time. This explains why you are seeing few processes doing their job: the parent process has to pickle a lot of big objects, and the children have to wait for it. The parent process is a bottleneck.
Pickling parameters implies that parameters have to be sent to processes. Sending data from a process to another requires system calls, which is why you are not seeing 100% CPU usage by the user space code. Part of the CPU time is spent on kernel space.[3]
Pickling objects and sending them to subprocesses also implies that: 1. you need memory for the pickle buffer; 2. each subprocess gets a copy of bigset. This is why you are seeing an increase in memory usage.
Instead, when bigset is a global variable, it is not sent anywhere (unless you are using a start method other than fork). It is just inherited as-is by subprocesses, using the usual copy-on-write rules of fork().
Footnotes:
In case you don't know what "pickling" means: pickle is one of the standard Python protocols to transform arbitrary Python objects to and from a byte sequence.
imap() & co. use queues behind the scenes, and queues work by pickling/unpickling objects.
I tried running your code (with all(pool.imap(_worker, xrange(100))) in order to make the process faster) and I got: 2 minutes user time, 13 seconds system time. That's almost 10% system time.
Using python's multiprocessing module, the following contrived example runs with minimal memory requirements:
import multiprocessing
# completely_unrelated_array = range(2**25)
def foo(x):
for x in xrange(2**28):pass
print x**2
P = multiprocessing.Pool()
for x in range(8):
multiprocessing.Process(target=foo, args=(x,)).start()
Uncomment the creation of the completely_unrelated_array and you'll find that each spawned process allocates the memory for a copy of the completely_unrelated_array! This is a minimal example of a much larger project that I can't figure out how to workaround; multiprocessing seems to make a copy of everything that is global. I don't need a shared memory object, I simply need to pass in x, and process it without the memory overhead of the entire program.
Side observation: What's interesting is that print id(completely_unrelated_array) inside foo gives the same value, suggesting that somehow that might not be copies...
Because of the nature of os.fork(), any variables in the global namespace of your __main__ module will be inherited by the child processes (assuming you're on a Posix platform), so you'll see the memory usage in the children reflect that as soon as they're created. I'm not sure if all that memory is really being allocated though, as far as I know that memory is shared until you actually try to change it in the child, at which point a new copy is made. Windows, on the other hand, doesn't use os.fork() - it re-imports the main module in each child, and pickles any local variables you want sent to the children. So, using Windows you can actually avoid the large global ending up copied in the child by only defining it inside an if __name__ == "__main__": guard, because everything inside that guard will only run in the parent process:
import time
import multiprocessing
def foo(x):
for x in range(2**28):pass
print(x**2)
if __name__ == "__main__":
completely_unrelated_array = list(range(2**25)) # This will only be defined in the parent on Windows
P = multiprocessing.Pool()
for x in range(8):
multiprocessing.Process(target=foo, args=(x,)).start()
Now, in Python 2.x, you can only create new multiprocessing.Process objects by forking if you're using a Posix platform. But on Python 3.4, you can specify how the new processes are created, by using contexts. So, we can specify the "spawn" context, which is the one Windows uses, to create our new processes, and use the same trick:
# Note that this is Python 3.4+ only
import time
import multiprocessing
def foo(x):
for x in range(2**28):pass
print(x**2)
if __name__ == "__main__":
completely_unrelated_array = list(range(2**23)) # Again, this only exists in the parent
ctx = multiprocessing.get_context("spawn") # Use process spawning instead of fork
P = ctx.Pool()
for x in range(8):
ctx.Process(target=foo, args=(x,)).start()
If you need 2.x support, or want to stick with using os.fork() to create new Process objects, I think the best you can do to get the reported memory usage down is immediately delete the offending object in the child:
import time
import multiprocessing
import gc
def foo(x):
init()
for x in range(2**28):pass
print(x**2)
def init():
global completely_unrelated_array
completely_unrelated_array = None
del completely_unrelated_array
gc.collect()
if __name__ == "__main__":
completely_unrelated_array = list(range(2**23))
P = multiprocessing.Pool(initializer=init)
for x in range(8):
multiprocessing.Process(target=foo, args=(x,)).start()
time.sleep(100)
What is important here is which platform you are targeting.
Unix systems processes are created by using Copy-On-Write (cow) memory. So even though each process gets a copy of the full memory of the parent process, that memory is only actually allocated on a per page bases (4KiB)when it is modified.
So if you are only targeting these platforms you don't have to change anything.
If you are targeting platforms without cow forks you may want to use python 3.4 and its new forking contexts spawn and forkserver, see the documentation
These methods will create new processes which share nothing or limited state with the parent and all memory passing is explicit.
But not that that the spawned process will import your module so all global data will be explicitly copied and no copy-on-write is possible. To prevent this you have to reduce the scope of the data.
import multiprocessing as mp
import numpy as np
def foo(x):
import time
time.sleep(60)
if __name__ == "__main__":
mp.set_start_method('spawn')
# not global so forks will not have this allocated due to the spawn method
# if the method would be fork the children would still have this memory allocated
# but it could be copy-on-write
completely_unrelated_array = np.ones((5000, 10000))
P = mp.Pool()
for x in range(3):
mp.Process(target=foo, args=(x,)).start()
e.g top output with spawn:
%MEM TIME+ COMMAND
29.2 0:00.52 python3
0.5 0:00.00 python3
0.5 0:00.00 python3
0.5 0:00.00 python3
and with fork:
%MEM TIME+ COMMAND
29.2 0:00.52 python3
29.1 0:00.00 python3
29.1 0:00.00 python3
29.1 0:00.00 python3
note how its more than 100%, due to copy-on-write