I have a Look Up Table LUT which is a very large dictionary (24G).
And I have millions of inputs to perform query on it.
I want to split the millions of inputs across 32 jobs, and run them in parallel.
Due to the space contraint, I cannot run multiple python scripts, because that will result in memory overload.
I want to use the multiprocessing module to only load the LUT just once, and then have different processes look it up, while sharing it as a global variable, without having to duplicate it.
However when I look at the htop, it seems each subprocess are re-creating the LUT? I made this claim because under the VIRT, RES, SHR. The numbers are very high.
But at the same time I dont see the additional memory used in the Mem row, it increased from 11Gb to 12.3G and just hovers there.
So im confused, is it, or is it not re-creating the LUT within each sub process ?
How should i proceed to make sure i am running parallel works, without duplicating LUT in each subprocess ?
Code is shown below the picture.
(In this experiment I'm only using 1Gb of LUT so, dont worry about it not being 24Gb)
import os, sys, time, pprint, pdb, datetime
import threading, multiprocessing
## Print the process/thread details
def getDetails(idx):
pid = os.getpid()
threadName = threading.current_thread().name
processName = multiprocessing.current_process().name
print(f"{idx})\tpid={pid}\tprocessName={processName}\tthreadName={threadName} ")
return pid, threadName, processName
def ComplexAlgorithm(value):
# Instead of just lookup like this
# the real algorithm is some complex algorithm that performs some search
return value in LUT
## Querying the 24Gb LUT from my millions of lines of input
def PerformMatching(idx, NumberOfLines):
pid, threadName, processName = getDetails(idx)
NumberMatches = 0
for _ in range(NumberOfLines):
# I will actually read the contents from my file live,
# but here just assume i generate random numbers
value = random.randint(-100, 100)
if ComplexAlgorithm(value): NumberMatches += 1
print(f"\t{idx}) | LUT={len(LUT)} | NumberMatches={NumberMatches} | done")
if __name__ == "__main__":
## Init
num_processes = 9
# this is just a pseudo-call to show you the structure of my LUT, the real one is larger
LUT = (dict(i,set([i])) for i in range(1000))
## Store the multiple filenames
ListOfLists = []
for idx in range(num_processes):
NumberOfLines = 10000
ListOfLists.append( NumberOfLines )
## Init the processes
ProcessList = []
for processIndex in range(num_processes):
ProcessList.append(
multiprocessing.Process(
target=PerformMatching,
args=(processIndex, ListOfLists[processIndex])
)
)
ProcessList[processIndex].start()
## Wait until the process terminates.
for processIndex in range(num_processes):
ProcessList[processIndex].join()
## Done
If you want to go the route of using a multiprocessing.Manager, this is how you could do it. The trade-off is that the dictionary is represented by a reference to a proxy for the actual dictionary that exists in a different address space and consequently every dictionary reference results in the equivalent of a remote procedure call. In other words, access is much slower compared with a "regular" dictionary.
In the demo program below, I have only defined a couple of methods for my managed dictionary, but you can define whatever you need. I have also used a multiprocessing pool instead of explicitly starting individual processes; you might consider doing likewise.
from multiprocessing.managers import BaseManager, BaseProxy
from multiprocessing import Pool
from functools import partial
def worker(LUT, key):
return LUT[key]
class MyDict:
def __init__(self):
""" initialize the dictionary """
# the very large dictionary reduced for demo purposes:
self._dict = {i: i for i in range(100)}
def get(self, obj, default=None):
""" delegates to underlying dict """
return self._dict.get(obj, default)
def __getitem__(self, obj):
""" delegates to underlying dict """
return self._dict[obj]
class MyDictManager(BaseManager):
pass
class MyDictProxy(BaseProxy):
_exposed_ = ('get', '__getitem__')
def get(self, *args, **kwargs):
return self._callmethod('get', args, kwargs)
def __getitem__(self, *args, **kwargs):
return self._callmethod('__getitem__', args, kwargs)
def main():
MyDictManager.register('MyDict', MyDict, MyDictProxy)
with MyDictManager() as manager:
my_dict = manager.MyDict()
pool = Pool()
# pass proxy instead of actual LUT:
results = pool.map(partial(worker, my_dict), range(100))
print(sum(results))
if __name__ == '__main__':
main()
Prints:
4950
Discussion
Python comes with a managed dict class built in obtainable with multiprocessing.Manager().dict(). But initializing such a large number of entries with such a dictionary would be very inefficient based on my prior comment that each access would be relatively expensive. It seemed to me that it would be less expensive to create our own managed class that had an underlying "regular" dictionary that could be initialized directly when the managed class is constructed and not via the proxy reference. And while it is true that the managed dict that comes with Python can be instantiated with an already built dictionary, which avoids that inefficiency problem, my concern is that memory efficiency would suffer because you would have two instances of the dictionary, i.e. the "regular" dictionary and the "managed" dictionary.
Related
I'm having trouble finding the correct way to implement the following: I have a class in Python3 for which I keep an instance counter. Using a concurrent.futures.ProcessPoolExecutor, I submit several tasks that make use of this class. I assumed that since the tasks ran in separate processes there would be no shared state between them, but it would seem I was wrong as this instance counter is shared among them. The following code exemplifies what I mean:
import concurrent.futures
class A:
counter = 0
def __init__(self):
A.counter += 1
self.id = A.counter
def hello(self):
return f'Hello from node{self.id}'
def start():
instance = A()
return instance.hello()
results = []
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
for i in range(4):
f = executor.submit(start)
results.append(f)
for r in results:
print(r.result())
The output from the above is:
Hello from node1
Hello from node2
Hello from node1
Hello from node1
The issue is not the race condition when accessing the counter, my issue is that the variable is even shared at all when I was expecting it to be private per process (e.g. start at 0 for each worker). What would be a pythonic way to achieve this?
Thanks in advance.
Here it would seem you found that tasks are not always evenly distributed between workers within a processing pool, and one of the workers managed to complete 2 "tasks" while one (of the 4) got none. In each worker the A class is defined either by copying the memory space from the time at which fork is called (*nix) or from importing the main file (Windows and MacOS). The class attribute counter behaves like a global variable, because it was not defined as an instance attribute, so any worker getting multiple tasks will see that value incremented each time. While it is possible to avoid this by limiting workers to only completing one task each before they're terminated and re-started, it's generally good practice to avoid global state in general. maxtasksperchild is much more frequently used to clean up instances where child processes may for various reasons not release memory or file handles over time to prevent leaks.
As you have stated in the comments, a long running task reduces the relative impact of the overhead of re-starting the process each task, however if you are using any of the functions that map a function over an iterable, and accept a chunksize argument, this approach may break. A "task" isn't specifically one iteration of the map, but may be several at once (to reduce overhead of transferring the arguments and results). This example should demonstrate a pool with maxtasksperchild=1, where each child ends up calling start() 4 times:
from multiprocessing import Pool
class A:
counter = 0
def __init__(self):
A.counter += 1
self.id = A.counter
def hello(self):
return f'Hello from node{self.id}'
def start(_):
instance = A()
print( instance.hello())
if __name__ == "__main__":
with Pool(4, maxtasksperchild=1) as p:
p.map(start, range(16), chunksize=4)
Context
I am trying to use multiprocessing, specifically Pool().starmap(), within a specific method of a class. Let's name the file with the code for the Quant class model.py. I'd like to be able to import Quant, declare an instance of it, myobj, and call myobj.calculate() from another file called tester.py. My overall goal here is to get tester.py to run as fast as possible, with the cleanest syntax possible.
model.py
import multiprocessing
from optcode import optimizer
def f(arg1, arg2, arg3, arg4):
return arg1.run(arg_a = arg2, arg_b=arg3, arg_c=arg4)
Class Quant
def __init__(self, name):
self.name = name
def calculate(self):
optimization = optimizer()
args = [(optimization, x, y, z),(optimization, q, r, z), (optimization, l, m, n)]
cpus = multiprocessing.cpu_count()
with multiprocessing.Pool(processes=cpus) as pool:
tasks = pool.starmap(f, args)
self.results = tasks
tester.py
from model import Quant
if __name__ == '__main__':
myobj = Quant('My Instance')
myobj.calculate()
print(myobj.results)
Questions
From what I can tell, the myobj.calculate() line needs to be within if __name__ == '__main__': to prevent everything from freezing (a fork bomb?). It appears that when I move the if __name__ == '__main__': line up to include everything within tester.py (as I have in the above example), it then prevents Python from re-importing (and executing) tester.py ncpu times when I execute tester.py once.
Is my understanding correct, and is there an alternative to needing to use these conditions? Trying to set things up for less saavy users. My actual application has more computationally intensive code within both Quant.__init__() and Quant.calculate().
How else can I speed this up (more) ? I have read that pathos.multiprocessing has superior serialization. Would that help in this context? In my actual application, args[0] is a tuple of pandas DataFrames and floats.
I have also read that there may be a better type of mapping to use with the processing pool. I do need the result of the pool to be ordered the same way as args, but I don't need intermediate results from each worker process; each process is totally independent of one another. Would something like imap() or map_async() make for a more efficient setup?
I haven't been able to get the syntax for pathos to work and I've read all the examples I can find. Something about args[0] being a tuple of arguments, and args itself being an iterable (list) seems to be the issue. I know pathos.multiprocessing.ProcessPool().map() can handle multi-arg functions, but I can't figure out how to use it, given the structure of my inputs.
I have a list of objects and need to call a member function of every object. Is it possible to use multiprocessing for that?
I wrote a short example of what i want to do.
import multiprocessing as mp
class Example:
data1 = 0
data2 = 3
def compute():
self.val3 = 6
listofobjects = []
for i in range(5):
listofobjects.append(Example())
pool = mp.Pool()
pool.map(listofobjects[range(5)].compute())
There are two conceptual issues that #abarnert has pointed out, beyond the syntactic and usage problems in your "pseudocode". The first is that map works with a function that is applied to the elements of your input. The second is that each sub-process gets a copy of you object, so changes to attributes are not automatically seen in the originals. Both issues can be worked around.
To answer your immediate question, here is how you would apply a method to your list:
with mp.Pool() as pool:
pool.map(Example.compute, listofobjects)
Example.compute is an unbound method. That means that it is just a regular function that accepts self as a first argument, making it a perfect fit for map. I also use the pool as a context manager, which is recommended to ensure that cleanup is done properly whether or not an error occurs.
The code above would not work because the effects of compute would be local to the subprocess. The only way to pass them back to the original process would be to return them from the function you passed to map. If you don't want to modify compute, you could do something like this:
def get_val3(x):
x.compute()
return x.val3
with mp.Pool() as pool:
for value, obj in zip(pool.map(get_val3, listofobjects), listofobjects):
obj.val3 = value
If you were willing to modify compute to return the object it is operating on (self), you could use it to replace the original objects much more efficiently:
class Example:
...
def compute():
...
return self
with mp.Pool() as pool:
listofobjects = list(pool.map(Example.compute, listofobjects))
Update
If your object or some part of its reference tree does not support pickling (which is the form of serialization normally used to pass objects between processes), you can at least get rid of the wrapper function by returning the updated value directly from compute:
class Example:
...
def compute():
self.val3 = ...
return self.val3
with mp.Pool() as pool:
for value, obj in zip(pool.map(Example.compute, listofobjects), listofobjects):
obj.val3 = value
Here is a simple secinaro:
class Test:
def __init__(self):
self.foo = []
def append(self, x):
self.foo.append(x)
def get(self):
return self.foo
def process_append_queue(append_queue, bar):
while True:
x = append_queue.get()
if x is None:
break
bar.append(x)
print("worker done")
def main():
import multiprocessing as mp
bar = Test()
append_queue = mp.Queue(10)
append_queue_process = mp.Process(target=process_append_queue, args=(append_queue, bar))
append_queue_process.start()
for i in range(100):
append_queue.put(i)
append_queue.put(None)
append_queue_process.join()
print str(bar.get())
if __name__=="__main__":
main()
When you call bar.get() at the end of the main() function why does it still return an empty list? How can I make it so that the child process also works with the same instance of Test not a new one?
All answers appreciated!
In general, processes have distinct address spaces, so that mutations of an object in one process have no effect on any object in any other process. Interprocess communication is needed to tell a process about changes made in another process.
That can be done explicitly (using things like multiprocessing.Queue), or implicitly if you use a facility implemented by multiprocessing for this purpose. For example, a great deal of work is done under the covers to make changes to a multiprocessing.Queue visible across processes.
The easiest way in your specific example is to replace your __init__ function like so:
def __init__(self):
import multiprocessing as mp
self.foo = mp.Manager().list()
It so happens that an mp.Manager instance supports a list() method that creates a process-aware list object (really a proxy for a list object, which forwards list operations to an under-the-covers server process that maintains a single copy of "the real" list - the list object isn't really shared across processes, because that's impossible - but the proxies make it appear to be shared).
So if you make that change, your code will display the results you expect - and there is no simpler way.
Note that multiprocessing works better the less IPC (interprocess communication) you need, and that's true pretty much regardless of application or programming language.
Objects are copied between processes by pickling them and passing the string over a pipe. There is no way to achieve true "shared memory" for pure Python objects between processes. To achieve precisely this type of synchronization, take a look at the multiprocessing.Manager documentation (https://docs.python.org/2/library/multiprocessing.html#managers) which provides you with examples about synchronized versions of common Python container types. These are "proxied" containers where operations on the proxy send all arguments across the process boundary, pickled, and are then executed in the parent process.
I have a function that does a calculation and saves the state of the calculation in the result dictionary (default default argument). I first run it, then run several processes using the multiprocessing module. I need to run the function again in each of those parallel processes, but after this function has run once, I need the cached state to be returned, the value must not be recalculated. This requirement doesn't make sense in my example, but I can't think of a simple realistic argument that would require this restriction. Using a dict as mutable default argument works, but
this doesn't work with the multiprocessing module. What approach can I use to get the same effect?
Note that the state value is something (a dictionary containing class values) that cannot be passed to the multiple processes as an argument afaik.
The SO question Python multiprocessing: How do I share a dict among multiple processes? seems to cover similar ground. Perhaps I can use a Manager to do what I need, but it is not obvious how. Alternatively, one could perhaps save the value to a global object, per https://stackoverflow.com/a/4534956/350713, but that doesn't seem very elegant.
def foo(result={}):
if result:
print "returning cached result"
return result
result[1] = 2
return result
def parafn():
from multiprocessing import Pool
pool = Pool(processes=2)
arglist = []
foo()
for i in range(4):
arglist.append({})
results = []
r = pool.map_async(foo, arglist, callback=results.append)
r.get()
r.wait()
pool.close()
pool.join()
return results
print parafn()
UPDATE: Thanks for the comments. I've got a working example now, posted below.
I think the safest way of exchange data between procesess is with a Queue, the multiprocessing module brings you 2 types of them Queue and JoinableQueue, see documentation:
http://docs.python.org/library/multiprocessing.html#exchanging-objects-between-processes
This code would not win any beauty prizes, but works for me.
This example is similar to the example in the question, but with some minor changes.
The add_to_d construct is a bit awkward, but I don't see a better way to do this.
Brief summary: I copy the state of foo's d, (which is a mutable default argument) back to foo,
but the foo in the new process spaces created by the pool. Once this is done, then foo in the new process spaces
will not recalculate the cached values.
It seems this is what the pool initializer does, though the documentation is not very explicit.
class bar(object):
def __init__(self, x):
self.x = x
def __repr__(self):
return "<bar "+ str(self.x) +">"
def foo(x=None, add_to_d=None, d = {}):
if add_to_d:
d.update(add_to_d)
if x is None:
return
if x in d:
print "returning cached result, d is %s, x is %s"%(d, x)
return d[x]
d[x] = bar(x)
return d[x]
def finit(cacheval):
foo(x=None, add_to_d=cacheval)
def parafn():
from multiprocessing import Pool
arglist = []
foo(1)
pool = Pool(processes=2, initializer=finit, initargs=[foo.func_defaults[2]])
arglist = range(4)
results = []
r = pool.map_async(foo, iterable=arglist, callback=results.append)
r.get()
r.wait()
pool.close()
pool.join()
return results
print parafn()