Why does `Queue.put` seem to be faster at pickling a numpy array than actual pickle?

Why does `Queue.put` seem to be faster at pickling a numpy array than actual pickle? - python

It appears that I can call q.put 1000 times in under 2.5ms. How is that possible when just pickling that very same array 1000 times takes over 2 seconds?
>>> a = np.random.rand(1024,1024)
>>> q = Queue()
>>> timeit.timeit(lambda: q.put(a), number=1000)
0.0025581769878044724
>>> timeit.timeit(lambda: pickle.dumps(a), number=1000)
2.690145633998327
Obviously, I am not understanding something about how Queue.put works. Can anyone enlighten me?
I also observed the following:
>>> def f():
... q.put(a)
... q.get()
>>> timeit.timeit(lambda: f(), number=1000)
42.33058542700019
This appears to be more realistic and suggests to me that simply calling q.put() will return before the object is actually serialized. Is that correct?

The multiprocessing implementation has a number of moving parts under the covers. Here, dealing with a multiprocessing.Queue is mostly done in a hidden (to the end user) worker thread. .put() just puts an object pointer on a queue (fast and constant-time), and that worker thread does the actual pickling, when it gets around to it.
This can burn you, though: if, in your example, the main program goes on to mutate the np array, after the .put(), an undefined number of those mutations may be captured by the eventually pickled state. The user-level .put() only captures the object pointer, nothing about the object's state.

Related

Ensure pickling is complete before ProcessPoolExecutor.submit returns

Say I have the following simple class (easily pickled):
import time
from concurrent.futures import ProcessPoolExecutor
class A:
def long_computation(self):
time.sleep(10)
return 42
I would like to be able to do this:
a = A()
with ProcessPoolExecutor(1) as executor:
a.future = executor.submit(a.long_computation)
On Python 3.6.9, this fails with TypeError: can't pickle _thread.RLock objects. On 3.8.0, it results in an endless wait for a lock to be acquired.
What does work (on both versions) is this:
a = A()
with ProcessPoolExecutor(1) as executor:
future = executor.submit(a.long_computation)
time.sleep(0.001)
a.future = future
It seems to me that executor.submit does not block long enough for the pickling of a to finish, and runs into issues with pickling the resulting Future object.
I'm not too happy about the time.sleep(0.001) workaround, as it involves a magic number and I imagine it could easily fail if the pickling ends up taking longer. I don't want to sleep for a safer, longer time as that would be a waste. Ideally I would want executor.submit to block until it is safe to store a reference to the Future object in a.
Is there a better way to do this?

Thinking about it a bit more, I came up with the following:
import pickle
a = A()
with ProcessPoolExecutor(1) as executor:
a.future = executor.submit(pickle.loads(pickle.dumps(a)).long_computation)
It involves a duplication of effort as the a is pickled twice, but works fine and ensures that the Future object does not get pickled in any case, as desired.
Then I realised that the reason this works is that it creates a copy of a. So the pickling and unpickling can be avoided by simply (shallow) copying the object before submitting the method, which ensures that no reference to the Future object exists on the copy:
from copy import copy
a = A()
with ProcessPoolExecutor(1) as executor:
a.future = executor.submit(copy(a).long_computation)
This is faster and much less awkward than the above cycle of pickles, but I'm still interested in the best practice here so I'll wait a bit before accepting this answer.

Passing a method of a big object to imap: 1000-fold speed-up by wrapping the method

Assume yo = Yo() is a big object with a method double, which returns its parameter multiplied by 2.
If I pass yo.double to imap of multiprocessing, then it is incredibly slow, because every function call creates a copy of yo I think.
Ie, this is very slow:
from tqdm import tqdm
from multiprocessing import Pool
import numpy as np
class Yo:
def __init__(self):
self.a = np.random.random((10000000, 10))
def double(self, x):
return 2 * x
yo = Yo()
with Pool(4) as p:
for _ in tqdm(p.imap(yo.double, np.arange(1000))):
pass
Output:
0it [00:00, ?it/s]
1it [00:06, 6.54s/it]
2it [00:11, 6.17s/it]
3it [00:16, 5.60s/it]
4it [00:20, 5.13s/it]
...
BUT, if I wrap yo.double with a function double_wrap and pass it to imap, then it is essentially instantaneous.
def double_wrap(x):
return yo.double(x)
with Pool(4) as p:
for _ in tqdm(p.imap(double_wrap, np.arange(1000))):
pass
Output:
0it [00:00, ?it/s]
1000it [00:00, 14919.34it/s]
How and why does wrapping the function change the behavior?
I use Python 3.6.6.

You are right about the copying. yo.double is a 'bound method', bound to your big object. When you pass it into the pool-method, it will pickle the whole instance with it, send it to the child processes and unpickle it there. This happens for every chunk of the iterable a child process works on. The default value for chunksize in pool.imap is 1, so you are hitting this communication overhead for every processed item in the iterable.
Contrary when you pass double_wrap, you are just passing a module-level function. Only it's name will actually get pickled and the child processes will import the function from __main__. Since you're obviously on an OS which supports forking, your double_wrap function will have access to the forked yo instance of Yo. Your big object won't be serialized (pickled) in this case, hence the communication overhead is tiny compared to the other approach.
#Darkonaut I just don't understand why making the function module level prevents copying of the object. After all, the function needs to have a pointer to the yo object itself – which should require all processes to copy yo as they cannot share memory.
The function running in the child process will automatically find a reference to a global yo, because your operating system (OS) is using fork to create a child process. Forking leads to a clone of your whole parent process and as long neither the parent nor the child alter a specific object, both will see the same object in the same memory place.
Only if parent or child change something on the object, the object get's copied in the child process. That's called "copy-on-write" and happens at OS level without you taking notice of it in Python. Your code wouldn't work on Windows, which uses 'spawn' as start method for new processes.
Now I'm simplifying a bit above where I write "the object gets copied", since the unit the OS operates on is a "page" (most commonly this will be of size 4KB). This answer here would be a good follow up read for broading your understanding.

Copy a generator

Let's say I have a generator like so
def gen():
a = yield "Hello World"
a_ = a + 1 #Imagine that on my computer "+ 1" is an expensive operation
print "a_ = ", a_
b = yield a_
print "b =", b
print "a_ =", a_
yield b
Now let's say I do
>>> g = gen()
>>> g.next()
>>> g.send(42)
a_ = 43
43
Now we have calculated a_. Now I would like to clone my generator like so.
>>> newG = clonify(g)
>>> newG.send(7)
b = 7
a_ = 43
7
but my original g still works.
>>> g.send(11)
b = 11
a_ = 43
11
Specifically, clonify takes the state of a generator, and copies it. I could just reset my generator to be like the old one, but that would require calculating a_. Note also that I would not want to modify the generator extensively. Ideally, I could just take a generator object from a library and clonify it.
Note: itertools.tee won't work, because it does not handle sends.
Note: I only care about generators created by placing yield statements in a function.

Python doesn't have any support for cloning generators.
Conceptually, this should be implementable, at least for CPython. But practically, it turns out to be very hard.
Under the covers, a generator is basically nothing but a wrapper around a stack frame.*
And a frame object is essentially just a code object, an instruction pointer (an index into that code object), the builtins/globals/locals environment, an exception state, and some flags and debugging info.
And both types are exposed to the Python level,** as are all the bits they need. So, it really should be just a matter of:
Create a frame object just like g.gi_frame, but with a copy of the locals instead of the original locals. (All the user-level questions come down to whether to shallow-copy, deep-copy, or one of the above plus recursively cloning generators here.)
Create a generator object out of the new frame object (and its code and running flag).
And there's no obvious practical reason it shouldn't be possible to construct a frame object out of its bits, just as it is for a code object or most of the other hidden builtin types.
Unfortunately, as it turns out, Python doesn't expose a way to construct a frame object. I thought you could get around that just by using ctypes.pythonapi to call PyFrame_New, but the first argument to that is a PyThreadState—which you definitely can't construct from Python, and shouldn't be able to. So, to make this work, you either have to:
Reproduce everything PyFrame_New does by banging on the C structs via ctypes, or
Manually build a fake PyThreadState by banging on the C structs (which will still require reading the code to PyFrame_New carefully to know what you have to fake).
I think this may still be doable (and I plan to play with it; if I come up with anything, I'll update the Cloning generators post on my blog), but it's definitely not going to be trivial—or, of course, even remotely portable.
There are also a couple of minor problems.
Locals are exposed to Python as a dict (whether you call locals() for your own, or access g.gi_frame.f_locals for a generator you want to clone). Under the covers, locals are actually stored on the C stack.*** You can get around this by using ctypes.pythonapi to call PyFrame_LocalsToFast and PyFrame_FastToLocals. But the dict just contains the values, not cell objects, so doing this shuffle will turn all nonlocal variables into local variables in the clone.****
Exception state is exposed to Python as a type/value/traceback 3-tuple, but inside a frame there's also a borrowed (non-refcounted) reference to the owning generator (or NULL if it's not a generator frame). (The source explains why.) So, your frame-constructing function can't refcount the generator or you have a cycle and therefore a leak, but it has to refcount the generator or you have a potentially dangling pointer until the frame is assigned to a generator. The obvious answer seems to be to leave the generator NULL at frame construction, and have the generator-constructing function do the equivalent of self.gi_f.f_generator = self; Py_DECREF(self).
* It also keeps a copy of the frame's code object and running flag, so they can be accessed after the generator exits and disposes of the frame.
** generator and frame are hidden from builtins, but they're available as types.GeneratorType types.FrameType. And they have docstrings, descriptions of their attributes in the inspect module, etc., just like function and code objects.
*** When you compile a function definition, the compiler makes a list of all the locals, stored in co_varnames, and turns each variable reference into a LOAD_FAST/STORE_FAST opcode with the index into co_varnames as its argument. When a function call is executed, the frame object stores the stack pointer in f_valuestack, pushes len(co_varnames)*sizeof(PyObject *) onto the stack, and then LOAD_FAST 0 just accesses *f_valuestack[0]. Closures are more complicated; a bit too much to explain in a comment on an SO answer.
**** I'm assuming you wanted the clone to share the original's closure references. If you were hoping to recursively clone all the frames up the stack to get a new set of closure references to bind, that adds another problem: there's no way to construct new cell objects from Python either.

You can't, in general. However, if you parametrise over some expensive operation why not lift that operation out, creating a generator factory?
def make_gen(a):
a_ = [a + 1] # Perform expensive calculation
def gen(a_=a_):
while True:
print "a_ = ", a_
a_[0] = yield a_[0]
return gen
Then you can create as many generators as you like from the returned object:
gen = make_gen(42)
g = gen()
g.send(None)
# a_ = [43]
g.send(7)
# a_ = [7]
new_g = gen()
new_g.send(None)
# a_ = [7]

Whilst not technically returning a generator, if you don't mind fully expanding your sequence:
source = ( x**2 for x in range(10) )
source1, source2 = zip(*( (s,s) for s in source ))
>>> print( source1, type(source1) )
(0, 1, 4, 9, 16, 25, 36, 49, 64, 81) <class 'tuple'>
>>> print( source2, type(source2) )
(0, 1, 4, 9, 16, 25, 36, 49, 64, 81) <class 'tuple'>
If your function is expensive, then consider using either joblib or pathos.multiprocessing. Joblib has simpler syntax and handles pool management behind the scenes, but only supports batch processing. Pathos forces you to manually manage and close your ProcessPools, but also as the pool.imap() an pool.uimap() functions which return generators
from pathos.multiprocessing import ProcessPool
pool = ProcessPool(ncpus=os.cpu_count())
try:
def expensive(x): return x**2
source = range(10)
results = pool.imap(expensive, source)
for result in results:
print(result)
except KeyboardInterrupt: pass
except: pass
finally:
pool.terminate()
In theory, you could set this to run in a separate thread and pass in two queue objects that could be read independently and could preserve generator like behavior as suggested in this answer:
How to use multiprocessing queue in Python?

how to create uncollectable garbage in python?

I have a large long-running server, and, over weeks the memory usage steadily climbs.
Generally, as pointed out below, its unlikely that leaks are my problem; however, I have not got a lot to go on so I want to see if there are any leaks.
Getting at console output is tricky so I'm not running with gc.set_debug(). This is not a big problem though, as I have easily added an API to get it to run gc.collect() and then iterate through gc.garbage and send the results back out to me over HTTP.
My problem is that running it locally for a short time my gc.garbage is always empty. I can't test my bit of code that lists the leaks before I deploy it.
Is there a trivial recipe for creating an uncollectable bit of garbage so I can test my code that lists the garbage?

Any cycle of finalizable objects (that is, objects with a __del__ method) is uncollectable (because the garbage collector does not know which order to run the finalizers in):
>>> class Finalizable:
... def __del__(self): pass
...
>>> a = Finalizable()
>>> b = Finalizable()
>>> a.x = b
>>> b.x = a
>>> del a
>>> del b
>>> import gc
>>> gc.collect()
4
>>> gc.garbage
[<__main__.Finalizable instance at 0x1004e0b48>,
<__main__.Finalizable instance at 0x1004e73f8>]
But as a general point, it seems unlikely to me that your problem is due to uncollectable garbage, unless you are in the habit of using finalizers. It's more likely due to the accumulation of live objects, or to fragmentation of memory (since Python uses a non-moving collector).

Python: Do (explicit) string parameters hurt performance?

Suppose some function that always gets some parameter s that it does not use.
def someFunc(s):
# do something _not_ using s, for example
a=1
now consider this call
someFunc("the unused string")
which gives a string as a parameter that is not built during runtime but compiled straight into the binary (hope thats right).
The question is: when calling someFunc this way for, say, severalthousand times the reference to "the unused string" is always passed but does that slow the program down?
in my naive thoughts i'd say the reference to "the unused string" is 'constant' and available in O(1) when a call to someFunc occurs. So i'd say 'no, that does not hurt performance'.
Same question as before: "Am I right?"
thanks for some :-)

The string is passed (by reference) each time, but the overhead is way too tiny to really affect performance unless it's in a super-tight loop.

this is an implementation detail of CPython, and may not apply to other pythons but yes, in many cases in a compiled module, a constant string will reference the same object, minimizing the overhead.
In general, even if it didn't, you really shouldn't worry about it, as it's probably imperceptibly tiny compared to other things going on.
However, here's a little interesting piece of code:
>>> def somefunc(x):
... print id(x) # prints the memory address of object pointed to by x
...
>>>
>>> def test():
... somefunc("hello")
...
>>> test()
134900896
>>> test()
134900896 # Hooray, like expected, it's the same object id
>>> somefunc("h" + "ello")
134900896 # Whoa, how'd that work?
What's happening here is that python keeps a global string lookup and in many cases, even when you concatenate two strings, you will get the same object if the values match up.
Note that this is an implementation detail, and you should NOT rely on it, as strings from any of: files, sockets, databases, string slicing, regex, or really any C module are not guaranteed to have this property. But it is interesting nonetheless.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why does `Queue.put` seem to be faster at pickling a numpy array than actual pickle? - python

Related

Ensure pickling is complete before ProcessPoolExecutor.submit returns

Passing a method of a big object to imap: 1000-fold speed-up by wrapping the method

Copy a generator

how to create uncollectable garbage in python?

Python: Do (explicit) string parameters hurt performance?

Categories

Resources