In the queue class from the Queue module, there are a few methods, namely, qsize, empty and full, whose documentation claims they are "not reliable".
What exactly is not reliable about them?
I did notice that on the Python docs site, the following is said about qsize:
Note, qsize() > 0 doesn’t guarantee
that a subsequent get() will not
block, nor will qsize() < maxsize
guarantee that put() will not block.
I personally don't consider that behavior "unreliable". But is this what is meant by "unreliable," or is there some more sinister defect in these methods?
Yes, the docs use "unreliable" here to convey exactly this meaning: for example, in a sense, qsize doesn't tell you how many entries there are "right now", a concept that is not necessarily very meaningful in a multithreaded world (except at specific points where synchronization precautions are being taken) -- it tells you how many entries it had "a while ago"... when you act upon that information, even in the very next opcode, the queue might have more entries, or fewer, or none at all maybe, depending on what other threads have been up to in the meantime (if anything;-).
I don't know which Queue module you're referring to, please can you provide a link?
One possible source of unreliability: Generally, a queue is read by one thread and written by another. If you are the only thread accessing a queue, then reliable implementations of qsize(), empty() and full() are possible. But once other threads get involved, the return value of these methods might be out-of-date by the time you test it.
This is one case of unreliability in line with what Alex Martelli suggested:
JoinableQueue.empty() unreliable? What's the alternative?
Related
I am working on a class which operates in a multithreaded environment, and looks something like this (with excess noise removed):
class B:
#classmethod
def apply(cls, item):
cls.do_thing(item)
#classmethod
def do_thing(cls, item)
'do something to item'
def run(self):
pool = multiprocessing.Pool()
for list_of_items in self.data_groups:
pool.map(list_of_items, self.apply)
My concern is that two threads might call apply or do_thing at the same time, or that a subclass might try to do something stupid with cls in one of these functions. I could use staticmethod instead of classmethod, but calling do_thing would become a lot more complicated, especially if a subclass reimplements one of these but not the other. So my question is this: Is the above class thread-safe, or is there a potential problem with using classmethods like that?
Whether a method is thread safe or not depends on what the method does.
Working with local variables only is thread safe. But when you change the same non local variable from different threads, it becomes unsafe.
‘do something to item’ seems to modify only the given object, which is independent from any other object in the list, so it should be thread safe.
If the same object is in the list several times, you may have to think about making the object thread safe. That can be done by using with self.object_scope_lock: in every method which modifies the object.
Anyway, what you are doing here is using processes instead of threads. In this case the objects are pickled and send through a pipe to the other process, where they are modified and send back. In contrast to threads processes do not share memory. So I don’t think using a lock in the class-method would have an effect.
http://docs.python.org/3/library/threading.html?highlight=threading#module-threading
There's no difference between classmethods and regular functions (and instance methods) in this regard. Neither is automagically thread-safe.
If one or more classmethods/methods/functions can manipulate data structures simultaneously from different threads, you'd need to add synchronization protection, typically using threading.Locks.
Both other answers are technically correct in that the safety of do_thing() depends on what happens inside the function.
But the more precise answer is that the call itself is safe. In other words if apply()and do_thing()are a pure functions, then your code is safe. Any unsafe-ness would be due to them not being pure functions (e.g. relying on or affecting a shared variable during execution)
As shx2 mentioned, classmethods are only "in" a class visually, for grouping. They have no inherent attachment to any instance of the class. Therefore this code is roughly equivalent in functioning:
def apply(item):
do_thing(item)
def do_thing(item)
'do something to item'
class B:
def run(self):
pool = multiprocessing.Pool()
for list_of_items in self.data_groups:
pool.map(list_of_items, apply)
A further note on concurrency given the other answers:
threading.Lock is easy to understand, but should be your last resort. In naive implementations it is often slower than completely linear processing. Your code will usually be faster if you can use things like threading.Event, queue.Queue, or multiprocessing.Pipe to transfer information instead.
asyncio is the new hotness in python3. It's a bit more difficult to get right but is generally the fastest method.
If you want a great walkthrough modern concurrency techniques in python check out core developer Raymond Hettinger's Keynote on Concurrency. The whole thing is great, but the downside of lockis highlighted starting at t=57:59.
I know that the builtin set type in python is not generally threadsafe, but this answer claims that it is safe to call pop() from two competing threads. Sure, you might get an exception, but your data isn't corrupted. I can't seem to find a doc that validates this claim. Is it true? Documentation, please!
If you look at the set.pop method in the CPython source you'll see that it doesn't release the GIL.
That means that only one set.pop will ever be happening at a time within a CPython process.
Since set.pop checks if the set is empty, you can't cause anything but an IndexError by trying to pop from an empty set.
So no, you can't corrupt the data by popping from a set in multiple threads with CPython.
I believe the Set "pop" operation to be thread-safe due to being atomic, in the sense that two threads won't be able to pop the same value.
I wouldn't rely on it's behavior if another thread was, for instance, iterating over that collection.
I couldn't find any concrete documentation either, just some topics that point in this direction. Python official documentation would indeed benefit with information of this kind.
After having read various articles that explain GIS and threads in Python, and Are locks unnecessary in multi-threaded Python code because of the GIL? which is a very useful answer, I have one "last question".
If, ideally, my threads only operate on shared data through atomic (Python VM) instructions, e.g. appending an item to a list, a lock is not needed, right?
That really depends on your application. You may need locks for your specific use case just like in any other language, but you don't need to protect the python objects getting corrupted anyway. In that sense you don't need locks.
Here's an example that uses a bunch of pretty much atomic operations, but can still behave in unexpected ways when you combine them.
Thread 1:
v = l[-1]
DoWork(v]
del l[-1]
Thread 2:
l.append(3)
If Thread 2 runs in between the first and last statement of Thread 1, then Thread 1 just deleted the wrong work item. None of the python objects are corrupt or anything, but you still got an unexpected result, and potentially exceptions could be thrown.
If you have shared data structures, you usually need to protect them with locks, or even better use already written protected versions, like in this case maybe a Queue: http://docs.python.org/library/queue.html
In theory not, but it depends on the logic, you need a lock when you are preserving order, for example.
When you share data between threads, you should always make sure your data is properly synchronized because you cannot rely on wether operations will be or not atomic in the future.
It is easier to get the multi-threading design right in the first place than try to fix something that breaks because of a change in implementation or a wrong assumption.
Thanks everyone for the answers!
It's clear that thread sync requirements are bound to the application logic, but I can rely on the GIL to do not corrupt the builtin objects internals (is the operations are atomic). It wasn't clear to me when the GIL i said to protect the "internal state" of the interpreter, its internal data structures... I mean, this is an effect, but the GIL protects every allocated builtin structure, both for objects created and used by internal operations of the interpreter and objects created by the application. That was my doubt.
PS: I'm sorry for having answered so late, but I didn't receive email notifies...
I need to convert a threading application to a multiprocessing application for multiple reasons (GIL, memory leaks). Fortunately the threads are quite isolated and only communicate via Queue.Queues. This primitive is also available in multiprocessing so everything looks fine. Now before I enter this minefield I'd like to get some advice on the upcoming problems:
How to ensure that my objects can be transfered via the Queue? Do I need to provide some __setstate__?
Can I rely on put returning instantly (like with threading Queues)?
General hints/tips?
Anything worthwhile to read apart from the Python documentation?
Answer to part 1:
Everything that has to pass through a multiprocessing.Queue (or Pipe or whatever) has to be picklable. This includes basic types such as tuples, lists and dicts. Also classes are supported if they are top-level and not too complicated (check the details). Trying to pass lambdas around will fail however.
Answer to part 2:
A put consists of two parts: It takes a semaphore to modify the queue and it optionally starts a feeder thread. So if no other Process tries to put to the same Queue at the same time (for instance because there is only one Process writing to it), it should be fast. For me it turned out to be fast enough for all practical purposes.
Partial answer to part 3:
The plain multiprocessing.queue.Queue lacks a task_done method, so it cannot be used as a drop-in replacement directly. (A subclass provides the method.)
The old processing.queue.Queue lacks a qsize method and the newer multiprocessing version is inaccurate (just keep this in mind).
Since filedescriptors normally inherited on fork, care needs to be taken about closing them in the right processes.
We're considering re-factoring a large application with a complex GUI which is isolated in a decoupled fashion from the back-end, to use the new (Python 2.6) multiprocessing module. The GUI/backend interface uses Queues with Message objects exchanged in both directions.
One thing I've just concluded (tentatively, but feel free to confirm it) is that "object identity" would not be preserved across the multiprocessing interface. Currently when our GUI publishes a Message to the back-end, it expects to get the same Message back with a result attached as an attribute. It uses object identity (if received_msg is message_i_sent:) to identify returning messages in some cases... and that seems likely not to work with multiprocessing.
This question is to ask what "gotchas" like this you have seen in actual use or can imagine one would encounter in naively using the multiprocessing module, especially in refactoring an existing single-process application. Please specify whether your answer is based on actual experience. Bonus points for providing a usable workaround for the problem.
Edit: Although my intent with this question was to gather descriptions of problems in general, I think I made two mistakes: I made it community wiki from the start (which probably makes many people ignore it, as they won't get reputation points), and I included a too-specific example which -- while I appreciate the answers -- probably made many people miss the request for general responses. I'll probably re-word and re-ask this in a new question. For now I'm accepting one answer as best merely to close the question as far as it pertains to the specific example I included. Thanks to those who did answer!
I have not used multiprocessing itself, but the problems presented are similar to experience I've had in two other domains: distributed systems, and object databases. Python object identity can be a blessing and a curse!
As for general gotchas, it helps if the application you are refactoring can acknowledge that tasks are being handled asynchronously. If not, you will generally end up managing locks, and much of the performance you could have gained by using separate processes will be lost to waiting on those locks. I will also suggest that you spend the time to build some scaffolding for debugging across processes. Truly asynchronous processes tend to be doing much more than the mind can hold and verify -- or at least my mind!
For the specific case outlined, I would manage object identity at the process border when items queued and returned. When sending a task to be processed, annotate the task with an id(), and stash the task instance in a dictionary using the id() as the key. When the task is updated/completed, retrieve the exact task back by id() from the dictionary, and apply the newly updated state to it. Now the exact task, and therefore its identity, will be maintained.
Well, of course testing for identity on non-singleton object (es. "a is None" or "a is False") isn't usually a good practice - it might be quick, but a really-quick workaround would be to exchange the "is" for the "==" test and use an incremental counter to define identity:
# this is not threadsafe.
class Message(object):
def _next_id():
i = 0
while True:
i += 1
yield i
_idgen = _next_id()
del _next_id
def __init__(self):
self.id = self._idgen.next()
def __eq__(self, other):
return (self.__class__ == other.__class__) and (self.id == other.id)
This might be an idea.
Also, be aware that if you have tons of "worker processes", memory consumption might be far greater than with a thread-based approach.
You can try the persistent package from my project GarlicSim. It's LGPL'ed.
http://github.com/cool-RR/GarlicSim/tree/development/garlicsim/garlicsim/misc/persistent/
(The main module in it is persistent.py)
I often use it like this:
# ...
self.identity = Persistent()
Then I have an identity that is preserved across processes.