In Python, is set.pop() threadsafe?

In Python, is set.pop() threadsafe? - python

I know that the builtin set type in python is not generally threadsafe, but this answer claims that it is safe to call pop() from two competing threads. Sure, you might get an exception, but your data isn't corrupted. I can't seem to find a doc that validates this claim. Is it true? Documentation, please!

If you look at the set.pop method in the CPython source you'll see that it doesn't release the GIL.
That means that only one set.pop will ever be happening at a time within a CPython process.
Since set.pop checks if the set is empty, you can't cause anything but an IndexError by trying to pop from an empty set.
So no, you can't corrupt the data by popping from a set in multiple threads with CPython.

I believe the Set "pop" operation to be thread-safe due to being atomic, in the sense that two threads won't be able to pop the same value.
I wouldn't rely on it's behavior if another thread was, for instance, iterating over that collection.
I couldn't find any concrete documentation either, just some topics that point in this direction. Python official documentation would indeed benefit with information of this kind.

Related

Python multiprocessing.Pool(): am I limited in what I can return?

I am using Python's multi-processing pool. I have been told, although not experienced this myself so I cannot post the code, that one cannot just "return" anything from within the multiprocessing.Pool()-worker back to the multiprocessing.Pool()'s main process. Words like "pickling" and "lock" were being thrown around but I am not sure.
Is this correct, and if so, what are these limitations?
In my case, I have a function which generates a mutable class object and then returns it after it has done some work with it. I'd like to have 8 processes run this function, generate their own classes, and return each of them after they're done. Full code is NOT written yet, so I cannot post it.
Any issues I may run into?
My code is: res = pool.map(foo, list_of_parameters)

Q : "Is this correct, and if so, what are these limitations?"
It depends. It is correct, but the SER/DES processing is the problem here, as a pair of disjoint processes tries to "send" something ( there: a task specification with parameters and back: ... Yessss, the so long waited for result* )
Initial versions of the Python standard library of modules piece, responsible for doing this, the pickle-module, was not able to SER-ialise some more complex types of objects, Class-instances being one such example.
There are newer and newer versions evolving, sure, yet this SER/DES step is one of the SPoFs that may avoid a smooth code-execution for some such cases.
Next are the cases, that finish by throwing a Memory Error as they request as much memory allocations, that the O/S simply rejects any new request for such an allocation, and the whole process attempt to produce and send pickle.dumps( ... ) un-resolvably crashes.
Do we have any remedies available?
Well, maybe yes, maybe no - Mike McKearn's dill may help in some cases to better handle complex objects in SER/DES-processing.
May try to use import dill as pickle; pickle.dumps(...) and test your hot-candidates for Class()-instances to get SER/DES-ed, if they get a chance to pass through. If not, no way using this low-hanging fruit first trick.
Next, a less easy way would be to avoid your dependence on hardwired multiprocessing.Pool()-instantiations and their (above)-limited SER/comms/DES-methods, and design your processing strategy as a distributed-computing system, based on a communicating agents paradigm.
That way you benefit from a right-sized, just-enough designed communication interchange between intelligent-enough agents, that know (as you've designed them to know it) what to tell one to the others, even without sending any mastodon-sized BLOB(s), that accidentally crash the processing in any of the SPoF(s) you cannot both prevent and salvage ex-post.
There seem no better ways forward I know about or can foresee in 2020-Q4 for doing this safe and smart.

Mutable Default Arguments - (Why) is my code dangerous?

My code triggers a warning in pylint:
def getInsertDefault(collection=['key', 'value'], usefile='defaultMode.xml'):
return doInsert(collection,usefile,True)
The warning is pretty clear, it's Mutable Default Arguments, I'm getting the point in several instances it can give a wrong impression of what's happening. There are several posts on SA already, but it doesn't feel this one here is covered.
Most questions and examples deal with empty lists which are weak-referenced and can cause an error.
I'm also aware it's better practice to change the code to getInsertDefault(collection=None ...) but in this method for default-initialization, I don't intend to do anything with the list except reading, (why) is my code dangerous or could result in a pitfall?
--EDIT--
To the point: Why is the empty dictionary a dangerous default value in Python? would be answering the question.
Kind of: I am aware my code is against the convention and could result in a pitfall - but in this very specific case: Am I safe?
I found the suggestion in the comments useful to use collection=('key', 'value') instead as it's conventional and safe. Still, out of pure interest: Is my previous attempt able to create some kind of major problem?

Assuming that doInsert() (and whatever code doInsert is calling) is only ever reading collection, there is indeed no immediate issue - just a time bomb.
As soon as any part of the code seeing this list starts mutating it, your code will break in the most unexpected way, and you may have a hard time debugging the issue (imagine if what changes is a 3rd part library function tens of stack frames away... and that's in the best case, where the issue and it's root cause are still in the same direct branch of the call stack - it might stored as an instance attribute somewhere and mutated by some unrelated call, and then you're in for some fun).
Now the odds that this ever happens are rather low, but given the potential for subtles (it might just result in incorrect results once in a while, not necessarily crash the program) and hard to track bugs this introduces, you should think twice before assuming that this is really "safe".

Python GIL and threads synchronization

After having read various articles that explain GIS and threads in Python, and Are locks unnecessary in multi-threaded Python code because of the GIL? which is a very useful answer, I have one "last question".
If, ideally, my threads only operate on shared data through atomic (Python VM) instructions, e.g. appending an item to a list, a lock is not needed, right?

That really depends on your application. You may need locks for your specific use case just like in any other language, but you don't need to protect the python objects getting corrupted anyway. In that sense you don't need locks.
Here's an example that uses a bunch of pretty much atomic operations, but can still behave in unexpected ways when you combine them.
Thread 1:
v = l[-1]
DoWork(v]
del l[-1]
Thread 2:
l.append(3)
If Thread 2 runs in between the first and last statement of Thread 1, then Thread 1 just deleted the wrong work item. None of the python objects are corrupt or anything, but you still got an unexpected result, and potentially exceptions could be thrown.
If you have shared data structures, you usually need to protect them with locks, or even better use already written protected versions, like in this case maybe a Queue: http://docs.python.org/library/queue.html

In theory not, but it depends on the logic, you need a lock when you are preserving order, for example.

When you share data between threads, you should always make sure your data is properly synchronized because you cannot rely on wether operations will be or not atomic in the future.
It is easier to get the multi-threading design right in the first place than try to fix something that breaks because of a change in implementation or a wrong assumption.

Thanks everyone for the answers!
It's clear that thread sync requirements are bound to the application logic, but I can rely on the GIL to do not corrupt the builtin objects internals (is the operations are atomic). It wasn't clear to me when the GIL i said to protect the "internal state" of the interpreter, its internal data structures... I mean, this is an effect, but the GIL protects every allocated builtin structure, both for objects created and used by internal operations of the interpreter and objects created by the application. That was my doubt.
PS: I'm sorry for having answered so late, but I didn't receive email notifies...

What problems will one see in using Python multiprocessing naively?

We're considering re-factoring a large application with a complex GUI which is isolated in a decoupled fashion from the back-end, to use the new (Python 2.6) multiprocessing module. The GUI/backend interface uses Queues with Message objects exchanged in both directions.
One thing I've just concluded (tentatively, but feel free to confirm it) is that "object identity" would not be preserved across the multiprocessing interface. Currently when our GUI publishes a Message to the back-end, it expects to get the same Message back with a result attached as an attribute. It uses object identity (if received_msg is message_i_sent:) to identify returning messages in some cases... and that seems likely not to work with multiprocessing.
This question is to ask what "gotchas" like this you have seen in actual use or can imagine one would encounter in naively using the multiprocessing module, especially in refactoring an existing single-process application. Please specify whether your answer is based on actual experience. Bonus points for providing a usable workaround for the problem.
Edit: Although my intent with this question was to gather descriptions of problems in general, I think I made two mistakes: I made it community wiki from the start (which probably makes many people ignore it, as they won't get reputation points), and I included a too-specific example which -- while I appreciate the answers -- probably made many people miss the request for general responses. I'll probably re-word and re-ask this in a new question. For now I'm accepting one answer as best merely to close the question as far as it pertains to the specific example I included. Thanks to those who did answer!

I have not used multiprocessing itself, but the problems presented are similar to experience I've had in two other domains: distributed systems, and object databases. Python object identity can be a blessing and a curse!
As for general gotchas, it helps if the application you are refactoring can acknowledge that tasks are being handled asynchronously. If not, you will generally end up managing locks, and much of the performance you could have gained by using separate processes will be lost to waiting on those locks. I will also suggest that you spend the time to build some scaffolding for debugging across processes. Truly asynchronous processes tend to be doing much more than the mind can hold and verify -- or at least my mind!
For the specific case outlined, I would manage object identity at the process border when items queued and returned. When sending a task to be processed, annotate the task with an id(), and stash the task instance in a dictionary using the id() as the key. When the task is updated/completed, retrieve the exact task back by id() from the dictionary, and apply the newly updated state to it. Now the exact task, and therefore its identity, will be maintained.

Well, of course testing for identity on non-singleton object (es. "a is None" or "a is False") isn't usually a good practice - it might be quick, but a really-quick workaround would be to exchange the "is" for the "==" test and use an incremental counter to define identity:
# this is not threadsafe.
class Message(object):
def _next_id():
i = 0
while True:
i += 1
yield i
_idgen = _next_id()
del _next_id
def __init__(self):
self.id = self._idgen.next()
def __eq__(self, other):
return (self.__class__ == other.__class__) and (self.id == other.id)
This might be an idea.
Also, be aware that if you have tons of "worker processes", memory consumption might be far greater than with a thread-based approach.

You can try the persistent package from my project GarlicSim. It's LGPL'ed.
http://github.com/cool-RR/GarlicSim/tree/development/garlicsim/garlicsim/misc/persistent/
(The main module in it is persistent.py)
I often use it like this:
# ...
self.identity = Persistent()
Then I have an identity that is preserved across processes.

Python: Why are some of Queue.queue's method "unreliable"?

In the queue class from the Queue module, there are a few methods, namely, qsize, empty and full, whose documentation claims they are "not reliable".
What exactly is not reliable about them?
I did notice that on the Python docs site, the following is said about qsize:
Note, qsize() > 0 doesn’t guarantee
that a subsequent get() will not
block, nor will qsize() < maxsize
guarantee that put() will not block.
I personally don't consider that behavior "unreliable". But is this what is meant by "unreliable," or is there some more sinister defect in these methods?

Yes, the docs use "unreliable" here to convey exactly this meaning: for example, in a sense, qsize doesn't tell you how many entries there are "right now", a concept that is not necessarily very meaningful in a multithreaded world (except at specific points where synchronization precautions are being taken) -- it tells you how many entries it had "a while ago"... when you act upon that information, even in the very next opcode, the queue might have more entries, or fewer, or none at all maybe, depending on what other threads have been up to in the meantime (if anything;-).

I don't know which Queue module you're referring to, please can you provide a link?
One possible source of unreliability: Generally, a queue is read by one thread and written by another. If you are the only thread accessing a queue, then reliable implementations of qsize(), empty() and full() are possible. But once other threads get involved, the return value of these methods might be out-of-date by the time you test it.

This is one case of unreliability in line with what Alex Martelli suggested:
JoinableQueue.empty() unreliable? What's the alternative?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.