Python GIL and threads synchronization

Python GIL and threads synchronization - python

After having read various articles that explain GIS and threads in Python, and Are locks unnecessary in multi-threaded Python code because of the GIL? which is a very useful answer, I have one "last question".
If, ideally, my threads only operate on shared data through atomic (Python VM) instructions, e.g. appending an item to a list, a lock is not needed, right?

That really depends on your application. You may need locks for your specific use case just like in any other language, but you don't need to protect the python objects getting corrupted anyway. In that sense you don't need locks.
Here's an example that uses a bunch of pretty much atomic operations, but can still behave in unexpected ways when you combine them.
Thread 1:
v = l[-1]
DoWork(v]
del l[-1]
Thread 2:
l.append(3)
If Thread 2 runs in between the first and last statement of Thread 1, then Thread 1 just deleted the wrong work item. None of the python objects are corrupt or anything, but you still got an unexpected result, and potentially exceptions could be thrown.
If you have shared data structures, you usually need to protect them with locks, or even better use already written protected versions, like in this case maybe a Queue: http://docs.python.org/library/queue.html

In theory not, but it depends on the logic, you need a lock when you are preserving order, for example.

When you share data between threads, you should always make sure your data is properly synchronized because you cannot rely on wether operations will be or not atomic in the future.
It is easier to get the multi-threading design right in the first place than try to fix something that breaks because of a change in implementation or a wrong assumption.

Thanks everyone for the answers!
It's clear that thread sync requirements are bound to the application logic, but I can rely on the GIL to do not corrupt the builtin objects internals (is the operations are atomic). It wasn't clear to me when the GIL i said to protect the "internal state" of the interpreter, its internal data structures... I mean, this is an effect, but the GIL protects every allocated builtin structure, both for objects created and used by internal operations of the interpreter and objects created by the application. That was my doubt.
PS: I'm sorry for having answered so late, but I didn't receive email notifies...

Related

Python multiprocessing.Pool(): am I limited in what I can return?

I am using Python's multi-processing pool. I have been told, although not experienced this myself so I cannot post the code, that one cannot just "return" anything from within the multiprocessing.Pool()-worker back to the multiprocessing.Pool()'s main process. Words like "pickling" and "lock" were being thrown around but I am not sure.
Is this correct, and if so, what are these limitations?
In my case, I have a function which generates a mutable class object and then returns it after it has done some work with it. I'd like to have 8 processes run this function, generate their own classes, and return each of them after they're done. Full code is NOT written yet, so I cannot post it.
Any issues I may run into?
My code is: res = pool.map(foo, list_of_parameters)

Q : "Is this correct, and if so, what are these limitations?"
It depends. It is correct, but the SER/DES processing is the problem here, as a pair of disjoint processes tries to "send" something ( there: a task specification with parameters and back: ... Yessss, the so long waited for result* )
Initial versions of the Python standard library of modules piece, responsible for doing this, the pickle-module, was not able to SER-ialise some more complex types of objects, Class-instances being one such example.
There are newer and newer versions evolving, sure, yet this SER/DES step is one of the SPoFs that may avoid a smooth code-execution for some such cases.
Next are the cases, that finish by throwing a Memory Error as they request as much memory allocations, that the O/S simply rejects any new request for such an allocation, and the whole process attempt to produce and send pickle.dumps( ... ) un-resolvably crashes.
Do we have any remedies available?
Well, maybe yes, maybe no - Mike McKearn's dill may help in some cases to better handle complex objects in SER/DES-processing.
May try to use import dill as pickle; pickle.dumps(...) and test your hot-candidates for Class()-instances to get SER/DES-ed, if they get a chance to pass through. If not, no way using this low-hanging fruit first trick.
Next, a less easy way would be to avoid your dependence on hardwired multiprocessing.Pool()-instantiations and their (above)-limited SER/comms/DES-methods, and design your processing strategy as a distributed-computing system, based on a communicating agents paradigm.
That way you benefit from a right-sized, just-enough designed communication interchange between intelligent-enough agents, that know (as you've designed them to know it) what to tell one to the others, even without sending any mastodon-sized BLOB(s), that accidentally crash the processing in any of the SPoF(s) you cannot both prevent and salvage ex-post.
There seem no better ways forward I know about or can foresee in 2020-Q4 for doing this safe and smart.

Is there a way to find out if Python threading locks are ever used by more than one thread?

I'm working on a personal project that has been refactored a number of times. It started off using multithreading, then parts of it used asyncio, and now it is back to being mainly single threaded.
As a result of all these changes I have a number of threading.Lock()'s in the code that I would like to remove and cleanup to prevent future issues.
How can I easily work out which locks are in use and hit by more than one thread during the runtime of the application?

If I am in the situation to find that out, I would try to replace the lock with a wrapper that do the counting (or print something, raise an exception, etc.) for me when the undesired behavior happened. Python is hacky, so I can simply create a function and overwrite the original threading.Lock to get the job done. That might need some careful implementation, e.g., catch both all possible pathway to lock and unlock.
However, you have to be careful that even so, you might not exercise all possible code path and thus never know if you really remove all "bugs".

In Python, is set.pop() threadsafe?

I know that the builtin set type in python is not generally threadsafe, but this answer claims that it is safe to call pop() from two competing threads. Sure, you might get an exception, but your data isn't corrupted. I can't seem to find a doc that validates this claim. Is it true? Documentation, please!

If you look at the set.pop method in the CPython source you'll see that it doesn't release the GIL.
That means that only one set.pop will ever be happening at a time within a CPython process.
Since set.pop checks if the set is empty, you can't cause anything but an IndexError by trying to pop from an empty set.
So no, you can't corrupt the data by popping from a set in multiple threads with CPython.

I believe the Set "pop" operation to be thread-safe due to being atomic, in the sense that two threads won't be able to pop the same value.
I wouldn't rely on it's behavior if another thread was, for instance, iterating over that collection.
I couldn't find any concrete documentation either, just some topics that point in this direction. Python official documentation would indeed benefit with information of this kind.

What problems will one see in using Python multiprocessing naively?

We're considering re-factoring a large application with a complex GUI which is isolated in a decoupled fashion from the back-end, to use the new (Python 2.6) multiprocessing module. The GUI/backend interface uses Queues with Message objects exchanged in both directions.
One thing I've just concluded (tentatively, but feel free to confirm it) is that "object identity" would not be preserved across the multiprocessing interface. Currently when our GUI publishes a Message to the back-end, it expects to get the same Message back with a result attached as an attribute. It uses object identity (if received_msg is message_i_sent:) to identify returning messages in some cases... and that seems likely not to work with multiprocessing.
This question is to ask what "gotchas" like this you have seen in actual use or can imagine one would encounter in naively using the multiprocessing module, especially in refactoring an existing single-process application. Please specify whether your answer is based on actual experience. Bonus points for providing a usable workaround for the problem.
Edit: Although my intent with this question was to gather descriptions of problems in general, I think I made two mistakes: I made it community wiki from the start (which probably makes many people ignore it, as they won't get reputation points), and I included a too-specific example which -- while I appreciate the answers -- probably made many people miss the request for general responses. I'll probably re-word and re-ask this in a new question. For now I'm accepting one answer as best merely to close the question as far as it pertains to the specific example I included. Thanks to those who did answer!

I have not used multiprocessing itself, but the problems presented are similar to experience I've had in two other domains: distributed systems, and object databases. Python object identity can be a blessing and a curse!
As for general gotchas, it helps if the application you are refactoring can acknowledge that tasks are being handled asynchronously. If not, you will generally end up managing locks, and much of the performance you could have gained by using separate processes will be lost to waiting on those locks. I will also suggest that you spend the time to build some scaffolding for debugging across processes. Truly asynchronous processes tend to be doing much more than the mind can hold and verify -- or at least my mind!
For the specific case outlined, I would manage object identity at the process border when items queued and returned. When sending a task to be processed, annotate the task with an id(), and stash the task instance in a dictionary using the id() as the key. When the task is updated/completed, retrieve the exact task back by id() from the dictionary, and apply the newly updated state to it. Now the exact task, and therefore its identity, will be maintained.

Well, of course testing for identity on non-singleton object (es. "a is None" or "a is False") isn't usually a good practice - it might be quick, but a really-quick workaround would be to exchange the "is" for the "==" test and use an incremental counter to define identity:
# this is not threadsafe.
class Message(object):
def _next_id():
i = 0
while True:
i += 1
yield i
_idgen = _next_id()
del _next_id
def __init__(self):
self.id = self._idgen.next()
def __eq__(self, other):
return (self.__class__ == other.__class__) and (self.id == other.id)
This might be an idea.
Also, be aware that if you have tons of "worker processes", memory consumption might be far greater than with a thread-based approach.

You can try the persistent package from my project GarlicSim. It's LGPL'ed.
http://github.com/cool-RR/GarlicSim/tree/development/garlicsim/garlicsim/misc/persistent/
(The main module in it is persistent.py)
I often use it like this:
# ...
self.identity = Persistent()
Then I have an identity that is preserved across processes.

What does Python's GIL have to do with the garbage collector?

I just saw this section of Unladen Swallow's documentation come up on Hacker News. Basically, it's the Google engineers saying that they're not optimistic about removing the GIL. However, it seems as though there is discussion about the garbage collector interspersed with this talk about the GIL. Could someone explain the relation to me?

The really short version is that currently python manages memory with a reference counting+a mark&sweep cycle collector scheme, optimized for latency (instead of throughput).
This is all fine when there is only a single mutating thread, but in a multi-threaded system, you need to synchronize all the times you modify refcounts, or else you can have values "fall trough the cracks", and synchronization primitives are quite expensive on contemporary hardware.
If refcounts weren't changed so often, this wouldn't be a problem, but pretty much every single operation you do in cpython can cause a refcount to change somewhere, so the options are either GIL, doing refcounts with some kind of synchronization (and literally spend almost all your time on the synch), or ditch the refcounting system for some kind of a real garbage collector.

Tuna-Fish's answer basically covers it. If you want more details, there was a discussion about how the GIL could be removed without having too much of an effect on the reference counting here: http://mail.python.org/pipermail/python-ideas/2009-October/006264.html

I just found another point of view on this subject here: http://renesd.blogspot.com/2009/12/python-gil-unladen-swallow-reference.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.