I am working on a class which operates in a multithreaded environment, and looks something like this (with excess noise removed):
class B:
#classmethod
def apply(cls, item):
cls.do_thing(item)
#classmethod
def do_thing(cls, item)
'do something to item'
def run(self):
pool = multiprocessing.Pool()
for list_of_items in self.data_groups:
pool.map(list_of_items, self.apply)
My concern is that two threads might call apply or do_thing at the same time, or that a subclass might try to do something stupid with cls in one of these functions. I could use staticmethod instead of classmethod, but calling do_thing would become a lot more complicated, especially if a subclass reimplements one of these but not the other. So my question is this: Is the above class thread-safe, or is there a potential problem with using classmethods like that?
Whether a method is thread safe or not depends on what the method does.
Working with local variables only is thread safe. But when you change the same non local variable from different threads, it becomes unsafe.
‘do something to item’ seems to modify only the given object, which is independent from any other object in the list, so it should be thread safe.
If the same object is in the list several times, you may have to think about making the object thread safe. That can be done by using with self.object_scope_lock: in every method which modifies the object.
Anyway, what you are doing here is using processes instead of threads. In this case the objects are pickled and send through a pipe to the other process, where they are modified and send back. In contrast to threads processes do not share memory. So I don’t think using a lock in the class-method would have an effect.
http://docs.python.org/3/library/threading.html?highlight=threading#module-threading
There's no difference between classmethods and regular functions (and instance methods) in this regard. Neither is automagically thread-safe.
If one or more classmethods/methods/functions can manipulate data structures simultaneously from different threads, you'd need to add synchronization protection, typically using threading.Locks.
Both other answers are technically correct in that the safety of do_thing() depends on what happens inside the function.
But the more precise answer is that the call itself is safe. In other words if apply()and do_thing()are a pure functions, then your code is safe. Any unsafe-ness would be due to them not being pure functions (e.g. relying on or affecting a shared variable during execution)
As shx2 mentioned, classmethods are only "in" a class visually, for grouping. They have no inherent attachment to any instance of the class. Therefore this code is roughly equivalent in functioning:
def apply(item):
do_thing(item)
def do_thing(item)
'do something to item'
class B:
def run(self):
pool = multiprocessing.Pool()
for list_of_items in self.data_groups:
pool.map(list_of_items, apply)
A further note on concurrency given the other answers:
threading.Lock is easy to understand, but should be your last resort. In naive implementations it is often slower than completely linear processing. Your code will usually be faster if you can use things like threading.Event, queue.Queue, or multiprocessing.Pipe to transfer information instead.
asyncio is the new hotness in python3. It's a bit more difficult to get right but is generally the fastest method.
If you want a great walkthrough modern concurrency techniques in python check out core developer Raymond Hettinger's Keynote on Concurrency. The whole thing is great, but the downside of lockis highlighted starting at t=57:59.
Related
I have following code for click handler in my PyQT4 program:
def click_btn_get_info(self):
task = self.window.le_task.text()
self.statusBar().showMessage('Getting task info...')
def thread_routine(task_id):
order = self.ae.get_task_info(task_id)
if order:
info_str = "Customer: {email}\nTitle: {title}".format(**order)
self.window.lbl_order_info.setText(info_str)
self.statusBar().showMessage('Done')
else:
self.statusBar().showMessage('Authentication: failed!')
thread = threading.Thread(target=thread_routine, args=(task,))
thread.start()
Is it a good practice to declare function in function for using with threads?
In general, yes, this is perfectly reasonable. However, the alternative of creating a separate method (or, for top-level code, a separate function) is also perfectly reasonable. And so is creating a Thread subclass. So, there's no rule saying to always do one of the three; there are different cases where each one seems more reasonable than the others, but there's overlap between those cases, so it's usually a judgment call.
As Maxime pointed out, you probably want to use Qt's threading, not native Python threading. Especially since you want to call methods on your GUI objects. The Qt docs article Threads, Events and QObjects in the Qt documentation gives you an overview (although from a C++, not Python, viewpoint). And if you're using a QThread rather than a threading.Thread, it is much more common to use the OO method—define a subclass of QThread and override its run method than to define a function, which makes your question moot.
But if you do stick with Python threading, here's how I'd decide.
Pro separate method:
You're doing this in a class method, rather than a function, and that the only state you want to share with the new thread is self.
Non-trivial code, longer than the function it's embedded in.
Pro local function:
Pretty specific to the info button callback; no one else will ever want to call it.
I'd probably make it a method, but I wouldn't complain about someone else's code that made it a local function.
In a different case—e.g., if the thread needed access to a local variable that had no business being part of the object, or if it were a trivial function I could write as an inline lambda, or if this were a top-level function sharing globals rather than a method sharing self, I'd go the other direction.
I need to convert a threading application to a multiprocessing application for multiple reasons (GIL, memory leaks). Fortunately the threads are quite isolated and only communicate via Queue.Queues. This primitive is also available in multiprocessing so everything looks fine. Now before I enter this minefield I'd like to get some advice on the upcoming problems:
How to ensure that my objects can be transfered via the Queue? Do I need to provide some __setstate__?
Can I rely on put returning instantly (like with threading Queues)?
General hints/tips?
Anything worthwhile to read apart from the Python documentation?
Answer to part 1:
Everything that has to pass through a multiprocessing.Queue (or Pipe or whatever) has to be picklable. This includes basic types such as tuples, lists and dicts. Also classes are supported if they are top-level and not too complicated (check the details). Trying to pass lambdas around will fail however.
Answer to part 2:
A put consists of two parts: It takes a semaphore to modify the queue and it optionally starts a feeder thread. So if no other Process tries to put to the same Queue at the same time (for instance because there is only one Process writing to it), it should be fast. For me it turned out to be fast enough for all practical purposes.
Partial answer to part 3:
The plain multiprocessing.queue.Queue lacks a task_done method, so it cannot be used as a drop-in replacement directly. (A subclass provides the method.)
The old processing.queue.Queue lacks a qsize method and the newer multiprocessing version is inaccurate (just keep this in mind).
Since filedescriptors normally inherited on fork, care needs to be taken about closing them in the right processes.
We're considering re-factoring a large application with a complex GUI which is isolated in a decoupled fashion from the back-end, to use the new (Python 2.6) multiprocessing module. The GUI/backend interface uses Queues with Message objects exchanged in both directions.
One thing I've just concluded (tentatively, but feel free to confirm it) is that "object identity" would not be preserved across the multiprocessing interface. Currently when our GUI publishes a Message to the back-end, it expects to get the same Message back with a result attached as an attribute. It uses object identity (if received_msg is message_i_sent:) to identify returning messages in some cases... and that seems likely not to work with multiprocessing.
This question is to ask what "gotchas" like this you have seen in actual use or can imagine one would encounter in naively using the multiprocessing module, especially in refactoring an existing single-process application. Please specify whether your answer is based on actual experience. Bonus points for providing a usable workaround for the problem.
Edit: Although my intent with this question was to gather descriptions of problems in general, I think I made two mistakes: I made it community wiki from the start (which probably makes many people ignore it, as they won't get reputation points), and I included a too-specific example which -- while I appreciate the answers -- probably made many people miss the request for general responses. I'll probably re-word and re-ask this in a new question. For now I'm accepting one answer as best merely to close the question as far as it pertains to the specific example I included. Thanks to those who did answer!
I have not used multiprocessing itself, but the problems presented are similar to experience I've had in two other domains: distributed systems, and object databases. Python object identity can be a blessing and a curse!
As for general gotchas, it helps if the application you are refactoring can acknowledge that tasks are being handled asynchronously. If not, you will generally end up managing locks, and much of the performance you could have gained by using separate processes will be lost to waiting on those locks. I will also suggest that you spend the time to build some scaffolding for debugging across processes. Truly asynchronous processes tend to be doing much more than the mind can hold and verify -- or at least my mind!
For the specific case outlined, I would manage object identity at the process border when items queued and returned. When sending a task to be processed, annotate the task with an id(), and stash the task instance in a dictionary using the id() as the key. When the task is updated/completed, retrieve the exact task back by id() from the dictionary, and apply the newly updated state to it. Now the exact task, and therefore its identity, will be maintained.
Well, of course testing for identity on non-singleton object (es. "a is None" or "a is False") isn't usually a good practice - it might be quick, but a really-quick workaround would be to exchange the "is" for the "==" test and use an incremental counter to define identity:
# this is not threadsafe.
class Message(object):
def _next_id():
i = 0
while True:
i += 1
yield i
_idgen = _next_id()
del _next_id
def __init__(self):
self.id = self._idgen.next()
def __eq__(self, other):
return (self.__class__ == other.__class__) and (self.id == other.id)
This might be an idea.
Also, be aware that if you have tons of "worker processes", memory consumption might be far greater than with a thread-based approach.
You can try the persistent package from my project GarlicSim. It's LGPL'ed.
http://github.com/cool-RR/GarlicSim/tree/development/garlicsim/garlicsim/misc/persistent/
(The main module in it is persistent.py)
I often use it like this:
# ...
self.identity = Persistent()
Then I have an identity that is preserved across processes.
I have been thinking about how I write classes in Python. More specifically how the constructor is implemented and how the object should be destroyed. I don't want to rely on CPython's reference counting to do object cleanup. This basically tells me I should use with statements to manage my object life times and that I need an explicit close/dispose method (this method could be called from __exit__ if the object is also a context manager).
class Foo(object):
def __init__(self):
pass
def close(self):
pass
Now, if all my objects behave in this way and all my code uses with statements or explicit calls to close() (or dispose()) I don't realy see the need for me to put any code in __del__. Should we really use __del__ to dispose of our objects?
Short answer : No.
Long answer: Using __del__ is tricky, mainly because it's not guaranteed to be called. That means you can't do things there that absolutely has to be done. This in turn means that __del__ basically only can be used for cleanups that would happen sooner or later anyway, like cleaning up resources that would be cleaned up when the process exits, so it doesn't matter if __del__ doesn't get called. Of course, these are also generally the same things Python will do for you. So that kinda makes __del__ useless.
Also, __del__ gets called when Python garbage collects, and you didn't want to wait for Pythons garbage collecting, which means you can't use __del__ anyway.
So, don't use __del__. Use __enter__/__exit__ instead.
FYI: Here is an example of a non-circular situation where the destructor did not get called:
class A(object):
def __init__(self):
print('Constructing A')
def __del__(self):
print('Destructing A')
class B(object):
a = A()
OK, so it's a class attribute. Evidently that's a special case. But it just goes to show that making sure __del__ gets called isn't straightforward. I'm pretty sure I've seen more non-circular situations where __del__ isn't called.
Not necessarily. You'll encounter problems when you have cyclic references. Eli Bendersky does a good job of explaining this in his blog post:
Safely using destructors in Python
If you are sure you will not go into cyclic references, then using __del__ in that way is OK: as soon as the reference count goes to zero, the CPython VM will call that method and destroy the object.
If you plan to use cyclic references - please think it very thoroughly, and check if weak references may help; in many cases, cyclic references are a first symptom of bad design.
If you have no control on the way your object is going to be used, then using __del__ may not be safe.
If you plan to use JPython or IronPython, __del__ is unreliable at all, because final object destruction will happen at garbage collection, and that's something you cannot control.
In sum, in my opinion, __del__ is usually perfectly safe and good; however, in many situation it could be better to make a step back, and try to look at the problem from a different perspective; a good use of try/except and of with contexts may be a more pythonic solution.
In python with all this idea of "Everything is an object" where is thread-safety?
I am developing django website with wsgi. Also it would work in linux, and as I know they use effective process management, so we could not think about thread-safety alot. I am not doubt in how module loads, and there functions are static or not? Every information would be helpfull.
Functions in a module are equivalent to static methods in a class. The issue of thread safety arises when multiple threads may be modifying shared data, or even one thread may be modifying such data while others are reading it; it's best avoided by making data be owned by ONE module (accessed via Queue.Queue from others), but when that's not feasible you have to resort to locking and other, more complex, synchronization primitives.
This applies whether the access to shared data happens in module functions, static methods, or instance methods -- and the shared data is such whether it's instance variables, class ones, or global ones (scoping and thread safety are essentially disjoint, except that function-local data is, to a pont, intrinsically thread-safe -- no other thread will ever see the data inside a function instance, until and unless the function deliberately "shares" it through shared containers).
If you use the multiprocessing module in Python's standard library, instead of the threading module, you may in fact not have to care about "thread safety" -- essentially because NO data is shared among processes... well, unless you go out of your way to change that, e.g. via mmapped files;-).
See the python documentation to better understand the general thread safety implications of Python.
Django itself seems to be thread safe as of 1.0.3, but your code may not and you will have to verify that...
My advice would be to simply don't care about that and serve your application with multiple processes instead of multiple threads (for example by using apache 'prefork' instead of 'worker' MPM).