Accessing `self` from thread target - python

According to a number of sources, including this question, passing a runnable as the target parameter in __init__ (with or without args and kwargs) is preferable to extending the Thread class.
If I create a runnable, how can I pass the thread it is running on as self to it without extending the Thread class? For example, the following would work fine:
class MyTask(Thread):
def run(self):
print(self.name)
MyTask().start()
However, I can't see a good way to get this version to work:
def my_task(t):
print(t.name)
Thread(target=my_task, args=(), kwargs={}).start()
This question is a followup to Python - How can I implement a 'stoppable' thread?, which I answered, but possibly incompletely.
Update
I've thought of a hack to do this using current_thread():
def my_task():
print(current_thread().name)
Thread(target=my_task).start()
Problem: calling a function to get a parameter that should ideally be passed in.
Update #2
I have found an even hackier solution that makes current_thread seem much more attractive:
class ThreadWithSelf(Thread):
def __init__(self, **kwargs):
args = kwargs.get('args', ())
args = (self,) + tuple(args)
kwargs[args] = args
super().__init__(**kwargs)
ThreadWithSelf(target=my_task).start()
Besides being incredibly ugly (e.g. by forcing the user to use keywords only, even if that is the recommended way in the documentation), this completely defeats the purpose of not extending Thread.
Update #3
Another ridiculous (and unsafe) solution: to pass in a mutable object via args and to update it afterwards:
def my_task(t):
print(t[0].name)
container = []
t = Thread(target=my_task, args=(container,))
container[0] = t
t.start()
To avoid synchronization issues, you could kick it up a notch and implement another layer of ridiculousness:
def my_task(t, i):
print(t[i].name)
container = []
container[0] = Thread(target=my_task, args=(container, 0))
container[1] = Thread(target=my_task, args=(container, 1))
for t in container:
t.start()
I am still looking for a legitimate answer.

It seems like your goal is to get access to the thread currently executing a task from within the task itself. You can't add the thread as an argument to the threading.Thread constructor, because it's not yet constructed. I think there are two real options.
If your task runs many times, potentially on many different threads, I think the best option is to use threading.current_thread() from within the task. This gives you access directly to the thread object, with which you can do whatever you want. This seems to be exactly the kind of use-case this function was designed for.
On the other hand, if your goal is implement a thread with some special characteristics, the natural choice is to subclass threading.Thread, implementing whatever extended behavior you wish.
Also, as you noted in your comment, insinstance(current_thread(), ThreadSubclass) will return True, meaning you can use both options and be assured that your task will have access to whatever extra behavior you've implemented in your subclass.

The simplest and most readable answer here is: use current_thread(). You can use various weird ways to pass the thread as a parameter, but there's no good reason for that. Calling current_thread() is the standard approach which is shorter than all the alternatives and will not confuse other developers. Don't try to overthink/overengineer this:
def runnable():
thread = threading.current_thread()
Thread(target=runnable).start()
If you want to hide it a bit and change for aesthetic reasons, you can try:
def with_current_thread(f):
return f(threading.current_thread())
#with_current_thread
def runnable(thread):
print(thread.name)
Thread(target=runnable).start()
If this is not good enough, you may get better answers by describing why you think the parameter passing is better / more correct for your use case.

Related

Updating the same instance variables from different processes

Here is a simple secinaro:
class Test:
def __init__(self):
self.foo = []
def append(self, x):
self.foo.append(x)
def get(self):
return self.foo
def process_append_queue(append_queue, bar):
while True:
x = append_queue.get()
if x is None:
break
bar.append(x)
print("worker done")
def main():
import multiprocessing as mp
bar = Test()
append_queue = mp.Queue(10)
append_queue_process = mp.Process(target=process_append_queue, args=(append_queue, bar))
append_queue_process.start()
for i in range(100):
append_queue.put(i)
append_queue.put(None)
append_queue_process.join()
print str(bar.get())
if __name__=="__main__":
main()
When you call bar.get() at the end of the main() function why does it still return an empty list? How can I make it so that the child process also works with the same instance of Test not a new one?
All answers appreciated!
In general, processes have distinct address spaces, so that mutations of an object in one process have no effect on any object in any other process. Interprocess communication is needed to tell a process about changes made in another process.
That can be done explicitly (using things like multiprocessing.Queue), or implicitly if you use a facility implemented by multiprocessing for this purpose. For example, a great deal of work is done under the covers to make changes to a multiprocessing.Queue visible across processes.
The easiest way in your specific example is to replace your __init__ function like so:
def __init__(self):
import multiprocessing as mp
self.foo = mp.Manager().list()
It so happens that an mp.Manager instance supports a list() method that creates a process-aware list object (really a proxy for a list object, which forwards list operations to an under-the-covers server process that maintains a single copy of "the real" list - the list object isn't really shared across processes, because that's impossible - but the proxies make it appear to be shared).
So if you make that change, your code will display the results you expect - and there is no simpler way.
Note that multiprocessing works better the less IPC (interprocess communication) you need, and that's true pretty much regardless of application or programming language.
Objects are copied between processes by pickling them and passing the string over a pipe. There is no way to achieve true "shared memory" for pure Python objects between processes. To achieve precisely this type of synchronization, take a look at the multiprocessing.Manager documentation (https://docs.python.org/2/library/multiprocessing.html#managers) which provides you with examples about synchronized versions of common Python container types. These are "proxied" containers where operations on the proxy send all arguments across the process boundary, pickled, and are then executed in the parent process.

Is this style of using Thread pool with tornado ok?

So I create a class variable called executor in my class
executor = ThreadPoolExecutor(100)
and instead of having functions and methods and using decorators, I simply use following line to handle my blocking tasks(like io and hash creation and....) in my async methods
result = await to_tornado_future(self.executor.submit(blocking_method, param1, param2)
I decided to use this style cause
1- decorators are slower by nature
2- there is no need for extra methods and functions
3- it workes as expected and creates no threads before it needed
Am I right ? Please use reasons(I want to know if the way I use, is slower or uses more resources or....)
Update
Based on Ben answer, my above approach was not correct
so I ended up using following function as needed, I think it's the best way to go
def pool(pool_executor, fn, *args, **kwargs):
new_future = Future()
result_future = pool_executor.submit(fn, *args, **kwargs)
result_future.add_done_callback(lambda f: new_future.set_result(f.result()))
return new_future
usage:
result = await pool(self.executor, time.sleep, 3)
This is safe as long as all your blocking methods are thread-safe. Since you mentioned doing IO in these threads, I'll point out that doing file IO here is fine but all network IO in Tornado must occur on the IOLoop's thread.
Why do you say "decorators are slower by nature"? Which decorators are slower than what? Some decorators have no performance overhead at all (although most do have some runtime cost). to_tornado_future(executor.submit()) isn't free either. (BTW, I think you want tornado.gen.convert_yielded instead of tornado.platform.asyncio.to_tornado_future. executor.submit doesn't return an asyncio.Future).
As a general rule, running blocking_method on a thread pool is going to be slower than just calling it directly. You should do this only when blocking_method is likely to block for long enough that you want the main thread free to do other things in the meantime.

Python multiprocessing seems near impossible to do within classes/using any class instances. What is its intended use?

I have an alogirithm that I am trying to parallelize, because of very long run times in serial. However, the function that needs to be parallelized is inside a class. multiprocessing.Pool seems to be the best and fastest way to do this, but there is a problem. It's target function can not be a function of an object instance. Meaning this; you declare a Pool in the following way:
import multiprocessing as mp
cpus = mp.cpu_count()
poolCount = cpus*2
pool = mp.Pool(processes = poolCount, maxtasksperchild = 2)
And then actually use it as so:
pool.map(self.TargetFunction, args)
But this throws an error, because object instances cannot be pickled, as the Pool function does to pass information to all of its child processes. But I have to use self.TargetFunction
So I had an idea, I would create a new Python file named parallel and simply write a couple of functions without putting them in a class, and call those functions from within my original class (of whose function I want to parallelize)
So I tried this:
import multiprocessing as mp
def MatrixHelper(args):
WM = args[0][0]
print(WM.CreateMatrixMp(*args))
return WM.CreateMatrixMp(*args)
def Start(sigmaI, sigmaX, numPixels, WM):
cpus = mp.cpu_count()
poolCount = cpus * 2
args = [(WM, sigmaI, sigmaX, i) for i in range(numPixels)]
print('Number of cpu\'s to process WM:%d'%cpus)
pool = mp.Pool(processes = poolCount, maxtasksperchild = 2)
tempData = pool.map(MatrixHelper, args)
return tempData
These functions are not part of a class, using MatrixHelper in Pools map function works fine. But I realized while doing this that it was no way out. The function in need of parallelization (CreateMatrixMp) expects an object to be passed to it (it is declared as def CreateMatrixMp(self, sigmaI, sigmaX, i))
Since it is not being called from within its class, it doesn't get a self passed to it. To solve this, I passed the Start funtion the calling object itself. As in, I say parallel.Start(sigmaI, sigmaX, self.numPixels, self). The object self then becomes WM so that I will be able to finally call the desired function as WM.CreateMatrixMp().
I'm sure that that is a very sloppy way of coding, but I just wanted to see if it would work. But nope, pickling error again, the map function cannot handle any objects instances at all.
So my question is, why is it designed this way? It seems useless, it seems to be completely disfunctional in any program that uses classes at all.
I tried using Process rather than Pool, but this requires the array that I am ultimately writing to to be shared, which requires processes waiting for eachother. If I don't want it to be shared, then I have each process write its own smaller array, and do one big write at the end. But both of these result in slower run times than when I was doing this serially! Pythons builtin multiprocessing seems absolutely useless!
Can someone please give me some guidance as to how to actually save time with multiprocessing, in the context of my tagret function being inside a class? I have read on posts here to use pathos.multiprocessing instead, but I am on Windows, and am working on this project with multiple people who all have different set ups. Having everyone try to install it would be inconveinient.
I was having a similar issue with trying to use multiprocessing within a class. I was able to solve it with a relatively easy workaround I found online. Basically you use a function outside of your class that unwraps/unpacks the method inside your function that you're trying to parallelize. Here are the two websites I found that explain how to do it.
Website 1 (joblib example)
Website 2 (multiprocessing module example)
For both, the idea is to do something like this:
rom multiprocessing import Pool
import time
def unwrap_self_f(arg, **kwarg):
return C.f(*arg, **kwarg)
class C:
def f(self, name):
print 'hello %s,'%name
time.sleep(5)
print 'nice to meet you.'
def run(self):
pool = Pool(processes=2)
names = ('frank', 'justin', 'osi', 'thomas')
pool.map(unwrap_self_f, zip([self]*len(names), names))
if __name__ == '__main__':
c = C()
c.run()
The essence of how multiprocessing works is that it spawns sub-processes that receive parameters to run a certain function. In order to pass these arguments, it needs that they are, well, passable: non-exclusive to the main process, s.a. sockets, file descriptors and other low-level, OS related stuff.
This translates into "need to be pickleable or serializable".
On the same topic, parallel processing works best when you (can) have self-contained divisions of a problem. I can tell you want to share some sort of input/stream/database source, but this will probably create a bottleneck that you'll have to tackle at some point (at least, from the "python script" side, rather than the "OS/database" side. Fortunately, you have to tackle it early now.
You can re-code your classes to spawn/create these non-pickable resources when neeeded rather than at start
def targetFunction(self, range_params):
if not self.ready():
self._init_source()
#rest of the code
You kinda tackled the problem the other way around (initialized an object based on params). And yes, parallel processing comes with a cost.
You can see the multiprocessing programming guidelines for an even more thorough insight on this matter.
this is an old post but it still is one of the top results when you search for the topic. Some good info for this question can be found at this stack overflow: python subclassing multiprocessing.Process
I tried some workarounds to try calling pool.starmap from inside of a class to another function in the class. Making it a staticmethod or having a function on the outside call it didn't work and gave the same error. A class instance just can't be pickled so we need to create the instance after we start the multiprocessing.
What I ended up doing that worked for me was to separate my class into two classes. Basically, the function you are calling the multiprocessing on needs to be called right after you instantiate a new object for the class it belongs to.
Something like this:
from multiprocessing import Pool
class B:
...
def process_feature(idx, feature):
# do stuff in the new process
pass
...
def multiprocess_feature(process_args):
b_instance = B()
return b_instance.process_feature(*process_args)
class A:
...
def process_stuff():
...
with Pool(processes=num_processes, maxtasksperchild=10) as pool:
results = pool.starmap(
multiprocess_feature,
[
(idx, feature)
for idx, feature in enumerate(features)
],
chunksize=100,
)
...
...
...

Creating and reusing objects in python processes

I have an embarrassingly parallelizable problem consisting on a bunch of tasks that get solved independently of each other. Solving each of the tasks is quite lengthy, so this is a prime candidate for multi-processing.
The problem is that solving my tasks requires creating a specific object that is very time consuming on its own but can be reused for all the tasks (think of an external binary program that needs to be launched), so in the serial version I do something like this:
def costly_function(task, my_object):
solution = solve_task_using_my_object
return solution
def solve_problem():
my_object = create_costly_object()
tasks = get_list_of_tasks()
all_solutions = [costly_function(task, my_object) for task in tasks]
return all_solutions
When I try to parallelize this program using multiprocessing, my_object cannot be passed as a parameter for a number of reasons (it cannot be pickled, and it should not run more than one task at the same time), so I have to resort to create a separate instance of the object for each task:
def costly_function(task):
my_object = create_costly_object()
solution = solve_task_using_my_object
return solution
def psolve_problem():
pool = multiprocessing.Pool()
tasks = get_list_of_tasks()
all_solutions = pool.map_async(costly_function, tasks)
return all_solutions.get()
but the added costs of creating multiple instances of my_object makes this code only marginally faster than the serialized one.
If I could create a separate instance of my_object in each process and then reuse them for all the tasks that get run in that process, my timings would significantly improve. Any pointers on how to do that?
I found a simple way of solving my own problem without bringing in any tools besides the standard library, I thought I'd write it down here in case somebody else has a similar problem.
multiprocessing.Pool accepts an initializer function (with arguments) that gets run when each process is launched. The return value of this function is not stored anywhere, but one can take advantage of the function to set up a global variable:
def init_process():
global my_object
my_object = create_costly_object()
def costly_function(task):
global my_object
solution = solve_task_using_my_object
return solution
def psolve_problem():
pool = multiprocessing.Pool(initializer=init_process)
tasks = get_list_of_tasks()
all_solutions = pool.map_async(costly_function, tasks)
return all_solutions.get()
Since each process has a separate global namespace, the instantiated objects do not clash, and they are created only once per process.
Probably not the most elegant solution, but it's simple enough and gives me a near-linear speedup.
you can have celery project handle all this for you, among many other features it also have a way to run some task initialization that can be used latter by all tasks
You are right that you are constrained to pickable objects when using multiprocessing. Are you absolutely sure that your object is un-pickleable?
Have you tried dill? If you import it in, anytime pickle is called it will use the dill bindings. It worked for me, when I was trying to use multiprocessing on sympy equations.

Is there a reason not to send super().__init__() a dictionary instead of **kwds?

I just started building a text based game yesterday as an exercise in learning Python (I'm using 3.3). I say "text based game," but I mean more of a MUD than a choose-your-own adventure. Anyway, I was really excited when I figured out how to handle inheritance and multiple inheritance using super() yesterday, but I found that the argument-passing really cluttered up the code, and required juggling lots of little loose variables. Also, creating save files seemed pretty nightmarish.
So, I thought, "What if certain class hierarchies just took one argument, a dictionary, and just passed the dictionary back?" To give you an example, here are two classes trimmed down to their init methods:
class Actor:
def __init__(self, in_dict,**kwds):
super().__init__(**kwds)
self._everything = in_dict
self._name = in_dict["name"]
self._size = in_dict["size"]
self._location = in_dict["location"]
self._triggers = in_dict["triggers"]
self._effects = in_dict["effects"]
self._goals = in_dict["goals"]
self._action_list = in_dict["action list"]
self._last_action = ''
self._current_action = '' # both ._last_action and ._current_action get updated by .update_action()
class Item(Actor):
def __init__(self,in_dict,**kwds)
super().__init__(in_dict,**kwds)
self._can_contain = in_dict("can contain") #boolean entry
self._inventory = in_dict("can contain") #either a list or dict entry
class Player(Actor):
def __init__(self, in_dict,**kwds):
super().__init__(in_dict,**kwds)
self._inventory = in_dict["inventory"] #entry should be a Container object
self._stats = in_dict["stats"]
Example dict that would be passed:
playerdict = {'name' : '', 'size' : '0', 'location' : '', 'triggers' : None, 'effects' : None, 'goals' : None, 'action list' = None, 'inventory' : Container(), 'stats' : None,}
(The None's get replaced by {} once the dictionary has been passed.)
So, in_dict gets passed to the previous class instead of a huge payload of **kwds.
I like this because:
It makes my code a lot neater and more manageable.
As long as the dicts have at least some entry for the key called, it doesn't break the code. Also, it doesn't matter if a given argument never gets used.
It seems like file IO just got a lot easier (dictionaries of player data stored as dicts, dictionaries of item data stored as dicts, etc.)
I get the point of **kwds (EDIT: apparently I didn't), and it hasn't seemed cumbersome when passing fewer arguments. This just appears to be a comfortable way of dealing with a need for a large number of attributes at the the creation of each instance.
That said, I'm still a major python noob. So, my question is this: Is there an underlying reason why passing the same dict repeatedly through super() to the base class would be a worse idea than just toughing it out with nasty (big and cluttered) **kwds passes? (e.g. issues with the interpreter that someone at my level would be ignorant of.)
EDIT:
Previously, creating a new Player might have looked like this, with an argument passed for each attribute.
bob = Player('bob', Location = 'here', ... etc.)
The number of arguments needed blew up, and I only included the attributes that really needed to be present to not break method calls from the Engine object.
This is the impression I'm getting from the answers and comments thus far:
There's nothing "wrong" with sending the same dictionary along, as long as nothing has the opportunity to modify its contents (Kirk Strauser) and the dictionary always has what it's supposed to have (goncalopp). The real answer is that the question was amiss, and using in_dict instead of **kwds is redundant.
Would this be correct? (Also, thanks for the great and varied feedback!)
I'm not sure I understand your question exactly, because I don't see how the code looked before you made the change to use in_dict. It sounds like you have been listing out dozens of keywords in the call to super (which is understandably not what you want), but this is not necessary. If your child class has a dict with all of this information, it can be turned into kwargs when you make the call with **in_dict. So:
class Actor:
def __init__(self, **kwds):
class Item(Actor):
def __init__(self, **kwds)
self._everything = kwds
super().__init__(**kwds)
I don't see a reason to add another dict for this, since you can just manipulate and pass the dict created for kwds anyway
Edit:
As for the question of the efficiency of using the ** expansion of the dict versus listing the arguments explicitly, I did a very unscientific timing test with this code:
import time
def some_func(**kwargs):
for k,v in kwargs.items():
pass
def main():
name = 'felix'
location = 'here'
user_type = 'player'
kwds = {'name': name,
'location': location,
'user_type': user_type}
start = time.time()
for i in range(10000000):
some_func(**kwds)
end = time.time()
print 'Time using expansion:\t{0}s'.format(start - end)
start = time.time()
for i in range(10000000):
some_func(name=name, location=location, user_type=user_type)
end = time.time()
print 'Time without expansion:\t{0}s'.format(start - end)
if __name__ == '__main__':
main()
Running this 10,000,000 times gives a slight (and probably statistically meaningless) advantage passing around a dict and using **.
Time using expansion: -7.9877269268s
Time without expansion: -8.06108212471s
If we print the IDs of the dict objects (kwds outside and kwargs inside the function), you will see that python creates a new dict for the function to use in either case, but in fact the function only gets one dict forever. After the initial definition of the function (where the kwargs dict is created) all subsequent calls are just updating the values of that dict belonging to the function, no matter how you call it. (See also this enlightening SO question about how mutable default parameters are handled in python, which is somewhat related)
So from a performance perspective, you can pick whichever makes sense to you. It should not meaningfully impact how python operates behind the scenes.
I've done that myself where in_dict was a dict with lots of keys, or a settings object, or some other "blob" of something with lots of interesting attributes. That's perfectly OK if it makes your code cleaner, particularly if you name it clearly like settings_object or config_dict or similar.
That shouldn't be the usual case, though. Normally it's better to explicitly pass a small set of individual variables. It makes the code much cleaner and easier to reason about. It's possible that a client could pass in_dict = None by accident and you wouldn't know until some method tried to access it. Suppose Actor.__init__ didn't peel apart in_dict but just stored it like self.settings = in_dict. Sometime later, Actor.method comes along and tries to access it, then boom! Dead process. If you're calling Actor.__init__(var1, var2, ...), then the caller will raise an exception much earlier and provide you with more context about what actually went wrong.
So yes, by all means: feel free to do that when it's appropriate. Just be aware that it's not appropriate very often, and the desire to do it might be a smell telling you to restructure your code.
This is not python specific, but the greatest problem I can see with passing arguments like this is that it breaks encapsulation. Any class may modify the arguments, and it's much more difficult to tell which arguments are expected in each class - making your code difficult to understand, and harder to debug.
Consider explicitly consuming the arguments in each class, and calling the super's __init__ on the remaining. You don't need to make them explicit:
class ClassA( object ):
def __init__(self, arg1, arg2=""):
pass
class ClassB( ClassA ):
def __init__(self, arg3, arg4="", *args, **kwargs):
ClassA.__init__(self, *args, **kwargs)
ClassB(3,4,1,2)
You can also leave the variables uninitialized and use methods to set them. You can then use different methods in the different classes, and all subclasses will have access to the superclass methods.

Categories

Resources