How to efficiently share a common parent atribute class by multiprocessing tasks? - python

I have a class named "Problem", and another two called "Colony", and "Ant".
A "Problem" has an attribute of type "Colony", and each "Colony" has a list of "Ants".
Each Ant is thought as a Task to be run in a multiprocessing.JoinableQueue() in a method in the "Problem" class, and when their method __call__ is called, they must consult & modify an attribute graph in the "Problem" class, which would have to be accessed by every ant.
What would be the most efficient way to implement this?
I have thought of passing to each ant in the constructor method a copy of the graph, and then when they are finished, join all the subgraphs into a graph. But I think it would be better to somehow share the resource directly by all ants, like using "semaphore" style design.
Any ideas?.
Thanks

If splitting the data and joining the results can be done reasonably, this is almost always going to be more efficient—and a lot simpler—than having them all fight over shared data.
There are cases where there is no reasonable way to do this (it's either very complicated, or very slow, to join the results back up). However, even in that case there can be a reasonable alternative: return "mutation commands" of some form. The parent process can then, e.g., iterate over the output queue and apply each result to the single big array.
If even that isn't feasible, then you need sharing. There are two parts to sharing: making the data itself sharable, and locking it.
Whatever your graph type is, it probably isn't inherently shareable; it's probably got internal pointers and so on. This means you will need to construct some kind of representation in terms of multiprocessing.Array, or multiprocessing.sharedctypes around Structures, or the like, or in terms of bytes in a file that each process can mmap, or by using whatever custom multiprocessing support may exist in modules like NumPy that you may be using. Then, all of your tasks can mutate the Array (or whatever), and at the end, if you need an extra step to turn that back into a useful graph object, it should be pretty quick.
Next, for locking, the really simple thing to do is create a single multiprocessing.Lock, and have each task grab the lock when it needs to mutate the shared data. In some cases, it can be more efficient to have multiple locks, protecting different parts of the shared data. And in some cases it can be more efficient to grab the lock for each mutation instead of grabbing it once for a whole "transaction" worth of sequences (but of course it may not be correct). Without knowing your actual code, there's no way to make a judgment on these tradeoffs; in fact, a large part of the art of shared-data threading is knowing how to work this stuff out. (And a large part of the reason shared-nothing threading is easier and usually more efficient is that you don't need to work this stuff out.)
Meanwhile, I'm not sure why you need an explicit JoinableQueue here in the first place. It sounds like everything you want can be done with a Pool. To take a simpler but concrete example:
a = [[0,1,2], [3,4,5], [6,7,8], [9,10,11]]
with multiprocessing.Pool() as pool:
b = pool.map(reversed, a, chunksize=1)
c = [list(i) for i in b]
This is a pretty stupid example, but it illustrates that each task operates on one of the rows of a and returns something, which I can then combine in some custom way (by calling list on each one) to get the result I want.

Related

Is there a good reason why classes shouldn't include a list of all objects created?

Learning lot about python. For one of my programs, I need to compare all objects that have been created, so I put them in a list. I thought it would maybe be simpler if I created a class variable that includes every object created.
This seems so obvious to me that I wonder why it isn't done all the time, so I figure there must be a really really good reason for that.
So for something like
class Basket:
baskets = []
def __init__:(self, id, volume):
self.id = id
self.volume = id
baskets.append(self)
Why is this not done more often? It seems obvious. My assumption is that there are very good reasons why you wouldn't, so I'm curious to hear them.
This is one of those ideas new programmers come up with over and over again, that turns out to be unuseful and counterproductive in practice. It's important to be able to manage the objects you create, but a class-managed single list of every instance of that class ever turns out to do a very bad job of that.
The core problem is that the "every" in "every object created" is much too broad. Code that actually needs to operate on every single instance of a specific class is extremely rare. Much more commonly, code needs to operate on every instance that particular code creates, or every member of a particular group of objects.
Using a single list of all instances makes your code inflexible. It's a design that encourages writing code to operate on "all the instances" instead of "all the instances the code cares about". When the scope of a program expands, that design makes it really hard to create instances the code doesn't or shouldn't care about.
Plus, a list of every instance is a data structure with almost no structure. It does nothing to express the relationships between objects. If two objects are in such a list, that just says "both these objects exist". You quickly end up needing more complex data structures to represent useful information about your objects, and once you have those, the class-managed list doesn't do anything useful.
For example, you've got a Basket class. We don't have enough information to tell whether this is a shopping basket, or a bin-packing problem, or what. "Volume" suggests maybe it's a bin-packing problem, so let's go with that. We've got a number of items to pack into a number of baskets, and the solver has to know about "all the baskets" to figure out how to pack items into baskets... except, it really needs to know about all the baskets in this problem. Not every instance of Basket in the entire program.
What if you want to solve two bin-packing problems, with two separate sets of baskets? Throwing all the baskets into a single list makes it hard to keep track of things. What if you want to solve two bin-packing problems at the same time, maybe in two different threads? Then you can't even just clear the list when you're done with one problem before moving on to the next.
What if you want to write unit tests? Those will need to create Basket instances. If you have a class-managed list of all baskets, the tests will add Basket instances to that list, making the tests interfere with each other. The contents of the list when one test runs will depend on test execution order. That's not good. Unit tests are supposed to be independent of each other.
Consider the built-in classes. int, dict, str, classes like those. Have you ever wanted a list of every int in your entire program, or every string? It wouldn't be very useful. It'd include all sorts of stuff you don't care about, and stuff you didn't even know existed. Random constants from modules you've never heard of, os.name, the Python copyright string, etc. You wouldn't have the slightest clue where most of it even came from. How would you get anything useful done with a list like that?
On a smaller scale, the same thing applies to a list of every instance of a class you write. Sure, your class won't be used in quite as many situations as a class like int, but as the scope of a program expands, your class will end up used in more ways, and those uses probably won't need to know about each other. A single list of instances intrinsically makes it hard for different uses of a class to avoid interfering with each other.

How to share large read only dictionary/list across processes in multiprocessing in python?

I have a 18Gb pickle file which i need to be accessing across processes. I tried using
from multiprocessing import Manager
import cPickle as pkl
manager = Manager()
data = manager.dict(pkl.load(open("xyz.pkl","rb")))
However, I am getting the following issue:
IOError: [Errno 11] Resource temporarily unavailable
Someone suggested it might be because of socket timeout but it doesn't seem like it as increasing the timeout did not help.
How do I go about this. Is there any other efficient way of sharing data across processes?
This is mostly a duplicate of Shared memory in multiprocessing, but you're specifically looking at a dict or list object, rather than a simple array or value.
Unfortunately, there is no simple way to share a dictionary or list like this, as the internals of a dictionary are complicated (and differ across different Python versions). If you can restructure your problem so that you can use an Array object, you can make a shared Array, fill it in once, and use it with no lock. This will be much more efficient in general.
It's also possible, depending on access patterns, that simply loading the object first, then creating your process pools, will work well on Unix-like systems with copy-on-write fork. But there are no guarantees here.
Note that the error you are getting, error number 11 = EAGAIN = "Resource temporarily unavailable", happens when trying to send an enormous set of values through a Manager instance. Managers don't share the underlying data at all: instead, they proxy accesses so that each independent process has its own copy, and when one process update a value, it sends the update to everyone participating in the illusion-of-sharing. In this way, everyone can (eventually, depending on access timing) agree as to what the values are, so that they all seem to be using the same values. In reality, though, each one has a private copy.

Walk a list without creating unneeded object

I usually do this:
[worker.do_work() for worker in workers]
This has the advantage of being very readable and contained in a single line, but the problem of creating an object (a list) which I do not need, which means garbage collection is unnecessarily triggered.
The obvious alternative:
for worker in workers:
worker.do_work()
Is also quite readable, but uses two lines.
Is there a single-line way of achieving the same result, without creating unnecessary objects?
Sure, there is.
def doLotsOfWork(wks):
for w in wks:
w.do_work()
And now, your "one liner":
doLotsOfWork(workers)
In short, there's no "shorter" (or better way) besides using a for loop. I'd advise you not to use the list comprehension because it uses side effects - that's code smell.
"GC" in python is quite different from java. A ref-count decrement is much much cheaper than mark-and-sweep. Benchmark it, then decide if you're placing too much emphasis on a small cost.
To make it a one liner, simply define a helper function and then it's a single line to invoke it. Bury the function in an imported library if convenient.

How to fetch Riak object, change its value and store it back with all indexes in Python

I am using Riak database to store my Python application objects that are used and processed in parallel by multiple scripts. Because of that, I need to lock them in various places, to avoid being processed by more than one script at once, like that:
riak_bucket = riak_connect('clusters')
cluster = riak_bucket.get(job_uuid).get_data()
cluster['status'] = 'locked'
riak_obj = riak_bucket.new(job_uuid, data=cluster)
riak_obj.add_index('username_bin', cluster['username'])
riak_obj.add_index('hostname_bin', cluster['hostname'])
riak_obj.store()
The thing is, this is quite a bit of code to do one simple, repeatable thing, and given the fact locking occurs quite often, I would like to find a simpler, cleaner way of doing that. I've tried to write a function to do locking/unlocking, like that (for a different object, called 'build'):
def build_job_locker(uuid, status='locked'):
riak_bucket = riak_connect('builds')
build = riak_bucket.get(uuid).get_data()
build['status'] = status
riak_obj = riak_bucket.new(build['uuid'], data=build)
riak_obj.add_index('cluster_uuid_bin', build['cluster_uuid'])
riak_obj.add_index('username_bin', build['username'])
riak_obj.store()
# when locking, return the locked db object to avoid fetching it again
if 'locked' in status:
return build
else:
return
but since the objects are obviously quite different one from another, they've different indexes and so on, I ended up writing a locking function per every object... which is almost as much messy as not having the functions at all and repeating the code.
The question is: is there a way to write a general function to do so, knowing that every object has a 'status' field, that'd lock them in db retaining all indexes and other attributes? Or, perhaps, there is another, easier way I havent thought about?
After some more research, and questions asked on various IRC channels it seems that this is not doable, as there's no way to fetch this kind of metadata about objects from Riak.

Design question in Python: should this be one generic function or two specific ones?

I'm creating a basic database utility class in Python. I'm refactoring an old module into a class. I'm now working on an executeQuery() function, and I'm unsure of whether to keep the old design or change it. Here are the 2 options:
(The old design:) Have one generic executeQuery method that takes the query to execute and a boolean commit parameter that indicates whether to commit (insert, update, delete) or not (select), and determines with an if statement whether to commit or to select and return.
(This is the way I'm used to, but that might be because you can't have a function that sometimes returns something and sometimes doesn't in the languages I've worked with:) Have 2 functions, executeQuery and executeUpdateQuery (or something equivalent). executeQuery will execute a simple query and return a result set, while executeUpdateQuery will make changes to the DB (insert, update, delete) and return nothing.
Is it accepted to use the first way? It seems unclear to me, but maybe it's more Pythonistic...? Python is very flexible, maybe I should take advantage of this feature that can't really be accomplished in this way in more strict languages...
And a second part of this question, unrelated to the main idea - what is the best way to return query results in Python? Using which function to query the database, in what format...?
It's propably just me and my FP fetish, but I think a function executed solely for side effects is very different from a non-destructive function that fetches some data, and therefore have different names. Especially if the generic function would do something different depending on exactly that (the part on the commit parameter seems to imply that).
As for how to return results... I'm a huge fan of generators, but if the library you use for database connections returns a list anyway, you might as well pass this list on - a generator wouldn't buy you anything in this case. But if it allows you to iterate over the results (one at a time), seize the opportunity to save a lot of memory on larger queries.
I don't know how to answer the first part of your question, it seems like a matter of style more than anything else. Maybe you could invoke the Single Responsibility Principle to argue that it should be two separate functions.
When you're going to return a sequence of indeterminate length, it's best to use a Generator.
I'd have two methods, one which updates the database and one which doesn't. Both could delegate to a common private method, if they share a lot of code.
By separating the two methods, it becomes clear to callers what the different semantics are between the two, makes documenting the different methods easier, and clarifies what return types to expect. Since you can pull out shared code into private methods on the object, there's no worry about duplicating code.
As for returning query results, it'll depend on whether you're loading all the results from the database before returning, or returning a cursor object. I'd be tempted to do something like the following:
with db.executeQuery('SELECT * FROM my_table') as results:
for row in results:
print row['col1'], row['col2']
... so the executeQuery method returns a ContextManager object (which cleans up any open connections, if needed), which also acts as a Generator. And the results from the generator act as read-only dicts.

Categories

Resources