How do I make Python respect iterable fields when multiprocessing?

How do I make Python respect iterable fields when multiprocessing? - python

Apologies if this is a dumb question, but I've not found an elegant workaround for this issue yet. Basically, when using the concurent.futures module, non-static methods of classes look like they should work fine, I didn't see anything in the docs for the module that would imply they wouldn't work fine, and the module produces no errors when running - and even produces the expected results in many cases!
However, I've noticed that the module seems to not respect updates to iterable fields made in the parent thread, even when those updates occur before starting any child processes. Here's an example of what I mean:
import concurrent.futures
class Thing:
data_list = [0, 0, 0]
data_number = 0
def foo(self, num):
return sum(self.data_list) * num
def bar(self, num):
return num * self.data_number
if __name__ == '__main__':
thing = Thing()
thing.data_list[0] = 1
thing.data_number = 1
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(thing.foo, range(3))
print('result of changing list:')
for result in results:
print(result)
results = executor.map(thing.bar, range(3))
print('result of changing number:')
for result in results:
print(result)
I would expect the result here to be
result of changing list:
0
1
2
result of changing number:
0
1
2
but instead I get
result of changing list:
0
0
0
result of changing number:
0
1
2
So for some reason, things work as expected for the field that's just an integer, but not at all as expected for the field that's a list. The implication is that the updates made to the list are not respected when the child processes are called, even though the updates to the simpler fields are. I've tried this with dicts as well with the same issue, and I suspect that this is a problem for all iterables.
Is there any way to make this work as expected, allowing for updates to iterable fields to be respected by child processes? It seems bizarre that multiprocessing for non-static methods would be half-implemented like this, but I'm hoping that I'm just missing something!

The problem has nothing to do with "respecting iterable fields", but it is a rather subtle issue. In your main process you have:
thing.data_list[0] = 1 # first assignment
thing.data_number = 1 # second assignmment
Rather than:
Thing.data_list[0] = 1 # first assignment
Thing.data_number = 1 # second assignment
As far as the first assignment is concerned, there isn't any material difference because with either version you are not modifying a class attribute but rather an element within a list that happens to be referenced by a class attribute. In other words, Thing.data_list is still pointing to the same list; this reference has not been changed. This is an important distinction.
But in the second assignment with your version of the code you have essentially modified a class attribute via the instance's self reference. When you do that, you are creating a new instance attribute with the same name data_number.
Your class members functions foo and bar are attempting to access class attributes via self. The Thing instance, thing will be pickled across to the new address space but in the new address space when the Thing is un-pickled, by default new class attributes will be created and initialized to their default values unless you add special pickle rules. But instance attributes should be successfully transmitted, such as your newly created data_number. And that's why the 'result of changing number:' prints out as you expected, i.e. you are actually accessing the instance attribute data_number in bar.
Change bar to the following and you will see that everything will print out as 0:
def bar(self, num):
return num * Thing.data_number

Related

How to keep track of instances of python objects in a reliable way?

I would like to be able to keep track of instances of geometric Point objects in order to know what names are already "taken" when automatically naming a new one.
For instance, if Points named "A", "B" and "C" have been created, then the next automatically named Point is named "D". If Point named "D" gets deleted, or its reference gets lost, then name "D" becomes available again.
The main attributes of my Point objects are defined as properties and are the quite standard x, y and name.
Solution with a problem and a "heavy" workaround
I proceeded as described here, using a weakref.WeakSet(). I added this to my Point class:
# class attribute
instances = weakref.WeakSet()
#classmethod
def names_in_use(cls):
return {p.name for p in Point.instances}
Problem is, when I instanciate a Point and then delete it, it is most of the time, but not always, removed from Point.instances. I noticed that, if I run the tests suite (pytest -x -vv -r w), then if a certain exception is raised in the test, then the instance never gets deleted (probable explanation to be read somewhat below).
In the following test code, after the first deletion of p, it always gets removed from Point.instances, but after the second deletion of p, it never gets deleted (test results are always the same) and the last assert statement fails:
def test_instances():
import sys
p = Point(0, 0, 'A')
del p
sys.stderr.write('1 - Point.instances={}\n'.format(Point.instances))
assert len(Point.instances) == 0
assert Point.names_in_use() == set()
p = Point(0, 0, 'A')
with pytest.raises(TypeError) as excinfo:
p.same_as('B')
assert str(excinfo.value) == 'Can only test if another Point is at the ' \
'same place. Got a <class \'str\'> instead.'
del p
sys.stderr.write('2 - Point.instances={}\n'.format(Point.instances))
assert len(Point.instances) == 0
And here the result:
tests/04_geometry/01_point_test.py::test_instances FAILED
=============================================================================== FAILURES ===============================================================================
____________________________________________________________________________ test_instances ____________________________________________________________________________
def test_instances():
import sys
p = Point(0, 0, 'A')
del p
sys.stderr.write('1 - Point.instances={}\n'.format(Point.instances))
assert len(Point.instances) == 0
assert Point.names_in_use() == set()
p = Point(0, 0, 'A')
with pytest.raises(TypeError) as excinfo:
p.same_as('B')
assert str(excinfo.value) == 'Can only test if another Point is at the ' \
'same place. Got a <class \'str\'> instead.'
del p
sys.stderr.write('2 - Point.instances={}\n'.format(Point.instances))
> assert len(Point.instances) == 0
E assert 1 == 0
E + where 1 = len(<_weakrefset.WeakSet object at 0x7ffb986a5048>)
E + where <_weakrefset.WeakSet object at 0x7ffb986a5048> = Point.instances
tests/04_geometry/01_point_test.py:42: AssertionError
------------------------------------------------------------------------- Captured stderr call -------------------------------------------------------------------------
1 - Point.instances=<_weakrefset.WeakSet object at 0x7ffb986a5048>
2 - Point.instances=<_weakrefset.WeakSet object at 0x7ffb986a5048>
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================================= 1 failed, 82 passed in 0.36 seconds ==================================================================
Yet, the code tested in the catched exception does not create a new Point instance:
def same_as(self, other):
"""Test geometric equality."""
if not isinstance(other, Point):
raise TypeError('Can only test if another Point is at the same '
'place. Got a {} instead.'.format(type(other)))
return self.coordinates == other.coordinates
and coordinates are basically:
#property
def coordinates(self):
return (self._x, self._y)
where _x and _y basically contain numbers.
The reason seems to be (quoting from python's doc):
CPython implementation detail: It is possible for a reference cycle to prevent the reference count of an object from going to zero. In this case, the cycle will be later detected and deleted by the cyclic garbage collector. A common cause of reference cycles is when an exception has been caught in a local variable.
The workaround
Adding this method to Point class:
def untrack(self):
Point.instances.discard(self)
and using myPoint.untrack() before del myPoint (or before losing reference to the Point in another way) seems to solve the problem.
But this is quite heavy to have to call untrack() each time... in my tests there are a lot of Points I will need to "untrack" only to ensure all names are available, for instance.
Question
Is there any better way to keep track of these instances? (either by improving the tracking method used here, or by any other better mean).

Don't try to track available names based on all Point objects that exist in the entire program. Predicting what objects will exist and when objects will cease to exist is difficult and unnecessary, and it will behave very differently on different Python implementations.
First, why are you trying to enforce Point name uniqueness at all? If, for example, you're drawing a figure in some window and you don't want two points with the same label in the same figure, then have the figure track the points in it and reject a new point with a taken name. This also makes it easy to explicitly remove points from a figure, or have two figures with independent point names. There are a number of other contexts where a similar explicit container object may be reasonable.
If these are free-floating points not attached to some geometry environment, then why name them at all? If I want to represent a point at (3.5, 2.4), I don't care whether I name it A or B or Bob, and I certainly don't want a crash because some other code somewhere halfway across the program decided to call their point Bob too. Why do names or name collisions matter?
I don't know what your use case is, but for most I can imagine, it'd be best to either only enforce name uniqueness within an explicit container, or not enforce name uniqueness at all.

Undo in Python with a very large state. Is it possible?

This appears simple, but I can't find a good solution.
It's the old 'pass by reference'/ 'pass by value' / 'pass by object reference' problem. I understand what is happening, but I can't find a good work around.
I am aware of solutions for small problems, but my state is very large and extremely expensive to save/ recalculate. Given these constraints, I can't find a solution.
Here is some simple pseudocode to illustrate what I would like to do (if Python would let me pass by reference):
class A:
def __init__(self,x):
self.g=x
self.changes=[]
def change_something(self,what,new): # I want to pass 'what' by reference
old=what # and then de-reference it here to read the value
self.changes.append([what,old]) # store a reference
what=new # dereference and change the value
def undo_changes():
for c in self.changes:
c[0]=c[1] # dereference and restore old value
Edit: Adding some more pseudocode to show how I would like the use the above
test=A(1) # initialise test.g as 1
print(test.g)
out: 1
test.change_something(test.g,2)
# if my imaginary code above functioned as described in its comments,
# this would change test.g to 2 and store the old value in the list of changes
print(test.g)
out: 2
test.undo_changes()
print(test.g)
out: 1
Obviously the above code doesnt work in python due to being 'pass by object reference'. Also I'd like to be able to undo a single change, not just all of them as in the code above.
The thing is... I can't seem to find a good work around. There are solutions out there like these:
Do/Undo using command pattern in Python
making undo in python
Which involve storing a stack of commands. 'Undo' then involves removing the last command and then re-building the final state by taking the initial state and re-applying everything but the last command. My state is too large for this to be feasible, the issues are:
The state is very large. Saving it entirely is prohibitively expensive.
'Do' operations are costly (making recalculating from a saved state infeasible).
Do operations are also non-deterministic, relying on random input.
Undo operations are very frequent
I have one idea, which is to ensure that EVERYTHING is stored in lists, and writing my code so that everything is stored, read from and written to these lists. Then in the code above I can pass the list name and list index every time I want to read/write a variable.
Essentially this amounts to building my own memory architecture and C-style pointer system within Python!
This works, but seems a little... ridiculous? Surely there is a better way?

Please check if it helps....
class A:
def __init__(self,x):
self.g=x
self.changes={}
self.changes[str(x)] = {'init':x, 'old':x, 'new':x} #or make a key by your choice(immutable)
def change_something(self,what,new): # I want to pass 'what' by reference
self.changes[what]['new'] = new #add changed value to your dict
what=new # dereference and change the value
def undo_changes():
what = self.changes[what]['old'] #retrieve/changed to the old value
self.changes[what]['new'] = self.changes[what]['old'] #change latest new val to old val as you reverted your changes
for each change you can update the change_dictionary. Onlhy thing you have to figure out is "how to create entry for what as a key in self.change dictionary", I just made it str(x), just check the type(what) and how to make it a key in your case.

Okay so I have come up with an answer... but it's ugly! I doubt it's the best solution. It uses exec() which I am told is bad practice and to be avoided if at all possible. EDIT: see below!
Old code using exec():
class A:
def __init__(self,x):
self.old=0
self.g=x
self.h=x*10
self.changes=[]
def change_something(self,what,new):
whatstr='self.'+what
exec('self.old='+whatstr)
self.changes.append([what,self.old])
exec(whatstr+'=new')
def undo_changes(self):
for c in self.changes:
exec('self.'+c[0]+'=c[1]')
def undo_last_change(self):
c = self.changes[-1]
exec('self.'+c[0]+'=c[1]')
self.changes.pop()
Thanks to barny, here's a much nicer version using getattr and setattr:
class A:
def __init__(self,x):
self.g=x
self.h=x*10
self.changes=[]
def change_something(self,what,new):
self.changes.append([what,getattr(self,what)])
setattr(self,what,new)
def undo_changes(self):
for c in self.changes:
setattr(self,c[0],c[1])
def undo_last_change(self):
c = self.changes[-1]
setattr(self,c[0],c[1])
self.changes.pop()
To demonstrate, the input:
print("demonstrate changing one value")
b=A(1)
print('g=',b.g)
b.change_something('g',2)
print('g=',b.g)
b.undo_changes()
print('g=',b.g)
print("\ndemonstrate changing two values and undoing both")
b.change_something('h',3)
b.change_something('g',4)
print('g=', b.g, 'h=',b.h)
b.undo_changes()
print('g=', b.g, 'h=',b.h)
print("\ndemonstrate changing two values and undoing one")
b.change_something('h',30)
b.change_something('g',40)
print('g=', b.g, 'h=',b.h)
b.undo_last_change()
print('g=', b.g, 'h=',b.h)
returns:
demonstrate changing one value
g= 1
g= 2
g= 1
demonstrate changing two values and undoing both
g= 4 h= 3
g= 1 h= 10
demonstrate changing two values and undoing one
g= 40 h= 30
g= 1 h= 30
EDIT 2: Actually... after further testing, my initial version with exec() has some advantages over the second. If the class contains a second class, or list, or whatever, the exec() version has no trouble updating a list within a class within a class, however the second version will fail.

Python returns wrong answer for list inclusion

I have a class with some methods, and a dict of dicts E which is shared between threads. I use the Threading.lock() class (instantiated as Elock) to read and write from E each time I need to.
One of the methods within the class looks like this:
Elock.acquire()
#print self.Num, E[key].keys()
if self.Num not in E[key].keys():
print '\n\nDisappeared!\n\n', self.Num, E[key].keys()
#DO STUFF
Elock.release()
return
else:
Elock.release()
What is really blowing my mind is that I will get
Disappeared!
17171875.0 [17175000.0, 17171875.0]
When I uncomment the print command before the conditional, I get what I'm expecting:
17171875.0 [17175000.0, 17171875.0]
As you can see, in both cases self.Num is in E[key].keys. Why is the conditional returning True and entering if clause?
Sometimes self.Num will be a float and the corresponding element in E[key].keys() will be an int. This should not be an issue anyway, and I think is not causing my problem. When I write
if float(self.Num) not in [float(x) for x in E[key].keys()]:
Also, a commenter suggested I not use floats. I tried this:
if int(self.Num) not in [int(x) for x in E[key].keys()]:
but in both cases the behavior does not change.
Most puzzling is that IT ONLY BREAKS SOMETIMES! It works normally most of the time, but seems to return the wrong answer if the number ends with 5.0 (at least, that is the only time I have seen it break).

Modify existing variable in `locals()` or `frame.f_locals`

I have found some vaguely related questions to this question, but not any clean and specific solution for CPython. And I assume that a "valid" solution is interpreter specific.
First the things I think I understand:
locals() gives a non-modifiable dictionary.
A function may (and indeed does) use some kind of optimization to access its local variables
frame.f_locals gives a locals() like dictionary, but less prone to hackish things through exec. Or at least I have been less able to do hackish undocumented things like the locals()['var'] = value ; exec ""
exec is capable to do weird things to the local variables, but it is not reliable --e.g. I read somewhere that it doesn't work in Python 3. Haven't tested.
So I understand that, given those limitations, it will never be safe to add extra variables to the locals, because it breaks the interpreter structure.
However, it should be possible to change a variable already existing, isn't it?
Things that I considered
In a function f, one can access the f.func_code.co_nlocals and f.func_code.co_varnames.
In a frame, the variables can be accessed / checked / read through the frame.f_locals. This is in the use case of setting a tracer through sys.settrace.
One can easily access the function in which a frame is --cosidering the use case of setting a trace and using it to "do things" in with the local variables given a certain trigger or whatever.
The variables should be somewhere, preferably writeable... but I am not capable of finding it. Even if it is an array (for interpreter efficient access), or I need some extra C-specific wiring, I am ready to commit to it.
How can I achieve that modification of variables from a tracer function or from a decorated wrapped function or something like that?
A full solution will be of course appreciated, but even some pointers will help me greatly, because I'm stuck here with lots of non writeable dictionaries :-/
Edit: Hackish exec is doing things like this or this

It exists an undocumented C-API call for doing things like that:
PyFrame_LocalsToFast
There is some more discussion in this PyDev blog post. The basic idea seems to be:
import ctypes
...
frame.f_locals.update({
'a': 'newvalue',
'b': other_local_value,
})
ctypes.pythonapi.PyFrame_LocalsToFast(
ctypes.py_object(frame), ctypes.c_int(0))
I have yet to test if this works as expected.
Note that there might be some way to access the Fast directly, to avoid an indirection if the requirements is only modification of existing variable. But, as this seems to be mostly non-documented API, source code is the documentation resource.

Based on the notes from MariusSiuram, I wrote a recipe that show the behavior.
The conclusions are:
we can modify an existing variable
we can delete an existing variable
we can NOT add a new variable.
So, here is the code:
import inspect
import ctypes
def parent():
a = 1
z = 'foo'
print('- Trying to add a new variable ---------------')
hack(case=0) # just try to add a new variable 'b'
print(a)
print(z)
assert a == 1
assert z == 'foo'
try:
print (b)
assert False # never is going to reach this point
except NameError, why:
print("ok, global name 'b' is not defined")
print('- Trying to remove an existing variable ------')
hack(case=1)
print(a)
assert a == 2
try:
print (z)
except NameError, why:
print("ok, we've removed the 'z' var")
print('- Trying to update an existing variable ------')
hack(case=2)
print(a)
assert a == 3
def hack(case=0):
frame = inspect.stack()[1][0]
if case == 0:
frame.f_locals['b'] = "don't work"
elif case == 1:
frame.f_locals.pop('z')
frame.f_locals['a'] += 1
else:
frame.f_locals['a'] += 1
# passing c_int(1) will remove and update variables as well
# passing c_int(0) will only update
ctypes.pythonapi.PyFrame_LocalsToFast(
ctypes.py_object(frame),
ctypes.c_int(1))
if __name__ == '__main__':
parent()
The output would be like:
- Trying to add a new variable ---------------
1
foo
ok, global name 'b' is not defined
- Trying to remove an existing variable ------
2
foo
- Trying to update an existing variable ------
3

Using singleton as a counter

I have an automation test, which uses function that creates screenshots to a folder. This function is called by multiple screenshot instances. On every test run, a new folder is created, so I don't care about counter reset. In order to reflect the order at which these screenshots are taken, I had to come up with names that could be sorted by order. This is my solution:
def make_screenshot_file(file_name):
order = Counter().count
test_suites_path = _make_job_directory()
return make_writable_file(os.path.join(test_suites_path,'screenshot',file_name % order))
class Counter():
__counter_instance = None
def __init__(self):
if Counter.__counter_instance is None:
self.count = 1
Counter.__counter_instance = self
else:
Counter.__counter_instance.count += 1
self.count = Counter.__counter_instance.count
It works fine for me. But I keep thinking that there should be an easier way to solve this problem. Is there? And if singleton is the only way, could my code be optimized in any way?

What you're trying to do here is simulate a global variable.
There is no good reason to do that. If you really want a global variable, make it explicitly a global variable.
You could create a simple Counter class that increments count by 1 each time you access it, and then create a global instance of it. But the standard library already gives you something like that for free, in itertools.count, as DSM explains in a comment.
So:
import itertools
_counter = itertools.count()
def make_screenshot_file(file_name):
order = next(_counter)
test_suites_path = _make_job_directory()
return make_writable_file(os.path.join(test_suites_path,'screenshot',file_name % order))
I'm not sure why you're so worried about how much storage or time this takes up, because I can't conceive of any program where it could possibly matter whether you were using 8 bytes or 800 for a single object you could never have more than one or, or whether it took 3ns or 3us to access it when you only do so a handful of times.
But if you are worried, as you can see from the source, count is implemented in C, it's pretty memory-efficient, and if you don't do anything fancy with it, it comes down to basically a single PyNumber_Add to generate each number, which is a lot less than interpreting a few lines of code.
Since you asked, here's how you could radically simplify your existing code by using a _count class attribute instead of a __counter_instance class attribute:
class Counter():
_count = 0
def count(self):
Counter._count += 1
return Counter.count
Of course now you have to to Counter().count() instead of just Counter().count—but you can fix that trivially with #property if it matters.
It's worth pointing out that it's a really bad idea to use a classic class instead of a new-style class (by passing nothing inside the parens), and if you do want a classic class you should leave the parens off, and most Python programmers will associate the name Counter with the class collections.Counter, and there's no reason count couldn't be a #classmethod or #staticmethod… at which point this is exactly Andrew T.'s answer. Which, as he points out, is much simpler than what you're doing, and no more or less Pythonic.
But really, all of this is no better than just making _count a module-level global and adding a module-level count() function that increments and returns it.

why not just do
order = time.time()
or do something like
import glob #glob is used for unix like path expansion
order = len(glob.glob(os.path.join(test_suites_path,"screenshot","%s*"%filename))

Using static methods and variables. Not very pythonic, but simpler.
def make_screenshot_file(file_name):
order = Counter.count() #Note the move of the parens
test_suites_path = _make_job_directory()
return make_writable_file(os.path.join(test_suites_path,'screenshot',file_name % order))
class Counter():
count_n = 0
#staticmethod
def count():
Counter.count_n += 1
return Counter.count_n
print Counter.count()
print Counter.count()
print Counter.count()
print Counter.count()
print Counter.count()
atarzwell#freeman:~/src$ python so.py
1
2
3
4
5

Well , you can use this solution, just make sure you never initialize the order kwarg!
Mutable Kwargs in function's work like classes global variables. And the value isn't reset to default between calls, as you might think at first!
def make_screenshot_file(file_name , order=[0]):
order[0] = order[0] + 1
test_suites_path = _make_job_directory()
return make_writable_file(os.path.join(test_suites_path,'screenshot',file_name % order[0]))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I make Python respect iterable fields when multiprocessing? - python

Related

How to keep track of instances of python objects in a reliable way?

Undo in Python with a very large state. Is it possible?

Python returns wrong answer for list inclusion

Modify existing variable in `locals()` or `frame.f_locals`

Using singleton as a counter

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I make Python respect iterable fields when multiprocessing? - python

Related

How to keep track of instances of python objects in a reliable way?

Undo in Python with a very large state. Is it possible?

Python returns wrong answer for list inclusion

Modify *existing* variable in `locals()` or `frame.f_locals`

Using singleton as a counter

Categories

Resources

Modify existing variable in `locals()` or `frame.f_locals`