__iter__() implemented as a generator - python

I have an object subclass which implements a dynamic dispatch __ iter __ using a caching generator (I also have a method for invalidating the iter cache) like so:
def __iter__(self):
print("iter called")
if self.__iter_cache is None:
iter_seen = {}
iter_cache = []
for name in self.__slots:
value = self.__slots[name]
iter_seen[name] = True
item = (name, value)
iter_cache.append(item)
yield item
for d in self.__dc_list:
for name, value in iter(d):
if name not in iter_seen:
iter_seen[name] = True
item = (name, value)
iter_cache.append(item)
yield item
self.__iter_cache = iter_cache
else:
print("iter cache hit")
for item in self.__iter_cache:
yield item
It seems to be working... Are there any gotchas I may not be aware of? Am I doing something ridiculous?

container.__iter__() returns an iterator object. The iterator objects themselves are required to support the two following methods, which together form the iterator protocol:
iterator.__iter__()
Returns the iterator object itself.
iterator.next()
Return the next item from the container.
That's exactly what every generator has, so don't be afraid of any side-effects.

It seems like a very fragile approach. It is enough to change any of __slots, __dc_list, __iter_cache during active iteration to put the object into an inconsistent state.
You need either to forbid changing the object during iteration or generate all cache items at once and return a copy of the list.

It might be better to separate the iteration of the object from the caching of the values it returns. That would simplify the iteration process and allow you to easily control how the caching is accomplished as well as whether it is enabled or not, for example.
Another possibly important consideration is the fact that your code would not predictively handle the situation where the object being iterated over gets changed between successive calls to the method. One simple way to deal with that would be to populate the cache's contents completely on the first call, and then just yield what it contains for each call -- and document the behavior.

What you're doing is valid albeit weird. What is a __slots or a __dc_list ?? Generally it's better to describe the contents of your object in an attribute name, rather than its type (eg: self.users rather than self.u_list).
You can use my LazyProperty decorator to simplify this substantially.
Just decorate your method with #LazyProperty. It will be called the first time, and the decorator will then replace the attribute with the results. The only requirement is that the value is repeatable; it doesn't depend on mutable state. You also have that requirement in your current code, with your self.__iter_cache.
def __iter__(self)
return self.__iter
#LazyProperty
def __iter(self)
def my_generator():
yield whatever
return tuple(my_generator())

Related

What is the proper way to make an object with unpickable fields pickable?

For me what I do is detect what is unpickable and make it into a string (I guess I could have deleted it too but then it will falsely tell me that field didn't exist but I'd rather have it exist but be a string). But I wanted to know if there was a less hacky more official way to do this.
Current code I use:
def make_args_pickable(args: Namespace) -> Namespace:
"""
Returns a copy of the args namespace but with unpickable objects as strings.
note: implementation not tested against deep copying.
ref:
- https://stackoverflow.com/questions/70128335/what-is-the-proper-way-to-make-an-object-with-unpickable-fields-pickable
"""
pickable_args = argparse.Namespace()
# - go through fields in args, if they are not pickable make it a string else leave as it
# The vars() function returns the __dict__ attribute of the given object.
for field in vars(args):
field_val: Any = getattr(args, field)
if not dill.pickles(field_val):
field_val: str = str(field_val)
setattr(pickable_args, field, field_val)
return pickable_args
Context: I think I do it mostly to remove the annoying tensorboard object I carry around (but I don't think I will need the .tb field anymore thanks to wandb/weights and biases). Not that this matters a lot but context is always nice.
Related:
What does it mean for an object to be picklable (or pickle-able)?
Python - How can I make this un-pickleable object pickleable?
Edit:
Since I decided to move away from dill - since sometimes it cannot recover classes/objects (probably because it cannot save their code or something) - I decided to only use pickle (which seems to be the recommended way to be done in PyTorch).
So what is the official (perhaps optimized) way to check for pickables without dill or with the official pickle?
Is this the best:
def is_picklable(obj):
try:
pickle.dumps(obj)
except pickle.PicklingError:
return False
return True
thus current soln:
def make_args_pickable(args: Namespace) -> Namespace:
"""
Returns a copy of the args namespace but with unpickable objects as strings.
note: implementation not tested against deep copying.
ref:
- https://stackoverflow.com/questions/70128335/what-is-the-proper-way-to-make-an-object-with-unpickable-fields-pickable
"""
pickable_args = argparse.Namespace()
# - go through fields in args, if they are not pickable make it a string else leave as it
# The vars() function returns the __dict__ attribute of the given object.
for field in vars(args):
field_val: Any = getattr(args, field)
# - if current field value is not pickable, make it pickable by casting to string
if not dill.pickles(field_val):
field_val: str = str(field_val)
elif not is_picklable(field_val):
field_val: str = str(field_val)
# - after this line the invariant is that it should be pickable, so set it in the new args obj
setattr(pickable_args, field, field_val)
return pickable_args
def make_opts_pickable(opts):
""" Makes a namespace pickable """
return make_args_pickable(opts)
def is_picklable(obj: Any) -> bool:
"""
Checks if somehting is pickable.
Ref:
- https://stackoverflow.com/questions/70128335/what-is-the-proper-way-to-make-an-object-with-unpickable-fields-pickable
"""
import pickle
try:
pickle.dumps(obj)
except pickle.PicklingError:
return False
return True
Note: one of the reasons I want something "offical"/tested is because I am getting pycharm halt on the try catch: How to stop PyCharm's break/stop/halt feature on handled exceptions (i.e. only break on python unhandled exceptions)? which is not what I want...I want it to only halt on unhandled exceptions.
What is the proper way to make an object with unpickable fields pickable?
I believe the answer to this belongs in the question you linked -- Python - How can I make this un-pickleable object pickleable?. I've added a new answer to that question explaining how you can make an unpicklable object picklable the proper way, without using __reduce__.
So what is the official (perhaps optimized) way to check for pickables without dill or with the official pickle?
Objects that are picklable are defined in the docs as follows:
None, True, and False
integers, floating point numbers, complex numbers
strings, bytes, bytearrays
tuples, lists, sets, and dictionaries containing only picklable objects
functions defined at the top level of a module (using def, not lambda)
built-in functions defined at the top level of a module
classes that are defined at the top level of a module
instances of such classes whose dict or the result of calling getstate() is picklable (see section Pickling Class Instances for details).
The tricky parts are (1) knowing how functions/classes are defined (you can probably use the inspect module for that) and (2) recursing through objects, checking against the rules above.
There are a lot of caveats to this, such as the pickle protocol versions, whether the object is an extension type (defined in a C extension like numpy, for example) or an instance of a 'user-defined' class. Usage of __slots__ can also impact whether an object is picklable or not (since __slots__ means there's no __dict__), but can be pickled with __getstate__. Some objects may also be registered with a custom function for pickling. So, you'd need to know if that has happened as well.
Technically, you can implement a function to check for all of this in Python, but it will be quite slow by comparison. The easiest (and probably most performant, as pickle is implemented in C) way to do this is to simply attempt to pickle the object you want to check.
I tested this with PyCharm pickling all kinds of things... it doesn't halt with this method. The key is that you must anticipate pretty much any kind of exception (see footnote 3 in the docs). The warnings are optional, they're mostly explanatory for the context of this question.
def is_picklable(obj: Any) -> bool:
try:
pickle.dumps(obj)
return True
except (pickle.PicklingError, pickle.PickleError, AttributeError, ImportError):
# https://docs.python.org/3/library/pickle.html#what-can-be-pickled-and-unpickled
return False
except RecursionError:
warnings.warn(
f"Could not determine if object of type {type(obj)!r} is picklable"
"due to a RecursionError that was supressed. "
"Setting a higher recursion limit MAY allow this object to be pickled"
)
return False
except Exception as e:
# https://docs.python.org/3/library/pickle.html#id9
warnings.warn(
f"An error occurred while attempting to pickle"
f"object of type {type(obj)!r}. Assuming it's unpicklable. The exception was {e}"
)
return False
Using the example from my other answer I linked above, you could make your object picklable by implementing __getstate__ and __setstate__ (or subclassing and adding them, or making a wrapper class) adapting your make_args_pickable...
class Unpicklable:
"""
A simple marker class so we can distinguish when a deserialized object
is a string because it was originally unpicklable
(and not simply a string to begin with)
"""
def __init__(self, obj_str: str):
self.obj_str = obj_str
def __str__(self):
return self.obj_str
def __repr__(self):
return f'Unpicklable(obj_str={self.obj_str!r})'
class PicklableNamespace(Namespace):
def __getstate__(self):
"""For serialization"""
# always make a copy so you don't accidentally modify state
state = self.__dict__.copy()
# Any unpicklables will be converted to a ``Unpicklable`` object
# with its str format stored in the object
for key, val in state.items():
if not is_picklable(val):
state[key] = Unpicklable(str(val))
return state
def __setstate__(self, state):
self.__dict__.update(state) # or leave unimplemented
In action, I'll pickle a namespace whose attributes contain a file handle (normally not picklable) and then load the pickle data.
# Normally file handles are not picklable
p = PicklableNamespace(f=open('test.txt'))
data = pickle.dumps(p)
del p
loaded_p = pickle.loads(data)
# PicklableNamespace(f=Unpicklable(obj_str="<_io.TextIOWrapper name='test.txt' mode='r' encoding='cp1252'>"))
Yes, a try/except is the best way to go about this.
Per the docs, pickle is capable of recursively pickling objects, that is to say, if you have a list of objects that are pickleable, it will pickle all objects inside of that list if you attempt to pickle that list. This means that you cannot feasibly test to see if an object is pickleable without pickling it. Because of that, your structure of:
def is_picklable(obj):
try:
pickle.dumps(obj)
except pickle.PicklingError:
return False
return True
is the simplest and easiest way to go about checking this. If you are not working with recursive structures and/or you can safely assume that all recursive structures will only contain pickleable objects, you could check the type() value of the object against the list of pickleable objects:
None, True, and False
integers, floating point numbers, complex numbers
strings, bytes, bytearrays
tuples, lists, sets, and dictionaries containing only picklable objects
functions defined at the top level of a module (using def, not lambda)
built-in functions defined at the top level of a module
classes that are defined at the top level of a module
instances of such classes whose dict or the result of calling getstate() is picklable (see section Pickling Class Instances for details).
This is likely faster than using a try:... except:... like you showed in your question.
To me no matter the error I want my function to tell me it's not pickable. So it seems to work if I do this:
def is_picklable(obj: Any) -> bool:
"""
Checks if somehting is pickable.
Ref:
- https://stackoverflow.com/questions/70128335/what-is-the-proper-way-to-make-an-object-with-unpickable-fields-pickable
- pycharm halting all the time issue: https://stackoverflow.com/questions/70761481/how-to-stop-pycharms-break-stop-halt-feature-on-handled-exceptions-i-e-only-b
"""
import pickle
try:
pickle.dumps(obj)
except:
return False
return True
plus as an added bonus it doesn't freak pycharm out see How to stop PyCharm's break/stop/halt feature on handled exceptions (i.e. only break on python unhandled exceptions)? for details.

append to request.sessions[list] in Django

Something is bugging me.
I'm following along with this beginner tutorial for django (cs50) and at some point we receive a string back from a form submission and want to add it to a list:
https://www.youtube.com/watch?v=w8q0C-C1js4&list=PLhQjrBD2T380xvFSUmToMMzERZ3qB5Ueu&t=5777s
def add(request):
if 'tasklist' not in request.session:
request.session['tasklist'] = []
if request.method == 'POST':
form_data = NewTaskForm(request.POST)
if form_data.is_valid():
task = form_data.cleaned_data['task']
request.session['tasklist'] += [task]
return HttpResponseRedirect(reverse('tasks:index'))
I've checked the type of request.session['tasklist']and python shows it's a list.
The task variable is a string.
So why doesn't request.session['tasklist'].append(task) work properly? I can see it being added to the list via some print statements but then it is 'forgotten again' - it doesn't seem to be permanently added to the tasklist.
Why do we use this request.session['tasklist'] += [task] instead?
The only thing I could find is https://ogirardot.wordpress.com/2010/09/17/append-objects-in-request-session-in-django/ but that refers to a site that no longer exists.
The code works fine, but I'm trying to understand why you need to use a different operation and can't / shouldn't use the append method.
Thanks.
The reason why it does not work is because django does not see that you have changed anything in the session by using the append() method on a list that is in the session.
What you are doing here is essentially pulling out the reference to the list and making changes to it without the session backend knowing anything about it. An other way to explain:
The append() method is on the list itself not on the session object
When you call append() on the list you are only talking to the list and the list's parent (the session) has no idea what you guys are doing
When you however do an assignment on the session itself session['whatever'] = 'something' then it knows that something is up and changes are made
So the key here is that you need to operate on the session object directly if you want your changes to be updated automatically
Django only thinks it needs to save a changed session item if the item got reassigned to the session. See here: django session base code the __setitem__ method containing a self.modified = True statement.
The session['list'] += [new_element] adds a new list item (mutates the list stored in the session, so the list reference stays the same) and then gets it reassigned to the session again -> thus triggering first a __getitem__ call -> then your += / __iadd__ runs on the value read -> then a __setitem__ call is made (with the list ref. passed to it). You can see it in the django codebase that it marks the session after each __setitem__ call as modified.
The session['list'] = session['list'] + [new_item] mode of doing the same does create a new list every time it's run so its a bit less efficient, but you should not store hundreds of items in the session anyway. So you're probably fine. This also works exactly as above.
However if you use sub-keys in the session like session['list']['x'] = 'whatever' the session will not see itself as modified so you need to mark it as by request.session.modified = True
Short answer: It's about how Python chooses to implement the dict data structure.
Long answer:
Let's start by saying that request.session is a dictionary.
Quoting Django's documentation, "By default, Django only saves to the session database when the session has been modified – that is if any of its dictionary values have been assigned or deleted". Link
So, the problem is that the session database is not being modified by
request.session['tasklist'].append(task)
Seeing the related parts Django's Session base code (as posted by #Csaba K. in an answer), the variable self.modified is to be set True when setitem dunder method is called.
Now, at this step the problem seems like the setitem dunder method is not being called with request.session['tasklist'].append(task) but with request.session['tasklist'] += [task] it gets called. It is not due to if the reference of request.session['tasklist'] is changing or not as pointed out by another answer, because the reference to the underlying list remains the same.
To confirm, let's create a custom dictionary which extends the Python dict, and print something when setitem dunder method is called.
class MyDict(dict):
def __init__(self, globalVar):
super().__init__()
self.globalVar = globalVar
def __setitem__(self, key, value):
super().__setitem__(key, value)
print("Called Set item when: ", end="")
myDict = MyDict(0)
print("Creating Dict")
print("-----")
myDict["y"] = []
print("Adding a new key-value pair")
print("-----")
myDict["y"] += ["x"]
print(" using +=")
print("-----")
myDict["y"].append("x")
print("append")
print("-----")
myDict["y"].extend(["x"])
print("extend")
print("-----")
myDict["y"] = myDict["y"] + ["x"]
print(" using +",)
print("-----")
It prints:
Creating Dict
-----
Called Set item when: Adding a new key-value pair
-----
Called Set item when: using +=
-----
append
-----
extend
-----
Called Set item when: using +
-----
As we can see, setitem dunder method is called and in turn self.modified is set true only when adding a new key-value pair, or using += or using +, but not when initializing, appending or extending an iterable (in this case a list). Now, the operator + and += do very different things in Python, as explained in the other answer. += behaves more like the append method but in this case, I guess it's more about how Python chooses to implement the dict data structure rather than how +, += and append behave on lists.
I found this while doing some more searching:
https://code.djangoproject.com/wiki/NewbieMistakes
Scroll to 'Appending to a list in session doesn't work'
Again, it is a very dated entry but still seems to hold true.
Not completely satisfied because this does not answer the question as to 'why' this doesn't work, but at the very least confirms 'something's up' and you should probably still use the recommendations there.
(if anyone out there can actually explain this in a more verbose manner then I'd be happy to hear it)

Static methods for recursive functions within a class?

I'm working with nested dictionaries on Python (2.7) obtained from YAML objects and I have a couple of questions that I've been trying to get an answer to by reading, but have not been successful. I'm somewhat new to Python.
One of the simplest functions is one that reads the whole dictionary and outputs a list of all the keys that exist in it. I use an underscore at the beginning since this function is later used by others within a class.
class Myclass(object):
#staticmethod
def _get_key_list(d,keylist):
for key,value in d.iteritems():
keylist.append(key)
if isinstance(value,dict):
Myclass._get_key_list(d.get(key),keylist)
return list(set(keylist))
def diff(self,dict2):
keylist = []
all_keys1 = self._get_key_list(self.d,keylist)
all_keys2 = self._get_key_list(dict2,keylist)
... # More code
Question 1: Is this a correct way to do this? I am not sure whether it's good practice to use a static method for this reason. Since self._get_key_list(d,keylist) is recursive, I dont want "self" to be the first argument once the function is recursively called, which is what would happen for a regular instance method.
I have a bunch of static methods that I'm using, but I've read in a lot of places thay they could perhaps not be good practice when used a lot. I also thought I could make them module functions, but I wanted them to be tied to the class.
Question 2: Instead of passing the argument keylist to self._get_key_list(d,keylist), how can I initialize an empty list inside the recursive function and update it? Initializing it inside would reset it to [] every time.
I would eliminate keylist as an explicit argument:
def _get_keys(d):
keyset = set()
for key, value in d.iteritems():
keylist.add(key)
if isinstance(value, dict):
keylist.update(_get_key_list(value))
return keyset
Let the caller convert the set to a list if they really need a list, rather than an iterable.
Often, there is little reason to declare something as a static method rather than a function outside the class.
If you are concerned about efficiency (e.g., getting lots of repeat keys from a dict), you can go back to threading a single set/list through the calls as an explicit argument, but don't make it optional; just require that the initial caller supply the set/list to update. To emphasize that the second argument will be mutated, just return None when the function returns.
def _get_keys(d, result):
for key, value in d.iteritems():
result.add(key)
if isinstance(value, dict):
_get_keys(value, result)
result = set()
_get_keys(d1, result)
_get_keys(d2, result)
# etc
There's no good reason to make a recursive function in a class a static method unless it is meant to be invoked outside the context of an instance.
To initialize a parameter, we usually assign to it a default value in the parameter list, but in case it needs to be a mutable object such as an empty list in this case, you need to default it to None and the initialize it inside the function, so that the list reference won't get reused in the next call:
class Myclass(object):
def _get_key_list(self, d, keylist=None):
if keylist is None:
keylist = []
for key, value in d.iteritems():
keylist.append(key)
if isinstance(value, dict):
self._get_key_list(d.get(key), keylist)
return list(set(keylist))
def diff(self, dict2):
all_keys1 = self._get_key_list(self.d)
all_keys2 = self._get_key_list(dict2)
... # More code

Twisted inlineCallbacks and remote generators

I have used defer.inlineCallbacks in my code as I find it much easier to read and debug than using addCallbacks.
I am using PB and I have hit a problem when returning data to the client. The data is about 18Mb in size and I get a failed BananaError because of the length of the string being returned.
What I want to do is to write a generator so I can just keep calling the function and return some of the data each time the function is called.
How would I write that with inlineCallbacks already being used? Is it actually possible, If i return a value instead. Would something like the following work?
#defer.inlineCallbacks
def getLatestVersions(self):
returnlist = []
try:
latest_versions = yield self.cur.runQuery("""SELECT id, filename,path,attributes ,MAX(version) ,deleted ,snapshot , modified, size, hash,
chunk_table, added, isDir, isSymlink, enchash from files group by filename, path""")
except:
logger.exception("problem querying latest versions")
for result in latest_versions:
returnlist.append(result)
if len(return_list) >= 10:
yield return_list
returnlist = []
yield returnlist
A generator function decorated with inlineCallbacks returns a Deferred - not a generator. This is always the case. You can never return a generator from a function decorated with inlineCallbacks.
See the pager classes in twisted.spread.util for ideas about another approach you can take.

design patterns for python generator/iterator? (backwad read/total count)

I am writing a python interface which basically construct from a db row by row, send the stream to a tcp socket, another thread checkes the tcp response and decide if there's an error response, skip certain steams and retry from earlier ones.
Pseudo-code below, PK means PrimaryKey.
It's basically like this
def generate_msg(pk_start, pk_stop):
for x in db.query(pk>pk_startand pk<pk_stop):
yield pack_to_stream(x)
then the tcp socket send thread is like:
for msg in generate_msg(first_id, last_id):
socket.send(msg)
The problem is when the tcp socket read thread finds some error in response, the msg's pk is returned, so I need to restart the iterator from the pk
So here's my question:
what's the design parttern for a iterator which can move both forward and backward, esp. working with database row cursors
can I get the total count of an iterator in the first place without reading the whole list?
What's the general advice for my scenario?
Thanks
Iterators are designed to save memory by dealing with one item at a time, and can potentially produce an unlimited number of items. As a result of their design however, you usually cannot know their length without consuming the whole iterator, and you are normally not expected to be able to steer them.
That said, there is nothing stopping you from making a custom class that can be used both as an iterator and can provide additional functionality. Database cursors are the canonical example of such a class; the cursor can be iterated over to yield rows, but you can also ask it for a rowcount (so the length of the sequence), and get additional information about columns, get multiple rows, or point to a new result set by calling the .execute() method.
If you want to build a custom class that acts as an iterator, you need to give it a __iter__() method. You either make this method into a generator (by using the yield statement), or just return self and give your class a .next() method; the latter is expected to return one item (do not use yield), or raise StopIteration when no more items can be returned.
You can then add other methods that return length information, or re-set the query to start from a given primary key.
Untested, python-ish code:
class MessagesIterator(object):
def __init__(self, pk_start, pk_stop):
self.pk_start, self.pk_stop = pk_start, pk_stop
self.cursor = db.query("pk>? and pk<?", (pk_start, pk_stop))
def __iter__(self):
return self
def next(self):
return next(self.cursor) # raises StopIteration when done
def length(self):
return self.cursor.rowcount
def move_to(self, pk_start):
# Validate pk_start perhaps
self.pk_start = pk_start
self.cursor = db.query("pk>? and pk<?", (self.pk_start, self.pk_stop))

Categories

Resources