Persistent references in Python

Persistent references in Python - python

I would like my program to store datas for later uses. Until now, not any problem: there is much ways of doing this in Python.
Things get a little more complicated because I want to keep references between instances. If a list X is a list Y (they have the same ID, modify one is modify the other), it should be true the next time I load the datas (another session of the program which has stopped in the meantime).
I know a solution : the pickle module keeps tracks of references and will remember that my X and Y lists are exactly the same (not only their contents, but their references).
Still, the problem using pickle is that it works if you dump every data in a single file. Which is not really clever if you have a large amount of data.
Do you know another way to handle this problem?

The simplest thing to do is probably to wrap up all your state you wish to save in a dictionary (keyed by variable name, perhaps, or some other unique but predictable identifier), then pickle and unpickle that dictionary. The objects within the dictionary will share references between one another like you want:
>>> class X(object):
... # just some object to be pickled
... pass
...
>>> l1 = [X(), X(), X()]
>>> l2 = [l1[0], X(), l1[2]]
>>> state = {'l1': l1, 'l2': l2}
>>> saved = pickle.dumps(state)
>>> restored = pickle.loads(saved)
>>> restored['l1'][0] is restored['l2'][0]
True
>>> restored['l1'][1] is restored['l2'][1]
False

I would recommand using shelve over pickle. It has higher level functionnality, and is simpler to use.
http://docs.python.org/library/shelve.html
If you have performance issue because you manipulate very large amount of data, you may try other librairies like pyTables:
http://www.pytables.org/moin

ZODB is developed to save persistent python objects and all references. Just inherit your class from Persistent and have a fun. http://www.zodb.org/

Related

Is there any usage of self-referential lists or circular reference in list, eg. appending a list to itself

So if I have a list a and append a to it, I will get a list that contains it own reference.
>>> a = [1,2]
>>> a.append(a)
>>> a
[1, 2, [...]]
>>> a[-1][-1][-1]
[1, 2, [...]]
And this basically results in seemingly infinite recursions.
And not only in lists, dictionaries as well:
>>> b = {'a':1,'b':2}
>>> b['c'] = b
>>> b
{'a': 1, 'b': 2, 'c': {...}}
It could have been a good way to store the list in last element and modify other elements, but that wouldn't work as the change will be seen in every recursive reference.
I get why this happens, i.e. due to their mutability. However, I am interested in actual use-cases of this behavior. Can somebody enlighten me?

The use case is that Python is a dynamically typed language, where anything can reference anything, including itself.
List elements are references to other objects, just like variable names and attributes and the keys and values in dictionaries. The references are not typed, variables or lists are not restricted to only referencing, say, integers or floating point values. Every reference can reference any valid Python object. (Python is also strongly typed, in that the objects have a specific type that won't just change; strings remain strings, lists stay lists).
So, because Python is dynamically typed, the following:
foo = []
# ...
foo = False
is valid, because foo isn't restricted to a specific type of object, and the same goes for Python list objects.
The moment your language allows this, you have to account for recursive structures, because containers are allowed to reference themselves, directly or indirectly. The list representation takes this into account by not blowing up when you do this and ask for a string representation. It is instead showing you a [...] entry when there is a circular reference. This happens not just for direct references either, you can create an indirect reference too:
>>> foo = []
>>> bar = []
>>> foo.append(bar)
>>> bar.append(foo)
>>> foo
[[[...]]]
foo is the outermost [/]] pair and the [...] entry. bar is the [/] pair in the middle.
There are plenty of practical situations where you'd want a self-referencing (circular) structure. The built-in OrderedDict object uses a circular linked list to track item order, for example. This is not normally easily visible as there is a C-optimised version of the type, but we can force the Python interpreter to use the pure-Python version (you want to use a fresh interpreter, this is kind-of hackish):
>>> import sys
>>> class ImportFailedModule:
... def __getattr__(self, name):
... raise ImportError
...
>>> sys.modules["_collections"] = ImportFailedModule() # block the extension module from being loaded
>>> del sys.modules["collections"] # force a re-import
>>> from collections import OrderedDict
now we have a pure-python version we can introspect:
>>> od = OrderedDict()
>>> vars(od)
{'_OrderedDict__hardroot': <collections._Link object at 0x10a854e00>, '_OrderedDict__root': <weakproxy at 0x10a861130 to _Link at 0x10a854e00>, '_OrderedDict__map': {}}
Because this ordered dict is empty, the root references itself:
>>> od._OrderedDict__root.next is od._OrderedDict__root
True
just like a list can reference itself. Add a key or two and the linked list grows, but remains linked to itself, eventually:
>>> od["foo"] = "bar"
>>> od._OrderedDict__root.next is od._OrderedDict__root
False
>>> od._OrderedDict__root.next.next is od._OrderedDict__root
True
>>> od["spam"] = 42
>>> od._OrderedDict__root.next.next is od._OrderedDict__root
False
>>> od._OrderedDict__root.next.next.next is od._OrderedDict__root
True
The circular linked list makes it easy to alter the key ordering without having to rebuild the whole underlying hash table.

However, I am interested in actual use-cases of this behavior. Can somebody enlighten me?
I don't think there are many useful use-cases for this. The reason this is allowed is because there could be some actual use-cases for it and forbidding it would make the performance of these containers worse or increase their memory usage.
Python is dynamically typed and you can add any Python object to a list. That means one would need to make special precautions to forbid adding a list to itself. This is different from (most) typed-languages where this cannot happen because of the typing-system.
So in order to forbid such recursive data-structures one would either need to check on every addition/insertion/mutation if the newly added object already participates in a higher layer of the data-structure. That means in the worst case it has to check if the newly added element is anywhere where it could participate in a recursive data-structure. The problem here is that the same list can be referenced in multiple places and can be part of multiple data-structures already and data-structures such as list/dict can be (almost) arbitrarily deep. That detection would be either slow (e.g. linear search) or would take quite a bit of memory (lookup). So it's cheaper to simply allow it.
The reason why Python detects this when printing is that you don't want the interpreter entering an infinite loop, or get a RecursionError, or StackOverflow. That's why for some operations like printing (but also deepcopy) Python temporarily creates a lookup to detect these recursive data-structures and handles them appropriately.

Consider building a state machine that parse string of digits an check if you can divide by 25 you could model each node as list with 10 outgoing directions consider some connections going to them self
def canDiv25(s):
n0,n1,n1g,n2=[],[],[],[]
n0.extend((n1,n0,n2,n0,n0,n1,n0,n2,n0,n0))
n1.extend((n1g,n0,n2,n0,n0,n1,n0,n2,n0,n0))
n1g.extend(n1)
n2.extend((n1,n0,n2,n0,n0,n1g,n0,n2,n0,n0))
cn=n0
for c in s:
cn=cn[int(c)]
return cn is n1g
for i in range(144):
print("%d %d"%(i,canDiv25(str(i))),end='\t')
While this state machine by itself has little practical it show what could happen. Alternative you could have an simple Adventure game where each room is represented as a dictionary you can go for example NORTH but in that room there is of course a back link to SOUTH. Also sometimes game developers make it so that for example to simulate a tricky path in some dungeon the way in NORTH direction will point to the room itself.

A very simple application of this would be a circular linked list where the last node in a list references the first node. These are useful for creating infinite resources, state machines or graphs in general.
def to_circular_list(items):
head, *tail = items
first = { "elem": head }
current = first
for item in tail:
current['next'] = { "elem": item }
current = current['next']
current['next'] = first
return first
to_circular_list([1, 2, 3, 4])
If it's not obvious how that relates to having a self-referencing object, think about what would happen if you only called to_circular_list([1]), you would end up with a data structure that looks like
item = {
"elem": 1,
"next": item
}
If the language didn't support this kind of direct self referencing, it would be impossible to use circular linked lists and many other concepts that rely on self references as a tool in Python.

The reason this is possible is simply because the syntax of Python doesn't prohibit it, much in the way any C or C++ object can contain a reference to itself. An example might be: https://www.geeksforgeeks.org/self-referential-structures/
As #MSeifert said, you will generally get a RecursionError at some point if you're trying to access the list repeatedly from itself. Code that uses this pattern like this:
a = [1, 2]
a.append(a)
def loop(l):
for item in l:
if isinstance(item, list):
loop(l)
else: print(item)
will eventually crash without some sort of condition. I believe that even print(a) will also crash. However:
a = [1, 2]
while True:
for item in a:
print(item)
will run infinitely with the same expected output as the above. Very few recursive problems don't unravel into a simple while loop. For an example of recursive problems that do require a self-referential structure, look up Ackermann's function: http://mathworld.wolfram.com/AckermannFunction.html. This function could be modified to use a self-referential list.
There is certainly precedent for self-referential containers or tree structures, particularly in math, but on a computer they are all limited by the size of the call stack and CPU time, making it impractical to investigate them without some sort of constraint.

Restore objects from pickled file by name in python

Using Python 2.7.
Is there a way to restore only specified objects from a pickle file?
using the same example as a previous post:
import pickle
# obj0, obj1, obj2 are created here...
# Saving the objects:
with open('objs.pickle', 'w') as f:
pickle.dump([obj0, obj1, obj2], f)
Now, I would like to only restore, say, obj1
I am doing the following:
with open('objs.pickle', 'r') as f:
obj1=pickle.load(f)[1]
but let's say I don't know the order of objects, just it's name.
Writing this, I am guessing the names get dropped during pickling?

Instead of storing the objects in a list, you could use a dictionary to provide names for each object:
import pickle
s = pickle.dumps({'obj0': obj0, 'obj1': obj1, 'obj2': obj2})
obj1 = pickle.loads(s)['obj1']
The order of the items no longer matters, in fact there is no order because a dictionary is being restored.
I'm not 100% sure that this is what you wanted. Were you hoping to restore the object of interest only, i.e. without parsing and restoring the other objects? I don't think that can be done with pickles without writing your own parser, or a fair degree of hacking.

No. Python objects do not have "names" (besides some exceptions like functions and classes that know their declared names), the names are just pointing to the object, and an object does not know its name even at runtime, and thus cannot be persisted in a pickle either.
Perhaps you need a dictionary instead.

Named Stuctured Data storage in Python

I am writing an application that calculates various Pandas dataframes over various time periods. Each of these Dataframes have additional data that need to be stored with them.
I can quite easily define a structure using lists or dicts to carry the data, but it would be nice if it is nicely structured.
I have looked at (tried namedtuples). This is great as it simplifies the syntax when accessing the information a lot. Problems with tuples are of course that they are immutable.
Have gotten around this by either doing all the calcs ahead of time and living with not being able to change them (without jumping through few hoops) or by following code:
from collections import namedtuple
m = namedtuple("Month", 'df StartDate EndDate DaysInMonth
m.Month = 2
m.df = pandas.DataFrame()
etc....
this seems to work, but I am actually misusing the named tuple class. m in the above code is actually a "type" not an instance. Although it is working and I can now assign to it I am probably going to run into some problems later on.
type(m)
>>> type
Any suggestions on whether I could carry on with this structure or if i should rather create my own class for the data structure?

What you're doing setting m.Month to 2 is using something all classes can do because they walk and talk like dictionaries.
class Month():
pass
a = Month()
a.df = 2
This works without doing anything special. If you look inside a's _dict_ attribute
print(a.__dict__)
You'll see something like the following
{'__module__': '__main__', '__doc__': None, 'df': 2}
I would probably use the empty class instead of the namedtuple if you want to change the values at a later time. All the namedtuple machinery in the background get you nothing for your use case.

References in Python

I have a multicasting network that needs to continuously send data to all other users. This data will be changing constantly so I do not want the programmer to have to deal with the sending of packets to users. Because of this, I am trying to find out how I can make a reference to any object or variable in Python (I am new to Python) so it can be modified by the user and changes what is sent in the multicasting packets.
Here is an example of what I want:
>>> test = "test"
>>> mdc = MulticastDataClient()
>>> mdc.add(test) # added into an internal list that is sent to all users
# here we can see that we are successfully receiving the data
>>> print mdc.receive()
{'192.168.1.10_0': 'test'}
# now we try to change the value of test
>>> test = "this should change"
>>> print mdc.receive()
{'192.168.1.10_0': 'test'} # want 'test' to change to -> 'this should change'
Any help on how I can fix this would be very much appreciated.
UPDATE:
I have tried it this way as well:
>>> test = [1, "test"]
>>> mdc = MulticastDataClient()
>>> mdc.add(test)
>>> mdc.receive()
{'192.168.1.10_1': 'test'}
>>> test[1] = "change!"
>>> mdc.receive()
{'192.168.1.10_1': 'change!'}
This did work.
However,
>>> val = "ftw!"
>>> nextTest = [4, val]
>>> mdc.add(nextTest)
>>> mdc.receive()
{'192.168.1.10_1': 'change!', '192.168.1.10_4': 'ftw!'}
>>> val = "different."
>>> mdc.receive()
{'192.168.1.10_1': 'change!', '192.168.1.10_4': 'ftw!'}
This does not work. I need 'ftw!' to become 'different.' in this case.
I am using strings for testing and am used to strings being objects from other languages. I will only be editing the contents inside of an object so would this end up working?

In python everything is a reference, but strings are not mutable. So test is holding a reference to "test". If you assign "this should change" to test you just change it to another reference. But your clients still have the reference to "test". Or shorter: It does not work that way in python! ;-)
A solution might be to put the data into an object:
data = {'someKey':"test"}
mdc.add(data)
Now your clients hold a reference to the dictionary. If you update the dictionary like this, your clients will see the changes:
data['someKey'] = "this should change"

You can't, not easily. A name (variable) in Python is just a location for a pointer. Overwrite it and you just replace the pointer with another pointer, i.e. the change is only visible to people who use the same variable. Object members are basically the same, but as their state is seen by everyone with a pointer to them, you can propagate changes like this. You just have to use obj.var every single time. Of course, strings (along with integers, tuples, a few other built-in types, and several other types) are immutable, i.e. you can't change anything about for others to see as you can't change it at all.
However, the mutability of objects opens another possibility: You could, if you bothered to pull it through, write a wrapper class that contains an arbitrary object, allows changing that object though a set() method and delegates everything important to that object. You'd probably run into nasty little troubles sooner or later though. For example, I can't imagine this would play well with metaprogramming that goes through all members, or anything that thinks it has to mess with. It's also incredibly hacky (i.e. unreliable). There's probably a much easier solution.
(On a side note, PyPy has a become function in one of its non-default object spaces that really and truly replaces one object with another, visible to everyone with a reference to that object. It doesn't work with any other implementations though and I think the incredible potential and misuse confusion as well as the fact most of us have rarely ever needed this makes it nearly unacceptable in real code.)

I need to free up RAM by storing a Python dictionary on the hard drive, not in RAM. Is it possible?

In my case, I have a dictionary of about 6000 instantiated classes, where each class has 1000 attributed variables all of type string or list of strings. As I build this dictionary up, my RAM goes up super high. Is there a way to write the dictionary as it is being built to the harddrive rather than the RAM so that I can save some memory? I've heard of something called "pickle" but I don't know if this is a feasible method for what I am doing.
Thanks for your help!

Maybe you should be using a database, but check out the shelve module
If shelve isn't powerful enough for you, there is always the industrial strength ZODB

shelve, as #gnibbler recommends, is what I would no doubt be using, but watch out for two traps: a simple one (all keys must be strings) and a subtle one (as the values don't normally exist in memory, calling mutators on them may not work as you expect).
For the simple problem, it's normally easy to find a workaround (and you do get a clear exception if you forget and try e.g. using an int or whatever as the key, so it's not hard t remember that you do need a workaround either).
For the subtle problem, consider for example:
x = d['foo']
x.amutatingmethod()
...much later...
y = d['foo']
# is y "mutated" or not now?
the answer to the question in the last comment depends on whether d is a real dict (in which case y will be mutated, and in fact exactly the same object as x) or a shelf (in which case y will be a distinct object from x, and in exactly the state you last saved to d['foo']!).
To get your mutations to persist, you need to "save them to disk" by doing
d['foo'] = x
after calling any mutators you want on x (so in particular you cannot just do
d['foo'].mutator()
and expect the mutation to "stick", as you would if d were a dict).
shelve does have an option to cache all fetched items in memory, but of course that can fill up the memory again, and result in long delays when you finally close the shelf object (since all the cached items must be saved back to disk then, just in case they had been mutated). That option was something I originally pushed for (as a Python core committer), but I've since changed my mind and I now apologize for getting it in (ah well, at least it's not the default!-), since the cases it should be used in are rare, and it can often trap the unwary user... sorry.
BTW, in case you don't know what a mutator, or "mutating method", is, it's any method that alters the state of the object you call it on -- e.g. .append if the object is a list, .pop if the object is any kind of container, and so on. No need to worry if the object is immutable, of course (numbers, strings, tuples, frozensets, ...), since it doesn't have mutating methods in that case;-).

Pickling an entire hash over and over again is bound to run into the same memory pressures that you're facing now -- maybe even worse, with all the data marshaling back and forth.
Instead, using an on-disk database that acts like a hash is probably the best bet; see this page for a quick introduction to using dbm-style databases in your program: http://docs.python.org/library/dbm
They act enough like hashes that it should be a simple transition for you.

"""I have a dictionary of about 6000 instantiated classes, where each class has 1000 attributed variables all of type string or list of strings""" ... I presume that you mean: """I have a class with about 1000 attributes all of type str or list of str. I have a dictionary mapping about 6000 keys of unspecified type to corresponding instances of that class.""" If that's not a reasonable translation, please correct it.
For a start, 1000 attributes in a class is mindboggling. You must be treating the vast majority generically using value = getattr(obj, attr_name) and setattr(obj, attr_name, value). Consider using a dict instead of an instance: value = obj[attr_name] and obj[attr_name] = value.
Secondly, what percentage of those 6 million attributes are ""? If sufficiently high, you might like to consider implementing a sparse dict which doesn't physically have entries for those attributes, using the __missing__ hook -- docs here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.