Accessing Dictionaries VS Accessing Shelves - python

Currently, I have a dictionary that has a number as the key and a Class as a value. I can access the attributes of that Class like so:
dictionary[str(instantiated_class_id_number)].attribute1
Due to memory issues, I want to use the shelve module. I am wondering if doing so is plausible. Does a shelve dictionary act the exact same as a standard dictionary? If not, how does it differ?

Shelve doesn't act extactly the same as dictionary, notably when modifying objects that are already in the dictionary.
The difference is that when you add a class to a dictionary a reference is stored, but shelve keeps a pickled (serialized) copy of the object. If you then modify the object you will
modify the in-memory copy, but not the pickled version. That can be handled (mostly) transparently by shelf.sync() and shelf.close(),
which write out entries. Making all that work does require tracking all retrieved objects which haven't been written back yet so you do have to call shelf.sync() to clear the cache.
The problem with shelf.sync() clearing the cache is that you can keep a reference to the object and modify it again.
This code doesn't work as expected with a shelf, but will work with a dictionary:
s["foo"] = MyClass()
s["foo"].X = 8
p = s["foo"] # store a reference to the object
p.X = 9 # update the reference
s.sync() # flushes the cache
p.X = 0
print "value in memory: %d" % p.X # prints 0
print "value in shelf: %d" % s["foo"].X # prints 9
Sync flushes the cache so the modified 'p' object is lost from the cache so it isn't written back.

Yes, it is plausible:
Shelf objects support all methods supported by dictionaries. This eases the transition from dictionary based scripts to those requiring persistent storage.
You need to call shelf.sync() every so often to clear the cache.
EDIT
Take care, it's not exactly a dict. See e.g. Laurion's answer.
Oh, and you can only have str keys.

Related

How to delete objects to increase free memory in python?

My python application reads data from files and stores these data in dictionaries during start up (dictionaries are properties of data reader classes). Once the application starts and the read data is used, these data in the dictionaries are no longer needed. However, they consume large amount of memory. How do I delete these dictionaries to free the memory?
For example:
class DataReader():
def __init__(self, data_file):
self.data_file = data_file
def read_data_file_and_store_data_in_dictionary():
self.data_dictionary = {}
for [data_name, data] in self.data_file:
self.data_dictionary[data_name] = data
class Application():
def __init__(self, data_file):
self.data_reader = DataReader()
self.data_reader.read()
def start_app(self):
self.use_read_data()
After application is started, self.data_dictionary is no longer needed. How do I delete self.data_dictionary permanently?
Use the del statement
del self.data_dictionary # or del obj.data_dictionary
Note this will only delete this reference to the dictionary. If any other references still exist for the dictionary (say if you had done d = data_reader.data_dictionary and d still references data_dictionary) then the dictionary will not be freed from memory. This also includes any references to d.keys(), d.values(), d.items().
Only when all references have been removed will the dictionary finally be released.
Using Python, you should not care about memory management.
Python has an excellent garbage collector, which counts for each object the reference in the code.
If there are no references, the object will be unallocated.
In your case, if the memory is not free after you're done using it, it means that the object can be still used in your program. If you delete it and then try to call it, you will get a ReferenceError.
Someone in other answers is suggesting to use del, but it will only delete the variable name.
My suggestion is to ensure that your code does not actually call the object anymore, and if it does, manipulate your data accordingly (use a lightweight db, save them on local hard drive, ...) and retrieve them when needed. If your big dictionaries are class parameters of a class which is still used, but doesn't need the dicts anymore, you should take those dicts outside the class (maybe referencing a new class, which only manages the dicts). In this Q&A you will find useful tips for optimizing memory usage.
You can read this article to have a really interesting dive into the Python's garbage collector
How about having the data in a smaller scope?
class Application():
def __init__(self, data_file):
self.use_read_data(DataReader(data_file).read())
After application is started, self.data_dictionary is no longer needed
If you do not need the data for the whole lifetime of the application then you shouldn't be storing it in an instance attribute.
Choose the right scope and you won't need to care about deleting variables.
del will delete the reference to the object, however it could still be on memory. In that case, the garbage collector (gc.collect([generation])) will free the memory:
https://docs.python.org/2.7/library/gc.html
import gc
[...]
# Delete reference
del object
# Garbage collector
gc.collect()
[...]

how to change type in python dictionary in function?

I have a really large group of pandas dataframe.
and transfer own column from json format string to dictionary.
import pandas as pd
import pymysql
db = pymysql.connect(XXXX)
df = pd.read_sql(balabal).to_dict(orient='records')
After we get dictionary, we need transfer one entity says df[0]['paragraphs'] from string to dictionary. Here is the code. i['t'] is key and i['p'] is value.
import json
def str2dict(input_str):
j = json.loads(input)
ret = {}
for i in j:
ret[i['t']] = i['p']
return ret
And I call this function by:
for i in df:
i['paragraphs'] = preprocess.str2dict(i['paragraphs'])
It works fine.
but at this part i['paragraphs'] = preprocess.str2dict(i['paragraphs']) it did unnecessary copy.
I want my str2dict function be like this:
def str2dict(input_str):
j = json.loads(inputs)
clear memory where input_str pointed. and let it be a new dictionary
for i in j:
input_str[i['t']] = i['p']
so that we can reduce the copy assignment.
And I'm confused that
in Python everything is object, so the variable are like shared_pointer in C++.
But where does there object be implement(initialized).
Does all objects' pointee store in heap, and in stack of function all variable are pointer.
when we do this
a = 1
a = {'a':1}
a = 2
Python will new a dictionary and let a point to it.
and when a = 2, program will delete dictionary on heap and point to 2.
but how about in function?
def test(a):
a = {}
return
s = 1
test(s)
s is still 1. I think a = {} create a dict and create a local variable a to point it, this a have no relation to parameter a. So how can I use parameter a to let output s = {}.
And in the end, where can I learn how Python implement (where the variable store, what happens when it type changed)and memory staff? I think I should learn something about Cpython. Do you guys have any suggestion about what books or videos should I read?
thx :)
CPython works very differently from C++. Everything is on the heap. Memory is managed automatically by reference counting (cycles are dealt with a garbage collector). Variables are not typed. Python does not support pass by reference semantics. i['paragraphs'] = preprocess.str2dict(i['paragraphs']) does not make a copy. If no other reference to the string referenced by i['paragraphs'] exists, that string's reference count will go to zero once str2dict terminates, and the memory will be reclaimed.
This function:
def test(a):
a = {}
return
creates a dict object, it gets assigned to the local name a. Once the function returns, no other references to that dict exist, and the dict object is deallocated. This is handled by the Python runtime, and generally, you do not worry about these things.
Do you want to free RAM while runnig your program?
I am not sure that you can do that,at least not in a way similar to C++.
With CPython you have gc
It says
This module provides an interface to the optional garbage collector. It provides the ability to disable the collector, tune the collection frequency, and set debugging options
Anyway,gc could reclaim memory but it doesn't necessarily return it to the OS.

Remove object from list after lifetime expires

I am creating a program that spawns objects randomly. These objects have a limited lifetime.
I create these objects and place them in a list. The objects keep track of how long they exist and eventually expire. They are no longer needed after expiration.
I would like to delete the objects after they expire but I'm not sure how to reference the specific object in the list to delete it.
if something:
list.append(SomeObject())
---- later---
I would like a cleanup process that looks at the variable in the Object and if it is expired, then remove it from the list.
Thanks for your help in advance.
You can use the refCount in case you define "no longer used" as "no other object keeps a reference". Which is a good way, for as no references exist, the object can no longer be accessed and may be disposed of. In fact, Python's garbage collector will do that for you.
Where it goes wrong is when you also have all the instances in a list. That also counts as a refeference to the object and it therefore never will be disposed of.
For example, a list of state variables that are not only referenced by their owning objects, but also by a list to allow linear access. Explicitly call a cleanup function from the accessor to keep the list clean:
GlobalStateList = []
def gcOnGlobalStateList():
for s in reversed(GlobalStateList):
if (getrefcount(s) <= 3): # 1=GlobalStateList, 2=Iterator, 3=getrefcount()
GlobalStateList.remove(s)
def getGlobalStateList():
gcOnGlobalStateList()
return GlobalStateList
Note that even looking at the refcount increases it, so the test-value is three or less.
Assuming that your concept of SomeObject "expiry" is not directly related to the underlying Python object lifetime (reference counting, etc.) I would suggest that the easiest way to purge the list is to occasionally run through it, dereferencing any expired objects:
lst = [obj for obj in lst if not obj.expired]
Note that you shouldn't call your own variables list, as this will shadow the built-in.

Persistent references in Python

I would like my program to store datas for later uses. Until now, not any problem: there is much ways of doing this in Python.
Things get a little more complicated because I want to keep references between instances. If a list X is a list Y (they have the same ID, modify one is modify the other), it should be true the next time I load the datas (another session of the program which has stopped in the meantime).
I know a solution : the pickle module keeps tracks of references and will remember that my X and Y lists are exactly the same (not only their contents, but their references).
Still, the problem using pickle is that it works if you dump every data in a single file. Which is not really clever if you have a large amount of data.
Do you know another way to handle this problem?
The simplest thing to do is probably to wrap up all your state you wish to save in a dictionary (keyed by variable name, perhaps, or some other unique but predictable identifier), then pickle and unpickle that dictionary. The objects within the dictionary will share references between one another like you want:
>>> class X(object):
... # just some object to be pickled
... pass
...
>>> l1 = [X(), X(), X()]
>>> l2 = [l1[0], X(), l1[2]]
>>> state = {'l1': l1, 'l2': l2}
>>> saved = pickle.dumps(state)
>>> restored = pickle.loads(saved)
>>> restored['l1'][0] is restored['l2'][0]
True
>>> restored['l1'][1] is restored['l2'][1]
False
I would recommand using shelve over pickle. It has higher level functionnality, and is simpler to use.
http://docs.python.org/library/shelve.html
If you have performance issue because you manipulate very large amount of data, you may try other librairies like pyTables:
http://www.pytables.org/moin
ZODB is developed to save persistent python objects and all references. Just inherit your class from Persistent and have a fun. http://www.zodb.org/

I need to free up RAM by storing a Python dictionary on the hard drive, not in RAM. Is it possible?

In my case, I have a dictionary of about 6000 instantiated classes, where each class has 1000 attributed variables all of type string or list of strings. As I build this dictionary up, my RAM goes up super high. Is there a way to write the dictionary as it is being built to the harddrive rather than the RAM so that I can save some memory? I've heard of something called "pickle" but I don't know if this is a feasible method for what I am doing.
Thanks for your help!
Maybe you should be using a database, but check out the shelve module
If shelve isn't powerful enough for you, there is always the industrial strength ZODB
shelve, as #gnibbler recommends, is what I would no doubt be using, but watch out for two traps: a simple one (all keys must be strings) and a subtle one (as the values don't normally exist in memory, calling mutators on them may not work as you expect).
For the simple problem, it's normally easy to find a workaround (and you do get a clear exception if you forget and try e.g. using an int or whatever as the key, so it's not hard t remember that you do need a workaround either).
For the subtle problem, consider for example:
x = d['foo']
x.amutatingmethod()
...much later...
y = d['foo']
# is y "mutated" or not now?
the answer to the question in the last comment depends on whether d is a real dict (in which case y will be mutated, and in fact exactly the same object as x) or a shelf (in which case y will be a distinct object from x, and in exactly the state you last saved to d['foo']!).
To get your mutations to persist, you need to "save them to disk" by doing
d['foo'] = x
after calling any mutators you want on x (so in particular you cannot just do
d['foo'].mutator()
and expect the mutation to "stick", as you would if d were a dict).
shelve does have an option to cache all fetched items in memory, but of course that can fill up the memory again, and result in long delays when you finally close the shelf object (since all the cached items must be saved back to disk then, just in case they had been mutated). That option was something I originally pushed for (as a Python core committer), but I've since changed my mind and I now apologize for getting it in (ah well, at least it's not the default!-), since the cases it should be used in are rare, and it can often trap the unwary user... sorry.
BTW, in case you don't know what a mutator, or "mutating method", is, it's any method that alters the state of the object you call it on -- e.g. .append if the object is a list, .pop if the object is any kind of container, and so on. No need to worry if the object is immutable, of course (numbers, strings, tuples, frozensets, ...), since it doesn't have mutating methods in that case;-).
Pickling an entire hash over and over again is bound to run into the same memory pressures that you're facing now -- maybe even worse, with all the data marshaling back and forth.
Instead, using an on-disk database that acts like a hash is probably the best bet; see this page for a quick introduction to using dbm-style databases in your program: http://docs.python.org/library/dbm
They act enough like hashes that it should be a simple transition for you.
"""I have a dictionary of about 6000 instantiated classes, where each class has 1000 attributed variables all of type string or list of strings""" ... I presume that you mean: """I have a class with about 1000 attributes all of type str or list of str. I have a dictionary mapping about 6000 keys of unspecified type to corresponding instances of that class.""" If that's not a reasonable translation, please correct it.
For a start, 1000 attributes in a class is mindboggling. You must be treating the vast majority generically using value = getattr(obj, attr_name) and setattr(obj, attr_name, value). Consider using a dict instead of an instance: value = obj[attr_name] and obj[attr_name] = value.
Secondly, what percentage of those 6 million attributes are ""? If sufficiently high, you might like to consider implementing a sparse dict which doesn't physically have entries for those attributes, using the __missing__ hook -- docs here.

Categories

Resources