I have some data stored in a DB that I want to process. DB access is painfully slow, so I decided to load all data in a dictionary before any processing. However, due to the huge size of the data stored, I get an out of memory error (I see more than 2 gigs being used). So I decided to use a disk data structure, and found out that using shelve is an option. Here's what I do (pseudo python code)
def loadData():
if (#dict exists on disk):
d = shelve.open(name)
return d
else:
d = shelve.open(name, writeback=True)
#access DB and write data to dict
# d[key] = value
# or for mutable values
# oldValue = d[key]
# newValue = f(oldValue)
# d[key] = newValue
d.close()
d = shelve.open(name, writeback=True)
return d
I have a couple of questions,
1) Do I really need the writeBack=True? What does it do?
2) I still get an OutofMemory exception, since I do not exercise any control over when the data is being written to disk. How do I do that? I tried doing a sync() every few iterations but that didn't help either.
Thanks!
writeback=True forces the shelf to keep in-memory any item ever fetched, and write them back when the shelf is closed. So, it consumes much more memory, and slows down closing.
The advantage of the parameter is that, with it, you don't need the contorted code you show in your comment for mutable items whose mutator is a method -- just
shelf['foobar'].append(23)
works (if shelf was opened with writeback enabled), assuming the item at key 'foobar' is a list of course, while it would silently be a no-operation (leaving the item on disk unchanged) if shelf was opened without writeback -- in the latter case you actually do need to code
thelist = shelf['foobar']
thelist.append(23)
shekf['foobar'] = thelist
in your comment's spirit -- which is stylistically somewhat of a bummer.
However, since you are having memory problems, I definitely recommend not using this dubious writeback option. I think I can call it "dubious" since I was the one proposing and first implementing it, but that was many years ago, and I've mostly repented of doing it -- it generales more confusion (as your Q evidences) than it allows elegance and handiness in moving code originally written to work with dicts (which would use the first idiom, not the second, and thus need rewriting in order to be usable with shelves without traceback). Ah well, sorry, it did seem a good idea at the time.
Using the sqlite3 module is probably your best choice here. You might be able to use sqlite entirely in memory anyway since its memory footprint might be a bit smaller than using python objects anyway. It's generally a better choice than using shelve anyway; shelve uses pickle underneath, which is rarely what you want.
Hell, you could just convert your entire existing database to a sqlite database. sqlite is nice and fast.
Related
I have a class which primarily contains the three dicts:
class KB(object):
def __init__(self):
# key:str value: list of str
linear_patterns = defaultdict(list)
# key:str value: list of str
nonlinear_patterns = defaultdict(list)
# key: str value: dict
pattern_meta_info = {}
...
self.__initialize()
def __initialize(self):
# the 3 dicts are populated
...
The size of the 3 dicts are below:
linear_patterns: 100,000
non_linear_patterns: 900,000
pattern_meta_info: 700,000
After the program is run and done, it takes about 15 seconds to release the memory. When I reduces the number of the dict sizes above by loading less data in initialization, the memory release is faster, so I judge it's due to these dict sizes that cause memory release slower. The total program takes about 8G memory. Also, after the dicts are built, all operations are lookup, no modifications.
Is there a way to use cython to optimize the 3 data structures above, especially in terms of memory usage? Is there a similar cython dictionary that can replaces the python dicts?
It seems unlikely that a different dictionary or object type would change much. Destructor performance is dominated by the memory allocator. That will be roughly the same unless you switch to a different malloc implementation.
If this is only about object destruction at the end of your program, most languages (but not Python) would allow you to use call exit while keeping the KB object alive. The OS will release the memory much quicker when the process terminates. So why bother? Unfortunately that doesn't work with Python's sys.exit() since this merely raises an exception.
Everything else relies on changing the data structure or algorithm. Are your strings highly redundant? Maybe you can reuse string objects by interning them. Keep them in a shared set to use the same string in multiple places. A simple call to string = sys.intern(string) is enough. Unlike in earlier versions of Python, this will not keep the string object alive beyond its use so you don't run the risk of leaking memory in a long-running process.
You could also pool the strings in one large allocation. If access is relatively rare, you could change the class to use one large io.StringIO object for its contained strings and all dictionaries just deal with (offset, length) tuples into that buffer.
That still leaves many tuple and integer objects but those use specialized allocators that may be faster. Also, the length integer will come from the common pool of small integers and not allocate a new object.
A final thought: 8 GB of string data. You sure you don't want a small sqlite or dbm database? Could be a temporary file
I'm debugging memory leaks in a Django application, and could something curious in django_cachepurge:
from threading import currentThread
_urls_to_purge = {}
def add_purge_url(url):
# ....
_urls_to_purge.setdefault(currentThread(), set()).add(url)
Is such construct prone to memory leaks?
I suspect so, unless I'm not familiar with some Python magic here.
There is no location where the dict is cleaned up.
I don't know what currentThread returns, but you probably can use the built-in id or hash functions on it to get a safe value.
If lookup isn't enough, e.g. because you want to iterate over the container, there is weakref.WeakKeyDictionary.
I am loading a JSON file to parse it and convert it (only a part of the JSON) to a CSV.
So at the end of the method I would free the space of the loaded JSON.
Here is the method:
def JSONtoCSV(input,output):
outputWriter = csv.writer(open(output,'wb'), delimiter=',')
jsonfile = open(input).read()
data = loads(jsonfile)
for k,v in data["specialKey"].iteritems():
outputWriter.writerow([v[1],v[5]])
How do you free the space of the "data" variable?
del data
should do it if you only have one reference. Keep in mind this will happen automatically when the current scope ends (the function returns).
Also, you don't need to keep the jsonfile string around, you can just
data = json.load(open(input))
to read the JSON data directly from the file.
If you want data to go away as soon as you're done with it, you can combine all of that:
for k,v in json.load(open(input))["specialKey"].iteritems():
since there is no reference to the data once the loop has ended, Python will free the memory immediately.
In Python, variables are automatically freed when they go out of scope so you shouldn't have to worry about it. However if you really want to, you can use
del data
One thing to note is that the garbage collector probably won't kick in immediately, even if you do use del. That's the downside of garbage collection. You just don't have 100% control of memory management. That is something you will need to accept if you want to use Python. You just have to trust the garbage collector to know what it's doing.
The data variable does not take up any meaningful spaceāit's just a name. The data object takes up some space, and Python does not allow you to free objects manually. Objects will be garbage collected some time after there are no references to them.
To make sure that you don't keep things alive longer than you want, make sure you don't have a way to access them (don't have a name still bound to them, etc).
An improved implementation might be
def JSONtoCSV(input_filename, output_filename):
with open(input_filename) as f:
special_data = json.load(f)[u'specialKey']
with open(output_filename,'wb') as f:
outputWriter = csv.writer(f, delimiter=',')
for k, v in special_data.iteritems():
outputWriter.writerow([v[1], v[5]])
This doesn't ever store the string you called jsonfile or the dict you called data, so they're freed to be collected as soon as Python wants. The former improvement was made by using json.load instead of json.loads, which takes the file object itself. The latter improvement is made by looking up 'specialKey' immediately rather than binding a name to all of data.
Consider that this delicate dance probably isn't necessary at all, since as soon as you return these references will cease to be around and you've at best sped things up momentarily.
Python is a garbage-collected language, so you don't have to worry about freeing memory once you've used it; once the jsonfile variable goes out of scope, it will automatically be freed by the interpreter.
If you really want to delete the variable, you can use del jsonfile, which will cause an error if you try to refer to it after deleting it. However, unless you're loading enough data to cause a significant drop in performance, I would leave this to the garbage collector.
Please refer to Python json memory bloat. Garbage collection is not kicking-in as thresholds are not met. So even a del call will not free memory. However a forced garbage collection using gc.collect() will free up the object.
In my case, I have a dictionary of about 6000 instantiated classes, where each class has 1000 attributed variables all of type string or list of strings. As I build this dictionary up, my RAM goes up super high. Is there a way to write the dictionary as it is being built to the harddrive rather than the RAM so that I can save some memory? I've heard of something called "pickle" but I don't know if this is a feasible method for what I am doing.
Thanks for your help!
Maybe you should be using a database, but check out the shelve module
If shelve isn't powerful enough for you, there is always the industrial strength ZODB
shelve, as #gnibbler recommends, is what I would no doubt be using, but watch out for two traps: a simple one (all keys must be strings) and a subtle one (as the values don't normally exist in memory, calling mutators on them may not work as you expect).
For the simple problem, it's normally easy to find a workaround (and you do get a clear exception if you forget and try e.g. using an int or whatever as the key, so it's not hard t remember that you do need a workaround either).
For the subtle problem, consider for example:
x = d['foo']
x.amutatingmethod()
...much later...
y = d['foo']
# is y "mutated" or not now?
the answer to the question in the last comment depends on whether d is a real dict (in which case y will be mutated, and in fact exactly the same object as x) or a shelf (in which case y will be a distinct object from x, and in exactly the state you last saved to d['foo']!).
To get your mutations to persist, you need to "save them to disk" by doing
d['foo'] = x
after calling any mutators you want on x (so in particular you cannot just do
d['foo'].mutator()
and expect the mutation to "stick", as you would if d were a dict).
shelve does have an option to cache all fetched items in memory, but of course that can fill up the memory again, and result in long delays when you finally close the shelf object (since all the cached items must be saved back to disk then, just in case they had been mutated). That option was something I originally pushed for (as a Python core committer), but I've since changed my mind and I now apologize for getting it in (ah well, at least it's not the default!-), since the cases it should be used in are rare, and it can often trap the unwary user... sorry.
BTW, in case you don't know what a mutator, or "mutating method", is, it's any method that alters the state of the object you call it on -- e.g. .append if the object is a list, .pop if the object is any kind of container, and so on. No need to worry if the object is immutable, of course (numbers, strings, tuples, frozensets, ...), since it doesn't have mutating methods in that case;-).
Pickling an entire hash over and over again is bound to run into the same memory pressures that you're facing now -- maybe even worse, with all the data marshaling back and forth.
Instead, using an on-disk database that acts like a hash is probably the best bet; see this page for a quick introduction to using dbm-style databases in your program: http://docs.python.org/library/dbm
They act enough like hashes that it should be a simple transition for you.
"""I have a dictionary of about 6000 instantiated classes, where each class has 1000 attributed variables all of type string or list of strings""" ... I presume that you mean: """I have a class with about 1000 attributes all of type str or list of str. I have a dictionary mapping about 6000 keys of unspecified type to corresponding instances of that class.""" If that's not a reasonable translation, please correct it.
For a start, 1000 attributes in a class is mindboggling. You must be treating the vast majority generically using value = getattr(obj, attr_name) and setattr(obj, attr_name, value). Consider using a dict instead of an instance: value = obj[attr_name] and obj[attr_name] = value.
Secondly, what percentage of those 6 million attributes are ""? If sufficiently high, you might like to consider implementing a sparse dict which doesn't physically have entries for those attributes, using the __missing__ hook -- docs here.
So I have this code in python that writes some values to a Dictionary where each key is a student ID number and each value is a Class (of type student) where each Class has some variables associated with it. '
Code
try:
if ((str(i) in row_num_id.iterkeys()) and (row_num_id[str(i)]==varschosen[1])):
valuetowrite=str(row[i])
if students[str(variablekey)].var2 != []:
students[str(variablekey)].var2.append(valuetowrite)
else:
students[str(variablekey)].var2=([valuetowrite])
except:
two=1#This is just a dummy assignment because I #can't leave it empty... I don't need my program to do anything if the "try" doesn't work. I just want to prevent a crash.
#Assign var3
try:
if ((str(i) in row_num_id.iterkeys()) and (row_num_id[str(i)]==varschosen[2])):
valuetowrite=str(row[i])
if students[str(variablekey)].var3 != []:
students[str(variablekey)].var3.append(valuetowrite)
else:
students[str(variablekey)].var3=([valuetowrite])
except:
two=1
#Assign var4
try:
if ((str(i) in row_num_id.iterkeys()) and (row_num_id[str(i)]==varschosen[3])):
valuetowrite=str(row[i])
if students[str(variablekey)].var4 != []:
students[str(variablekey)].var4.append(valuetowrite)
else:
students[str(variablekey)].var4=([valuetowrite])
except:
two=1
'
The same code repeats many, many times for each variable that the student has (var5, var6,....varX). However, the RAM spike in my program comes up as I execute the function that does this series of variable assignments.
I wish to find out a way to make this more efficient in speed or more memory efficient because running this part of my program takes up around half a gig of memory. :(
Thanks for your help!
EDIT:
Okay let me simplify my question:
In my case, I have a dictionary of about 6000 instantiated classes, where each class has 1000 attributed variables all of type string or list of strings. I don't really care about the number of lines my code is or the speed at which it runs (Right now, my code is at almost 20,000 lines and is about a 1 MB .py file!). What I am concerned about is the amount of memory it is taking up because this is the culprit in throttling my CPU. The ultimate question is: does the number of code lines by which I build up this massive dictionary matter so much in terms of RAM usage?
My original code functions fine, but the RAM usage is high. I'm not sure if that is "normal" with the amount of data I am collecting. Does writing the code in a condensed fashion (as shown by the people who helped me below) actually make a noticeable difference in the amount of RAM I am going to eat up? Sure there are X ways to build a dictionary, but does it even affect the RAM usage in this case?
Edit: The suggested code-refactoring below won't reduce the memory consumption very much. 6000 classes each with 1000 attributes may very well consume half a gig of memory.
You might be better off storing the data in a database and pulling out the data only as you need it via SQL queries. Or you might use shelve or marshal to dump some or all of the data to disk, where it can be read back in only when needed. A third option would be to use a numpy array of strings. The numpy array will hold the strings more compactly. (Python strings are objects with lots of methods which make them bulkier memory-wise. A numpy array of strings loses all those methods but requires relatively little memory overhead.) A fourth option might be to use PyTables.
And lastly (but not leastly), there might be ways to re-design your algorithm to be less memory intensive. We'd have to know more about your program and the problem it's trying to solve to give more concrete advice.
Original suggestion:
for v in ('var2','var3','var4'):
try:
if row_num_id.get(str(i))==varschosen[1]:
valuetowrite=str(row[i])
value=getattr(students[str(variablekey)],v)
if value != []:
value.append(valuetowrite)
else:
value=[valuetowrite]
except PUT_AN_EXPLICT_EXCEPTION_HERE:
pass
PUT_AN_EXPLICT_EXCEPTION_HERE should be replaced with something like AttributeError or TypeError, or ValueError, or maybe something else.
It's hard to guess what to put here because I don't know what kind of values the variables might have.
If you run the code without the try...exception block, and your program crashes, take note of the traceback error message you receive. The last line will say something like
TypeError: ...
In that case, replace PUT_AN_EXPLICT_EXCEPTION_HERE with TypeError.
If your code can fail in a number of ways, say, with TypeError or ValueError, then you can replace PUT_AN_EXPLICT_EXCEPTION_HERE with
(TypeError,ValueError) to catch both kinds of error.
Note: There is a little technical caveat that should be mentioned regarding
row_num_id.get(str(i))==varschosen[1]. The expression row_num_id.get(str(i)) returns None if str(i) is not in row_num_id.
But what if varschosen[1] is None and str(i) is not in row_num_id? Then the condition is True, when the longer original condition returned False.
If that is a possibility, then the solution is to use a sentinal default value like row_num_id.get(str(i),object())==varschosen[1]. Now row_num_id.get(str(i),object()) returns object() when str(i) is not in row_num_id. Since object() is a new instance of object there is no way it could equal varschosen[1].
You've spelled this wrong
two=1#This is just a dummy assignment because I
#can't leave it empty... I don't need my program to do anything if the "try" doesn't work. I just want to prevent a crash.
It's spelled
pass
You should read a tutorial on Python.
Also,
except:
Is a bad policy. Your program will fail to crash when it's supposed to crash.
Names like var2 and var3 are evil. They are intentionally misleading.
Don't repeat str(variablekey) over and over again.
I wish to find out a way to make this more efficient in speed or more memory efficient because running this part of my program takes up around half a gig of memory. :(
This request is unanswerable because we don't know what it's supposed to do. Intentionally obscure names like var1 and var2 make it impossible to understand.
"6000 instantiated classes, where each class has 1000 attributed variables"
So. 6 million objects? That's a lot of memory. A real lot of memory.
What I am concerned about is the amount of memory it is taking up because this is the culprit in throttling my CPU
Really? Any evidence?
but the RAM usage is high
Compared with what? What's your basis for this claim?
Python dicts use a surprisingly large amount of memory. Try:
import sys
for i in range( 30 ):
d = dict( ( j, j ) for j in range( i ) )
print "dict with", i, "elements is", sys.getsizeof( d ), "bytes"
for an illustration of just how expensive they are. Note that this is just the size of the dict itself: it doesn't include the size of the keys or values stored in the dict.
By default, an instance of a Python class stores its attributes in a dict. Therefore, each of your 6000 instances is using a lot of memory just for that dict.
One way that you could save a lot of memory, provided that your instances all have the same set of attributes, is to use __slots__ (see http://docs.python.org/reference/datamodel.html#slots). For example:
class Foo( object ):
__slots__ = ( 'a', 'b', 'c' )
Now, instances of class Foo have space allocated for precisely three attributes, a, b, and c, but no instance dict in which to store any other attributes. This uses only 4 bytes (on a 32-bit system) per attribute, as opposed to perhaps 15-20 bytes per attribute using a dict.
Another way in which you could be wasting memory, given that you have a lot of strings, is if you're storing multiple identical copies of the same string. Using the intern function (see http://docs.python.org/library/functions.html#intern) could help if this turns out to be a problem.