So I have this code in python that writes some values to a Dictionary where each key is a student ID number and each value is a Class (of type student) where each Class has some variables associated with it. '
Code
try:
if ((str(i) in row_num_id.iterkeys()) and (row_num_id[str(i)]==varschosen[1])):
valuetowrite=str(row[i])
if students[str(variablekey)].var2 != []:
students[str(variablekey)].var2.append(valuetowrite)
else:
students[str(variablekey)].var2=([valuetowrite])
except:
two=1#This is just a dummy assignment because I #can't leave it empty... I don't need my program to do anything if the "try" doesn't work. I just want to prevent a crash.
#Assign var3
try:
if ((str(i) in row_num_id.iterkeys()) and (row_num_id[str(i)]==varschosen[2])):
valuetowrite=str(row[i])
if students[str(variablekey)].var3 != []:
students[str(variablekey)].var3.append(valuetowrite)
else:
students[str(variablekey)].var3=([valuetowrite])
except:
two=1
#Assign var4
try:
if ((str(i) in row_num_id.iterkeys()) and (row_num_id[str(i)]==varschosen[3])):
valuetowrite=str(row[i])
if students[str(variablekey)].var4 != []:
students[str(variablekey)].var4.append(valuetowrite)
else:
students[str(variablekey)].var4=([valuetowrite])
except:
two=1
'
The same code repeats many, many times for each variable that the student has (var5, var6,....varX). However, the RAM spike in my program comes up as I execute the function that does this series of variable assignments.
I wish to find out a way to make this more efficient in speed or more memory efficient because running this part of my program takes up around half a gig of memory. :(
Thanks for your help!
EDIT:
Okay let me simplify my question:
In my case, I have a dictionary of about 6000 instantiated classes, where each class has 1000 attributed variables all of type string or list of strings. I don't really care about the number of lines my code is or the speed at which it runs (Right now, my code is at almost 20,000 lines and is about a 1 MB .py file!). What I am concerned about is the amount of memory it is taking up because this is the culprit in throttling my CPU. The ultimate question is: does the number of code lines by which I build up this massive dictionary matter so much in terms of RAM usage?
My original code functions fine, but the RAM usage is high. I'm not sure if that is "normal" with the amount of data I am collecting. Does writing the code in a condensed fashion (as shown by the people who helped me below) actually make a noticeable difference in the amount of RAM I am going to eat up? Sure there are X ways to build a dictionary, but does it even affect the RAM usage in this case?
Edit: The suggested code-refactoring below won't reduce the memory consumption very much. 6000 classes each with 1000 attributes may very well consume half a gig of memory.
You might be better off storing the data in a database and pulling out the data only as you need it via SQL queries. Or you might use shelve or marshal to dump some or all of the data to disk, where it can be read back in only when needed. A third option would be to use a numpy array of strings. The numpy array will hold the strings more compactly. (Python strings are objects with lots of methods which make them bulkier memory-wise. A numpy array of strings loses all those methods but requires relatively little memory overhead.) A fourth option might be to use PyTables.
And lastly (but not leastly), there might be ways to re-design your algorithm to be less memory intensive. We'd have to know more about your program and the problem it's trying to solve to give more concrete advice.
Original suggestion:
for v in ('var2','var3','var4'):
try:
if row_num_id.get(str(i))==varschosen[1]:
valuetowrite=str(row[i])
value=getattr(students[str(variablekey)],v)
if value != []:
value.append(valuetowrite)
else:
value=[valuetowrite]
except PUT_AN_EXPLICT_EXCEPTION_HERE:
pass
PUT_AN_EXPLICT_EXCEPTION_HERE should be replaced with something like AttributeError or TypeError, or ValueError, or maybe something else.
It's hard to guess what to put here because I don't know what kind of values the variables might have.
If you run the code without the try...exception block, and your program crashes, take note of the traceback error message you receive. The last line will say something like
TypeError: ...
In that case, replace PUT_AN_EXPLICT_EXCEPTION_HERE with TypeError.
If your code can fail in a number of ways, say, with TypeError or ValueError, then you can replace PUT_AN_EXPLICT_EXCEPTION_HERE with
(TypeError,ValueError) to catch both kinds of error.
Note: There is a little technical caveat that should be mentioned regarding
row_num_id.get(str(i))==varschosen[1]. The expression row_num_id.get(str(i)) returns None if str(i) is not in row_num_id.
But what if varschosen[1] is None and str(i) is not in row_num_id? Then the condition is True, when the longer original condition returned False.
If that is a possibility, then the solution is to use a sentinal default value like row_num_id.get(str(i),object())==varschosen[1]. Now row_num_id.get(str(i),object()) returns object() when str(i) is not in row_num_id. Since object() is a new instance of object there is no way it could equal varschosen[1].
You've spelled this wrong
two=1#This is just a dummy assignment because I
#can't leave it empty... I don't need my program to do anything if the "try" doesn't work. I just want to prevent a crash.
It's spelled
pass
You should read a tutorial on Python.
Also,
except:
Is a bad policy. Your program will fail to crash when it's supposed to crash.
Names like var2 and var3 are evil. They are intentionally misleading.
Don't repeat str(variablekey) over and over again.
I wish to find out a way to make this more efficient in speed or more memory efficient because running this part of my program takes up around half a gig of memory. :(
This request is unanswerable because we don't know what it's supposed to do. Intentionally obscure names like var1 and var2 make it impossible to understand.
"6000 instantiated classes, where each class has 1000 attributed variables"
So. 6 million objects? That's a lot of memory. A real lot of memory.
What I am concerned about is the amount of memory it is taking up because this is the culprit in throttling my CPU
Really? Any evidence?
but the RAM usage is high
Compared with what? What's your basis for this claim?
Python dicts use a surprisingly large amount of memory. Try:
import sys
for i in range( 30 ):
d = dict( ( j, j ) for j in range( i ) )
print "dict with", i, "elements is", sys.getsizeof( d ), "bytes"
for an illustration of just how expensive they are. Note that this is just the size of the dict itself: it doesn't include the size of the keys or values stored in the dict.
By default, an instance of a Python class stores its attributes in a dict. Therefore, each of your 6000 instances is using a lot of memory just for that dict.
One way that you could save a lot of memory, provided that your instances all have the same set of attributes, is to use __slots__ (see http://docs.python.org/reference/datamodel.html#slots). For example:
class Foo( object ):
__slots__ = ( 'a', 'b', 'c' )
Now, instances of class Foo have space allocated for precisely three attributes, a, b, and c, but no instance dict in which to store any other attributes. This uses only 4 bytes (on a 32-bit system) per attribute, as opposed to perhaps 15-20 bytes per attribute using a dict.
Another way in which you could be wasting memory, given that you have a lot of strings, is if you're storing multiple identical copies of the same string. Using the intern function (see http://docs.python.org/library/functions.html#intern) could help if this turns out to be a problem.
Related
I have a class which primarily contains the three dicts:
class KB(object):
def __init__(self):
# key:str value: list of str
linear_patterns = defaultdict(list)
# key:str value: list of str
nonlinear_patterns = defaultdict(list)
# key: str value: dict
pattern_meta_info = {}
...
self.__initialize()
def __initialize(self):
# the 3 dicts are populated
...
The size of the 3 dicts are below:
linear_patterns: 100,000
non_linear_patterns: 900,000
pattern_meta_info: 700,000
After the program is run and done, it takes about 15 seconds to release the memory. When I reduces the number of the dict sizes above by loading less data in initialization, the memory release is faster, so I judge it's due to these dict sizes that cause memory release slower. The total program takes about 8G memory. Also, after the dicts are built, all operations are lookup, no modifications.
Is there a way to use cython to optimize the 3 data structures above, especially in terms of memory usage? Is there a similar cython dictionary that can replaces the python dicts?
It seems unlikely that a different dictionary or object type would change much. Destructor performance is dominated by the memory allocator. That will be roughly the same unless you switch to a different malloc implementation.
If this is only about object destruction at the end of your program, most languages (but not Python) would allow you to use call exit while keeping the KB object alive. The OS will release the memory much quicker when the process terminates. So why bother? Unfortunately that doesn't work with Python's sys.exit() since this merely raises an exception.
Everything else relies on changing the data structure or algorithm. Are your strings highly redundant? Maybe you can reuse string objects by interning them. Keep them in a shared set to use the same string in multiple places. A simple call to string = sys.intern(string) is enough. Unlike in earlier versions of Python, this will not keep the string object alive beyond its use so you don't run the risk of leaking memory in a long-running process.
You could also pool the strings in one large allocation. If access is relatively rare, you could change the class to use one large io.StringIO object for its contained strings and all dictionaries just deal with (offset, length) tuples into that buffer.
That still leaves many tuple and integer objects but those use specialized allocators that may be faster. Also, the length integer will come from the common pool of small integers and not allocate a new object.
A final thought: 8 GB of string data. You sure you don't want a small sqlite or dbm database? Could be a temporary file
Background: I need to read the same key/value from a dictionary (exactly) twice.
Question: There are two ways, as shown below,
Method 1. Read it with the same key twice, e.g.,
sample_map = {'A':1,}
...
if sample_map.get('A', None) is not None:
print("A's value in map is {}".format(sample_map.get('A')))
Method 2. Read it once and store it in a local variable, e.g,
sample_map = {'A':1,}
...
ret_val = sample.get('A', None)
if ret_val is not None:
print("A's value in map is {}".format(ret_val))
Which way is better? What are their Pros and Cons?
Note that I am aware that print() can naturally handle ret_val of None. This is a hypothetical example and I just use it for illustration purposes.
Under these conditions, I wouldn't use either. What you're really interested in is whether A is a valid key, and the KeyError (or lack thereof) raised by __getitem__ will tell you if it is or not.
try:
print("A's value in map is {}".format(sample['A'])
except KeyError:
pass
Or course, some would say there is too much code in the try block, in which case method 2 would be preferable.
try:
ret_val = sample['A']
except KeyError:
pass
else:
print("A's value in map is {}".format(ret_val))
or the code you already have:
ret_val = sample.get('A') # None is the default value for the second argument
if ret_val is not None:
print("A's value in map is {}".format(ret_val))
There isn't any effective difference between the options you posted.
Python: List vs Dict for look up table
Lookups in a dict are about o(1). Same goes for a variable you have stored.
Efficiency is about the same. In this case, I would skip defining the extra variable, since not much else is going on.
But in a case like below, where there's a lot of dict lookups going on, I have plans to refactor the code to make things more intelligible, as all of the lookups clutter or obfuscate the logic:
# At this point, assuming that these are floats is OK, since no thresholds had text values
if vname in paramRanges:
"""
Making sure the variable is one that we have a threshold for
"""
# We might care about it
# Don't check the equal case, because that won't matter
if float(tblChanges[vname][0]) < float(tblChanges[vname][1]):
# Check lower tolerance
# Distinction is important because tolerances are not always the same +/-
if abs(float(tblChanges[vname][0]) - float(tblChanges[vname][1])) >= float(
paramRanges[vname][2]):
# Difference from default is greater than tolerance
# vname : current value, default value, negative tolerance, tolerance units, change date
alerts[vname] = (
float(tblChanges[vname][0]), float(tblChanges[vname][1]), float(paramRanges[vname][2]),
paramRanges[vname][0], tblChanges[vname][2]
)
if abs(float(tblChanges[vname][0]) - float(tblChanges[vname][1])) >= float(
paramRanges[vname][1]):
alerts[vname] = (
float(tblChanges[vname][0]), float(tblChanges[vname][1]), float(paramRanges[vname][1]),
paramRanges[vname][0], tblChanges[vname][2]
)
In most cases—if you can't just rewrite your code to use EAFP as chepner suggests, which you probably can for this example—you want to avoid repeated method calls.
The only real benefit of repeating the get is saving an assignment statement.
If your code isn't crammed in the middle of a complex expression, that just means saving one line of vertical space—which isn't nothing, but isn't a huge deal.
If your code is crammed in the middle of a complex expression, pulling the get out may force you to rewrite things a bit. You may have to, e.g., turn a lambda into a def, or turn a while loop with a simple condition into a while True: with an if …: break. Usually that's a sign that you, e.g., really wanted a def in the first place, but "usually" isn't "always". So, this is where you might want to violate the rule of thumb—but see the section at the bottom first.
On the other side…
For dict.get, the performance cost of repeating the method is pretty tiny, and unlikely to impact your code. But what if you change the code to take an arbitrary mapping object from the caller, and someone passes you, say, a proxy that does a get by making a database query or an RPC to a remote server?
For single-threaded code, calling dict.get with the same arguments twice in a row without doing anything in between is correct. But what if you're taking a dict passed by the caller, and the caller has a background thread also modifying the same dict? Then your code is only correct if you put a Lock or other synchronization around the two accesses.
Or, what if your expression was something that might mutate some state, or do something dangerous?
Even if nothing like this is ever going to be an issue in your code, unless that fact is blindingly obvious to anyone reading your code, they're still going to have to think about the possibility of performance costs and ToCToU races and so on.
And, of course, it makes at least two of your lines longer. Assuming you're trying to write readable code that sticks to 72 or 79 or 99 columns, horizontal space is a scarce resource, while vertical space is much less of a big deal. I think your second version is easier to scan than your first, even without all of these other considerations, but imagine making the expression, say, 20 characters longer.
In the rare cases where pulling the repeated value out of an expression would be a problem, you still often want to assign it to a temporary.
Unfortunately, up to Python 3.7, you usually can't. It's either clumsy (e.g., requiring an extra nested comprehension or lambda just to give you an opportunity to bind a variable) or impossible.
But in Python 3.8, PEP 572 assignment expressions handle this case.
if (sample := sample_map.get('A', None)) is not None:
print("A's value in map is {}".format(sample))
I don't think this is a great use of an assignment expression (see the PEP for some better examples), especially since I'd probably write this the way chepner suggested… but it does show how to get the best of both worlds (assigning a temporary, and being embeddable in an expression) when you really need to.
I am making a POST to a python script, the POST has 2 parameters. Name and Location, and then it returns one string. My question is I am going to have 100's of these options, is it faster to do it in a dictionary like this:
myDictionary = {"Name":{"Location":"result", "LocationB":"resultB"},
"Name2":{"Location2":"result2A", "Location2B":"result2B"}}
And then I would use.get("Name").get("Location") to get the results
OR do something like this:
if Name = "Name":
if Location = "Location":
result = "result"
elif Location = "LocationB":
result = "resultB"
elif Name = "Name2":
if Location = "Location2B":
result = "result2A"
elif Location = "LocationB":
result = "result2B"
Now if there are hundreds or thousands of these what is faster? Or is there a better way all together?
First of all:
Generally, it's much more pythonic to match keys to values using dictionaries. You should do that from a point of style.
Secondly:
If you really care about performance, python might not always be the optimal tool. However, the dict approach should be much much faster, unless your selections happen about as often as the creation of these dicts. The creation of thousands and thousands of PyObjects to check your case is a really bad idea.
Thirdly:
If you care about your application so much, you might really want to benchmark both solutions -- as usual when it comes to performance questions, there's a million factors including your computing platform that only experiments will help to sort out
Fourth(ly?):
It looks like you're building something like a protocol parser. That's really not python's forte, performance-wise. Maybe you'd want to look into one of the dozens of tools that can write C code parsers for you and wrap that in a native module, it's pretty sure to be faster than either of your implementations, if done right.
Here's the python documentation on Extending Python with C or C++
I decided to test the two scenarios of 1000 Names and 2 locations
The Test Samples
Team Dictionary:
di = {}
for i in range(1000):
di["Name{}".format(i)] = {'Location': 'result{}'.format(i), 'LocationB':'result{}B'.format(i)}
def get_dictionary_value():
di.get("Name999").get("LocationB")
Team If Statement:
I used a python script to generate a 5000 line function if_statements(name, location): following this pattern
elif name == 'Name994':
if location == 'Location':
return 'result994'
elif location == 'LocationB':
return 'result994B'
# Some time later ...
def get_if_value():
if_statements("Name999", "LocationB")
Timing Results
You can time with the timeit function to test the time it takes a function to complete.
import timeit
print(timeit.timeit(get_dictionary_value))
# 0.06353...
print(timeit.timeit(get_if_value))
# 6.3684...
So there you have it, dictionary was 100 times faster on my machine than the hefty 165 KB if-statement function.
I will root for dict().
In most cases [key] selection is much faster than conditional checks. Rule of thumb conditionals are generally used for boolean statements.
The reason for this is; when you create a dictionary you essentially create a registry of that data which is stored in as hashes in a bucket. When you say for instance my dictonary_name['key'] if that value exist python knows the exact location of that value and returns it in almost in an instant.
However conditionals are different. Conditionals are sequential checks meaning worse case it has to check every condition provided to first establish the value's existence then it has return the respective data.
As you can see with 100's of statements this can be problematic. Though in this case dictionaries are faster. You also need to be cognizant of how often and how quickly these checks are. Because if they are faster than the the building of your dictionary you might get an error of value not found.
In my case, I have a dictionary of about 6000 instantiated classes, where each class has 1000 attributed variables all of type string or list of strings. As I build this dictionary up, my RAM goes up super high. Is there a way to write the dictionary as it is being built to the harddrive rather than the RAM so that I can save some memory? I've heard of something called "pickle" but I don't know if this is a feasible method for what I am doing.
Thanks for your help!
Maybe you should be using a database, but check out the shelve module
If shelve isn't powerful enough for you, there is always the industrial strength ZODB
shelve, as #gnibbler recommends, is what I would no doubt be using, but watch out for two traps: a simple one (all keys must be strings) and a subtle one (as the values don't normally exist in memory, calling mutators on them may not work as you expect).
For the simple problem, it's normally easy to find a workaround (and you do get a clear exception if you forget and try e.g. using an int or whatever as the key, so it's not hard t remember that you do need a workaround either).
For the subtle problem, consider for example:
x = d['foo']
x.amutatingmethod()
...much later...
y = d['foo']
# is y "mutated" or not now?
the answer to the question in the last comment depends on whether d is a real dict (in which case y will be mutated, and in fact exactly the same object as x) or a shelf (in which case y will be a distinct object from x, and in exactly the state you last saved to d['foo']!).
To get your mutations to persist, you need to "save them to disk" by doing
d['foo'] = x
after calling any mutators you want on x (so in particular you cannot just do
d['foo'].mutator()
and expect the mutation to "stick", as you would if d were a dict).
shelve does have an option to cache all fetched items in memory, but of course that can fill up the memory again, and result in long delays when you finally close the shelf object (since all the cached items must be saved back to disk then, just in case they had been mutated). That option was something I originally pushed for (as a Python core committer), but I've since changed my mind and I now apologize for getting it in (ah well, at least it's not the default!-), since the cases it should be used in are rare, and it can often trap the unwary user... sorry.
BTW, in case you don't know what a mutator, or "mutating method", is, it's any method that alters the state of the object you call it on -- e.g. .append if the object is a list, .pop if the object is any kind of container, and so on. No need to worry if the object is immutable, of course (numbers, strings, tuples, frozensets, ...), since it doesn't have mutating methods in that case;-).
Pickling an entire hash over and over again is bound to run into the same memory pressures that you're facing now -- maybe even worse, with all the data marshaling back and forth.
Instead, using an on-disk database that acts like a hash is probably the best bet; see this page for a quick introduction to using dbm-style databases in your program: http://docs.python.org/library/dbm
They act enough like hashes that it should be a simple transition for you.
"""I have a dictionary of about 6000 instantiated classes, where each class has 1000 attributed variables all of type string or list of strings""" ... I presume that you mean: """I have a class with about 1000 attributes all of type str or list of str. I have a dictionary mapping about 6000 keys of unspecified type to corresponding instances of that class.""" If that's not a reasonable translation, please correct it.
For a start, 1000 attributes in a class is mindboggling. You must be treating the vast majority generically using value = getattr(obj, attr_name) and setattr(obj, attr_name, value). Consider using a dict instead of an instance: value = obj[attr_name] and obj[attr_name] = value.
Secondly, what percentage of those 6 million attributes are ""? If sufficiently high, you might like to consider implementing a sparse dict which doesn't physically have entries for those attributes, using the __missing__ hook -- docs here.
I have some data stored in a DB that I want to process. DB access is painfully slow, so I decided to load all data in a dictionary before any processing. However, due to the huge size of the data stored, I get an out of memory error (I see more than 2 gigs being used). So I decided to use a disk data structure, and found out that using shelve is an option. Here's what I do (pseudo python code)
def loadData():
if (#dict exists on disk):
d = shelve.open(name)
return d
else:
d = shelve.open(name, writeback=True)
#access DB and write data to dict
# d[key] = value
# or for mutable values
# oldValue = d[key]
# newValue = f(oldValue)
# d[key] = newValue
d.close()
d = shelve.open(name, writeback=True)
return d
I have a couple of questions,
1) Do I really need the writeBack=True? What does it do?
2) I still get an OutofMemory exception, since I do not exercise any control over when the data is being written to disk. How do I do that? I tried doing a sync() every few iterations but that didn't help either.
Thanks!
writeback=True forces the shelf to keep in-memory any item ever fetched, and write them back when the shelf is closed. So, it consumes much more memory, and slows down closing.
The advantage of the parameter is that, with it, you don't need the contorted code you show in your comment for mutable items whose mutator is a method -- just
shelf['foobar'].append(23)
works (if shelf was opened with writeback enabled), assuming the item at key 'foobar' is a list of course, while it would silently be a no-operation (leaving the item on disk unchanged) if shelf was opened without writeback -- in the latter case you actually do need to code
thelist = shelf['foobar']
thelist.append(23)
shekf['foobar'] = thelist
in your comment's spirit -- which is stylistically somewhat of a bummer.
However, since you are having memory problems, I definitely recommend not using this dubious writeback option. I think I can call it "dubious" since I was the one proposing and first implementing it, but that was many years ago, and I've mostly repented of doing it -- it generales more confusion (as your Q evidences) than it allows elegance and handiness in moving code originally written to work with dicts (which would use the first idiom, not the second, and thus need rewriting in order to be usable with shelves without traceback). Ah well, sorry, it did seem a good idea at the time.
Using the sqlite3 module is probably your best choice here. You might be able to use sqlite entirely in memory anyway since its memory footprint might be a bit smaller than using python objects anyway. It's generally a better choice than using shelve anyway; shelve uses pickle underneath, which is rarely what you want.
Hell, you could just convert your entire existing database to a sqlite database. sqlite is nice and fast.