I am making a POST to a python script, the POST has 2 parameters. Name and Location, and then it returns one string. My question is I am going to have 100's of these options, is it faster to do it in a dictionary like this:
myDictionary = {"Name":{"Location":"result", "LocationB":"resultB"},
"Name2":{"Location2":"result2A", "Location2B":"result2B"}}
And then I would use.get("Name").get("Location") to get the results
OR do something like this:
if Name = "Name":
if Location = "Location":
result = "result"
elif Location = "LocationB":
result = "resultB"
elif Name = "Name2":
if Location = "Location2B":
result = "result2A"
elif Location = "LocationB":
result = "result2B"
Now if there are hundreds or thousands of these what is faster? Or is there a better way all together?
First of all:
Generally, it's much more pythonic to match keys to values using dictionaries. You should do that from a point of style.
Secondly:
If you really care about performance, python might not always be the optimal tool. However, the dict approach should be much much faster, unless your selections happen about as often as the creation of these dicts. The creation of thousands and thousands of PyObjects to check your case is a really bad idea.
Thirdly:
If you care about your application so much, you might really want to benchmark both solutions -- as usual when it comes to performance questions, there's a million factors including your computing platform that only experiments will help to sort out
Fourth(ly?):
It looks like you're building something like a protocol parser. That's really not python's forte, performance-wise. Maybe you'd want to look into one of the dozens of tools that can write C code parsers for you and wrap that in a native module, it's pretty sure to be faster than either of your implementations, if done right.
Here's the python documentation on Extending Python with C or C++
I decided to test the two scenarios of 1000 Names and 2 locations
The Test Samples
Team Dictionary:
di = {}
for i in range(1000):
di["Name{}".format(i)] = {'Location': 'result{}'.format(i), 'LocationB':'result{}B'.format(i)}
def get_dictionary_value():
di.get("Name999").get("LocationB")
Team If Statement:
I used a python script to generate a 5000 line function if_statements(name, location): following this pattern
elif name == 'Name994':
if location == 'Location':
return 'result994'
elif location == 'LocationB':
return 'result994B'
# Some time later ...
def get_if_value():
if_statements("Name999", "LocationB")
Timing Results
You can time with the timeit function to test the time it takes a function to complete.
import timeit
print(timeit.timeit(get_dictionary_value))
# 0.06353...
print(timeit.timeit(get_if_value))
# 6.3684...
So there you have it, dictionary was 100 times faster on my machine than the hefty 165 KB if-statement function.
I will root for dict().
In most cases [key] selection is much faster than conditional checks. Rule of thumb conditionals are generally used for boolean statements.
The reason for this is; when you create a dictionary you essentially create a registry of that data which is stored in as hashes in a bucket. When you say for instance my dictonary_name['key'] if that value exist python knows the exact location of that value and returns it in almost in an instant.
However conditionals are different. Conditionals are sequential checks meaning worse case it has to check every condition provided to first establish the value's existence then it has return the respective data.
As you can see with 100's of statements this can be problematic. Though in this case dictionaries are faster. You also need to be cognizant of how often and how quickly these checks are. Because if they are faster than the the building of your dictionary you might get an error of value not found.
Related
I'm caching values that are slow to calculate but are usually needed several times. I have a dictionary that looks something like this:
stored_values = {
hash1: slow_to_calc_value1
hash2: slow_to_calc_value2
# And so on x5000
}
I'm using it like this, to quickly fetch the value if it has been calculated before.
def calculate_value_for_item(item):
item_hash = hash_item(item) # Hash the item, used as the dictionary key
stored_value = stored_values.get(item_hash, None)
if stored_value is not None:
return stored_value
calculated_value = do_heavy_math(item) # This is slow and I want to avoid
# Storing the reult for re-use makes me run out of memory at some point
stored_values[item_hash] = calculated_value
return calculated_value
However, I'm running out of memory if I try to store all values that are calculated throughout the program.
How can I manage the size of the lookup dictionary efficiently? It's a reasonable assumption that values which were needed most recently are also most likely to be needed in the future.
Things to note
I have simplified the scenario a lot.
The stored values actually use a lot of memory. The dictionary itself doesn't contain too many items, only several thousand. I can definitely afford some parallel book-keeping data structures if needed.
An ideal solution would let me store n last needed values while removing the rest. But any heuristic close enough is good enough.
Have you tried using the #lru_cache decorator? It seems to do exactly what you are asking for.
from functools import lru_cache
store_this_many_values = 5
#lru_cache(maxsize=store_this_many_values)
def calculate_value_for_item(item):
calculated_value = do_heavy_math(item)
return calculated_value
#lru_cache also adds new functions, which might help you to optimise for memory and/or performance, such as cache_info
for i in [1,1,1,2]:
calculate_value_for_item(i)
print(calculate_value_for_item.cache_info())
>>> CacheInfo(hits=2, misses=2, maxsize=5, currsize=2)
I'm working on an Mechatronics project where i access current(Amps) data from multiple sources and have to calculate a response(fed to mechanical system) based on changing value trend within and among (increasing/decreasing values and increasing/decreasing relative differences). There are many a conditions to access (unique or mixed response to each) and many a variables they are dependent on, so i'm left with lots of nested if-elif-else statements each evaluating multiple conditions and flags thus taking time to respond while data flows in fast (as much as 85 Hz).
The module is part of larger project and needs to be done using Python only. Here's how that part of my current code looks like -
def function(args):
if flag1 and flag2 and condition1 and not condition2:
if condition3 and not flag3:
response += var1
flag4 = True
elif -- :
response = var2
flag3 = False
elif -- :
------------
else :
------------
if not flag_n and flag_m and condition_p and condition_q and not condition_r:
if.. elif ... else :
flags... response changes..
more IFs
i need is a better and effecient way of doing this or a completely different approach e.g. some machine learning or deep learning algorithm or framework suitable for above kind of usage.
You could use binary, maybe:
flag_bits = {flag1: 0b0000000001,
flag2: 0b0000000010,
flag3: 0b0000000100,
flag4: 0b0000001000,
condition1: 0b0000010000,
condition2: 0b0000100000,
...}
Then as you receive flags and conditions evaluate them bitwise, and have a dictionary of results or methods to calculate results based on it:
def add_response(response, add_value):
return response += add_value
def subtract_response(response, subtract_value):
return response -= subtract_value
response_actions = {0b0000110011: ('add', var1, 0b0000001000), ...}
response_methods = {'add': add_response, 'sub': subtract_response, ...}
response_action = response_actions[0b0000110011]
response_method = response_action[0]
response = response_method(response, respnose_action[1])
flag_bits = response_action[2]
Obviously, not totally perfect, but it will eliminate lots of ifs and turn actions into a lookup and hopefully save time.
From your question, I could not understand whether your problem was if-else statements getting bigger and more confusing over time, or them being computationally intensive.
In either case, any type of machine learning or deep learning framework would probably be much slower than your if-else's, and more confusing because it is very hard to know why-AI-deep-learning-algorithm-does-what exactly. What if your robot flips over? You'll never know why. But, you can trace if-else statements... I would strongly suggest not going the AI route, if your if-else trees aren't something like... 3000-5000 lines long, they keep changing daily 100-200 lines, or anything like that.
Software developers generally try to follow good design principles to not fall into situations as this, however, if it is too late to change architecture, What is the best way to replace or substitute if..else if..else trees in programs? (polymorphism) can come to your rescue.
That being said, I've worked on a lot of sensors/math heavy projects, and they grew always the same: project starts nice and slow, then comes nice improvements, then comes the deadline and you end up with an if-else spaghetti. Always the same. At least for me. So, what I do nowadays is whenever I have an improvement, I try to add it to the source code such that the general architecture stays in tact.
Another way of dealing with this issue is writing flowcharts, as in Matlab's Simulink, and showing explicitly your general idea of how the project is going to work/how it is actually implemented etc.
I have a more-or-less complex data structure (list of dictionaries of sets) on which I perform a bunch of operations in a loop until the data structure reaches a steady-state, ie. doesn't change anymore. The number of iterations it takes to perform the calculation varies wildly depending on the input.
I'd like to know if there's an established way for forming a halting condition in this case. The best I could come up with is pickling the data structure, storing its md5 and checking if it has changed from the previous iteration. Since this is more expensive than my operations I only do this every 20 iterations but still, it feels wrong.
Is there a nicer or cheaper way to check for deep equality so that I know when to halt?
Thanks!
Take a look at python-deep. It should do what you want, and if it's not fast enough you can modify it yourself.
It also very much depends on how expensive the compare operation and how expensive one calculation iteration is. Say, one calculation iteration takes c time and one test takes t time and the chance of termination is p then the optimal testing frequency is:
(t * p) / c
That is assuming c < t, if that's not true then you should obviously check every loop.
So, since you can dynamically can track c and t and estimate p (with possible adaptions in the code if the code suspects the calculation is going to end) you can set your test frequency to an optimal value.
I think your only choices are:
Have every update mark a "dirty flag" when it alters a value from its starting state.
Doing a whole structure analysis (like the pickle/md5 combination you suggested).
Just run a fixed number of iterations known to reach a steady state (possibly running too many times but not having the overhead of checking the termination condition).
Option 1 is analogous to what Python itself does with ref-counting. Option 2 is analogous to what Python does with its garbage collector. Option 3 is common in numerical analysis (i.e. run divide-and-average 20 times to compute a square root).
Checking for equality to me doesn't seem the right way to go. Provided that you have full control over the operations you perform, I would introduce a "modified" flag (boolean variable) that is set to false at the beginning of each iteration. Whenever one of your operation modifies (part of) your data structure, it is set to true, and repetition is performed until modified remained "false" throughout a complete iteration.
I would trust the python equality operator to be reasonably efficient for comparing compositions of built-in objects.
I expect it would be faster than pickling+hashing, provided python tests for list equality something like this:
def __eq__(a,b):
if type(a) == list and type(b) == list:
if len(a) != len(b):
return False
for i in range(len(a)):
if a[i] != b[i]:
return False
return True
#testing for other types goes here
Since the function returns as soon as it finds two elements that don't match, in the average case it won't need to iterate through the whole thing. Compare to hashing, which does need to iterate through the whole data structure, even in the best case.
Here's how I would do it:
import copy
def perform_a_bunch_of_operations(data):
#take care to not modify the original data, as we will be using it later
my_shiny_new_data = copy.deepcopy(data)
#do lots of math here...
return my_shiny_new_data
data = get_initial_data()
while(True):
nextData = perform_a_bunch_of_operations(data)
if data == nextData: #steady state reached
break
data = nextData
This has the disadvantage of having to make a deep copy of your data each iteration, but it may still be faster than hashing - you can only know for sure by profiling your particular case.
There's this script called svnmerge.py that I'm trying to tweak and optimize a bit. I'm completely new to Python though, so it's not easy.
The current problem seems to be related to a class called RevisionSet in the script. In essence what it does is create a large hashtable(?) of integer-keyed boolean values. In the worst case - one for each revision in our SVN repository, which is near 75,000 now.
After that it performs set operations on such huge arrays - addition, subtraction, intersection, and so forth. The implementation is the simplest O(n) implementation, which, naturally, gets pretty slow on such large sets. The whole data structure could be optimized because there are long spans of continuous values. For example, all keys from 1 to 74,000 might contain true. Also the script is written for Python 2.2, which is a pretty old version and we're using 2.6 anyway, so there could be something to gain there too.
I could try to cobble this together myself, but it would be difficult and take a lot of time - not to mention that it might be already implemented somewhere. Although I'd like the learning experience, the result is more important right now. What would you suggest I do?
You could try doing it with numpy instead of plain python. I found it to be very fast for operations like these.
For example:
# Create 1000000 numbers between 0 and 1000, takes 21ms
x = numpy.random.randint(0, 1000, 1000000)
# Get all items that are larger than 500, takes 2.58ms
y = x > 500
# Add 10 to those items, takes 26.1ms
x[y] += 10
Since that's with a lot more rows, I think that 75000 should not be a problem either :)
Here's a quick replacement for RevisionSet that makes it into a set. It should be much faster. I didn't fully test it, but it worked with all of the tests that I did. There are undoubtedly other ways to speed things up, but I think that this will really help because it actually harnesses the fast implementation of sets rather than doing loops in Python which the original code was doing in functions like __sub__ and __and__. The only problem with it is that the iterator isn't sorted. You might have to change a little bit of the code to account for this. I'm sure there are other ways to improve this, but hopefully it will give you a good start.
class RevisionSet(set):
"""
A set of revisions, held in dictionary form for easy manipulation. If we
were to rewrite this script for Python 2.3+, we would subclass this from
set (or UserSet). As this class does not include branch
information, it's assumed that one instance will be used per
branch.
"""
def __init__(self, parm):
"""Constructs a RevisionSet from a string in property form, or from
a dictionary whose keys are the revisions. Raises ValueError if the
input string is invalid."""
revision_range_split_re = re.compile('[-:]')
if isinstance(parm, set):
print "1"
self.update(parm.copy())
elif isinstance(parm, list):
self.update(set(parm))
else:
parm = parm.strip()
if parm:
for R in parm.split(","):
rev_or_revs = re.split(revision_range_split_re, R)
if len(rev_or_revs) == 1:
self.add(int(rev_or_revs[0]))
elif len(rev_or_revs) == 2:
self.update(set(range(int(rev_or_revs[0]),
int(rev_or_revs[1])+1)))
else:
raise ValueError, 'Ill formatted revision range: ' + R
def sorted(self):
return sorted(self)
def normalized(self):
"""Returns a normalized version of the revision set, which is an
ordered list of couples (start,end), with the minimum number of
intervals."""
revnums = sorted(self)
revnums.reverse()
ret = []
while revnums:
s = e = revnums.pop()
while revnums and revnums[-1] in (e, e+1):
e = revnums.pop()
ret.append((s, e))
return ret
def __str__(self):
"""Convert the revision set to a string, using its normalized form."""
L = []
for s,e in self.normalized():
if s == e:
L.append(str(s))
else:
L.append(str(s) + "-" + str(e))
return ",".join(L)
Addition:
By the way, I compared doing unions, intersections and subtractions of the original RevisionSet and my RevisionSet above, and the above code is from 3x to 7x faster for those operations when operating on two RevisionSets that have 75000 elements. I know that other people are saying that numpy is the way to go, but if you aren't very experienced with Python, as your comment indicates, then you might not want to go that route because it will involve a lot more changes. I'd recommend trying my code, seeing if it works and if it does, then see if it is fast enough for you. If it isn't, then I would try profiling to see what needs to be improved. Only then would I consider using numpy (which is a great package that I use quite frequently).
For example, all keys from 1 to 74,000 contain true
Why not work on a subset? Just 74001 to the end.
Pruning 74/75th of your data is far easier than trying to write an algorithm more clever than O(n).
You should rewrite RevisionSet to have a set of revisions. I think the internal representation for a revision should be an integer and revision ranges should be created as needed.
There is no compelling reason to use code that supports python 2.3 and earlier.
Just a thought. I used to do this kind of thing using run-coding in binary image manipulation. That is, store each set as a series of numbers: number of bits off, number of bits on, number of bits off, etc.
Then you can do all sorts of boolean operations on them as decorations on a simple merge algorithm.
So I have this code in python that writes some values to a Dictionary where each key is a student ID number and each value is a Class (of type student) where each Class has some variables associated with it. '
Code
try:
if ((str(i) in row_num_id.iterkeys()) and (row_num_id[str(i)]==varschosen[1])):
valuetowrite=str(row[i])
if students[str(variablekey)].var2 != []:
students[str(variablekey)].var2.append(valuetowrite)
else:
students[str(variablekey)].var2=([valuetowrite])
except:
two=1#This is just a dummy assignment because I #can't leave it empty... I don't need my program to do anything if the "try" doesn't work. I just want to prevent a crash.
#Assign var3
try:
if ((str(i) in row_num_id.iterkeys()) and (row_num_id[str(i)]==varschosen[2])):
valuetowrite=str(row[i])
if students[str(variablekey)].var3 != []:
students[str(variablekey)].var3.append(valuetowrite)
else:
students[str(variablekey)].var3=([valuetowrite])
except:
two=1
#Assign var4
try:
if ((str(i) in row_num_id.iterkeys()) and (row_num_id[str(i)]==varschosen[3])):
valuetowrite=str(row[i])
if students[str(variablekey)].var4 != []:
students[str(variablekey)].var4.append(valuetowrite)
else:
students[str(variablekey)].var4=([valuetowrite])
except:
two=1
'
The same code repeats many, many times for each variable that the student has (var5, var6,....varX). However, the RAM spike in my program comes up as I execute the function that does this series of variable assignments.
I wish to find out a way to make this more efficient in speed or more memory efficient because running this part of my program takes up around half a gig of memory. :(
Thanks for your help!
EDIT:
Okay let me simplify my question:
In my case, I have a dictionary of about 6000 instantiated classes, where each class has 1000 attributed variables all of type string or list of strings. I don't really care about the number of lines my code is or the speed at which it runs (Right now, my code is at almost 20,000 lines and is about a 1 MB .py file!). What I am concerned about is the amount of memory it is taking up because this is the culprit in throttling my CPU. The ultimate question is: does the number of code lines by which I build up this massive dictionary matter so much in terms of RAM usage?
My original code functions fine, but the RAM usage is high. I'm not sure if that is "normal" with the amount of data I am collecting. Does writing the code in a condensed fashion (as shown by the people who helped me below) actually make a noticeable difference in the amount of RAM I am going to eat up? Sure there are X ways to build a dictionary, but does it even affect the RAM usage in this case?
Edit: The suggested code-refactoring below won't reduce the memory consumption very much. 6000 classes each with 1000 attributes may very well consume half a gig of memory.
You might be better off storing the data in a database and pulling out the data only as you need it via SQL queries. Or you might use shelve or marshal to dump some or all of the data to disk, where it can be read back in only when needed. A third option would be to use a numpy array of strings. The numpy array will hold the strings more compactly. (Python strings are objects with lots of methods which make them bulkier memory-wise. A numpy array of strings loses all those methods but requires relatively little memory overhead.) A fourth option might be to use PyTables.
And lastly (but not leastly), there might be ways to re-design your algorithm to be less memory intensive. We'd have to know more about your program and the problem it's trying to solve to give more concrete advice.
Original suggestion:
for v in ('var2','var3','var4'):
try:
if row_num_id.get(str(i))==varschosen[1]:
valuetowrite=str(row[i])
value=getattr(students[str(variablekey)],v)
if value != []:
value.append(valuetowrite)
else:
value=[valuetowrite]
except PUT_AN_EXPLICT_EXCEPTION_HERE:
pass
PUT_AN_EXPLICT_EXCEPTION_HERE should be replaced with something like AttributeError or TypeError, or ValueError, or maybe something else.
It's hard to guess what to put here because I don't know what kind of values the variables might have.
If you run the code without the try...exception block, and your program crashes, take note of the traceback error message you receive. The last line will say something like
TypeError: ...
In that case, replace PUT_AN_EXPLICT_EXCEPTION_HERE with TypeError.
If your code can fail in a number of ways, say, with TypeError or ValueError, then you can replace PUT_AN_EXPLICT_EXCEPTION_HERE with
(TypeError,ValueError) to catch both kinds of error.
Note: There is a little technical caveat that should be mentioned regarding
row_num_id.get(str(i))==varschosen[1]. The expression row_num_id.get(str(i)) returns None if str(i) is not in row_num_id.
But what if varschosen[1] is None and str(i) is not in row_num_id? Then the condition is True, when the longer original condition returned False.
If that is a possibility, then the solution is to use a sentinal default value like row_num_id.get(str(i),object())==varschosen[1]. Now row_num_id.get(str(i),object()) returns object() when str(i) is not in row_num_id. Since object() is a new instance of object there is no way it could equal varschosen[1].
You've spelled this wrong
two=1#This is just a dummy assignment because I
#can't leave it empty... I don't need my program to do anything if the "try" doesn't work. I just want to prevent a crash.
It's spelled
pass
You should read a tutorial on Python.
Also,
except:
Is a bad policy. Your program will fail to crash when it's supposed to crash.
Names like var2 and var3 are evil. They are intentionally misleading.
Don't repeat str(variablekey) over and over again.
I wish to find out a way to make this more efficient in speed or more memory efficient because running this part of my program takes up around half a gig of memory. :(
This request is unanswerable because we don't know what it's supposed to do. Intentionally obscure names like var1 and var2 make it impossible to understand.
"6000 instantiated classes, where each class has 1000 attributed variables"
So. 6 million objects? That's a lot of memory. A real lot of memory.
What I am concerned about is the amount of memory it is taking up because this is the culprit in throttling my CPU
Really? Any evidence?
but the RAM usage is high
Compared with what? What's your basis for this claim?
Python dicts use a surprisingly large amount of memory. Try:
import sys
for i in range( 30 ):
d = dict( ( j, j ) for j in range( i ) )
print "dict with", i, "elements is", sys.getsizeof( d ), "bytes"
for an illustration of just how expensive they are. Note that this is just the size of the dict itself: it doesn't include the size of the keys or values stored in the dict.
By default, an instance of a Python class stores its attributes in a dict. Therefore, each of your 6000 instances is using a lot of memory just for that dict.
One way that you could save a lot of memory, provided that your instances all have the same set of attributes, is to use __slots__ (see http://docs.python.org/reference/datamodel.html#slots). For example:
class Foo( object ):
__slots__ = ( 'a', 'b', 'c' )
Now, instances of class Foo have space allocated for precisely three attributes, a, b, and c, but no instance dict in which to store any other attributes. This uses only 4 bytes (on a 32-bit system) per attribute, as opposed to perhaps 15-20 bytes per attribute using a dict.
Another way in which you could be wasting memory, given that you have a lot of strings, is if you're storing multiple identical copies of the same string. Using the intern function (see http://docs.python.org/library/functions.html#intern) could help if this turns out to be a problem.