I have a more-or-less complex data structure (list of dictionaries of sets) on which I perform a bunch of operations in a loop until the data structure reaches a steady-state, ie. doesn't change anymore. The number of iterations it takes to perform the calculation varies wildly depending on the input.
I'd like to know if there's an established way for forming a halting condition in this case. The best I could come up with is pickling the data structure, storing its md5 and checking if it has changed from the previous iteration. Since this is more expensive than my operations I only do this every 20 iterations but still, it feels wrong.
Is there a nicer or cheaper way to check for deep equality so that I know when to halt?
Thanks!
Take a look at python-deep. It should do what you want, and if it's not fast enough you can modify it yourself.
It also very much depends on how expensive the compare operation and how expensive one calculation iteration is. Say, one calculation iteration takes c time and one test takes t time and the chance of termination is p then the optimal testing frequency is:
(t * p) / c
That is assuming c < t, if that's not true then you should obviously check every loop.
So, since you can dynamically can track c and t and estimate p (with possible adaptions in the code if the code suspects the calculation is going to end) you can set your test frequency to an optimal value.
I think your only choices are:
Have every update mark a "dirty flag" when it alters a value from its starting state.
Doing a whole structure analysis (like the pickle/md5 combination you suggested).
Just run a fixed number of iterations known to reach a steady state (possibly running too many times but not having the overhead of checking the termination condition).
Option 1 is analogous to what Python itself does with ref-counting. Option 2 is analogous to what Python does with its garbage collector. Option 3 is common in numerical analysis (i.e. run divide-and-average 20 times to compute a square root).
Checking for equality to me doesn't seem the right way to go. Provided that you have full control over the operations you perform, I would introduce a "modified" flag (boolean variable) that is set to false at the beginning of each iteration. Whenever one of your operation modifies (part of) your data structure, it is set to true, and repetition is performed until modified remained "false" throughout a complete iteration.
I would trust the python equality operator to be reasonably efficient for comparing compositions of built-in objects.
I expect it would be faster than pickling+hashing, provided python tests for list equality something like this:
def __eq__(a,b):
if type(a) == list and type(b) == list:
if len(a) != len(b):
return False
for i in range(len(a)):
if a[i] != b[i]:
return False
return True
#testing for other types goes here
Since the function returns as soon as it finds two elements that don't match, in the average case it won't need to iterate through the whole thing. Compare to hashing, which does need to iterate through the whole data structure, even in the best case.
Here's how I would do it:
import copy
def perform_a_bunch_of_operations(data):
#take care to not modify the original data, as we will be using it later
my_shiny_new_data = copy.deepcopy(data)
#do lots of math here...
return my_shiny_new_data
data = get_initial_data()
while(True):
nextData = perform_a_bunch_of_operations(data)
if data == nextData: #steady state reached
break
data = nextData
This has the disadvantage of having to make a deep copy of your data each iteration, but it may still be faster than hashing - you can only know for sure by profiling your particular case.
Related
The Problem:
Count the number of elements in a List using recursion.
I wrote the following function:
def count_rec(arr, i):
"""
This function takes List (arr) and Index Number
then returns the count of number of elements in it
using Recursion.
"""
try:
temp = arr[i] # if element exists at i, continue
return 1 + count_rec(arr, i+1)
except IndexError:
# if index error, that means, i == length of list
return 0
I noticed some problems with it:
RecursionError (when the number of elements is more than 990)
Using a temp element (wasting memory..?)
Exception Handling (I feel like we shouldn't use it unless necessary)
If anyone can suggest how to improve the above solution or come up with an alternative one, It would be really helpful.
What you have is probably as efficient as you are going to get for this thought experiment (obviously, python already calculates and stores length for LIST objects, which can be retrieved with the len() built-in, so this function is completely unnecessary).
You could get shorter code if you want:
def count(L):
return count(L[:-1])+1 if L else 0
But you still need to change python's recursion limit.
import sys; sys.setrecursionlimit(100000)
However, we should note that in most cases, "if else" statements take longer to process than "try except". Hence, "try except" is going to be a better (if you are after performance). Of course, that's weird talking about performance because recursion typically doesn't perform very well, due to how python manage's namespaces and such. Recursion is typically frowned upon, unnecessary, and slow. So, trying to optimize recursion performance is a littler strange.
A last point to note. You mention the temp=arr[i] taking up memory. Yes, possibly a few bytes. Of course, any calculation you do to determine if arr has an element at i, is going to take a few bytes in memory even simply running "arr[i]" without assignment. In addition, those bytes are freed the second the temp variable falls out of scope, gets re-used, or the function exits. Hence, unless you are planning on launching 10,000,000,000 sub-processes, rest assure there is no performance degradation in using a temp variable like that.
you are prob looking for something like this
def count_rec(arr):
if arr == []:
return 0
return count_rec(arr[1:]) + 1
You can use pop() to do it.
def count_r(l):
if l==[]:
return 0
else:
l.pop()
return count_r(l)+1
When checking for equality, is there any actual difference between speed and functionality of the following:
number = 'one'
if number == 'one' or number == 'two':
vs.
number = 'one'
if number in ['one', 'two']:
If the values are literal constants (as in this case), in is likely to run faster, as the (extremely limited) optimizer converts it to a constant tuple which is loaded all at once, reducing the bytecode work performed to two cheap loads, and a single comparison operation/conditional jump, where chained ors involve two cheap loads and a comparison op/conditional jump for each test.
For two values, it might not help as much, but as the number of values increases, the byte code savings over the alternative (especially if hits are uncommon, or evenly distributed across the options) can be meaningful.
The above applies specifically to the CPython reference interpreter; other interpreters may have lower per-bytecode costs that reduce or eliminate the differences in performance.
A general advantage comes in if number is a more complicated expression; my_expensive_function() in (...) will obviously outperform my_expensive_function() == A or my_expensive_function() == B, since the former only computes the value once.
That said, if the values in the tuple aren't constant literals, especially if hits will be common on the earlier values, in will usually be more expensive (because it must create the sequence for testing every time, even if it ends up only testing the first value).
Talking about functionality - no, these two approaches generally differ: see https://stackoverflow.com/a/41957167/747744
I am making a POST to a python script, the POST has 2 parameters. Name and Location, and then it returns one string. My question is I am going to have 100's of these options, is it faster to do it in a dictionary like this:
myDictionary = {"Name":{"Location":"result", "LocationB":"resultB"},
"Name2":{"Location2":"result2A", "Location2B":"result2B"}}
And then I would use.get("Name").get("Location") to get the results
OR do something like this:
if Name = "Name":
if Location = "Location":
result = "result"
elif Location = "LocationB":
result = "resultB"
elif Name = "Name2":
if Location = "Location2B":
result = "result2A"
elif Location = "LocationB":
result = "result2B"
Now if there are hundreds or thousands of these what is faster? Or is there a better way all together?
First of all:
Generally, it's much more pythonic to match keys to values using dictionaries. You should do that from a point of style.
Secondly:
If you really care about performance, python might not always be the optimal tool. However, the dict approach should be much much faster, unless your selections happen about as often as the creation of these dicts. The creation of thousands and thousands of PyObjects to check your case is a really bad idea.
Thirdly:
If you care about your application so much, you might really want to benchmark both solutions -- as usual when it comes to performance questions, there's a million factors including your computing platform that only experiments will help to sort out
Fourth(ly?):
It looks like you're building something like a protocol parser. That's really not python's forte, performance-wise. Maybe you'd want to look into one of the dozens of tools that can write C code parsers for you and wrap that in a native module, it's pretty sure to be faster than either of your implementations, if done right.
Here's the python documentation on Extending Python with C or C++
I decided to test the two scenarios of 1000 Names and 2 locations
The Test Samples
Team Dictionary:
di = {}
for i in range(1000):
di["Name{}".format(i)] = {'Location': 'result{}'.format(i), 'LocationB':'result{}B'.format(i)}
def get_dictionary_value():
di.get("Name999").get("LocationB")
Team If Statement:
I used a python script to generate a 5000 line function if_statements(name, location): following this pattern
elif name == 'Name994':
if location == 'Location':
return 'result994'
elif location == 'LocationB':
return 'result994B'
# Some time later ...
def get_if_value():
if_statements("Name999", "LocationB")
Timing Results
You can time with the timeit function to test the time it takes a function to complete.
import timeit
print(timeit.timeit(get_dictionary_value))
# 0.06353...
print(timeit.timeit(get_if_value))
# 6.3684...
So there you have it, dictionary was 100 times faster on my machine than the hefty 165 KB if-statement function.
I will root for dict().
In most cases [key] selection is much faster than conditional checks. Rule of thumb conditionals are generally used for boolean statements.
The reason for this is; when you create a dictionary you essentially create a registry of that data which is stored in as hashes in a bucket. When you say for instance my dictonary_name['key'] if that value exist python knows the exact location of that value and returns it in almost in an instant.
However conditionals are different. Conditionals are sequential checks meaning worse case it has to check every condition provided to first establish the value's existence then it has return the respective data.
As you can see with 100's of statements this can be problematic. Though in this case dictionaries are faster. You also need to be cognizant of how often and how quickly these checks are. Because if they are faster than the the building of your dictionary you might get an error of value not found.
I am performing multiple iterations of the type:
masterSet=masterSet.union(setA)
As the set grows the length of time taken to perform these operations is growing (as one would expect, I guess).
I expect that the time is taken up checking whether each element of setA is already in masterSet?
My question is that if i KNOW that masterSet does not already contain any of elements in setA can I do this quicker?
[UPDATE]
Given that this question is still attracting views I thought I would clear up a few of the things from the comments and answers below:
When iterating though there were many iterations where I knew setA would be distinct from masterSet because of how it was constructed (without having to process any checks) but a few iterations I needed the uniqueness check.
I wondered if there was a way to 'tell' the masterSet.union() procedure not to bother with the uniquness check this time around as I know this one is distinct from masterSet just add these elements quickly trusting the programmer's assertion they were definately distict. Perhpas through calling some different ".unionWithDistinctSet()" procedure or something.
I think the responses have suggested that this isnt possible (and that really set operations should be quick enough anyway) but to use masterSet.update(setA) instead of union as its slightly quicker still.
I have accepted the clearest reponse along those lines, resolved the issue I was having at the time and got on with my life but would still love to hear if my hypothesised .unionWithDistinctSet() could ever exist?
You can use set.update to update your master set in place. This saves allocating a new set all the time so it should be a little faster than set.union...
>>> s = set(range(3))
>>> s.update(range(4))
>>> s
set([0, 1, 2, 3])
Of course, if you're doing this in a loop:
masterSet = set()
for setA in iterable:
masterSet = masterSet.union(setA)
You might get a performance boost by doing something like:
masterSet = set().union(*iterable)
Ultimately, membership testing of a set is O(1) (in the average case), so testing if the element is already contained in the set isn't really a big performance hit.
As mgilson points out, you can use update to update a set in-place from another set. That actually works out slightly quicker:
def union():
i = set(range(10000))
j = set(range(5000, 15000))
return i.union(j)
def update():
i = set(range(10000))
j = set(range(5000, 15000))
i.update(j)
return i
timeit.Timer(union).timeit(10000) # 10.351907968521118
timeit.Timer(update).timeit(10000) # 8.83384895324707
If you know your elements are unique, a set is not necessarily the best structure.
A simple list is way faster to extend.
masterList = list(masterSet)
masterList.extend(setA)
For sure, forgoing this check could be a big saving when the __eq__(..) method is very expensive. In the CPython implementation, __eq__(..) is called with every element already in the set that hashes to the same number. (Reference: source code for set.)
However, there will never be this functionality in a million years, because it opens up another way to violate the integrity of a set. The trouble associated with that far outweighs the (typically negligible) performance gain. While if this is determined as a performance bottleneck, it's not hard to write a C++ extension, and use its STL <set>, which should be faster by one or more orders of magnitude.
There's this script called svnmerge.py that I'm trying to tweak and optimize a bit. I'm completely new to Python though, so it's not easy.
The current problem seems to be related to a class called RevisionSet in the script. In essence what it does is create a large hashtable(?) of integer-keyed boolean values. In the worst case - one for each revision in our SVN repository, which is near 75,000 now.
After that it performs set operations on such huge arrays - addition, subtraction, intersection, and so forth. The implementation is the simplest O(n) implementation, which, naturally, gets pretty slow on such large sets. The whole data structure could be optimized because there are long spans of continuous values. For example, all keys from 1 to 74,000 might contain true. Also the script is written for Python 2.2, which is a pretty old version and we're using 2.6 anyway, so there could be something to gain there too.
I could try to cobble this together myself, but it would be difficult and take a lot of time - not to mention that it might be already implemented somewhere. Although I'd like the learning experience, the result is more important right now. What would you suggest I do?
You could try doing it with numpy instead of plain python. I found it to be very fast for operations like these.
For example:
# Create 1000000 numbers between 0 and 1000, takes 21ms
x = numpy.random.randint(0, 1000, 1000000)
# Get all items that are larger than 500, takes 2.58ms
y = x > 500
# Add 10 to those items, takes 26.1ms
x[y] += 10
Since that's with a lot more rows, I think that 75000 should not be a problem either :)
Here's a quick replacement for RevisionSet that makes it into a set. It should be much faster. I didn't fully test it, but it worked with all of the tests that I did. There are undoubtedly other ways to speed things up, but I think that this will really help because it actually harnesses the fast implementation of sets rather than doing loops in Python which the original code was doing in functions like __sub__ and __and__. The only problem with it is that the iterator isn't sorted. You might have to change a little bit of the code to account for this. I'm sure there are other ways to improve this, but hopefully it will give you a good start.
class RevisionSet(set):
"""
A set of revisions, held in dictionary form for easy manipulation. If we
were to rewrite this script for Python 2.3+, we would subclass this from
set (or UserSet). As this class does not include branch
information, it's assumed that one instance will be used per
branch.
"""
def __init__(self, parm):
"""Constructs a RevisionSet from a string in property form, or from
a dictionary whose keys are the revisions. Raises ValueError if the
input string is invalid."""
revision_range_split_re = re.compile('[-:]')
if isinstance(parm, set):
print "1"
self.update(parm.copy())
elif isinstance(parm, list):
self.update(set(parm))
else:
parm = parm.strip()
if parm:
for R in parm.split(","):
rev_or_revs = re.split(revision_range_split_re, R)
if len(rev_or_revs) == 1:
self.add(int(rev_or_revs[0]))
elif len(rev_or_revs) == 2:
self.update(set(range(int(rev_or_revs[0]),
int(rev_or_revs[1])+1)))
else:
raise ValueError, 'Ill formatted revision range: ' + R
def sorted(self):
return sorted(self)
def normalized(self):
"""Returns a normalized version of the revision set, which is an
ordered list of couples (start,end), with the minimum number of
intervals."""
revnums = sorted(self)
revnums.reverse()
ret = []
while revnums:
s = e = revnums.pop()
while revnums and revnums[-1] in (e, e+1):
e = revnums.pop()
ret.append((s, e))
return ret
def __str__(self):
"""Convert the revision set to a string, using its normalized form."""
L = []
for s,e in self.normalized():
if s == e:
L.append(str(s))
else:
L.append(str(s) + "-" + str(e))
return ",".join(L)
Addition:
By the way, I compared doing unions, intersections and subtractions of the original RevisionSet and my RevisionSet above, and the above code is from 3x to 7x faster for those operations when operating on two RevisionSets that have 75000 elements. I know that other people are saying that numpy is the way to go, but if you aren't very experienced with Python, as your comment indicates, then you might not want to go that route because it will involve a lot more changes. I'd recommend trying my code, seeing if it works and if it does, then see if it is fast enough for you. If it isn't, then I would try profiling to see what needs to be improved. Only then would I consider using numpy (which is a great package that I use quite frequently).
For example, all keys from 1 to 74,000 contain true
Why not work on a subset? Just 74001 to the end.
Pruning 74/75th of your data is far easier than trying to write an algorithm more clever than O(n).
You should rewrite RevisionSet to have a set of revisions. I think the internal representation for a revision should be an integer and revision ranges should be created as needed.
There is no compelling reason to use code that supports python 2.3 and earlier.
Just a thought. I used to do this kind of thing using run-coding in binary image manipulation. That is, store each set as a series of numbers: number of bits off, number of bits on, number of bits off, etc.
Then you can do all sorts of boolean operations on them as decorations on a simple merge algorithm.