"Undirected" tuple comparison - python

I am currently working on an undirected graph in python, where the edges are represented by tuples (edge between A and B is represented by either (A,B) or (B,A)). I was wondering whether there is a tuple operation that performs an undirected comparison of tuples like this:
exp1 = undirected_comp((A,B), (B,A)) #exp1 should evaluate to True
exp2 = undirected_comp((A,B), (A,C)) #exp2 should evaluate to False

not exactly, but in general, you can do this kind of comparison with
set (A,B) == set (B, A)

Sure:
undirected_comp = lambda e1,e2: e1==e2 or (e1[1],e1[0])==e2
Since edges are always exactly 2-tuples it should be robust enough, assuming the A and B objects have equality defined.
EDIT (shameless self-promotion): You probably don't want the overhead of creating two set objects for each comparator, especially if this is part of a larger algorithm. Sets are great for look up but the instantiation is much slower: https://stackoverflow.com/a/7717668/837451

In addition to the solutions using sets, it is easy enough to roll your own comparison function:
In [1]: def undirected_comp(tup1, tup2):
...: return tup1 == tup2 or tup1 == tup2[::-1]
In [2]: undirected_comp(('A','B'), ('B','A'))
Out[2]: True
In [3]: undirected_comp(('A','B'), ('A','C'))
Out[3]: False
In [4]: undirected_comp(('A','B'), ('A','B'))
Out[4]: True
As noted by mmdanziger, this is faster than the solution with the sets, since you do not have to pay the cost of the set creation.
But if you care about speed and you spend more time on comparing various edges than on creating them, it is probably best not to store the edges as a tuple with arbitrary order, but to pre-process them and store them in a different format. The two best options would probably be a frozenset or a sorted tuple (i.e. by convention, you always store the smallest node first). Some quick timing:
# edge creation, this time is spent only once, so presumably we don't care:
In [1]: tup1 = (1, 2); tup2 = (2, 1)
In [2]: fs1 = frozenset(tup1); fs2 = frozenset(tup2)
In [3]: sorted_tup1 = tuple(sorted(tup1)); sorted_tup2 = tuple(sorted(tup2))
# now time the comparison operations
In [4]: timeit set(tup1) == set(tup2) # Corley's solution
1000000 loops, best of 3: 674 ns per loop
In [5]: timeit tup1 == tup2 or tup1 == tup2[::-1] # my solution above
1000000 loops, best of 3: 348 ns per loop
In [6]: timeit fs1 == fs2 # frozensets
10000000 loops, best of 3: 120 ns per loop
In [7]: timeit sorted_tup1 == sorted_tup2 # pre-sorted tuples
10000000 loops, best of 3: 83.4 ns per loop
So assuming that you don't care about the creation time of the edges, storing them as a sorted tuple is the fastest for doing the comparisons. In this case, you only have to do a simple comparison and do not have to compare the backwards case, since the order is guaranteed by the pre-sorting.

Python tuples are ordered, while python sets are not. You could simply convert the tuples to sets before comparison using set.
(A,B) == (B,A)) # evaluates to false
set((A,B)) == set((B,A)) # evaluates to true
set((A,B) == set((A,C)) # evaluates to false
If you want to use a function, you could do something like this:
def undirected_comp(a,b):
return (set(a) == set(b))
Edit: I was using cmp() to do comparisons, which was incorrect since it returns 1 if true and -1 if false, rather than boolean. Changed the function to use ==, which should return boolean - if you want 1 and -1, use return cmp(set(a), set(b)).

Related

FAST comparing two numpy arrays for equality [Python] [duplicate]

Suppose I have a bunch of arrays, including x and y, and I want to check if they're equal. Generally, I can just use np.all(x == y) (barring some dumb corner cases which I'm ignoring now).
However this evaluates the entire array of (x == y), which is usually not needed. My arrays are really large, and I have a lot of them, and the probability of two arrays being equal is small, so in all likelihood, I really only need to evaluate a very small portion of (x == y) before the all function could return False, so this is not an optimal solution for me.
I've tried using the builtin all function, in combination with itertools.izip: all(val1==val2 for val1,val2 in itertools.izip(x, y))
However, that just seems much slower in the case that two arrays are equal, that overall, it's stil not worth using over np.all. I presume because of the builtin all's general-purposeness. And np.all doesn't work on generators.
Is there a way to do what I want in a more speedy manner?
I know this question is similar to previously asked questions (e.g. Comparing two numpy arrays for equality, element-wise) but they specifically don't cover the case of early termination.
Until this is implemented in numpy natively you can write your own function and jit-compile it with numba:
import numpy as np
import numba as nb
#nb.jit(nopython=True)
def arrays_equal(a, b):
if a.shape != b.shape:
return False
for ai, bi in zip(a.flat, b.flat):
if ai != bi:
return False
return True
a = np.random.rand(10, 20, 30)
b = np.random.rand(10, 20, 30)
%timeit np.all(a==b) # 100000 loops, best of 3: 9.82 µs per loop
%timeit arrays_equal(a, a) # 100000 loops, best of 3: 9.89 µs per loop
%timeit arrays_equal(a, b) # 100000 loops, best of 3: 691 ns per loop
Worst case performance (arrays equal) is equivalent to np.all and in case of early stopping the compiled function has the potential to outperform np.all a lot.
Adding short-circuit logic to array comparisons is apparently being discussed on the numpy page on github, and will thus presumably be available in a future version of numpy.
Probably someone who understands the underlying data structure could optimize this or explain whether it's reliable/safe/good practice, but it seems to work.
np.all(a==b)
Out[]: True
memoryview(a.data)==memoryview(b.data)
Out[]: True
%timeit np.all(a==b)
The slowest run took 10.82 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 6.2 µs per loop
%timeit memoryview(a.data)==memoryview(b.data)
The slowest run took 8.55 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.85 µs per loop
If I understand this correctly, ndarray.data creates a pointer to the data buffer and memoryview creates a native python type that can be short-circuited out of the buffer.
I think.
EDIT: further testing shows it may not be as big a time-improvement as shown. previously a=b=np.eye(5)
a=np.random.randint(0,10,(100,100))
b=a.copy()
%timeit np.all(a==b)
The slowest run took 6.70 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 17.7 µs per loop
%timeit memoryview(a.data)==memoryview(b.data)
10000 loops, best of 3: 30.1 µs per loop
np.all(a==b)
Out[]: True
memoryview(a.data)==memoryview(b.data)
Out[]: True
Hmmm, I know it is the poor answer but it seems there is no easy way for this. Numpy Creators should fix it. I suggest:
def compare(a, b):
if len(a) > 0 and not np.array_equal(a[0], b[0]):
return False
if len(a) > 15 and not np.array_equal(a[:15], b[:15]):
return False
if len(a) > 200 and not np.array_equal(a[:200], b[:200]):
return False
return np.array_equal(a, b)
:)
Well, not really an answer as I haven't checked if it break-circuits, but:
assert_array_equal.
From the documentation:
Raises an AssertionError if two array_like objects are not equal.
Try Except it if not on a performance sensitive code path.
Or follow the underlying source code, maybe it's efficient.
You could iterate all elements of the arrays and check if they are equal.
If the arrays are most likely not equal it will return much faster than the .all function.
Something like this:
import numpy as np
a = np.array([1, 2, 3])
b = np.array([1, 3, 4])
areEqual = True
for x in range(0, a.size-1):
if a[x] != b[x]:
areEqual = False
break
else:
print "a[x] is equal to b[x]\n"
if areEqual:
print "The tables are equal\n"
else:
print "The tables are not equal\n"
As Thomas Kühn wrote in a comment to your post, array_equal is a function which should solve the problem. It is described in Numpy's API reference.
Breaking down the original problem to three parts: "(1) My arrays are really large, and (2) I have a lot of them, and (3) the probability of two arrays being equal is small"
All the solutions (to date) are focused on part (1) - optimizing the performance of each equality check, and some improve this performance by factor of 10. Points (2) and (3) are ignored. Comparing each pair has O(n^2) complexity, which can become huge for a lot of matrices, while needles as the probability of being duplicates is very small.
The check can become much faster with the following general algorithm -
fast hash of each array O(n)
check equality only for arrays with the same hash
A good hash is almost unique, so the number of keys can easily be a very large fraction of n. On average, number of arrays with the same hash will be very small, and almost 1 in some cases. Duplicate arrays will have the same hash, while having the same hash doesn't guarantee they are duplicates. In that sense, the algorithm will catch all the duplicates. Comparing images only with the same hash significantly reduces the number of comparisons, which becomes almost O(n)
For my problem, I had to check duplicates within ~1 million integer arrays, each with 10k elements. Optimizing only the array equality check (with #MB-F solution) estimated run time was 5 days. With hashing first it finished in minutes. (I used array sum as the hash, that was suited for my arrays characteristics)
Some psuedo-python code
def fast_hash(arr) -> int:
pass
def arrays_equal(arr1, arr2) -> bool:
pass
def make_hash_dict(array_stack, hush_fn=np.sum):
hash_dict = defaultdict(list)
hashes = np.squeeze(np.apply_over_axes(hush_fn, array_stack, range(1, array_stack.ndim)))
for idx, hash_val in enumerate(hashes):
hash_dict[hash_val].append(idx)
return hash_dict
def get_duplicate_sets(hash_dict, array_stack):
duplicate_sets = []
for hash_key, ind_list in hash_dict.items():
if len(ind_list) == 1:
continue
all_duplicates = []
for idx1 in range(len(ind_list)):
v1 = ind_list[idx1]
if v1 in all_duplicates:
continue
arr1 = array_stack[v1]
curr_duplicates = []
for idx2 in range(idx1+1, len(ind_list)):
v2 = ind_list[idx2]
arr2 = array_stack[v2]
if arrays_equal(arr1, arr2):
if len(curr_duplicates) == 0:
curr_duplicates.append(v1)
curr_duplicates.append(v2)
if len(curr_duplicates) > 0:
all_duplicates.extend(curr_duplicates)
duplicate_sets.append(curr_duplicates)
return duplicate_sets
The variable duplicate_sets is a list of lists, each internal list contains indices of all the same duplicates.

Python: efficient way to match 2 different length arrays and find index in larger array

I have 2 arrays: x and bigx. They span the same range, but bigx has many more points.
e.g.
x = np.linspace(0,10,100)
bigx = np.linspace(0,10,1000)
I want to find the indices in bigx where x and bigx match to 2 significant figures. I need to do this extremely quickly as I need the indices for each step of an integral.
Using numpy.where is very slow:
index_bigx = [np.where(np.around(bigx,2) == i) for i in np.around(x,2)]
Using numpy.in1d is ~30x faster
index_bigx = np.where(np.in1d(np.around(bigx), np.around(x,2) == True)
I also tried using zip and enumerate as I know that's supposed be faster but it returns empty:
>>> index_bigx = [i for i,(v,myv) in enumerate(zip(np.around(bigx,2), np.around(x,2))) if myv == v]
>>> print index_bigx
[]
I think I must have muddled things here and I want to optimise it as much as possible. Any suggestions?
Since bigx is always evenly spaced, it's quite straightforward to just directly compute the indices:
start = bigx[0]
step = bigx[1] - bigx[0]
indices = ((x - start)/step).round().astype(int)
Linear time, no searching necessary.
Since we are mapping x to bigx which has its elemments equidistant, you can use a binning operation with np.searchsorted to simulate the index finding operation using its 'left' option. Here's the implementation -
out = np.searchsorted(np.around(bigx,2), np.around(x,2),side='left')
Runtime tests
In [879]: import numpy as np
...:
...: xlen = 10000
...: bigxlen = 70000
...: bigx = 100*np.linspace(0,1,bigxlen)
...: x = bigx[np.random.permutation(bigxlen)[:xlen]]
...:
In [880]: %timeit np.where(np.in1d(np.around(bigx,2), np.around(x,2)))
...: %timeit np.searchsorted(np.around(bigx,2), np.around(x,2),side='left')
...:
100 loops, best of 3: 4.1 ms per loop
1000 loops, best of 3: 1.81 ms per loop
If you want just the elements, this should work:
np.intersect1d(np.around(bigx,2), np.around(x,2))
If you want the indices, try this:
around_x = set(np.around(x,2))
index_bigx = [i for i,b in enumerate(np.around(bigx,2)) if b in around_x]
Note: these were not tested.

Dictionaries with numpy - Can I use XY coordinates as a hash? [duplicate]

I need to be able to store a numpy array in a dict for caching purposes. Hash speed is important.
The array represents indicies, so while the actual identity of the object is not important, the value is. Mutabliity is not a concern, as I'm only interested in the current value.
What should I hash in order to store it in a dict?
My current approach is to use str(arr.data), which is faster than md5 in my testing.
I've incorporated some examples from the answers to get an idea of relative times:
In [121]: %timeit hash(str(y))
10000 loops, best of 3: 68.7 us per loop
In [122]: %timeit hash(y.tostring())
1000000 loops, best of 3: 383 ns per loop
In [123]: %timeit hash(str(y.data))
1000000 loops, best of 3: 543 ns per loop
In [124]: %timeit y.flags.writeable = False ; hash(y.data)
1000000 loops, best of 3: 1.15 us per loop
In [125]: %timeit hash((b*y).sum())
100000 loops, best of 3: 8.12 us per loop
It would appear that for this particular use case (small arrays of indicies), arr.tostring offers the best performance.
While hashing the read-only buffer is fast on its own, the overhead of setting the writeable flag actually makes it slower.
You can simply hash the underlying buffer, if you make it read-only:
>>> a = random.randint(10, 100, 100000)
>>> a.flags.writeable = False
>>> %timeit hash(a.data)
100 loops, best of 3: 2.01 ms per loop
>>> %timeit hash(a.tostring())
100 loops, best of 3: 2.28 ms per loop
For very large arrays, hash(str(a)) is a lot faster, but then it only takes a small part of the array into account.
>>> %timeit hash(str(a))
10000 loops, best of 3: 55.5 us per loop
>>> str(a)
'[63 30 33 ..., 96 25 60]'
You can try xxhash via its Python binding. For large arrays this is much faster than hash(x.tostring()).
Example IPython session:
>>> import xxhash
>>> import numpy
>>> x = numpy.random.rand(1024 * 1024 * 16)
>>> h = xxhash.xxh64()
>>> %timeit hash(x.tostring())
1 loops, best of 3: 208 ms per loop
>>> %timeit h.update(x); h.intdigest(); h.reset()
100 loops, best of 3: 10.2 ms per loop
And by the way, on various blogs and answers posted to Stack Overflow, you'll see people using sha1 or md5 as hash functions. For performance reasons this is usually not acceptable, as those "secure" hash functions are rather slow. They're useful only if hash collision is one of the top concerns.
Nevertheless, hash collisions happen all the time. And if all you need is implementing __hash__ for data-array objects so that they can be used as keys in Python dictionaries or sets, I think it's better to concentrate on the speed of __hash__ itself and let Python handle the hash collision[1].
[1] You may need to override __eq__ too, to help Python manage hash collision. You would want __eq__ to return a boolean, rather than an array of booleans as is done by numpy.
Coming late to the party, but for large arrays, I think a decent way to do it is to randomly subsample the matrix and hash that sample:
def subsample_hash(a):
rng = np.random.RandomState(89)
inds = rng.randint(low=0, high=a.size, size=1000)
b = a.flat[inds]
b.flags.writeable = False
return hash(b.data)
I think this is better than doing hash(str(a)), because the latter could confuse arrays that have unique data in the middle but zeros around the edges.
If your np.array() is small and in a tight loop, then one option is to skip hash() completely and just use np.array().data.tobytes() directly as your dict key:
grid = np.array([[True, False, True],[False, False, True]])
hash = grid.data.tobytes()
cache = cache or {}
if hash not in cache:
cache[hash] = function(grid)
return cache[hash]
What kind of data do you have?
array-size
do you have an index several times in the array
If your array only consists of permutation of indices you can use a base-convertion
(1, 0, 2) -> 1 * 3**0 + 0 * 3**1 + 2 * 3**2 = 10(base3)
and use '10' as hash_key via
import numpy as num
base_size = 3
base = base_size ** num.arange(base_size)
max_base = (base * num.arange(base_size)).sum()
hashed_array = (base * array).sum()
Now you can use an array (shape=(base_size, )) instead of a dict in order to access the values.

Most efficient property to hash for numpy array

I need to be able to store a numpy array in a dict for caching purposes. Hash speed is important.
The array represents indicies, so while the actual identity of the object is not important, the value is. Mutabliity is not a concern, as I'm only interested in the current value.
What should I hash in order to store it in a dict?
My current approach is to use str(arr.data), which is faster than md5 in my testing.
I've incorporated some examples from the answers to get an idea of relative times:
In [121]: %timeit hash(str(y))
10000 loops, best of 3: 68.7 us per loop
In [122]: %timeit hash(y.tostring())
1000000 loops, best of 3: 383 ns per loop
In [123]: %timeit hash(str(y.data))
1000000 loops, best of 3: 543 ns per loop
In [124]: %timeit y.flags.writeable = False ; hash(y.data)
1000000 loops, best of 3: 1.15 us per loop
In [125]: %timeit hash((b*y).sum())
100000 loops, best of 3: 8.12 us per loop
It would appear that for this particular use case (small arrays of indicies), arr.tostring offers the best performance.
While hashing the read-only buffer is fast on its own, the overhead of setting the writeable flag actually makes it slower.
You can simply hash the underlying buffer, if you make it read-only:
>>> a = random.randint(10, 100, 100000)
>>> a.flags.writeable = False
>>> %timeit hash(a.data)
100 loops, best of 3: 2.01 ms per loop
>>> %timeit hash(a.tostring())
100 loops, best of 3: 2.28 ms per loop
For very large arrays, hash(str(a)) is a lot faster, but then it only takes a small part of the array into account.
>>> %timeit hash(str(a))
10000 loops, best of 3: 55.5 us per loop
>>> str(a)
'[63 30 33 ..., 96 25 60]'
You can try xxhash via its Python binding. For large arrays this is much faster than hash(x.tostring()).
Example IPython session:
>>> import xxhash
>>> import numpy
>>> x = numpy.random.rand(1024 * 1024 * 16)
>>> h = xxhash.xxh64()
>>> %timeit hash(x.tostring())
1 loops, best of 3: 208 ms per loop
>>> %timeit h.update(x); h.intdigest(); h.reset()
100 loops, best of 3: 10.2 ms per loop
And by the way, on various blogs and answers posted to Stack Overflow, you'll see people using sha1 or md5 as hash functions. For performance reasons this is usually not acceptable, as those "secure" hash functions are rather slow. They're useful only if hash collision is one of the top concerns.
Nevertheless, hash collisions happen all the time. And if all you need is implementing __hash__ for data-array objects so that they can be used as keys in Python dictionaries or sets, I think it's better to concentrate on the speed of __hash__ itself and let Python handle the hash collision[1].
[1] You may need to override __eq__ too, to help Python manage hash collision. You would want __eq__ to return a boolean, rather than an array of booleans as is done by numpy.
Coming late to the party, but for large arrays, I think a decent way to do it is to randomly subsample the matrix and hash that sample:
def subsample_hash(a):
rng = np.random.RandomState(89)
inds = rng.randint(low=0, high=a.size, size=1000)
b = a.flat[inds]
b.flags.writeable = False
return hash(b.data)
I think this is better than doing hash(str(a)), because the latter could confuse arrays that have unique data in the middle but zeros around the edges.
If your np.array() is small and in a tight loop, then one option is to skip hash() completely and just use np.array().data.tobytes() directly as your dict key:
grid = np.array([[True, False, True],[False, False, True]])
hash = grid.data.tobytes()
cache = cache or {}
if hash not in cache:
cache[hash] = function(grid)
return cache[hash]
What kind of data do you have?
array-size
do you have an index several times in the array
If your array only consists of permutation of indices you can use a base-convertion
(1, 0, 2) -> 1 * 3**0 + 0 * 3**1 + 2 * 3**2 = 10(base3)
and use '10' as hash_key via
import numpy as num
base_size = 3
base = base_size ** num.arange(base_size)
max_base = (base * num.arange(base_size)).sum()
hashed_array = (base * array).sum()
Now you can use an array (shape=(base_size, )) instead of a dict in order to access the values.

Is there a performance difference in using a tuple over a frozenset as a key for a dictionary?

I have a script that makes many calls to a dictionary using a key consisting of two variables. I know that my program will encounter the two variables again in the reverse order which makes storing the key as a tuple feasible. (Creating a matrix with the same labels for rows and columns)
Therefore, I was wondering if there was a performance difference in using a tuple over a frozenset for a dictionary key.
In a quick test, apparently it makes a negligible difference.
python -m timeit -s "keys = list(zip(range(10000), range(10, 10000)))" -s "values = range(10000)" -s "a=dict(zip(keys, values))" "for i in keys:" " _ = a[i]"
1000 loops, best of 3: 855 usec per loop
python -m timeit -s "keys = [frozenset(i) for i in zip(range(10000), range(10, 10000))]" -s "values = range(10000)" -s "a=dict(zip(keys, values))" "for i in keys:" " _ = a[i]"
1000 loops, best of 3: 848 usec per loop
I really would just go with what is best elsewhere in your code.
Without having done any tests, I have a few guesses. For frozensets, cpython stores the hash after it has been calculated; furthermore, iterating over a set of any kind incurs extra overhead because the data is stored sparsely. In a 2-item set, that imposes a significant performance penalty on the first hash, but would probably make the second hash very fast -- at least when the object itself is the same. (i.e. is not a new but equivalent frozenset.)
For tuples, cpython does not store the hash, but rather calculates it every time. So it might be that repeated hashing is slightly cheaper with frozensets. But for such a short tuple, there's probably almost no difference; it's even possible that very short tuples will be faster.
Lattyware's current timings line up reasonably well with my line of reasoning here; see below.
To test my intuition about the asymmetry of hashing new vs. old frozensets, I did the following. I believe the difference in timings is exclusively due to the extra hash time. Which is pretty insignificant, by the way:
>>> fs = frozenset((1, 2))
>>> old_fs = lambda: [frozenset((1, 2)), fs][1]
>>> new_fs = lambda: [frozenset((1, 2)), fs][0]
>>> id(fs) == id(old_fs())
True
>>> id(fs) == id(new_fs())
False
>>> %timeit hash(old_fs())
1000000 loops, best of 3: 642 ns per loop
>>> %timeit hash(new_fs())
1000000 loops, best of 3: 660 ns per loop
Note that my previous timings were wrong; using and created a timing asymmetry that the above method avoids. This new method produces expected results for tuples here -- negligable timing difference:
>>> tp = (1, 2)
>>> old_tp = lambda: [tuple((1, 2)), tp][1]
>>> new_tp = lambda: [tuple((1, 2)), tp][0]
>>> id(tp) == id(old_tp())
True
>>> id(tp) == id(new_tp())
False
>>> %timeit hash(old_tp())
1000000 loops, best of 3: 533 ns per loop
>>> %timeit hash(new_tp())
1000000 loops, best of 3: 532 ns per loop
And, the coup de grace, comparing hash time for a pre-constructred frozenset to hash time for a pre-constructed tuple:
>>> %timeit hash(fs)
10000000 loops, best of 3: 82.2 ns per loop
>>> %timeit hash(tp)
10000000 loops, best of 3: 93.6 ns per loop
Lattyware's results look more like this because they are an average of results for new and old frozensets. (They hash each tuple or frozenset twice, once in creating the dictionary, once in accessing it.)
The upshot of all this is that it probably doesn't matter, except to those of us who enjoy digging around in Python's internals and testing things into oblivion.
While you can use timeit to find out (and I encourage you to do so, if for no other reason than to learn how it works), in the end it almost certainly doesn't matter.
frozensets are designed specifically to be hashable, so I would be shocked if their hash method is linear time. This kind of micro-optimisation can only matter if you need to get through a fixed (large) number of look-ups in a very short amount of time in a realtime application.
Update: Look at the various updates and comments to Lattyware's answer - it took a lot of collective effort (well, relatively), to strip out the confounding factors, and show that the performance of the two approaches is almost the same. The performance hits were not where they were assumed to be, and it will be the same in your own code.
Write your code to work, then profile to find the hotspots, then apply algorithmic optimisations, then apply micro-optimisations.
The top answer (Gareth Latty's) appears to be out of date. On python 3.6 hashing a frozenset appears to be much faster, but it depends quite a bit on what you're hashing:
sjelin#work-desktop:~$ ipython
Python 3.6.9 (default, Nov 7 2019, 10:44:02)
In [1]: import time
In [2]: def perf(get_data):
...: tuples = []
...: sets = []
...: for _ in range(10000):
...: t = tuple(get_data(10000))
...: tuples.append(t)
...: sets.append(frozenset(t))
...:
...: start = time.time()
...: for s in sets:
...: hash(s)
...: mid = time.time()
...: for t in tuples:
...: hash(t)
...: end = time.time()
...: return {'sets': mid-start, 'tuples': end-mid}
...:
In [3]: perf(lambda n: range(n))
Out[3]: {'sets': 0.32627034187316895, 'tuples': 0.22960591316223145}
In [4]: from random import random
In [5]: perf(lambda n: (random() for _ in range(n)))
Out[5]: {'sets': 0.3242628574371338, 'tuples': 1.117497205734253}
In [6]: perf(lambda n: (0 for _ in range(n)))
Out[6]: {'sets': 0.0005457401275634766, 'tuples': 0.16936826705932617}
In [7]: perf(lambda n: (str(i) for i in range(n)))
Out[7]: {'sets': 0.33167099952697754, 'tuples': 0.3538074493408203}
In [8]: perf(lambda n: (object() for _ in range(n)))
Out[8]: {'sets': 0.3275420665740967, 'tuples': 0.18484067916870117}
In [9]: class C:
...: def __init__(self):
...: self._hash = int(random()*100)
...:
...: def __hash__(self):
...: return self._hash
...:
In [10]: perf(lambda n: (C() for i in range(n)))
Out[10]: {'sets': 0.32653021812438965, 'tuples': 6.292834997177124}
Some of these differences are enough to matter in a perf context, but only if hashing is actually your bottleneck (which almost never happens).
I'm not sure what to make the fact that the frozensets almost always ran in ~0.33 seconds, while the tuples took anywhere between 0.2 and 6.3 seconds. To be clear, rerunning with the same lambda never changed the results by more than 1%, so it's not like there's a bug.
In python2 the results were different, and the two were generally closer to each other, which is probably why Gareth didn't see the same differences.

Categories

Resources