Recently I have been learning more about hashing in Python and I came around this blog where it states that:
Suppose a Python program has 2 lists. If we need to think about
comparing those two lists, what will you do? Compare each element?
Well, that sounds fool-proof but slow as well!
Python has a much smarter way to do this. When a tuple is constructed
in a program, Python Interpreter calculates its hash in its memory. If
a comparison occurs between 2 tuples, it simply compares the hash
values and it knows if they are equal!
So I am really confused about these statements.
First when we do:
[1, 2, 3] == [1, 2, 3] then how does this equality works ? Does it calculate the hash value and then compare it ?
Second what's the difference when we do:
[1, 2, 3] == [1, 2, 3] and (1, 2, 3) == (1, 2, 3) ?
Because when I tried to find time of execution with timeit then I got this result:
$ python3.5 -m timeit '[1, 2, 3] == [1, 2, 3]'
10000000 loops, best of 3: 0.14 usec per loop
$ python3.5 -m timeit '(1, 2, 3) == (1, 2, 3)'
10000000 loops, best of 3: 0.0301 usec per loop
So why there is difference in time from 0.14 for list to 0.03 for tuple which is faster than list.
Well, part of your confusion is that the blog post you're reading is just wrong. About multiple things. Try to forget that you ever read it (except to remember the site and the author's name so you'll know to avoid them in the future).
It is true that tuples are hashable and lists are not, but that isn't relevant to their equality-testing functions. And it's certainly not true that "it simply compares the hash values and it knows if they are equal!" Hash collisions happen, and ignoring them would lead to horrible bugs, and fortunately Python's developers are not that stupid. In fact, it's not even true that Python computes the hash value at initialization time.*
There actually is one significant difference between tuples and lists (in CPython, as of 3.6), but it usually doesn't make much difference: Lists do an extra check for unequal length at the beginning as an optimization, but the same check turned out to be a pessimization for tuples,** so it was removed from there.
Another, often much more important, difference is that tuple literals in your source get compiled into constant values, and separate copies of the same tuple literal get folded into the same constant object; that doesn't happen with lists, for obvious reasons.
In fact, that's what you're really testing with your timeit. On my laptop, comparing the tuples takes 95ns, while comparing the lists takes 169ns—but breaking it down, that's actually 93ns for the comparison, plus an extra 38ns to create each list. To make it a fair comparison, you have to move the creation to a setup step, and then compare already-existing values inside the loop. (Or, of course, you may not want to be fair—you're discovering the useful fact that every time you use a tuple constant instead of creating a new list, you're saving a significant fraction of a microsecond.)
Other than that, they basically do the same thing. Translating the C source into Python-like pseudocode (and removing all the error handling, and the stuff that makes the same function work for <, and so on):
for i in range(min(len(v), len(w))):
if v[i] != w[i]:
break
else:
return len(v) == len(w)
return False
The list equivalent is like this:
if len(v) != len(w):
return False
for i in range(min(len(v), len(w))):
if v[i] != w[i]:
break
else:
return True
return False
* In fact, unlike strings, tuples don't even cache their hashes; if you call hash over and over, it'll keep re-computing it. See issue 9685, where a patch to change that was rejected because it slowed down some benchmarks and didn't speed up anything anyone could find.
** Not because of anything inherent about the implementation, but because people often compare lists of different lengths, but rarely do with tuples.
The answer was given in that article too :)
here is the demonstrated :
>>> l1=[1,2,3]
>>> l2=[1,2,3]
>>>
>>> hash(l1) #since list is not hashable
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
>>> t=(1,2,3)
>>> t2=(1,2,3)
>>> hash(t)
2528502973977326415
>>> hash(t2)
2528502973977326415
>>>
in above when you call hash on list it will give you TypeError as it is not hashable and for equality of two list python check its inside value that will lead to take much time
in case of tuple it calculates the hash value and for two similar tuplle having the same hash value so python only compared the hashvalue of tuple so it us much fast than list
from given article
Python has a much smarter way to do this. When a tuple is constructed
in a program, Python Interpreter calculates its hash in its memory. If
a comparison occurs between 2 tuples, it simply compares the hash
values and it knows if they are equal!
Related
I understand that sets in Python are unordered, but I'm curious about the 'order' they're displayed in, as it seems to be consistent. They seem to be out-of-order in the same way every time:
>>> set_1 = set([5, 2, 7, 2, 1, 88])
>>> set_2 = set([5, 2, 7, 2, 1, 88])
>>> set_1
set([88, 1, 2, 5, 7])
>>> set_2
set([88, 1, 2, 5, 7])
...and another example:
>>> set_3 = set('abracadabra')
>>> set_4 = set('abracadabra')
>>> set_3
set(['a', 'r', 'b', 'c', 'd'])
>>>> set_4
set(['a', 'r', 'b', 'c', 'd'])
I'm just curious why this would be. Any help?
You should watch this video (although it is CPython1 specific and about dictionaries -- but I assume it applies to sets as well).
Basically, python hashes the elements and takes the last N bits (where N is determined by the size of the set) and uses those bits as array indices to place the object in memory. The objects are then yielded in the order they exist in memory. Of course, the picture gets a little more complicated when you need to resolve collisions between hashes, but that's the gist of it.
Also note that the order that they are printed out is determined by the order that you put them in (due to collisions). So, if you reorder the list you pass to set_2, you might get a different order out if there are key collisions.
For example:
list1 = [8,16,24]
set(list1) #set([8, 16, 24])
list2 = [24,16,8]
set(list2) #set([24, 16, 8])
Note the fact that the order is preserved in these sets is "coincidence" and has to do with collision resolution (which I don't know anything about). The point is that the last 3 bits of hash(8), hash(16) and hash(24) are the same. Because they are the same, collision resolution takes over and puts the elements in "backup" memory locations instead of the first (best) choice and so whether 8 occupies a location or 16 is determined by which one arrived at the party first and took the "best seat".
If we repeat the example with 1, 2 and 3, you will get a consistent order no matter what order they have in the input list:
list1 = [1,2,3]
set(list1) # set([1, 2, 3])
list2 = [3,2,1]
set(list2) # set([1, 2, 3])
since the last 3 bits of hash(1), hash(2) and hash(3) are unique.
1Note The implementation described here applies to CPython dict and set. I think that the general description is valid for all modern versions of CPython up to 3.6. However, starting with CPython3.6, there is an additional implementation detail that actually preserves the insertion order for iteration for dict. It appears that set still do not have this property. The data structure is described by this blog post by the pypy folks (who started using this before the CPython folks). The original idea (at least for the python ecosystem) is archived on the python-dev mailing list.
The reason of such behavior is than Python use hash tables for dictionary implementation: https://en.wikipedia.org/wiki/Hash_table#Open_addressing
Position of the key is defined by it's memory address. If you know Python reuse memory for some objects:
>>> a = 'Hello world'
>>> id(a)
140058096568768
>>> a = 'Hello world'
>>> id(a)
140058096568480
You can see that object a has different address every time it's init.
But for small integers it isn't change:
>>> a = 1
>>> id(a)
40060856
>>> a = 1
>>> id(a)
40060856
Even if we create second object with different name it would be the same:
>>> b = 1
>>> id(b)
40060856
This approach allow to save memory which Python interpreter consume.
AFAIK Python sets are implemented using a hash table. The order in which the items appear depends on the hash function used. Within the same run of the program, the hash function probably does not change, hence you get the same order.
But there are no guarantees that it will always use the same function, and the order will change across runs - or within the same run if you insert a lot of elements and the hash table has to resize.
One key thing that's hinted at mgilson's great answer, but isn't mentioned explicitly in any of the existing answers:
Small integers hash to themselves:
>>> [hash(x) for x in (1, 2, 3, 88)]
[1, 2, 3, 88]
Strings hash to values that are unpredictable. In fact, from 3.3 on, by default, they're built off a seed that's randomized at startup. So, you'll get different results for each new Python interpreter session, but:
>>> [hash(x) for x in 'abcz']
[6014072853767888837,
8680706751544317651,
-7529624133683586553,
-1982255696180680242]
So, consider the simplest possible hash table implementation: just an array of N elements, where inserting a value means putting it in hash(value) % N (assuming no collisions). And you can make a rough guess at how large N will be—it's going to be a little bigger than the number of elements in it. When creating a set from a sequence of 6 elements, N could easily be, say, 8.
What happens when you store those 5 numbers with N=8? Well, hash(1) % 8, hash(2) % 8, etc. are just the numbers themselves, but hash(88) % 8 is 0. So, the hash table's array ends up holding 88, 1, 2, NULL, NULL, 5, NULL, 7. So it should be easy to figure out why iterating the set might give you 88, 1, 2, 5, 7.
Of course Python doesn't guarantee that you'll get this order every time. A small change to the way it guesses at the right value for N could mean 88 ends up somewhere different (or ends up colliding with one of the other values). And, in fact, running CPython 3.7 on my Mac, I get 1, 2, 5, 7, 88.0
Meanwhile, when you build a hash from a sequence of size 11 and then insert randomized hashes into it, what happens? Even assuming the simplest implementation, and assuming there are no collisions, you still have no idea what order you're going to get. It will be consistent within a single run of the Python interpreter, but different the next time you start it up. (Unless you set PYTHONHASHSEED to 0, or to some other int value.) Which is exactly what you see.
Of course it's worth looking at the way sets are actually implemented rather than guessing. But what you'd guess based on the assumption of the simplest hash table implementation is (barring collisions and barring expansion of the hash table) exactly what happens.
Sets are based on a hash table. The hash of a value should be consistent so the order will be also - unless two elements hash to the same code, in which case the order of insertion will change the output order.
I understand that sets in Python are unordered, but I'm curious about the 'order' they're displayed in, as it seems to be consistent. They seem to be out-of-order in the same way every time:
>>> set_1 = set([5, 2, 7, 2, 1, 88])
>>> set_2 = set([5, 2, 7, 2, 1, 88])
>>> set_1
set([88, 1, 2, 5, 7])
>>> set_2
set([88, 1, 2, 5, 7])
...and another example:
>>> set_3 = set('abracadabra')
>>> set_4 = set('abracadabra')
>>> set_3
set(['a', 'r', 'b', 'c', 'd'])
>>>> set_4
set(['a', 'r', 'b', 'c', 'd'])
I'm just curious why this would be. Any help?
You should watch this video (although it is CPython1 specific and about dictionaries -- but I assume it applies to sets as well).
Basically, python hashes the elements and takes the last N bits (where N is determined by the size of the set) and uses those bits as array indices to place the object in memory. The objects are then yielded in the order they exist in memory. Of course, the picture gets a little more complicated when you need to resolve collisions between hashes, but that's the gist of it.
Also note that the order that they are printed out is determined by the order that you put them in (due to collisions). So, if you reorder the list you pass to set_2, you might get a different order out if there are key collisions.
For example:
list1 = [8,16,24]
set(list1) #set([8, 16, 24])
list2 = [24,16,8]
set(list2) #set([24, 16, 8])
Note the fact that the order is preserved in these sets is "coincidence" and has to do with collision resolution (which I don't know anything about). The point is that the last 3 bits of hash(8), hash(16) and hash(24) are the same. Because they are the same, collision resolution takes over and puts the elements in "backup" memory locations instead of the first (best) choice and so whether 8 occupies a location or 16 is determined by which one arrived at the party first and took the "best seat".
If we repeat the example with 1, 2 and 3, you will get a consistent order no matter what order they have in the input list:
list1 = [1,2,3]
set(list1) # set([1, 2, 3])
list2 = [3,2,1]
set(list2) # set([1, 2, 3])
since the last 3 bits of hash(1), hash(2) and hash(3) are unique.
1Note The implementation described here applies to CPython dict and set. I think that the general description is valid for all modern versions of CPython up to 3.6. However, starting with CPython3.6, there is an additional implementation detail that actually preserves the insertion order for iteration for dict. It appears that set still do not have this property. The data structure is described by this blog post by the pypy folks (who started using this before the CPython folks). The original idea (at least for the python ecosystem) is archived on the python-dev mailing list.
The reason of such behavior is than Python use hash tables for dictionary implementation: https://en.wikipedia.org/wiki/Hash_table#Open_addressing
Position of the key is defined by it's memory address. If you know Python reuse memory for some objects:
>>> a = 'Hello world'
>>> id(a)
140058096568768
>>> a = 'Hello world'
>>> id(a)
140058096568480
You can see that object a has different address every time it's init.
But for small integers it isn't change:
>>> a = 1
>>> id(a)
40060856
>>> a = 1
>>> id(a)
40060856
Even if we create second object with different name it would be the same:
>>> b = 1
>>> id(b)
40060856
This approach allow to save memory which Python interpreter consume.
AFAIK Python sets are implemented using a hash table. The order in which the items appear depends on the hash function used. Within the same run of the program, the hash function probably does not change, hence you get the same order.
But there are no guarantees that it will always use the same function, and the order will change across runs - or within the same run if you insert a lot of elements and the hash table has to resize.
One key thing that's hinted at mgilson's great answer, but isn't mentioned explicitly in any of the existing answers:
Small integers hash to themselves:
>>> [hash(x) for x in (1, 2, 3, 88)]
[1, 2, 3, 88]
Strings hash to values that are unpredictable. In fact, from 3.3 on, by default, they're built off a seed that's randomized at startup. So, you'll get different results for each new Python interpreter session, but:
>>> [hash(x) for x in 'abcz']
[6014072853767888837,
8680706751544317651,
-7529624133683586553,
-1982255696180680242]
So, consider the simplest possible hash table implementation: just an array of N elements, where inserting a value means putting it in hash(value) % N (assuming no collisions). And you can make a rough guess at how large N will be—it's going to be a little bigger than the number of elements in it. When creating a set from a sequence of 6 elements, N could easily be, say, 8.
What happens when you store those 5 numbers with N=8? Well, hash(1) % 8, hash(2) % 8, etc. are just the numbers themselves, but hash(88) % 8 is 0. So, the hash table's array ends up holding 88, 1, 2, NULL, NULL, 5, NULL, 7. So it should be easy to figure out why iterating the set might give you 88, 1, 2, 5, 7.
Of course Python doesn't guarantee that you'll get this order every time. A small change to the way it guesses at the right value for N could mean 88 ends up somewhere different (or ends up colliding with one of the other values). And, in fact, running CPython 3.7 on my Mac, I get 1, 2, 5, 7, 88.0
Meanwhile, when you build a hash from a sequence of size 11 and then insert randomized hashes into it, what happens? Even assuming the simplest implementation, and assuming there are no collisions, you still have no idea what order you're going to get. It will be consistent within a single run of the Python interpreter, but different the next time you start it up. (Unless you set PYTHONHASHSEED to 0, or to some other int value.) Which is exactly what you see.
Of course it's worth looking at the way sets are actually implemented rather than guessing. But what you'd guess based on the assumption of the simplest hash table implementation is (barring collisions and barring expansion of the hash table) exactly what happens.
Sets are based on a hash table. The hash of a value should be consistent so the order will be also - unless two elements hash to the same code, in which case the order of insertion will change the output order.
I am comparing two lists of integers and am trying to access the lowest value without using a for-loop as the lists are quite large. I have tried using set comparison, yet I receive an empty set when doing so. Currently my approach is:
differenceOfIpLists = list(set(reservedArray).difference(set(ipChoicesArray)))
I have also tried:
differenceOfIpLists = list(set(reservedArray) - set(ipChoicesArray))
And the lists are defined as such:
reservedArray = [169017344, 169017345, 169017346, 169017347, 169017348, 169017349, 169017350, 169017351, 169017352, 169017353, 169017354, 169017355, 169017356, 169017357, 169017358, 169017359, 169017360, 169017361, 169017362, 169017363, 169017364, 169017365, 169017366, 169017367, 169017368, 169017369, 169017600, 169017601, 169017602, 169017603, 169017604, 169017605, 169017606, 169017607, 169017608, 169017609, 169017610, 169017611, 169017612, 169017613, 169017614, 169017615, 169017616, 169017617, 169017618, 169017619...]
ipChoicesArray = [169017344, 169017345, 169017346, 169017347, 169017348, 169017349, 169017350, 169017351, 169017352, 169017353, 169017354, 169017355, 169017356, 169017357, 169017358, 169017359, 169017360, 169017361, 169017362, 169017363, 169017364, 169017365, 169017366, 169017367, 169017368, 169017369, 169017370, 169017371, 169017372, 169017373, 169017374, 169017375, 169017376, 169017377, 169017378, 169017379, 169017380, 169017381, 169017382...]
Portions of these lists are the same, yet they are vastly different as the lengths are:
reservedArrayLength = 6658
ipChoicesArray = 65536
I have also tried converting these values to strings and doing the same style of comparison, also to no avail.
Once I am able to extract a list of the elements in the ipChoicesArray that are not in the reservedArray, I will return the smallest element after sorting.
I do not believe that I am facing a max length issue...
Subtracting the sets should work as you desire, see below:
ipChoicesArray = [1,3,4,7,1]
reservedArray = [1,2,5,7,8,2,1]
min(list(set(ipChoicesArray) - set(reservedArray)))
###Output###
[3]
By the way, max list is a length of 536,870,912 elements
without using a for-loop as the lists are quite large
The presumption that a for-loop is a poor choice because the list is large is likely incorrect. Creation of a set from a list and vice-versa will not only iterate through the containers under the hood anyway (just like a for-loop) in addition to allocating new containers and taking up more memory. Profile your code before you assume something won't perform well.
That aside, in your code it seems the reason you are getting an empty result is because your difference is inverted. To get the elements in ipChoicesArray but not in reservedArray you want to difference the latter from the former:
diff = set(ipChoicesArray) - set(reservedArray)
The obvious solution (you just did the set difference in the wrong direction):
print(min(set(ipChoicesArray) - set(reservedArray)))
You said they're always sorted, and your reverse difference being empty (and thinking about what you're doing) suggests that the "choices" are a superset of the "reserved", so then this also works and could be faster:
print(next(c for c, r in zip(ipChoicesArray, reservedArray) if c != r))
Disclaimer: Python docs states that
A set is an unordered collection with no duplicate elements.
But I can see that the output of an unordered set is an ordered set:
s = {'z', 1, 0, 'a'}
s #=> {0, 1, 'a', 'z'}
next(iter(s)) #=> 0
So, I don't know if this approach is reliable. Maybe some other user can deny or confirmi this with an appropriate reference to the set behaviour.
Having said this...
Don't know if I'm getting the point, but..
Not knowing where the smallest value is, you could use this approach (here using smaller values and shorter list):
a = [2, 5, 5, 1, 6, 7, 8, 9]
b = [2, 3, 4, 5, 6, 6, 1]
Find the smallest of the union:
union = set_a | set_b
next(iter(union))
#=> 1
Or just:
min([next(iter(set_a)), next(iter(set_b))])
#=> 1
Or, maybe this fits better your question:
next(iter(set_a-set_b)) #=> 8
I'm impressed by and enjoy the fact that a simple Python for statement can easily unravel a list of lists, without the need for numpy.unravel or an equivalent flatten function. However, the trade-off is now that I can't access elements of a list like this:
for a,b,c in [[5],[6],[7]]:
print(str(a),str(b),str(c))
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: not enough values to unpack (expected 3, got 1)
and instead, this works, up until the length-1 [5]:
for a,b,c in [[1,2,3],[4,5,6],[7,8,9],[0,0,0], [5]]:
print(a,b,c)
1 2 3
4 5 6
7 8 9
0 0 0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: not enough values to unpack (expected 3, got 1)
Logically, it doesn't make sense to assume that a list would have a fixed number of elements. How come then, Python allows us to assume that a list of lists would always have the same number of elements?
I'd like to be aware of what Python expects, because I want to anticipate wrongly formatted lists/sublists.
I've poked around Python documentation and Stackoverflow, but haven't found the reasoning or how the interpreter is doing this.
My guess is that flattening same-length arrays is such a common occurrence (e.g. machine learning dimensionality reduction, matrix transformations, etc.), that there's utility in providing this feature at the trade-off of being unable to do what I've tried above.
Python does not assume same length lists because this is not only for lists.
When you iterate for a,b,c in [[1,2,3],[4,5,6],[7,8,9],[0,0,0], [5]] what is happening is that python returns a iterator that will iterate(return) each list values.
So that for is equivalent with:
l = [[1,2,3],[4,5,6],[7,8,9],[0,0,0], [5]]
l_iter = iter(l)
a,b,c = next(l_iter)
next(l_iter) will return each element from the list until it will raise a StopIteration execption according to the python iteration protocol.
This means:
a,b,c = [1,2,3]
a,b,c = [4,5,6]
a,b,c = [7,8,9]
a,b,c = [0,0,0]
a,b,c = [5]
As you can see now python can't unpack [5] into a,b,c as there is only one value.
The interpreter always assumes the length is matching when making an unpacking assignment, and just crashes with ValueError if it doesn't match. A for-loop is actually very similar to a kind of "repeated assignment statement", with the LHS being the free variable(s) of the loop and the RHS being an iterable container yielding the successive value(s) to use in each step of the iteration.
One assignment per iteration, made at the beginning of the loop body - in your case, it's an unpacking assignment, which binds multiple names.
So, in order to be properly equivalent to the second example, your first example which was:
for a,b,c in [[5],[6],[7]]:
...
should have been written instead:
for a, in [[5],[6],[7]]:
...
There is no "anticipation", and there can't be because (in the general case) you may be iterating over anything, e.g. data streaming in from a socket.
In order to fully grasp how for-loop flow works, the analogy with assignment statements is very useful. Anything that you can use on the left hand side of an assignment statement, you can use as the target in a for-loop. For example, this is equivalent to setting d[1] = 2 etc in a dict - and should make same result as dict(RHS):
>>> d = {}
>>> for k, d[k] in [[1, 2], [3, 4]]:
... pass
...
>>> d
{1: 2, 3: 4}
It's just a bunch of assignments, in a well-defined order.
Python doesn't know, you just told it to expect three elements by unpacking to three names. The ValueError says "you told us three, but we found a sub-iterable that didn't have three elements, and we don't know what to do".
Python isn't really doing anything special to implement this; aside from special cases for built-in types like tuple (and probably list), the implementation is just to iterate the sub-iterable the expected number of times and dump all the values found on the interpreter stack, then store them to the provided names. It also tries to iterate one more time (expecting StopIteration) so you don't silently ignore extra values.
For limited cases, you can be flexible by having one of the unpack names preceded with a *, so you capture all the "didn't fit" elements into that name (as a list). That lets you set a minimum number of elements while allowing more, e.g. if you really only need the first element from your second example, you could do:
for a, *_ in [[1,2,3],[4,5,6],[7,8,9],[0,0,0], [5]]:
print(a,b,c)
where _ is just a name that, by convention, means "I don't actually care about this value, but I needed a placeholder name".
Another example would be when you want the first and last element, but otherwise don't care about the middle:
for first, *middle, last in myiterable:
...
But otherwise, if you need to handle variable length iterables, don't unpack, just store to a single name and iterate that name manually in whatever way makes sense to your program logic.
Given a list of integers such as integers = [1, 2, 3, 4, 5, 6]
I would like to know if there is an even number in the list using Python's any() function. My question is if it is more efficient to pass a list comprehension outcome like so:
evens = [each for each in integers if each % 2 == 0]
has_even = any(evens)
versus using a generator such as:
has_even = any(each for each in integers if each % 2 == 0)
It's better to pass a generator than a list comprehension to any and all. Both of those functions can short-circuit, i.e., any will stop as soon as it encounters a True value, and all will stop as soon as it encounters a False value. If you pass them a list comprehension the whole list has to be built before any / all can start work, but if you pass them a generator then only the needed vales will be generated. So not only do you save RAM using a generator, you may save a substantial amount of execution time, too.
Your generator can be made more efficient; the if part is redundant.
has_even = any(each % 2 == 0 for each in integers)
any with a generator is the most efficient method here as it will not allocate the list of all even numbers and moreover it will stop at the first even number, not even considering others. The input could also be a generator (e.g. reading numbers from a file) and in this case the saving is bigger if you stop reading the input.
any with a generator is also very readable, especially if you define an even predicate...
def even(x):
return x % 2 == 0
if any(even(x) for x in integers):
...
Readability should be for most software your primary concern (computers today are generally very fast).
If your eyes are trained with the functional approach then an even more readable version could be
if any(filter(even, integers)):
...
that with Python 3 is also as efficient (not extracting input from numbers once the result is known).
Note however that if efficiency for this kind of computation is your most important concern then Python is probably the wrong tool...