I understand that sets in Python are unordered, but I'm curious about the 'order' they're displayed in, as it seems to be consistent. They seem to be out-of-order in the same way every time:
>>> set_1 = set([5, 2, 7, 2, 1, 88])
>>> set_2 = set([5, 2, 7, 2, 1, 88])
>>> set_1
set([88, 1, 2, 5, 7])
>>> set_2
set([88, 1, 2, 5, 7])
...and another example:
>>> set_3 = set('abracadabra')
>>> set_4 = set('abracadabra')
>>> set_3
set(['a', 'r', 'b', 'c', 'd'])
>>>> set_4
set(['a', 'r', 'b', 'c', 'd'])
I'm just curious why this would be. Any help?
You should watch this video (although it is CPython1 specific and about dictionaries -- but I assume it applies to sets as well).
Basically, python hashes the elements and takes the last N bits (where N is determined by the size of the set) and uses those bits as array indices to place the object in memory. The objects are then yielded in the order they exist in memory. Of course, the picture gets a little more complicated when you need to resolve collisions between hashes, but that's the gist of it.
Also note that the order that they are printed out is determined by the order that you put them in (due to collisions). So, if you reorder the list you pass to set_2, you might get a different order out if there are key collisions.
For example:
list1 = [8,16,24]
set(list1) #set([8, 16, 24])
list2 = [24,16,8]
set(list2) #set([24, 16, 8])
Note the fact that the order is preserved in these sets is "coincidence" and has to do with collision resolution (which I don't know anything about). The point is that the last 3 bits of hash(8), hash(16) and hash(24) are the same. Because they are the same, collision resolution takes over and puts the elements in "backup" memory locations instead of the first (best) choice and so whether 8 occupies a location or 16 is determined by which one arrived at the party first and took the "best seat".
If we repeat the example with 1, 2 and 3, you will get a consistent order no matter what order they have in the input list:
list1 = [1,2,3]
set(list1) # set([1, 2, 3])
list2 = [3,2,1]
set(list2) # set([1, 2, 3])
since the last 3 bits of hash(1), hash(2) and hash(3) are unique.
1Note The implementation described here applies to CPython dict and set. I think that the general description is valid for all modern versions of CPython up to 3.6. However, starting with CPython3.6, there is an additional implementation detail that actually preserves the insertion order for iteration for dict. It appears that set still do not have this property. The data structure is described by this blog post by the pypy folks (who started using this before the CPython folks). The original idea (at least for the python ecosystem) is archived on the python-dev mailing list.
The reason of such behavior is than Python use hash tables for dictionary implementation: https://en.wikipedia.org/wiki/Hash_table#Open_addressing
Position of the key is defined by it's memory address. If you know Python reuse memory for some objects:
>>> a = 'Hello world'
>>> id(a)
140058096568768
>>> a = 'Hello world'
>>> id(a)
140058096568480
You can see that object a has different address every time it's init.
But for small integers it isn't change:
>>> a = 1
>>> id(a)
40060856
>>> a = 1
>>> id(a)
40060856
Even if we create second object with different name it would be the same:
>>> b = 1
>>> id(b)
40060856
This approach allow to save memory which Python interpreter consume.
AFAIK Python sets are implemented using a hash table. The order in which the items appear depends on the hash function used. Within the same run of the program, the hash function probably does not change, hence you get the same order.
But there are no guarantees that it will always use the same function, and the order will change across runs - or within the same run if you insert a lot of elements and the hash table has to resize.
One key thing that's hinted at mgilson's great answer, but isn't mentioned explicitly in any of the existing answers:
Small integers hash to themselves:
>>> [hash(x) for x in (1, 2, 3, 88)]
[1, 2, 3, 88]
Strings hash to values that are unpredictable. In fact, from 3.3 on, by default, they're built off a seed that's randomized at startup. So, you'll get different results for each new Python interpreter session, but:
>>> [hash(x) for x in 'abcz']
[6014072853767888837,
8680706751544317651,
-7529624133683586553,
-1982255696180680242]
So, consider the simplest possible hash table implementation: just an array of N elements, where inserting a value means putting it in hash(value) % N (assuming no collisions). And you can make a rough guess at how large N will be—it's going to be a little bigger than the number of elements in it. When creating a set from a sequence of 6 elements, N could easily be, say, 8.
What happens when you store those 5 numbers with N=8? Well, hash(1) % 8, hash(2) % 8, etc. are just the numbers themselves, but hash(88) % 8 is 0. So, the hash table's array ends up holding 88, 1, 2, NULL, NULL, 5, NULL, 7. So it should be easy to figure out why iterating the set might give you 88, 1, 2, 5, 7.
Of course Python doesn't guarantee that you'll get this order every time. A small change to the way it guesses at the right value for N could mean 88 ends up somewhere different (or ends up colliding with one of the other values). And, in fact, running CPython 3.7 on my Mac, I get 1, 2, 5, 7, 88.0
Meanwhile, when you build a hash from a sequence of size 11 and then insert randomized hashes into it, what happens? Even assuming the simplest implementation, and assuming there are no collisions, you still have no idea what order you're going to get. It will be consistent within a single run of the Python interpreter, but different the next time you start it up. (Unless you set PYTHONHASHSEED to 0, or to some other int value.) Which is exactly what you see.
Of course it's worth looking at the way sets are actually implemented rather than guessing. But what you'd guess based on the assumption of the simplest hash table implementation is (barring collisions and barring expansion of the hash table) exactly what happens.
Sets are based on a hash table. The hash of a value should be consistent so the order will be also - unless two elements hash to the same code, in which case the order of insertion will change the output order.
Related
I understand that sets in Python are unordered, but I'm curious about the 'order' they're displayed in, as it seems to be consistent. They seem to be out-of-order in the same way every time:
>>> set_1 = set([5, 2, 7, 2, 1, 88])
>>> set_2 = set([5, 2, 7, 2, 1, 88])
>>> set_1
set([88, 1, 2, 5, 7])
>>> set_2
set([88, 1, 2, 5, 7])
...and another example:
>>> set_3 = set('abracadabra')
>>> set_4 = set('abracadabra')
>>> set_3
set(['a', 'r', 'b', 'c', 'd'])
>>>> set_4
set(['a', 'r', 'b', 'c', 'd'])
I'm just curious why this would be. Any help?
You should watch this video (although it is CPython1 specific and about dictionaries -- but I assume it applies to sets as well).
Basically, python hashes the elements and takes the last N bits (where N is determined by the size of the set) and uses those bits as array indices to place the object in memory. The objects are then yielded in the order they exist in memory. Of course, the picture gets a little more complicated when you need to resolve collisions between hashes, but that's the gist of it.
Also note that the order that they are printed out is determined by the order that you put them in (due to collisions). So, if you reorder the list you pass to set_2, you might get a different order out if there are key collisions.
For example:
list1 = [8,16,24]
set(list1) #set([8, 16, 24])
list2 = [24,16,8]
set(list2) #set([24, 16, 8])
Note the fact that the order is preserved in these sets is "coincidence" and has to do with collision resolution (which I don't know anything about). The point is that the last 3 bits of hash(8), hash(16) and hash(24) are the same. Because they are the same, collision resolution takes over and puts the elements in "backup" memory locations instead of the first (best) choice and so whether 8 occupies a location or 16 is determined by which one arrived at the party first and took the "best seat".
If we repeat the example with 1, 2 and 3, you will get a consistent order no matter what order they have in the input list:
list1 = [1,2,3]
set(list1) # set([1, 2, 3])
list2 = [3,2,1]
set(list2) # set([1, 2, 3])
since the last 3 bits of hash(1), hash(2) and hash(3) are unique.
1Note The implementation described here applies to CPython dict and set. I think that the general description is valid for all modern versions of CPython up to 3.6. However, starting with CPython3.6, there is an additional implementation detail that actually preserves the insertion order for iteration for dict. It appears that set still do not have this property. The data structure is described by this blog post by the pypy folks (who started using this before the CPython folks). The original idea (at least for the python ecosystem) is archived on the python-dev mailing list.
The reason of such behavior is than Python use hash tables for dictionary implementation: https://en.wikipedia.org/wiki/Hash_table#Open_addressing
Position of the key is defined by it's memory address. If you know Python reuse memory for some objects:
>>> a = 'Hello world'
>>> id(a)
140058096568768
>>> a = 'Hello world'
>>> id(a)
140058096568480
You can see that object a has different address every time it's init.
But for small integers it isn't change:
>>> a = 1
>>> id(a)
40060856
>>> a = 1
>>> id(a)
40060856
Even if we create second object with different name it would be the same:
>>> b = 1
>>> id(b)
40060856
This approach allow to save memory which Python interpreter consume.
AFAIK Python sets are implemented using a hash table. The order in which the items appear depends on the hash function used. Within the same run of the program, the hash function probably does not change, hence you get the same order.
But there are no guarantees that it will always use the same function, and the order will change across runs - or within the same run if you insert a lot of elements and the hash table has to resize.
One key thing that's hinted at mgilson's great answer, but isn't mentioned explicitly in any of the existing answers:
Small integers hash to themselves:
>>> [hash(x) for x in (1, 2, 3, 88)]
[1, 2, 3, 88]
Strings hash to values that are unpredictable. In fact, from 3.3 on, by default, they're built off a seed that's randomized at startup. So, you'll get different results for each new Python interpreter session, but:
>>> [hash(x) for x in 'abcz']
[6014072853767888837,
8680706751544317651,
-7529624133683586553,
-1982255696180680242]
So, consider the simplest possible hash table implementation: just an array of N elements, where inserting a value means putting it in hash(value) % N (assuming no collisions). And you can make a rough guess at how large N will be—it's going to be a little bigger than the number of elements in it. When creating a set from a sequence of 6 elements, N could easily be, say, 8.
What happens when you store those 5 numbers with N=8? Well, hash(1) % 8, hash(2) % 8, etc. are just the numbers themselves, but hash(88) % 8 is 0. So, the hash table's array ends up holding 88, 1, 2, NULL, NULL, 5, NULL, 7. So it should be easy to figure out why iterating the set might give you 88, 1, 2, 5, 7.
Of course Python doesn't guarantee that you'll get this order every time. A small change to the way it guesses at the right value for N could mean 88 ends up somewhere different (or ends up colliding with one of the other values). And, in fact, running CPython 3.7 on my Mac, I get 1, 2, 5, 7, 88.0
Meanwhile, when you build a hash from a sequence of size 11 and then insert randomized hashes into it, what happens? Even assuming the simplest implementation, and assuming there are no collisions, you still have no idea what order you're going to get. It will be consistent within a single run of the Python interpreter, but different the next time you start it up. (Unless you set PYTHONHASHSEED to 0, or to some other int value.) Which is exactly what you see.
Of course it's worth looking at the way sets are actually implemented rather than guessing. But what you'd guess based on the assumption of the simplest hash table implementation is (barring collisions and barring expansion of the hash table) exactly what happens.
Sets are based on a hash table. The hash of a value should be consistent so the order will be also - unless two elements hash to the same code, in which case the order of insertion will change the output order.
I am comparing two lists of integers and am trying to access the lowest value without using a for-loop as the lists are quite large. I have tried using set comparison, yet I receive an empty set when doing so. Currently my approach is:
differenceOfIpLists = list(set(reservedArray).difference(set(ipChoicesArray)))
I have also tried:
differenceOfIpLists = list(set(reservedArray) - set(ipChoicesArray))
And the lists are defined as such:
reservedArray = [169017344, 169017345, 169017346, 169017347, 169017348, 169017349, 169017350, 169017351, 169017352, 169017353, 169017354, 169017355, 169017356, 169017357, 169017358, 169017359, 169017360, 169017361, 169017362, 169017363, 169017364, 169017365, 169017366, 169017367, 169017368, 169017369, 169017600, 169017601, 169017602, 169017603, 169017604, 169017605, 169017606, 169017607, 169017608, 169017609, 169017610, 169017611, 169017612, 169017613, 169017614, 169017615, 169017616, 169017617, 169017618, 169017619...]
ipChoicesArray = [169017344, 169017345, 169017346, 169017347, 169017348, 169017349, 169017350, 169017351, 169017352, 169017353, 169017354, 169017355, 169017356, 169017357, 169017358, 169017359, 169017360, 169017361, 169017362, 169017363, 169017364, 169017365, 169017366, 169017367, 169017368, 169017369, 169017370, 169017371, 169017372, 169017373, 169017374, 169017375, 169017376, 169017377, 169017378, 169017379, 169017380, 169017381, 169017382...]
Portions of these lists are the same, yet they are vastly different as the lengths are:
reservedArrayLength = 6658
ipChoicesArray = 65536
I have also tried converting these values to strings and doing the same style of comparison, also to no avail.
Once I am able to extract a list of the elements in the ipChoicesArray that are not in the reservedArray, I will return the smallest element after sorting.
I do not believe that I am facing a max length issue...
Subtracting the sets should work as you desire, see below:
ipChoicesArray = [1,3,4,7,1]
reservedArray = [1,2,5,7,8,2,1]
min(list(set(ipChoicesArray) - set(reservedArray)))
###Output###
[3]
By the way, max list is a length of 536,870,912 elements
without using a for-loop as the lists are quite large
The presumption that a for-loop is a poor choice because the list is large is likely incorrect. Creation of a set from a list and vice-versa will not only iterate through the containers under the hood anyway (just like a for-loop) in addition to allocating new containers and taking up more memory. Profile your code before you assume something won't perform well.
That aside, in your code it seems the reason you are getting an empty result is because your difference is inverted. To get the elements in ipChoicesArray but not in reservedArray you want to difference the latter from the former:
diff = set(ipChoicesArray) - set(reservedArray)
The obvious solution (you just did the set difference in the wrong direction):
print(min(set(ipChoicesArray) - set(reservedArray)))
You said they're always sorted, and your reverse difference being empty (and thinking about what you're doing) suggests that the "choices" are a superset of the "reserved", so then this also works and could be faster:
print(next(c for c, r in zip(ipChoicesArray, reservedArray) if c != r))
Disclaimer: Python docs states that
A set is an unordered collection with no duplicate elements.
But I can see that the output of an unordered set is an ordered set:
s = {'z', 1, 0, 'a'}
s #=> {0, 1, 'a', 'z'}
next(iter(s)) #=> 0
So, I don't know if this approach is reliable. Maybe some other user can deny or confirmi this with an appropriate reference to the set behaviour.
Having said this...
Don't know if I'm getting the point, but..
Not knowing where the smallest value is, you could use this approach (here using smaller values and shorter list):
a = [2, 5, 5, 1, 6, 7, 8, 9]
b = [2, 3, 4, 5, 6, 6, 1]
Find the smallest of the union:
union = set_a | set_b
next(iter(union))
#=> 1
Or just:
min([next(iter(set_a)), next(iter(set_b))])
#=> 1
Or, maybe this fits better your question:
next(iter(set_a-set_b)) #=> 8
Recently I have been learning more about hashing in Python and I came around this blog where it states that:
Suppose a Python program has 2 lists. If we need to think about
comparing those two lists, what will you do? Compare each element?
Well, that sounds fool-proof but slow as well!
Python has a much smarter way to do this. When a tuple is constructed
in a program, Python Interpreter calculates its hash in its memory. If
a comparison occurs between 2 tuples, it simply compares the hash
values and it knows if they are equal!
So I am really confused about these statements.
First when we do:
[1, 2, 3] == [1, 2, 3] then how does this equality works ? Does it calculate the hash value and then compare it ?
Second what's the difference when we do:
[1, 2, 3] == [1, 2, 3] and (1, 2, 3) == (1, 2, 3) ?
Because when I tried to find time of execution with timeit then I got this result:
$ python3.5 -m timeit '[1, 2, 3] == [1, 2, 3]'
10000000 loops, best of 3: 0.14 usec per loop
$ python3.5 -m timeit '(1, 2, 3) == (1, 2, 3)'
10000000 loops, best of 3: 0.0301 usec per loop
So why there is difference in time from 0.14 for list to 0.03 for tuple which is faster than list.
Well, part of your confusion is that the blog post you're reading is just wrong. About multiple things. Try to forget that you ever read it (except to remember the site and the author's name so you'll know to avoid them in the future).
It is true that tuples are hashable and lists are not, but that isn't relevant to their equality-testing functions. And it's certainly not true that "it simply compares the hash values and it knows if they are equal!" Hash collisions happen, and ignoring them would lead to horrible bugs, and fortunately Python's developers are not that stupid. In fact, it's not even true that Python computes the hash value at initialization time.*
There actually is one significant difference between tuples and lists (in CPython, as of 3.6), but it usually doesn't make much difference: Lists do an extra check for unequal length at the beginning as an optimization, but the same check turned out to be a pessimization for tuples,** so it was removed from there.
Another, often much more important, difference is that tuple literals in your source get compiled into constant values, and separate copies of the same tuple literal get folded into the same constant object; that doesn't happen with lists, for obvious reasons.
In fact, that's what you're really testing with your timeit. On my laptop, comparing the tuples takes 95ns, while comparing the lists takes 169ns—but breaking it down, that's actually 93ns for the comparison, plus an extra 38ns to create each list. To make it a fair comparison, you have to move the creation to a setup step, and then compare already-existing values inside the loop. (Or, of course, you may not want to be fair—you're discovering the useful fact that every time you use a tuple constant instead of creating a new list, you're saving a significant fraction of a microsecond.)
Other than that, they basically do the same thing. Translating the C source into Python-like pseudocode (and removing all the error handling, and the stuff that makes the same function work for <, and so on):
for i in range(min(len(v), len(w))):
if v[i] != w[i]:
break
else:
return len(v) == len(w)
return False
The list equivalent is like this:
if len(v) != len(w):
return False
for i in range(min(len(v), len(w))):
if v[i] != w[i]:
break
else:
return True
return False
* In fact, unlike strings, tuples don't even cache their hashes; if you call hash over and over, it'll keep re-computing it. See issue 9685, where a patch to change that was rejected because it slowed down some benchmarks and didn't speed up anything anyone could find.
** Not because of anything inherent about the implementation, but because people often compare lists of different lengths, but rarely do with tuples.
The answer was given in that article too :)
here is the demonstrated :
>>> l1=[1,2,3]
>>> l2=[1,2,3]
>>>
>>> hash(l1) #since list is not hashable
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
>>> t=(1,2,3)
>>> t2=(1,2,3)
>>> hash(t)
2528502973977326415
>>> hash(t2)
2528502973977326415
>>>
in above when you call hash on list it will give you TypeError as it is not hashable and for equality of two list python check its inside value that will lead to take much time
in case of tuple it calculates the hash value and for two similar tuplle having the same hash value so python only compared the hashvalue of tuple so it us much fast than list
from given article
Python has a much smarter way to do this. When a tuple is constructed
in a program, Python Interpreter calculates its hash in its memory. If
a comparison occurs between 2 tuples, it simply compares the hash
values and it knows if they are equal!
I have a list my_list (the list contains utf8 strings):
>>> len(my_list)
8777
>>> getsizeof(my_list) # <-- note the size
77848
For some reason, a sorted list (my_sorted_list = sorted(my_list)) uses more memory :
>>> len(my_sorted_list)
8777
>>> getsizeof(my_sorted_list) # <-- note the size
79104
Why is sorted returning a list that takes more space in memory than the initial unsorted list?
As Ignacio points out, this is due to Python allocating a bit more memory than required. This is done in order to perform O(1) .appends on lists.
sorted creates a new list out of the sequence provided, sorts it in place and returns it. To create the new list, Python extends an empty sized list with the one passed; that results in the observed over-allocation (which happens after calling list_resize). You can corroborate the fact that sorting isn't the culprit by using list.sort; the same algorithm is used without a new list getting created (or, as it's known, it's performed in-place). The sizes there, of course, don't differ.
It is worth noting that this difference is mostly present when:
The original list was created with a list-comp (where, if space is available and the final append doesn't trigger a resize, the size is smaller).
When list literals are used. There a PyList_New is created based on the number of values on the stack and no appends are made. Direct assigning to the underlying array is performed) which doesn't trigger any resizes and keeps size to its minimum:
So, with a list-comp:
l = [i for i in range(10)]
getsizeof(l) # 192
getsizeof(sorted(l)) # 200
Or a list literal:
l = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
getsizeof(l) # 144
getsizeof(sorted(l)) # 200
the sizes are smaller (more so with use of literals).
When creating through list, the memory is always over-allocated; Python knows the sizes and preempts future modifications by over-allocating a bit based on the size:
l = list(range(10))
getsizeof(l) # 200
getsizeof(sorted(l)) # 200
So you get no observed difference in the sizes of the list(s).
As a final note, I must point out that this is behavior specific the C implementation of Python i.e CPython. It's a detail of how the language was implemented and as such, you shouldn't depend on it in any wacky way.
Jython, IronPython, PyPy and any other implementation might/might not have the same behavior.
The list resize operation overallocates in order to amortize appending to the list versus starting with a list preallocated by the compiler.
Is anyone able to explain the comment under testLookups() in this code snippet?
I've run the code and indeed what the comment sais is true. However I'd like to understand why it's true, i.e. why is cPickle outputting different values for the same object depending on how it is referenced.
Does it have anything to do with reference count? If so, isn't that some kind of a bug - i.e. the pickled and deserialized object would have an abnormally high reference count and in effect would never get garbage collected?
There is no guarantee that seemingly identical objects will produce identical pickle strings.
The pickle protocol is a virtual machine, and a pickle string is a program for that virtual machine. For a given object there exist multiple pickle strings (=programs) that will reconstruct that object exactly.
To take one of your examples:
>>> from cPickle import dumps
>>> t = ({1: 1, 2: 4, 3: 6, 4: 8, 5: 10}, 'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5])
>>> dumps(({1: 1, 2: 4, 3: 6, 4: 8, 5: 10}, 'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5]))
"((dp1\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\nI10\nsS'Hello World'\np2\n(I1\nI2\nI3\nI4\nI5\ntp3\n(lp4\nI1\naI2\naI3\naI4\naI5\nat."
>>> dumps(t)
"((dp1\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\nI10\nsS'Hello World'\n(I1\nI2\nI3\nI4\nI5\nt(lp2\nI1\naI2\naI3\naI4\naI5\natp3\n."
The two pickle strings differ in their use of the p opcode. The opcode takes one integer argument and its function is as follows:
name='PUT' code='p' arg=decimalnl_short
Store the stack top into the memo. The stack is not popped.
The index of the memo location to write into is given by the newline-
terminated decimal string following. BINPUT and LONG_BINPUT are
space-optimized versions.
To cut a long story short, the two pickle strings are basically equivalent.
I haven't tried to nail down the exact cause of the differences in generated opcodes. This could well have to do with reference counts of the objects being serialized. What is clear, however, that discrepancies like this will have no effect on the reconstructed object.
It is looking at the reference counts, from the cPickle source:
if (Py_REFCNT(args) > 1) {
if (!( py_ob_id = PyLong_FromVoidPtr(args)))
goto finally;
if (PyDict_GetItem(self->memo, py_ob_id)) {
if (get(self, py_ob_id) < 0)
goto finally;
res = 0;
goto finally;
}
}
The pickle protocol has to deal with pickling multiple references to the same object. In order to prevent duplicating the object when depickled it uses a memo. The memo basically maps indexes to the various objects. The PUT (p) opcode in the pickle stores the current object in this memo dictionary.
However, if there is only a single reference to an object, there is no reason to store it it the memo because it is impossible to need to reference it again because it only has one reference. Thus the cPickle code checks the reference count for a little optimization at this point.
So yes, its the reference counts. But not that's not a problem. The objects unpickled will have the correct reference counts, it just produces a slightly shorter pickle when the reference counts are at 1.
Now, I don't know what you are you doing that you care about this. But you really shouldn't assume that pickling the same object will always give you the same result. If nothing else, I'd expect dictionaries to give you problems because the order of the keys is undefined. Unless you have python documentation that guarantees the pickle is the same each time I highly recommend you don't depend on it.