Related
I understand that sets in Python are unordered, but I'm curious about the 'order' they're displayed in, as it seems to be consistent. They seem to be out-of-order in the same way every time:
>>> set_1 = set([5, 2, 7, 2, 1, 88])
>>> set_2 = set([5, 2, 7, 2, 1, 88])
>>> set_1
set([88, 1, 2, 5, 7])
>>> set_2
set([88, 1, 2, 5, 7])
...and another example:
>>> set_3 = set('abracadabra')
>>> set_4 = set('abracadabra')
>>> set_3
set(['a', 'r', 'b', 'c', 'd'])
>>>> set_4
set(['a', 'r', 'b', 'c', 'd'])
I'm just curious why this would be. Any help?
You should watch this video (although it is CPython1 specific and about dictionaries -- but I assume it applies to sets as well).
Basically, python hashes the elements and takes the last N bits (where N is determined by the size of the set) and uses those bits as array indices to place the object in memory. The objects are then yielded in the order they exist in memory. Of course, the picture gets a little more complicated when you need to resolve collisions between hashes, but that's the gist of it.
Also note that the order that they are printed out is determined by the order that you put them in (due to collisions). So, if you reorder the list you pass to set_2, you might get a different order out if there are key collisions.
For example:
list1 = [8,16,24]
set(list1) #set([8, 16, 24])
list2 = [24,16,8]
set(list2) #set([24, 16, 8])
Note the fact that the order is preserved in these sets is "coincidence" and has to do with collision resolution (which I don't know anything about). The point is that the last 3 bits of hash(8), hash(16) and hash(24) are the same. Because they are the same, collision resolution takes over and puts the elements in "backup" memory locations instead of the first (best) choice and so whether 8 occupies a location or 16 is determined by which one arrived at the party first and took the "best seat".
If we repeat the example with 1, 2 and 3, you will get a consistent order no matter what order they have in the input list:
list1 = [1,2,3]
set(list1) # set([1, 2, 3])
list2 = [3,2,1]
set(list2) # set([1, 2, 3])
since the last 3 bits of hash(1), hash(2) and hash(3) are unique.
1Note The implementation described here applies to CPython dict and set. I think that the general description is valid for all modern versions of CPython up to 3.6. However, starting with CPython3.6, there is an additional implementation detail that actually preserves the insertion order for iteration for dict. It appears that set still do not have this property. The data structure is described by this blog post by the pypy folks (who started using this before the CPython folks). The original idea (at least for the python ecosystem) is archived on the python-dev mailing list.
The reason of such behavior is than Python use hash tables for dictionary implementation: https://en.wikipedia.org/wiki/Hash_table#Open_addressing
Position of the key is defined by it's memory address. If you know Python reuse memory for some objects:
>>> a = 'Hello world'
>>> id(a)
140058096568768
>>> a = 'Hello world'
>>> id(a)
140058096568480
You can see that object a has different address every time it's init.
But for small integers it isn't change:
>>> a = 1
>>> id(a)
40060856
>>> a = 1
>>> id(a)
40060856
Even if we create second object with different name it would be the same:
>>> b = 1
>>> id(b)
40060856
This approach allow to save memory which Python interpreter consume.
AFAIK Python sets are implemented using a hash table. The order in which the items appear depends on the hash function used. Within the same run of the program, the hash function probably does not change, hence you get the same order.
But there are no guarantees that it will always use the same function, and the order will change across runs - or within the same run if you insert a lot of elements and the hash table has to resize.
One key thing that's hinted at mgilson's great answer, but isn't mentioned explicitly in any of the existing answers:
Small integers hash to themselves:
>>> [hash(x) for x in (1, 2, 3, 88)]
[1, 2, 3, 88]
Strings hash to values that are unpredictable. In fact, from 3.3 on, by default, they're built off a seed that's randomized at startup. So, you'll get different results for each new Python interpreter session, but:
>>> [hash(x) for x in 'abcz']
[6014072853767888837,
8680706751544317651,
-7529624133683586553,
-1982255696180680242]
So, consider the simplest possible hash table implementation: just an array of N elements, where inserting a value means putting it in hash(value) % N (assuming no collisions). And you can make a rough guess at how large N will be—it's going to be a little bigger than the number of elements in it. When creating a set from a sequence of 6 elements, N could easily be, say, 8.
What happens when you store those 5 numbers with N=8? Well, hash(1) % 8, hash(2) % 8, etc. are just the numbers themselves, but hash(88) % 8 is 0. So, the hash table's array ends up holding 88, 1, 2, NULL, NULL, 5, NULL, 7. So it should be easy to figure out why iterating the set might give you 88, 1, 2, 5, 7.
Of course Python doesn't guarantee that you'll get this order every time. A small change to the way it guesses at the right value for N could mean 88 ends up somewhere different (or ends up colliding with one of the other values). And, in fact, running CPython 3.7 on my Mac, I get 1, 2, 5, 7, 88.0
Meanwhile, when you build a hash from a sequence of size 11 and then insert randomized hashes into it, what happens? Even assuming the simplest implementation, and assuming there are no collisions, you still have no idea what order you're going to get. It will be consistent within a single run of the Python interpreter, but different the next time you start it up. (Unless you set PYTHONHASHSEED to 0, or to some other int value.) Which is exactly what you see.
Of course it's worth looking at the way sets are actually implemented rather than guessing. But what you'd guess based on the assumption of the simplest hash table implementation is (barring collisions and barring expansion of the hash table) exactly what happens.
Sets are based on a hash table. The hash of a value should be consistent so the order will be also - unless two elements hash to the same code, in which case the order of insertion will change the output order.
I understand that sets in Python are unordered, but I'm curious about the 'order' they're displayed in, as it seems to be consistent. They seem to be out-of-order in the same way every time:
>>> set_1 = set([5, 2, 7, 2, 1, 88])
>>> set_2 = set([5, 2, 7, 2, 1, 88])
>>> set_1
set([88, 1, 2, 5, 7])
>>> set_2
set([88, 1, 2, 5, 7])
...and another example:
>>> set_3 = set('abracadabra')
>>> set_4 = set('abracadabra')
>>> set_3
set(['a', 'r', 'b', 'c', 'd'])
>>>> set_4
set(['a', 'r', 'b', 'c', 'd'])
I'm just curious why this would be. Any help?
You should watch this video (although it is CPython1 specific and about dictionaries -- but I assume it applies to sets as well).
Basically, python hashes the elements and takes the last N bits (where N is determined by the size of the set) and uses those bits as array indices to place the object in memory. The objects are then yielded in the order they exist in memory. Of course, the picture gets a little more complicated when you need to resolve collisions between hashes, but that's the gist of it.
Also note that the order that they are printed out is determined by the order that you put them in (due to collisions). So, if you reorder the list you pass to set_2, you might get a different order out if there are key collisions.
For example:
list1 = [8,16,24]
set(list1) #set([8, 16, 24])
list2 = [24,16,8]
set(list2) #set([24, 16, 8])
Note the fact that the order is preserved in these sets is "coincidence" and has to do with collision resolution (which I don't know anything about). The point is that the last 3 bits of hash(8), hash(16) and hash(24) are the same. Because they are the same, collision resolution takes over and puts the elements in "backup" memory locations instead of the first (best) choice and so whether 8 occupies a location or 16 is determined by which one arrived at the party first and took the "best seat".
If we repeat the example with 1, 2 and 3, you will get a consistent order no matter what order they have in the input list:
list1 = [1,2,3]
set(list1) # set([1, 2, 3])
list2 = [3,2,1]
set(list2) # set([1, 2, 3])
since the last 3 bits of hash(1), hash(2) and hash(3) are unique.
1Note The implementation described here applies to CPython dict and set. I think that the general description is valid for all modern versions of CPython up to 3.6. However, starting with CPython3.6, there is an additional implementation detail that actually preserves the insertion order for iteration for dict. It appears that set still do not have this property. The data structure is described by this blog post by the pypy folks (who started using this before the CPython folks). The original idea (at least for the python ecosystem) is archived on the python-dev mailing list.
The reason of such behavior is than Python use hash tables for dictionary implementation: https://en.wikipedia.org/wiki/Hash_table#Open_addressing
Position of the key is defined by it's memory address. If you know Python reuse memory for some objects:
>>> a = 'Hello world'
>>> id(a)
140058096568768
>>> a = 'Hello world'
>>> id(a)
140058096568480
You can see that object a has different address every time it's init.
But for small integers it isn't change:
>>> a = 1
>>> id(a)
40060856
>>> a = 1
>>> id(a)
40060856
Even if we create second object with different name it would be the same:
>>> b = 1
>>> id(b)
40060856
This approach allow to save memory which Python interpreter consume.
AFAIK Python sets are implemented using a hash table. The order in which the items appear depends on the hash function used. Within the same run of the program, the hash function probably does not change, hence you get the same order.
But there are no guarantees that it will always use the same function, and the order will change across runs - or within the same run if you insert a lot of elements and the hash table has to resize.
One key thing that's hinted at mgilson's great answer, but isn't mentioned explicitly in any of the existing answers:
Small integers hash to themselves:
>>> [hash(x) for x in (1, 2, 3, 88)]
[1, 2, 3, 88]
Strings hash to values that are unpredictable. In fact, from 3.3 on, by default, they're built off a seed that's randomized at startup. So, you'll get different results for each new Python interpreter session, but:
>>> [hash(x) for x in 'abcz']
[6014072853767888837,
8680706751544317651,
-7529624133683586553,
-1982255696180680242]
So, consider the simplest possible hash table implementation: just an array of N elements, where inserting a value means putting it in hash(value) % N (assuming no collisions). And you can make a rough guess at how large N will be—it's going to be a little bigger than the number of elements in it. When creating a set from a sequence of 6 elements, N could easily be, say, 8.
What happens when you store those 5 numbers with N=8? Well, hash(1) % 8, hash(2) % 8, etc. are just the numbers themselves, but hash(88) % 8 is 0. So, the hash table's array ends up holding 88, 1, 2, NULL, NULL, 5, NULL, 7. So it should be easy to figure out why iterating the set might give you 88, 1, 2, 5, 7.
Of course Python doesn't guarantee that you'll get this order every time. A small change to the way it guesses at the right value for N could mean 88 ends up somewhere different (or ends up colliding with one of the other values). And, in fact, running CPython 3.7 on my Mac, I get 1, 2, 5, 7, 88.0
Meanwhile, when you build a hash from a sequence of size 11 and then insert randomized hashes into it, what happens? Even assuming the simplest implementation, and assuming there are no collisions, you still have no idea what order you're going to get. It will be consistent within a single run of the Python interpreter, but different the next time you start it up. (Unless you set PYTHONHASHSEED to 0, or to some other int value.) Which is exactly what you see.
Of course it's worth looking at the way sets are actually implemented rather than guessing. But what you'd guess based on the assumption of the simplest hash table implementation is (barring collisions and barring expansion of the hash table) exactly what happens.
Sets are based on a hash table. The hash of a value should be consistent so the order will be also - unless two elements hash to the same code, in which case the order of insertion will change the output order.
I am comparing two lists of integers and am trying to access the lowest value without using a for-loop as the lists are quite large. I have tried using set comparison, yet I receive an empty set when doing so. Currently my approach is:
differenceOfIpLists = list(set(reservedArray).difference(set(ipChoicesArray)))
I have also tried:
differenceOfIpLists = list(set(reservedArray) - set(ipChoicesArray))
And the lists are defined as such:
reservedArray = [169017344, 169017345, 169017346, 169017347, 169017348, 169017349, 169017350, 169017351, 169017352, 169017353, 169017354, 169017355, 169017356, 169017357, 169017358, 169017359, 169017360, 169017361, 169017362, 169017363, 169017364, 169017365, 169017366, 169017367, 169017368, 169017369, 169017600, 169017601, 169017602, 169017603, 169017604, 169017605, 169017606, 169017607, 169017608, 169017609, 169017610, 169017611, 169017612, 169017613, 169017614, 169017615, 169017616, 169017617, 169017618, 169017619...]
ipChoicesArray = [169017344, 169017345, 169017346, 169017347, 169017348, 169017349, 169017350, 169017351, 169017352, 169017353, 169017354, 169017355, 169017356, 169017357, 169017358, 169017359, 169017360, 169017361, 169017362, 169017363, 169017364, 169017365, 169017366, 169017367, 169017368, 169017369, 169017370, 169017371, 169017372, 169017373, 169017374, 169017375, 169017376, 169017377, 169017378, 169017379, 169017380, 169017381, 169017382...]
Portions of these lists are the same, yet they are vastly different as the lengths are:
reservedArrayLength = 6658
ipChoicesArray = 65536
I have also tried converting these values to strings and doing the same style of comparison, also to no avail.
Once I am able to extract a list of the elements in the ipChoicesArray that are not in the reservedArray, I will return the smallest element after sorting.
I do not believe that I am facing a max length issue...
Subtracting the sets should work as you desire, see below:
ipChoicesArray = [1,3,4,7,1]
reservedArray = [1,2,5,7,8,2,1]
min(list(set(ipChoicesArray) - set(reservedArray)))
###Output###
[3]
By the way, max list is a length of 536,870,912 elements
without using a for-loop as the lists are quite large
The presumption that a for-loop is a poor choice because the list is large is likely incorrect. Creation of a set from a list and vice-versa will not only iterate through the containers under the hood anyway (just like a for-loop) in addition to allocating new containers and taking up more memory. Profile your code before you assume something won't perform well.
That aside, in your code it seems the reason you are getting an empty result is because your difference is inverted. To get the elements in ipChoicesArray but not in reservedArray you want to difference the latter from the former:
diff = set(ipChoicesArray) - set(reservedArray)
The obvious solution (you just did the set difference in the wrong direction):
print(min(set(ipChoicesArray) - set(reservedArray)))
You said they're always sorted, and your reverse difference being empty (and thinking about what you're doing) suggests that the "choices" are a superset of the "reserved", so then this also works and could be faster:
print(next(c for c, r in zip(ipChoicesArray, reservedArray) if c != r))
Disclaimer: Python docs states that
A set is an unordered collection with no duplicate elements.
But I can see that the output of an unordered set is an ordered set:
s = {'z', 1, 0, 'a'}
s #=> {0, 1, 'a', 'z'}
next(iter(s)) #=> 0
So, I don't know if this approach is reliable. Maybe some other user can deny or confirmi this with an appropriate reference to the set behaviour.
Having said this...
Don't know if I'm getting the point, but..
Not knowing where the smallest value is, you could use this approach (here using smaller values and shorter list):
a = [2, 5, 5, 1, 6, 7, 8, 9]
b = [2, 3, 4, 5, 6, 6, 1]
Find the smallest of the union:
union = set_a | set_b
next(iter(union))
#=> 1
Or just:
min([next(iter(set_a)), next(iter(set_b))])
#=> 1
Or, maybe this fits better your question:
next(iter(set_a-set_b)) #=> 8
This is somewhat of a broad topic, but I will try to pare it to some specific questions.
I was thinking about a certain ~meta~ property in Python where the console representation of many basic datatypes are equivalent to the code used to construct those objects:
l = [1,2,3]
d = {'a':1,'b':2,'c':3}
s = {1,2,3}
t = (1,2,3)
g = "123"
###
>>> l
[1, 2, 3]
>>> d
{'a': 1, 'b': 2, 'c': 3}
>>> s
{1, 2, 3}
>>> t
(1, 2, 3)
>>> g
'123'
So for any of these objects, I could copy the console output into the code to create those structures or assign them to variables.
This doesn't apply to some objects, like functions:
def foo():
pass
f = foo
L = [1,2,3, foo]
###
>>> f
<function foo at 0x00000235950347B8>
>>> L
[1, 2, 3, <function foo at 0x00000235950347B8>]
While the list l above had this property, the list L here does not; but this seems to be only b/c L contains an element which doesn't hold this property. So it seems to me that generally, list has this property in some way.
This applies to some objects in non-standard libraries as well:
import numpy as np
a = np.array([1,2,3])
import pandas as pd
dr = pd.date_range('01-01-2020','01-02-2020', freq='3H')
###
>>> a
array([1, 2, 3])
>>> dr
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 03:00:00',
'2020-01-01 06:00:00', '2020-01-01 09:00:00',
'2020-01-01 12:00:00', '2020-01-01 15:00:00',
'2020-01-01 18:00:00', '2020-01-01 21:00:00',
'2020-01-02 00:00:00'],
dtype='datetime64[ns]', freq='3H')
For the numpy array, the console output matches the code used, provided you have array in the namespace. For the pandas.date_range, it's a little bit different because the console output can construct the same object produced created by dr = pd.date_range('01-01-2020','01-02-2020', freq='3H'), but with different code.
Interesting Example
A DataFrame doesn't hold this property, however using the to_dict() method converts it into a structure which does hold this property:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6]})
###
>>> df
A B
0 1 4
1 2 5
2 3 6
>>> df.to_dict()
{'A': {0: 1, 1: 2, 2: 3}, 'B': {0: 4, 1: 5, 2: 6}}
>>> pd.DataFrame.from_dict({'A': {0: 1, 1: 2, 2: 3}, 'B': {0: 4, 1: 5, 2: 6}})
A B
0 1 4
1 2 5
2 3 6
An example scenario where this is useful is.....posting on SO! B/c you can convert your DataFrame into a data structure where the text representation can be used to construct that data structure. So if you share the to_dict() version of your DataFrame with someone, they are getting Python-syntaxed code which can be used to recreate the structure. I have found this to be advantageous over pd.read_clipboard() in some situations.
My questions based on the above:
Mainly:
Is there a name for this "property" (given it is a real intentional "property" of objects in Python?)
Additionally (these are less concretely answerable, I recognize, and can remove if off-topic):
Is it something unique to Python or does it hold true in other languages?
Are there other basic Python structures for which this property holds or doesn't hold?
I apologize if this is common knowledge to people, or if I am making a mountain out of a molehill here!
What the console representation of an object is, depends on the way its __repr__() method is written. So I think most of us would at least understand if you talked about this "property" as the repr of the object. The method has to return a string but the string's contents are up to the author, so it's impossible to say in general whether the repr of an object is the same as the code needed to create it. In some cases (such as functions) the code might be too long to be useful. In others (such as recursive structures) there might be no reasonable linear representation.
Reposted as an answer instead of a comment in response to suggestions by participants in the comment thread.
I happened to come across some information related to this (a year and a half later). An interesting passage in this article asserts that:
The default objective of __repr__ is to have a string representation of the object from which object can be formed again using Python’s eval such that below holds true: object = eval(repr(object))
object == eval(repr(object)) (!!!)
In Python code this neatly states the concept I was grasping at. An object can be constructed from its console representation.
From Googling that Python phrase, I came across more Stack resources, including the canonical post on the difference between __str__ and __repr__. But particularly relevant here was this answer which highlights how this concept is discussed in the Python documentation for __repr__. Particularly, there is the recommendation that:
If at all possible, this should look like a valid Python expression that could be used to recreate an object with the same value (given an appropriate environment).
Furthermore, the __repr__ of an object is meant to be clear and unambiguous such that if you inspect the object, you know exactly what is. And having object == eval(repr(object)) is one way of achieving this.
So in regard to my initial questions:
Is there a name for this "property"? Not really, but object == eval(repr(object)) is a succinct way of stating it. And the Python docs have their own way of stating it.
Is it something unique to Python or does it hold true in other languages? I don't really know if it is unique, but it certainly is an encouraged part of Python! But its main intention is for developing/unambiguity, rather than sharing/reproducing code.
Recently I have been learning more about hashing in Python and I came around this blog where it states that:
Suppose a Python program has 2 lists. If we need to think about
comparing those two lists, what will you do? Compare each element?
Well, that sounds fool-proof but slow as well!
Python has a much smarter way to do this. When a tuple is constructed
in a program, Python Interpreter calculates its hash in its memory. If
a comparison occurs between 2 tuples, it simply compares the hash
values and it knows if they are equal!
So I am really confused about these statements.
First when we do:
[1, 2, 3] == [1, 2, 3] then how does this equality works ? Does it calculate the hash value and then compare it ?
Second what's the difference when we do:
[1, 2, 3] == [1, 2, 3] and (1, 2, 3) == (1, 2, 3) ?
Because when I tried to find time of execution with timeit then I got this result:
$ python3.5 -m timeit '[1, 2, 3] == [1, 2, 3]'
10000000 loops, best of 3: 0.14 usec per loop
$ python3.5 -m timeit '(1, 2, 3) == (1, 2, 3)'
10000000 loops, best of 3: 0.0301 usec per loop
So why there is difference in time from 0.14 for list to 0.03 for tuple which is faster than list.
Well, part of your confusion is that the blog post you're reading is just wrong. About multiple things. Try to forget that you ever read it (except to remember the site and the author's name so you'll know to avoid them in the future).
It is true that tuples are hashable and lists are not, but that isn't relevant to their equality-testing functions. And it's certainly not true that "it simply compares the hash values and it knows if they are equal!" Hash collisions happen, and ignoring them would lead to horrible bugs, and fortunately Python's developers are not that stupid. In fact, it's not even true that Python computes the hash value at initialization time.*
There actually is one significant difference between tuples and lists (in CPython, as of 3.6), but it usually doesn't make much difference: Lists do an extra check for unequal length at the beginning as an optimization, but the same check turned out to be a pessimization for tuples,** so it was removed from there.
Another, often much more important, difference is that tuple literals in your source get compiled into constant values, and separate copies of the same tuple literal get folded into the same constant object; that doesn't happen with lists, for obvious reasons.
In fact, that's what you're really testing with your timeit. On my laptop, comparing the tuples takes 95ns, while comparing the lists takes 169ns—but breaking it down, that's actually 93ns for the comparison, plus an extra 38ns to create each list. To make it a fair comparison, you have to move the creation to a setup step, and then compare already-existing values inside the loop. (Or, of course, you may not want to be fair—you're discovering the useful fact that every time you use a tuple constant instead of creating a new list, you're saving a significant fraction of a microsecond.)
Other than that, they basically do the same thing. Translating the C source into Python-like pseudocode (and removing all the error handling, and the stuff that makes the same function work for <, and so on):
for i in range(min(len(v), len(w))):
if v[i] != w[i]:
break
else:
return len(v) == len(w)
return False
The list equivalent is like this:
if len(v) != len(w):
return False
for i in range(min(len(v), len(w))):
if v[i] != w[i]:
break
else:
return True
return False
* In fact, unlike strings, tuples don't even cache their hashes; if you call hash over and over, it'll keep re-computing it. See issue 9685, where a patch to change that was rejected because it slowed down some benchmarks and didn't speed up anything anyone could find.
** Not because of anything inherent about the implementation, but because people often compare lists of different lengths, but rarely do with tuples.
The answer was given in that article too :)
here is the demonstrated :
>>> l1=[1,2,3]
>>> l2=[1,2,3]
>>>
>>> hash(l1) #since list is not hashable
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
>>> t=(1,2,3)
>>> t2=(1,2,3)
>>> hash(t)
2528502973977326415
>>> hash(t2)
2528502973977326415
>>>
in above when you call hash on list it will give you TypeError as it is not hashable and for equality of two list python check its inside value that will lead to take much time
in case of tuple it calculates the hash value and for two similar tuplle having the same hash value so python only compared the hashvalue of tuple so it us much fast than list
from given article
Python has a much smarter way to do this. When a tuple is constructed
in a program, Python Interpreter calculates its hash in its memory. If
a comparison occurs between 2 tuples, it simply compares the hash
values and it knows if they are equal!