Accessing the lowest value when comparing two python lists - python

I am comparing two lists of integers and am trying to access the lowest value without using a for-loop as the lists are quite large. I have tried using set comparison, yet I receive an empty set when doing so. Currently my approach is:
differenceOfIpLists = list(set(reservedArray).difference(set(ipChoicesArray)))
I have also tried:
differenceOfIpLists = list(set(reservedArray) - set(ipChoicesArray))
And the lists are defined as such:
reservedArray = [169017344, 169017345, 169017346, 169017347, 169017348, 169017349, 169017350, 169017351, 169017352, 169017353, 169017354, 169017355, 169017356, 169017357, 169017358, 169017359, 169017360, 169017361, 169017362, 169017363, 169017364, 169017365, 169017366, 169017367, 169017368, 169017369, 169017600, 169017601, 169017602, 169017603, 169017604, 169017605, 169017606, 169017607, 169017608, 169017609, 169017610, 169017611, 169017612, 169017613, 169017614, 169017615, 169017616, 169017617, 169017618, 169017619...]
ipChoicesArray = [169017344, 169017345, 169017346, 169017347, 169017348, 169017349, 169017350, 169017351, 169017352, 169017353, 169017354, 169017355, 169017356, 169017357, 169017358, 169017359, 169017360, 169017361, 169017362, 169017363, 169017364, 169017365, 169017366, 169017367, 169017368, 169017369, 169017370, 169017371, 169017372, 169017373, 169017374, 169017375, 169017376, 169017377, 169017378, 169017379, 169017380, 169017381, 169017382...]
Portions of these lists are the same, yet they are vastly different as the lengths are:
reservedArrayLength = 6658
ipChoicesArray = 65536
I have also tried converting these values to strings and doing the same style of comparison, also to no avail.
Once I am able to extract a list of the elements in the ipChoicesArray that are not in the reservedArray, I will return the smallest element after sorting.
I do not believe that I am facing a max length issue...

Subtracting the sets should work as you desire, see below:
ipChoicesArray = [1,3,4,7,1]
reservedArray = [1,2,5,7,8,2,1]
min(list(set(ipChoicesArray) - set(reservedArray)))
###Output###
[3]
By the way, max list is a length of 536,870,912 elements

without using a for-loop as the lists are quite large
The presumption that a for-loop is a poor choice because the list is large is likely incorrect. Creation of a set from a list and vice-versa will not only iterate through the containers under the hood anyway (just like a for-loop) in addition to allocating new containers and taking up more memory. Profile your code before you assume something won't perform well.
That aside, in your code it seems the reason you are getting an empty result is because your difference is inverted. To get the elements in ipChoicesArray but not in reservedArray you want to difference the latter from the former:
diff = set(ipChoicesArray) - set(reservedArray)

The obvious solution (you just did the set difference in the wrong direction):
print(min(set(ipChoicesArray) - set(reservedArray)))
You said they're always sorted, and your reverse difference being empty (and thinking about what you're doing) suggests that the "choices" are a superset of the "reserved", so then this also works and could be faster:
print(next(c for c, r in zip(ipChoicesArray, reservedArray) if c != r))

Disclaimer: Python docs states that
A set is an unordered collection with no duplicate elements.
But I can see that the output of an unordered set is an ordered set:
s = {'z', 1, 0, 'a'}
s #=> {0, 1, 'a', 'z'}
next(iter(s)) #=> 0
So, I don't know if this approach is reliable. Maybe some other user can deny or confirmi this with an appropriate reference to the set behaviour.
Having said this...
Don't know if I'm getting the point, but..
Not knowing where the smallest value is, you could use this approach (here using smaller values and shorter list):
a = [2, 5, 5, 1, 6, 7, 8, 9]
b = [2, 3, 4, 5, 6, 6, 1]
Find the smallest of the union:
union = set_a | set_b
next(iter(union))
#=> 1
Or just:
min([next(iter(set_a)), next(iter(set_b))])
#=> 1
Or, maybe this fits better your question:
next(iter(set_a-set_b)) #=> 8

Related

How does Python low-level order sets? [duplicate]

I understand that sets in Python are unordered, but I'm curious about the 'order' they're displayed in, as it seems to be consistent. They seem to be out-of-order in the same way every time:
>>> set_1 = set([5, 2, 7, 2, 1, 88])
>>> set_2 = set([5, 2, 7, 2, 1, 88])
>>> set_1
set([88, 1, 2, 5, 7])
>>> set_2
set([88, 1, 2, 5, 7])
...and another example:
>>> set_3 = set('abracadabra')
>>> set_4 = set('abracadabra')
>>> set_3
set(['a', 'r', 'b', 'c', 'd'])
>>>> set_4
set(['a', 'r', 'b', 'c', 'd'])
I'm just curious why this would be. Any help?
You should watch this video (although it is CPython1 specific and about dictionaries -- but I assume it applies to sets as well).
Basically, python hashes the elements and takes the last N bits (where N is determined by the size of the set) and uses those bits as array indices to place the object in memory. The objects are then yielded in the order they exist in memory. Of course, the picture gets a little more complicated when you need to resolve collisions between hashes, but that's the gist of it.
Also note that the order that they are printed out is determined by the order that you put them in (due to collisions). So, if you reorder the list you pass to set_2, you might get a different order out if there are key collisions.
For example:
list1 = [8,16,24]
set(list1) #set([8, 16, 24])
list2 = [24,16,8]
set(list2) #set([24, 16, 8])
Note the fact that the order is preserved in these sets is "coincidence" and has to do with collision resolution (which I don't know anything about). The point is that the last 3 bits of hash(8), hash(16) and hash(24) are the same. Because they are the same, collision resolution takes over and puts the elements in "backup" memory locations instead of the first (best) choice and so whether 8 occupies a location or 16 is determined by which one arrived at the party first and took the "best seat".
If we repeat the example with 1, 2 and 3, you will get a consistent order no matter what order they have in the input list:
list1 = [1,2,3]
set(list1) # set([1, 2, 3])
list2 = [3,2,1]
set(list2) # set([1, 2, 3])
since the last 3 bits of hash(1), hash(2) and hash(3) are unique.
1Note The implementation described here applies to CPython dict and set. I think that the general description is valid for all modern versions of CPython up to 3.6. However, starting with CPython3.6, there is an additional implementation detail that actually preserves the insertion order for iteration for dict. It appears that set still do not have this property. The data structure is described by this blog post by the pypy folks (who started using this before the CPython folks). The original idea (at least for the python ecosystem) is archived on the python-dev mailing list.
The reason of such behavior is than Python use hash tables for dictionary implementation: https://en.wikipedia.org/wiki/Hash_table#Open_addressing
Position of the key is defined by it's memory address. If you know Python reuse memory for some objects:
>>> a = 'Hello world'
>>> id(a)
140058096568768
>>> a = 'Hello world'
>>> id(a)
140058096568480
You can see that object a has different address every time it's init.
But for small integers it isn't change:
>>> a = 1
>>> id(a)
40060856
>>> a = 1
>>> id(a)
40060856
Even if we create second object with different name it would be the same:
>>> b = 1
>>> id(b)
40060856
This approach allow to save memory which Python interpreter consume.
AFAIK Python sets are implemented using a hash table. The order in which the items appear depends on the hash function used. Within the same run of the program, the hash function probably does not change, hence you get the same order.
But there are no guarantees that it will always use the same function, and the order will change across runs - or within the same run if you insert a lot of elements and the hash table has to resize.
One key thing that's hinted at mgilson's great answer, but isn't mentioned explicitly in any of the existing answers:
Small integers hash to themselves:
>>> [hash(x) for x in (1, 2, 3, 88)]
[1, 2, 3, 88]
Strings hash to values that are unpredictable. In fact, from 3.3 on, by default, they're built off a seed that's randomized at startup. So, you'll get different results for each new Python interpreter session, but:
>>> [hash(x) for x in 'abcz']
[6014072853767888837,
8680706751544317651,
-7529624133683586553,
-1982255696180680242]
So, consider the simplest possible hash table implementation: just an array of N elements, where inserting a value means putting it in hash(value) % N (assuming no collisions). And you can make a rough guess at how large N will be—it's going to be a little bigger than the number of elements in it. When creating a set from a sequence of 6 elements, N could easily be, say, 8.
What happens when you store those 5 numbers with N=8? Well, hash(1) % 8, hash(2) % 8, etc. are just the numbers themselves, but hash(88) % 8 is 0. So, the hash table's array ends up holding 88, 1, 2, NULL, NULL, 5, NULL, 7. So it should be easy to figure out why iterating the set might give you 88, 1, 2, 5, 7.
Of course Python doesn't guarantee that you'll get this order every time. A small change to the way it guesses at the right value for N could mean 88 ends up somewhere different (or ends up colliding with one of the other values). And, in fact, running CPython 3.7 on my Mac, I get 1, 2, 5, 7, 88.0
Meanwhile, when you build a hash from a sequence of size 11 and then insert randomized hashes into it, what happens? Even assuming the simplest implementation, and assuming there are no collisions, you still have no idea what order you're going to get. It will be consistent within a single run of the Python interpreter, but different the next time you start it up. (Unless you set PYTHONHASHSEED to 0, or to some other int value.) Which is exactly what you see.
Of course it's worth looking at the way sets are actually implemented rather than guessing. But what you'd guess based on the assumption of the simplest hash table implementation is (barring collisions and barring expansion of the hash table) exactly what happens.
Sets are based on a hash table. The hash of a value should be consistent so the order will be also - unless two elements hash to the same code, in which case the order of insertion will change the output order.

How to get common elements in a deep nested list: my two solutions work but take some time

I have a nested list structure as below. Each of the 4 nested structures represents some free positions for me. I want to find which elements are present in all 4 nested lists.
ary=[ [[0, 4], [5, 11]], [[0, 2], [0, 4], [5,10]], [[0, 4], [0, 14], [5,11]], [[0, 4], [0, 14], [5,11]] ]
As in above, in the first nested list [[0, 4], [5, 11]], the [0,4] is present in all but [5,11] is not. Hence, my answer should be [[0,4]] or even just [0,4].
I did this in two ways.
Solution1:
ary1=ary
newlist = [item for items in ary for item in items]
x=[i for i in ary[0] if newlist.count(i)== len(ary1)]
#OUTPUT is x= [[0,4]]
Solution2:
x=[]
for u in ary[0]:
n=[]
n=[1 for t in range(1,len(ary)) if u in ary[t]]
if len(ary)-1==len(n):
x.append(u)
#OUTPUT is x= [[0,4]]
These two seem to take similar computational time when checked using line profiler. And this is the only point of heavy computation in my hundreds of liens of code and I want to reduce this. So, can you suggest any other Python commands/code that can do the task better than these two solutions?
You can try to convert each nested array at the second level into the set of tuples, where each lowest level array (i.e. [0,4]) is an element of the set.
The conversion into tuples is required because lists are not hashable.
Once you have each nested list of lists as a set, simply find their intersection.
set.intersection(*[set(tuple(elem) for elem in sublist) for sublist in ary])
I'm not suggesting to do it like this, I just came up with another way so you could compare it. Maybe it is a lucky shot and need less computation than your topical ways.
flatten_list = [tuple(item) for sublist in ary for item in sublist]
max(set(flatten_list), key = flatten_list.count)
This requires that there is always one element which is present in every nested list because I don't check it explicitly.
Two approaches that I can think of
Depending on how intensive you are willing to take this and how big the target list is going to be, it may need a little bit of time and testing.
If so, consider option 2. But my theory is to turn it into a binary based approach as opposed to iterating thru each of the direct elements of the outer array.
May need the use of itertools.tee() and/or multithreading with a recursive function (depending on the length of the outer list).
The recursive function will split the list by half in every iteration, until it is determined splits are small enough in length to start ruling out uncommon elements (like [5-11]).
Then common elements are passed back up the recursion hierarchy.
Designing this more closely should help assess conditions/threshold to avoid runaway conditions like excessive thread counts
It seems that all third level lists (e.g., [[0, 2], [0, 4], [5,10]]) are sorted. If not, then sort them, eliminate (pop) duplicates, and then merge them all together using + operator, and then resort.
After that you will end up with a structure containing as many [0,4]'s as the length of ary1.
That could be your condition for identifying [0,4] as the answer.
That again may need to be tested

str.join returning different value for a list and set [duplicate]

I understand that sets in Python are unordered, but I'm curious about the 'order' they're displayed in, as it seems to be consistent. They seem to be out-of-order in the same way every time:
>>> set_1 = set([5, 2, 7, 2, 1, 88])
>>> set_2 = set([5, 2, 7, 2, 1, 88])
>>> set_1
set([88, 1, 2, 5, 7])
>>> set_2
set([88, 1, 2, 5, 7])
...and another example:
>>> set_3 = set('abracadabra')
>>> set_4 = set('abracadabra')
>>> set_3
set(['a', 'r', 'b', 'c', 'd'])
>>>> set_4
set(['a', 'r', 'b', 'c', 'd'])
I'm just curious why this would be. Any help?
You should watch this video (although it is CPython1 specific and about dictionaries -- but I assume it applies to sets as well).
Basically, python hashes the elements and takes the last N bits (where N is determined by the size of the set) and uses those bits as array indices to place the object in memory. The objects are then yielded in the order they exist in memory. Of course, the picture gets a little more complicated when you need to resolve collisions between hashes, but that's the gist of it.
Also note that the order that they are printed out is determined by the order that you put them in (due to collisions). So, if you reorder the list you pass to set_2, you might get a different order out if there are key collisions.
For example:
list1 = [8,16,24]
set(list1) #set([8, 16, 24])
list2 = [24,16,8]
set(list2) #set([24, 16, 8])
Note the fact that the order is preserved in these sets is "coincidence" and has to do with collision resolution (which I don't know anything about). The point is that the last 3 bits of hash(8), hash(16) and hash(24) are the same. Because they are the same, collision resolution takes over and puts the elements in "backup" memory locations instead of the first (best) choice and so whether 8 occupies a location or 16 is determined by which one arrived at the party first and took the "best seat".
If we repeat the example with 1, 2 and 3, you will get a consistent order no matter what order they have in the input list:
list1 = [1,2,3]
set(list1) # set([1, 2, 3])
list2 = [3,2,1]
set(list2) # set([1, 2, 3])
since the last 3 bits of hash(1), hash(2) and hash(3) are unique.
1Note The implementation described here applies to CPython dict and set. I think that the general description is valid for all modern versions of CPython up to 3.6. However, starting with CPython3.6, there is an additional implementation detail that actually preserves the insertion order for iteration for dict. It appears that set still do not have this property. The data structure is described by this blog post by the pypy folks (who started using this before the CPython folks). The original idea (at least for the python ecosystem) is archived on the python-dev mailing list.
The reason of such behavior is than Python use hash tables for dictionary implementation: https://en.wikipedia.org/wiki/Hash_table#Open_addressing
Position of the key is defined by it's memory address. If you know Python reuse memory for some objects:
>>> a = 'Hello world'
>>> id(a)
140058096568768
>>> a = 'Hello world'
>>> id(a)
140058096568480
You can see that object a has different address every time it's init.
But for small integers it isn't change:
>>> a = 1
>>> id(a)
40060856
>>> a = 1
>>> id(a)
40060856
Even if we create second object with different name it would be the same:
>>> b = 1
>>> id(b)
40060856
This approach allow to save memory which Python interpreter consume.
AFAIK Python sets are implemented using a hash table. The order in which the items appear depends on the hash function used. Within the same run of the program, the hash function probably does not change, hence you get the same order.
But there are no guarantees that it will always use the same function, and the order will change across runs - or within the same run if you insert a lot of elements and the hash table has to resize.
One key thing that's hinted at mgilson's great answer, but isn't mentioned explicitly in any of the existing answers:
Small integers hash to themselves:
>>> [hash(x) for x in (1, 2, 3, 88)]
[1, 2, 3, 88]
Strings hash to values that are unpredictable. In fact, from 3.3 on, by default, they're built off a seed that's randomized at startup. So, you'll get different results for each new Python interpreter session, but:
>>> [hash(x) for x in 'abcz']
[6014072853767888837,
8680706751544317651,
-7529624133683586553,
-1982255696180680242]
So, consider the simplest possible hash table implementation: just an array of N elements, where inserting a value means putting it in hash(value) % N (assuming no collisions). And you can make a rough guess at how large N will be—it's going to be a little bigger than the number of elements in it. When creating a set from a sequence of 6 elements, N could easily be, say, 8.
What happens when you store those 5 numbers with N=8? Well, hash(1) % 8, hash(2) % 8, etc. are just the numbers themselves, but hash(88) % 8 is 0. So, the hash table's array ends up holding 88, 1, 2, NULL, NULL, 5, NULL, 7. So it should be easy to figure out why iterating the set might give you 88, 1, 2, 5, 7.
Of course Python doesn't guarantee that you'll get this order every time. A small change to the way it guesses at the right value for N could mean 88 ends up somewhere different (or ends up colliding with one of the other values). And, in fact, running CPython 3.7 on my Mac, I get 1, 2, 5, 7, 88.0
Meanwhile, when you build a hash from a sequence of size 11 and then insert randomized hashes into it, what happens? Even assuming the simplest implementation, and assuming there are no collisions, you still have no idea what order you're going to get. It will be consistent within a single run of the Python interpreter, but different the next time you start it up. (Unless you set PYTHONHASHSEED to 0, or to some other int value.) Which is exactly what you see.
Of course it's worth looking at the way sets are actually implemented rather than guessing. But what you'd guess based on the assumption of the simplest hash table implementation is (barring collisions and barring expansion of the hash table) exactly what happens.
Sets are based on a hash table. The hash of a value should be consistent so the order will be also - unless two elements hash to the same code, in which case the order of insertion will change the output order.

Is ordered ensured in list iteration in Python?

Let's suppose to have a list of strings, named strings, in Python and to execute this line:
lengths = [ len(value) for value in strings ]
Is the strings list order kept? I mean, can I be sure that lengths[i] corresponds to strings[i]?
I've tryed many times and it works but I'm not sure if my experiments were special cases or the rule.
Thanks in advance
For lists, yes. That is one of the fundamental properties of lists: that they're ordered.
It should be noted though that what you're doing though is known as "parallel arrays" (having several "arrays" to maintain a linked state), and is often considered to be poor practice. If you change one list, you must change the other in the same way, or they'll be out of sync, and then you have real problems.
A dictionary would likely be the better option here:
lengths_dict = {value:len(value) for value in strings}
print(lengths_dict["some_word"]) # Prints its length
Or maybe if you want lookups by index, a list of tuples:
lengths = [(value, len(value)) for value in strings]
word, length = lengths[1]
Yes, since list in python are sequences you can be sure that each length that you have in the list of the length is corresponding to the string length in the same index.
like the following code represents
a = ['a', 'ab', 'abc', 'abcd']
print([len(i) for i in a])
Output
[1, 2, 3, 4]

Python Lists and Yielding

I have the following (correct) solution to Project Euler problem 24. I'm relatively new to Python, and am stumped on a couple of Python points.
First, the code:
# A permutation is an ordered arrangement of objects. For example, 3124 is one possible permutation of the digits 1, 2, 3 and 4.
# If all of the permutations are listed numerically or alphabetically, we call it lexicographic order.
# The lexicographic permutations of 0, 1 and 2 are: 012 021 102 120 201 210
# What is the millionth lexicographic permutation of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9?
permutations = []
def getLexicographicPermutationsOf(digits, state):
if len(digits) == 0:
permutations.append(str(state))
for i in range(len(digits)):
state.append(digits[i])
rest = digits[:i] + digits[i+1:]
getLexicographicPermutationsOf(rest, state)
state.pop()
digits = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
getLexicographicPermutationsOf(digits, [])
print(permutations[999999])
My first query is regarding the use of the yield statement. Instead of defining the permutations list at the top, my first design was to replace the permutations.append line with yield state. I would then assign the return value of the method to a variable. I checked, and the return value was a generator, as expected. However, looping over its contents indicated that no values were being generated. Am I missing something here?
My second query is about the final line - printing a value from the list. When I run this, it outputs the values as though it was a list, whereas it should be a string. In fact, replacing print(permutations[999999]) with print(type(permutations[999999])) results in < class str>. So why is it being printed like a list (with square brackets, separated by commas)?
When you recursively call getLexicographicPermutationsOf, you need to yield results from there too.
for result in getLexicographicPermutationsOf(rest, state):
yield result
permutations.append(str(state)) creates a string representation of state, which is a list. This explains why it looks like a list when printed.
There is a much less computationally intensive way to calculate this. It might actually not be so easy to write a program, but it lets you work out the answer by hand. :) (Hint: how many permutations are there? How many of them start with 0?)
Also, range(len(x)) is highly un-Pythonic. Granted, it would be nice to have the indices in order to slice the list to remove the 'current' element... but there is another way: just ask Python to remove the elements with that value (since there is only one such element). That allows us to loop over element values directly:
for digit in digits:
state.append(digit)
rest = digits[:]
rest.remove(digit) # a copy with the current value removed.
getLexicographicPermutationsOf(rest, state)
state.pop()
range is primarily useful for actually creating data ranges - such as you initialize digits with. :)
Am I missing something here?
You're missing that just calling a function recursively won't magically put its results anywhere. In fact, even if you 'yield' the results of a recursive call, it still won't do what you want - you'll end up with a generator that returns generators (that return generators, etc.... down to the base of the recursion) when you want one generator. (FogleBird's answer explains how to deal with this: you must take the generator from the recursive call, and explicitly "feed" its yielded elements into the current generator.)
But there is a much simpler way anyway: the library already has this algorithm built in.
The entire program can be done thus:
from itertools import permutations, islice
print next(islice(permutations(range(10)), 1000000, None))
why is it being printed like a list (with square brackets, separated by commas)?
Because the string contains square brackets and commas. That's what you get when you use str on a list (state, in this case).

Categories

Resources