Why does the iteration order of a Python set (with the same contents) vary from run to run, and what are my options for making it consistent from run to run?
I understand that the iteration order for a Python set is arbitrary. If I put 'a', 'b', and 'c' into a set and then iterate them, they may come back out in any order.
What I've observed is that the order remains the same within a run of the program. That is, if my program iterates the same set twice in a row, I get the same order both times. However, if I run the program twice in a row, the order changes from run to run.
Unfortunately, this breaks one of my automated tests, which simply compares the output from two runs of my program. I don't care about the actual order, but I would like it to be consistent from run to run.
The best solution I've come up with is:
Copy the set to a list.
Apply an arbitrary sort to the list.
Iterate the list instead of the set.
Is there a simpler solution?
Note: I've found similar questions on StackOverlow, but none that address this specific issue of getting the same results from run to run.
The reason the set iteration order changes from run-to-run appears to be because Python uses hash seed randomization by default. (See command option -R.) Thus set iteration is not only arbitrary (because of hashing), but also non-deterministic (because of the random seed).
You can override the random seed with a fixed value by setting the environment variable PYTHONHASHSEED for the interpreter. Using the same seed from run to run means set iteration is still arbitrary, but now it is deterministic, which was the desired property.
Hash seed randomization is a security measure to make it difficult for an adversary to feed inputs that will cause pathological behavior (e.g., by creating numerous hash collisions). For unit testing, this is not a concern, so it's reasonable to override the hash seed while running tests.
Use the symmetric_difference (^) operator on your two sets to see if there are any differences:
In [1]: s1 = set([5,7,8,2,1,9,0])
In [2]: s2 = set([9,0,5,1,8,2,7])
In [3]: s1
Out[3]: set([0, 1, 2, 5, 7, 8, 9])
In [4]: s2
Out[4]: set([0, 1, 2, 5, 7, 8, 9])
In [5]: s1 ^ s2
Out[5]: set()
What you want isn't possible. Arbitrary means arbitrary.
My solution would be the same as yours, you have to sort the set if you want to be able to compare it to another one.
The set's iteration order depends not only its contents, but on the order in which the items were inserted into the set, and whether there were deletions along the way. So you can create two different sets, using different insertions and deletions, and end up with the same set at the end, but with different iteration orders.
As others have said: if you care about the order of the set, you have to create a sorted list from it.
Your question transformed into two questions: A) how to compare "the output of two runs" in your specific case; B) what's the definition of the iteration order in a set. Maybe you should distinguish them and post B) as a new question if appropriate. I'll answer A.
IMHO, using a sorted list in your case is not a very clean solution. You should decide whether you care for iteration order once and for all and use the appropriate structure.
Either 1) you want to compare the two sets to see if they have equal content, irrespective of the order. Then the simple == operator on sets seems appropriate. See python2 sets, python3 sets.
Or 2) you want to check whether the elements were inserted in the same order. But this seems reasonable only if the order of insertion somehow matters to the users of your library, in which case using the set type was probably inappropriate to begin with. Put another way, it is unclear what you mean exactly by "comparing the output of two runs" and why you want to do that.
In all cases, I doubt a sorted list is appropriate here.
You can set the expected result to be also a set. And checks if those two sets are equal using ==.
Contrary to sets, lists have always a guaranteed order, so you could toss the set and use the list.
Related
I am writing a program to find the most similar words for a list of input words using Gensim for word2vec.
Example: input_list = ["happy", "beautiful"]
Further, I use a for loop to iterate over the list and store the output in a list data structure using the .append() function.
The final list is a list of lists having tuples. See below
results = [[('glad', 0.7408891320228577),
('pleased', 0.6632170081138611)],
[('gorgeous', 0.8353002667427063),
('lovely', 0.8106935024261475)]]
My question is how to separate the list of lists into independent lists? I followed answers from 1 and 2 which suggest unpacking like a, b = results.
But this is possible when you know the number of input elements (here 2).
Expected Output (based on above):
list_a = [('glad', 0.7408891320228577), ('pleased', 0.6632170081138611)]
list_b = [('gorgeous', 0.8353002667427063), ('lovely', 0.8106935024261475)]
But, if the number of input elements is always variable, say 4 or 5, then how do we unpack and get a reference to the independent lists at run-time?
Or what is a better data structure to store the above results so that unpacking or further processing is friendlier?
Kindly help.
If you have a variable number of query-words - sometimes 2, sometimes 5, sometimes any other number N – then you almost certainly do not want to bring those out into totally-separate variable names (like list_a, list_b, etc).
Why not? Well, your next steps will then likely be to do something to each of the N items.
And to do that, you'll then want them in some sort of indexed-list you can iterate over.
What if instead, they're in some bunch of local variables - list_a, list_b, list_c, list_d - like you've requested? Then in the case where there's only 3, some of those variables, like list_d, either won't exist (be undefined) or will hold some different signal value (like say None).
For most tasks, that will then be harder to work with - requiring awkward branches/tests for evey possible count of results.
Instead, your existing results, which is a list, where you can access each by numeric index – results[0], results[1] – either alone, or in a loop, is a much more typically-useful structure when the count of things you're dealing with will vary.
If you think you have a valid reason for your expected end-state, please describe the reason, and especially the next things you then want to do, in more detail, via an expansion to the question. And consider those next steps for several different scenarios: just 1 set of results, 2 ests of results, 5 sets of results, 100 sets of results. (In that last case, what would you even name the variables, beyond list_z?)
(Separately, this is not really a question about Gensim or word2vec, but about core Python language features and variable/data-structure handling. So I've removed those tags, and added destructuring, a term for the sort of multiple-variable assignment that almost does what you need but isn't quite right, and will tune the title a bit.)
I quite often use set() to remove duplicates from lists. After doing so, I always directly change it back to a list.
a = [0,0,0,1,2,3,4,5]
b = list(set(a))
Why does set() return a set item, instead of simply a list?
type(set(a)) == set # is true
Is there a use for set items that I have failed to understand?
Yes, sets have many uses. They have lots of nice operations documented here which lists don't have. One very useful difference is that membership testing (x in a) can be much faster than for a list.
Okay, by doubles you mean duplicate? and set() will always return a set because it is a data structure in python like lists. when you are calling set you are creating an object of set().
rest of the information about sets you can find here
https://docs.python.org/2/library/sets.html
As already mentioned, I won't go into why set does not return a list but like you stated:
I quite often use set() to remove doubles from lists. After doing so, I always directly change it back to a list.
You could use OrderedDict if you really hate going back to changing it to a list:
source_list = [0,0,0,1,2,3,4,5]
from collections import OrderedDict
print(OrderedDict((x, True) for x in source_list).keys())
OUTPUT:
odict_keys([0, 1, 2, 3, 4, 5])
As said before, for certain operations if you use set instead of list, it is faster. Python wiki has query TimeComplexity in which speed of operations of various data types are given. Note that if you have few elements in your list or set, you will most probably do not notice difference, but with more elements it become more important.
Notice that for example if you want to make in-place removal, for list it is O(n) meaning that for 10 times longer list it will need 10 times more time, while for set and s.difference_update(t) where s is set, t is set with one element to be removed from s, time is O(1) i.e. independent from number of elements of s.
I am writing a Python program to remove duplicates from a list. My code is the following:
some_values_list = [2,2,4,7,7,8]
unique_values_list = []
for i in some_values_list:
if i not in unique_values_list:
unique_values_list.append(i)
print(unique_values_list)
This code works fine. However, an alternative solution is given and I am trying to interpret it (as I am still a beginner in Python). Specifically, I do not understand the added value or benefit of creating an empty set - how does that make the code clearer or more efficient? Isn´t it enough to create an empty list as I have done in the first example?
The code for the alternative solution is the following:
a = [10,20,30,20,10,50,60,40,80,50,40]
dup_items = set()
uniq_items = []
for x in a:
if x not in dup_items:
uniq_items.append(x)
dup_items.add(x)
print(dup_items)
This code also throws up an error TypeError: set() missing 1 required positional argument: 'items' (This is from a website for Python exercises with answers key, so it is supposed to be correct.)
Determining if an item is present in a set is generally faster than determining if it is present in a list of the same size. Why? Because for a set (at least, for a hash table, which is how CPython sets are implemented) we don't need to traverse the entire collection of elements to check if a particular value is present (whereas we do for a list). Rather, we usually just need to check at most one element. A more precise way to frame this is to say that containment tests for lists take "linear time" (i.e. time proportional to the size of the list), whereas containment tests in sets take "constant time" (i.e. the runtime does not depend on the size of the set).
Lookup for an element in a list takes O(N) time (you can find an element in logarithmic time, but the list should be sorted, so not your case). So if you use the same list to keep unique elements and lookup newly added ones, your whole algorithm runs in O(N²) time (N elements, O(N) average lookup). set is a hash-set in Python, so lookup in it should take O(1) on average. Thus, if you use an auxiliary set to keep track of unique elements already found, your whole algorithm will only take O(N) time on average, chances are good, one order better.
In most cases sets are faster than lists. One of this cases is when you look for an item using "in" keyword. The reason why sets are faster is that, they implement hashtable.
So, in short, if x not in dup_items in second code snippet works faster than if i not in unique_values_list.
If you want to check the time complexity of different Python data structures and operations, you can check this link
.
I think your code is also inefficient in a way that for each item in list you are searching in larger list. The second snippet looks for the item in smaller set. But that is not correct all the time. For example, if the list is all unique items, then it is the same.
Hope it clarifies.
I just finished LearnPythonTheHardWay as my intro to programming and set my mind on a sudoku related project. I've been reading through the code of a Sudoku Generator that was uploaded here
to learn some things, and I ran into the line available = set(range(1,10)). I read that as available = set([1, 2, 3, 4, 5, 6, 7, 8, 9]) but I'm not sure what set is.
I tried googling python set, looked through the code to see if set had been defined anywhere, and now I'm coming to you.
Thanks.
Set is built-in type. From the documentation:
A set object is an unordered collection of distinct hashable objects. Common uses include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference.
A set in Python is the collection used to mimic the mathematical notion of set. To put it very succinctly, a set is a list of unique objects, that is, it cannot contain duplicates, which a list can do.
A set is kind of like an unordered list, with unique elements. Documentation exists though, so I'm not sure why you couldn't find it:
https://docs.python.org/2/library/stdtypes.html#set
to make it easy to understand ,
lets take a list ,
a = [1,2,3,4,5,5,5,6,7,7,9]
print list(set(a))
the output will be ,
[1,2,3,4,5,6,7,9]
You can prevent repetitive number using set.
For more usage of set you have to refer to the docs.
Thanks to my friend here who reminded me about the lack of order ,
Incase if the list 'a' was like,
a =[7,7,5,5,5,1,2,3,4,6,9]
print list(set(a))
will still print the output as
[1,2,3,4,5,6,7,9]
You cant preserve order in set.
I have a set, I add items (ints) to it, and when I print it, the items apparently are sorted:
a = set()
a.add(3)
a.add(2)
a.add(4)
a.add(1)
a.add(5)
print a
# set([1, 2, 3, 4, 5])
I have tried with various values, apparently it needs to be only ints.
I run Python 2.7.5 under MacOSX. It is also reproduced using repl.it (see http://repl.it/TpV)
The question is: is this documented somewhere (haven't find it so far), is it normal, is it something that can be relied on?
Extra question: when is the sort done? during the print? is it internally stored sorted? (is that even possible given the expected constant complexity of insertion?)
This is a coincidence. The data is neither sorted nor does __str__ sort.
The hash values for integers equal their value (except for -1 and long integers outside the sys.maxint range), which increases the chance that integers are slotted in order, but that's not a given.
set uses a hash table to track items contained, and ordering depends on the hash value, and insertion and deletion history.
The how and why of the interaction between integers and sets are all implementation details, and can easily vary from version to version. Python 3.3 introduced hash randomisation for certain types, and Python 3.4 expanded on this, making ordering of sets and dictionaries volatile between Python process restarts too (depending on the types of values stored).