What is `set` definition? AKA what does `set` do?

What is `set` definition? AKA what does `set` do? - python

I just finished LearnPythonTheHardWay as my intro to programming and set my mind on a sudoku related project. I've been reading through the code of a Sudoku Generator that was uploaded here
to learn some things, and I ran into the line available = set(range(1,10)). I read that as available = set([1, 2, 3, 4, 5, 6, 7, 8, 9]) but I'm not sure what set is.
I tried googling python set, looked through the code to see if set had been defined anywhere, and now I'm coming to you.
Thanks.

Set is built-in type. From the documentation:
A set object is an unordered collection of distinct hashable objects. Common uses include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference.

A set in Python is the collection used to mimic the mathematical notion of set. To put it very succinctly, a set is a list of unique objects, that is, it cannot contain duplicates, which a list can do.

A set is kind of like an unordered list, with unique elements. Documentation exists though, so I'm not sure why you couldn't find it:
https://docs.python.org/2/library/stdtypes.html#set

to make it easy to understand ,
lets take a list ,
a = [1,2,3,4,5,5,5,6,7,7,9]
print list(set(a))
the output will be ,
[1,2,3,4,5,6,7,9]
You can prevent repetitive number using set.
For more usage of set you have to refer to the docs.
Thanks to my friend here who reminded me about the lack of order ,
Incase if the list 'a' was like,
a =[7,7,5,5,5,1,2,3,4,6,9]
print list(set(a))
will still print the output as
[1,2,3,4,5,6,7,9]
You cant preserve order in set.

Related

Why does Pythons' set() return a set item instead of a list

I quite often use set() to remove duplicates from lists. After doing so, I always directly change it back to a list.
a = [0,0,0,1,2,3,4,5]
b = list(set(a))
Why does set() return a set item, instead of simply a list?
type(set(a)) == set # is true
Is there a use for set items that I have failed to understand?

Yes, sets have many uses. They have lots of nice operations documented here which lists don't have. One very useful difference is that membership testing (x in a) can be much faster than for a list.

Okay, by doubles you mean duplicate? and set() will always return a set because it is a data structure in python like lists. when you are calling set you are creating an object of set().
rest of the information about sets you can find here
https://docs.python.org/2/library/sets.html

As already mentioned, I won't go into why set does not return a list but like you stated:
I quite often use set() to remove doubles from lists. After doing so, I always directly change it back to a list.
You could use OrderedDict if you really hate going back to changing it to a list:
source_list = [0,0,0,1,2,3,4,5]
from collections import OrderedDict
print(OrderedDict((x, True) for x in source_list).keys())
OUTPUT:
odict_keys([0, 1, 2, 3, 4, 5])

As said before, for certain operations if you use set instead of list, it is faster. Python wiki has query TimeComplexity in which speed of operations of various data types are given. Note that if you have few elements in your list or set, you will most probably do not notice difference, but with more elements it become more important.
Notice that for example if you want to make in-place removal, for list it is O(n) meaning that for 10 times longer list it will need 10 times more time, while for set and s.difference_update(t) where s is set, t is set with one element to be removed from s, time is O(1) i.e. independent from number of elements of s.

Access the nth element of an object in Python

Here's the json
{u'skeleton_horde': 2, u'baby_dragon': 3, u'valkyrie': 5, u'chr_witch': 1, u'lightning': 1, u'order_volley': 6, u'building_inferno': 3, u'battle_ram': 2}
I'm trying to make the list look like this
skeleton_horde baby_dragon valkyrie lightning order_volley building_inferno
Here's the python
print(x['right']['troops'])
There's surprisingly no documentation on how to get the n element of an object (not array). I tried:
print(x['right']['troops'][1])
but it doesn't work.

First you want to extract the keys:
x['right']['troops']
Then you want to join them with spaces interspersed
' '.join(x['right']['troops'])
This will be in a different order than what you have, though, since Python dictionaries are unordered.

You want to use dict.keys() to a get a list* of the key values of the dict:
print(list(x['right']['troops'].keys()))
*It's actually a view, in Python 3. It would be a list in Python 2.

There's no way of getting the nth item in a dictionary (perhaps you've conflated Python dicts with JavaScript objects) for the simple reason that they are unordered.
There is however a type of dictionary that does maintain the order of its keys, aptly named OrderedDict.
Solution
As another commenter pointed out, there is a solution to your problem, but it still won't give you the keys in the order of definition:
' '.join(obj['right']['troops'])
Note
In a recent version of CPython (3.6), dictionary keys are indeed ordered. I'm not sure if I'd rely on implementation-specific behaviour, or whether you even need to order the keys in this case, but it's good to know. Props to #ScottColby for pointing this out to me!

Difference between two "contains" operations for python lists

I'm fairly new to python and have found that I need to query a list about whether it contains a certain item.
The majority of the postings I have seen on various websites (including this similar stackoverflow question) have all suggested something along the lines of
for i in list
if i == thingIAmLookingFor
return True
However, I have also found from one lone forum that
if thingIAmLookingFor in list
# do work
works.
I am wondering if the if thing in list method is shorthand for the for i in list method, or if it is implemented differently.
I would also like to which, if either, is more preferred.

In your simple example it is of course better to use in.
However... in the question you link to, in doesn't work (at least not directly) because the OP does not want to find an object that is equal to something, but an object whose attribute n is equal to something.
One answer does mention using in on a list comprehension, though I'm not sure why a generator expression wasn't used instead:
if 5 in (data.n for data in myList):
print "Found it"
But this is hardly much of an improvement over the other approaches, such as this one using any:
if any(data.n == 5 for data in myList):
print "Found it"

the "if x in thing:" format is strongly preferred, not just because it takes less code, but it also works on other data types and is (to me) easier to read.
I'm not sure how it's implemented, but I'd expect it to be quite a lot more efficient on datatypes that are stored in a more searchable form. eg. sets or dictionary keys.

The if thing in somelist is the preferred and fastest way.
Under-the-hood that use of the in-operator translates to somelist.__contains__(thing) whose implementation is equivalent to: any((x is thing or x == thing) for x in somelist).
Note the condition tests identity and then equality.

for i in list
if i == thingIAmLookingFor
return True
The above is a terrible way to test whether an item exists in a collection. It returns True from the function, so if you need the test as part of some code you'd need to move this into a separate utility function, or add thingWasFound = False before the loop and set it to True in the if statement (and then break), either of which is several lines of boilerplate for what could be a simple expression.
Plus, if you just use thingIAmLookingFor in list, this might execute more efficiently by doing fewer Python level operations (it'll need to do the same operations, but maybe in C, as list is a builtin type). But even more importantly, if list is actually bound to some other collection like a set or a dictionary thingIAmLookingFor in list will use the hash lookup mechanism such types support and be much more efficient, while using a for loop will force Python to go through every item in turn.
Obligatory post-script: list is a terrible name for a variable that contains a list as it shadows the list builtin, which can confuse you or anyone who reads your code. You're much better off naming it something that tells you something about what it means.

Set iteration order varies from run to run

Why does the iteration order of a Python set (with the same contents) vary from run to run, and what are my options for making it consistent from run to run?
I understand that the iteration order for a Python set is arbitrary. If I put 'a', 'b', and 'c' into a set and then iterate them, they may come back out in any order.
What I've observed is that the order remains the same within a run of the program. That is, if my program iterates the same set twice in a row, I get the same order both times. However, if I run the program twice in a row, the order changes from run to run.
Unfortunately, this breaks one of my automated tests, which simply compares the output from two runs of my program. I don't care about the actual order, but I would like it to be consistent from run to run.
The best solution I've come up with is:
Copy the set to a list.
Apply an arbitrary sort to the list.
Iterate the list instead of the set.
Is there a simpler solution?
Note: I've found similar questions on StackOverlow, but none that address this specific issue of getting the same results from run to run.

The reason the set iteration order changes from run-to-run appears to be because Python uses hash seed randomization by default. (See command option -R.) Thus set iteration is not only arbitrary (because of hashing), but also non-deterministic (because of the random seed).
You can override the random seed with a fixed value by setting the environment variable PYTHONHASHSEED for the interpreter. Using the same seed from run to run means set iteration is still arbitrary, but now it is deterministic, which was the desired property.
Hash seed randomization is a security measure to make it difficult for an adversary to feed inputs that will cause pathological behavior (e.g., by creating numerous hash collisions). For unit testing, this is not a concern, so it's reasonable to override the hash seed while running tests.

Use the symmetric_difference (^) operator on your two sets to see if there are any differences:
In [1]: s1 = set([5,7,8,2,1,9,0])
In [2]: s2 = set([9,0,5,1,8,2,7])
In [3]: s1
Out[3]: set([0, 1, 2, 5, 7, 8, 9])
In [4]: s2
Out[4]: set([0, 1, 2, 5, 7, 8, 9])
In [5]: s1 ^ s2
Out[5]: set()

What you want isn't possible. Arbitrary means arbitrary.
My solution would be the same as yours, you have to sort the set if you want to be able to compare it to another one.

The set's iteration order depends not only its contents, but on the order in which the items were inserted into the set, and whether there were deletions along the way. So you can create two different sets, using different insertions and deletions, and end up with the same set at the end, but with different iteration orders.
As others have said: if you care about the order of the set, you have to create a sorted list from it.

Your question transformed into two questions: A) how to compare "the output of two runs" in your specific case; B) what's the definition of the iteration order in a set. Maybe you should distinguish them and post B) as a new question if appropriate. I'll answer A.
IMHO, using a sorted list in your case is not a very clean solution. You should decide whether you care for iteration order once and for all and use the appropriate structure.
Either 1) you want to compare the two sets to see if they have equal content, irrespective of the order. Then the simple == operator on sets seems appropriate. See python2 sets, python3 sets.
Or 2) you want to check whether the elements were inserted in the same order. But this seems reasonable only if the order of insertion somehow matters to the users of your library, in which case using the set type was probably inappropriate to begin with. Put another way, it is unclear what you mean exactly by "comparing the output of two runs" and why you want to do that.
In all cases, I doubt a sorted list is appropriate here.

You can set the expected result to be also a set. And checks if those two sets are equal using ==.

Contrary to sets, lists have always a guaranteed order, so you could toss the set and use the list.

Python equivalent to java.util.SortedSet?

Does anybody know if Python has an equivalent to Java's SortedSet interface?
Heres what I'm looking for: lets say I have an object of type foo, and I know how to compare two objects of type foo to see whether foo1 is "greater than" or "less than" foo2. I want a way of storing many objects of type foo in a list L, so that whenever I traverse the list L, I get the objects in order, according to the comparison method I define.
Edit:
I guess I can use a dictionary or a list and sort() it every time I modify it, but is this the best way?

Take a look at BTrees. It look like you need one of them. As far as I understood you need structure that will support relatively cheap insertion of element into storage structure and cheap sorting operation (or even lack of it). BTrees offers that.
I've experience with ZODB.BTrees, and they scale to thousands and millions of elements.

You can use insort from the bisect module to insert new elements efficiently in an already sorted list:
from bisect import insort
items = [1,5,7,9]
insort(items, 3)
insort(items, 10)
print items # -> [1, 3, 5, 7, 9, 10]
Note that this does not directly correspond to SortedSet, because it uses a list. If you insert the same item more than once you will have duplicates in the list.

If you're looking for an implementation of an efficient container type for Python implemented using something like a balanced search tree (A Red-Black tree for example) then it's not part of the standard library.
I was able to find this, though:
http://www.brpreiss.com/books/opus7/
The source code is available here:
http://www.brpreiss.com/books/opus7/public/Opus7-1.0.tar.gz
I don't know how the source code is licensed, and I haven't used it myself, but it would be a good place to start looking if you're not interested in rolling your own container classes.
There's PyAVL which is a C module implementing an AVL tree.
Also, this thread might be useful to you. It contains a lot of suggestions on how to use the bisect module to enhance the existing Python dictionary to do what you're asking.
Of course, using insort() that way would be pretty expensive for insertion and deletion, so consider it carefully for your application. Implementing an appropriate data structure would probably be a better approach.
In any case, to understand whether you should keep the data structure sorted or sort it when you iterate over it you'll have to know whether you intend to insert a lot or iterate a lot. Keeping the data structure sorted makes sense if you modify its content relatively infrequently but iterate over it a lot. Conversely, if you insert and delete members all the time but iterate over the collection relatively infrequently, sorting the collection of keys before iterating will be faster. There is no one correct approach.

Similar to blist.sortedlist, the sortedcontainers module provides a sorted list, sorted set, and sorted dict data type. It uses a modified B-tree in the underlying implementation and is faster than blist in most cases.
The sortedcontainers module is pure-Python so installation is easy:
pip install sortedcontainers
Then for example:
from sortedcontainers import SortedList, SortedDict, SortedSet
help(SortedList)
The sortedcontainers module has 100% coverage testing and hours of stress. There's a pretty comprehensive performance comparison that lists most of the options you'd consider for this.

If you only need the keys, and no associated value, Python offers sets:
s = set(a_list)
for k in sorted(s):
print k
However, you'll be sorting the set each time you do this.
If that is too much overhead you may want to look at HeapQueues. They may not be as elegant and "Pythonic" but maybe they suit your needs.

Use blist.sortedlist from the blist package.
from blist import sortedlist
z = sortedlist([2, 3, 5, 7, 11])
z.add(6)
z.add(3)
z.add(10)
print z
This will output:
sortedlist([2, 3, 3, 5, 6, 7, 10, 11])
The resulting object can be used just like a python list.
>>> len(z)
8
>>> [2 * x for x in z]
[4, 6, 6, 10, 12, 14, 20, 22]

Do you have the possibility of using Jython? I just mention it because using TreeMap, TreeSet, etc. is trivial. Also if you're coming from a Java background and you want to head in a Pythonic direction Jython is wonderful for making the transition easier. Though I recognise that use of TreeSet in this case would not be part of such a "transition".
For Jython superusers I have a question myself: the blist package can't be imported because it uses a C file which must be imported. But would there be any advantage of using blist instead of TreeSet? Can we generally assume the JVM uses algorithms which are essentially as good as those of CPython stuff?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.