Irregularities in Python set comprehensions [duplicate]

Irregularities in Python set comprehensions [duplicate] - python

With this code:
print set(a**b for a in range(2, 5) for b in range(2, 5))
I get this answer:
set([64, 256, 4, 8, 9, 16, 81, 27])
Why it isn't sorted?

Sets are not ordered collections in python or any other language for that matter.
Sets are usually implemented using hash keys (hash codes). So order is probably related to how hash functions are used instead of natural order of its elements.
If you need order, please do consider using a list.

Sets are by their nature unordered containers. From the documentation:
A set object is an unordered collection of distinct hashable objects.
They are implemented using a hash table, facilitating O(1) membership tests. If you need an ordered set, try OrderedDict.fromkeys():
from collections import OrderedDict
OrderedDict.fromkeys(a**b for a in range(2, 5) for b in range(2, 5))

Related

Python - Differences between two lists

for practicing purposes, I tried to implement a function that receives two lists as parameters and returns the difference of them. So basically the elements which are the lists have not in common.
I coded the following functions:
list1 = [4,2,5,3,9,11]
list2 = [7,9,2,3,5,1]
def difference(list1,list2):
return (list(set(list1) - set(list2)))
difference(list1,list2)
AND
def difference_extra_credit(list1,list2):
return [value for value in list1 if value not in list2]
difference(list1,list2)
--> Basically both codes seem to work but I'm currently facing the problem that the lists need to have the same length in order for the functions to work. If the length is not the same, adding for instance an integer of 100 to list 1, it would not be shown as a difference between the lists if you print the functions.
I didn't manage to find a way to modify the code so that the length of the lists doesn't matter.. Does someone has an idea?
Thanks!

If you want symmetric difference, use the ^ operator instead of -
def difference(list1, list2):
return list(set(list1) ^ set(list2))
Here are the four set operators that combine two sets into one set.
| union : elements in one or both of the sets
& intersection : only elements common to both sets
- difference : elements in the left hand set that are not in the right hand set
^ symmetric difference : elements in either set but not in both.
I think this is a more readable way of writing the function
def symmetric_difference(a, b):
return {*a} ^ {*b}
(* unpacking in set literals requires python 3.5 or later)
Returning a set instead of a list makes it a bit more clear what the function does. The input arguments can be any iterable types, and since set is an unordered data type, returning a set makes it obvious that any ordering in the input data was not preserved.
>>> symmetric_difference(range(3, 8), [1,2,3,4])
{1, 2, 5, 6, 7}
>>> symmetric_difference('hello', 'world')
{'d', 'e', 'h', 'r', 'w'}

your both versions aren't symmetrical: if you exchange list1 and list2, the result won't be the same.
If you add a number in list2 (not in list1 as your question states), it's not seen as a difference, whereas it is one.
You want to perform a symmetric difference, so no matter the data in both lists (swapped or not) the result remains the same
def difference(list1,list2):
return list(set(list1).symmetric_difference(list2))
with your data:
[1, 4, 7, 11]

Trying out your code, it seemed to work fine with me regardless of the length of the lists - when I added 100 to list1, it showed up for both difference functions.
However, there appear to be a few issues with your code that could be causing the problems. Firstly, you accept arguments list1 and list2 for both functions, but these variables are the same name as your list variables. This seems not to cause an issue, but it means that the global variables are no longer accessible, and it is generally a better practice to avoid confusion by using different names for global variables and variables within functions.
Additionally, your function does not take the symmetric difference - it only loops over the variables in the first list, so unique variables in the second list will not be counted. To fix this easily, you could add a line combining your lists into a sum list, then looping over that entire list and checking if each value is in only one of the lists - this would use ^ to do an xor comparison of whether or not the variable is in the two lists, so that it returns true if it is in only one of the lists. This could be done like so:
def difference_extra_credit(l1,l2):
list = l1 + l2
return [value for value in list if (value in l1) ^ (value in l2)]
Testing this function on my own has resulted in the list [4, 11, 7, 1], and [4, 11, 100, 7, 1] if 100 is added to list1 or list2.

Sorting a list of python sets by value

The frozenset docs says:
The frozenset type is immutable and hashable — its contents cannot be altered after it is created; it can therefore be used as a dictionary key or as an element of another set.
However, the docs for for python sets says:
Since sets only define partial ordering (subset relationships), the output of the list.sort() method is undefined for lists of sets.
This makes me ask: why is the case? And, if I wanted to sort a list of sets by set content, how could I do this? I know that the extension intbitset: https://pypi.python.org/pypi/intbitset/2.3.0 , has a function for returning a bit sequence that represents the set contents. Is there something comparable for python sets?

Tuples, lists, strings, etc. have a natural lexicographic ordering and can be sorted because you can always compare two elements of a given collection. That is, either a < b, b < a, or a == b.
A natural comparison between two sets is having a <= b mean a is a subset of b, which is what the expression a <= b actually does in Python. What the documentation means by "partial ordering" is that not all sets are comparable. Take, for example, the following sets:
a = {1, 2, 3}
b = {4, 5, 6}
Is a a subset of b? No. Is b a subset of a? No. Are they equal? No. If you can't compare them at all, you clearly can't sort them.
The only way you can sort a collection of sets is if your comparison function actually can compare any two elements (a total order). This means you can still sort a collection of sets using the above subset relation, but you will have to ensure that all of the sets are comparable (e.g. [{1}, {1, 2, 4}, {1, 2}]).
The easiest way to do what you want is to transform each individual set into something that you actually can compare. Basically, you do f(a) <= f(b) (where <= is obvious) for some simple function f. This is done with the key keyword argument:
In [10]: def f(some_set):
... return max(some_set)
...
In [11]: sorted([{1, 2, 3, 999}, {4, 5, 6}, {7, 8, 9}], key=f)
Out[11]: [{4, 5, 6}, {7, 8, 9}, {1, 2, 3, 999}]
You're sorting [f(set1), f(set2), f(set3)] and applying the resulting ordering to [set1, set2, set3].

Take an example: say you wanted to sort a list of sets by the "first element" of each set. The issue is that Python sets or frozensets don't have a "first element." They have no sense of their own ordering. A set is an unordered collection with no duplicate elements.
Furthermore, list.sort() sorts the list in place, using only the < operator between items.
If you just use a.sort() without passing any key parameter, saying set_a < set_b (or set_a.__lt__(set_b)) is insufficient. By insufficient, I mean that set_a.__lt__(set_b) is a subset operator. (Is a a subset of b?). As mentioned by #Blender and referenced in your question, this provides for partial rather than total ordering, which is insufficient for defining what ever sequence holds the sets.
From the docs:
set < other: Test whether the set is a proper subset of other, that
is, set <= other and set != other.
You could pass a key to sort(), it just couldn't refer to anything to do with the "ordering" of the sets internally, because remember--there is none.
>>> a = {2, 3, 1}
>>> b = {6, 9, 0, 1}
>>> c = {0}
>>> i = [b, a, c]
>>> i.sort(key=len)
>>> i
[{0}, {1, 2, 3}, {0, 9, 6, 1}]

Does Python keep track of when something has been sorted, internally?

For example, if I call
L = [3,4,2,1,5]
L = sorted(L)
I get a sorted list. Now, in the future, if I want to perform some other kind of sort on L, does Python automatically know "this list has been sorted before and not modified since, so we can perform some internal optimizations on how we perform this other kind of sort" such as a reverse-sort, etc?

Nope, it doesn't. The sorting algorithm is designed to exploit (partially) sorted inputs, but the list itself doesn't "remember" being sorted in any way.
(This is actually a CPython implementation detail, and future versions/different implementations could cache the fact that a list was just sorted. However, I'm not convinced that could be done without slowing down all operations that modify the list, such as append.)

As the commenters pointed out, normal Python lists are inherently ordered and efficiently sortable (thanks, Timsort!), but do not remember or maintain sorting status.
If you want lists that invariably retain their sorted status, you can install the SortedContainers package from PyPI.
>>> from sortedcontainers import SortedList
>>> L = SortedList([3,4,2,1,5])
>>> L
SortedList([1, 2, 3, 4, 5])
>>> L.add(3.3)
>>> L
SortedList([1, 2, 3, 3.3, 4, 5])
Note the normal append method becomes add, because the item isn't added on the end. It's added wherever appropriate given the sort order. There is also a SortedListWithKey type that allows you to set your sort key/order explicitly.

Some of this, at least the specific reverse sort question, could be done using numpy:
import numpy as np
L = np.array([3,4,2,1,5])
a = np.argsort(L)
b = L[a]
r = L[a[::-1]]
print L
[3 4 2 1 5]
print b
[1 2 3 4 5]
print r
[5, 4, 3, 2, 1]
That is, here we just do the sort once (to create a, the sorting indices), and then we can manipulate a, to do other various sorts, like the normal sort b, and the reverse sort r. And many others would be similarly easy, like every other element.

Calling Methods on Python's Dictionary Literals

I'm trying to concatenate some dictionaries. The best way I've come up with to do that is to use dict1.update(dict2).
This is the code I'm trying to run, but it evaluates to None. Why?
{k:30 for k in [4, 9, 11, 6]}.update({k:31 for k in [1, 3, 5, 7, 8, 10, 12]})

The dict.update method works in-place and therefore always returns None. It is no different than other in-place methods such as dict.clear and list.append.
Note too that this behavior is mentioned in the docs:
update([other])
Update the dictionary with the key/value pairs from other, overwriting existing keys. Return None.
Emphasis mine.

Since update doesn't return a reference to the updated dictionary, you can use the following instead:
import itertools
d = dict(itertools.chain({k:30 for k in [...]}.items(),
{k:31 for k in [...]}.items()))

Python "set" with duplicate/repeated elements

Is there a standard way to represent a "set" that can contain duplicate elements.
As I understand it, a set has exactly one or zero of an element. I want functionality to have any number.
I am currently using a dictionary with elements as keys, and quantity as values, but this seems wrong for many reasons.
Motivation:
I believe there are many applications for such a collection. For example, a survey of favourite colours could be represented by:
survey = ['blue', 'red', 'blue', 'green']
Here, I do not care about the order, but I do about quantities. I want to do things like:
survey.add('blue')
# would give survey == ['blue', 'red', 'blue', 'green', 'blue']
...and maybe even
survey.remove('blue')
# would give survey == ['blue', 'red', 'green']
Notes:
Yes, set is not the correct term for this kind of collection. Is there a more correct one?
A list of course would work, but the collection required is unordered. Not to mention that the method naming for sets seems to me to be more appropriate.

You are looking for a multiset.
Python's closest datatype is collections.Counter:
A Counter is a dict subclass for counting hashable objects. It is an
unordered collection where elements are stored as dictionary keys and
their counts are stored as dictionary values. Counts are allowed to be
any integer value including zero or negative counts. The Counter class
is similar to bags or multisets in other languages.
For an actual implementation of a multiset, use the bag class from the data-structures package on pypi. Note that this is for Python 3 only. If you need Python 2, here is a recipe for a bag written for Python 2.4.

Your approach with dict with element/count seems ok to me. You probably need some more functionality. Have a look at collections.Counter.
O(1) test whether an element is present and current count retrieval (faster than with element in list and list.count(element))
counter.elements() looks like a list with all duplicates
easy manipulation union/difference with other Counters

Python "set" with duplicate/repeated elements
This depends on how you define a set. One may assume that to the OP
order does not matter (whether ordered or unordered)
replicates/repeated elements (a.k.a. multiplicities) are permitted
Given these assumptions, the options reduce to two abstract types: a list or a multiset. In Python, these type usually translate to a list and Counter respectively. See the Details on some subtleties to observe.
Given
import random
import collections as ct
random.seed(123)
elems = [random.randint(1, 11) for _ in range(10)]
elems
# [1, 5, 2, 7, 5, 2, 1, 7, 9, 9]
Code
A list of replicate elements:
list(elems)
# [1, 5, 2, 7, 5, 2, 1, 7, 9, 9]
A "multiset" of replicate elements:
ct.Counter(elems)
# Counter({1: 2, 5: 2, 2: 2, 7: 2, 9: 2})
Details
On Data Structures
We have a mix of terms here that easily get confused. To clarify, here are some basic mathematical data structures compared to ones in Python.
Type |Abbr|Order|Replicates| Math* | Python | Implementation
------------|----|-----|----------|-----------|-------------|----------------
Set |Set | n | n | {2 3 1} | {2, 3, 1} | set(el)
Ordered Set |Oset| y | n | {1, 2, 3} | - | list(dict.fromkeys(el)
Multiset |Mset| n | y | [2 1 2] | - | <see `mset` below>
List |List| y | y | [1, 2, 2] | [1, 2, 2] | list(el)
From the table, one can deduce the definition of each type. Example: a set is a container that ignores order and rejects replicate elements. In contrast, a list is a container that preserves order and permits replicate elements.
Also from the table, we can see:
Both an ordered set and a multiset are not explicitly implemented in Python
"Order" is a contrary term to a random arrangement of elements, e.g. sorted or insertion order
Sets and multisets are not strictly ordered. They can be ordered, but order does not matter.
Multisets permit replicates, thus they are not strict sets (the term "set" is indeed confusing).
On Multisets
Some may argue that collections.Counter is a multiset. You are safe in many cases to treat it as such, but be aware that Counter is simply a dict (a mapping) of key-multiplicity pairs. It is a map of multiplicities. See an example of elements in a flattened multiset:
mset = [x for k, v in ct.Counter(elems).items() for x in [k]*v]
mset
# [1, 1, 5, 5, 2, 2, 7, 7, 9, 9]
Notice there is some residual ordering, which may be surprising if you expect disordered results. However, disorder does not preclude order. Thus while you can generate a multiset from a Counter, be aware of the following provisos on residual ordering in Python:
replicates get grouped together in the mapping, introducing some degree of order
in Python 3.6, dict's preserve insertion order
Summary
In Python, a multiset can be translated to a map of multiplicities, i.e. a Counter, which is not randomly unordered like a pure set. There can be some residual ordering, which in most cases is ok since order does not generally matter in multisets.
See Also
collections-extended - a package on extra data types in collections
N. Wildberger's lectures on mathematical data structures
*Mathematically, (according to N. Wildberger, we express braces {} to imply a set and brackets [] to imply a list, as seen in Python. Unlike Python, commas , to imply order.

You can use a plain list and use list.count(element) whenever you want to access the "number" of elements.
my_list = [1, 1, 2, 3, 3, 3]
my_list.count(1) # will return 2

An alternative Python multiset implementation uses a sorted list data structure. There are a couple implementations on PyPI. One option is the sortedcontainers module which implements a SortedList data type that efficiently implements set-like methods like add, remove, and contains. The sortedcontainers module is implemented in pure-Python, fast-as-C implementations (even faster), has 100% unit test coverage, and hours of stress testing.
Installation is easy from PyPI:
pip install sortedcontainers
If you can't pip install then simply pull the sortedlist.py file down from the open-source repository.
Use it as you would a set:
from sortedcontainers import SortedList
survey = SortedList(['blue', 'red', 'blue', 'green']]
survey.add('blue')
print survey.count('blue') # "3"
survey.remove('blue')
The sortedcontainers module also maintains a performance comparison with other popular implementations.

What you're looking for is indeed a multiset (or bag), a collection of not necessarily distinct elements (whereas a set does not contain duplicates).
There's an implementation for multisets here: https://github.com/mlenzen/collections-extended (Pypy's collections extended module).
The data structure for multisets is called bag. A bag is a subclass of the Set class from collections module with an extra dictionary to keep track of the multiplicities of elements.
class _basebag(Set):
"""
Base class for bag and frozenbag. Is not mutable and not hashable, so there's
no reason to use this instead of either bag or frozenbag.
"""
# Basic object methods
def __init__(self, iterable=None):
"""Create a new basebag.
If iterable isn't given, is None or is empty then the bag starts empty.
Otherwise each element from iterable will be added to the bag
however many times it appears.
This runs in O(len(iterable))
"""
self._dict = dict()
self._size = 0
if iterable:
if isinstance(iterable, _basebag):
for elem, count in iterable._dict.items():
self._inc(elem, count)
else:
for value in iterable:
self._inc(value)
A nice method for bag is nlargest (similar to Counter for lists), that returns the multiplicities of all elements blazingly fast since the number of occurrences of each element is kept up-to-date in the bag's dictionary:
>>> b=bag(random.choice(string.ascii_letters) for x in xrange(10))
>>> b.nlargest()
[('p', 2), ('A', 1), ('d', 1), ('m', 1), ('J', 1), ('M', 1), ('l', 1), ('n', 1), ('W', 1)]
>>> Counter(b)
Counter({'p': 2, 'A': 1, 'd': 1, 'm': 1, 'J': 1, 'M': 1, 'l': 1, 'n': 1, 'W': 1})

You can used collections.Counter to implement a multiset, as already mentioned.
Another way to implement a multiset is by using defaultdict, which would work by counting occurrences, like collections.Counter.
Here's a snippet from the python docs:
Setting the default_factory to int makes the defaultdict useful for counting (like a bag or multiset in other languages):
>>> s = 'mississippi'
>>> d = defaultdict(int)
>>> for k in s:
... d[k] += 1
...
>>> d.items()
[('i', 4), ('p', 2), ('s', 4), ('m', 1)]

If you need duplicates, use a list, and transform it to a set when you need operate as a set.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Irregularities in Python set comprehensions [duplicate] - python

With this code: print set(a**b for a in range(2, 5) for b in range(2, 5)) I get this answer: set([64, 256, 4, 8, 9, 16, 81, 27]) Why it isn't sorted?

Sets are not ordered collections in python or any other language for that matter. Sets are usually implemented using hash keys (hash codes). So order is probably related to how hash functions are used instead of natural order of its elements. If you need order, please do consider using a list.

Related

Python - Differences between two lists

Sorting a list of python sets by value

Does Python keep track of when something has been sorted, internally?

Calling Methods on Python's Dictionary Literals

Python "set" with duplicate/repeated elements

Categories

Resources