Python have some great structures to model data.
Here are some :
+-------------------+-----------------------------------+
| indexed by int | no-indexed by int |
+-------------+-------------------+-----------------------------------+
| no-indexed | [1, 2, 3] | {1, 2, 3} |
| by key | or | or |
| | [x+1 in range(3)] | {x+1 in range(3)} |
+-------------+-------------------+-----------------------------------+
| indexed | | {'a': 97, 'c': 99, 'b': 98} |
| by key | | or |
| | | {chr(x):x for x in range(97,100)} |
+-------------+-------------------+-----------------------------------+
Why python does not include by default a structure indexed by key+int (like a PHP Array) ? I know there is a library that emulate this object ( http://docs.python.org/3/library/collections.html#ordereddict-objects). But here is the representation of a "orderedDict" taken from the documentation :
OrderedDict([('pear', 1), ('apple', 4), ('orange', 2), ('banana', 3)])
Wouldn't it be better to have a native type that should logically be writen like this:
['a': 97, 'b': 98, 'c': 99]
And same logic for orderedDict comprehension :
[chr(x):x for x in range(97,100)]
Does it make sense to fill the table cell like this in the python design?
It is there any particular reason for this to not be implemented yet?
Python's dictionaries are implemented as hash tables. Those are inherently unordered data structures. While it is possible to add extra logic to keep track of the order (as is done in collections.OrderedDict in Python 2.7 and 3.1+), there's a non-trivial overhead involved.
For instance, the recipe that the collections documentation suggest for use in Python 2.4-2.6 requires more than twice as much work to complete many basic dictionary operations (such as adding and removing values). This is because it must maintain a doubly-linked list to use for ordered iteration, and it needs an extra dictionary to help maintain the list. While its operations are still O(1), the constant terms are larger.
Since Python uses dict instances everywhere (for all variable lookups, for instance), they need to be very fast or every part of every program will suffer. Since ordered iteration is not needed very often, it makes sense to avoid the overhead it requires in the general case. If you need an ordered dictionary, use the one in the standard library (or the recipe it suggests, if you're using an earlier version of Python).
Your question appears to be "why does Python not have native PHP-style arrays with ordered keys?"
Python has three core non-scalar datatypes: list, dict, and tuple. Dicts and tuples are absolutely essential for implementing the language itself: they are used for assignment, argument unpacking, attribute lookup, etc. Although not really used for the core language semantics, lists are pretty essential for data and programs in Python. All three must be extremely lightweight, have very well-understood semantics, and be as fast as possible.
PHP-style arrays are none of these things. They are not fast or lightweight, have poorly defined runtime complexity, and they have confused semantics since they can be used for so many different things--look at the array functions. They are actually a terrible datatype for almost every use case except the very narrow one for which they were created: representing x-www-form-encoded data. Even for this use case a failing is that earlier keys overwrite the value of later keys: in PHP ?a=1&a=2 results in array('a'=>2). (A common structure for dealing with this in Python is the MultiDict, which has ordered keys and values, and each key can have multiple values.)
PHP has one datatype that must be used for pretty much every use case without being great for any of them. Python has many different datatypes (some core, many more in external libraries) which excel at much more narrow use cases.
Adding a new answer with updated information: As of CPython3.6, dicts preserve order. Though still not index-accessible. Most likely because integer-based item-lookup is ambiguous since dict keys can be int's. (Some custom use cases exist.)
Unfortunately, the documentation for dict hasn't been updated to reflect this (yet) and still says "Keys and values are iterated over in an arbitrary order which is non-random". Ironically, the collections.OrderedDict docs mention the new behaviour:
Changed in version 3.6: With the acceptance of PEP 468, order is retained for keyword arguments passed to the OrderedDict constructor and its update() method.
And here's an article mentioning some more details about it:
A minor but useful internal improvement: Python 3.6 preserves the order of elements for more structures. Keyword arguments passed to a function, attribute definitions in a class, and dictionaries all preserve the order of elements as they were defined.
So if you're only writing code for Py36 onwards, you shouldn't need collections.OrderedDict unless you're using popitem, move_to_end or order-based equality.
Example, in Python 2.7:
>>> d = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 0: None}
>>> d
{'a': 1, 0: None, 'c': 3, 'b': 2, 'd': 4}
And in Python 3.6:
>>> d = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 0: None}
>>> d
{'a': 1, 'b': 2, 'c': 3, 'd': 4, 0: None}
>>> d['new'] = 'really?'
>>> d[None]= None
>>> d
{'a': 1, 'b': 2, 'c': 3, 'd': 4, 0: None, 'new': 'really?', None: None}
>>> d['a'] = 'aaa'
>>> d
{'a': 'aaa', 'b': 2, 'c': 3, 'd': 4, 0: None, 'new': 'really?', None: None}
>>>
>>> # equality is not order-based
>>> d1 = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 0: None}
... d2 = {'b': 2, 'a': 1, 'd': 4, 'c': 3, 0: None}
>>> d2
{'b': 2, 'a': 1, 'd': 4, 'c': 3, 0: None}
>>> d1 == d2
True
As of python 3.7 this is now a default behavior for dictionaries, it was an implementation detail in 3.6 that was adopted as of June 2018 :')
the insertion-order preservation nature of dict objects has been declared to be an official part of the Python language spec.
https://docs.python.org/3/whatsnew/3.7.html
Related
The frozenset docs says:
The frozenset type is immutable and hashable — its contents cannot be altered after it is created; it can therefore be used as a dictionary key or as an element of another set.
However, the docs for for python sets says:
Since sets only define partial ordering (subset relationships), the output of the list.sort() method is undefined for lists of sets.
This makes me ask: why is the case? And, if I wanted to sort a list of sets by set content, how could I do this? I know that the extension intbitset: https://pypi.python.org/pypi/intbitset/2.3.0 , has a function for returning a bit sequence that represents the set contents. Is there something comparable for python sets?
Tuples, lists, strings, etc. have a natural lexicographic ordering and can be sorted because you can always compare two elements of a given collection. That is, either a < b, b < a, or a == b.
A natural comparison between two sets is having a <= b mean a is a subset of b, which is what the expression a <= b actually does in Python. What the documentation means by "partial ordering" is that not all sets are comparable. Take, for example, the following sets:
a = {1, 2, 3}
b = {4, 5, 6}
Is a a subset of b? No. Is b a subset of a? No. Are they equal? No. If you can't compare them at all, you clearly can't sort them.
The only way you can sort a collection of sets is if your comparison function actually can compare any two elements (a total order). This means you can still sort a collection of sets using the above subset relation, but you will have to ensure that all of the sets are comparable (e.g. [{1}, {1, 2, 4}, {1, 2}]).
The easiest way to do what you want is to transform each individual set into something that you actually can compare. Basically, you do f(a) <= f(b) (where <= is obvious) for some simple function f. This is done with the key keyword argument:
In [10]: def f(some_set):
... return max(some_set)
...
In [11]: sorted([{1, 2, 3, 999}, {4, 5, 6}, {7, 8, 9}], key=f)
Out[11]: [{4, 5, 6}, {7, 8, 9}, {1, 2, 3, 999}]
You're sorting [f(set1), f(set2), f(set3)] and applying the resulting ordering to [set1, set2, set3].
Take an example: say you wanted to sort a list of sets by the "first element" of each set. The issue is that Python sets or frozensets don't have a "first element." They have no sense of their own ordering. A set is an unordered collection with no duplicate elements.
Furthermore, list.sort() sorts the list in place, using only the < operator between items.
If you just use a.sort() without passing any key parameter, saying set_a < set_b (or set_a.__lt__(set_b)) is insufficient. By insufficient, I mean that set_a.__lt__(set_b) is a subset operator. (Is a a subset of b?). As mentioned by #Blender and referenced in your question, this provides for partial rather than total ordering, which is insufficient for defining what ever sequence holds the sets.
From the docs:
set < other: Test whether the set is a proper subset of other, that
is, set <= other and set != other.
You could pass a key to sort(), it just couldn't refer to anything to do with the "ordering" of the sets internally, because remember--there is none.
>>> a = {2, 3, 1}
>>> b = {6, 9, 0, 1}
>>> c = {0}
>>> i = [b, a, c]
>>> i.sort(key=len)
>>> i
[{0}, {1, 2, 3}, {0, 9, 6, 1}]
The order of objects stored in dictionaries in python3.5 changes over different executions of the interpreter, but it seems to stay the same for the same interpreter instance.
$ python3 <(printf 'print({"a": 1, "b": 2})\nprint({"a": 1, "b": 2})\nprint({"a": 1, "b": 2})\nprint({"a": 1, "b": 2})')
{'b': 2, 'a': 1}
{'b': 2, 'a': 1}
{'b': 2, 'a': 1}
{'b': 2, 'a': 1}
$ python3 <(printf 'print({"a": 1, "b": 2})\nprint({"a": 1, "b": 2})\nprint({"a": 1, "b": 2})\nprint({"a": 1, "b": 2})')
{'a': 1, 'b': 2}
{'a': 1, 'b': 2}
{'a': 1, 'b': 2}
{'a': 1, 'b': 2}
I always thought the order was based off of the hash of the key. Why is the order different between different executions of python?
Dictionaries use hash function, and the order is based on the hash of the key all right.
But, as stated somewhere in this Q&A, starting from python 3.3, the seed of the hash is randomly chosen at execution time (not to mention that it depends on the python versions) .
Note that as of Python 3.3, a random hash seed is used as well, making hash collisions unpredictable to prevent certain types of denial of service (where an attacker renders a Python server unresponsive by causing mass hash collisions). This means that the order of a given dictionary is then also dependent on the random hash seed for the current Python invocation.
So each time you execute your program, you may get a different order.
Since order of dictionaries are not guaranteed (not before python 3.6 anyway), this is an implementation detail that you shouldn't consider.
dictionaries are inherently unordered. expecting any standardized behavior of the "order" is not realistic.
to keep the ordering, make an ordered list of .keys()
How to test if a python Counter is contained in another one using the following definition:
A Counter a is contained in a Counter b if, and only if, for every key k in a, the value a[k] is less or equal to the value b[k]. The Counter({'a': 1, 'b': 1}) is contained in Counter({'a': 2, 'b': 2}) but it is not contained in Counter({'a': 2, 'c': 2}).
I think it is a poor design choice but in python 2.x the comparison operators (<, <=, >=, >) do not use the previous definition, so the third Counter is considered greater-than the first. In python 3.x, instead, Counter is an unorderable type.
The best I came up with is to convert the definition i gave in code:
def contains(container, contained):
return all(container[x] >= contained[x] for x in contained)
But if feels strange that python don't have an out-of-the-box solution and I have to write a function for every operator (or make a generic one and pass the comparison function).
While Counter instances are not comparable with the < and > operators, you can find their difference with the - operator. The difference never returns negative counts, so if A - B is empty, you know that B contains all the items in A.
def contains(larger, smaller):
return not smaller - larger
For all the keys in smaller Counter make sure that no value is greater than its counterpart in the bigger Counter:
def containment(big, small):
return not any(v > big[k] for (k, v) in small.iteritems())
>>> containment(Counter({'a': 2, 'b': 2}), Counter({'a': 1, 'b': 1}))
True
>>> containment(Counter({'a': 2, 'c': 2, 'b': 3}), Counter({'a': 2, 'b': 2}))
True
>>> print containment(Counter({'a': 2, 'b': 2}), Counter({'a': 2, 'b': 2, 'c':1}))
False
>>> print containment(Counter({'a': 2, 'c': 2}), Counter({'a': 1, 'b': 1})
False
Another, fairly succinct, way to express this:
"Counter A is a subset of Counter B" is equivalent to (A & B) == A.
That's because the intersection (&) of two Counters has the counts of elements common to both. That'll be the same as A if every element of A (counting multiplicity) is also in B; otherwise it will be smaller.
Performance-wise, this seems to be about the same as the not A - B method proposed by Blckknght. Checking each key as in the answer of enrico.bacis is considerably faster.
As a variation, you can also check that the union is equal to the larger Counter (so nothing was added): (A | B) == B. This is noticeably slower for some largish multisets I tested (1,000,000 elements).
While of historical interest, all these answers are obsolete.
Counter class objects are in fact comparable
This question already has answers here:
Why is python ordering my dictionary like so? [duplicate]
(3 answers)
In what order does python display dictionary keys? [duplicate]
(4 answers)
Closed 9 years ago.
I was wondering in what order does the dictionary in python store key : value pairs. I wrote the following in my python shell but I can't figure out what is the reason for the order it is storing the key : value pairs.
>>> d = {}
>>> d['a'] = 8
>>> d['b'] = 8
>>> d
{'a': 8, 'b': 8}
>>> d['c'] = 8
>>> d
{'a': 8, 'c': 8, 'b': 8}
>>> d['z'] = 8
>>> d
{'a': 8, 'c': 8, 'b': 8, 'z': 8}
>>> d['w'] = 8
>>> d
{'a': 8, 'c': 8, 'b': 8, 'z': 8, 'w': 8}
I also tried the same thing with different values for the same keys. But the order remained the same. Adding one more key : value pair gives another result that just can't make out. Here it is :
>>> d[1] = 8
>>> d
{'a': 8, 1: 8, 'c': 8, 'b': 8, 'w': 8, 'z': 8}
The short answer is: in an implementation-defined order. You can't rely and shouldn't expect any particular order, and it can change after changing the dictionary in a supposedly-irrelevant manner.
Although not directly, it's somehow explained in Dictionary view objects:
Keys and values are iterated over in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary’s history of insertions and deletions. If keys, values and items views are iterated over with no intervening modifications to the dictionary, the order of items will directly correspond.
Elements are stored based on the hash of their key. The documentation states that a key must be a hashable type.
Dictionaries do not have a predictable order as their keys are stored by a hash. If you need order, use a list or collections.OrderedDict.
It's a hash table. The keys are partially ordered by their hash value hash(key), but the actual traversal order of the dictionary can depend on the order that elements were inserted, the number of elements in the dictionary, and possibly other factors. You should never count on it being anything in particular.
Is there a standard way to represent a "set" that can contain duplicate elements.
As I understand it, a set has exactly one or zero of an element. I want functionality to have any number.
I am currently using a dictionary with elements as keys, and quantity as values, but this seems wrong for many reasons.
Motivation:
I believe there are many applications for such a collection. For example, a survey of favourite colours could be represented by:
survey = ['blue', 'red', 'blue', 'green']
Here, I do not care about the order, but I do about quantities. I want to do things like:
survey.add('blue')
# would give survey == ['blue', 'red', 'blue', 'green', 'blue']
...and maybe even
survey.remove('blue')
# would give survey == ['blue', 'red', 'green']
Notes:
Yes, set is not the correct term for this kind of collection. Is there a more correct one?
A list of course would work, but the collection required is unordered. Not to mention that the method naming for sets seems to me to be more appropriate.
You are looking for a multiset.
Python's closest datatype is collections.Counter:
A Counter is a dict subclass for counting hashable objects. It is an
unordered collection where elements are stored as dictionary keys and
their counts are stored as dictionary values. Counts are allowed to be
any integer value including zero or negative counts. The Counter class
is similar to bags or multisets in other languages.
For an actual implementation of a multiset, use the bag class from the data-structures package on pypi. Note that this is for Python 3 only. If you need Python 2, here is a recipe for a bag written for Python 2.4.
Your approach with dict with element/count seems ok to me. You probably need some more functionality. Have a look at collections.Counter.
O(1) test whether an element is present and current count retrieval (faster than with element in list and list.count(element))
counter.elements() looks like a list with all duplicates
easy manipulation union/difference with other Counters
Python "set" with duplicate/repeated elements
This depends on how you define a set. One may assume that to the OP
order does not matter (whether ordered or unordered)
replicates/repeated elements (a.k.a. multiplicities) are permitted
Given these assumptions, the options reduce to two abstract types: a list or a multiset. In Python, these type usually translate to a list and Counter respectively. See the Details on some subtleties to observe.
Given
import random
import collections as ct
random.seed(123)
elems = [random.randint(1, 11) for _ in range(10)]
elems
# [1, 5, 2, 7, 5, 2, 1, 7, 9, 9]
Code
A list of replicate elements:
list(elems)
# [1, 5, 2, 7, 5, 2, 1, 7, 9, 9]
A "multiset" of replicate elements:
ct.Counter(elems)
# Counter({1: 2, 5: 2, 2: 2, 7: 2, 9: 2})
Details
On Data Structures
We have a mix of terms here that easily get confused. To clarify, here are some basic mathematical data structures compared to ones in Python.
Type |Abbr|Order|Replicates| Math* | Python | Implementation
------------|----|-----|----------|-----------|-------------|----------------
Set |Set | n | n | {2 3 1} | {2, 3, 1} | set(el)
Ordered Set |Oset| y | n | {1, 2, 3} | - | list(dict.fromkeys(el)
Multiset |Mset| n | y | [2 1 2] | - | <see `mset` below>
List |List| y | y | [1, 2, 2] | [1, 2, 2] | list(el)
From the table, one can deduce the definition of each type. Example: a set is a container that ignores order and rejects replicate elements. In contrast, a list is a container that preserves order and permits replicate elements.
Also from the table, we can see:
Both an ordered set and a multiset are not explicitly implemented in Python
"Order" is a contrary term to a random arrangement of elements, e.g. sorted or insertion order
Sets and multisets are not strictly ordered. They can be ordered, but order does not matter.
Multisets permit replicates, thus they are not strict sets (the term "set" is indeed confusing).
On Multisets
Some may argue that collections.Counter is a multiset. You are safe in many cases to treat it as such, but be aware that Counter is simply a dict (a mapping) of key-multiplicity pairs. It is a map of multiplicities. See an example of elements in a flattened multiset:
mset = [x for k, v in ct.Counter(elems).items() for x in [k]*v]
mset
# [1, 1, 5, 5, 2, 2, 7, 7, 9, 9]
Notice there is some residual ordering, which may be surprising if you expect disordered results. However, disorder does not preclude order. Thus while you can generate a multiset from a Counter, be aware of the following provisos on residual ordering in Python:
replicates get grouped together in the mapping, introducing some degree of order
in Python 3.6, dict's preserve insertion order
Summary
In Python, a multiset can be translated to a map of multiplicities, i.e. a Counter, which is not randomly unordered like a pure set. There can be some residual ordering, which in most cases is ok since order does not generally matter in multisets.
See Also
collections-extended - a package on extra data types in collections
N. Wildberger's lectures on mathematical data structures
*Mathematically, (according to N. Wildberger, we express braces {} to imply a set and brackets [] to imply a list, as seen in Python. Unlike Python, commas , to imply order.
You can use a plain list and use list.count(element) whenever you want to access the "number" of elements.
my_list = [1, 1, 2, 3, 3, 3]
my_list.count(1) # will return 2
An alternative Python multiset implementation uses a sorted list data structure. There are a couple implementations on PyPI. One option is the sortedcontainers module which implements a SortedList data type that efficiently implements set-like methods like add, remove, and contains. The sortedcontainers module is implemented in pure-Python, fast-as-C implementations (even faster), has 100% unit test coverage, and hours of stress testing.
Installation is easy from PyPI:
pip install sortedcontainers
If you can't pip install then simply pull the sortedlist.py file down from the open-source repository.
Use it as you would a set:
from sortedcontainers import SortedList
survey = SortedList(['blue', 'red', 'blue', 'green']]
survey.add('blue')
print survey.count('blue') # "3"
survey.remove('blue')
The sortedcontainers module also maintains a performance comparison with other popular implementations.
What you're looking for is indeed a multiset (or bag), a collection of not necessarily distinct elements (whereas a set does not contain duplicates).
There's an implementation for multisets here: https://github.com/mlenzen/collections-extended (Pypy's collections extended module).
The data structure for multisets is called bag. A bag is a subclass of the Set class from collections module with an extra dictionary to keep track of the multiplicities of elements.
class _basebag(Set):
"""
Base class for bag and frozenbag. Is not mutable and not hashable, so there's
no reason to use this instead of either bag or frozenbag.
"""
# Basic object methods
def __init__(self, iterable=None):
"""Create a new basebag.
If iterable isn't given, is None or is empty then the bag starts empty.
Otherwise each element from iterable will be added to the bag
however many times it appears.
This runs in O(len(iterable))
"""
self._dict = dict()
self._size = 0
if iterable:
if isinstance(iterable, _basebag):
for elem, count in iterable._dict.items():
self._inc(elem, count)
else:
for value in iterable:
self._inc(value)
A nice method for bag is nlargest (similar to Counter for lists), that returns the multiplicities of all elements blazingly fast since the number of occurrences of each element is kept up-to-date in the bag's dictionary:
>>> b=bag(random.choice(string.ascii_letters) for x in xrange(10))
>>> b.nlargest()
[('p', 2), ('A', 1), ('d', 1), ('m', 1), ('J', 1), ('M', 1), ('l', 1), ('n', 1), ('W', 1)]
>>> Counter(b)
Counter({'p': 2, 'A': 1, 'd': 1, 'm': 1, 'J': 1, 'M': 1, 'l': 1, 'n': 1, 'W': 1})
You can used collections.Counter to implement a multiset, as already mentioned.
Another way to implement a multiset is by using defaultdict, which would work by counting occurrences, like collections.Counter.
Here's a snippet from the python docs:
Setting the default_factory to int makes the defaultdict useful for counting (like a bag or multiset in other languages):
>>> s = 'mississippi'
>>> d = defaultdict(int)
>>> for k in s:
... d[k] += 1
...
>>> d.items()
[('i', 4), ('p', 2), ('s', 4), ('m', 1)]
If you need duplicates, use a list, and transform it to a set when you need operate as a set.