Is there an article or forum discussion or something somewhere that explains why lists use append/extend, but sets and dicts use add/update?
I frequently find myself converting lists into sets and this difference makes that quite tedious, so for my personal sanity I'd like to know what the rationalization is.
The need to convert between these occurs regularly as we iterate on development. Over time as the structure of the program morphs, various structures gain and lose requirements like ordering and duplicates.
For example, something that starts out as an unordered bunch of stuff in a list might pick up the the requirement that there be no duplicates and so need to be converted to a set.
All such changes require finding and changing all places where the relevant structure is added/appended and extended/updated.
So I'm curious to see the original discussion that led to this language choice, but unfortunately I didn't have any luck googling for it.
append has a popular definition of "add to the very end", and extend can be read similarly (in the nuance where it means "...beyond a certain point"); sets have no "end", nor any way to specify some "point" within them or "at their boundaries" (because there are no "boundaries"!), so it would be highly misleading to suggest that these operations could be performed.
x.append(y) always increases len(x) by exactly one (whether y was already in list x or not); no such assertion holds for s.add(z) (s's length may increase or stay the same). Moreover, in these snippets, y can have any value (i.e., the append operation never fails [except for the anomalous case in which you've run out of memory]) -- again no such assertion holds about z (which must be hashable, otherwise the add operation fails and raises an exception). Similar differences apply to extend vs update. Using the same name for operations with such drastically different semantics would be very misleading indeed.
it seems pythonic to just use a list
on the first pass and deal with the
performance on a later iteration
Performance is the least of it! lists support duplicate items, ordering, and any item type -- sets guarantee item uniqueness, have no concept of order, and demand item hashability. There is nothing Pythonic in using a list (plus goofy checks against duplicates, etc) to stand for a set -- performance or not, "say what you mean!" is the Pythonic Way;-). (In languages such as Fortran or C, where all you get as a built-in container type are arrays, you might have to perform such "mental mapping" if you need to avoid using add-on libraries; in Python, there is no such need).
Edit: the OP asserts in a comment that they don't know from the start (e.g.) that duplicates are disallowed in a certain algorithm (strange, but, whatever) -- they're looking for a painless way to make a list into a set once they do discover duplicates are bad there (and, I'll add: order doesn't matter, items are hashable, indexing/slicing unneeded, etc). To get exactly the same effect one would have if Python's sets had "synonyms" for the two methods in question:
class somewhatlistlikeset(set):
def append(self, x): self.add(x)
def extend(self, x): self.update(x)
Of course, if the only change is at the set creation (which used to be list creation), the code may be much more challenging to follow, having lost the useful clarity whereby using add vs append allows anybody reading the code to know "locally" whether the object is a set vs a list... but this, too, is part of the "exactly the same effect" above-mentioned!-)
set and dict are unordered. "Append" and "extend" conceptually only apply to ordered types.
It's written that way to annoy you.
Seriously. It's designed so that one can't simply convert one into the other easily. Historically, sets are based off dicts, so the two share naming conventions. While you could easily write a set wrapper to add these methods ...
class ListlikeSet(set):
def append(self, x):
self.add(x)
def extend(self, xs):
self.update(xs)
... the greater question is why you find yourself converting lists to sets with such regularity. They represent substantially different models of a collection of objects; if you have to convert between the two a lot, it suggests you may not have a very good handle on the conceptual architecture of your program.
Related
Python's documentation has a table with "Common Sequence Operations" that are "supported by most sequence types". It lists for example x in s, s[i], and len(s), which the sequence can support with methods __contains__, __getitem__ and __len__. But it also lists min(s) and max(s), and I don't understand why. Those two work on any iterable, I don't see anything special about them in relation to sequences. There are no __min__ and __max__ or any other ways to truly support them, are there? And if there were, I'd expect max(range(10**8)) to give me the result instantly, not take several seconds. Just like 10**20 in range(10**30) does. And if min and max are just there to showcase built-in functions, I'd rather expect reversed to be listed, as that really has something to do with sequences (it works on every sequence, but not on every iterable).
So am I overlooking something? Or did __min__ and __max__ or some other way to truly support min and max exist in previous Python versions, and the table wasn't updated? Or is there some other good reason to list them there? I'm confused.
The first paragraph in that section even says:
The collections.abc.Sequence ABC is provided to make it easier to correctly implement these operations on custom sequence types.
That sounds like people writing custom sequence types are somehow expected to implement them. That makes no sense to me unless there's an actual way to implement them.
It looks like a documentation bug, the original commit for this was made in 1998. Probably the idea was to list all operations user can perform on sequences, not necessarily override them.
reversed on the other hand was added in 2003.
When the The collections.abc.Sequence ABC is provided to make it easier to correctly implement these operations on custom sequence types part was added either the min/max functions should have been removed from the table to prevent such confusion(as there are no dunder methods available yet to override the behaviour of these two built-in functions) or the wording should've been improved.
My teacher says that tuples are faster than lists because tuples are immutable, but I don't understand the reason.
I personally think that tuples are faster than lists because tuples are hashable and lists are not hashable.
Please tell me if I am right or wrong.
No, being hashable has nothing to do with being faster.
As in Order to access an element from a collection that is hashable it requires constant time.
You're getting thing backward. The time to look up a hashable element in a collection that uses a hash table (like a set) is constant. But that's about the elements being hashable, not the collection, and it's about the collection using a hash table instead of an array, and it's about looking them up by value instead of by index.
Looking up a value in an array by index—whether the value or the array is hashable or not—takes constant time. Searching an array by value takes linear time. (Unless, e.g., it's sorted and you search by bisecting.)
Your teacher is only partly right—but then they may have been simplifying things to avoid getting into gory details.
There are three reasons why tuples are faster than lists for some operations.
But it's worth noting that these are usually pretty small differences, and usually hard to predict.1 Almost always, you just want to use whichever one makes more sense, and if you occasionally do find a bottleneck where a few % would make a difference, pull it out and timeit both versions and see.
First, there are some operations that are optimized differently for the two types. Of course this is different for different implementations and even different versions of the same implementation, but a few examples from CPython 3.7:
When sorting a list of tuples, there's a special unsafe_tuple_compare that isn't applied to lists.
When comparing two lists for == or !=, there's a special is test to short-circuit the comparison, which sometimes speeds things up a lot, but otherwise slows things down a little. Benchmarking a whole mess of code showed that this was worth doing for lists, but not for tuples.
Mutability generally doesn't enter into it for these choices; it's more about how the two types are typically used (lists are often homogenously-typed but arbitrary-length, while tuples are often heterogenerously-typed and consistent-length). However, it's not quite irrelevant—e.g., the fact that a list can be made to contain itself (because they're mutable) and a tuple can't (because they aren't) prevents at least one minor optimization from being applied to lists.2
Second, two equal tuple constants in the same compilation unit can be merged into the same value. And at least CPython and PyPy usually do so. Which can speed some things up (if nothing else, you get better cache locality when there's less data to cache, but sometimes it means bigger savings, like being able to use is tests).
And this one is about mutability: the compiler is only allowed to merge equal values if it knows they're immutable.
Third, lists of the same size are bigger. Allocating more memory, using more cache lines, etc. slows things down a little.
And this one is also about mutability. A list has to have room to grow on the end; otherwise, calling append N times would take N**2 time. But tuples don't have to append.
1. There are a handful of cases that come up often enough in certain kinds of problems that some people who deal with those problems all the time learn them and remember them. Occasionally, you'll see an answer on an optimization question on Stack Overflow where someone chimes in, "this would probably be about 3% faster with a tuple instead of a list", and they're usually right.
2. Also, I could imagine a case where a JIT compiler like the one in PyPy could speed things up better with a tuple. If you run the same code a million times in a row with the same values, you're going to get a million copies of the same answer—unless the value changes. If the value is a tuple of two objects, PyPy can add guards to see if either of those objects changes, and otherwise just reuse the last value. If it's a list of two objects, PyPy would have to add guards to the two objects and the list, which is 50% more checking. Whether this actually happens, I have no idea; every time I try to trace through how a PyPy optimizations works and generalize from there, I turn out to be wrong, and I just end up concluding that Armin Rigo is a wizard.
I like to use collections.OrderedDict sometimes when I need an associative array where the order of the keys should be retained. Best example I have of this is in parsing or creating csv files, where it's useful to have the order of columns retained implicitly in the object.
But I'm worried that this is bad practice, since it seems to me that the whole concept of an associative array is that the order of the keys should never matter, and that any operations which rely on ordering should just use lists because that's why lists exist (this can be done for the csv example above). I don't have data on this, but I'm willing to bet that the performance for lists is universally better than OrderedDict.
So my question is: Are there any really compelling use cases for OrderedDict? Is the csv use case a good example of where it should be used or a bad one?
But I'm worried that this is bad practice, since it seems to me that the whole concept of an associative array is that the order of the keys should never matter,
Nonsense. That's not the "whole concept of an associative array". It's just that the order rarely matters and so we default to surrendering the order to get a conceptually simpler (and more efficient) data structure.
and that any operations which rely on ordering should just use lists because that's why lists exist
Stop it right there! Think a second. How would you use lists? As a list of (key, value) pairs, with unique keys, right? Well congratulations, my friend, you just re-invented OrderedDict, just with an awful API and really slow. Any conceptual objections to an ordered mapping would apply to this ad hoc data structure as well. Luckily, those objections are nonsense. Ordered mappings are perfectly fine, they're just different from unordered mappings. Giving it an aptly-named dedicated implementation with a good API and good performance improves people's code.
Aside from that: Lists are only one kind of ordered data structure. And while they are somewhat universal in that you can virtually all data structures out of some combination of lists (if you bend over backwards), that doesn't mean you should always use lists.
I don't have data on this, but I'm willing to bet that the performance for lists is universally better than OrderedDict.
Data (structures) doesn't (don't) have performance. Operations on data (structures) have. And thus it depends on what operations you're interested in. If you just need a list of pairs, a list is obviously correct, and iterating over it or indexing it is quite efficient. However, if you want a mapping that's also ordered, or even a tiny subset of mapping functionality (such as handling duplicate keys), then a list alone is pretty awful, as I already explained above.
For your specific use case (writing csv files) an ordered dict is not necessary. Instead, use a DictWriter.
Personally I use OrderedDict when I need some LIFO/FIFO access, for which is even has a the popitem method. I honestly couldn't think of a good use case, but the one mentioned at PEP-0327 for attribute order is a good one:
XML/HTML processing libraries currently drop the ordering of
attributes, use a list instead of a dict which makes filtering
cumbersome, or implement their own ordered dictionary. This affects
ElementTree, html5lib, Genshi and many more libraries.
If you are ever questioning why there is some feature in Python, the PEP is a good place to start because that's where the justification that leads to the inclusion of the feature is detailed.
Probably a comment would suffice...
I think it would be questionable if you use it on places where you don't need it (where order is irrelevant and ordinary a dict would suffice). Otherwise the code will probably be simpler than using lists.
This is valid for any language construct/library - if it makes your code simpler, use the higher level abstraction/implementation.
As long as you feel comfortable with this data structure, and that it fits your needs, why caring? Perhaps it is not the more efficient one (in term of speed, etc.), but, if it's there, it's obviously because it's useful in certain cases (or nobody would have thought of writing it).
You can basically use three types of associative arrays in Python:
the classic hash table (no order at all)
the OrderedDict (order which mirrors the way the object was created)
and the binary trees - this is not in the standard lib -, which order their keys exactly as you want, in a custom order (not necessarily the alphabetical one).
So, in fact, the order of the keys can matter. Just choose the structure that you think is the more appropriate to do the job.
For CSV and similar constructs of repeated keys use a namedtuple. It is best of both worlds.
One of the reasons I love Python is the expressive power / reduced programming effort provided by tuples, lists, sets and dictionaries. Once you understand list comprehensions and a few of the basic patterns using in and for, life gets so much better! Python rocks.
However I do wonder why these constructs are treated as differently as they are, and how this is changing (getting stranger) over time. Back in Python 2.x, I could've made an argument they were all just variations of a basic collection type, and that it was kind of irritating that some non-exotic use cases require you to convert a dictionary to a list and back again. (Isn't a dictionary just a list of tuples with a particular uniqueness constraint? Isn't a list just a set with a different kind of uniqueness constraint?).
Now in the 3.x world, it's gotten more complicated. There are now named tuples -- starting to feel more like a special-case dictionary. There are now ordered dictionaries -- starting to feel more like a list. And I just saw a recipe for ordered sets. I can picture this going on and on ... what about unique lists, etc.
The Zen of Python says "There should be one-- and preferably only one --obvious way to do it". It seems to me this profusion of specialized collections types is in conflict with this Python precept.
What do the hardcore Pythonistas think?
These data types all serve different purposes, and in an ideal world you might be able to unify them more. However, in the real world we need to have efficient implementations of the basic collections, and e.g. ordering adds a runtime penalty.
The named tuples mainly serve to make the interface of stat() and the like more usable, and also can be nice when dealing with SQL row sets.
The big unification you're looking for is actually there, in the form of the different access protocols (getitem, getattr, iter, ...), which these types mix and match for their intended purposes.
tl;dr (duck-typing)
You're correct to see some similarities in all these data structures. Remember that python uses duck-typing (if it looks like a duck and quacks like a duck then it is a duck). If you can use two objects in the same situation then, for your current intents and purposes, they might as well be the same data type. But you always have to keep in mind that if you try to use them in other situations, they may no longer behave the same way.
With this in mind we should take a look at what's actually different and the same about the four data types you mentioned, to get a general idea of the situations where they are interchangeable.
Mutability (can you change it?)
You can make changes to dictionaries, lists, and sets. Tuples cannot be "changed" without making a copy.
Mutable: dict, list, set
Immutable: tuple
Python string is also an immutable type. Why do we want some immutable objects? I would paraphrase from this answer:
Immutable objects can be optimized a lot
In Python, only immutables are hashable (and only hashable objects can be members of sets, or keys in dictionaries).
Comparing across this property, lists and tuples seem like the "closest" two data types. At a high-level a tuple is an immutable "freeze-frame" version of a list. This makes lists useful for data sets that will be changing over time (since you don't have to copy a list to modify it) but tuples useful for things like dictionary keys (which must be immutable types).
Ordering (and a note on abstract data types)
A dictionary, like a set, has no inherent conceptual order to it. This is in contrast to lists and tuples, which do have an order. The order for the items in a dict or a set is abstracted away from the programmer, meaning that if element A comes before B in a for k in mydata loop, you shouldn't (and can't generally) rely on A being before B once you start making changes to mydata.
Order-preserving: list, tuple
Non-order-preserving: dict, set
Technically if you iterate over mydata twice in a row it'll be in the same order, but this is more a convenient feature of the mechanics of python, and not really a part of the set abstract data type (the mathematical definition of the data type). Lists and tuples do guarantee order though, especially tuples which are immutable.
What you see when you iterate (if it walks like a duck...)
One "item" per "element": set, list, tuple
Two "items" per "element": dict
I suppose here you could see a named tuple, which has both a name and a value for each element, as an immutable analogue of a dictionary. But this is a tenuous comparison- keep in mind that duck-typing will cause problems if you're trying to use a dictionary-only method on a named tuple, or vice-versa.
Direct responses to your questions
Isn't a dictionary just a list of tuples with a particular uniqueness
constraint?
No, there are several differences. Dictionaries have no inherent order, which is different from a list, which does.
Also, a dictionary has a key and a value for each "element". A tuple, on the other hand, can have an arbitrary number of elements, but each with only a value.
Because of the mechanics of a dictionary, where keys act like a set, you can look up values in constant time if you have the key. In a list of tuples (pairs here), you would need to iterate through the list until you found the key, meaning search would be linear in the number of elements in your list.
Most importantly, though, dictionary items can be changed, while tuples cannot.
Isn't a list just a set with a different kind of uniqueness
constraint?
Again, I'd stress that sets have no inherent ordering, while lists do. This makes lists much more useful for representing things like stacks and queues, where you want to be able to remember the order in which you appended items. Sets offer no such guarantee. However they do offer the advantage of being able to do membership lookups in constant time, while again lists take linear time.
There are now named tuples -- starting to feel more like a special-case dictionary. There are now ordered dictionaries -- starting to feel more like a list. And I just saw a recipe for ordered sets. I can picture this going on and on ... what about unique lists, etc.
To some degree I agree with you. However data structure libraries can be useful to support common use-cases for already well-established data structures. This keep the programmer from wasting time trying to come up with custom extensions to the standard structures. As long as it doesn't get out of hand, and we can still see the unique usefulness in each solution, it's good to have a wheel on the shelf so we don't need to reinvent it.
A great example is the Counter() class. This specialized dictionary has been of use to me more times than I can count (badoom-tshhhhh!) and it has saved me the effort of coding up a custom solution. I'd much rather have a solution that the community is helping me to develop and keep with proper python best-practices than something that sits around in my custom data structures folder and only gets used once or twice a year.
First of all, Ordered Dictionaries and Named Tuples were introduced in Python 2, but that's beside the point.
I won't point you at the docs since if you were really interested you would have read them already.
The first difference between collection types is mutability. tuple and frozenset are immutable types. This means they can be more efficient than list or set.
If you want something you can access randomly or in order, but will mainly change at the end, you want a list. If you want something you can also change at the beginning, you want a deque.
You simply can't have your cake and eat it too -- every feature you add causes you to lose some speed.
dict and set are fundamentally different from lists and tuples`. They store the hash of their keys, allowing you to see if an item is in them very quickly, but requires the key be hashable. You don't get the same membership testing speed with linked lists or arrays.
When you get to OrderedDict and NamedTuple, you're talking about subclasses of the builtin types implemented in Python, rather than in C. They are for special cases, just like any other code in the standard library you have to import. They don't clutter up the namespace but are nice to have when you need them.
One of these days, you'll be coding, and you'll say, "Man, now I know exactly what they meant by 'There should be one-- and preferably only one --obvious way to do it', a set is just what I needed for this, I'm so glad it's part of the Python language! If I had to use a list, it would take forever." That's when you'll understand why these different types exist.
A dictionary is indexed by key (in fact, it's a hash map); a generic list of tuples won't be. You might argue that both should be implemented as relations, with the ability to add indices at will, but in practice having optimized types for the common use cases is both more convenient and more efficient.
New specialized collections get added because they are common enough that lots of people would end up implementing them using more basic data types, and then you'd have the usual problems with wheel reinvention (wasted effort, lack of interoperability...). And if Python just offered an entirely generic construct, then we'd get lots of people asking "how do I implement a set using a relation", etc.
(btw, I'm using relation in the mathematical or DB sense)
All of these specialized collection types provide specific functionalities that are not adequately or efficiently provided by the "standard" data types of list, tuple, dict, and set.
For example, sometimes you need a collection of unique items, and you also need to retain the order in which you encountered them. You can do this using a set to keep track of membership and a list to keep track of order, but your solution will probably be slower and more memory-hungry than a specialized data structure designed for exactly this purpose, such as an ordered set.
These additional data types, which you see as combinations or variations on the basic ones, actually fill gaps in functionality left by the basic data types. From a practical perspective, if Python's core or standard library did not provide these data types, then anyone who needed them would invent their own inefficient versions. They are used less often than the basic types, but often enough to make it worth while to provide standard implementations.
One of the things I like in Python the most is agility. And a lot of functional, effective and usable collections types gives it to me.
And there is still one way to do this - each type does its own job.
The world of data structures (language agnostic) can generally be boiled down to a few small basic structures - lists, trees, hash-tables and graphs, etc. and variants and combinations thereof. Each has its own specific purpose in terms of use and implementation.
I don't think that you can do things like reduce a dictionary to a list of tuples with a particular uniqueness constraint without actually specifying a dictionary. A dictionary has a specific purpose - key/value look-ups - and the implementation of the data structure is generally tailored to those needs. Sets are like dictionaries in many ways, but certain operations on sets don't make sense on a dictionary (union, disjunction, etc).
I don't see this violating the 'Zen of Python' of doing things one way. While you can use a sorted dictionary to do what a dictionary does without using the sorted part, you're more violating Occam's razor and likely causing a performance penalty. I see this as different than being able to syntactically do thing different ways a la Perl.
The Zen of Python says "There should be one-- and preferably only one --obvious way to do it". It seems to me this profusion of specialized collections types is in conflict with this Python precept.
Not remotely. There are several different things being done here. We choose the right tool for the job. All of these containers are modeled on decades-old tried, tested and true CS concepts.
Dictionaries are not like tuples: they are optimized for key-value lookup. The tuple is also immutable, which distinguishes it from a list (you could think of it as sort of like a frozenlist). If you find yourself converting dictionaries to lists and back, you are almost certainly doing something wrong; an example would help.
Named tuples exist for convenience and are intended to replace simple classes rather than dictionaries, really. Ordered dictionaries are just a bit of wrapping to remember the order in which things were added to the dictionary. And neither is new in 3.x (although there may be better language support for them; I haven't looked).
I have a small Python program consisting of very few modules (about 4 or so). The main module creates a list of tuples, thereby representing a number of records. These tuples are available to the other modules through a simple function that returns them (say, get_records()).
I am not sure if this is good design however. The problem being that the other modules need to know the indexes of each element in the tuple. This increases coupling between the modules, and isn't very transparent to someone who wants to use the main module.
I can think of a couple of alternatives:
Make the index values of the tuple elements available as module constants (e.g., IDX_RECORD_TITLE, IDX_RECORD_STARTDATE, etc.). This avoids the need of magic numbers like title = record[3].
Don't use tuples, but create a record class, and return a list of these class objects. The advantage being that the class methods will have self-explaining names like record.get_title().
Don't use tuples, but dictionaries instead. So in this scenario, the function would return a list of dictionaries. The advantage being that the dictionary keys are also self-explanatory (though someone using the module would need to know them). But this seems like a huge overhead.
I find tuples to be one of the great strengths of Python (very easy to pass compound data around without the coding overhead of classes/objects), so I currently use (1), but still wonder what would be the best approach.
http://docs.python.org/library/collections.html#namedtuple-factory-function-for-tuples-with-named-fields
i do not see any overhead or complexity in passing objects over tuples(tuples are also objects)
IMO if tuple serves your purpose easily use it, but as you have seen the constraints just switch to a class which represent your data cleanily e.g.
class MyData(object):
def __init__(self, title, desc):
self.title = title
self.desc = desc
You need not add any getter or setter method .
In those cases, I tend to use dictionaries.
If only to have things easily understandable for myself when I come back a bit later to use the code.
I don't know if it's a "huge overhead". I guess it depends on how often you do it and what it is used for. I start off with the easiest solution and optimize when I really need to. It surprisingly seldom I need to change something like that.