I have a small Python program consisting of very few modules (about 4 or so). The main module creates a list of tuples, thereby representing a number of records. These tuples are available to the other modules through a simple function that returns them (say, get_records()).
I am not sure if this is good design however. The problem being that the other modules need to know the indexes of each element in the tuple. This increases coupling between the modules, and isn't very transparent to someone who wants to use the main module.
I can think of a couple of alternatives:
Make the index values of the tuple elements available as module constants (e.g., IDX_RECORD_TITLE, IDX_RECORD_STARTDATE, etc.). This avoids the need of magic numbers like title = record[3].
Don't use tuples, but create a record class, and return a list of these class objects. The advantage being that the class methods will have self-explaining names like record.get_title().
Don't use tuples, but dictionaries instead. So in this scenario, the function would return a list of dictionaries. The advantage being that the dictionary keys are also self-explanatory (though someone using the module would need to know them). But this seems like a huge overhead.
I find tuples to be one of the great strengths of Python (very easy to pass compound data around without the coding overhead of classes/objects), so I currently use (1), but still wonder what would be the best approach.
http://docs.python.org/library/collections.html#namedtuple-factory-function-for-tuples-with-named-fields
i do not see any overhead or complexity in passing objects over tuples(tuples are also objects)
IMO if tuple serves your purpose easily use it, but as you have seen the constraints just switch to a class which represent your data cleanily e.g.
class MyData(object):
def __init__(self, title, desc):
self.title = title
self.desc = desc
You need not add any getter or setter method .
In those cases, I tend to use dictionaries.
If only to have things easily understandable for myself when I come back a bit later to use the code.
I don't know if it's a "huge overhead". I guess it depends on how often you do it and what it is used for. I start off with the easiest solution and optimize when I really need to. It surprisingly seldom I need to change something like that.
Related
How map defined in python is
map(function, iterable, ...)
as can be seen function is the first parameter the same goes for filter,reduce
but when I am checking functions like sorted they are defined
sorted(iterable, key=None, reverse=False)
key being the function that can be used while sorting. I don't know Python well to say if there are other examples like sorted. But for starters, this seems a little a bit unorganized. Since I am coming from C++/D background in which I can almost all the time tell where the function parameter will go in the standard library it is a bit unorthodox for me.
Is there any historical or hierarchical reason why the function parameter is expected in different orders?
The actual signature of map is:
map(function, iterable, ...)
It can take more than one iterable, so making the function the first argument is the most sensible design.
You can argue about filter, there's no one "correct" way to design it, but making it the same order as map rather makes sense.
sorted doesn't require a key function, so it makes no sense to put it first.
Anyone can contribute a module into the python ecosystem. The lineage of any particular module is fairly unique (though I'm sure there are general families that share common lineages). While there will be attempts to standardise and agree on common conventions, there is a limit to what is possible.
As a result, some modules will have a set of paradigms that will differ vastly from other modules - they will have different focuses, and you just can't standardise down to the level you're looking for.
That being said, if you wanted to make that a priority, there's nothing stopping you from recasting all the non-standard things you find into a new suite of open source libraries and encouraging people to adopt them as the standard.
The map() function design is different, the purpose of the design will surely determine the parameters that you pass into the function. The map() function executes a specified function for each item in an iterable, which is very different for the purpose of the sorted() function.
I'm creating some simulation software and I'm putting all the initial conditions in a yaml file to be parsed.
The thing is, there are many different types of objects to model in the simulation, and I only need to model a few in any one run of the simulation.
My first approach was a long, ugly string of if else statements which instantiate and import objects based on the initial conditions. I then replaced that with some still very ugly eval and exec statements. My question is, is there a better way to do this?
perhaps a dictionary?
simulation_objects= {
'bird': bird.Bird,
'water': water.Water,
...
I mean those would point to classes, so then to instantiate something for a configuration you'd do:
obtype = simulation_objects[confvar] #get the class/type from the dict
ob = obtype() #instansiates (is e.g. bird.Bird())
simulation.add(ob)
To instantiate the appropriate types of simulation objects based on the configuration, without a lot of if-else statements. Dictionaries are kind of the replacement for switch-case statements of c in python. And is kind of nice functional style I figure, a mapping of params to functions (or well classes here but their constructors anyhow).
I've done this kind of thing often for games and have really liked how classes (and of course functions too) are objects in python so that you can have them in dicts etc.
I like to use collections.OrderedDict sometimes when I need an associative array where the order of the keys should be retained. Best example I have of this is in parsing or creating csv files, where it's useful to have the order of columns retained implicitly in the object.
But I'm worried that this is bad practice, since it seems to me that the whole concept of an associative array is that the order of the keys should never matter, and that any operations which rely on ordering should just use lists because that's why lists exist (this can be done for the csv example above). I don't have data on this, but I'm willing to bet that the performance for lists is universally better than OrderedDict.
So my question is: Are there any really compelling use cases for OrderedDict? Is the csv use case a good example of where it should be used or a bad one?
But I'm worried that this is bad practice, since it seems to me that the whole concept of an associative array is that the order of the keys should never matter,
Nonsense. That's not the "whole concept of an associative array". It's just that the order rarely matters and so we default to surrendering the order to get a conceptually simpler (and more efficient) data structure.
and that any operations which rely on ordering should just use lists because that's why lists exist
Stop it right there! Think a second. How would you use lists? As a list of (key, value) pairs, with unique keys, right? Well congratulations, my friend, you just re-invented OrderedDict, just with an awful API and really slow. Any conceptual objections to an ordered mapping would apply to this ad hoc data structure as well. Luckily, those objections are nonsense. Ordered mappings are perfectly fine, they're just different from unordered mappings. Giving it an aptly-named dedicated implementation with a good API and good performance improves people's code.
Aside from that: Lists are only one kind of ordered data structure. And while they are somewhat universal in that you can virtually all data structures out of some combination of lists (if you bend over backwards), that doesn't mean you should always use lists.
I don't have data on this, but I'm willing to bet that the performance for lists is universally better than OrderedDict.
Data (structures) doesn't (don't) have performance. Operations on data (structures) have. And thus it depends on what operations you're interested in. If you just need a list of pairs, a list is obviously correct, and iterating over it or indexing it is quite efficient. However, if you want a mapping that's also ordered, or even a tiny subset of mapping functionality (such as handling duplicate keys), then a list alone is pretty awful, as I already explained above.
For your specific use case (writing csv files) an ordered dict is not necessary. Instead, use a DictWriter.
Personally I use OrderedDict when I need some LIFO/FIFO access, for which is even has a the popitem method. I honestly couldn't think of a good use case, but the one mentioned at PEP-0327 for attribute order is a good one:
XML/HTML processing libraries currently drop the ordering of
attributes, use a list instead of a dict which makes filtering
cumbersome, or implement their own ordered dictionary. This affects
ElementTree, html5lib, Genshi and many more libraries.
If you are ever questioning why there is some feature in Python, the PEP is a good place to start because that's where the justification that leads to the inclusion of the feature is detailed.
Probably a comment would suffice...
I think it would be questionable if you use it on places where you don't need it (where order is irrelevant and ordinary a dict would suffice). Otherwise the code will probably be simpler than using lists.
This is valid for any language construct/library - if it makes your code simpler, use the higher level abstraction/implementation.
As long as you feel comfortable with this data structure, and that it fits your needs, why caring? Perhaps it is not the more efficient one (in term of speed, etc.), but, if it's there, it's obviously because it's useful in certain cases (or nobody would have thought of writing it).
You can basically use three types of associative arrays in Python:
the classic hash table (no order at all)
the OrderedDict (order which mirrors the way the object was created)
and the binary trees - this is not in the standard lib -, which order their keys exactly as you want, in a custom order (not necessarily the alphabetical one).
So, in fact, the order of the keys can matter. Just choose the structure that you think is the more appropriate to do the job.
For CSV and similar constructs of repeated keys use a namedtuple. It is best of both worlds.
One of the reasons I love Python is the expressive power / reduced programming effort provided by tuples, lists, sets and dictionaries. Once you understand list comprehensions and a few of the basic patterns using in and for, life gets so much better! Python rocks.
However I do wonder why these constructs are treated as differently as they are, and how this is changing (getting stranger) over time. Back in Python 2.x, I could've made an argument they were all just variations of a basic collection type, and that it was kind of irritating that some non-exotic use cases require you to convert a dictionary to a list and back again. (Isn't a dictionary just a list of tuples with a particular uniqueness constraint? Isn't a list just a set with a different kind of uniqueness constraint?).
Now in the 3.x world, it's gotten more complicated. There are now named tuples -- starting to feel more like a special-case dictionary. There are now ordered dictionaries -- starting to feel more like a list. And I just saw a recipe for ordered sets. I can picture this going on and on ... what about unique lists, etc.
The Zen of Python says "There should be one-- and preferably only one --obvious way to do it". It seems to me this profusion of specialized collections types is in conflict with this Python precept.
What do the hardcore Pythonistas think?
These data types all serve different purposes, and in an ideal world you might be able to unify them more. However, in the real world we need to have efficient implementations of the basic collections, and e.g. ordering adds a runtime penalty.
The named tuples mainly serve to make the interface of stat() and the like more usable, and also can be nice when dealing with SQL row sets.
The big unification you're looking for is actually there, in the form of the different access protocols (getitem, getattr, iter, ...), which these types mix and match for their intended purposes.
tl;dr (duck-typing)
You're correct to see some similarities in all these data structures. Remember that python uses duck-typing (if it looks like a duck and quacks like a duck then it is a duck). If you can use two objects in the same situation then, for your current intents and purposes, they might as well be the same data type. But you always have to keep in mind that if you try to use them in other situations, they may no longer behave the same way.
With this in mind we should take a look at what's actually different and the same about the four data types you mentioned, to get a general idea of the situations where they are interchangeable.
Mutability (can you change it?)
You can make changes to dictionaries, lists, and sets. Tuples cannot be "changed" without making a copy.
Mutable: dict, list, set
Immutable: tuple
Python string is also an immutable type. Why do we want some immutable objects? I would paraphrase from this answer:
Immutable objects can be optimized a lot
In Python, only immutables are hashable (and only hashable objects can be members of sets, or keys in dictionaries).
Comparing across this property, lists and tuples seem like the "closest" two data types. At a high-level a tuple is an immutable "freeze-frame" version of a list. This makes lists useful for data sets that will be changing over time (since you don't have to copy a list to modify it) but tuples useful for things like dictionary keys (which must be immutable types).
Ordering (and a note on abstract data types)
A dictionary, like a set, has no inherent conceptual order to it. This is in contrast to lists and tuples, which do have an order. The order for the items in a dict or a set is abstracted away from the programmer, meaning that if element A comes before B in a for k in mydata loop, you shouldn't (and can't generally) rely on A being before B once you start making changes to mydata.
Order-preserving: list, tuple
Non-order-preserving: dict, set
Technically if you iterate over mydata twice in a row it'll be in the same order, but this is more a convenient feature of the mechanics of python, and not really a part of the set abstract data type (the mathematical definition of the data type). Lists and tuples do guarantee order though, especially tuples which are immutable.
What you see when you iterate (if it walks like a duck...)
One "item" per "element": set, list, tuple
Two "items" per "element": dict
I suppose here you could see a named tuple, which has both a name and a value for each element, as an immutable analogue of a dictionary. But this is a tenuous comparison- keep in mind that duck-typing will cause problems if you're trying to use a dictionary-only method on a named tuple, or vice-versa.
Direct responses to your questions
Isn't a dictionary just a list of tuples with a particular uniqueness
constraint?
No, there are several differences. Dictionaries have no inherent order, which is different from a list, which does.
Also, a dictionary has a key and a value for each "element". A tuple, on the other hand, can have an arbitrary number of elements, but each with only a value.
Because of the mechanics of a dictionary, where keys act like a set, you can look up values in constant time if you have the key. In a list of tuples (pairs here), you would need to iterate through the list until you found the key, meaning search would be linear in the number of elements in your list.
Most importantly, though, dictionary items can be changed, while tuples cannot.
Isn't a list just a set with a different kind of uniqueness
constraint?
Again, I'd stress that sets have no inherent ordering, while lists do. This makes lists much more useful for representing things like stacks and queues, where you want to be able to remember the order in which you appended items. Sets offer no such guarantee. However they do offer the advantage of being able to do membership lookups in constant time, while again lists take linear time.
There are now named tuples -- starting to feel more like a special-case dictionary. There are now ordered dictionaries -- starting to feel more like a list. And I just saw a recipe for ordered sets. I can picture this going on and on ... what about unique lists, etc.
To some degree I agree with you. However data structure libraries can be useful to support common use-cases for already well-established data structures. This keep the programmer from wasting time trying to come up with custom extensions to the standard structures. As long as it doesn't get out of hand, and we can still see the unique usefulness in each solution, it's good to have a wheel on the shelf so we don't need to reinvent it.
A great example is the Counter() class. This specialized dictionary has been of use to me more times than I can count (badoom-tshhhhh!) and it has saved me the effort of coding up a custom solution. I'd much rather have a solution that the community is helping me to develop and keep with proper python best-practices than something that sits around in my custom data structures folder and only gets used once or twice a year.
First of all, Ordered Dictionaries and Named Tuples were introduced in Python 2, but that's beside the point.
I won't point you at the docs since if you were really interested you would have read them already.
The first difference between collection types is mutability. tuple and frozenset are immutable types. This means they can be more efficient than list or set.
If you want something you can access randomly or in order, but will mainly change at the end, you want a list. If you want something you can also change at the beginning, you want a deque.
You simply can't have your cake and eat it too -- every feature you add causes you to lose some speed.
dict and set are fundamentally different from lists and tuples`. They store the hash of their keys, allowing you to see if an item is in them very quickly, but requires the key be hashable. You don't get the same membership testing speed with linked lists or arrays.
When you get to OrderedDict and NamedTuple, you're talking about subclasses of the builtin types implemented in Python, rather than in C. They are for special cases, just like any other code in the standard library you have to import. They don't clutter up the namespace but are nice to have when you need them.
One of these days, you'll be coding, and you'll say, "Man, now I know exactly what they meant by 'There should be one-- and preferably only one --obvious way to do it', a set is just what I needed for this, I'm so glad it's part of the Python language! If I had to use a list, it would take forever." That's when you'll understand why these different types exist.
A dictionary is indexed by key (in fact, it's a hash map); a generic list of tuples won't be. You might argue that both should be implemented as relations, with the ability to add indices at will, but in practice having optimized types for the common use cases is both more convenient and more efficient.
New specialized collections get added because they are common enough that lots of people would end up implementing them using more basic data types, and then you'd have the usual problems with wheel reinvention (wasted effort, lack of interoperability...). And if Python just offered an entirely generic construct, then we'd get lots of people asking "how do I implement a set using a relation", etc.
(btw, I'm using relation in the mathematical or DB sense)
All of these specialized collection types provide specific functionalities that are not adequately or efficiently provided by the "standard" data types of list, tuple, dict, and set.
For example, sometimes you need a collection of unique items, and you also need to retain the order in which you encountered them. You can do this using a set to keep track of membership and a list to keep track of order, but your solution will probably be slower and more memory-hungry than a specialized data structure designed for exactly this purpose, such as an ordered set.
These additional data types, which you see as combinations or variations on the basic ones, actually fill gaps in functionality left by the basic data types. From a practical perspective, if Python's core or standard library did not provide these data types, then anyone who needed them would invent their own inefficient versions. They are used less often than the basic types, but often enough to make it worth while to provide standard implementations.
One of the things I like in Python the most is agility. And a lot of functional, effective and usable collections types gives it to me.
And there is still one way to do this - each type does its own job.
The world of data structures (language agnostic) can generally be boiled down to a few small basic structures - lists, trees, hash-tables and graphs, etc. and variants and combinations thereof. Each has its own specific purpose in terms of use and implementation.
I don't think that you can do things like reduce a dictionary to a list of tuples with a particular uniqueness constraint without actually specifying a dictionary. A dictionary has a specific purpose - key/value look-ups - and the implementation of the data structure is generally tailored to those needs. Sets are like dictionaries in many ways, but certain operations on sets don't make sense on a dictionary (union, disjunction, etc).
I don't see this violating the 'Zen of Python' of doing things one way. While you can use a sorted dictionary to do what a dictionary does without using the sorted part, you're more violating Occam's razor and likely causing a performance penalty. I see this as different than being able to syntactically do thing different ways a la Perl.
The Zen of Python says "There should be one-- and preferably only one --obvious way to do it". It seems to me this profusion of specialized collections types is in conflict with this Python precept.
Not remotely. There are several different things being done here. We choose the right tool for the job. All of these containers are modeled on decades-old tried, tested and true CS concepts.
Dictionaries are not like tuples: they are optimized for key-value lookup. The tuple is also immutable, which distinguishes it from a list (you could think of it as sort of like a frozenlist). If you find yourself converting dictionaries to lists and back, you are almost certainly doing something wrong; an example would help.
Named tuples exist for convenience and are intended to replace simple classes rather than dictionaries, really. Ordered dictionaries are just a bit of wrapping to remember the order in which things were added to the dictionary. And neither is new in 3.x (although there may be better language support for them; I haven't looked).
Is there an article or forum discussion or something somewhere that explains why lists use append/extend, but sets and dicts use add/update?
I frequently find myself converting lists into sets and this difference makes that quite tedious, so for my personal sanity I'd like to know what the rationalization is.
The need to convert between these occurs regularly as we iterate on development. Over time as the structure of the program morphs, various structures gain and lose requirements like ordering and duplicates.
For example, something that starts out as an unordered bunch of stuff in a list might pick up the the requirement that there be no duplicates and so need to be converted to a set.
All such changes require finding and changing all places where the relevant structure is added/appended and extended/updated.
So I'm curious to see the original discussion that led to this language choice, but unfortunately I didn't have any luck googling for it.
append has a popular definition of "add to the very end", and extend can be read similarly (in the nuance where it means "...beyond a certain point"); sets have no "end", nor any way to specify some "point" within them or "at their boundaries" (because there are no "boundaries"!), so it would be highly misleading to suggest that these operations could be performed.
x.append(y) always increases len(x) by exactly one (whether y was already in list x or not); no such assertion holds for s.add(z) (s's length may increase or stay the same). Moreover, in these snippets, y can have any value (i.e., the append operation never fails [except for the anomalous case in which you've run out of memory]) -- again no such assertion holds about z (which must be hashable, otherwise the add operation fails and raises an exception). Similar differences apply to extend vs update. Using the same name for operations with such drastically different semantics would be very misleading indeed.
it seems pythonic to just use a list
on the first pass and deal with the
performance on a later iteration
Performance is the least of it! lists support duplicate items, ordering, and any item type -- sets guarantee item uniqueness, have no concept of order, and demand item hashability. There is nothing Pythonic in using a list (plus goofy checks against duplicates, etc) to stand for a set -- performance or not, "say what you mean!" is the Pythonic Way;-). (In languages such as Fortran or C, where all you get as a built-in container type are arrays, you might have to perform such "mental mapping" if you need to avoid using add-on libraries; in Python, there is no such need).
Edit: the OP asserts in a comment that they don't know from the start (e.g.) that duplicates are disallowed in a certain algorithm (strange, but, whatever) -- they're looking for a painless way to make a list into a set once they do discover duplicates are bad there (and, I'll add: order doesn't matter, items are hashable, indexing/slicing unneeded, etc). To get exactly the same effect one would have if Python's sets had "synonyms" for the two methods in question:
class somewhatlistlikeset(set):
def append(self, x): self.add(x)
def extend(self, x): self.update(x)
Of course, if the only change is at the set creation (which used to be list creation), the code may be much more challenging to follow, having lost the useful clarity whereby using add vs append allows anybody reading the code to know "locally" whether the object is a set vs a list... but this, too, is part of the "exactly the same effect" above-mentioned!-)
set and dict are unordered. "Append" and "extend" conceptually only apply to ordered types.
It's written that way to annoy you.
Seriously. It's designed so that one can't simply convert one into the other easily. Historically, sets are based off dicts, so the two share naming conventions. While you could easily write a set wrapper to add these methods ...
class ListlikeSet(set):
def append(self, x):
self.add(x)
def extend(self, xs):
self.update(xs)
... the greater question is why you find yourself converting lists to sets with such regularity. They represent substantially different models of a collection of objects; if you have to convert between the two a lot, it suggests you may not have a very good handle on the conceptual architecture of your program.