What is the true difference between a dictionary and a hash table? - python

I've always used dictionaries. I write in Python.

A dictionary is a general concept that maps keys to values. There are many ways to implement such a mapping.
A hashtable is a specific way to implement a dictionary.
Besides hashtables, another common way to implement dictionaries is red-black trees.
Each method has its own pros and cons. A red-black tree can always perform a lookup in O(log N). A hashtable can perform a lookup in O(1) time although that can degrade to O(N) depending on the input.

A dictionary is a data structure that maps keys to values.
A hash table is a data structure that maps keys to values by taking the hash value of the key (by applying some hash function to it) and mapping that to a bucket where one or more values are stored.
IMO this is analogous to asking the difference between a list and a linked list.
For clarity it may be important to note that it MAY be the case that Python currently implements their dictionaries using hash tables, and it MAY be the case in the future that Python changes that fact without causing their dictionaries to stop being dictionaries.

"A dictionary" has a few different meanings in programming, as wikipedia will tell you -- "associative array", the sense in which Python uses the term (also known as "a mapping"), is one of those meanings (but "data dictionary", and "dictionary attacks" in password guess attempts, are also important).
Hash tables are important data structures; Python uses them to implement two important built-in data types, dict and set.
So, even in Python, you can't consider "hash table" to be a synonym for "dictionary"... since a similar data structure is also used to implement "sets"!-)

A Python dictionary is internally implemented with a hashtable.

Both dictionary and hash table pair keys to value in order to have fast big O operations while insertion or deletion or lookups, the difference is that a hash table uses hash in order to store (key, value) pairs that's why we can access data faster. Python implements dictionaries as hash tables, Maps and sets are new kinds of hash tables that take into consideration the order while inserting, and you can put any kind of object as keys...
Recently lists and hash table are more similar in Python3 due to order, Check this for more details:
https://softwaremaniacs.org/blog/2020/02/05/dicts-ordered/en/

A hash table always uses some function operating on a value to determine where a value will be stored. A Dictionary (as I believe you intend it) is a more general term, and simply indicates a lookup mechanism, which might be a hash table or might be implemented by a simpler structure which does not consider the value itself in determining its storage location.

Dictionary is implemented using hash tables. In my opinion the difference between the 2 can be thought of as the difference between Stacks and Arrays where we would be using arrays to implement Stacks.

Related

When you don't specify an order- why does result order vary?

I am in the very early stages of learning Python. This question has more to do with basic understanding than coding- hopefully I tag it correctly. I am reading my coursework and it says
"Run the program below that displays the ... The indentation and
spacing of the... key-value pairs simply provides more readability.
Note that order is not maintained in the dict when printed."
I know I can specify so that the order is the same each time. I can do that. I want to know when you write a program and run it why do the results get returned in a different order when not specified? Is it because of the way it gets handled in the processor?
Thanks.
The answer has nothing to do with Python, and everything to do with data structures - this behavior is universal and expected across all languages that implement a similar data structure. In Python it's called a dictionary, in other languages it's called a Map or a Hash Map or a Hash Table. There are a few other similar names for the same underlying data structure.
The Python dictionary is an Associative collection, as opposed to a Python List (which is just an Array), where its elements are contiguous in memory.
The big advantage that dictionaries (associative collections) offer is fast and constant look up times (O(1)) - arrays also offer fast look up since calculating an index is trivial - however a dictionary consists of key-value pairs where the key can be anything as long as it is hashable.
Essentially, to determine the "index" where an associated value should go in an associative container, you take the key, hash it, devise some way of mapping the hash to a number and treat that number like an index. As unlikely as it is for two different objects to yield the same hash, it could theoretically happen - what's more likely to happen is that your hash-to-number procedure maps two unique hashes to the same number - in any case, collisions like this can happen, and there are strategies for handling these collisions.
The point is, the hash of a key determines the order in which the associated value appears in the collection - therefore, there is no inherent order.

Injective two-way mappings [duplicate]

This question already has answers here:
Two way/reverse map [duplicate]
(15 answers)
Closed 5 years ago.
I often deal with mappings which are injective. In programming terminology, this can be expressed as a dictionary where all values are unique as well as, of course, all keys.
Is there a memory-efficient data structure for injective mappings with all the time-complexity properties you expect from dictionaries?
For example:
d = {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}
d.get(2) = 'b' # this works with a normal dictionary
d.get('b', reverse=True) = 2 # but this is not possible
All the solutions in Two way/reverse map seem to require using or combining two sets of mappings, focusing on making it easier to perform operations on a two-way map. This is fine for small dictionaries that fit neatly in memory, but not good for large dictionaries.
The requirement is there should be no additional memory overhead storing the injective two-way map versus a regular dictionary storing only one-way mappings.
I understand dictionaries use a hash table, which use an associative array data type. By definition, associative arrays implement key -> value mappings with unique keys. Is it possible, theoretically or in practice, to produce a smart injective mapping which allows reverse lookup?
If it is not possible, I would appreciate an explanation why such a construct is difficult, or impossible, to implement with the same efficiency as dictionaries.
Update
Following the discussion with #rpy (see comments below), any information on how to set up a python dictionary-like object using a perfect reversible hash function would be useful. But, of course, a working implementation would be ideal (I have already tried perfection).
The net answer to your question is: NO (for any efficient implementation)
You put up two requirements that can not be fulfilled at the same time:
Do not use extra memory for the reverse mapping
Do not add extra time for doing (reverse) look-ups
Why are those two restrictions prohibiting a solution?
Mappings are pairs of values (tuples).
The most trivial implementation would be:
Sequential searching all tuples for a match.
This would have identical complexity for forward and backward mapping.
However, this clearly violates the expectation of time-complexity properties you expect from dictionaries:
If you would allow for O(n) complexity, then searching a tuple set sequentially would give you a proper solution.
Usually dictionary implementations try to get down to O(1) or at least O(n*log(n)) complexity. This is being achieved by introducing additional data for speeding up look-ups, like hashes or trees. Unfortunately, such aids only help for one direction as they either deal with keys (forward mapping case) or values (reverse mapping case).
So, as soon as you need to keep look-up complexity down (this also applies for modification complexity, but usually dictionaries are tailored towards fast look-up), you will need to add data for achieving speed.
The whole issue turns down to the classic memory vs. speed trade-off.
EDIT:
An approach for addressing the problem in a general implementation (for cases where keys and values allow for getting a numeric representant if those are not integral numbers in the first place) might be:
Calculate a hash value for key and one for value and register the tuple under both hash values. This way you can take key or value and identify the matching tuple and return the proper result. This would even work for non injective cases when you allow for returning sets of matching tuples.
This will require more space (double the hash entries) while keeping look-up complexity within values typical for hash based dicts. You might need to keep an eye on hash bucket size (length of collision chains) especially when value sets of keys and values are not disjoint)

Why use hashes instead of test true equality?

I've recently been looking into Python's dictionaries (I believe they're called associate arrays in other languages) and was confused by a couple of the restrictions on its keys.
First, dict keys must be immutable. When I looked up the logic behind it the answer was that dictionaries work like hash tables to look up the values for keys, and thus immutable keys (if they're hashable at all) may change their hash value, causing problems when retrieving the value.
I understood why that was the case just fine, but I was still somewhat confused by what the point of using a hash table would be. If you simply didn't hash the keys and tested for true equality (assuming indentically constructed objects compare equal), you could replicate most of the functionality of the dictionary without that limitation just by using two lists.
So, I suppose that's my real question - what's the rationale behind using hashes to look up values instead of equality?
If I had to guess, it's likely simply because comparing integers is incredibly fast and optimized, whereas comparing instances of other classes may not be.
You seem to be missing the whole point of a hash table, which is fast (O(1))1 retrieval, and which cannot be implemented without hashing, i.e. transformation of the key into some kind of well-distributed integer that is used as an index into a table. Notice that equality is still needed on retrieval to be able to handle hash collisions2, but you resort to it only when you already narrowed down the set of elements to those having a certain hash.
With just equality you could replicate similar functionality with parallel arrays or something similar, but that would make retrieval O(n)3; if you ask for strict weak ordering, instead, you can implement RB trees, that allow for O(log n) retrieval. But O(1) requires hashing.
Have a look at Wikipedia for some more insight on hash tables.
Notes
It can become O(n) in pathological scenarios (where all the elements get put in the same bucket), but that's not supposed to happen with a good hashing function.
Since different elements may have the same hash, after getting to the bucket corresponding to the hash you must check if you are actually retrieving the element associated with the given key.
Or O(log n) if you keep your arrays sorted, but that complicates insertion, which becomes on average O(n) due to shifting elements around; but then again, if you have ordering you probably want an RB tree or a heap.
By using hastables you achieve O(1) retrieval data, while comparing against each independent vale for equality will take O(n) (in a sequential search) or O(log(n)) in a binary search.
Also note that O(1) is amortized time, because if there are several values that hash to the same key, then a sequential search is needed among these values.

List of lists vs dictionary

In Python, are there any advantages / disadvantages of working with a list of lists versus working with a dictionary, more specifically when doing numerical operations with them? I'm writing a class of functions to solve simple matrix operations for my linear algebra class. I was using dictionaries, but then I saw that numpy uses list of lists instead, so I guess there must be some advantages in it.
Example: [[1,2,3],[4,5,6],[7,8,9]] as opposed to {0:[1,2,3],1:[4,5,6],2:[7,8,9]}
I think this largely depends on how you plan on using this structure.
Python's dictionaries are (like most) unordered by default. If you plan on iterating over your data like such, you shouldn't use a dictionary:
for list in dict.keys():
for elem in list:
# Logic
Likewise, it doesn't make a lot of sense to use a dictionary with the keys 1, 2, 3 ... when they have little value other than an index. Dictionaries also take up more space in memory due to the hashing process.
If you plan on accessing items by element (which it sounds like you want to), you'll still want to use a List. Index lookup in a list of O(1), same as in a Dictionary. The only difference is when you're looking up some key value, instead of an index (which will be faster than in a dictionary).
You should really only consider using a dictionary when you have some sort of key-value relationship mapping, in which a meaningful key needs to be searched for to retrieve a related value. This doesn't sound like one of those cases. Stick with a list-of-lists.
That's not to say Dictionaries are a bad data structure. Ruby and Python introduced me to them, and they're extremely useful for any of those afore-mentioned mapping problems (which I find I run into quite a lot). They're just useful for a specific class of problems, and this isn't one of them.
When the keys of the dictionary are 0, 1, ..., n, a list will be faster, since no hashing is involved. As soon as the keys are not such a sequence, you need to use a dict.

Why does Python treat tuples, lists, sets and dictionaries as fundamentally different things?

One of the reasons I love Python is the expressive power / reduced programming effort provided by tuples, lists, sets and dictionaries. Once you understand list comprehensions and a few of the basic patterns using in and for, life gets so much better! Python rocks.
However I do wonder why these constructs are treated as differently as they are, and how this is changing (getting stranger) over time. Back in Python 2.x, I could've made an argument they were all just variations of a basic collection type, and that it was kind of irritating that some non-exotic use cases require you to convert a dictionary to a list and back again. (Isn't a dictionary just a list of tuples with a particular uniqueness constraint? Isn't a list just a set with a different kind of uniqueness constraint?).
Now in the 3.x world, it's gotten more complicated. There are now named tuples -- starting to feel more like a special-case dictionary. There are now ordered dictionaries -- starting to feel more like a list. And I just saw a recipe for ordered sets. I can picture this going on and on ... what about unique lists, etc.
The Zen of Python says "There should be one-- and preferably only one --obvious way to do it". It seems to me this profusion of specialized collections types is in conflict with this Python precept.
What do the hardcore Pythonistas think?
These data types all serve different purposes, and in an ideal world you might be able to unify them more. However, in the real world we need to have efficient implementations of the basic collections, and e.g. ordering adds a runtime penalty.
The named tuples mainly serve to make the interface of stat() and the like more usable, and also can be nice when dealing with SQL row sets.
The big unification you're looking for is actually there, in the form of the different access protocols (getitem, getattr, iter, ...), which these types mix and match for their intended purposes.
tl;dr (duck-typing)
You're correct to see some similarities in all these data structures. Remember that python uses duck-typing (if it looks like a duck and quacks like a duck then it is a duck). If you can use two objects in the same situation then, for your current intents and purposes, they might as well be the same data type. But you always have to keep in mind that if you try to use them in other situations, they may no longer behave the same way.
With this in mind we should take a look at what's actually different and the same about the four data types you mentioned, to get a general idea of the situations where they are interchangeable.
Mutability (can you change it?)
You can make changes to dictionaries, lists, and sets. Tuples cannot be "changed" without making a copy.
Mutable: dict, list, set
Immutable: tuple
Python string is also an immutable type. Why do we want some immutable objects? I would paraphrase from this answer:
Immutable objects can be optimized a lot
In Python, only immutables are hashable (and only hashable objects can be members of sets, or keys in dictionaries).
Comparing across this property, lists and tuples seem like the "closest" two data types. At a high-level a tuple is an immutable "freeze-frame" version of a list. This makes lists useful for data sets that will be changing over time (since you don't have to copy a list to modify it) but tuples useful for things like dictionary keys (which must be immutable types).
Ordering (and a note on abstract data types)
A dictionary, like a set, has no inherent conceptual order to it. This is in contrast to lists and tuples, which do have an order. The order for the items in a dict or a set is abstracted away from the programmer, meaning that if element A comes before B in a for k in mydata loop, you shouldn't (and can't generally) rely on A being before B once you start making changes to mydata.
Order-preserving: list, tuple
Non-order-preserving: dict, set
Technically if you iterate over mydata twice in a row it'll be in the same order, but this is more a convenient feature of the mechanics of python, and not really a part of the set abstract data type (the mathematical definition of the data type). Lists and tuples do guarantee order though, especially tuples which are immutable.
What you see when you iterate (if it walks like a duck...)
One "item" per "element": set, list, tuple
Two "items" per "element": dict
I suppose here you could see a named tuple, which has both a name and a value for each element, as an immutable analogue of a dictionary. But this is a tenuous comparison- keep in mind that duck-typing will cause problems if you're trying to use a dictionary-only method on a named tuple, or vice-versa.
Direct responses to your questions
Isn't a dictionary just a list of tuples with a particular uniqueness
constraint?
No, there are several differences. Dictionaries have no inherent order, which is different from a list, which does.
Also, a dictionary has a key and a value for each "element". A tuple, on the other hand, can have an arbitrary number of elements, but each with only a value.
Because of the mechanics of a dictionary, where keys act like a set, you can look up values in constant time if you have the key. In a list of tuples (pairs here), you would need to iterate through the list until you found the key, meaning search would be linear in the number of elements in your list.
Most importantly, though, dictionary items can be changed, while tuples cannot.
Isn't a list just a set with a different kind of uniqueness
constraint?
Again, I'd stress that sets have no inherent ordering, while lists do. This makes lists much more useful for representing things like stacks and queues, where you want to be able to remember the order in which you appended items. Sets offer no such guarantee. However they do offer the advantage of being able to do membership lookups in constant time, while again lists take linear time.
There are now named tuples -- starting to feel more like a special-case dictionary. There are now ordered dictionaries -- starting to feel more like a list. And I just saw a recipe for ordered sets. I can picture this going on and on ... what about unique lists, etc.
To some degree I agree with you. However data structure libraries can be useful to support common use-cases for already well-established data structures. This keep the programmer from wasting time trying to come up with custom extensions to the standard structures. As long as it doesn't get out of hand, and we can still see the unique usefulness in each solution, it's good to have a wheel on the shelf so we don't need to reinvent it.
A great example is the Counter() class. This specialized dictionary has been of use to me more times than I can count (badoom-tshhhhh!) and it has saved me the effort of coding up a custom solution. I'd much rather have a solution that the community is helping me to develop and keep with proper python best-practices than something that sits around in my custom data structures folder and only gets used once or twice a year.
First of all, Ordered Dictionaries and Named Tuples were introduced in Python 2, but that's beside the point.
I won't point you at the docs since if you were really interested you would have read them already.
The first difference between collection types is mutability. tuple and frozenset are immutable types. This means they can be more efficient than list or set.
If you want something you can access randomly or in order, but will mainly change at the end, you want a list. If you want something you can also change at the beginning, you want a deque.
You simply can't have your cake and eat it too -- every feature you add causes you to lose some speed.
dict and set are fundamentally different from lists and tuples`. They store the hash of their keys, allowing you to see if an item is in them very quickly, but requires the key be hashable. You don't get the same membership testing speed with linked lists or arrays.
When you get to OrderedDict and NamedTuple, you're talking about subclasses of the builtin types implemented in Python, rather than in C. They are for special cases, just like any other code in the standard library you have to import. They don't clutter up the namespace but are nice to have when you need them.
One of these days, you'll be coding, and you'll say, "Man, now I know exactly what they meant by 'There should be one-- and preferably only one --obvious way to do it', a set is just what I needed for this, I'm so glad it's part of the Python language! If I had to use a list, it would take forever." That's when you'll understand why these different types exist.
A dictionary is indexed by key (in fact, it's a hash map); a generic list of tuples won't be. You might argue that both should be implemented as relations, with the ability to add indices at will, but in practice having optimized types for the common use cases is both more convenient and more efficient.
New specialized collections get added because they are common enough that lots of people would end up implementing them using more basic data types, and then you'd have the usual problems with wheel reinvention (wasted effort, lack of interoperability...). And if Python just offered an entirely generic construct, then we'd get lots of people asking "how do I implement a set using a relation", etc.
(btw, I'm using relation in the mathematical or DB sense)
All of these specialized collection types provide specific functionalities that are not adequately or efficiently provided by the "standard" data types of list, tuple, dict, and set.
For example, sometimes you need a collection of unique items, and you also need to retain the order in which you encountered them. You can do this using a set to keep track of membership and a list to keep track of order, but your solution will probably be slower and more memory-hungry than a specialized data structure designed for exactly this purpose, such as an ordered set.
These additional data types, which you see as combinations or variations on the basic ones, actually fill gaps in functionality left by the basic data types. From a practical perspective, if Python's core or standard library did not provide these data types, then anyone who needed them would invent their own inefficient versions. They are used less often than the basic types, but often enough to make it worth while to provide standard implementations.
One of the things I like in Python the most is agility. And a lot of functional, effective and usable collections types gives it to me.
And there is still one way to do this - each type does its own job.
The world of data structures (language agnostic) can generally be boiled down to a few small basic structures - lists, trees, hash-tables and graphs, etc. and variants and combinations thereof. Each has its own specific purpose in terms of use and implementation.
I don't think that you can do things like reduce a dictionary to a list of tuples with a particular uniqueness constraint without actually specifying a dictionary. A dictionary has a specific purpose - key/value look-ups - and the implementation of the data structure is generally tailored to those needs. Sets are like dictionaries in many ways, but certain operations on sets don't make sense on a dictionary (union, disjunction, etc).
I don't see this violating the 'Zen of Python' of doing things one way. While you can use a sorted dictionary to do what a dictionary does without using the sorted part, you're more violating Occam's razor and likely causing a performance penalty. I see this as different than being able to syntactically do thing different ways a la Perl.
The Zen of Python says "There should be one-- and preferably only one --obvious way to do it". It seems to me this profusion of specialized collections types is in conflict with this Python precept.
Not remotely. There are several different things being done here. We choose the right tool for the job. All of these containers are modeled on decades-old tried, tested and true CS concepts.
Dictionaries are not like tuples: they are optimized for key-value lookup. The tuple is also immutable, which distinguishes it from a list (you could think of it as sort of like a frozenlist). If you find yourself converting dictionaries to lists and back, you are almost certainly doing something wrong; an example would help.
Named tuples exist for convenience and are intended to replace simple classes rather than dictionaries, really. Ordered dictionaries are just a bit of wrapping to remember the order in which things were added to the dictionary. And neither is new in 3.x (although there may be better language support for them; I haven't looked).

Categories

Resources