Why use hashes instead of test true equality? - python

I've recently been looking into Python's dictionaries (I believe they're called associate arrays in other languages) and was confused by a couple of the restrictions on its keys.
First, dict keys must be immutable. When I looked up the logic behind it the answer was that dictionaries work like hash tables to look up the values for keys, and thus immutable keys (if they're hashable at all) may change their hash value, causing problems when retrieving the value.
I understood why that was the case just fine, but I was still somewhat confused by what the point of using a hash table would be. If you simply didn't hash the keys and tested for true equality (assuming indentically constructed objects compare equal), you could replicate most of the functionality of the dictionary without that limitation just by using two lists.
So, I suppose that's my real question - what's the rationale behind using hashes to look up values instead of equality?
If I had to guess, it's likely simply because comparing integers is incredibly fast and optimized, whereas comparing instances of other classes may not be.

You seem to be missing the whole point of a hash table, which is fast (O(1))1 retrieval, and which cannot be implemented without hashing, i.e. transformation of the key into some kind of well-distributed integer that is used as an index into a table. Notice that equality is still needed on retrieval to be able to handle hash collisions2, but you resort to it only when you already narrowed down the set of elements to those having a certain hash.
With just equality you could replicate similar functionality with parallel arrays or something similar, but that would make retrieval O(n)3; if you ask for strict weak ordering, instead, you can implement RB trees, that allow for O(log n) retrieval. But O(1) requires hashing.
Have a look at Wikipedia for some more insight on hash tables.
Notes
It can become O(n) in pathological scenarios (where all the elements get put in the same bucket), but that's not supposed to happen with a good hashing function.
Since different elements may have the same hash, after getting to the bucket corresponding to the hash you must check if you are actually retrieving the element associated with the given key.
Or O(log n) if you keep your arrays sorted, but that complicates insertion, which becomes on average O(n) due to shifting elements around; but then again, if you have ordering you probably want an RB tree or a heap.

By using hastables you achieve O(1) retrieval data, while comparing against each independent vale for equality will take O(n) (in a sequential search) or O(log(n)) in a binary search.
Also note that O(1) is amortized time, because if there are several values that hash to the same key, then a sequential search is needed among these values.

Related

When you don't specify an order- why does result order vary?

I am in the very early stages of learning Python. This question has more to do with basic understanding than coding- hopefully I tag it correctly. I am reading my coursework and it says
"Run the program below that displays the ... The indentation and
spacing of the... key-value pairs simply provides more readability.
Note that order is not maintained in the dict when printed."
I know I can specify so that the order is the same each time. I can do that. I want to know when you write a program and run it why do the results get returned in a different order when not specified? Is it because of the way it gets handled in the processor?
Thanks.
The answer has nothing to do with Python, and everything to do with data structures - this behavior is universal and expected across all languages that implement a similar data structure. In Python it's called a dictionary, in other languages it's called a Map or a Hash Map or a Hash Table. There are a few other similar names for the same underlying data structure.
The Python dictionary is an Associative collection, as opposed to a Python List (which is just an Array), where its elements are contiguous in memory.
The big advantage that dictionaries (associative collections) offer is fast and constant look up times (O(1)) - arrays also offer fast look up since calculating an index is trivial - however a dictionary consists of key-value pairs where the key can be anything as long as it is hashable.
Essentially, to determine the "index" where an associated value should go in an associative container, you take the key, hash it, devise some way of mapping the hash to a number and treat that number like an index. As unlikely as it is for two different objects to yield the same hash, it could theoretically happen - what's more likely to happen is that your hash-to-number procedure maps two unique hashes to the same number - in any case, collisions like this can happen, and there are strategies for handling these collisions.
The point is, the hash of a key determines the order in which the associated value appears in the collection - therefore, there is no inherent order.

Checking for duplicate arrays when I have a huge amount of arrays

I am counting various patterns in graphs, and I store the relevant information in a defaultdict of lists of numpy arrays of size N, where the index values are integers.
I want to efficiently know if I am creating a duplicate array. Not removing duplicates can exponentially grow the amount of duplicates to the point where what I am doing becomes infeasible. But there are potentially hundreds of thousands of arrays, stored in different lists, under different keys. As far as I know, I can't hash an array.
If I simply needed to check for duplicate nonzero indices, I would store the nonzero indices as a bit sequence of ones and then hash that value. But, I don't only need to check the indices - I need to also check their integer values. Is there any way to do this short of coming up with a completely knew design that uses different structures?
Thanks.
The basic idea is “How can I use my own hash (and perhaps ==) to store things differently in a set/dict?” (where “differently” includes “without raising TypeError for being non-hashable).
The first part of the answer is defining your hash function, for example following myrtlecat’s comment. However, beware the standard non-answer based on it: store the custom hash of each object in a set (or map it to, say, the original object with a dict). That you don’t have to provide an equality implementation is a hint that this is wrong: hash values aren’t always unique! (Exception: if you want to “hash by identity”, and know all your keys will outlive the map, id does provide unique “hashes”.)
The rest of the answer is to wrap your desired keys in objects that expose your hash/equality functions as __hash__ and __eq__. Note that overriding the non-hashability of mutable types comes with an obligation to not alter the (underlying) keys! (C programmers would often call doing so undefined behavior.)
For code, see an old answer by xperroni (which includes the option to increase safety by basing the comparisons on private copies that are less likely to be altered by some other code), though I’d add __slots__ to combat the memory overhead.

Injective two-way mappings [duplicate]

This question already has answers here:
Two way/reverse map [duplicate]
(15 answers)
Closed 5 years ago.
I often deal with mappings which are injective. In programming terminology, this can be expressed as a dictionary where all values are unique as well as, of course, all keys.
Is there a memory-efficient data structure for injective mappings with all the time-complexity properties you expect from dictionaries?
For example:
d = {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}
d.get(2) = 'b' # this works with a normal dictionary
d.get('b', reverse=True) = 2 # but this is not possible
All the solutions in Two way/reverse map seem to require using or combining two sets of mappings, focusing on making it easier to perform operations on a two-way map. This is fine for small dictionaries that fit neatly in memory, but not good for large dictionaries.
The requirement is there should be no additional memory overhead storing the injective two-way map versus a regular dictionary storing only one-way mappings.
I understand dictionaries use a hash table, which use an associative array data type. By definition, associative arrays implement key -> value mappings with unique keys. Is it possible, theoretically or in practice, to produce a smart injective mapping which allows reverse lookup?
If it is not possible, I would appreciate an explanation why such a construct is difficult, or impossible, to implement with the same efficiency as dictionaries.
Update
Following the discussion with #rpy (see comments below), any information on how to set up a python dictionary-like object using a perfect reversible hash function would be useful. But, of course, a working implementation would be ideal (I have already tried perfection).
The net answer to your question is: NO (for any efficient implementation)
You put up two requirements that can not be fulfilled at the same time:
Do not use extra memory for the reverse mapping
Do not add extra time for doing (reverse) look-ups
Why are those two restrictions prohibiting a solution?
Mappings are pairs of values (tuples).
The most trivial implementation would be:
Sequential searching all tuples for a match.
This would have identical complexity for forward and backward mapping.
However, this clearly violates the expectation of time-complexity properties you expect from dictionaries:
If you would allow for O(n) complexity, then searching a tuple set sequentially would give you a proper solution.
Usually dictionary implementations try to get down to O(1) or at least O(n*log(n)) complexity. This is being achieved by introducing additional data for speeding up look-ups, like hashes or trees. Unfortunately, such aids only help for one direction as they either deal with keys (forward mapping case) or values (reverse mapping case).
So, as soon as you need to keep look-up complexity down (this also applies for modification complexity, but usually dictionaries are tailored towards fast look-up), you will need to add data for achieving speed.
The whole issue turns down to the classic memory vs. speed trade-off.
EDIT:
An approach for addressing the problem in a general implementation (for cases where keys and values allow for getting a numeric representant if those are not integral numbers in the first place) might be:
Calculate a hash value for key and one for value and register the tuple under both hash values. This way you can take key or value and identify the matching tuple and return the proper result. This would even work for non injective cases when you allow for returning sets of matching tuples.
This will require more space (double the hash entries) while keeping look-up complexity within values typical for hash based dicts. You might need to keep an eye on hash bucket size (length of collision chains) especially when value sets of keys and values are not disjoint)

What makes sets faster than lists?

The python wiki says: "Membership testing with sets and dictionaries is much faster, O(1), than searching sequences, O(n). When testing "a in b", b should be a set or dictionary instead of a list or tuple."
I've been using sets in place of lists whenever speed is important in my code, but lately I've been wondering why sets are so much faster than lists. Could anyone explain, or point me to a source that would explain, what exactly is going on behind the scenes in python to make sets faster?
list: Imagine you are looking for your socks in your closet, but you don't know in which drawer your socks are, so you have to search drawer by drawer until you find them (or maybe you never do). That's what we call O(n), because in the worst scenario, you will look in all your drawers (where n is the number of drawers).
set: Now, imagine you're still looking for your socks in your closet, but now you know in which drawer your socks are, say in the 3rd drawer. So, you will just search in the 3rd drawer, instead of searching in all drawers. That's what we call O(1), because in the worst scenario you will look in just one drawer.
Sets are implemented using hash tables. Whenever you add an object to a set, the position within the memory of the set object is determined using the hash of the object to be added. When testing for membership, all that needs to be done is basically to look if the object is at the position determined by its hash, so the speed of this operation does not depend on the size of the set. For lists, in contrast, the whole list needs to be searched, which will become slower as the list grows.
This is also the reason that sets do not preserve the order of the objects you add.
Note that sets aren't faster than lists in general -- membership test is faster for sets, and so is removing an element. As long as you don't need these operations, lists are often faster.
I think you need to take a good look at a book on data structures. Basically, Python lists are implemented as dynamic arrays and sets are implemented as a hash tables.
The implementation of these data structures gives them radically different characteristics. For instance, a hash table has a very fast lookup time but cannot preserve the order of insertion.
While I have not measured anything performance related in python so far, I'd still like to point out that lists are often faster.
Yes, you have O(1) vs. O(n). But always remember that this gives information only about the asymptotic behavior of something. That means if your n is very high O(1) will always be faster - theoretically. In practice however n often needs to be much bigger than your usual data set will be.
So sets are not faster than lists per se, but only if you have to handle a lot of elements.
Python uses hashtables, which have O(1) lookup.
Basically, Depends on the operation you are doing …
*For adding an element - then a set doesn’t need to move any data, and all it needs to do is calculate a hash value and add it to a table. For a list insertion then potentially there is data to be moved.
*For deleting an element - all a set needs to do is remove the hash entry from the hash table, for a list it potentially needs to move data around (on average 1/2 of the data.
*For a search (i.e. an in operator) - a set just needs to calculate the hash value of the data item, find that hash value in the hash table, and if it is there - then bingo. For a list, the search has to look up each item in turn - on average 1/2 of all of the terms in the list. Even for many 1000s of items a set will be far quicker to search.
Actually sets are not faster than lists in every scenario. Generally the lists are faster than sets. But in the case of searching for an element in a collection, sets are faster because sets have been implemented using hash tables. So basically Python does not have to search the full set, which means that the time complexity in average is O(1). Lists use dynamic arrays and Python needs to check the full array to search. So it takes O(n).
So finally we can see that sets are better in some case and lists are better in some cases. Its up to us to select the appropriate data structure according to our task.
A list must be searched one by one, where a set or dictionary has an index for faster searching.

What is the true difference between a dictionary and a hash table?

I've always used dictionaries. I write in Python.
A dictionary is a general concept that maps keys to values. There are many ways to implement such a mapping.
A hashtable is a specific way to implement a dictionary.
Besides hashtables, another common way to implement dictionaries is red-black trees.
Each method has its own pros and cons. A red-black tree can always perform a lookup in O(log N). A hashtable can perform a lookup in O(1) time although that can degrade to O(N) depending on the input.
A dictionary is a data structure that maps keys to values.
A hash table is a data structure that maps keys to values by taking the hash value of the key (by applying some hash function to it) and mapping that to a bucket where one or more values are stored.
IMO this is analogous to asking the difference between a list and a linked list.
For clarity it may be important to note that it MAY be the case that Python currently implements their dictionaries using hash tables, and it MAY be the case in the future that Python changes that fact without causing their dictionaries to stop being dictionaries.
"A dictionary" has a few different meanings in programming, as wikipedia will tell you -- "associative array", the sense in which Python uses the term (also known as "a mapping"), is one of those meanings (but "data dictionary", and "dictionary attacks" in password guess attempts, are also important).
Hash tables are important data structures; Python uses them to implement two important built-in data types, dict and set.
So, even in Python, you can't consider "hash table" to be a synonym for "dictionary"... since a similar data structure is also used to implement "sets"!-)
A Python dictionary is internally implemented with a hashtable.
Both dictionary and hash table pair keys to value in order to have fast big O operations while insertion or deletion or lookups, the difference is that a hash table uses hash in order to store (key, value) pairs that's why we can access data faster. Python implements dictionaries as hash tables, Maps and sets are new kinds of hash tables that take into consideration the order while inserting, and you can put any kind of object as keys...
Recently lists and hash table are more similar in Python3 due to order, Check this for more details:
https://softwaremaniacs.org/blog/2020/02/05/dicts-ordered/en/
A hash table always uses some function operating on a value to determine where a value will be stored. A Dictionary (as I believe you intend it) is a more general term, and simply indicates a lookup mechanism, which might be a hash table or might be implemented by a simpler structure which does not consider the value itself in determining its storage location.
Dictionary is implemented using hash tables. In my opinion the difference between the 2 can be thought of as the difference between Stacks and Arrays where we would be using arrays to implement Stacks.

Categories

Resources