Is there a Set-like object that doesn't store values? - python

I want a data type that will allow me to efficiently keep track of objects that have been "added" to it, allowing me to test for membership. I don't need any other features.
As far as I can tell, Python does not have such a datatype. The closest to what I want is the Set, but the set will always store values (which I do not need).
Currently the best I can come up with is taking the hash() of each object and storing it in a set, but at a lower level a hash of the hash is being computed, and the hash string is being stored as a value.
Is there a way to use just the low-level lookup functionality of Sets without actually pointing to anything?

Basically, no, because, as I pointed out in my comment, it is perfectly possible for two unequal objects to share the same hash key.
The hash key points, not to either nothing or an object, but to a bucket which contains zero or more objects. The set implementation then needs to do equality comparisons against each of these to work out if the object is in the set.
So you always need at least enough information to make an equality comparison. If you've got very large objects whose equality can be decided on a subset of their data, say 2 or 3 fields, you could consider creating a new object with just these fields and storing this in the set instead of the whole object.

weakref module implements a bunch of containers that can test membership without "storing" the value, the downside being that when last strong reference to the object is removed, object disappears from a weak container.
If this works for you, WeakSet is what you want.
If this doesn't work for you, then it seem you want a Bloom filter which is probablistic (there are false positives) but for your purpose robust (default is no false negatives).
Typical arrangement is "try in the filter, if no, it's a no; if yes, check the slow way e.g. word list in a file"

Related

Strange behavior of python built in hash function [duplicate]

I used Spyder, run Python 2.7.
Just found interesting things:
hash(-1) and hash(-2) both return -2, is there a problem? I though hash function on different object should return different values. I read previous posts that -1 is reserved as an error in Python.
hash('s') returns 1835142386, then hash(1835142386) returns the same value. Is this another problem?
Thanks.
-1 is not "reserved as an error" in Python. Not sure what that would even mean. There are a huge number of programs you couldn't write simply and clearly if you weren't allowed to use -1.
"Is there a problem?" No. Hash functions do not need to return a different hash for every object. In fact, this is not possible, since there are many more possible objects than there are hashes. CPython's hash() has the nice property of returning its argument for non-negative numbers up to sys.maxint, which is why in your second question hash(hash('s')) == hash('s'), but that is an implementation detail.
The fact that -1 and -2 have the same hash simply means that using those values as, for example, dictionary keys will result in a hash conflict. Hash conflicts are an expected situation and are automatically resolved by Python, and the second key added would simply go in the next available slot in the dictionary. Accessing the key that was inserted second would then be slightly slower than accessing the other one, but in most cases, not enough slower that you'd notice.
It is possible to construct a huge number of unequal objects all with the same hash value, which would, when stored in a dictionary or a set, cause the performance of the container to deteriorate substantially because every object added would cause a hash collision, but it isn't something you will run into unless you go looking for it.

When you don't specify an order- why does result order vary?

I am in the very early stages of learning Python. This question has more to do with basic understanding than coding- hopefully I tag it correctly. I am reading my coursework and it says
"Run the program below that displays the ... The indentation and
spacing of the... key-value pairs simply provides more readability.
Note that order is not maintained in the dict when printed."
I know I can specify so that the order is the same each time. I can do that. I want to know when you write a program and run it why do the results get returned in a different order when not specified? Is it because of the way it gets handled in the processor?
Thanks.
The answer has nothing to do with Python, and everything to do with data structures - this behavior is universal and expected across all languages that implement a similar data structure. In Python it's called a dictionary, in other languages it's called a Map or a Hash Map or a Hash Table. There are a few other similar names for the same underlying data structure.
The Python dictionary is an Associative collection, as opposed to a Python List (which is just an Array), where its elements are contiguous in memory.
The big advantage that dictionaries (associative collections) offer is fast and constant look up times (O(1)) - arrays also offer fast look up since calculating an index is trivial - however a dictionary consists of key-value pairs where the key can be anything as long as it is hashable.
Essentially, to determine the "index" where an associated value should go in an associative container, you take the key, hash it, devise some way of mapping the hash to a number and treat that number like an index. As unlikely as it is for two different objects to yield the same hash, it could theoretically happen - what's more likely to happen is that your hash-to-number procedure maps two unique hashes to the same number - in any case, collisions like this can happen, and there are strategies for handling these collisions.
The point is, the hash of a key determines the order in which the associated value appears in the collection - therefore, there is no inherent order.

Checking for duplicate arrays when I have a huge amount of arrays

I am counting various patterns in graphs, and I store the relevant information in a defaultdict of lists of numpy arrays of size N, where the index values are integers.
I want to efficiently know if I am creating a duplicate array. Not removing duplicates can exponentially grow the amount of duplicates to the point where what I am doing becomes infeasible. But there are potentially hundreds of thousands of arrays, stored in different lists, under different keys. As far as I know, I can't hash an array.
If I simply needed to check for duplicate nonzero indices, I would store the nonzero indices as a bit sequence of ones and then hash that value. But, I don't only need to check the indices - I need to also check their integer values. Is there any way to do this short of coming up with a completely knew design that uses different structures?
Thanks.
The basic idea is “How can I use my own hash (and perhaps ==) to store things differently in a set/dict?” (where “differently” includes “without raising TypeError for being non-hashable).
The first part of the answer is defining your hash function, for example following myrtlecat’s comment. However, beware the standard non-answer based on it: store the custom hash of each object in a set (or map it to, say, the original object with a dict). That you don’t have to provide an equality implementation is a hint that this is wrong: hash values aren’t always unique! (Exception: if you want to “hash by identity”, and know all your keys will outlive the map, id does provide unique “hashes”.)
The rest of the answer is to wrap your desired keys in objects that expose your hash/equality functions as __hash__ and __eq__. Note that overriding the non-hashability of mutable types comes with an obligation to not alter the (underlying) keys! (C programmers would often call doing so undefined behavior.)
For code, see an old answer by xperroni (which includes the option to increase safety by basing the comparisons on private copies that are less likely to be altered by some other code), though I’d add __slots__ to combat the memory overhead.

Giving unique IDs to all nodes?

I am making a class in Python that relates a lot of nodes and edges together. I also have other operations that can take two separate objects and merge them into a single object of the same type, and so on.
However, I need a way to give every node a unique ID for easy lookup. Is there a "proper way" to do this, or do I just have to keep an external ID variable that I increment and pass into my class methods every time I add more nodes to any object?
I also considered generating a random string for each node upon creation, but there is still a risk of collision error (even if this probability is near-zero, it still exists and seems like a design flaw, if not a longwinded overengineered way of going about it anyway).
If you just need a unique identifier, the built-in Python id() function would do it:
Return the “identity” of an object. This is an integer (or long integer) which is guaranteed to be unique and constant for this object during its lifetime. Two objects with non-overlapping lifetimes may have the same id() value.
You could keep a class variable and use it for ordinal ids:
class Node(object):
_id = 0
def __init__(self):
self._id = Node._id
Node._id += 1
It also has the benefit that your class will be able to know how many objects were altogether created.
This is also way cheaper than random ids.
Pretty much both of your solutions are what is done in practice.
Your first solution is to just increment a number will give you uniqueness, as long as you don't overflow (with python bigintegers this isnt really a problem). The disadvantage of this approach is if you start doing concurrency you have to make sure you use locking to prevent data races when increment and reading your external value.
The other approach where you generate a random number works well in the concurrency situation. The larger number of bits you use, the less likely it is you will run into a collision. In fact you can pretty much guarantee that you won't have collisions if you use say 128-bits for your id.
An approach you can use to further guarantee you don't have collisions, is to make your unique ids something like TIMESTAMP_HASHEDMACHINENAME_PROCESSID/THREADID_UNIQUEID. Then pretty much can't have collisions unless you generate two of the same UNIQUEID on the same process/thread within 1 second. MongoDB does something like this where they just increment the UNIQUEID. I am not sure what they do in the case of an overflow (which I assume doesn't happen too often in practice). One solution might be just to wait till the next second before generating more ids.
This is probably overkill for what you are trying to do, but it is a somewhat interesting problem indeed.
UUID is good for this sort of thing.
>>> from uuid import uuid4
>>> uuid4().hex
'461dd72c63db4ae9a969978daadc59f0'
Universally Unique ID's have very low collision rate -- unless you are creating billions of nodes, it should do the trick.

Why use hashes instead of test true equality?

I've recently been looking into Python's dictionaries (I believe they're called associate arrays in other languages) and was confused by a couple of the restrictions on its keys.
First, dict keys must be immutable. When I looked up the logic behind it the answer was that dictionaries work like hash tables to look up the values for keys, and thus immutable keys (if they're hashable at all) may change their hash value, causing problems when retrieving the value.
I understood why that was the case just fine, but I was still somewhat confused by what the point of using a hash table would be. If you simply didn't hash the keys and tested for true equality (assuming indentically constructed objects compare equal), you could replicate most of the functionality of the dictionary without that limitation just by using two lists.
So, I suppose that's my real question - what's the rationale behind using hashes to look up values instead of equality?
If I had to guess, it's likely simply because comparing integers is incredibly fast and optimized, whereas comparing instances of other classes may not be.
You seem to be missing the whole point of a hash table, which is fast (O(1))1 retrieval, and which cannot be implemented without hashing, i.e. transformation of the key into some kind of well-distributed integer that is used as an index into a table. Notice that equality is still needed on retrieval to be able to handle hash collisions2, but you resort to it only when you already narrowed down the set of elements to those having a certain hash.
With just equality you could replicate similar functionality with parallel arrays or something similar, but that would make retrieval O(n)3; if you ask for strict weak ordering, instead, you can implement RB trees, that allow for O(log n) retrieval. But O(1) requires hashing.
Have a look at Wikipedia for some more insight on hash tables.
Notes
It can become O(n) in pathological scenarios (where all the elements get put in the same bucket), but that's not supposed to happen with a good hashing function.
Since different elements may have the same hash, after getting to the bucket corresponding to the hash you must check if you are actually retrieving the element associated with the given key.
Or O(log n) if you keep your arrays sorted, but that complicates insertion, which becomes on average O(n) due to shifting elements around; but then again, if you have ordering you probably want an RB tree or a heap.
By using hastables you achieve O(1) retrieval data, while comparing against each independent vale for equality will take O(n) (in a sequential search) or O(log(n)) in a binary search.
Also note that O(1) is amortized time, because if there are several values that hash to the same key, then a sequential search is needed among these values.

Categories

Resources