I am making a class in Python that relates a lot of nodes and edges together. I also have other operations that can take two separate objects and merge them into a single object of the same type, and so on.
However, I need a way to give every node a unique ID for easy lookup. Is there a "proper way" to do this, or do I just have to keep an external ID variable that I increment and pass into my class methods every time I add more nodes to any object?
I also considered generating a random string for each node upon creation, but there is still a risk of collision error (even if this probability is near-zero, it still exists and seems like a design flaw, if not a longwinded overengineered way of going about it anyway).
If you just need a unique identifier, the built-in Python id() function would do it:
Return the “identity” of an object. This is an integer (or long integer) which is guaranteed to be unique and constant for this object during its lifetime. Two objects with non-overlapping lifetimes may have the same id() value.
You could keep a class variable and use it for ordinal ids:
class Node(object):
_id = 0
def __init__(self):
self._id = Node._id
Node._id += 1
It also has the benefit that your class will be able to know how many objects were altogether created.
This is also way cheaper than random ids.
Pretty much both of your solutions are what is done in practice.
Your first solution is to just increment a number will give you uniqueness, as long as you don't overflow (with python bigintegers this isnt really a problem). The disadvantage of this approach is if you start doing concurrency you have to make sure you use locking to prevent data races when increment and reading your external value.
The other approach where you generate a random number works well in the concurrency situation. The larger number of bits you use, the less likely it is you will run into a collision. In fact you can pretty much guarantee that you won't have collisions if you use say 128-bits for your id.
An approach you can use to further guarantee you don't have collisions, is to make your unique ids something like TIMESTAMP_HASHEDMACHINENAME_PROCESSID/THREADID_UNIQUEID. Then pretty much can't have collisions unless you generate two of the same UNIQUEID on the same process/thread within 1 second. MongoDB does something like this where they just increment the UNIQUEID. I am not sure what they do in the case of an overflow (which I assume doesn't happen too often in practice). One solution might be just to wait till the next second before generating more ids.
This is probably overkill for what you are trying to do, but it is a somewhat interesting problem indeed.
UUID is good for this sort of thing.
>>> from uuid import uuid4
>>> uuid4().hex
'461dd72c63db4ae9a969978daadc59f0'
Universally Unique ID's have very low collision rate -- unless you are creating billions of nodes, it should do the trick.
Related
I tried to find an answer here and in Python doc but the only things I got were questions about hashing list objects and details abaut how dicts work.
Background
I'm developing a program that parses over a huge graphs (atm. 44K nodes, 14K of them are of any interest and they are connected by 15K edges) and have problems with performance although I allready optimized my algorithm as far as I could and now the last resort is to optimize the data structure:
def single_pass_build(nodes):
for node in nodes:
if node.__class__ in listOfRequiredClasses:
children = get_children(node)
for child in children:
if child__class__ in listOfRequiredClasses:
add_edge(node, child)
def get_children(node):
return [attr for attr in node.__dict__.values() if attr.__class__ in listOfRequiredClasses]
I still have to care about my add_connection function but even without it my program takes slightly over 10 Minutes for nothing but this iteration. For comparison: the module I get the data from generates it from an xml document in not more than 5 seconds.
I have a total of 44K object, each representing a node in a ralation graph. The objects I get have plenty attributes so I could try to optimize get_children to know all relevant attributes for every class or just speed up the lookup. Lists take O(n) (so if a is the number os attributes and k the number of classes in my list I get a total O(nak + mak)). Many of my attribute classes are not in that list so I am closer to the worst case than to the average. I'd like to speed up the lookup from O(k) to O(1) or at least O(log(k))
Question
Knowing that a key lookup of dict should be O(log(n)) for many hash collision and with (few to) no hash collisions it becomes (almost) static. After I don't care for any values I'd like to know if there is a kind of (hash) list optimized for x in list?
I could use a dict with None values but with a total of 70000 lookups and greater graphs in future, every milli second counts. The space is not the big problem here because I expect ~50 classes total and in no case more than some hundred classes. In other cases, the space could be a issue too.
I don't expect the answer to be in standard Python but maby someone knows a common framework that can help or can make me believe that there is no reason at all why I can't use a dict for the job.
You want the builtin set type : https://docs.python.org/2/library/stdtypes.html#set
And yes its IS in standard Python ;)
I'm not clear on what goes on behind the scenes of a dictionary lookup. Does key size factor into the speed of lookup for that key?
Current dictionary keys are between 10-20 long, alphanumeric.
I need to do hundreds of lookups a minute.
If I replace those with smaller key IDs of between 1 & 4 digits will I get faster lookup times? This would mean I would need to add another value in each item the dictionary is holding. Overall the dictionary will be larger.
Also I'll need to change the program to lookup the ID then get the URL associated with the ID.
Am I likely just adding complexity to the program with little benefit?
Dictionaries are hash tables, so looking up a key consists of:
Hash the key.
Reduce the hash to the table size.
Index the table with the result.
Compare the looked-up key with the input key.
Normally, this is amortized constant time, and you don't care about anything more than that. There are two potential issues, but they don't come up often.
Hashing the key takes linear time in the length of the key. For, e.g., huge strings, this could be a problem. However, if you look at the source code for most of the important types, including [str/unicode](https://hg.python.org/cpython/file/default/Objects/unicodeobject.c, you'll see that they cache the hash the first time. So, unless you're inputting (or randomly creating, or whatever) a bunch of strings to look up once and then throw away, this is unlikely to be an issue in most real-life programs.
On top of that, 20 characters is really pretty short; you can probably do millions of such hashes per second, not hundreds.
From a quick test on my computer, hashing 20 random letters takes 973ns, hashing a 4-digit number takes 94ns, and hashing a value I've already hashed takes 77ns. Yes, that's nanoseconds.
Meanwhile, "Index the table with the result" is a bit of a cheat. What happens if two different keys hash to the same index? Then "compare the looked-up key" will fail, and… what happens next? CPython's implementation uses probing for this. The exact algorithm is explained pretty nicely in the source. But you'll notice that given really pathological data, you could end up doing a linear search for every single element. This is never going to come up—unless someone can attack your program by explicitly crafting pathological data, in which case it will definitely come up.
Switching from 20-character strings to 4-digit numbers wouldn't help here either. If I'm crafting keys to DoS your system via dictionary collisions, I don't care what your actual keys look like, just what they hash to.
More generally, premature optimization is the root of all evil. This is sometimes misquoted to overstate the point; Knuth was arguing that the most important thing to do is find the 3% of the cases where optimization is important, not that optimization is always a waste of time. But either way, the point is: if you don't know in advance where your program is too slow (and if you think you know in advance, you're usually wrong…), profile it, and then find the part where you get the most bang for your buck. Optimizing one arbitrary piece of your code is likely to have no measurable effect at all.
Python dictionaries are implemented as hash-maps in the background. The key length might have some impact on the performance if, for example, the hash-functions complexity depends on the key-length. But in general the performance impacts will be definitely negligable.
So I'd say there is little to no benefit for the added complexity.
I want a data type that will allow me to efficiently keep track of objects that have been "added" to it, allowing me to test for membership. I don't need any other features.
As far as I can tell, Python does not have such a datatype. The closest to what I want is the Set, but the set will always store values (which I do not need).
Currently the best I can come up with is taking the hash() of each object and storing it in a set, but at a lower level a hash of the hash is being computed, and the hash string is being stored as a value.
Is there a way to use just the low-level lookup functionality of Sets without actually pointing to anything?
Basically, no, because, as I pointed out in my comment, it is perfectly possible for two unequal objects to share the same hash key.
The hash key points, not to either nothing or an object, but to a bucket which contains zero or more objects. The set implementation then needs to do equality comparisons against each of these to work out if the object is in the set.
So you always need at least enough information to make an equality comparison. If you've got very large objects whose equality can be decided on a subset of their data, say 2 or 3 fields, you could consider creating a new object with just these fields and storing this in the set instead of the whole object.
weakref module implements a bunch of containers that can test membership without "storing" the value, the downside being that when last strong reference to the object is removed, object disappears from a weak container.
If this works for you, WeakSet is what you want.
If this doesn't work for you, then it seem you want a Bloom filter which is probablistic (there are false positives) but for your purpose robust (default is no false negatives).
Typical arrangement is "try in the filter, if no, it's a no; if yes, check the slow way e.g. word list in a file"
Forgive me for asking in in such a general way as I'm sure their performance is depending on how one uses them, but in my case collections.deque was way slower than collections.defaultdict when I wanted to verify the existence of a value.
I used the spelling correction from Peter Norvig in order to verify a user's input against a small set of words. As I had no use for a dictionary with word frequencies I used a simple list instead of defaultdict at first, but replaced it with deque as soon as I noticed that a single word lookup took about 25 seconds.
Surprisingly, that wasn't faster than using a list so I returned to using defaultdict which returned results almost instantaneously.
Can someone explain this difference in performance to me?
Thanks in advance
PS: If one of you wants to reproduce what I was talking about, change the following lines in Norvig's script.
-NWORDS = train(words(file('big.txt').read()))
+NWORDS = collections.deque(words(file('big.txt').read()))
-return max(candidates, key=NWORDS.get)
+return candidates
These three data structures aren't interchangeable, they serve very different purposes and have very different characteristics:
Lists are dynamic arrays, you use them to store items sequentially for fast random access, use as stack (adding and removing at the end) or just storing something and later iterating over it in the same order.
Deques are sequences too, only for adding and removing elements at both ends instead of random access or stack-like growth.
Dictionaries (providing a default value just a relatively simple and convenient but - for this question - irrelevant extension) are hash tables, they associate fully-featured keys (instead of an index) with values and provide very fast access to a value by a key and (necessarily) very fast checks for key existence. They don't maintain order and require the keys to be hashable, but well, you can't make an omelette without breaking eggs.
All of these properties are important, keep them in mind whenever you choose one over the other. What breaks your neck in this particular case is a combination of the last property of dictionaries and the number of possible corrections that have to be checked. Some simple combinatorics should arrive at a concrete formula for the number of edits this code generates for a given word, but everyone who mispredicted such things often enough will know it's going to be surprisingly large number even for average words.
For each of these edits, there is a check edit in NWORDS to weeds out edits that result in unknown words. Not a bit problem in Norvig's program, since in checks (key existence checks) are, as metioned before, very fast. But you swaped the dictionary with a sequence (a deque)! For sequences, in has to iterate over the whole sequence and compare each item with the value searched for (it can stop when it finds a match, but since the least edits are know words sitting at the beginning of the deque, it usually still searches all or most of the deque). Since there are quite a few words and the test is done for each edit generated, you end up spending 99% of your time doing a linear search in a sequence where you could just hash a string and compare it once (or at most - in case of collisions - a few times).
If you don't need weights, you can conceptually use bogus values you never look at and still get the performance boost of an O(1) in check. Practically, you should just use a set which uses pretty much the same algorithms as the dictionaries and just cuts away the part where it stores the value (it was actually first implemented like that, I don't know how far the two diverged since sets were re-implemented in a dedicated, seperate C module).
Is there an article or forum discussion or something somewhere that explains why lists use append/extend, but sets and dicts use add/update?
I frequently find myself converting lists into sets and this difference makes that quite tedious, so for my personal sanity I'd like to know what the rationalization is.
The need to convert between these occurs regularly as we iterate on development. Over time as the structure of the program morphs, various structures gain and lose requirements like ordering and duplicates.
For example, something that starts out as an unordered bunch of stuff in a list might pick up the the requirement that there be no duplicates and so need to be converted to a set.
All such changes require finding and changing all places where the relevant structure is added/appended and extended/updated.
So I'm curious to see the original discussion that led to this language choice, but unfortunately I didn't have any luck googling for it.
append has a popular definition of "add to the very end", and extend can be read similarly (in the nuance where it means "...beyond a certain point"); sets have no "end", nor any way to specify some "point" within them or "at their boundaries" (because there are no "boundaries"!), so it would be highly misleading to suggest that these operations could be performed.
x.append(y) always increases len(x) by exactly one (whether y was already in list x or not); no such assertion holds for s.add(z) (s's length may increase or stay the same). Moreover, in these snippets, y can have any value (i.e., the append operation never fails [except for the anomalous case in which you've run out of memory]) -- again no such assertion holds about z (which must be hashable, otherwise the add operation fails and raises an exception). Similar differences apply to extend vs update. Using the same name for operations with such drastically different semantics would be very misleading indeed.
it seems pythonic to just use a list
on the first pass and deal with the
performance on a later iteration
Performance is the least of it! lists support duplicate items, ordering, and any item type -- sets guarantee item uniqueness, have no concept of order, and demand item hashability. There is nothing Pythonic in using a list (plus goofy checks against duplicates, etc) to stand for a set -- performance or not, "say what you mean!" is the Pythonic Way;-). (In languages such as Fortran or C, where all you get as a built-in container type are arrays, you might have to perform such "mental mapping" if you need to avoid using add-on libraries; in Python, there is no such need).
Edit: the OP asserts in a comment that they don't know from the start (e.g.) that duplicates are disallowed in a certain algorithm (strange, but, whatever) -- they're looking for a painless way to make a list into a set once they do discover duplicates are bad there (and, I'll add: order doesn't matter, items are hashable, indexing/slicing unneeded, etc). To get exactly the same effect one would have if Python's sets had "synonyms" for the two methods in question:
class somewhatlistlikeset(set):
def append(self, x): self.add(x)
def extend(self, x): self.update(x)
Of course, if the only change is at the set creation (which used to be list creation), the code may be much more challenging to follow, having lost the useful clarity whereby using add vs append allows anybody reading the code to know "locally" whether the object is a set vs a list... but this, too, is part of the "exactly the same effect" above-mentioned!-)
set and dict are unordered. "Append" and "extend" conceptually only apply to ordered types.
It's written that way to annoy you.
Seriously. It's designed so that one can't simply convert one into the other easily. Historically, sets are based off dicts, so the two share naming conventions. While you could easily write a set wrapper to add these methods ...
class ListlikeSet(set):
def append(self, x):
self.add(x)
def extend(self, xs):
self.update(xs)
... the greater question is why you find yourself converting lists to sets with such regularity. They represent substantially different models of a collection of objects; if you have to convert between the two a lot, it suggests you may not have a very good handle on the conceptual architecture of your program.