using python hash() values as a refrence number - python

I'm writing a scrapy spider that collects news articles from various online newspapers. The sites in question update at least once a day, and I'm going to run the spider just as often, I need some way to filter out duplicates (i.e articles I've already scraped).
In other cases it'd be as simple as comparing reference numbers, but newspaper articles don't have any reference numbers. I was wondering if it'd be possible to hash the title using pythons hash() function and use the resulting value as a stand-in for an actual reference number, just for comparison purposes?
On the surface it seems possible, but what do you guys think?

Yes, you can do that, but I'd not use hash() for this, as hash() is optimised for a different task and can lead too easily to collisions on larger texts (different inputs resulting in the same hash value).
Use a cryptographic hashing scheme instead; the hashlib module gives you access to MD5 and other algorithms and produce output that is far less likely to produce collisions.
For your purposes, MD5 will do just fine:
article_hash = hashlib.md5(scraped_info).hexdigest()
This has the added advantage that the MD5 hash is always going to be calculated the same regardless of OS or system architecture; hash() can offer no such guarantee.

Related

When you don't specify an order- why does result order vary?

I am in the very early stages of learning Python. This question has more to do with basic understanding than coding- hopefully I tag it correctly. I am reading my coursework and it says
"Run the program below that displays the ... The indentation and
spacing of the... key-value pairs simply provides more readability.
Note that order is not maintained in the dict when printed."
I know I can specify so that the order is the same each time. I can do that. I want to know when you write a program and run it why do the results get returned in a different order when not specified? Is it because of the way it gets handled in the processor?
Thanks.
The answer has nothing to do with Python, and everything to do with data structures - this behavior is universal and expected across all languages that implement a similar data structure. In Python it's called a dictionary, in other languages it's called a Map or a Hash Map or a Hash Table. There are a few other similar names for the same underlying data structure.
The Python dictionary is an Associative collection, as opposed to a Python List (which is just an Array), where its elements are contiguous in memory.
The big advantage that dictionaries (associative collections) offer is fast and constant look up times (O(1)) - arrays also offer fast look up since calculating an index is trivial - however a dictionary consists of key-value pairs where the key can be anything as long as it is hashable.
Essentially, to determine the "index" where an associated value should go in an associative container, you take the key, hash it, devise some way of mapping the hash to a number and treat that number like an index. As unlikely as it is for two different objects to yield the same hash, it could theoretically happen - what's more likely to happen is that your hash-to-number procedure maps two unique hashes to the same number - in any case, collisions like this can happen, and there are strategies for handling these collisions.
The point is, the hash of a key determines the order in which the associated value appears in the collection - therefore, there is no inherent order.

Algorithm to generate 12 byte hash from Web URLs

I am crawling some websites for special items and storing them in MongoDB server. To avoid duplicate items, I am using the hash value of the item link. Here is my code to generate the hash from the link:
import hashlib
from bson.objectid import ObjectId
def gen_objectid(link):
"""Generates objectid from given link"""
return ObjectId(hashlib.shake_128(str(link).encode('utf-8')).digest(12))
# end def
I have no idea how the shake_128 algorithm works. That is where my question comes in.
Is it okay to use this method? Can I safely assume that the probability of a collision is negligible?
What is the better way to do this?
shake_128 is one of the SHA-3 hash algorithms, chosen as the result of a contest to be the next generation of secure hash algorithms. They are not widely used, since SHA-2 is still considered good enough in most cases. Since these algorithms are designed for cryptographically secure hashing, this should be overkill for what you are doing. Also shake_128, as the name implies, should give you a 128-bit value, which is 16 bytes, not 12. This gives you 2^128 = 3.4e38 different hashes. I think you will be just fine. If anything, I would say you could use a faster hashing algorithm since you don't need cryptographic security in this case.

Deterministic, recursive hashing in python

Python 3's default hashing function(s) isn't deterministic (hash(None) varies from run to run), and doesn't even make a best effort to generate unique id's with high probability (hash(-1)==hash(-2) is True).
Is there some other hash function that works well as a checksum (i.e. negligible probability of two data structures hashing to the same value, and returns the same result each run of python), and supports all of python's built-in datatypes, including None?
Ideally it would be in the standard library. I can pickle the object or get a string representation, but that seems unnecessarily hacky, and string representations of floats are probably very bad checksums.
I found the cryptographic hashes (md5,sha256) in the standard library, but they only operate on bytestrings.
Haskell seems to get this ~almost right in their standard library... but "Nothing::Maybe Int" and 0 both hash to 0, so it's not perfect there either.
You can use any hash from hashlib on a pickled object.
pickle.dumps not suitable for hashing.
You can use sorted-keys json with hashlib.
hashlib.md5(json.dumps(data, sort_keys=True)).hexdigest()
Taken from: https://stackoverflow.com/a/10288255/3858507, according to AndrewWagner's comment.
By the way and only for reference as this causes security vulnerabitilies, the PYTHONHASHSEED environment variable can be used to disable randomization of hashes throughout your application.

Optimizing Python Dictionary Lookup Speeds by Shortening Key Size?

I'm not clear on what goes on behind the scenes of a dictionary lookup. Does key size factor into the speed of lookup for that key?
Current dictionary keys are between 10-20 long, alphanumeric.
I need to do hundreds of lookups a minute.
If I replace those with smaller key IDs of between 1 & 4 digits will I get faster lookup times? This would mean I would need to add another value in each item the dictionary is holding. Overall the dictionary will be larger.
Also I'll need to change the program to lookup the ID then get the URL associated with the ID.
Am I likely just adding complexity to the program with little benefit?
Dictionaries are hash tables, so looking up a key consists of:
Hash the key.
Reduce the hash to the table size.
Index the table with the result.
Compare the looked-up key with the input key.
Normally, this is amortized constant time, and you don't care about anything more than that. There are two potential issues, but they don't come up often.
Hashing the key takes linear time in the length of the key. For, e.g., huge strings, this could be a problem. However, if you look at the source code for most of the important types, including [str/unicode](https://hg.python.org/cpython/file/default/Objects/unicodeobject.c, you'll see that they cache the hash the first time. So, unless you're inputting (or randomly creating, or whatever) a bunch of strings to look up once and then throw away, this is unlikely to be an issue in most real-life programs.
On top of that, 20 characters is really pretty short; you can probably do millions of such hashes per second, not hundreds.
From a quick test on my computer, hashing 20 random letters takes 973ns, hashing a 4-digit number takes 94ns, and hashing a value I've already hashed takes 77ns. Yes, that's nanoseconds.
Meanwhile, "Index the table with the result" is a bit of a cheat. What happens if two different keys hash to the same index? Then "compare the looked-up key" will fail, and… what happens next? CPython's implementation uses probing for this. The exact algorithm is explained pretty nicely in the source. But you'll notice that given really pathological data, you could end up doing a linear search for every single element. This is never going to come up—unless someone can attack your program by explicitly crafting pathological data, in which case it will definitely come up.
Switching from 20-character strings to 4-digit numbers wouldn't help here either. If I'm crafting keys to DoS your system via dictionary collisions, I don't care what your actual keys look like, just what they hash to.
More generally, premature optimization is the root of all evil. This is sometimes misquoted to overstate the point; Knuth was arguing that the most important thing to do is find the 3% of the cases where optimization is important, not that optimization is always a waste of time. But either way, the point is: if you don't know in advance where your program is too slow (and if you think you know in advance, you're usually wrong…), profile it, and then find the part where you get the most bang for your buck. Optimizing one arbitrary piece of your code is likely to have no measurable effect at all.
Python dictionaries are implemented as hash-maps in the background. The key length might have some impact on the performance if, for example, the hash-functions complexity depends on the key-length. But in general the performance impacts will be definitely negligable.
So I'd say there is little to no benefit for the added complexity.

Why use hashes instead of test true equality?

I've recently been looking into Python's dictionaries (I believe they're called associate arrays in other languages) and was confused by a couple of the restrictions on its keys.
First, dict keys must be immutable. When I looked up the logic behind it the answer was that dictionaries work like hash tables to look up the values for keys, and thus immutable keys (if they're hashable at all) may change their hash value, causing problems when retrieving the value.
I understood why that was the case just fine, but I was still somewhat confused by what the point of using a hash table would be. If you simply didn't hash the keys and tested for true equality (assuming indentically constructed objects compare equal), you could replicate most of the functionality of the dictionary without that limitation just by using two lists.
So, I suppose that's my real question - what's the rationale behind using hashes to look up values instead of equality?
If I had to guess, it's likely simply because comparing integers is incredibly fast and optimized, whereas comparing instances of other classes may not be.
You seem to be missing the whole point of a hash table, which is fast (O(1))1 retrieval, and which cannot be implemented without hashing, i.e. transformation of the key into some kind of well-distributed integer that is used as an index into a table. Notice that equality is still needed on retrieval to be able to handle hash collisions2, but you resort to it only when you already narrowed down the set of elements to those having a certain hash.
With just equality you could replicate similar functionality with parallel arrays or something similar, but that would make retrieval O(n)3; if you ask for strict weak ordering, instead, you can implement RB trees, that allow for O(log n) retrieval. But O(1) requires hashing.
Have a look at Wikipedia for some more insight on hash tables.
Notes
It can become O(n) in pathological scenarios (where all the elements get put in the same bucket), but that's not supposed to happen with a good hashing function.
Since different elements may have the same hash, after getting to the bucket corresponding to the hash you must check if you are actually retrieving the element associated with the given key.
Or O(log n) if you keep your arrays sorted, but that complicates insertion, which becomes on average O(n) due to shifting elements around; but then again, if you have ordering you probably want an RB tree or a heap.
By using hastables you achieve O(1) retrieval data, while comparing against each independent vale for equality will take O(n) (in a sequential search) or O(log(n)) in a binary search.
Also note that O(1) is amortized time, because if there are several values that hash to the same key, then a sequential search is needed among these values.

Categories

Resources