What is a crypto hash and what are some algorithms? How it is different from a normal hash in python? How can I determine which to use?
EX:
Cryptographic hash function
hello--aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d
helld--44d634fa6b81353bc3ed424879ffd013501ade53
hash function
hash("hello") -1267296259
hash("helld") -1267296266
Please help me
Cryptographic Hash functions are different from Hashtable Hash functions. One main difference is that cryptographic hash functions are designed not to have hash collision weaknesses. They are designed to be more secure and irreversible in most cases. But Hashtable hash functions like hash are faster and are designed to use to quickly access items in memory or comparing items or etc.
Suppose two differenct Scenarios. If you want to store passwords in a database you must use something like pbkdf2 so it is more secure and so slower to generate in order to prevent brute forces. But in another case you just want to have a set of items and check if an item exists in that set. You can simply store a 32-bit or 64-bit hash of items(e.g. classes) and compare hashes quickly instead of classes.
For example for string "hello", it is much faster to compute and store 1267296259 as it is a 32-bit integer and more secure and slower to compute and store aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d.
P.S. A good example is here.
Related
I am crawling some websites for special items and storing them in MongoDB server. To avoid duplicate items, I am using the hash value of the item link. Here is my code to generate the hash from the link:
import hashlib
from bson.objectid import ObjectId
def gen_objectid(link):
"""Generates objectid from given link"""
return ObjectId(hashlib.shake_128(str(link).encode('utf-8')).digest(12))
# end def
I have no idea how the shake_128 algorithm works. That is where my question comes in.
Is it okay to use this method? Can I safely assume that the probability of a collision is negligible?
What is the better way to do this?
shake_128 is one of the SHA-3 hash algorithms, chosen as the result of a contest to be the next generation of secure hash algorithms. They are not widely used, since SHA-2 is still considered good enough in most cases. Since these algorithms are designed for cryptographically secure hashing, this should be overkill for what you are doing. Also shake_128, as the name implies, should give you a 128-bit value, which is 16 bytes, not 12. This gives you 2^128 = 3.4e38 different hashes. I think you will be just fine. If anything, I would say you could use a faster hashing algorithm since you don't need cryptographic security in this case.
Python 3's default hashing function(s) isn't deterministic (hash(None) varies from run to run), and doesn't even make a best effort to generate unique id's with high probability (hash(-1)==hash(-2) is True).
Is there some other hash function that works well as a checksum (i.e. negligible probability of two data structures hashing to the same value, and returns the same result each run of python), and supports all of python's built-in datatypes, including None?
Ideally it would be in the standard library. I can pickle the object or get a string representation, but that seems unnecessarily hacky, and string representations of floats are probably very bad checksums.
I found the cryptographic hashes (md5,sha256) in the standard library, but they only operate on bytestrings.
Haskell seems to get this ~almost right in their standard library... but "Nothing::Maybe Int" and 0 both hash to 0, so it's not perfect there either.
You can use any hash from hashlib on a pickled object.
pickle.dumps not suitable for hashing.
You can use sorted-keys json with hashlib.
hashlib.md5(json.dumps(data, sort_keys=True)).hexdigest()
Taken from: https://stackoverflow.com/a/10288255/3858507, according to AndrewWagner's comment.
By the way and only for reference as this causes security vulnerabitilies, the PYTHONHASHSEED environment variable can be used to disable randomization of hashes throughout your application.
I'm writing a scrapy spider that collects news articles from various online newspapers. The sites in question update at least once a day, and I'm going to run the spider just as often, I need some way to filter out duplicates (i.e articles I've already scraped).
In other cases it'd be as simple as comparing reference numbers, but newspaper articles don't have any reference numbers. I was wondering if it'd be possible to hash the title using pythons hash() function and use the resulting value as a stand-in for an actual reference number, just for comparison purposes?
On the surface it seems possible, but what do you guys think?
Yes, you can do that, but I'd not use hash() for this, as hash() is optimised for a different task and can lead too easily to collisions on larger texts (different inputs resulting in the same hash value).
Use a cryptographic hashing scheme instead; the hashlib module gives you access to MD5 and other algorithms and produce output that is far less likely to produce collisions.
For your purposes, MD5 will do just fine:
article_hash = hashlib.md5(scraped_info).hexdigest()
This has the added advantage that the MD5 hash is always going to be calculated the same regardless of OS or system architecture; hash() can offer no such guarantee.
I've recently been looking into Python's dictionaries (I believe they're called associate arrays in other languages) and was confused by a couple of the restrictions on its keys.
First, dict keys must be immutable. When I looked up the logic behind it the answer was that dictionaries work like hash tables to look up the values for keys, and thus immutable keys (if they're hashable at all) may change their hash value, causing problems when retrieving the value.
I understood why that was the case just fine, but I was still somewhat confused by what the point of using a hash table would be. If you simply didn't hash the keys and tested for true equality (assuming indentically constructed objects compare equal), you could replicate most of the functionality of the dictionary without that limitation just by using two lists.
So, I suppose that's my real question - what's the rationale behind using hashes to look up values instead of equality?
If I had to guess, it's likely simply because comparing integers is incredibly fast and optimized, whereas comparing instances of other classes may not be.
You seem to be missing the whole point of a hash table, which is fast (O(1))1 retrieval, and which cannot be implemented without hashing, i.e. transformation of the key into some kind of well-distributed integer that is used as an index into a table. Notice that equality is still needed on retrieval to be able to handle hash collisions2, but you resort to it only when you already narrowed down the set of elements to those having a certain hash.
With just equality you could replicate similar functionality with parallel arrays or something similar, but that would make retrieval O(n)3; if you ask for strict weak ordering, instead, you can implement RB trees, that allow for O(log n) retrieval. But O(1) requires hashing.
Have a look at Wikipedia for some more insight on hash tables.
Notes
It can become O(n) in pathological scenarios (where all the elements get put in the same bucket), but that's not supposed to happen with a good hashing function.
Since different elements may have the same hash, after getting to the bucket corresponding to the hash you must check if you are actually retrieving the element associated with the given key.
Or O(log n) if you keep your arrays sorted, but that complicates insertion, which becomes on average O(n) due to shifting elements around; but then again, if you have ordering you probably want an RB tree or a heap.
By using hastables you achieve O(1) retrieval data, while comparing against each independent vale for equality will take O(n) (in a sequential search) or O(log(n)) in a binary search.
Also note that O(1) is amortized time, because if there are several values that hash to the same key, then a sequential search is needed among these values.
I've always used dictionaries. I write in Python.
A dictionary is a general concept that maps keys to values. There are many ways to implement such a mapping.
A hashtable is a specific way to implement a dictionary.
Besides hashtables, another common way to implement dictionaries is red-black trees.
Each method has its own pros and cons. A red-black tree can always perform a lookup in O(log N). A hashtable can perform a lookup in O(1) time although that can degrade to O(N) depending on the input.
A dictionary is a data structure that maps keys to values.
A hash table is a data structure that maps keys to values by taking the hash value of the key (by applying some hash function to it) and mapping that to a bucket where one or more values are stored.
IMO this is analogous to asking the difference between a list and a linked list.
For clarity it may be important to note that it MAY be the case that Python currently implements their dictionaries using hash tables, and it MAY be the case in the future that Python changes that fact without causing their dictionaries to stop being dictionaries.
"A dictionary" has a few different meanings in programming, as wikipedia will tell you -- "associative array", the sense in which Python uses the term (also known as "a mapping"), is one of those meanings (but "data dictionary", and "dictionary attacks" in password guess attempts, are also important).
Hash tables are important data structures; Python uses them to implement two important built-in data types, dict and set.
So, even in Python, you can't consider "hash table" to be a synonym for "dictionary"... since a similar data structure is also used to implement "sets"!-)
A Python dictionary is internally implemented with a hashtable.
Both dictionary and hash table pair keys to value in order to have fast big O operations while insertion or deletion or lookups, the difference is that a hash table uses hash in order to store (key, value) pairs that's why we can access data faster. Python implements dictionaries as hash tables, Maps and sets are new kinds of hash tables that take into consideration the order while inserting, and you can put any kind of object as keys...
Recently lists and hash table are more similar in Python3 due to order, Check this for more details:
https://softwaremaniacs.org/blog/2020/02/05/dicts-ordered/en/
A hash table always uses some function operating on a value to determine where a value will be stored. A Dictionary (as I believe you intend it) is a more general term, and simply indicates a lookup mechanism, which might be a hash table or might be implemented by a simpler structure which does not consider the value itself in determining its storage location.
Dictionary is implemented using hash tables. In my opinion the difference between the 2 can be thought of as the difference between Stacks and Arrays where we would be using arrays to implement Stacks.