Algorithm to generate 12 byte hash from Web URLs

Algorithm to generate 12 byte hash from Web URLs - python

I am crawling some websites for special items and storing them in MongoDB server. To avoid duplicate items, I am using the hash value of the item link. Here is my code to generate the hash from the link:
import hashlib
from bson.objectid import ObjectId
def gen_objectid(link):
"""Generates objectid from given link"""
return ObjectId(hashlib.shake_128(str(link).encode('utf-8')).digest(12))
# end def
I have no idea how the shake_128 algorithm works. That is where my question comes in.
Is it okay to use this method? Can I safely assume that the probability of a collision is negligible?
What is the better way to do this?

shake_128 is one of the SHA-3 hash algorithms, chosen as the result of a contest to be the next generation of secure hash algorithms. They are not widely used, since SHA-2 is still considered good enough in most cases. Since these algorithms are designed for cryptographically secure hashing, this should be overkill for what you are doing. Also shake_128, as the name implies, should give you a 128-bit value, which is 16 bytes, not 12. This gives you 2^128 = 3.4e38 different hashes. I think you will be just fine. If anything, I would say you could use a faster hashing algorithm since you don't need cryptographic security in this case.

Related

Deterministic, recursive hashing in python

Python 3's default hashing function(s) isn't deterministic (hash(None) varies from run to run), and doesn't even make a best effort to generate unique id's with high probability (hash(-1)==hash(-2) is True).
Is there some other hash function that works well as a checksum (i.e. negligible probability of two data structures hashing to the same value, and returns the same result each run of python), and supports all of python's built-in datatypes, including None?
Ideally it would be in the standard library. I can pickle the object or get a string representation, but that seems unnecessarily hacky, and string representations of floats are probably very bad checksums.
I found the cryptographic hashes (md5,sha256) in the standard library, but they only operate on bytestrings.
Haskell seems to get this ~almost right in their standard library... but "Nothing::Maybe Int" and 0 both hash to 0, so it's not perfect there either.

You can use any hash from hashlib on a pickled object.
pickle.dumps not suitable for hashing.
You can use sorted-keys json with hashlib.
hashlib.md5(json.dumps(data, sort_keys=True)).hexdigest()
Taken from: https://stackoverflow.com/a/10288255/3858507, according to AndrewWagner's comment.
By the way and only for reference as this causes security vulnerabitilies, the PYTHONHASHSEED environment variable can be used to disable randomization of hashes throughout your application.

one time pad with pad seeded via 'passphrase'

Looking for a theoretical discussion here. I personally would (and will continue to) use GPG or just SCP for simply getting a file somewhere where only I can decrypt it or only I can download it. Still a discussion of where the following falls short (and by how much) would help my curiosity.
Suppose I want to encrypt a file locally, put it on the internet, and be able to grab it later. I want to make sure that only people with a certain password/phrase can decrypt the file ... and I insist on incorporating a one-time-pad.
Assuming it's only used to encrypt a message once, if one were to use a very random passphrase (e.g. Diceware) to seed the pad in a reproducible way, would this be a problem? In python, I would do something like random.seed("hurt coaster lemon swab lincoln") and then generate my pad. I would use the same seed for encryption and decryption.
There are warnings all over the place about how this Mersenne Twister RNG is not suitable for security/cryptography purposes. I see that it has a very long period, and IIUC, that random.seed allows me to choose 16 bytes worth of different seeds (Python: where is random.random() seeded?).
I've heard that the numbers in an OTP should be "truly random", but even if somebody saw, say, the 1st 100 characters of my pad, how much would that help them in determining what the seed of my RNG was (in hopes of decoding the rest)? I suppose they could brute force the seed by generating pads from every possible random seed and seeing which ones match my first 100 random letters. Still, there are quite a few random seeds to try, right?
So, how dangerous is this? And is there a reasonable way to figure out the seed of a sequence generated by common RNGs by peeking at a little bit of the sequence?

A one-time pad's key is truly-random data of the same size as the plaintext, by definition. If you're producing it some other way (e.g. by seeding a PRNG), it isn't a one-time pad, and it doesn't have the one-time pad's unbreakability property.
One-time pads are actually a special type of stream cipher. There are other stream ciphers too, and yes, they can be quite secure if used properly. But stream ciphers can also be completely insecure if used improperly, and your idea of making up your own cipher based on a non-cryptographic PRNG is improper usage from the start.
One-time pads are used when the key must be impossible to brute-force even if the attacker has unlimited computing power. Based on your description, you're just looking for something that's infeasible to brute-force by any realistic attacker, and that's what any other decent cipher will give you. And unless you're protecting nuclear launch codes or something, that's all you need.
Forget the faux-OTP and Mersenne Twister idea and just use something like AES, with something like bcrypt or scrypt to derive the key from your passphrase.
Regarding your specific question about determining the RNG's sequence: Mersenne twister's internal state can be determined by observing 2496 bytes of its output. And in a stream cipher, it's easy to determine the keystream given the plaintext and ciphertext. This means that if an attacker has your ciphertext and can determine the first 2496 bytes of your plaintext, he knows the RNG state and can use it to produce the rest of the keystream and decrypt the whole message.
2496 bytes is not feasible to brute-force, but a sophisticated attacker may be able to significantly narrow down the possibilities using intelligent guessing about the content of your plaintext, such as what you might have written about, or what file formats the data likely to be in and the known structure of those file formats. This is known as cribbing, and can provide enough of a starting point that the remaining brute-force attack becomes feasible.
Even better is if the attacker can trick you into incorporating some specific content into your plaintext. Then he doesn't even have to guess.

what is difference between crypto hash and hashtable hashes in python?

What is a crypto hash and what are some algorithms? How it is different from a normal hash in python? How can I determine which to use?
EX:
Cryptographic hash function
hello--aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d
helld--44d634fa6b81353bc3ed424879ffd013501ade53
hash function
hash("hello") -1267296259
hash("helld") -1267296266
Please help me

Cryptographic Hash functions are different from Hashtable Hash functions. One main difference is that cryptographic hash functions are designed not to have hash collision weaknesses. They are designed to be more secure and irreversible in most cases. But Hashtable hash functions like hash are faster and are designed to use to quickly access items in memory or comparing items or etc.
Suppose two differenct Scenarios. If you want to store passwords in a database you must use something like pbkdf2 so it is more secure and so slower to generate in order to prevent brute forces. But in another case you just want to have a set of items and check if an item exists in that set. You can simply store a 32-bit or 64-bit hash of items(e.g. classes) and compare hashes quickly instead of classes.
For example for string "hello", it is much faster to compute and store 1267296259 as it is a 32-bit integer and more secure and slower to compute and store aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d.
P.S. A good example is here.

using python hash() values as a refrence number

I'm writing a scrapy spider that collects news articles from various online newspapers. The sites in question update at least once a day, and I'm going to run the spider just as often, I need some way to filter out duplicates (i.e articles I've already scraped).
In other cases it'd be as simple as comparing reference numbers, but newspaper articles don't have any reference numbers. I was wondering if it'd be possible to hash the title using pythons hash() function and use the resulting value as a stand-in for an actual reference number, just for comparison purposes?
On the surface it seems possible, but what do you guys think?

Yes, you can do that, but I'd not use hash() for this, as hash() is optimised for a different task and can lead too easily to collisions on larger texts (different inputs resulting in the same hash value).
Use a cryptographic hashing scheme instead; the hashlib module gives you access to MD5 and other algorithms and produce output that is far less likely to produce collisions.
For your purposes, MD5 will do just fine:
article_hash = hashlib.md5(scraped_info).hexdigest()
This has the added advantage that the MD5 hash is always going to be calculated the same regardless of OS or system architecture; hash() can offer no such guarantee.

Whats more random, hashlib or urandom?

I'm working on a project with a friend where we need to generate a random hash. Before we had time to discuss, we both came up with different approaches and because they are using different modules, I wanted to ask you all what would be better--if there is such a thing.
hashlib.sha1(str(random.random())).hexdigest()
or
os.urandom(16).encode('hex')
Typing this question out has got me thinking that the second method is better. Simple is better than complex. If you agree, how reliable is this for 'randomly' generating hashes? How would I test this?

This solution:
os.urandom(16).encode('hex')
is the best since it uses the OS to generate randomness which should be usable for cryptographic purposes (depends on the OS implementation).
random.random() generates pseudo-random values.
Hashing a random value does not add any new randomness.

random.random() is a pseudo-radmom generator, that means the numbers are generated from a sequence. if you call random.seed(some_number), then after that the generated sequence will always be the same.
os.urandom() get's the random numbers from the os' rng, which uses an entropy pool to collect real random numbers, usually by random events from hardware devices, there exist even random special entropy generators for systems where a lot of random numbers are generated.
on unix system there are traditionally two random number generators: /dev/random and /dev/urandom. calls to the first block if there is not enough entropy available, whereas when you read /dev/urandom and there is not enough entropy data available, it uses a pseudo-rng and doesn't block.
so the use depends usually on what you need: if you need a few, equally distributed random numbers, then the built in prng should be sufficient. for cryptographic use it's always better to use real random numbers.

The second solution clearly has more entropy than the first. Assuming the quality of the source of the random bits would be the same for os.urandom and random.random:
In the second solution you are fetching 16 bytes = 128 bits worth of randomness
In the first solution you are fetching a floating point value which has roughly 52 bits of randomness (IEEE 754 double, ignoring subnormal numbers, etc...). Then you hash it around, which, of course, doesn't add any randomness.
More importantly, the quality of the randomness coming from os.urandom is expected and documented to be much better than the randomness coming from random.random. os.urandom's docstring says "suitable for cryptographic use".

Testing randomness is notoriously difficult - however, I would chose the second method, but ONLY (or, only as far as comes to mind) for this case, where the hash is seeded by a random number.
The whole point of hashes is to create a number that is vastly different based on slight differences in input. For your use case, the randomness of the input should do. If, however, you wanted to hash a file and detect one eensy byte's difference, that's when a hash algorithm shines.
I'm just curious, though: why use a hash algorithm at all? It seems that you're looking for a purely random number, and there are lots of libraries that generate uuid's, which have far stronger guarantees of uniqueness than random number generators.

if you want a unique identifier (uuid), then you should use
import uuid
uuid.uuid4().hex
https://docs.python.org/3/library/uuid.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.