Are there any variable length hash functions available for Python?

Are there any variable length hash functions available for Python? - python

I am looking for a hash function that can generate a digest of a specified bit-size for a cryptographic signature scheme. A related question (https://crypto.stackexchange.com/questions/3558/are-there-hash-algorithms-with-variable-length-output) on the Cryptography SE specifies that algorithms exist for this particular purpose.
Are there any Python libraries that I can use for this?
Currently, my scheme just pads a SHA-256 output to the desired size. I also have tried the Python SHA3 library - pysha3 1.0.2, however, it has a few predefined digest sizes that can be used.
I want a hashing function which can take in the desired digest size as a parameter and accordingly hashes a message (if possible)

As a cursory answer: You might be interested in the inbuilt Blake2 function in hashlib in python 3.6+.
It only outputs up to 64 bytes, but is "faster than MD5, SHA-1, SHA-2, and SHA-3, yet is at least as secure as the latest standard SHA-3".
Hopefully this is long enough and you don't need external libraries!

Any Extendable output function (XOF) can be used to obtain a digest of a given size. From Wikipedia:
Extendable-output functions (XOFs) are cryptographic hashes which can output an arbitrarily large number of random-looking bits.
One of the function provided under SHA-3 includes the SHAKE128 and SHAKE256 functions. They follow from the general properties of the sponge construction. A sponge function can generate an arbitrary length of output. The 128 and 256 in their names indicates its maximum security level (in bits), as described in Sections A.1 and A.2 of FIPS 202.
In python, first install the PyCryptodome library:
pip install pycryptodome
A hash of say 20 bytes can be generated as follows:
from Crypto.Hash import SHAKE256
from binascii import hexlify
shake = SHAKE256.new()
shake.update(b'Some data')
print hexlify(shake.read(20))
Further references on SHAKE256 and SHA3:
Link 1
Link 2

Related

Bytes object containing an arbitrary number of bits

I'm currently implementing a platform to compare the execution time of different cryptographic algorithms in Python3.
One of the requirements of this platform is using the test vectors provided in the NESSIE project. While checking the test vectors I realized there are vectors such as
Set 2, vector# 3:
message=3 zero bits
hash=88BAD9D59A0A5195FAF7961BB6625486816C1430
This test vector requires an input of 3 zero-bits, that is '000' which arises a problem, the library I'm using Cryptography implements the SHA-1 function that only accepts as input a bytes data type.
I've been reading the Python documentation and I wasn't able to find a way to generate a bytes object with only 3 bits, I'm only able to create them with 8,16,etc bits. I know this is the proper structure of a byte but, is there a way to create or slice a byte object so it contains only 3 bits?
If there's no way to do this, do you know about another cryptography library where these cases can be performed? (I'm not allowed to implement my own.)

No, you cannot slice a byte. A byte consists of 8 bits, presuming that a byte is an octet, as in all modern systems. Of course, there is a way to represent 3 bits, e.g. using a special data structure or by simply adding an integer that specifies how many bits are there to be used.
However, if you do not have a use case that requires a bit-oriented implementation then you can ignore the test vectors where the bits are not a full multiple of eight. Most libraries are byte-oriented, even if the algorithm is specified in bits. You can create a bit oriented from the spec, so the test vector needs to be there in case it is needed.
Unfortunately I have not found any statement from NIST to this affect, even if it is completely apparent from e.g. FIPS compliant libraries.

Algorithm to generate 12 byte hash from Web URLs

I am crawling some websites for special items and storing them in MongoDB server. To avoid duplicate items, I am using the hash value of the item link. Here is my code to generate the hash from the link:
import hashlib
from bson.objectid import ObjectId
def gen_objectid(link):
"""Generates objectid from given link"""
return ObjectId(hashlib.shake_128(str(link).encode('utf-8')).digest(12))
# end def
I have no idea how the shake_128 algorithm works. That is where my question comes in.
Is it okay to use this method? Can I safely assume that the probability of a collision is negligible?
What is the better way to do this?

shake_128 is one of the SHA-3 hash algorithms, chosen as the result of a contest to be the next generation of secure hash algorithms. They are not widely used, since SHA-2 is still considered good enough in most cases. Since these algorithms are designed for cryptographically secure hashing, this should be overkill for what you are doing. Also shake_128, as the name implies, should give you a 128-bit value, which is 16 bytes, not 12. This gives you 2^128 = 3.4e38 different hashes. I think you will be just fine. If anything, I would say you could use a faster hashing algorithm since you don't need cryptographic security in this case.

Deterministic, recursive hashing in python

Python 3's default hashing function(s) isn't deterministic (hash(None) varies from run to run), and doesn't even make a best effort to generate unique id's with high probability (hash(-1)==hash(-2) is True).
Is there some other hash function that works well as a checksum (i.e. negligible probability of two data structures hashing to the same value, and returns the same result each run of python), and supports all of python's built-in datatypes, including None?
Ideally it would be in the standard library. I can pickle the object or get a string representation, but that seems unnecessarily hacky, and string representations of floats are probably very bad checksums.
I found the cryptographic hashes (md5,sha256) in the standard library, but they only operate on bytestrings.
Haskell seems to get this ~almost right in their standard library... but "Nothing::Maybe Int" and 0 both hash to 0, so it's not perfect there either.

You can use any hash from hashlib on a pickled object.
pickle.dumps not suitable for hashing.
You can use sorted-keys json with hashlib.
hashlib.md5(json.dumps(data, sort_keys=True)).hexdigest()
Taken from: https://stackoverflow.com/a/10288255/3858507, according to AndrewWagner's comment.
By the way and only for reference as this causes security vulnerabitilies, the PYTHONHASHSEED environment variable can be used to disable randomization of hashes throughout your application.

one time pad with pad seeded via 'passphrase'

Looking for a theoretical discussion here. I personally would (and will continue to) use GPG or just SCP for simply getting a file somewhere where only I can decrypt it or only I can download it. Still a discussion of where the following falls short (and by how much) would help my curiosity.
Suppose I want to encrypt a file locally, put it on the internet, and be able to grab it later. I want to make sure that only people with a certain password/phrase can decrypt the file ... and I insist on incorporating a one-time-pad.
Assuming it's only used to encrypt a message once, if one were to use a very random passphrase (e.g. Diceware) to seed the pad in a reproducible way, would this be a problem? In python, I would do something like random.seed("hurt coaster lemon swab lincoln") and then generate my pad. I would use the same seed for encryption and decryption.
There are warnings all over the place about how this Mersenne Twister RNG is not suitable for security/cryptography purposes. I see that it has a very long period, and IIUC, that random.seed allows me to choose 16 bytes worth of different seeds (Python: where is random.random() seeded?).
I've heard that the numbers in an OTP should be "truly random", but even if somebody saw, say, the 1st 100 characters of my pad, how much would that help them in determining what the seed of my RNG was (in hopes of decoding the rest)? I suppose they could brute force the seed by generating pads from every possible random seed and seeing which ones match my first 100 random letters. Still, there are quite a few random seeds to try, right?
So, how dangerous is this? And is there a reasonable way to figure out the seed of a sequence generated by common RNGs by peeking at a little bit of the sequence?

A one-time pad's key is truly-random data of the same size as the plaintext, by definition. If you're producing it some other way (e.g. by seeding a PRNG), it isn't a one-time pad, and it doesn't have the one-time pad's unbreakability property.
One-time pads are actually a special type of stream cipher. There are other stream ciphers too, and yes, they can be quite secure if used properly. But stream ciphers can also be completely insecure if used improperly, and your idea of making up your own cipher based on a non-cryptographic PRNG is improper usage from the start.
One-time pads are used when the key must be impossible to brute-force even if the attacker has unlimited computing power. Based on your description, you're just looking for something that's infeasible to brute-force by any realistic attacker, and that's what any other decent cipher will give you. And unless you're protecting nuclear launch codes or something, that's all you need.
Forget the faux-OTP and Mersenne Twister idea and just use something like AES, with something like bcrypt or scrypt to derive the key from your passphrase.
Regarding your specific question about determining the RNG's sequence: Mersenne twister's internal state can be determined by observing 2496 bytes of its output. And in a stream cipher, it's easy to determine the keystream given the plaintext and ciphertext. This means that if an attacker has your ciphertext and can determine the first 2496 bytes of your plaintext, he knows the RNG state and can use it to produce the rest of the keystream and decrypt the whole message.
2496 bytes is not feasible to brute-force, but a sophisticated attacker may be able to significantly narrow down the possibilities using intelligent guessing about the content of your plaintext, such as what you might have written about, or what file formats the data likely to be in and the known structure of those file formats. This is known as cribbing, and can provide enough of a starting point that the remaining brute-force attack becomes feasible.
Even better is if the attacker can trick you into incorporating some specific content into your plaintext. Then he doesn't even have to guess.

Reproducibility of python pseudo-random numbers across systems and versions?

I need to generate a controlled sequence of pseudo-random numbers, given an initial parameter. For that I'm using the standard python random generator, seeded by this parameter. I'd like to make sure that I will generate the same sequence across systems (Operating system, but also Python version).
In summary: Does python ensure the reproducibility / portability of it's pseudo-random number generator across implementation and versions?

No, it doesn't. There's no such promise in the random module's documentation.
What the docs do contain is this remark:
Changed in version 2.3: MersenneTwister replaced Wichmann-Hill as the default generator
So a different RNG was used prior to Python 2.3.
So far, I've been using numpy.random.RandomState for reproducible pseudo-randomness, though it too does not make the formal promise you're after.
If you want full reproducibility, you might want to include a copy of random's source in your program, or hack together a "P²RNG" (pseudo-pseudo-RNG) from hashlib.

Not necessarily.
As described in the documentation, the random module has used the Mersenne twister to generate random numbers since version 2.3, but used Wichmann-Hill before that.
(If a seed is not provided, the method of obtaining the seed also does depend on the operating system, the Python version, and factors such as the system time).

#reubano - 3.2 changed the integer functions in random, to produce more evenly distributed (which inevitably means different) output.
That change was discussed in Issue9025, where the team discuss whether they have an obligation to stick to the previous output, even when it was defective. They conclude that they do not. The docs for the module guarantee consistency for random.random() - one might assume that the functions which call it (like random.randrange()) are implicitly covered under that guarantee, but that doesn't seem to be the case.

Just as a heads up: in addition to the 2.3 change, python 3 gives numbers from python 2.x from randrange and probably other functions, even if the numbers from random.random are similar.

I just found out that there is also a difference between python3.7 and python3.8.
The following code behaves the same
from random import Random
seed = 317
rand = Random(seed)
rand.getrandbits(64)
but if you use from _random import Random instead, it behaves differently.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.