I am having a hard time figuring out a reasonable way to generate a mixed-case hash in Python.
I want to generate something like: aZeEe9E
Right now I'm using MD5, which doesn't generate case-sensitive hashes.
Do any of you know how to generate a hash value consisting of upper- and lower- case characters + numbers?
-
Okay, GregS's advice worked like a charm (on the first try!):
Here is a simple example:
>>> import hashlib, base64
>>> s = 'http://gooogle.com'
>>> hash = hashlib.md5(s).digest()
>>> print hash
46c4f333fae34078a68393213bb9272d
>>> print base64.b64encode(hash)
NDZjNGYzMzNmYWUzNDA3OGE2ODM5MzIxM2JiOTI3MmQ=
you can base64 encode the output of the hash. This has a couple of additional characters beyond those you mentioned.
Maybe you can use base64-encoded hashes?
Related
I thought that this would be a fairly common and straightforward problem, but I searched and was not able to find it.
I am a novice Python user, mostly self-taught. I'm trying what I thought would be a fairly straightforward exercise: generating a hash value from an input phrase. Here is my code:
import hashlib
target = input("Give me a phrase: ").encode('utf-8')
hashed_target = hashlib.sha256(target)
print(hashed_target)
I execute this and get the prompt:
Give me a phrase:
I entered the phrase "Give me liberty or give me death!" and got the hash output 0x7f8ed43d6a80.
Just to test, I tried again with the same phrase, but got a different output: 0x7f1cc23bca80.
I thought that was strange, so I copied the original input and pasted it in, and got a third, different hash output: 0x7f358aabea80.
I'm sure there must be a simple explanation. I'm not getting any errors, and the code looks straightforward, but the hashes, while similar, are definitely different.
Can someone help?
You are directly printing an object, which returns a memory address in the __repr__ string. You need to use the hexdigest or digest methods to get the hash:
>>> import hashlib
>>> testing=hashlib.sha256(b"sha256 is much longer than 12 hex characters")
>>> testing
<sha256 HASH object # 0x7f31c1c64670>
>>> hashed_testing=testing.hexdigest()
>>> hashed_testing
'a0798cfd68c7463937acd7c08e5c157b7af29f3bbe9af3c30c9e62c10d388e80'
>>>
I need to create an identifier token from a set of nested configuration values.
The token can be part of a URL, so – to make processing easier – it should contain only hexadecimal digits (or something similar).
The config values are nested tuples with elements of hashable types like int, bool, str etc.
My idea was to use the built-in hash() function, as this will continue to work even if the config structure changes.
This is my first attempt:
def token(config):
h = hash(config)
return '{:X}'.format(h)
This will produce tokens of variable length, but that doesn't matter.
What bothers me, though, is that the token might contain a leading minus sign, since the return value of hash() is a signed integer.
As a way to avoid the sign, I thought of the following work-around, which is adding a constant to the hash value.
This constant should be half the size of the range the value of hash() can take (which is platform-dependent, eg. different for 32-/64-bit systems):
HALF_HASH_RANGE = 2**(sys.hash_info.width-1)
Is this a sane and portable solution?
Or will I shoot myself in the foot with this?
I also saw suggestions for using struct.pack() (which returns a bytes object, on which one can call the .hex() method), but it also requires knowing the range of the hash value in advance (for the choice of the right format character).
Addendum:
Encryption strength or collisions by chance are not an issue.
The drawback of the hashlib library in this scenario is that it requires writing a converter that traverses the input structure and converts everything into a bytes representation, which is cumbersome.
You can use any of hash functions for getting unique string. Right now python support out of the box many algorithms, like: md5, sha1, sha224, sha256, sha384, sha512. You can read more about it here - https://docs.python.org/2/library/hashlib.html
This example shows how to use library hashlib. (Python 3)
>>> import hashlib
>>> sha = hashlib.sha256()
>>> sha.update('somestring'.encode())
>>> sha.hexdigest()
>>> '63f6fe797026d794e0dc3e2bd279aee19dd2f8db67488172a644bb68792a570c'
Also you can try library hashids. But note that it's not a hash algorithm and you (and anyone who knows salt) can decrypt data.
$ pip install hashids
Basic usage:
>>> from hashids import Hashids
>>> hashids = Hashids()
>>> hashids.encode(123)
'Mj3'
>>> hashids.decode('Mj3')
123
I need to create an identifier token from a set of nested configuration values
I came across this question while trying to solve the same problem, and realizing that some of the calls to hash return negative integers.
Here's how I would implement your token function:
import sys
def token(config) -> str:
"""Generates a hex token that identifies a hashable config."""
# `sign_mask` is used to make `hash` return unsigned values
sign_mask = (1 << sys.hash_info.width) - 1
# Get the hash as a positive hex value with consistent padding without '0x'
return f'{hash(config) & sign_mask:#0{sys.hash_info.width//4}x}'[2:]
In my case I needed it to work with a broad range of inputs for the config. It did not need to be particularly performant (it was not on a hot path), and it was acceptable if it occasionally had collisions (more than what would normally be expected from hash). All it really needed to do is produce short (e.g. 16 chars long) consistent outputs for consistent inputs. So for my case I used the above function with a small modification to ensure the provided config is hashable, at the cost of increased collision risk and processing time:
import sys
def token(config) -> str:
"""Generates a hex token that identifies a config."""
# `sign_mask` is used to make `hash` return unsigned values
sign_mask = (1 << sys.hash_info.width) - 1
# Use `json.dumps` with `repr` to ensure the config is hashable
json_config = json.dumps(config, default=repr)
# Get the hash as a positive hex value with consistent padding without '0x'
return f'{hash(json_config) & sign_mask:#0{sys.hash_info.width//4}x}'[2:]
I'd reccomend using hashlib
cast the token to a string, and then cast the hexdigest to a hex integer. Bellow is an example with the sha256 algorithm but you can use any hashing algorithm hashlib supports
import hashlib as hl
def shasum(token):
return int(hl.sha256(str(token).encode('utf-8')).hexdigest(), 16)
I would like to generate a human-readable hash with customized properties -- e.g., a short string of specified length consisting entirely of upper case letters and digits excluding 0, 1, O, and I (to eliminate visual ambiguity):
"arbitrary string" --> "E3Y7UM8"
A 7-character string of the above form could take on over 34 billion unique values which, for my purposes, makes collisions extremely unlikely. Security is also not a major concern.
Is there an existing module or routine that implements something like the above? Alternatively, can someone suggest a straightforward algorithm?
The method you should be using has similarities with password one-way encryption. Of course since you are going for readable, a good password function is probably out of the question.
Here's what I would do:
Take an MD5 hash of the email
Convert base32 which already eliminates O and I
Replace any non-readable characters with readable ones
Here's an example based on the above:
import base64 # base32 is a function in base64
import hashlib
email = "somebody#example.com"
md5 = hashlib.md5()
md5.update(email.encode('utf-8'))
hash_in_bytes = md5.digest()
result = base64.b32encode(hash_in_bytes)
print(result)
# Or you can remove the extra "=" at the end
result = result.strip(b'=')
Since it's a one-way function (hash), you obviously don't need to worry about reversing the process (you can't anyway). You can also replace any other characters you find non-readable with readable ones (I would go for lowercase versions of the characters, e.g. q instead of Q)
More about base32 here: https://docs.python.org/3/library/base64.html
You can simply truncate the beginning of an MD5sum algorithm. It should have approximately the same statistical properties than the whole string anyway:
import md5
m = md5.new()
m.update("arbitrary string")
print(m.hexdigest()[:7])
Same code with hashlib module:
import hashlib
m = hashlib.md5()
m.update("arbitrary string")
print(m.hexdigest()[:7])
What I need is to hash a string. It doesn't have to be secure because it's just going to be a hidden phrase in the text file (it just doesn't have to be recognizable for a human-eye).
It should not be just a random string because when the users types the string I would like to hash it and compare it with an already hashed one (from the text file).
What would be the best for this purpose? Can it be done with the built-in classes?
First off, let me say that you can't guarantee unique results. If you wanted unique results for all the strings in the universe, you're better off storing the string itself (or a compressed version).
More on that in a second. Let's get some hashes first.
hashlib way
You can use any of the main cryptographic hashes to hash a string with a few steps:
>>> import hashlib
>>> sha = hashlib.sha1("I am a cat")
>>> sha.hexdigest()
'576f38148ae68c924070538b45a8ef0f73ed8710'
You have a choice between SHA1, SHA224, SHA256, SHA384, SHA512, and MD5 as far as built-ins are concerned.
What's the difference between those hash algorithms?
A hash function works by taking data of variable length and turning it into data of fixed length.
The fixed length, in the case of each of the SHA algorithms built into hashlib, is the number of bits specified in the name (with the exception of sha1 which is 160 bits). If you want better certainty that two strings won't end up in the same bucket (same hash value), pick a hash with a bigger digest (the fixed length).
In sorted order, these are the digest sizes you have to work with:
Algorithm Digest Size (in bits)
md5 128
sha1 160
sha224 224
sha256 256
sha384 384
sha512 512
The bigger the digest the less likely you'll have a collision, provided your hash function is worth its salt.
Wait, what about hash()?
The built in hash() function returns integers, which could also be easy to use for the purpose you outline. There are problems though.
>>> hash('moo')
6387157653034356308
If your program is going to run on different systems, you can't be sure that hash will return the same thing. In fact, I'm running on a 64-bit box using 64-bit Python. These values are going to be wildly different than for 32-bit Python.
For Python 3.3+, as #gnibbler pointed out, hash() is randomized between runs. It will work for a single run, but almost definitely won't work across runs of your program (pulling from the text file you mentioned).
Why would hash() be built that way? Well, the built in hash is there for one specific reason. Hash tables/dictionaries/look up tables in memory. Not for cryptographic use but for cheap lookups at runtime.
Don't use hash(), use hashlib.
You can simply use the base64 module to achieve your goal:
>>> import base64
>>> a = 'helloworld'
>>> encoded_str = base64.encodestring(a)
>>> encoded_str
'aGVsbG93b3JsZA=='
>>> base64.decodestring(encoded_str)
'helloworld'
>>>
of course you can also use the the hashlib module, it's more secure , because the hashed string cannot(or very very hard) be decoded latter, but for your question base64 is enough -- "It doesn't really have to be secure"
Note that Python's string hash is not "defined" - it can, and does, vary across releases and implementations. So storing a Python string hash will create difficulties. CPython's string hash makes no attempt to be "obscure", either.
A standard approach is to use a hash function designed for this kind of thing. Like this:
>>> import hashlib
>>> encoded = hashlib.sha1("abcdef") # "abcdef" is the password
>>> encoded.hexdigest()
'1f8ac10f23c5b5bc1167bda84b833e5c057a77d2'
That long string of hexadecimal digits is "the hash". SHA-1 is a "strong" hash function. You can get famous if you find two strings that hash to the same value ;-) And given the same input, it will return the same "hexdigest" on all platforms across all releases and implementations of Python.
Simply use the hash() built-in function, for example:
s = 'a string'
hash(s)
=> -8411828025894108412
What is the easiest way to generate a random hash (MD5) in Python?
A md5-hash is just a 128-bit value, so if you want a random one:
import random
hash = random.getrandbits(128)
print("hash value: %032x" % hash)
I don't really see the point, though. Maybe you should elaborate why you need this...
I think what you are looking for is a universal unique identifier.Then the module UUID in python is what you are looking for.
import uuid
uuid.uuid4().hex
UUID4 gives you a random unique identifier that has the same length as a md5 sum. Hex will represent is as an hex string instead of returning a uuid object.
http://docs.python.org/2/library/uuid.html
https://docs.python.org/3/library/uuid.html
The secrets module was added in Python 3.6+. It provides cryptographically secure random values with a single call. The functions take an optional nbytes argument, default is 32 (bytes * 8 bits = 256-bit tokens). MD5 has 128-bit hashes, so provide 16 for "MD5-like" tokens.
>>> import secrets
>>> secrets.token_hex(nbytes=16)
'17adbcf543e851aa9216acc9d7206b96'
>>> secrets.token_urlsafe(16)
'X7NYIolv893DXLunTzeTIQ'
>>> secrets.token_bytes(128 // 8)
b'\x0b\xdcA\xc0.\x0e\x87\x9b`\x93\\Ev\x1a|u'
This works for both python 2.x and 3.x
import os
import binascii
print(binascii.hexlify(os.urandom(16)))
'4a4d443679ed46f7514ad6dbe3733c3d'
Yet another approach. You won't have to format an int to get it.
import random
import string
def random_string(length):
pool = string.letters + string.digits
return ''.join(random.choice(pool) for i in xrange(length))
Gives you flexibility on the length of the string.
>>> random_string(64)
'XTgDkdxHK7seEbNDDUim9gUBFiheRLRgg7HyP18j6BZU5Sa7AXiCHP1NEIxuL2s0'
Another approach to this specific question:
import random, string
def random_md5like_hash():
available_chars= string.hexdigits[:16]
return ''.join(
random.choice(available_chars)
for dummy in xrange(32))
I'm not saying it's faster or preferable to any other answer; just that it's another approach :)
import uuid
from md5 import md5
print md5(str(uuid.uuid4())).hexdigest()
import os, hashlib
hashlib.md5(os.urandom(32)).hexdigest()
The most proper way is to use random module
import random
format(random.getrandbits(128), 'x')
Using secrets is an overkill. It generates cryptographically strong randomness sacrifying performance.
All responses that suggest using UUID are intrinsically wrong because UUID (even UUID4) are not totally random. At least they include fixed version number that never changes.
import uuid
>>> uuid.uuid4()
UUID('8a107d39-bb30-4843-8607-ce9e480c8339')
>>> uuid.uuid4()
UUID('4ed324e8-08f9-4ea5-bc0c-8a9ad53e2df6')
All MD5s containing something other than 4 at 13th position from the left will be unreachable this way.
from hashlib import md5
plaintext = input('Enter the plaintext data to be hashed: ') # Must be a string, doesn't need to have utf-8 encoding
ciphertext = md5(plaintext.encode('utf-8')).hexdigest()
print(ciphertext)
It should also be noted that MD5 is a very weak hash function, also collisions have been found (two different plaintext values result in the same hash)
Just use a random value for plaintext.