What I need is to hash a string. It doesn't have to be secure because it's just going to be a hidden phrase in the text file (it just doesn't have to be recognizable for a human-eye).
It should not be just a random string because when the users types the string I would like to hash it and compare it with an already hashed one (from the text file).
What would be the best for this purpose? Can it be done with the built-in classes?
First off, let me say that you can't guarantee unique results. If you wanted unique results for all the strings in the universe, you're better off storing the string itself (or a compressed version).
More on that in a second. Let's get some hashes first.
hashlib way
You can use any of the main cryptographic hashes to hash a string with a few steps:
>>> import hashlib
>>> sha = hashlib.sha1("I am a cat")
>>> sha.hexdigest()
'576f38148ae68c924070538b45a8ef0f73ed8710'
You have a choice between SHA1, SHA224, SHA256, SHA384, SHA512, and MD5 as far as built-ins are concerned.
What's the difference between those hash algorithms?
A hash function works by taking data of variable length and turning it into data of fixed length.
The fixed length, in the case of each of the SHA algorithms built into hashlib, is the number of bits specified in the name (with the exception of sha1 which is 160 bits). If you want better certainty that two strings won't end up in the same bucket (same hash value), pick a hash with a bigger digest (the fixed length).
In sorted order, these are the digest sizes you have to work with:
Algorithm Digest Size (in bits)
md5 128
sha1 160
sha224 224
sha256 256
sha384 384
sha512 512
The bigger the digest the less likely you'll have a collision, provided your hash function is worth its salt.
Wait, what about hash()?
The built in hash() function returns integers, which could also be easy to use for the purpose you outline. There are problems though.
>>> hash('moo')
6387157653034356308
If your program is going to run on different systems, you can't be sure that hash will return the same thing. In fact, I'm running on a 64-bit box using 64-bit Python. These values are going to be wildly different than for 32-bit Python.
For Python 3.3+, as #gnibbler pointed out, hash() is randomized between runs. It will work for a single run, but almost definitely won't work across runs of your program (pulling from the text file you mentioned).
Why would hash() be built that way? Well, the built in hash is there for one specific reason. Hash tables/dictionaries/look up tables in memory. Not for cryptographic use but for cheap lookups at runtime.
Don't use hash(), use hashlib.
You can simply use the base64 module to achieve your goal:
>>> import base64
>>> a = 'helloworld'
>>> encoded_str = base64.encodestring(a)
>>> encoded_str
'aGVsbG93b3JsZA=='
>>> base64.decodestring(encoded_str)
'helloworld'
>>>
of course you can also use the the hashlib module, it's more secure , because the hashed string cannot(or very very hard) be decoded latter, but for your question base64 is enough -- "It doesn't really have to be secure"
Note that Python's string hash is not "defined" - it can, and does, vary across releases and implementations. So storing a Python string hash will create difficulties. CPython's string hash makes no attempt to be "obscure", either.
A standard approach is to use a hash function designed for this kind of thing. Like this:
>>> import hashlib
>>> encoded = hashlib.sha1("abcdef") # "abcdef" is the password
>>> encoded.hexdigest()
'1f8ac10f23c5b5bc1167bda84b833e5c057a77d2'
That long string of hexadecimal digits is "the hash". SHA-1 is a "strong" hash function. You can get famous if you find two strings that hash to the same value ;-) And given the same input, it will return the same "hexdigest" on all platforms across all releases and implementations of Python.
Simply use the hash() built-in function, for example:
s = 'a string'
hash(s)
=> -8411828025894108412
Related
I have a really long string that I want to shorten into a smaller string of random characters similar to what a hash does. However, I want to be able to undo it later to read it. As far as I know hashes are unable to be undone and thus I could not read it later.
You can use Python's builtin compression library zlib
>>> from zlib import compress, decompress
>>> original = 'A' * 1024
>>> len(original)
1024
>>> compressed = compress(original.encode('utf-8'))
>>> len(compressed)
17
>>> original == decompress(compressed).decode('utf-8')
True
Note that the original string must contain some patterns to be compressed efficiently. In general, the more entropy original has, the longer compressed will be.
I ended up using a database and just stuck with the long strings. Originally, I was going to shorten them and not have a database, but I think a database is better anyways and it allows me to store the long form.
I need to create an identifier token from a set of nested configuration values.
The token can be part of a URL, so – to make processing easier – it should contain only hexadecimal digits (or something similar).
The config values are nested tuples with elements of hashable types like int, bool, str etc.
My idea was to use the built-in hash() function, as this will continue to work even if the config structure changes.
This is my first attempt:
def token(config):
h = hash(config)
return '{:X}'.format(h)
This will produce tokens of variable length, but that doesn't matter.
What bothers me, though, is that the token might contain a leading minus sign, since the return value of hash() is a signed integer.
As a way to avoid the sign, I thought of the following work-around, which is adding a constant to the hash value.
This constant should be half the size of the range the value of hash() can take (which is platform-dependent, eg. different for 32-/64-bit systems):
HALF_HASH_RANGE = 2**(sys.hash_info.width-1)
Is this a sane and portable solution?
Or will I shoot myself in the foot with this?
I also saw suggestions for using struct.pack() (which returns a bytes object, on which one can call the .hex() method), but it also requires knowing the range of the hash value in advance (for the choice of the right format character).
Addendum:
Encryption strength or collisions by chance are not an issue.
The drawback of the hashlib library in this scenario is that it requires writing a converter that traverses the input structure and converts everything into a bytes representation, which is cumbersome.
You can use any of hash functions for getting unique string. Right now python support out of the box many algorithms, like: md5, sha1, sha224, sha256, sha384, sha512. You can read more about it here - https://docs.python.org/2/library/hashlib.html
This example shows how to use library hashlib. (Python 3)
>>> import hashlib
>>> sha = hashlib.sha256()
>>> sha.update('somestring'.encode())
>>> sha.hexdigest()
>>> '63f6fe797026d794e0dc3e2bd279aee19dd2f8db67488172a644bb68792a570c'
Also you can try library hashids. But note that it's not a hash algorithm and you (and anyone who knows salt) can decrypt data.
$ pip install hashids
Basic usage:
>>> from hashids import Hashids
>>> hashids = Hashids()
>>> hashids.encode(123)
'Mj3'
>>> hashids.decode('Mj3')
123
I need to create an identifier token from a set of nested configuration values
I came across this question while trying to solve the same problem, and realizing that some of the calls to hash return negative integers.
Here's how I would implement your token function:
import sys
def token(config) -> str:
"""Generates a hex token that identifies a hashable config."""
# `sign_mask` is used to make `hash` return unsigned values
sign_mask = (1 << sys.hash_info.width) - 1
# Get the hash as a positive hex value with consistent padding without '0x'
return f'{hash(config) & sign_mask:#0{sys.hash_info.width//4}x}'[2:]
In my case I needed it to work with a broad range of inputs for the config. It did not need to be particularly performant (it was not on a hot path), and it was acceptable if it occasionally had collisions (more than what would normally be expected from hash). All it really needed to do is produce short (e.g. 16 chars long) consistent outputs for consistent inputs. So for my case I used the above function with a small modification to ensure the provided config is hashable, at the cost of increased collision risk and processing time:
import sys
def token(config) -> str:
"""Generates a hex token that identifies a config."""
# `sign_mask` is used to make `hash` return unsigned values
sign_mask = (1 << sys.hash_info.width) - 1
# Use `json.dumps` with `repr` to ensure the config is hashable
json_config = json.dumps(config, default=repr)
# Get the hash as a positive hex value with consistent padding without '0x'
return f'{hash(json_config) & sign_mask:#0{sys.hash_info.width//4}x}'[2:]
I'd reccomend using hashlib
cast the token to a string, and then cast the hexdigest to a hex integer. Bellow is an example with the sha256 algorithm but you can use any hashing algorithm hashlib supports
import hashlib as hl
def shasum(token):
return int(hl.sha256(str(token).encode('utf-8')).hexdigest(), 16)
I would like to generate a human-readable hash with customized properties -- e.g., a short string of specified length consisting entirely of upper case letters and digits excluding 0, 1, O, and I (to eliminate visual ambiguity):
"arbitrary string" --> "E3Y7UM8"
A 7-character string of the above form could take on over 34 billion unique values which, for my purposes, makes collisions extremely unlikely. Security is also not a major concern.
Is there an existing module or routine that implements something like the above? Alternatively, can someone suggest a straightforward algorithm?
The method you should be using has similarities with password one-way encryption. Of course since you are going for readable, a good password function is probably out of the question.
Here's what I would do:
Take an MD5 hash of the email
Convert base32 which already eliminates O and I
Replace any non-readable characters with readable ones
Here's an example based on the above:
import base64 # base32 is a function in base64
import hashlib
email = "somebody#example.com"
md5 = hashlib.md5()
md5.update(email.encode('utf-8'))
hash_in_bytes = md5.digest()
result = base64.b32encode(hash_in_bytes)
print(result)
# Or you can remove the extra "=" at the end
result = result.strip(b'=')
Since it's a one-way function (hash), you obviously don't need to worry about reversing the process (you can't anyway). You can also replace any other characters you find non-readable with readable ones (I would go for lowercase versions of the characters, e.g. q instead of Q)
More about base32 here: https://docs.python.org/3/library/base64.html
You can simply truncate the beginning of an MD5sum algorithm. It should have approximately the same statistical properties than the whole string anyway:
import md5
m = md5.new()
m.update("arbitrary string")
print(m.hexdigest()[:7])
Same code with hashlib module:
import hashlib
m = hashlib.md5()
m.update("arbitrary string")
print(m.hexdigest()[:7])
Do you know how to create a ldap compatible password (preferred md5crypt) via python on Windows
I used to write something like this in Linux but the crypt module is not present on Windows
char_set = string.ascii_uppercase + string.digits
salt = ''.join(random.sample(char_set,8))
salt = '$1$' + salt + '$'
pwd = "{CRYPT}" + crypt.crypt(str(old_password),salt)
The Passlib python library contains cross-platform implementations of all the crypt(3) algorithms. In particular, it contains ldap_md5_crypt, which sounds like exactly what you want. Here's how to use it (this code will work on windows or linux):
from passlib.hash import ldap_md5_crypt
#note salt generation is automatically handled
hash = ldap_md5_crypt.encrypt("password")
#hash will be similar to '{CRYPT}$1$wa6OLvW3$uzcIj2Puf3GcFDf2KztQN0'
#to verify a password...
valid = ldap_md5_crypt.verify("password", hash)
I should note that while MD5-Crypt is widely supported (Linux, all the BSDs, internally in openssl), it's none-the-less not the strongest hash available really horribly insecure, and should be avoided if at all possible. If you want the strongest hash that's compatible with linux crypt(), SHA512-Crypt is probably the way to go. It adds variable rounds, as well as some other improvements over MD5-Crypt internally.
From here http://www.openldap.org/faq/data/cache/347.html
One of the variants for generating SHA-hash can be:
import sha
from base64 import b64encode
ctx = sha.new("your_password")
hash = "{SHA}" + b64encode(ctx.digest())
print(hash)
This code is for Python.
# python my_sha.py
{SHA}Vk40DNSEN9Lf6HbuFUzJncTQ0Tc=
I (and not only me) don't recommend to use MD5 anymore.
PS. Follow the link you can try some windows variants.
You'll want to use fcrypt, which is a pure Python implementation of the Unix module crypt. It's a bit slower than crypt but it has the same functionality.
Disclaimer: I know Google, not cryptography.
From the crypt docs:
This module implements an interface to
the crypt(3) routine, which is a
one-way hash function based upon a
modified DES algorithm; see the Unix
man page for further details. Possible
uses include allowing Python scripts
to accept typed passwords from the
user, or attempting to crack Unix
passwords with a dictionary.
You could have a look at md5crypt.py. Alternatively, crypt for Windows is part of GnuWin32. Here's some of the Unix man page; the Windows interface should be similar.
CRYPT(3) Linux
Programmer's Manual
CRYPT(3)
NAME
crypt, crypt_r - password and data encryption
SYNOPSIS
#define _XOPEN_SOURCE
#include <unistd.h>
char *crypt(const char *key, const char *salt);
char *crypt_r(const char *key, const char *salt,
struct crypt_data *data);
Link with -lcrypt.
DESCRIPTION
crypt() is the password encryption function. It is based on
the Data
Encryption Standard algorithm with variations intended (among
other
things) to discourage use of hardware implementations of a key
search.
key is a user's typed password.
salt is a two-character string chosen from the set [a–zA–Z0–9./].
This
string is used to perturb the algorithm in one of 4096 different
ways.
By taking the lowest 7 bits of each of the first eight characters
of
the key, a 56-bit key is obtained. This 56-bit key is used to
encrypt
repeatedly a constant string (usually a string consisting of
all
zeros). The returned value points to the encrypted password, a
series
of 13 printable ASCII characters (the first two characters
represent
the salt itself). The return value points to static data whose
content
is overwritten by each call.
Warning: The key space consists of 2**56 equal 7.2e16 possible
values.
Exhaustive searches of this key space are possible using massively
par‐
allel computers. Software, such as crack(1), is available which
will
search the portion of this key space that is generally used by
humans
for passwords. Hence, password selection should, at minimum,
avoid
common words and names. The use of a passwd(1) program that checks
for
crackable passwords during the selection process is recommended.
The DES algorithm itself has a few quirks which make the use of
the
crypt() interface a very poor choice for anything other than
password
authentication. If you are planning on using the crypt()
interface for
a cryptography project, don't do it: get a good book on encryption
and
one of the widely available DES libraries.
I am having a hard time figuring out a reasonable way to generate a mixed-case hash in Python.
I want to generate something like: aZeEe9E
Right now I'm using MD5, which doesn't generate case-sensitive hashes.
Do any of you know how to generate a hash value consisting of upper- and lower- case characters + numbers?
-
Okay, GregS's advice worked like a charm (on the first try!):
Here is a simple example:
>>> import hashlib, base64
>>> s = 'http://gooogle.com'
>>> hash = hashlib.md5(s).digest()
>>> print hash
46c4f333fae34078a68393213bb9272d
>>> print base64.b64encode(hash)
NDZjNGYzMzNmYWUzNDA3OGE2ODM5MzIxM2JiOTI3MmQ=
you can base64 encode the output of the hash. This has a couple of additional characters beyond those you mentioned.
Maybe you can use base64-encoded hashes?