What is the most dense way (fewest characters) that I can store a complete SHA-256 hash?
Calling .digest() on a hashlib.sha256 object will return a 32-byte string -- the shortest possible way (with 8-bit bytes as the relevant unit) to store 256 bits of data which is effectively random for compressibility purposes.
Since 8 * 32 == 256, this provably has no wastage -- every bit is used.
Charles' answer is absolutely correct. However, I'm assuming that you don't want with the shortest binary encoding of the SHA256 bash - the 32-octet string - and want something printable and somewhat human-readable.
Note: However, this does not exactly apply to barcodes. At least QR codes encode binary data, so just use digest() method of your hash - that would be the most efficient encoding you can use there. Your QR code generation library should most likely support generating codes for "raw" binary strings - check your library docs and find the correct method/invocation.
SHA hashes (and other hashes) don't produce or operate on characters, they work with binary data. SHA-256 produces 256 bits of data, commonly represented with 32 bytes. In particular, in Python 3 you should notice that hashlib.sha256("...").digest() returns bytes and not str.
There is a convenience method hexdigest, that produces hexadecimal (base16) string that represents those bytes. You can use base32, base58, base64, baseEmoji or any other encoding that fits your requirements.
Basically, your problem is actually "I have a number and want a short encoding of it". Decide on how many distinct characters you can use (encoding base) and use that. There are many libraries on PyPI that could help. python-baseconv may come handy.
Related
We need such a code for hashing:
from hashlib import sha256
Hash = sha256(b"hello").hexdigest()
#Hash = '2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824'
hexdigest seems to be doing the main thing, because without it we will get the following result:
Hash = sha256(b"hello")
#Hash = <sha256 HASH object # 0x000001E92939B950>
The use of hexdigest is mandatory because if it is not used, another output will be obtained, but what does it do?
The actual digest is a really big number. It is conventionally represented as a sequence of hex digits, as we humans aren't very good at dealing with numbers with more than a handful of digits (and hex has the advantage that it reveals some types of binary patterns really well; for example, you'd be hard pressed to reason about a number like 4,262,789,120 whereas its hex representation FE150000 readily reveals that the low 16 bits are all zeros) but the object is more than just a number; it's a class instance with methods which allow you e.g. to add more data in chunks, so that you can calculate the digest of a large file or a stream of data successively, without keeping all of it in memory. You can think of the digest object as a collection of states which permit this operation to be repeated many times, and the hex digest method as a way to query its state at the current point in the input stream.
You could argue that the interface could be different - for example, str(obj) could produce the hex representation; but this only pushes the problem to a different, and arguably more obscure, corner.
This question already has answers here:
Why UTF-32 exists whereas only 21 bits are necessary to encode every character?
(4 answers)
Closed 3 years ago.
I am reading UNICODE Howto in the Python documentation.
It is written that
a Unicode string is a sequence of code points, which are numbers from
0 through 0x10FFFF
which make it looks like the maximum number of bits needed to represent a code point is 24 (because there are 6 hexadecimal characters, and 6*4=24).
But then the documentation states:
The first encoding you might think of is using 32-bit integers as the
code unit
Why is that? The first encoding I could think of is with 24-bit integers, not 32-bit.
Actually you only need 21. Many CPUs use 32-bit registers natively, and most languages have a 32-bit integer type.
If you study the UTF-16 and UTF-8 encodings, you’ll find that their algorithms encode a maximum of a 21-bit code point using two 16-bit code units and four 8-bit code units, respectively.
Because it is the standard way. Python uses different "internal encoding", depending the content of the string: ASCII/ISO, UTF-16, UTF-32. UTF-32 is a common used representation (usually just intern to programs) to represent Unicode code point. So Python, instead of reinventing an other encoding (e.g. a UTF-22), it just uses UTF-32 representation. It is also easier for the different interfaces. Not so efficient on space, but much more on string operations.
Note: Python uses (in seldom cases) also surrogate range to encode "wrong" bytes. So you need more than 10FFFF code points.
Note: Also colour encoding had a similar encoding: 8bit * 3 channels = 24bit, but often represented with 32 integers (but this also for other reasons: just a write, instead of 2 read + 2 write on bus). 32 bits is much more easier and fast to handle.
The purpose of base64.b64encode() is to convert binary data into ASCII-safe "text". However, the method returns an object of type bytes:
>>> import base64
>>> base64.b64encode(b'abc')
b'YWJj'
It's easy to simply take that output and decode() it, but my question is: what is a significance of base64.b64encode() returning bytes rather than a str?
The purpose of the base64.b64encode() function is to convert binary data into ASCII-safe "text"
Python disagrees with that - base64 has been intentionally classified as a binary transform.
It was a design decision in Python 3 to force the separation of bytes and text and prohibit implicit transformations. Python is now so strict about this that bytes.encode doesn't even exist, and so b'abc'.encode('base64') would raise an AttributeError.
The opinion the language takes is that a bytestring object is already encoded. A codec which encodes bytes into text does not fit into this paradigm, because when you want to go from the bytes domain to the text domain it's a decode. Note that rot13 encoding was also banished from the list of standard encodings for the same reason - it didn't fit properly into the Python 3 paradigm.
There also can be a performance argument to make: suppose Python automatically handled decoding of the base64 output, which is an ASCII-encoded binary representation produced by C code from the binascii module, into a Python object in the text domain. If you actually wanted the bytes, you would just have to undo the decoding by encoding into ASCII again. It would be a wasteful round-trip, an unnecessary double-negation. Better to 'opt-in' for the decode-to-text step.
It's impossible for b64encode() to know what you want to do with its output.
While in many cases you may want to treat the encoded value as text, in many others – for example, sending it over a network – you may instead want to treat it as bytes.
Since b64encode() can't know, it refuses to guess. And since the input is bytes, the output remains the same type, rather than being implicitly coerced to str.
As you point out, decoding the output to str is straightforward:
base64.b64encode(b'abc').decode('ascii')
... as well as being explicit about the result.
As an aside, it's worth noting that although base64.b64decode() (note: decode, not encode) has accepted str since version 3.3, the change was somewhat controversial.
I have arbitrary binary data. I need to store it in a system that expects valid UTF8. It will never be interpreted as text, I just need to put it in there and be able to retrieve it and reconstitute my binary data.
Obviously base64 would work, but I can't have that much inflation.
How can I easily achieve this in python 2.7?
You'll have to express your data using just ASCII characters. Using Base64 is the most efficient method (available in the Python standard library) to do this, in terms of making binary data fit in printable text that is also UTF-8 safe. Sure, it requires 33% more space to express the same data, but other methods take more additional space.
You can combine this with compression to limit how much space this is going to take, but make the compression optional (mark the data) and only actually use it if the data is going to be smaller.
import zlib
import base64
def pack_utf8_safe(data):
is_compressed = False
compressed = zlib.compress(data)
if len(compressed) < (len(data) - 1):
data = compressed
is_compressed = True
base64_encoded = base64.b64encode(data)
if is_compressed:
base64_encoded = '.' + base64_encoded
return base64_encoded
def unpack_utf8_safe(base64_encoded):
decompress = False
if base64_encoded.startswith('.'):
base64_encoded = base64_encoded[1:]
decompress = True
data = base64.b64decode(base64_encoded)
if decompress:
data = zlib.decompress(data)
return data
The '.' character is not part of the Base64 alphabet, so I used it here to mark compressed data.
You could further shave of the 1 or 2 = padding characters from the end of the Base64 encoded data; these can then be re-added when decoding (add '=' * (-len(encoded) * 4) to the end), but I'm not sure that's worth the bother.
You can achieve further savings by switching to the Base85 encoding, a 4-to-5 ratio ASCII-safe encoding for binary data, so a 20% overhead. For Python 2.7 this is only available in an external library (Python 3.4 added it to the base64 library). You can use python-mom project in 2.7:
from mom.codec import base85
and replace all base64.b64encode() and base64.b64decode() calls with base85.b85encode() and base85.b85decode() calls instead.
If you are 100% certain nothing along the path is going to treat your data as text (possibly altering line separators, or interpret and alter other control codes), you could also use the Base128 encoding, reducing the overhead to a 14.3% increase (8 characters for every 7 bytes). I cannot, however, recommend a pip-installable Python module for you; there is a GitHub hosted module but I have not tested it.
You can decode your bytes as 8859-1 data, which will always produce a valid Unicode string. Then you can encode it to UTF8:
utf8_data = my_bytes.decode('iso8859-1').encode('utf8')
On average, half your data will be in the 0-127 range, which is one byte in UTF8, and half your data will be in the 128-255 range, which is two bytes in UTF8, so your result will be 50% larger than your input data.
If there is any structure to your data at all, then zlib compressing it as Martijn suggests, might reduce the size.
If your application really requires you to be able to represent 256 different byte values in a graphically distinguishable form, all you actually need is 256 Unicode code points. Problem solved.
ASCII codes 33-127 are a no-brainer, Unicode code points 160-255 are also good candidates for representing themselves but you might want to exclude a few which are hard to distinguish (if you want OCR or humans to handle them reliably, áåä etc might be too similar). Pick the rest from the set of code points which can be represented in two bytes -- quite a large set, but again, many of them are graphically indistinguishable from other glyphs in most renderings.
This scheme does not attempt any form of compression. I imagine you'd get better results by compressing your data prior to encoding it if that's an issue.
I am using the md5 function to hash a string into a 32 digit string.
str_to_encode = 'this is a test string which I want to encode'
encoded = hashlib.md5(str_to_encode).hexdigest()
I want to be able to decode this string (i.e. encoded in the example above) back to its original value. I don't think this is possible using md5 (but if it is please let me know), but is there a compressionn function which I can use which will give me a 32 digit string at the end but which can be reverted?
EDIT:
The string being encoded is a url so will only be a couple of hundred characters max although in most cases it will be a lot less.
Thanks
It seems to me that you aren't looking for a hash or encryption, you are looking for compression. Try zlib and base64 encoding:
s = 'Hello, world'
encoded = zlib.compress(s).encode('base64')
The length of the encoded data will grow as the input grows, but it may work for you.
Even restricting yourself to URLs, there's no way to reversibly map them to 32-character strings, there are just too many possible URLs.
You seem to want two things that can't coexist:
Any string of any length is converted to exactly 32-bytes, even if it started as 4gb
The encoded string is decodable without loss of information
There's only so many bits in an MD5 hash, so by the pigeonhole principle it's impossible to reverse it. If it were reversible you could use a hash to compress information infinitely. Furthermore, irreversibility is the main point of a hash; they're intended to be one-way functions. Encryption algorithms are reversible, but require more bytes to store the ciphertext since decodability means they must be collision-free (two plaintexts can't encode to the same ciphertext, or the decode function wouldn't know which plaintext to output given that ciphertext)