We need such a code for hashing:
from hashlib import sha256
Hash = sha256(b"hello").hexdigest()
#Hash = '2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824'
hexdigest seems to be doing the main thing, because without it we will get the following result:
Hash = sha256(b"hello")
#Hash = <sha256 HASH object # 0x000001E92939B950>
The use of hexdigest is mandatory because if it is not used, another output will be obtained, but what does it do?
The actual digest is a really big number. It is conventionally represented as a sequence of hex digits, as we humans aren't very good at dealing with numbers with more than a handful of digits (and hex has the advantage that it reveals some types of binary patterns really well; for example, you'd be hard pressed to reason about a number like 4,262,789,120 whereas its hex representation FE150000 readily reveals that the low 16 bits are all zeros) but the object is more than just a number; it's a class instance with methods which allow you e.g. to add more data in chunks, so that you can calculate the digest of a large file or a stream of data successively, without keeping all of it in memory. You can think of the digest object as a collection of states which permit this operation to be repeated many times, and the hex digest method as a way to query its state at the current point in the input stream.
You could argue that the interface could be different - for example, str(obj) could produce the hex representation; but this only pushes the problem to a different, and arguably more obscure, corner.
Related
Currently, I have a system that converts a list of integers to their binary representations. I calculate the number of bytes each number requires and then use the to_bytes() function to convert them to bytes, like so:
o = open(outFileName, "wb")
for n in result:
numBytes = math.ceil(n.bit_length()/8)
o.write(n.to_bytes(numBytes, 'little'))
o.close()
However, since the bytes are of varying lengths, what would be the method to allow an unpacking program/function to know how long each byte was? I have heard uses of the struct module and specifically the pack function, but with a focus on efficiency and reducing the size of the file as much as possible in mind, what would be the best way of approaching this to allow such an unpacking program to retrieve the exact list of originally encoded integers?
You can't. Your encoding maps different lists of integers to the same sequence of bytes. It is then impossible to know which one was the original input.
You need a different encoding.
Take a look at using the high bit each byte. There are other ways that might be better, depending on the distribution of your integers, such as Golomb coding.
So I am generating a URL in python for a GET request (it has to be a GET request)
and one of my query string parameters is EXTREMELY long (~900 chars) Is there any way I can compress this string and place it in the url? I have tried zlib but that uses bytes and the url needs to be a string. Basically is there any way to do this?
# On server
x = '900_char_string'
compressed_string = compress(x)
return 'http://whatever?querystring_var=' + compressed_string
# ^ return value is what client requests by clicking link with that url or whatever
# On client
# GET http://whatever?querystring_var=randomcompressedchars<900
# Server receiving request
value = request['querystring_var']
y = decompress(value)
print(y)
>>> 900_char_string # at this point server can work with the uncompressed string
The issue is now fairly clear. I think we need to examine this from a standpoint of information theory.
The input is a string of visible characters, currently represented in 8 bits each.
The "alphabet" for this string is alphanumeric (26+26+10 symbols), plus about 20 special and reserved characters, 80+ characters total.
There is no apparent redundancy in the generated string.
There are three main avenues to shortening a representation, taking advantage of
Frequency of characters (hamming): replace a frequent character with fewer than 8 bits; longer bit strings will then be needed for rare characters.
Frequency of substrings (compression): replace a frequent substring with a single character.
Convert to a different base: ideally, len(alphabet).
The first two methods can lengthen the resulting string, as they require starting with a translation table. Also, since your strings appear to be taken from a uniform random distribution, there will be no redundancy or commonality to leverage. When the Shannon entropy is at or near the maximum over the input tokens, there is nothing to be gained in those methods.
This leaves us base conversion. We're using 8 bits -- 256 combinations -- to represent an alphabet of only 82 characters. A simple base conversion will save about 20%; the ratio is log(82) / log(256). If you want a cheap conversion, simply map into a 7-bit representation, a saving of 12.5%
Very simply, define a symbol ordinality on your character set, such as
0123456789ABCDEFGH...YZabcd...yz:/?#[]()#!$%&'*+,;=% (81 chars)
Now, compute the numerical equivalent of a given string, just as if you were hand-coding a conversion from a decimal or hex string. The resulting large integer is the compressed value. Write it out in bytes, or chop it into 32-bit integers, or whatever fits your intermediate storage medium.
What is the most dense way (fewest characters) that I can store a complete SHA-256 hash?
Calling .digest() on a hashlib.sha256 object will return a 32-byte string -- the shortest possible way (with 8-bit bytes as the relevant unit) to store 256 bits of data which is effectively random for compressibility purposes.
Since 8 * 32 == 256, this provably has no wastage -- every bit is used.
Charles' answer is absolutely correct. However, I'm assuming that you don't want with the shortest binary encoding of the SHA256 bash - the 32-octet string - and want something printable and somewhat human-readable.
Note: However, this does not exactly apply to barcodes. At least QR codes encode binary data, so just use digest() method of your hash - that would be the most efficient encoding you can use there. Your QR code generation library should most likely support generating codes for "raw" binary strings - check your library docs and find the correct method/invocation.
SHA hashes (and other hashes) don't produce or operate on characters, they work with binary data. SHA-256 produces 256 bits of data, commonly represented with 32 bytes. In particular, in Python 3 you should notice that hashlib.sha256("...").digest() returns bytes and not str.
There is a convenience method hexdigest, that produces hexadecimal (base16) string that represents those bytes. You can use base32, base58, base64, baseEmoji or any other encoding that fits your requirements.
Basically, your problem is actually "I have a number and want a short encoding of it". Decide on how many distinct characters you can use (encoding base) and use that. There are many libraries on PyPI that could help. python-baseconv may come handy.
I am retrieving a value that is set by another application from memcached using python-memcached library. But unfortunately this is the value that I am getting:
>>> mc.get("key")
'\x04\x08"\nHello'
Is it possible to parse this mixed ASCII code into plain string using python function?
Thanks heaps for your help
It is a "plain string", to the extent that such a thing exists. I have no idea what kind of output you're expecting, but:
There ain't no such thing as plain text.
The Python (in 2.x, anyway) str type is really a container for bytes, not characters. So it isn't really text in the first place :) It displays the bytes assuming a very simple encoding, using escape sequence to represent every byte that's even slightly "weird". It will be formatted differently again if you print the string (what you're seeing right now is syntax for creating such a literal string in your code).
In simpler times, we naively assumed that we could just map bytes to these symbols we call "characters", and that would be that. Then it turned out that there were approximately a zillion different mappings that people wanted to use, and lots of them needed more symbols than a byte could represent. Which is why we have Unicode now: it represents every symbol you could conceivably need for any real-world language (and several for fake languages and other purposes), and it abstractly assigns numbers to those symbols but does not say how to collect and interpret the bytes as numbers. (That is the purpose of the encoding).
If you know that the string data is encoded in a particular way, you can decode it to a Unicode string. It could either be an encoding of actual Unicode data, or it could be in some other format (for example, Japanese text is often found in something called "Shift-JIS", because it has approximately the same significance to them as "Latin-1" - a common extension of ASCII - does to us). Either way, you get an in-memory representation of a series of Unicode code points (the numbers referred to in the previous paragraph). This, for all intents and purposes, is really "text", but it isn't really "plain" :)
But it looks like the data you have is really a binary blob of bytes that simply happens to consist mostly of "readable text" if interpreted as ASCII.
What you really need to do is figure out why the first byte has a value of 4 and the next byte has a value of 8, and proceed accordingly.
If you just need to trim the '\x04\x08"\n', and it's always the same (you haven't put your question very clearly, I'm not certain if that's what it is or what you want), do something like this:
to_trim = '\x04\x08"\n'
string = mc.get('key')
if string.startswith(to_trim):
string = string[len(to_trim):]
I am using the md5 function to hash a string into a 32 digit string.
str_to_encode = 'this is a test string which I want to encode'
encoded = hashlib.md5(str_to_encode).hexdigest()
I want to be able to decode this string (i.e. encoded in the example above) back to its original value. I don't think this is possible using md5 (but if it is please let me know), but is there a compressionn function which I can use which will give me a 32 digit string at the end but which can be reverted?
EDIT:
The string being encoded is a url so will only be a couple of hundred characters max although in most cases it will be a lot less.
Thanks
It seems to me that you aren't looking for a hash or encryption, you are looking for compression. Try zlib and base64 encoding:
s = 'Hello, world'
encoded = zlib.compress(s).encode('base64')
The length of the encoded data will grow as the input grows, but it may work for you.
Even restricting yourself to URLs, there's no way to reversibly map them to 32-character strings, there are just too many possible URLs.
You seem to want two things that can't coexist:
Any string of any length is converted to exactly 32-bytes, even if it started as 4gb
The encoded string is decodable without loss of information
There's only so many bits in an MD5 hash, so by the pigeonhole principle it's impossible to reverse it. If it were reversible you could use a hash to compress information infinitely. Furthermore, irreversibility is the main point of a hash; they're intended to be one-way functions. Encryption algorithms are reversible, but require more bytes to store the ciphertext since decodability means they must be collision-free (two plaintexts can't encode to the same ciphertext, or the decode function wouldn't know which plaintext to output given that ciphertext)