Using python, how can I compress a long query string value? - python

So I am generating a URL in python for a GET request (it has to be a GET request)
and one of my query string parameters is EXTREMELY long (~900 chars) Is there any way I can compress this string and place it in the url? I have tried zlib but that uses bytes and the url needs to be a string. Basically is there any way to do this?
# On server
x = '900_char_string'
compressed_string = compress(x)
return 'http://whatever?querystring_var=' + compressed_string
# ^ return value is what client requests by clicking link with that url or whatever
# On client
# GET http://whatever?querystring_var=randomcompressedchars<900
# Server receiving request
value = request['querystring_var']
y = decompress(value)
print(y)
>>> 900_char_string # at this point server can work with the uncompressed string

The issue is now fairly clear. I think we need to examine this from a standpoint of information theory.
The input is a string of visible characters, currently represented in 8 bits each.
The "alphabet" for this string is alphanumeric (26+26+10 symbols), plus about 20 special and reserved characters, 80+ characters total.
There is no apparent redundancy in the generated string.
There are three main avenues to shortening a representation, taking advantage of
Frequency of characters (hamming): replace a frequent character with fewer than 8 bits; longer bit strings will then be needed for rare characters.
Frequency of substrings (compression): replace a frequent substring with a single character.
Convert to a different base: ideally, len(alphabet).
The first two methods can lengthen the resulting string, as they require starting with a translation table. Also, since your strings appear to be taken from a uniform random distribution, there will be no redundancy or commonality to leverage. When the Shannon entropy is at or near the maximum over the input tokens, there is nothing to be gained in those methods.
This leaves us base conversion. We're using 8 bits -- 256 combinations -- to represent an alphabet of only 82 characters. A simple base conversion will save about 20%; the ratio is log(82) / log(256). If you want a cheap conversion, simply map into a 7-bit representation, a saving of 12.5%
Very simply, define a symbol ordinality on your character set, such as
0123456789ABCDEFGH...YZabcd...yz:/?#[]()#!$%&'*+,;=% (81 chars)
Now, compute the numerical equivalent of a given string, just as if you were hand-coding a conversion from a decimal or hex string. The resulting large integer is the compressed value. Write it out in bytes, or chop it into 32-bit integers, or whatever fits your intermediate storage medium.

Related

What does hexdigest do in Python?

We need such a code for hashing:
from hashlib import sha256
Hash = sha256(b"hello").hexdigest()
#Hash = '2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824'
hexdigest seems to be doing the main thing, because without it we will get the following result:
Hash = sha256(b"hello")
#Hash = <sha256 HASH object # 0x000001E92939B950>
The use of hexdigest is mandatory because if it is not used, another output will be obtained, but what does it do?
The actual digest is a really big number. It is conventionally represented as a sequence of hex digits, as we humans aren't very good at dealing with numbers with more than a handful of digits (and hex has the advantage that it reveals some types of binary patterns really well; for example, you'd be hard pressed to reason about a number like 4,262,789,120 whereas its hex representation FE150000 readily reveals that the low 16 bits are all zeros) but the object is more than just a number; it's a class instance with methods which allow you e.g. to add more data in chunks, so that you can calculate the digest of a large file or a stream of data successively, without keeping all of it in memory. You can think of the digest object as a collection of states which permit this operation to be repeated many times, and the hex digest method as a way to query its state at the current point in the input stream.
You could argue that the interface could be different - for example, str(obj) could produce the hex representation; but this only pushes the problem to a different, and arguably more obscure, corner.

Compressing/encoding strings with limited characters in Python

I've been trying to find a way to encode limited character strings in order to compress data, as well as to find unique 'IDs' for each string.
I have several million strings each with around 280~300 characters each, but limited to only four letters (A, T, C and G). I've wondered if there wouldn't be an easier way to encode those, using less memory, considering that they should be easily encoded using a 'base four', but don't know what's the easier way to do that. I've considered using for loops in Python, where I'd iterate over each string, then find the correct value for each letter using a dictionary and multiply that by the base-four value. Example:
base_dict = {
'A' : 0,
'T' : 1,
'C' : 2,
'G' : 3
} # These are the four bases of DNA, each assigned a different numeric value
strings_list = [
'ATCG',
'TGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGAAAATGACGGTACCTGACTAAGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGGACAGCAAGTCTGATATGAAAGGCGGGGGCTCAACCCCCGGACTGCATTGGAAACTGCTGGCCTGGAGTACCGGAGG',
'GGGGGGGGGG'
] # A few sample DNA sequences
for string in strings_list:
encoded_number = 0
for i in range(len(string)):
letter = string[i]
encoded_number += (4**i) * base_dict[letter]
print('String {} = {}'.format(string, encoded_number))
It seemed to work well, encoding my strings into binary format. The problem is that I could not get the encoded_number to turn into binary. The best I could do was to use this:
binary = '{0:b}'.format(encoded_number)
But though it returned me the binary value, it would do so as a string. Trying to convert it to binary always yields an error because of the huge size of the integer (when using the actual 280+ characters strings), as the long string above would result in a huge integer (230124923583823837719192000765784020788478094239354720336304458517780079994251890530919145486338353514167796587078005476564902583371606379793061574009099280577109729494013):
bytes(encoded_number) # trying to turn the encoded number into bytes
OverflowError: cannot fit 'int' into an index-sized integer
I'd like to know if this is the most efficient way to encode limited character strings like this, or if there's some better way, and also if there's any other ways I could use to compress this data even more, while still being able to reverse the final number/binary back into my string. Also, is there anyway I can actually convert it to binary format, instead of an integer or a string? Does doing so helps in conserving data?
Also, what would be the most concise way of reducing the integer/binary for a human-readable value (to a new, shorter string)? Using integers or binaries seem to conserve data and I'd be able to store these strings using less memory (and also transfer the data faster), but if I want to create concise user-readable strings, what would be the best option? Is there any way I could encode back into a string, but making use of the whole ASCII table so as to use a lot less characters?
It would be very useful to be able to reduce my 300 character strings into smaller, 86 characters strings (considering the ASCII table has 128 characters available, and 4^300 ~= 128^86).
I'm trying to do this in Python, as it's the language I'm most familiar with, and also what my code is in already.
TL;DR, summarizing the several questions I'm having trouble with:
What is the most efficient way to encode limited character
strings? (There's an example in the code above, is that the best
way?)
Is there any other ways to compress strings that could be
used alongside the encoding of limited characters, to further
compress the data?
Can large integers (4^300) be converted into
binary without resulting in an Overflow? How?
What's the most efficient way to convert binary values, numbers or limited character strings (it's basically the same in this situation, as I'm trying to convert one into the other) into small, concise strings (user-readable, so the smaller, the better)
The conversion you're making is the obvious one: since 4 is a power of 2, the conversion to binary is as compact as you can get for uniformly-distributed sequences. You need only to represent each letter with its 2-bit sequence, and you're done with the conversion.
Your problem seems to be in storing the result. The shortest change is likely to upgrade your code using bytes properly.
Another version of this is to break the string into 8-letter chunks, turning each into a 32-bit integer; then write out the sequence of integers (in binary).
Another is to forget the entire conversion; feed the string to your system's compression algorithm, which will take advantage of frequent amino acids.
N.B. your conversion will lose leading zeros, such as "AAAAGCTGA"; this would re-constitute as "GCTGA". You'll need to include the expected string length.
For doing the simple chunk-convert method, refer to the link I provided.
For compression methods, research compression (which we presume you've done before posting here, per the posting guidelines). On Linux, use the file compression provided with the OS (likely gzip).
Another possibility is if you have at least two amino acids that don't appear in your data, code the other triples and use base62 (do a browser search for documentation) -- this uses the full range of alphanumeric characters to encode in text-readable form.

In Python, represent a SHA-256 hash using the fewest characters possible

What is the most dense way (fewest characters) that I can store a complete SHA-256 hash?
Calling .digest() on a hashlib.sha256 object will return a 32-byte string -- the shortest possible way (with 8-bit bytes as the relevant unit) to store 256 bits of data which is effectively random for compressibility purposes.
Since 8 * 32 == 256, this provably has no wastage -- every bit is used.
Charles' answer is absolutely correct. However, I'm assuming that you don't want with the shortest binary encoding of the SHA256 bash - the 32-octet string - and want something printable and somewhat human-readable.
Note: However, this does not exactly apply to barcodes. At least QR codes encode binary data, so just use digest() method of your hash - that would be the most efficient encoding you can use there. Your QR code generation library should most likely support generating codes for "raw" binary strings - check your library docs and find the correct method/invocation.
SHA hashes (and other hashes) don't produce or operate on characters, they work with binary data. SHA-256 produces 256 bits of data, commonly represented with 32 bytes. In particular, in Python 3 you should notice that hashlib.sha256("...").digest() returns bytes and not str.
There is a convenience method hexdigest, that produces hexadecimal (base16) string that represents those bytes. You can use base32, base58, base64, baseEmoji or any other encoding that fits your requirements.
Basically, your problem is actually "I have a number and want a short encoding of it". Decide on how many distinct characters you can use (encoding base) and use that. There are many libraries on PyPI that could help. python-baseconv may come handy.

String Compression: Output Alphabet Restricted to Alphanumeric Characters

I have a long string and I would like to compress it to a new string with the restriction that the output alphabet only contains [a-z] [A-Z] and [0-9] characters.
How can I do this, specifically in Python?
While many encoding algorithms can take an arbitrary output range, most implementations can't, and many algorithms are much less efficient if the output range isn't a power of 2/16/256.
So, you want to split this into two parts: First compress one byte stream to another. Then encode the output byte stream into alphanumeric characters. (If you're starting with something that isn't a byte stream, like a Python 3 string or a Python 2 unicode, then there's a zeroth step of encoding it into a byte stream.)
For example, if you wanted base64, you could do this:
import base64, zlib
compressed_bytes = zlib.compress(plain_bytes)
compressed_text = base64.b64encode(compressed_bytes)
Unfortunately, you don't want base-64, because that includes a few non-alphanumeric characters.
You can use base32, which has just the capital letters and 6 digits, and the only change to your code is b32encode instead of encode. But that's a bit wasteful, because it's only using 5 out of every 8 bits, when you could in theory use ~5.594 of each 8 bits.
If you want to do this optimally, and you can't bend the requirement for alphanumeric characters only, base62 is very complicated, because you can't do it byte by byte, but only in chunks of 7936 bytes at a time. That's not going to be fun, or efficient. You can get reasonably close to optimal by chunking, say, 32 bytes at a time and wasting the leftover bits. But you might be better off using base64 plus an escaping mechanism to handle the two characters that don't fit into your scheme. For example:
def b62encode(plain):
b64 = base64.b64encode(plain)
return b64.replace('0', '00').replace('+', '01').replace('/', '02')
def b62decode(data):
b64 = '0'.join(part.replace('01', '+').replace('02', '/')
for part in data.split('00'))
return base64.b64decode(b64)
For comparison, here's how much each algorithm expands your binary data:
base32: 60.0%
fake base62: 39.2%
realistic base62: ~38%
optimal base62: 34.4%
base64: 33%
The point of partial-byte transfer encodings like base64 is that they're dead-simple and run fast. While you can extend it to partial-bit encodings like base62, you lose all of the advantages… so if the fake base62 isn't good enough, I'd suggest using something completely different instead.
To reverse this, reverse all the same steps in reverse order.
Putting it all together, using the fake base62, and using unicode/Python 3 strings:
plain_bytes = plain_text.encode('utf-8')
compressed_bytes = zlib.compress(plain_bytes)
b62_bytes = b62encode(compressed_bytes)
b62_text = b62_bytes.decode('ascii')
b62_bytes = b62_text.encode('ascii')
compressed_bytes = b62decode(b62_bytes)
plain_bytes = zlib.decompress(compressed_bytes)
plain_text = plain_bytes.decode('utf-8')
And that's about as complicated as it can get.
There is a much simpler encoding scheme than base 62 or modifications of base 64 for limiting the output to 62 values. Take your input as a stream of bits (which in fact it is), and then encode either five or six bits as each output character. If the five bits are 00000 or 00001, then encode it as your first two characters from your set of 62. Otherwise, take one more bit, giving you 60 possible values. Use your remaining 60 characters for those. Continue with the remaining bits. Pad with zero bits on the end to get your last five or six bits.
Decoding is even simpler. You just emit five or six bits for each character received. You throw away any extra bits at the end that don't make up a full byte.
The expansion resulting from this scheme is 35%, close to the theoretical optimal of 34.36%.

python compression function which returns 32 digit string?

I am using the md5 function to hash a string into a 32 digit string.
str_to_encode = 'this is a test string which I want to encode'
encoded = hashlib.md5(str_to_encode).hexdigest()
I want to be able to decode this string (i.e. encoded in the example above) back to its original value. I don't think this is possible using md5 (but if it is please let me know), but is there a compressionn function which I can use which will give me a 32 digit string at the end but which can be reverted?
EDIT:
The string being encoded is a url so will only be a couple of hundred characters max although in most cases it will be a lot less.
Thanks
It seems to me that you aren't looking for a hash or encryption, you are looking for compression. Try zlib and base64 encoding:
s = 'Hello, world'
encoded = zlib.compress(s).encode('base64')
The length of the encoded data will grow as the input grows, but it may work for you.
Even restricting yourself to URLs, there's no way to reversibly map them to 32-character strings, there are just too many possible URLs.
You seem to want two things that can't coexist:
Any string of any length is converted to exactly 32-bytes, even if it started as 4gb
The encoded string is decodable without loss of information
There's only so many bits in an MD5 hash, so by the pigeonhole principle it's impossible to reverse it. If it were reversible you could use a hash to compress information infinitely. Furthermore, irreversibility is the main point of a hash; they're intended to be one-way functions. Encryption algorithms are reversible, but require more bytes to store the ciphertext since decodability means they must be collision-free (two plaintexts can't encode to the same ciphertext, or the decode function wouldn't know which plaintext to output given that ciphertext)

Categories

Resources