String Compression: Output Alphabet Restricted to Alphanumeric Characters - python

I have a long string and I would like to compress it to a new string with the restriction that the output alphabet only contains [a-z] [A-Z] and [0-9] characters.
How can I do this, specifically in Python?

While many encoding algorithms can take an arbitrary output range, most implementations can't, and many algorithms are much less efficient if the output range isn't a power of 2/16/256.
So, you want to split this into two parts: First compress one byte stream to another. Then encode the output byte stream into alphanumeric characters. (If you're starting with something that isn't a byte stream, like a Python 3 string or a Python 2 unicode, then there's a zeroth step of encoding it into a byte stream.)
For example, if you wanted base64, you could do this:
import base64, zlib
compressed_bytes = zlib.compress(plain_bytes)
compressed_text = base64.b64encode(compressed_bytes)
Unfortunately, you don't want base-64, because that includes a few non-alphanumeric characters.
You can use base32, which has just the capital letters and 6 digits, and the only change to your code is b32encode instead of encode. But that's a bit wasteful, because it's only using 5 out of every 8 bits, when you could in theory use ~5.594 of each 8 bits.
If you want to do this optimally, and you can't bend the requirement for alphanumeric characters only, base62 is very complicated, because you can't do it byte by byte, but only in chunks of 7936 bytes at a time. That's not going to be fun, or efficient. You can get reasonably close to optimal by chunking, say, 32 bytes at a time and wasting the leftover bits. But you might be better off using base64 plus an escaping mechanism to handle the two characters that don't fit into your scheme. For example:
def b62encode(plain):
b64 = base64.b64encode(plain)
return b64.replace('0', '00').replace('+', '01').replace('/', '02')
def b62decode(data):
b64 = '0'.join(part.replace('01', '+').replace('02', '/')
for part in data.split('00'))
return base64.b64decode(b64)
For comparison, here's how much each algorithm expands your binary data:
base32: 60.0%
fake base62: 39.2%
realistic base62: ~38%
optimal base62: 34.4%
base64: 33%
The point of partial-byte transfer encodings like base64 is that they're dead-simple and run fast. While you can extend it to partial-bit encodings like base62, you lose all of the advantages… so if the fake base62 isn't good enough, I'd suggest using something completely different instead.
To reverse this, reverse all the same steps in reverse order.
Putting it all together, using the fake base62, and using unicode/Python 3 strings:
plain_bytes = plain_text.encode('utf-8')
compressed_bytes = zlib.compress(plain_bytes)
b62_bytes = b62encode(compressed_bytes)
b62_text = b62_bytes.decode('ascii')
b62_bytes = b62_text.encode('ascii')
compressed_bytes = b62decode(b62_bytes)
plain_bytes = zlib.decompress(compressed_bytes)
plain_text = plain_bytes.decode('utf-8')
And that's about as complicated as it can get.

There is a much simpler encoding scheme than base 62 or modifications of base 64 for limiting the output to 62 values. Take your input as a stream of bits (which in fact it is), and then encode either five or six bits as each output character. If the five bits are 00000 or 00001, then encode it as your first two characters from your set of 62. Otherwise, take one more bit, giving you 60 possible values. Use your remaining 60 characters for those. Continue with the remaining bits. Pad with zero bits on the end to get your last five or six bits.
Decoding is even simpler. You just emit five or six bits for each character received. You throw away any extra bits at the end that don't make up a full byte.
The expansion resulting from this scheme is 35%, close to the theoretical optimal of 34.36%.

Related

When to raise UnicodeTranslateError?

The standard library documentation says:
exception UnicodeTranslateError
Raised when a Unicode-related error occurs during translating.
But translation is never defined. Doing a grep through the cpython source I can't see any examples of this class being raised as an error from anything. What is this exception used for and what's the difference between it and the Decode exception which seems to be used much more frequently?
Unicode has room for more than 1 million code points. (At the moment "only" about 150,000 of them are assigned to characters, but in theory more than 1M can be used.) The number of all available code points, written as a binary number, has 21 binary digits, which means you need 21 bit or at least 3 byte to encode all code points.
But the most used characters have unicode code points then need less than 10 bit, many even less than 8 bit. So, a 3-byte encoding would waste a lot of space when you use it to encode texts that contain mainly characters with low code points.
On the other hand a 3-byte encoding has disadvantages when processing in modern computers because the CPU prefers packages of 2, 4 or 8 byte.
And so there are different encodings for unicode strings:
UTF-32 uses 32-bit fields to encode unicode characters. This encoding performs very fast in computers, but also wastes a lot of space in memory.
UCS-4 is just another name for UTF-32. The number 4 means: exactly 4 bytes (which are 32 bit).
USC-2 uses 2 byte and therefore 16-bit fields. You need only half of the memory, but not all existing unicode code points can be encoded in UCS-2.
UTF-16 also uses 16-bit fields, but here also 2 of these fields can be used together to encode one character. So also UTF-16 can be used to encode all possible unicode codepoints.
UTF-8 uses 1-byte-fields. So in theory you need between 1 and 3 byte to encode every code point, but you also must add the information, if a byte is the start byte of a code point, and how many bytes the code point is long. And if you add these control bit to teh 21 data bit, you get more than 24 bit in total, which means: you need up to 4 byte to encode every possible unicode character.
There are even more encodings for unicode: UTF-1, UTF-7, CESU-8, GB 18030 and many more.
And the fact, that there are many different encodings make it necessary to translate from one encoding to another in some situations. And when you want to translate for example UTF-8 to UCS-2 you will get in trouble if the original text contains characters with code points out of the range that UCS-2 can encode. And in this case you should through a UnicodeTranslateError.

Alignment/Packing in Python Struct.Unpack

I have a piece of hardware sending data at a fixed length: 2bytes, 1 bytes, 4 bytes, 4 bytes, 2 bytes, 4bytes for a total of 17 bytes. If I change my format to 18bytes the code works but values are incorrect.
format = '<2s1s4s4s2s4s'
print(struct.calcsize(format))
print(len(hardware_data))
splitdata = struct.unpack(format,hardware_data)
The output is 17, 18 and an error because of the mismatch. I think this is caused by alignment but I'm unsure and nothing I've tried had fixed this. Below are a couple typical strings, if I print(hardware_data) I noticed the 'R' and 'n' characters but I'm unsure how to handle.
b'\x18\x06\x00R\x1f\x01\x00\x00\x00\x00\x00\xd8\xff\x00\x00\x00\x00\x80'
b'\x18\x06\x00R\x1f\x01\x00\x00\x00\x00\x00\n\x00\x00\x00\x00\x00\x80'
Odds are whatever is sending the data is padding it in some way you're not expecting.
For example, if the first four byte field is supposed to represent an int, C struct padding rules would require a padding byte, after the one byte field (to align the next four byte field to four byte alignment). So just add the padding byte explicitly, changing your format string to:
format = '<2s1sx4s4s2s4s'
The x in there says "I expect a byte here, but it's padding, don't unpack it to anything." It's possible the pad byte belongs elsewhere (I have no idea what your hardware is doing); I notice the third byte is the NUL (\0) byte in both examples, but the spot I assumed would be padding is 'R', so it's possible you want:
format = '<2sx1s4s4s2s4s'
instead. Or it could be somewhere else (without knowing which of the fields is a char array in the hardware struct, and which are larger types with alignment requirements, it's impossible to say). Point is, your hardware is sending 18 bytes; figure out which one is garbage, and put the x pad byte at the appropriate location.
Side-note: The repr of bytes objects will use ASCII or simpler ASCII escapes when available. That's why you see an R and a \n in your output; b'R' and b'\x52' are equivalent literals, as are b'\n' and b'\x0a' and Python chooses to use the "more readable" version (when the bytes is actually just ASCII, this is much more readable).

Compressing/encoding strings with limited characters in Python

I've been trying to find a way to encode limited character strings in order to compress data, as well as to find unique 'IDs' for each string.
I have several million strings each with around 280~300 characters each, but limited to only four letters (A, T, C and G). I've wondered if there wouldn't be an easier way to encode those, using less memory, considering that they should be easily encoded using a 'base four', but don't know what's the easier way to do that. I've considered using for loops in Python, where I'd iterate over each string, then find the correct value for each letter using a dictionary and multiply that by the base-four value. Example:
base_dict = {
'A' : 0,
'T' : 1,
'C' : 2,
'G' : 3
} # These are the four bases of DNA, each assigned a different numeric value
strings_list = [
'ATCG',
'TGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGAAAATGACGGTACCTGACTAAGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGGACAGCAAGTCTGATATGAAAGGCGGGGGCTCAACCCCCGGACTGCATTGGAAACTGCTGGCCTGGAGTACCGGAGG',
'GGGGGGGGGG'
] # A few sample DNA sequences
for string in strings_list:
encoded_number = 0
for i in range(len(string)):
letter = string[i]
encoded_number += (4**i) * base_dict[letter]
print('String {} = {}'.format(string, encoded_number))
It seemed to work well, encoding my strings into binary format. The problem is that I could not get the encoded_number to turn into binary. The best I could do was to use this:
binary = '{0:b}'.format(encoded_number)
But though it returned me the binary value, it would do so as a string. Trying to convert it to binary always yields an error because of the huge size of the integer (when using the actual 280+ characters strings), as the long string above would result in a huge integer (230124923583823837719192000765784020788478094239354720336304458517780079994251890530919145486338353514167796587078005476564902583371606379793061574009099280577109729494013):
bytes(encoded_number) # trying to turn the encoded number into bytes
OverflowError: cannot fit 'int' into an index-sized integer
I'd like to know if this is the most efficient way to encode limited character strings like this, or if there's some better way, and also if there's any other ways I could use to compress this data even more, while still being able to reverse the final number/binary back into my string. Also, is there anyway I can actually convert it to binary format, instead of an integer or a string? Does doing so helps in conserving data?
Also, what would be the most concise way of reducing the integer/binary for a human-readable value (to a new, shorter string)? Using integers or binaries seem to conserve data and I'd be able to store these strings using less memory (and also transfer the data faster), but if I want to create concise user-readable strings, what would be the best option? Is there any way I could encode back into a string, but making use of the whole ASCII table so as to use a lot less characters?
It would be very useful to be able to reduce my 300 character strings into smaller, 86 characters strings (considering the ASCII table has 128 characters available, and 4^300 ~= 128^86).
I'm trying to do this in Python, as it's the language I'm most familiar with, and also what my code is in already.
TL;DR, summarizing the several questions I'm having trouble with:
What is the most efficient way to encode limited character
strings? (There's an example in the code above, is that the best
way?)
Is there any other ways to compress strings that could be
used alongside the encoding of limited characters, to further
compress the data?
Can large integers (4^300) be converted into
binary without resulting in an Overflow? How?
What's the most efficient way to convert binary values, numbers or limited character strings (it's basically the same in this situation, as I'm trying to convert one into the other) into small, concise strings (user-readable, so the smaller, the better)
The conversion you're making is the obvious one: since 4 is a power of 2, the conversion to binary is as compact as you can get for uniformly-distributed sequences. You need only to represent each letter with its 2-bit sequence, and you're done with the conversion.
Your problem seems to be in storing the result. The shortest change is likely to upgrade your code using bytes properly.
Another version of this is to break the string into 8-letter chunks, turning each into a 32-bit integer; then write out the sequence of integers (in binary).
Another is to forget the entire conversion; feed the string to your system's compression algorithm, which will take advantage of frequent amino acids.
N.B. your conversion will lose leading zeros, such as "AAAAGCTGA"; this would re-constitute as "GCTGA". You'll need to include the expected string length.
For doing the simple chunk-convert method, refer to the link I provided.
For compression methods, research compression (which we presume you've done before posting here, per the posting guidelines). On Linux, use the file compression provided with the OS (likely gzip).
Another possibility is if you have at least two amino acids that don't appear in your data, code the other triples and use base62 (do a browser search for documentation) -- this uses the full range of alphanumeric characters to encode in text-readable form.

Using python, how can I compress a long query string value?

So I am generating a URL in python for a GET request (it has to be a GET request)
and one of my query string parameters is EXTREMELY long (~900 chars) Is there any way I can compress this string and place it in the url? I have tried zlib but that uses bytes and the url needs to be a string. Basically is there any way to do this?
# On server
x = '900_char_string'
compressed_string = compress(x)
return 'http://whatever?querystring_var=' + compressed_string
# ^ return value is what client requests by clicking link with that url or whatever
# On client
# GET http://whatever?querystring_var=randomcompressedchars<900
# Server receiving request
value = request['querystring_var']
y = decompress(value)
print(y)
>>> 900_char_string # at this point server can work with the uncompressed string
The issue is now fairly clear. I think we need to examine this from a standpoint of information theory.
The input is a string of visible characters, currently represented in 8 bits each.
The "alphabet" for this string is alphanumeric (26+26+10 symbols), plus about 20 special and reserved characters, 80+ characters total.
There is no apparent redundancy in the generated string.
There are three main avenues to shortening a representation, taking advantage of
Frequency of characters (hamming): replace a frequent character with fewer than 8 bits; longer bit strings will then be needed for rare characters.
Frequency of substrings (compression): replace a frequent substring with a single character.
Convert to a different base: ideally, len(alphabet).
The first two methods can lengthen the resulting string, as they require starting with a translation table. Also, since your strings appear to be taken from a uniform random distribution, there will be no redundancy or commonality to leverage. When the Shannon entropy is at or near the maximum over the input tokens, there is nothing to be gained in those methods.
This leaves us base conversion. We're using 8 bits -- 256 combinations -- to represent an alphabet of only 82 characters. A simple base conversion will save about 20%; the ratio is log(82) / log(256). If you want a cheap conversion, simply map into a 7-bit representation, a saving of 12.5%
Very simply, define a symbol ordinality on your character set, such as
0123456789ABCDEFGH...YZabcd...yz:/?#[]()#!$%&'*+,;=% (81 chars)
Now, compute the numerical equivalent of a given string, just as if you were hand-coding a conversion from a decimal or hex string. The resulting large integer is the compressed value. Write it out in bytes, or chop it into 32-bit integers, or whatever fits your intermediate storage medium.

Store arbitrary binary data on a system accepting only valid UTF8

I have arbitrary binary data. I need to store it in a system that expects valid UTF8. It will never be interpreted as text, I just need to put it in there and be able to retrieve it and reconstitute my binary data.
Obviously base64 would work, but I can't have that much inflation.
How can I easily achieve this in python 2.7?
You'll have to express your data using just ASCII characters. Using Base64 is the most efficient method (available in the Python standard library) to do this, in terms of making binary data fit in printable text that is also UTF-8 safe. Sure, it requires 33% more space to express the same data, but other methods take more additional space.
You can combine this with compression to limit how much space this is going to take, but make the compression optional (mark the data) and only actually use it if the data is going to be smaller.
import zlib
import base64
def pack_utf8_safe(data):
is_compressed = False
compressed = zlib.compress(data)
if len(compressed) < (len(data) - 1):
data = compressed
is_compressed = True
base64_encoded = base64.b64encode(data)
if is_compressed:
base64_encoded = '.' + base64_encoded
return base64_encoded
def unpack_utf8_safe(base64_encoded):
decompress = False
if base64_encoded.startswith('.'):
base64_encoded = base64_encoded[1:]
decompress = True
data = base64.b64decode(base64_encoded)
if decompress:
data = zlib.decompress(data)
return data
The '.' character is not part of the Base64 alphabet, so I used it here to mark compressed data.
You could further shave of the 1 or 2 = padding characters from the end of the Base64 encoded data; these can then be re-added when decoding (add '=' * (-len(encoded) * 4) to the end), but I'm not sure that's worth the bother.
You can achieve further savings by switching to the Base85 encoding, a 4-to-5 ratio ASCII-safe encoding for binary data, so a 20% overhead. For Python 2.7 this is only available in an external library (Python 3.4 added it to the base64 library). You can use python-mom project in 2.7:
from mom.codec import base85
and replace all base64.b64encode() and base64.b64decode() calls with base85.b85encode() and base85.b85decode() calls instead.
If you are 100% certain nothing along the path is going to treat your data as text (possibly altering line separators, or interpret and alter other control codes), you could also use the Base128 encoding, reducing the overhead to a 14.3% increase (8 characters for every 7 bytes). I cannot, however, recommend a pip-installable Python module for you; there is a GitHub hosted module but I have not tested it.
You can decode your bytes as 8859-1 data, which will always produce a valid Unicode string. Then you can encode it to UTF8:
utf8_data = my_bytes.decode('iso8859-1').encode('utf8')
On average, half your data will be in the 0-127 range, which is one byte in UTF8, and half your data will be in the 128-255 range, which is two bytes in UTF8, so your result will be 50% larger than your input data.
If there is any structure to your data at all, then zlib compressing it as Martijn suggests, might reduce the size.
If your application really requires you to be able to represent 256 different byte values in a graphically distinguishable form, all you actually need is 256 Unicode code points. Problem solved.
ASCII codes 33-127 are a no-brainer, Unicode code points 160-255 are also good candidates for representing themselves but you might want to exclude a few which are hard to distinguish (if you want OCR or humans to handle them reliably, áåä etc might be too similar). Pick the rest from the set of code points which can be represented in two bytes -- quite a large set, but again, many of them are graphically indistinguishable from other glyphs in most renderings.
This scheme does not attempt any form of compression. I imagine you'd get better results by compressing your data prior to encoding it if that's an issue.

Categories

Resources