Compressing/encoding strings with limited characters in Python

Compressing/encoding strings with limited characters in Python - python

I've been trying to find a way to encode limited character strings in order to compress data, as well as to find unique 'IDs' for each string.
I have several million strings each with around 280~300 characters each, but limited to only four letters (A, T, C and G). I've wondered if there wouldn't be an easier way to encode those, using less memory, considering that they should be easily encoded using a 'base four', but don't know what's the easier way to do that. I've considered using for loops in Python, where I'd iterate over each string, then find the correct value for each letter using a dictionary and multiply that by the base-four value. Example:
base_dict = {
'A' : 0,
'T' : 1,
'C' : 2,
'G' : 3
} # These are the four bases of DNA, each assigned a different numeric value
strings_list = [
'ATCG',
'TGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGAAAATGACGGTACCTGACTAAGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGGACAGCAAGTCTGATATGAAAGGCGGGGGCTCAACCCCCGGACTGCATTGGAAACTGCTGGCCTGGAGTACCGGAGG',
'GGGGGGGGGG'
] # A few sample DNA sequences
for string in strings_list:
encoded_number = 0
for i in range(len(string)):
letter = string[i]
encoded_number += (4**i) * base_dict[letter]
print('String {} = {}'.format(string, encoded_number))
It seemed to work well, encoding my strings into binary format. The problem is that I could not get the encoded_number to turn into binary. The best I could do was to use this:
binary = '{0:b}'.format(encoded_number)
But though it returned me the binary value, it would do so as a string. Trying to convert it to binary always yields an error because of the huge size of the integer (when using the actual 280+ characters strings), as the long string above would result in a huge integer (230124923583823837719192000765784020788478094239354720336304458517780079994251890530919145486338353514167796587078005476564902583371606379793061574009099280577109729494013):
bytes(encoded_number) # trying to turn the encoded number into bytes
OverflowError: cannot fit 'int' into an index-sized integer
I'd like to know if this is the most efficient way to encode limited character strings like this, or if there's some better way, and also if there's any other ways I could use to compress this data even more, while still being able to reverse the final number/binary back into my string. Also, is there anyway I can actually convert it to binary format, instead of an integer or a string? Does doing so helps in conserving data?
Also, what would be the most concise way of reducing the integer/binary for a human-readable value (to a new, shorter string)? Using integers or binaries seem to conserve data and I'd be able to store these strings using less memory (and also transfer the data faster), but if I want to create concise user-readable strings, what would be the best option? Is there any way I could encode back into a string, but making use of the whole ASCII table so as to use a lot less characters?
It would be very useful to be able to reduce my 300 character strings into smaller, 86 characters strings (considering the ASCII table has 128 characters available, and 4^300 ~= 128^86).
I'm trying to do this in Python, as it's the language I'm most familiar with, and also what my code is in already.
TL;DR, summarizing the several questions I'm having trouble with:
What is the most efficient way to encode limited character
strings? (There's an example in the code above, is that the best
way?)
Is there any other ways to compress strings that could be
used alongside the encoding of limited characters, to further
compress the data?
Can large integers (4^300) be converted into
binary without resulting in an Overflow? How?
What's the most efficient way to convert binary values, numbers or limited character strings (it's basically the same in this situation, as I'm trying to convert one into the other) into small, concise strings (user-readable, so the smaller, the better)

The conversion you're making is the obvious one: since 4 is a power of 2, the conversion to binary is as compact as you can get for uniformly-distributed sequences. You need only to represent each letter with its 2-bit sequence, and you're done with the conversion.
Your problem seems to be in storing the result. The shortest change is likely to upgrade your code using bytes properly.
Another version of this is to break the string into 8-letter chunks, turning each into a 32-bit integer; then write out the sequence of integers (in binary).
Another is to forget the entire conversion; feed the string to your system's compression algorithm, which will take advantage of frequent amino acids.
N.B. your conversion will lose leading zeros, such as "AAAAGCTGA"; this would re-constitute as "GCTGA". You'll need to include the expected string length.
For doing the simple chunk-convert method, refer to the link I provided.
For compression methods, research compression (which we presume you've done before posting here, per the posting guidelines). On Linux, use the file compression provided with the OS (likely gzip).
Another possibility is if you have at least two amino acids that don't appear in your data, code the other triples and use base62 (do a browser search for documentation) -- this uses the full range of alphanumeric characters to encode in text-readable form.

Related

Is json.dumps and json.loads safe to run on a list of any string?

Is there any danger in losing information when JSON serialising/deserialising lists of text in Python?
Given a list of strings lst:
lst = ['str1', 'str2', 'str3', ...]
If I run
lst2 = json.loads(json.dumps(lst))
Will lst always be exactly the same as lst2 (i.e. will lst == lst2 always result to True)? Or are there some special, unusual characters that would break either of these methods?
I'm curious because I'll be dealing with a lot of different and unusual characters from various Unicode ranges and I would like to be absolutely certain that this process is 100% robust.

Depends on what you mean by "exactly the same". We can identify three separate issues:
Semantic identity. What you read in is equivalent in meaning to what you write back out, as long as it's well-defined in the first place. Python (depending on version) might reorder dictionary keys, and will commonly prefer Unicode escapes over literal characters for some code points, and vice versa.
>>> json.loads(json.dumps("\u0050\U0001fea5\U0001f4a9"))
'P\U0001fea5💩'
Lexical identity. Nope. As shown above, the JSON representation of Unicode code points can get normalized in different ways, so that \u0050 gets turned into a literal P, and printable emoji may or may not similarly be turned into Unicode escapes, or vice versa.
(This is distinct from proper Unicode normalization, which makes sure that homoglyphs get turned into the same precise code point.)
Garbage in, same garbage out. Nope. If you have invalid input, Python will often tend to crash rather than pass it through, though you can modify some of this by catching errors and/or passing in flags to request less strict behavior.
>>> json.loads(r'"\u123"')
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-4: truncated \uXXXX escape
>>> print(json.loads(r'"\udcff"'))
?
>>> #!? should probably crash or moan instead!
You seem to be asking about the first case, but the third can bite your behind badly if you don't make sure you have decided what to do with invalid Unicode data.
The second case would make a difference if you care about the JSON on disk being equivalent between versions; it doesn't seem to matter to you, but future visitors of this question might well care.

To some degree, yes, it should be safe. Note however that JSON is not defined in terms of byte strings, but rather it's defined in terms of Unicode text. That means before you do it json.parse, you need to decode that string first from whatever text encoding you're using. This Unicode encoding/decoding step may introduce inconsistencies.
The other implicit question you might have may be, will this process round trip. The answer to that is, it usually will, but it depends on the encoding/decoding process. Depending on the processing step, you may be normalising different characters that are considered equivalent in Unicode but composed using different code points. For example, accented characters like å may be encoded as composite characters using letter a and combining characters for the circle, or it may be encoded as the canonical code point of that character.
There's also the issue of JSON escape characters, which looks like "\u1234". Once decoded, Python doesn't preserve whether the characters is originally encoded using JSON escape or as Unicode character, so you'll lose that information as well and the text may not round trip fully in that case.
Apart from those issues in the deep corners of Unicode nerdery regarding equivalent characters and normalisation, encoding and decoding from/to JSON itself is pretty safe.

File compression/decompression of binary representations of integer lists

Currently, I have a system that converts a list of integers to their binary representations. I calculate the number of bytes each number requires and then use the to_bytes() function to convert them to bytes, like so:
o = open(outFileName, "wb")
for n in result:
numBytes = math.ceil(n.bit_length()/8)
o.write(n.to_bytes(numBytes, 'little'))
o.close()
However, since the bytes are of varying lengths, what would be the method to allow an unpacking program/function to know how long each byte was? I have heard uses of the struct module and specifically the pack function, but with a focus on efficiency and reducing the size of the file as much as possible in mind, what would be the best way of approaching this to allow such an unpacking program to retrieve the exact list of originally encoded integers?

You can't. Your encoding maps different lists of integers to the same sequence of bytes. It is then impossible to know which one was the original input.
You need a different encoding.
Take a look at using the high bit each byte. There are other ways that might be better, depending on the distribution of your integers, such as Golomb coding.

Using python, how can I compress a long query string value?

So I am generating a URL in python for a GET request (it has to be a GET request)
and one of my query string parameters is EXTREMELY long (~900 chars) Is there any way I can compress this string and place it in the url? I have tried zlib but that uses bytes and the url needs to be a string. Basically is there any way to do this?
# On server
x = '900_char_string'
compressed_string = compress(x)
return 'http://whatever?querystring_var=' + compressed_string
# ^ return value is what client requests by clicking link with that url or whatever
# On client
# GET http://whatever?querystring_var=randomcompressedchars<900
# Server receiving request
value = request['querystring_var']
y = decompress(value)
print(y)
>>> 900_char_string # at this point server can work with the uncompressed string

The issue is now fairly clear. I think we need to examine this from a standpoint of information theory.
The input is a string of visible characters, currently represented in 8 bits each.
The "alphabet" for this string is alphanumeric (26+26+10 symbols), plus about 20 special and reserved characters, 80+ characters total.
There is no apparent redundancy in the generated string.
There are three main avenues to shortening a representation, taking advantage of
Frequency of characters (hamming): replace a frequent character with fewer than 8 bits; longer bit strings will then be needed for rare characters.
Frequency of substrings (compression): replace a frequent substring with a single character.
Convert to a different base: ideally, len(alphabet).
The first two methods can lengthen the resulting string, as they require starting with a translation table. Also, since your strings appear to be taken from a uniform random distribution, there will be no redundancy or commonality to leverage. When the Shannon entropy is at or near the maximum over the input tokens, there is nothing to be gained in those methods.
This leaves us base conversion. We're using 8 bits -- 256 combinations -- to represent an alphabet of only 82 characters. A simple base conversion will save about 20%; the ratio is log(82) / log(256). If you want a cheap conversion, simply map into a 7-bit representation, a saving of 12.5%
Very simply, define a symbol ordinality on your character set, such as
0123456789ABCDEFGH...YZabcd...yz:/?#[]()#!$%&'*+,;=% (81 chars)
Now, compute the numerical equivalent of a given string, just as if you were hand-coding a conversion from a decimal or hex string. The resulting large integer is the compressed value. Write it out in bytes, or chop it into 32-bit integers, or whatever fits your intermediate storage medium.

In Python, represent a SHA-256 hash using the fewest characters possible

What is the most dense way (fewest characters) that I can store a complete SHA-256 hash?

Calling .digest() on a hashlib.sha256 object will return a 32-byte string -- the shortest possible way (with 8-bit bytes as the relevant unit) to store 256 bits of data which is effectively random for compressibility purposes.
Since 8 * 32 == 256, this provably has no wastage -- every bit is used.

Charles' answer is absolutely correct. However, I'm assuming that you don't want with the shortest binary encoding of the SHA256 bash - the 32-octet string - and want something printable and somewhat human-readable.
Note: However, this does not exactly apply to barcodes. At least QR codes encode binary data, so just use digest() method of your hash - that would be the most efficient encoding you can use there. Your QR code generation library should most likely support generating codes for "raw" binary strings - check your library docs and find the correct method/invocation.
SHA hashes (and other hashes) don't produce or operate on characters, they work with binary data. SHA-256 produces 256 bits of data, commonly represented with 32 bytes. In particular, in Python 3 you should notice that hashlib.sha256("...").digest() returns bytes and not str.
There is a convenience method hexdigest, that produces hexadecimal (base16) string that represents those bytes. You can use base32, base58, base64, baseEmoji or any other encoding that fits your requirements.
Basically, your problem is actually "I have a number and want a short encoding of it". Decide on how many distinct characters you can use (encoding base) and use that. There are many libraries on PyPI that could help. python-baseconv may come handy.

Python convert mixed ASCII code to String

I am retrieving a value that is set by another application from memcached using python-memcached library. But unfortunately this is the value that I am getting:
>>> mc.get("key")
'\x04\x08"\nHello'
Is it possible to parse this mixed ASCII code into plain string using python function?
Thanks heaps for your help

It is a "plain string", to the extent that such a thing exists. I have no idea what kind of output you're expecting, but:
There ain't no such thing as plain text.
The Python (in 2.x, anyway) str type is really a container for bytes, not characters. So it isn't really text in the first place :) It displays the bytes assuming a very simple encoding, using escape sequence to represent every byte that's even slightly "weird". It will be formatted differently again if you print the string (what you're seeing right now is syntax for creating such a literal string in your code).
In simpler times, we naively assumed that we could just map bytes to these symbols we call "characters", and that would be that. Then it turned out that there were approximately a zillion different mappings that people wanted to use, and lots of them needed more symbols than a byte could represent. Which is why we have Unicode now: it represents every symbol you could conceivably need for any real-world language (and several for fake languages and other purposes), and it abstractly assigns numbers to those symbols but does not say how to collect and interpret the bytes as numbers. (That is the purpose of the encoding).
If you know that the string data is encoded in a particular way, you can decode it to a Unicode string. It could either be an encoding of actual Unicode data, or it could be in some other format (for example, Japanese text is often found in something called "Shift-JIS", because it has approximately the same significance to them as "Latin-1" - a common extension of ASCII - does to us). Either way, you get an in-memory representation of a series of Unicode code points (the numbers referred to in the previous paragraph). This, for all intents and purposes, is really "text", but it isn't really "plain" :)
But it looks like the data you have is really a binary blob of bytes that simply happens to consist mostly of "readable text" if interpreted as ASCII.
What you really need to do is figure out why the first byte has a value of 4 and the next byte has a value of 8, and proceed accordingly.

If you just need to trim the '\x04\x08"\n', and it's always the same (you haven't put your question very clearly, I'm not certain if that's what it is or what you want), do something like this:
to_trim = '\x04\x08"\n'
string = mc.get('key')
if string.startswith(to_trim):
string = string[len(to_trim):]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.