Compress data into smallest amount of text? - python

I have data (mostly a series of numpy arrays) that I want to convert into text that can be copied/pasted/emailed etc.. I created the following formula which does this.
def convert_to_ascii85(x):
p = pickle.dumps(x)
p = zlib.compress(p)
return b64.b85encode(p)
My issue is that the string it produces is longer than it needs to be because it only uses a subset of letters, numbers, and symbols. If I was able to encode using unicode, I feel like it could produce a shorter string because it would have access to more characters. Is there a way to do this?
Edit to clarify:
My goal is NOT the smallest amount of data/information/bytes. My goal is the smallest number of characters. The reason is that the channel I'm sending the data through is capped by characters (100k to be precise) instead of bytes (strange, I know). I've already tested that I can send 100k unicode characters, I just don't know how to convert my bytes into unicode.

UPDATE: I just saw that you changed your question to clarify that you care about character length rather than byte length. This is a really strange constraint. I've never heard of it before. I don't know quite what to make of it. But if that's your need, and you want predicable blocking behavior, then I'm thinking that your problem is pretty simple. Just pick the compatible character encoding that can represent the most possible unique characters, and then map blocks of your binary across that character set such that each block is the longest it can be and yet consists of fewer bits than the number of representable characters in your character encoding. Each such block then becomes a single character. Since this constraint is kinda strange, I don't know if there are libraries out there that do this.
UPDATE2: Being curious about the above myself, I just Google'd and found this: https://qntm.org/unicodings. If your tools and communication channels can deal with UFT-16 or UTF-32, then you might be onto something in seeking to use that. If so, I hope this article opens up to the solution you're looking for. I think this article is still optimizing for byte length vs character length, so maybe this won't provide the optimal solution, but it can only help (32 potential bits per char rather than 7 or 8). I couldn't find anything seeking to optimize on character count alone, but maybe a UTF-32 scheme like Base65536 is your answer. Check out https://github.com/qntm/base65536 .
If it is byte length that you cared about, and you want to stick to using what is usually meant by "printable characters" or "plain printable text", then here's my original answer...
There are options for getting better "readable text" encoding space efficiency from an encoding other than Base85. There's also a case to be made for giving up more space efficiency and going with Base64. Here I'll make the case for using both Base85 and Base64. If you can use Base85, you only take a 25% hit on the inflation of your binary, and you save a whole lot of headaches in doing so.
Base85 is pretty close to the best you're going to do if you seek to encode arbitrary binary to "plain text", and it is the BEST you can do if you want a "plain text" encoding that you can logically break into meaningful, predictable chunks. You can in theory use a character set that uses printable characters in the high-ASCII range, but experience has shown that many tools and communication channels don't deal well with high-ASCII if they can't handle straight binary. You don't get much in additional space savings for trying to use the extra 5 bits per 4 binary bytes or so that could potentially be used by using 256-bit high-ASCII vs 128-bit ASCII.
For any BaseXX encoding, the algorithm takes incoming binary bits and encodes them as tightly as it can using the XX printable characters it has at its disposal. Base85 will be more compact than Base64 because it uses more of the printable characters (85) than Base64 does (64 characters).
There are 95 printable characters in standard ASCII. So there is a Base95 that is the most compact encoding possible using all the printable characters. But to try to use all 95 bits is messy, because it leads to uneven blockings of the incoming bits. Each 4 binary bytes is mapped to some fractional number of characters less than 5.
It turns out that 85 characters is what you need to encode 4 bytes as exactly 5 printable characters. Many will choose to add about 10% of extra length to attain the fact that every 4 encoded bytes leads to exactly 5 ASCII characters. This is only a 25% inflation in size of the binary. That's not bad at all for all of the headaches it saves. Hence, the motivation behind Base85.
Base64 is used to produce longer, but even less problematic encodings. Characters that cause trouble for various text documents, like HTML, XML, JSON, etc, are not used. In this way, Base64 is useful in almost any context without any escaping. You have to be more careful with Base85, as it doesn't throw out any of these problematic characters. For encoding/decoding efficiency, it uses the range 33 (“!”) through 117 (‘u’), starting at 33 rather than 32 just to avoid the often problematic space character. The characters above 'u' that it doesn't use are nothing special.
So that's the story pretty much on binary -> ASCII encoding side. The other question is what you can do to reduce the size of what you're representing prior to the stage of encoding its binary representation to ASCII. You're choosing to use pickle.dumps() and zlib.compress(). If those are your best choices are left for another discussion...

Related

What does encoding='latin-1' do when reading a file [duplicate]

This question already has answers here:
What is character encoding and why should I bother with it
(4 answers)
Closed 1 year ago.
I am using a Youtube channel to learn machine learning algorithms. Somewhere in this video, I encountered an argument inputted into pd.read_csv method called encoding='latin-1'. What is the function of this argument?
Here the underlying reason the for encoding parameter.
English speakers live in an easy world where the number of necessary characters to write any kind of text or computer code is small enough to be stored in a 8-bit byte (even on a 7-bit, btw, but that's not the point). Therefore, 1 character = 1 byte, everybody agrees on the meaning of each one of the 256 possible 8-bit values.
Many other languages, even those who use the same latin alphabet, need all kinds of accented letters and specialties that do not exist in English. In addition, all special characters of all those languages don't fit into 256 different byte values. Historically, every language community has decided on a specific encoding for all byte values above 127. latin-1, aka iso-8859-1, is one of those encodings, but as you may guess, not the only one. This doesn't scale well, of course, and won't work for languages that don't use latin alphabet and need far over 256 different values.
In all modern languages, a character and a byte are two different things.
(read this sentence twice or more, and commit into permanent brain memory)
The computer can in no way "guess" which is the encoding of a byte stream (like a csv file) that you feed it for processing as text (= strings of characters). Therefore, a function that reads files (I didn't watch the video, but the name of the function is explicit enough to understand its purpose) has to convert bytes (on the disk) into characters (in memory, using whatever internal representation the language happens to use). Inversely, when you have to write something to the disk or the network, which can only accept 8-bit bytes, you have to convert back your characters into bytes.
Those conversions are performed using the particular encoding your file/byte stream/network protocol is using.
As a side note, you should consider getting rid of 8859-* encoding and using unicode, and the utf-8 encoding, as much as possible in new developments.
https://en.wikipedia.org/wiki/ISO/IEC_8859-1
Latin-1 is the same as 8859-1. Every character is encoded as a single byte. There are 191 characters total.

ASCII vs UTF-8?

Assuming the storage size is important:
I have a long list of digits (0-9) that I want to write to a file. From a storage standpoint, would it be more efficient to use ASCII or UTF-8 as an encoding?
Is it possible to create a smaller file using something else?
There's no difference between ASCII and UTF-8 when storing digits. A tighter packing would be using 4 bits per digit (BCD).
If you want to go below that, you need to take advantage of the fact that long sequences of 10-base values can be presented as 2-base (binary) values.
There is absolutely no difference in this case; UTF-8 is identical to ASCII in this character range.
If storage is an important consideration, maybe look into compression. A simple Huffman compression will use something like 3 bits per byte for this kind of data. If there are periodicity patterns, a modern compression algorithm can take it even further.

FontTools: extracting useful UTF information provided by it

FontTools is producing some XML with all sorts of details in this structure
<cmap>
<tableVersion version="0"/>
<cmap_format_4 platformID="0" platEncID="3" language="0">
<map code="0x20" name="space"/><!-- SPACE -->
<!--many, many more characters-->
</cmap_format_4>
<cmap_format_0 platformID="1" platEncID="0" language="0">
<map code="0x0" name=".notdef"/>
<!--many, many more characters again-->
</cmap_format_0>
<cmap_format_4 platformID="0" platEncID="3" language="0"> <!--"cmap_format_4" again-->
<map code="0x20" name="space"/><!-- SPACE -->
<!--more "map" nodes-->
</cmap_format_4>
</cmap>
I'm trying to figure out every character this font supports, so these code attributes are what I'm interested in. I believe I am correct in thinking that all code attributes are UTF-8 values: is this correct? I am curious why there are two nodes cmap_format_4 (they seem to be identical, but I haven't tested that with a thorough amount of fonts those, so if someone familiar with this module knows for certain, that is my first question).
To be assured I am seeing all characters contained in the typeface, do I need to combine all code attribute values, or just one or two. Will FontTools always produce these three XML nodes, or is the quantity variable? Any idea why? The documentation is a little vague.
the number of cmap_format_N nodes ("cmap subtables") is variable, as is the 'N' (the format). There are several formats; the most common is 4, but there is also format 12, format 0, format 6, and a few others.
fonts may have multiple cmap subtables, but are not required to. The reason for this is the history of the development of TrueType (which has evolved into OpenType). The format was invented before Unicode, at a time when each platform had their own way(s) of character mapping. The different formats and ability to have multiple mappings was necessity at the time in order to have a single font file that could map everything without multiple files, duplication, etc. Nowadays most fonts that are produced will only have a single Unicode subtable, but there are many floating around that have multiple subtables.
The code values in the map node are code point values expressed as hexadecimal. They might be Unicode values, but not necessarily (see the next point).
I think your font may be corrupted (or possibly there was copy/paste mix-up). It is possible to have multiple cmap_format_N entries in the cmap, but each combination of platformID/platformEncID/language should be unique. Also, it is important to note that not all cmap subtables map Unicodes; some express older, pre-Unicode encodings. You should look at tables where platformID="3" first, then platformID="0" and finally platformID="2" as a last resort. Other platformIDs do not necessarily map Unicode values.
As for discovering "all Unicodes mapped in a font": that can be a bit tricky when there are multiple Unicode subtables, especially if their contents differ. You might get close by taking the union of all code values in all of the subtables that are known to be Unicode maps, but it is important to understand that most platforms will only use one of the maps at a time. Usually there is a preferred picking order similar to what I stated above; when one is found, that is the one used. There's no standardized order of preference that applies to all platforms (that I'm aware of), but most of the popular ones follow an order pretty close to what I listed.
Finally, regarding Unicode vs UTF-8: the code values are Unicode code points; NOT UTF-8 byte sequences. If you're not sure of the difference, spend some time reading about character encodings and byte serialization at Unicode.org.

Compressing short English strings in Python?

I would like to fit 80M strings of length < 20 characters in memory and use as little memory as possible.
I would like a compression library that I can drive from Python, that will allow me to compress short (<20 char) English strings. I have about 80M of them, and I would like them to fit in as little memory as possible.
I would like maximum lossless compression. CPU time is not the bottleneck.
I don't want the dictionary stored with each string, because that would be high overhead.
I want to compress to <20% the original size. This is plausible, given that the upper bound of the entropy of English is 1.75 bits (Brown et al, 1992, http://acl.ldc.upenn.edu/J/J92/J92-1002.pdf) = 22% compression (1.75/8).
Edit:
I can't use zlib because the header is too large. (If I have a string that starts at 20 bytes, there can be NO header for there to be good compression. zlib header = 200 bytes according to Roland Illing. I haven't doublechecked, but I know it's bigger than 20.)
Huffman coding sounds nice, except it is based upon individual tokens, and can't do ngrams (multiple characters).
smaz has a crappy dictionary, and compresses to only 50%.
I strongly prefer to use existing code, rather than implement a compression algorithm.
I don't want the dictionary stored with each string, because that would be high overhead.
So build a single string with all of the desired contents, and compress it all at once with whichever solution. This solves the "header is too large" problem as well.
You can do this in a variety of ways. Probably the simplest is to create the repr() of a list of the strings; or you can use the pickle, shelve or json modules to create some other sort of serialized form.
Make a dictionary of all words. Then, convert all words to numbers corresponding to the offset in the dictionary. If needed, you can use the first bit to indicate that the word is capitalized.
How about using zipfile from the standard library?
There are no more than 128 different characters in English strings. Hence you can describe each character with a 7bits code. See Compressing UTF-8(or other 8-bit encoding) to 7 or fewer bits
First, if you compress each 20-bytes string individually, your compression ratio will be miserable. You need to compress a lot of strings together to really witness some tangible benefits.
Second, 80M strings is a lot, and if you have to decompress them all to extract a single one of them, you'll be displeased by performance. Chunk your input into smaller but still large enough blocks. A typical value would be 64KB, translating into 3200 strings.
Then, you can compress each 64KB block independantly. When you need to access a single string into the block, you need to decode the entire block.
So here, there is a trade-off to decide between compression ratio (which prefer larger blocks) and random access speed (which prefer smaller blocks). You'll be the judge to select the best one.
Quick note : random access on in-memory structure usually favor fast compression algorithm, rather than strong ones. If you compress only once, but random access a lot of times, prefer some highly assymetric algorithms, such as LZ4-HC :
http://code.google.com/p/lz4hc/
According to benchmark, compression speed is only 15MB/s, but decoding speed is about 1GB/s. That translates into 16K blocks of 64KB decoded per second...

Python - letter frequency count and translation

I am using Python 3.1, but I can downgrade if needed.
I have an ASCII file containing a short story written in one of the languages the alphabet of which can be represented with upper and or lower ASCII. I wish to:
1) Detect an encoding to the best of my abilities, get some sort of confidence metric (would vary depending on the length of the file, right?)
2) Automatically translate the whole thing using some free online service or a library.
Additional question: What if the text is written in a language where it takes 2 or more bytes to represent one letter and the byte order mark is not there to help me?
Finally, how do I deal with punctuation and misc characters such as space? It will occur more frequently than some letters, right? How about the fact that punctuation and characters can be sometimes mixed - there might be two representations of a comma, two representations for what looks like an "a", etc.?
Yes, I have read the article by Joel Spolsky on Unicode. Please help me with at least some of these items.
Thank you!
P.S. This is not a homework, but it is for self-educational purposes. I prefer using a letter frequency library that is open-source and readable as opposed to the one that is closed, efficient, but gets the job done well.
Essentially there are three main tasks to implement the described application:
1a) Identify the character encoding of the input text
1b) Identify the language of the input text
2) Get the text translated the text, by way of one of the online services' API
For 1a, you may want to take a look at decodeh.py, aside from the script itself, it provides many very useful resources regarding character sets and encoding at large. CharDet, mentioned in other answer also seems to be worthy of consideration.
Once the character encoding is known, as you suggest, you may solve 1b) by calculating the character frequency profile of the text, and matching it with known frequencies. While simple, this approach typically provide a decent precision ratio, although it may be weak on shorter texts and also on texts which follow particular patterns; for example a text in French with many references to units in the metric system will have an unusually high proportion of the letters M, K and C.
A complementary and very similar approach, use bi-grams (sequences of two letters) and tri-grams (three letters) and the corresponding tables of frequency distribution references in various languages.
Other language detection methods involve tokenizing the text, i.e. considering the words within the text. NLP resources include tables with the most used words in various languages. Such words are typically articles, possessive adjectives, adverbs and the like.
An alternative solution to the language detection is to rely on the online translation service to figure this out for us. What is important is to supply the translation service with text in a character encoding it understands, providing it the language may be superfluous.
Finally, as many practical NLP applications, you may decide to implement multiple solutions. By using a strategy design pattern, one can apply several filters/classifiers/steps in a particular order, and exit this logic at different points depending on the situation. For example, if a simple character/bigram frequency matches the text to English (with a small deviation), one may just stop there. Otherwise, if the guessed language is French or German, perform another test, etc. etc.
If you have an ASCII file then I can tell you with 100% confidence that it is encoded in ASCII. Beyond that try chardet. But knowing the encoding isn't necessarily enough to determine what language it's in.
As for multibyte encodings, The only reliable way to handle it is to hope that it has characters in the Latin alphabet and look for which half of the pair has the NULL. Otherwise treat it as UTF-8 unless you know better (Shift-JIS, GB2312, etc.).
Oh, and UTF-8. UTF-8, UTF-8, UTF-8. I don't think I can stress that enough. And in case I haven't... UTF-8.
Character frequency is pretty straight forward
I just noticed that you are using Python3.1 so this is even easier
>>> from collections import Counter
>>> Counter("Μεταλλικα")
Counter({'α': 2, 'λ': 2, 'τ': 1, 'ε': 1, 'ι': 1, 'κ': 1, 'Μ': 1})
For older versions of Python:
>>> from collections import defaultdict
>>> letter_freq=defaultdict(int)
>>> unistring = "Μεταλλικα"
>>> for uc in unistring: letter_freq[uc]+=1
...
>>> letter_freq
defaultdict(<class 'int'>, {'τ': 1, 'α': 2, 'ε': 1, 'ι': 1, 'λ': 2, 'κ': 1, 'Μ': 1})
I have provided some conditional answers however your question is a little vague and inconsistent. Please edit your question to provide answers to my questions below.
(1) You say that the file is ASCII but you want to detect an encoding? Huh? Isn't the answer "ascii"?? If you really need to detect an encoding, use chardet
(2) Automatically translate what? encoding? language? If language, do you know what the input language is or are you trying to detect that also? To detect language, try guess-language ... note that it needs a tweak for better detection of Japanese. See this SO topic which notes the Japanese problem and also highlights that for ANY language-guesser, you need to remove all HTML/XML/Javascript/etc noise from your text otherwise it will heavily bias the result towards ASCII-only languages like English (or Catalan!).
(3) You are talking about a "letter-frequency library" ... you are going to use this library to do what? If language guessing, it appears that using frequency of single letters is not much help distinguishing between languages which use the same (or almost the same) character set; one needs to use the frequency of three-letter groups ("trigrams").
(4) Your questions on punctuation and spaces: depends on your purpose (which we are not yet sure of). If purpose is language detection, the idea is to standardise the text; e.g. replace all runs of not (letter or apostrophe) with a single space, then remove any leading/trailing whitespace, than add 1 leading and 1 trailing space -- more precision is gained by treating start/end of word bigrams as trigrams. Note that as usual in all text processing you should decode your input into unicode immediately and work with unicode thereafter.

Categories

Resources