Compressing short English strings in Python?

Compressing short English strings in Python? - python

I would like to fit 80M strings of length < 20 characters in memory and use as little memory as possible.
I would like a compression library that I can drive from Python, that will allow me to compress short (<20 char) English strings. I have about 80M of them, and I would like them to fit in as little memory as possible.
I would like maximum lossless compression. CPU time is not the bottleneck.
I don't want the dictionary stored with each string, because that would be high overhead.
I want to compress to <20% the original size. This is plausible, given that the upper bound of the entropy of English is 1.75 bits (Brown et al, 1992, http://acl.ldc.upenn.edu/J/J92/J92-1002.pdf) = 22% compression (1.75/8).
Edit:
I can't use zlib because the header is too large. (If I have a string that starts at 20 bytes, there can be NO header for there to be good compression. zlib header = 200 bytes according to Roland Illing. I haven't doublechecked, but I know it's bigger than 20.)
Huffman coding sounds nice, except it is based upon individual tokens, and can't do ngrams (multiple characters).
smaz has a crappy dictionary, and compresses to only 50%.
I strongly prefer to use existing code, rather than implement a compression algorithm.

I don't want the dictionary stored with each string, because that would be high overhead.
So build a single string with all of the desired contents, and compress it all at once with whichever solution. This solves the "header is too large" problem as well.
You can do this in a variety of ways. Probably the simplest is to create the repr() of a list of the strings; or you can use the pickle, shelve or json modules to create some other sort of serialized form.

Make a dictionary of all words. Then, convert all words to numbers corresponding to the offset in the dictionary. If needed, you can use the first bit to indicate that the word is capitalized.

How about using zipfile from the standard library?

There are no more than 128 different characters in English strings. Hence you can describe each character with a 7bits code. See Compressing UTF-8(or other 8-bit encoding) to 7 or fewer bits

First, if you compress each 20-bytes string individually, your compression ratio will be miserable. You need to compress a lot of strings together to really witness some tangible benefits.
Second, 80M strings is a lot, and if you have to decompress them all to extract a single one of them, you'll be displeased by performance. Chunk your input into smaller but still large enough blocks. A typical value would be 64KB, translating into 3200 strings.
Then, you can compress each 64KB block independantly. When you need to access a single string into the block, you need to decode the entire block.
So here, there is a trade-off to decide between compression ratio (which prefer larger blocks) and random access speed (which prefer smaller blocks). You'll be the judge to select the best one.
Quick note : random access on in-memory structure usually favor fast compression algorithm, rather than strong ones. If you compress only once, but random access a lot of times, prefer some highly assymetric algorithms, such as LZ4-HC :
http://code.google.com/p/lz4hc/
According to benchmark, compression speed is only 15MB/s, but decoding speed is about 1GB/s. That translates into 16K blocks of 64KB decoded per second...

Related

ASCII vs UTF-8?

Assuming the storage size is important:
I have a long list of digits (0-9) that I want to write to a file. From a storage standpoint, would it be more efficient to use ASCII or UTF-8 as an encoding?
Is it possible to create a smaller file using something else?

There's no difference between ASCII and UTF-8 when storing digits. A tighter packing would be using 4 bits per digit (BCD).
If you want to go below that, you need to take advantage of the fact that long sequences of 10-base values can be presented as 2-base (binary) values.

There is absolutely no difference in this case; UTF-8 is identical to ASCII in this character range.
If storage is an important consideration, maybe look into compression. A simple Huffman compression will use something like 3 bits per byte for this kind of data. If there are periodicity patterns, a modern compression algorithm can take it even further.

Efficiently processing large binary files in python

I'm currently reading binary files that are 150,000 kb each. They contain roughly 3,000 structured binary messages and I'm trying to figure out the quickest way to process them. Out of each message, I only need to actually read about 30 lines of data. These messages have headers that allow me to jump to specific portions of the message and find the data I need.
I'm trying to figure out whether it's more efficient to unpack the entire message (50 kb each) and pull my data from the resulting tuple that includes a lot of data I don't actually need, or would it cost less to use seek to go to each line of data I need for every message and unpack each of those 30 lines? Alternatively, is this something better suited to mmap?

Seeking, possibly several times, within just 50 kB is probably not worthwhile: system calls are expensive. Instead, read each message into one bytes and use slicing to “seek” to the offsets you need and get the right amount of data.
It may be beneficial to wrap the bytes in a memoryview to avoid copying, but for small individual reads it probably doesn’t matter much. If you can use a memoryview, definitely try using mmap, which exposes a similar interface over the whole file. If you’re using struct, its unpack_from can already seek within a bytes or an mmap without wrapping or copying.

how to efficiently compress a long list of file path strings one by one?

I have a very long (>1M) list of file path strings. I need to compress these strings individually to save space. A path can be quite long, e.g. 150 chars.
Many of the paths have a common prefix, which would improve compression, if I only could compress them as a bulk.
Experimenting with gzip and zip, I get 16% compression on one string, 85% on 1000 strings, which is expected.
Is there a way to "teach" the algorithm what the data distribution is upfront, or have a "learning" algorithm which improves the compression over subsequent applications?
I need this as a library, no time to develop my own at the moment. I suppose a trie would be of help here.

Using compression library to estimate information complexity of an english sentence?

I'm trying to write an algorithm that can work out the 'unexpectedness' or 'information complexity' of a sentence. More specifically I'm trying to sort a set of sentences so the least complex come first.
My thought was that I could use a compression library, like zlib?, 'pre train' it on a large corpus of text in the same language (call this the 'Corpus') and then append to that corpus of text the different sentences.
That is I could define the complexity measure for a sentence to be how many more bytes it requires to compress the whole corpus with that sentence appended, versus the whole corpus with a different sentence appended. (The fewer extra bytes, the more predictable or 'expected' that sentence is, and therefore the lower the complexity). Does that make sense?
The problem is with trying to find the right library that will let me do this, preferably from python.
I could do this by literally appending sentences to a large corpus and asking a compression library to compress the whole shebang, but if possible, I'd much rather halt the processing of the compression library at the end of the corpus, take a snapshot of the relevant compression 'state', and then, with all that 'state' available try to compress the final sentence. (I would then roll back to the snapshot of the state at the end of the corpus and try a different sentence).
Can anyone help me with a compression library that might be suitable for this need? (Something that lets me 'freeze' its state after 'pre training'.)
I'd prefer to use a library I can call from Python, or Scala. Even better if it is pure python (or pure scala)

All this is going to do is tell you whether the words in the sentence, and maybe phrases in the sentence, are in the dictionary you supplied. I don't see how that's complexity. More like grade level. And there are better tools for that. Anyway, I'll answer your question.
Yes, you can preset the zlib compressor a dictionary. All it is is up to 32K bytes of text. You don't need to run zlib on the dictionary or "freeze a state" -- you simply start compressing the new data, but permit it to look back at the dictionary for matching strings. However 32K isn't very much. That's as far back as zlib's deflate format will look, and you can't load much of the English language in 32K bytes.
LZMA2 also allows for a preset dictionary, but it can be much larger, up to 4 GB. There is a Python binding for the LZMA2 library, but you may need to extend it to provide the dictionary preset function.

Store large dictionary to file in Python

I have a dictionary with many entries and a huge vector as values. These vectors can be 60.000 dimensions large and I have about 60.000 entries in the dictionary. To save time, I want to store this after computation. However, using a pickle led to a huge file. I have tried storing to JSON, but the file remains extremely large (like 10.5 MB on a sample of 50 entries with less dimensions). I have also read about sparse matrices. As most entries will be 0, this is a possibility. Will this reduce the filesize? Is there any other way to store this information? Or am I just unlucky?
Update:
Thank you all for the replies. I want to store this data as these are word counts. For example, when given sentences, I store the amount of times word 0 (at location 0 in the array) appears in the sentence. There are obviously more words in all sentences than appear in one sentence, hence the many zeros. Then, I want to use this array tot train at least three, maybe six classifiers. It seemed easier to create the arrays with word counts and then run the classifiers over night to train and test. I use sklearn for this. This format was chosen to be consistent with other feature vector formats, which is why I am approaching the problem this way. If this is not the way to go, in this case, please let me know. I am very much aware that I have much to learn in coding efficiently!
I also started implementing sparse matrices. The file is even bigger now (testing with a sample set of 300 sentences).
Update 2:
Thank you all for the tips. John Mee was right by not needing to store the data. Both he and Mike McKerns told me to use sparse matrices, which sped up calculation significantly! So thank you for your input. Now I have a new tool in my arsenal!

See my answer to a very closely related question https://stackoverflow.com/a/25244747/2379433, if you are ok with pickling to several files instead of a single file.
Also see: https://stackoverflow.com/a/21948720/2379433 for other potential improvements, and here too: https://stackoverflow.com/a/24471659/2379433.
If you are using numpy arrays, it can be very efficient, as both klepto and joblib understand how to use minimal state representation for an array. If you indeed have most elements of the arrays as zeros, then by all means, convert to sparse matrices... and you will find huge savings in storage size of the array.
As the links above discuss, you could use klepto -- which provides you with the ability to easily store dictionaries to disk or database, using a common API. klepto also enables you to pick a storage format (pickle, json, etc.) -- where HDF5 is coming soon. It can utilize both specialized pickle formats (like numpy's) and compression (if you care about size and not speed).
klepto gives you the option to store the dictionary with "all-in-one" file or "one-entry-per" file, and also can leverage multiprocessing or multithreading -- meaning that you can save and load dictionary items to/from the backend in parallel.

With 60,000 dimensions do you mean 60,000 elements? if this is the case and the numbers are 1..10 then a reasonably compact but still efficient approach is to use a dictionary of Python array.array objects with 1 byte per element (type 'B').
The size in memory should be about 60,000 entries x 60,000 bytes, totaling 3.35Gb of data.
That data structure is pickled to about the same size to disk too.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.