I'm building a Python program to compress/decompress a text file using a Huffman tree. Previously, I would store the frequency table a .json file alongside the compressed file. When I read in the compressed data and .json, I would rebuild the decompression tree from the frequency table. I thought this was a pretty eloquent solution.
However, I was running into an odd issue with files of medium length where they would decompress into strings of seemingly random characters. I found that the issue occurred when two character where occurring the same number of times. When I rebuilt my tree, any of those characters with matching frequencies would have the chance of getting swapped. For the majority of files, particularly large and small files, this wasn't a problem. Most letter occurred slightly more or slightly less than others. But for some medium sized files, a large portion of the characters occurred the same number of times as another character resulting in gibberish.
Is there a unique identifier for my nodes that I can use instead to easily rebuild my tree? Or should I be approaching the tree writing completely differently?
In the Huffman algorithm you need to pick the lowest two frequencies in a deterministic way that is the same on both sides. If there is a tie, you need to use the symbol to break the tie. Without that, you have no assurance that the sorting on both sides will pick the same symbols when faced with equal frequencies.
You don't need to send the frequencies. All you need to send is the bit lengths for the symbols. The lengths can be coded much more compactly than the frequencies. You can build a canonical code from just the lengths, using the symbols to order the codes unambiguously.
Related
I am comparing two translations (English to French) of the same novel that are quite different from one another. I am interested in locating any significant matches that exist in both (>3 words in order).
My first instinct was to look at difflib or filecmp, but they seem to mostly output quantitative data when I want qualitative. They also mostly seem fit for line by line comparison, but I want to compare the texts in their entirety. Given the large size of the .txt files (novel-length), am I crazy to think this is even possible?
I'm honestly open to any programming language that can solve this, but partial to python.
thanks!
I want to generate homophones of words programmatically. Meaning, words that sound similar to the original words.
I've come across the Soundex algorithm, but it just replaces some characters with other characters (like t instead of d). Are there any lists or algorithms that are a little bit more sophisticated, providing at least homophone substrings?
Important: I want to apply this on words that aren't in dictionaries, meaning that I can't rely on whole, real words.
EDIT:
The input is a string which is often a proper name and therefore in no standard (homophone) dictionary. An example could be Google or McDonald's (just to name two popular named entities, but many are much more unpopular).
The output is then a (random) homophone of this string. Since words often have more than one homophone, a single (random) one is my goal. In the case of Google, a homophone could be gugel, or MacDonald's for McDonald's.
How to do this well is a research topic. See for example http://www.inf.ufpr.br/didonet/articles/2014_FPSS.pdf.
But suppose that you want to roll your own.
The first step is figuring out how to turn the letters that you are given into a representation of what it sounds like. This is a very hard problem with guessing required. (eg What sound does "read" make? Depends on whether you are going to read, or you already read!) However text to phonemes converter suggests that Arabet has solved this for English.
Next you'll want this to have been done for every word in a dictionary. Assuming that you can do that for one word, that's just a script.
Then you'll want it stored in a data structure where you can easily find similar sounds. That is in principle no difference than the sort of algorithms that are used for autocorrect for spelling. Only with phonemes instead of letters. You can get a sense of how to do that with http://norvig.com/spell-correct.html. Or try to implement something like what is described in http://fastss.csg.uzh.ch/ifi-2007.02.pdf.
And that is it.
I'm trying to write an algorithm that can work out the 'unexpectedness' or 'information complexity' of a sentence. More specifically I'm trying to sort a set of sentences so the least complex come first.
My thought was that I could use a compression library, like zlib?, 'pre train' it on a large corpus of text in the same language (call this the 'Corpus') and then append to that corpus of text the different sentences.
That is I could define the complexity measure for a sentence to be how many more bytes it requires to compress the whole corpus with that sentence appended, versus the whole corpus with a different sentence appended. (The fewer extra bytes, the more predictable or 'expected' that sentence is, and therefore the lower the complexity). Does that make sense?
The problem is with trying to find the right library that will let me do this, preferably from python.
I could do this by literally appending sentences to a large corpus and asking a compression library to compress the whole shebang, but if possible, I'd much rather halt the processing of the compression library at the end of the corpus, take a snapshot of the relevant compression 'state', and then, with all that 'state' available try to compress the final sentence. (I would then roll back to the snapshot of the state at the end of the corpus and try a different sentence).
Can anyone help me with a compression library that might be suitable for this need? (Something that lets me 'freeze' its state after 'pre training'.)
I'd prefer to use a library I can call from Python, or Scala. Even better if it is pure python (or pure scala)
All this is going to do is tell you whether the words in the sentence, and maybe phrases in the sentence, are in the dictionary you supplied. I don't see how that's complexity. More like grade level. And there are better tools for that. Anyway, I'll answer your question.
Yes, you can preset the zlib compressor a dictionary. All it is is up to 32K bytes of text. You don't need to run zlib on the dictionary or "freeze a state" -- you simply start compressing the new data, but permit it to look back at the dictionary for matching strings. However 32K isn't very much. That's as far back as zlib's deflate format will look, and you can't load much of the English language in 32K bytes.
LZMA2 also allows for a preset dictionary, but it can be much larger, up to 4 GB. There is a Python binding for the LZMA2 library, but you may need to extend it to provide the dictionary preset function.
Background info: I am teaching myself concurrent programming in python, to do this I am implementing a version of grep that splits the task of searching into work units to be executed on separate cores.
I noticed in this question , that grep is able to search quickly due to some optimisations, a key optimisation being that it avoids reading every byte in input files. An example of this is that the input is read into one buffer rather than being split up based on where newlines are found.
I would like to try out splitting large input files into smaller work units but without reading each byte to find new lines or anything similar to determine split points. My plan is to split the input in half (the splits simply being offsets), then split those halves into halves continuing until they are of manageable (possibly predetermined) sizes - naturally to do this you need to know the size of your input.
The Question: is it possible to calculate or estimate the number of characters in a plain text file, if the size of the file is known and the encoding is also known?
I am working on making a index for a text file. index will be an list of every word and symbols like(~!##$%^&*()_-{}:"<>?/.,';[]1234567890|) and counting the number of times each token occurred in the text file. printing all this in an ascending ASCII value order.
I am going to read a .txt file and split the words and special characters and store it in a list. Can any one throw me idea on how to use binary search in this case.
If your lookup is small (say, up to 1000 records), then you can use a dict; you can either pickle() it, or write it out to a text file. Overhead (for this size) is fairly small anyway.
If your look-up table is bigger, or there are a small number of lookups per run, I would suggest using a key/value database (e.g. dbm).
If it is too complex to use a dbm, use a SQL (e.g. sqlite3, MySQL, Postgres) or NoSQL database. Depending on your application, you can get a huge benefit from the extra features these provide.
In either case, all the hard work is done for you, much better than you can expect to do it yourself. These formats are all standard, so you will get simple-to-use tools to read the data.