I'm trying to write an algorithm that can work out the 'unexpectedness' or 'information complexity' of a sentence. More specifically I'm trying to sort a set of sentences so the least complex come first.
My thought was that I could use a compression library, like zlib?, 'pre train' it on a large corpus of text in the same language (call this the 'Corpus') and then append to that corpus of text the different sentences.
That is I could define the complexity measure for a sentence to be how many more bytes it requires to compress the whole corpus with that sentence appended, versus the whole corpus with a different sentence appended. (The fewer extra bytes, the more predictable or 'expected' that sentence is, and therefore the lower the complexity). Does that make sense?
The problem is with trying to find the right library that will let me do this, preferably from python.
I could do this by literally appending sentences to a large corpus and asking a compression library to compress the whole shebang, but if possible, I'd much rather halt the processing of the compression library at the end of the corpus, take a snapshot of the relevant compression 'state', and then, with all that 'state' available try to compress the final sentence. (I would then roll back to the snapshot of the state at the end of the corpus and try a different sentence).
Can anyone help me with a compression library that might be suitable for this need? (Something that lets me 'freeze' its state after 'pre training'.)
I'd prefer to use a library I can call from Python, or Scala. Even better if it is pure python (or pure scala)
All this is going to do is tell you whether the words in the sentence, and maybe phrases in the sentence, are in the dictionary you supplied. I don't see how that's complexity. More like grade level. And there are better tools for that. Anyway, I'll answer your question.
Yes, you can preset the zlib compressor a dictionary. All it is is up to 32K bytes of text. You don't need to run zlib on the dictionary or "freeze a state" -- you simply start compressing the new data, but permit it to look back at the dictionary for matching strings. However 32K isn't very much. That's as far back as zlib's deflate format will look, and you can't load much of the English language in 32K bytes.
LZMA2 also allows for a preset dictionary, but it can be much larger, up to 4 GB. There is a Python binding for the LZMA2 library, but you may need to extend it to provide the dictionary preset function.
Related
I'm trying to extract morphs/similar words in Sinhala language using Fasttext.
But FastText takes a 1 second for 2.64 words. How can I increase the speed without changing the model size?
My code looks like this:
import fasttext
fasttext.util.download_model('si', if_exists='ignore') # Sinhala
ft = fasttext.load_model('cc.si.300.bin')
words_file = open(r'/Datasets/si_words_filtered.txt')
words = words_file.readlines()
words = words[0:300]
synon_dict = dict()
from tqdm import tqdm_notebook
for i in tqdm_notebook(range(len(words))):
word = words[i].strip()
synon = ft.get_nearest_neighbors(word)[0][1] ### takes a lot of time
if is_strictly_sinhala_word(synon):
synon_dict[word] = synon
import json
with open("out.json", "w", encoding='utf8') as f:
json.dump(synon_dict, f, ensure_ascii=False)
To do a fully accurate get_nearest_neighbors()-type of calculation is inherently fairly expensive, requiring a lookup & calculation against every word in the set, for each new word.
As it looks like that set of vectors is near or beyond 2GB in size, when just the word-vectors are loaded, that means a scan of 2GB of addressable memory may be the dominant factor in the runtime.
Some things to try that might help:
Ensure that you have plenty of RAM - if there's any use of 'swap'/virtual-memory, that will make things far slower.
Avoid all unnecessary comparisons - for example, perform your is_strictly_sinhala_word() check before the expensive step, so you can skip the costly step if not interested in the results. Also, you could consider shrinking the full set of word-vectors to eliminate those that you are unlikely to want as responses. This might involve throwing out words you know are not of the language-of-interest, or all lower-frequency words. (If you can throw out half the words as possible nearest-neighbors before even trying the get_nearest_neighbors(), it will go roughly twice as fast.) More on these options below.
Try other word-vector libraries, to see if they offer any improvement. For example, the Python Gensim project can load either plain sets of full-word vectors (eg, the cc.si.300.vec words-only file) or FastText models (the .bin file), and offers a .most_similar() function that has some extra options & might, in some cases, offer different performance. (Though, the official Facebook Fasttext .get_nearest_neighbors() is probably pretty good.)
Use an "approximate nearest neighbors" library to pre-build an index of the word-vector space that can then offer extra-fast nearest-neighbor lookups - although at some risk of not finding the exact right top-N neighbors. There are many such libraries – see this benchmarking project that compares over 20 of them. But, adding this step complicates things & the tradeoff of that complexity & the imperfect result may not be worth the effort & time-savings. So, just remember that it's a possibility if your need s large enough & nothing else helps.
With regard to slimming the set of vectors searched:
The Gensim KeyedVectors.load_word2vec_format() function, which can load the .vec words-only file, has an option limit that will only read the specified number of words from the file. It looks like the .vec file for your dataset has over 800k words - but if you chose to load only 400k, your .most_similar() calculations would go about twice as fast. (And, since such files typically front-load the files with the most-common words, the loss of the far-rarer words may not be a concern.)
Siilarly, even if you load all the vectors, the Gensim .most_similar() function has a restrict_vocab option that can limit searches to just the 1st words of that count, which could also speed things or helpfully drop obscure words that may be of less interest.
The .vec file may be easier to work with if you wanted to pre-filter the words to, for example, eliminate non-Sinhala words. (Note: the usual .load_word2vec_format() text format needs a 1st line that declares the count of words & word-dimensionality, but you may leave that off, then load using the no_header=True option, which instead uses 2 full passes over the file to get the count.)
I want to generate homophones of words programmatically. Meaning, words that sound similar to the original words.
I've come across the Soundex algorithm, but it just replaces some characters with other characters (like t instead of d). Are there any lists or algorithms that are a little bit more sophisticated, providing at least homophone substrings?
Important: I want to apply this on words that aren't in dictionaries, meaning that I can't rely on whole, real words.
EDIT:
The input is a string which is often a proper name and therefore in no standard (homophone) dictionary. An example could be Google or McDonald's (just to name two popular named entities, but many are much more unpopular).
The output is then a (random) homophone of this string. Since words often have more than one homophone, a single (random) one is my goal. In the case of Google, a homophone could be gugel, or MacDonald's for McDonald's.
How to do this well is a research topic. See for example http://www.inf.ufpr.br/didonet/articles/2014_FPSS.pdf.
But suppose that you want to roll your own.
The first step is figuring out how to turn the letters that you are given into a representation of what it sounds like. This is a very hard problem with guessing required. (eg What sound does "read" make? Depends on whether you are going to read, or you already read!) However text to phonemes converter suggests that Arabet has solved this for English.
Next you'll want this to have been done for every word in a dictionary. Assuming that you can do that for one word, that's just a script.
Then you'll want it stored in a data structure where you can easily find similar sounds. That is in principle no difference than the sort of algorithms that are used for autocorrect for spelling. Only with phonemes instead of letters. You can get a sense of how to do that with http://norvig.com/spell-correct.html. Or try to implement something like what is described in http://fastss.csg.uzh.ch/ifi-2007.02.pdf.
And that is it.
So, I have written an autocomplete and autocorrect program in Python 2. I have written the autocorrect program using the approach mentioned is Peter Norvig's blog on how to write a spell checker, link.
Now, I am using a trie data structure implemented using nested lists. I am using a trie as it can give me all words starting with a particular prefix.At the leaf would be a tuple with the word and a value denoting the frequency of the word.For e.g.- the words bad,bat,cat would be saved as-
['b'['a'['d',('bad',4),'t',('bat',3)]],'c'['a'['t',('cat',4)]]]
Where 4,3,4 are the number times the words have been used or the frequency value. Similarly I have made a trie of about 130,000 words of the english dictionary and stored it using cPickle.
Now, it takes about 3-4 seconds for the entire trie to be read each time.The problem is each time a word is encountered the frequency value has to be incremented and then the updated trie needs to be saved again. As you can imagine it would be a big problem waiting each time for 3-4 seconds to read and then again that much time to save the updated trie each time. I will need to perform a lot of update operations each time the program is run and save them.
Is there a faster or efficient way to store a large data structure which repeatedly will be updated? How are the data structures of the autocorrect programs in IDEs and mobile devices saved & retrieved so fast? I am open to different approaches as well.
A few things come to mind.
1) Split the data. Say use 26 files each storing the tries starting with a certain character. You can improve it so that you use a prefix. This way the amount of data you need to write is less.
2) Don't reflect everything to disk. If you need to perform a lot of operations do them in ram(memory) and write them down at then end. If you're afraid of data loss, you can checkpoint your computation after some time X or after a number of operations.
3) Multi-threading. Unless you program only does spellchecking, it's likely there are other things it needs to do. Have a separate thread that does loading writing so that it doesn't block everything while it does disk IO. Multi-threading in python is a bit tricky but it can be done.
4) Custom structure. Part of the time spent in serialization is invoking serialization functions. Since you have a dictionary for everything that's a lot of function calls. In the perfect case you should have a memory representation that matches exactly the disk representation. You would then simply read a large string and put it into your custom class (and write that string to disk when you need to). This is a bit more advanced and likely the benefits will not be that huge especially since python is not so efficient in playing with bits, but if you need to squeeze the last bit of speed out of it, this is the way to go.
I would suggest you to move serialization to a separate thread and run it periodically. You don't need to re-read your data each time because you already have the latest version in memory. This way your program would be responsive to the user while the data is being saved to the disk. The saved version on disk may be lagging and the latest updates may get lost in case of program crash but this shouldn't be a big issue for your use case, I think.
It depends on a particular use case and environment but, I think, most programs having local data sets sync them using multi-threading.
I'm doing some machine extraction of sometimes-garbled PDF text, which often ends up with words incorrectly split up by spaces, or chunks of words put in incorrect order, resulting in pure gibberish.
I'd like a tool that can scan through and recognize these chunks of pure gibberish while skipping non-dictionary words that are likely to be proper names or simply words in a foreign language.
Not sure if this is even possible, but if it is I imagine something like this could be done using NLTK. I'm just wondering if this has been done before to save me the trouble of reinventing the wheel.
Hm, I imagine you could train a SVM or neural network on character n-grams...but you'd need pretty darn long ones. The problem is that this would probably have a high rate of false negatives (throwing out what you wanted) because you can have drastically different rates of character clusters in various languages.
Take Polish, for example (it's my only second language in easy-to-type Latin characters). Skrzywdy would be a highly unlikely series of letters in English, but is easily pronouncable in Polish.
A better technique might be to use language detection to detect languages used in a document above a certain probability, and then check the dictionaries for those languages...
This won't help for (for instance) a Linguistics textbook where a large variety of snippets of various languages are frequently used.
** EDIT **
Idea 2:
You say this is Bibliographic information. Meta-information like its position in the text or any font information your OCR software is returning to you is almost certainly more important than the series of characters you see showing up. If it's in the title, or near the position where author goes, or in Italics, it's worth considering as foreign...
I would like to fit 80M strings of length < 20 characters in memory and use as little memory as possible.
I would like a compression library that I can drive from Python, that will allow me to compress short (<20 char) English strings. I have about 80M of them, and I would like them to fit in as little memory as possible.
I would like maximum lossless compression. CPU time is not the bottleneck.
I don't want the dictionary stored with each string, because that would be high overhead.
I want to compress to <20% the original size. This is plausible, given that the upper bound of the entropy of English is 1.75 bits (Brown et al, 1992, http://acl.ldc.upenn.edu/J/J92/J92-1002.pdf) = 22% compression (1.75/8).
Edit:
I can't use zlib because the header is too large. (If I have a string that starts at 20 bytes, there can be NO header for there to be good compression. zlib header = 200 bytes according to Roland Illing. I haven't doublechecked, but I know it's bigger than 20.)
Huffman coding sounds nice, except it is based upon individual tokens, and can't do ngrams (multiple characters).
smaz has a crappy dictionary, and compresses to only 50%.
I strongly prefer to use existing code, rather than implement a compression algorithm.
I don't want the dictionary stored with each string, because that would be high overhead.
So build a single string with all of the desired contents, and compress it all at once with whichever solution. This solves the "header is too large" problem as well.
You can do this in a variety of ways. Probably the simplest is to create the repr() of a list of the strings; or you can use the pickle, shelve or json modules to create some other sort of serialized form.
Make a dictionary of all words. Then, convert all words to numbers corresponding to the offset in the dictionary. If needed, you can use the first bit to indicate that the word is capitalized.
How about using zipfile from the standard library?
There are no more than 128 different characters in English strings. Hence you can describe each character with a 7bits code. See Compressing UTF-8(or other 8-bit encoding) to 7 or fewer bits
First, if you compress each 20-bytes string individually, your compression ratio will be miserable. You need to compress a lot of strings together to really witness some tangible benefits.
Second, 80M strings is a lot, and if you have to decompress them all to extract a single one of them, you'll be displeased by performance. Chunk your input into smaller but still large enough blocks. A typical value would be 64KB, translating into 3200 strings.
Then, you can compress each 64KB block independantly. When you need to access a single string into the block, you need to decode the entire block.
So here, there is a trade-off to decide between compression ratio (which prefer larger blocks) and random access speed (which prefer smaller blocks). You'll be the judge to select the best one.
Quick note : random access on in-memory structure usually favor fast compression algorithm, rather than strong ones. If you compress only once, but random access a lot of times, prefer some highly assymetric algorithms, such as LZ4-HC :
http://code.google.com/p/lz4hc/
According to benchmark, compression speed is only 15MB/s, but decoding speed is about 1GB/s. That translates into 16K blocks of 64KB decoded per second...