Optimize runtime of pattern matching - python

Consider you had two lists, the first one consisting of 700 words and the second consisting of 30.000 possible sentence beginnings. There will be 21.000.000 combinations of sentence beginning and word.
Furthermore, there are about 400 files with some results for every possible sentence + word combination. Every file consists of 170.000.000 lines and has a structure as follows:
this is the first sentence
1. result for that sentence
2. result for that sentence
...
nth result for that sentence
this is the second sentence
...
this is the nth sentence
...
For every possible sentence + word combination I would like to find the results file that carries some information about the combination (for every combination there is only one results file in which the combination occurs) and read the results out. You could do it in a for loop:
all_results = []
#create combinations
for sentence in sentencelist:
for word in wordlist:
combo = str(sentence + ' ' + word)
#loop through results file while no result for combination has bin found
c_result = []
while not c_result:
for resultsfilename in os.listdir(resultsdirectory):
with open(resultsfilename, 'r') as infile:
results = infile.read().splitlines()
if combo in results:
c_result = function_to_find_pattern_based_on_continuation(continuation, results)
#append results and reset c_result
all_results.append(c_result)
c_result = []
However, this algorithm has quite a long runtime and I'm wondering how it could be improved. For instance, I'm wondering how I could prevent to load resultsfiles over and over again. Furthermore, I would like to create a copy of the resultsfiles and after the results of a sentence + word combination have been read out of a results file, they could be deleted in the copy (I don't want to change the files on the drive). However, every results file is about 7GB big, so it would not make sense to store every file in a variable, right?
Are there some other things, which could be used to improve the runtime?
Edit1: Adapted the size of the lists
Edit2: Add while loop and comments in code

You have 2 problems here as I understand it.
you need some way to reduce I/O on several large files.
you need a way to modify/copy some of these large files
There are a few ways I think you can solve these issues. Firstly if it is possible I would use a database like sqlite - this will remove your lots of file open/closing problem.
Secondly you can use pythons yield operator in your for loop (place it in its own function) and then iterate through it as a generator and edit it like a stream as you go. This will allow you to store the results (say in a file) without putting them all in a list which will run out of memory pretty fast by the sound of it.

Related

Optimize Word Generator

I am trying to build a program capable of finding the best word in a scrabble game. In the following code, I am trying to create a list of all the possible words given a set of 7 characters.
import csv
import itertools
with open('dictionary.csv', newline='') as f:
reader = csv.reader(f)
data = list(reader)
def FindLegalWords(data):
LegalWords = []
for i in data:
if len(i[0]) <= 15:
LegalWords.append(i[0])
return LegalWords
PossibleWords = []
def word_generator(chars, start_with, min_len, max_len):
for i in range(min_len - 1, max_len):
for s in itertools.product(chars, repeat=i):
yield start_with + ''.join(s)
for word in word_generator('abcdefg', '', 2, 15):
if word in FindLegalWords(data):
PossibleWords.append(word)
I think it is clear that the aforementioned code will take days to find all the possible words. What would be a better approach to the problem? Personally, I thought of making each word a number and use NumPy to manipulate them because I have heard that NumPy is very quick. Would this solve the problem? Or it would not be enough? I will be happy to answer any questions that will arise about my code.
Thank you in advance
There is about 5_539 billion possibilities and codes working with strings are generally pretty slow (partially due to Unicode and allocations). This is huge. Generating a massive amount of data to filter most of them is not efficient. This algorithmic problem cannot be fixed using optimized libraries like Numpy. One solution to solve this problem is to directly generate a much smaller subset of all possible values that still fit to FindLegalWords. I guess you probably do not want to generate words likes "bfddgfbgfgd". Thus, you can generate pronounceable words by concatenating 2 pronounceable word parts. Doing this is a bit tricky though. A much better solution is to retrieve the possible words from an existing dictionary. You can find such list online. There are also some dictionary of pronounceable words that can be retrieved from free passwords databases. AFAIK, some tools like John-the-Ripper can generate such list of word you can store in a text file and then read it from your Python program. Note that since the list can be huge, it is better to compress the file and read directly the file from a compressed source.
Some notes regarding the update:
Since FindLegalWords(data) is a constant, you can store it so not to recompute it over and over. You can even compute set(FindLegalWords(data)) so to search word faster in the result. Still, the number of possibility is the main problem so it will not be enough.
PossibleWords will contain all possible subsets of all strings in FindLegalWords(data). Thus, you can generate it directly from data rather than using a bruteforce approach combined with a check. This should be several order of magnitude faster is data is small. Otherwise, the main problem will be that PossibleWords will be so big that your RAM will certainly not big enough to contain it anyway...

Detect bigram collection on a very large text dataset

I would like to find bigrams in a large corpus in text format. As the corpus cannot be loaded at once in memory and its lines are very big, I load it by chunks, each 1 kb
def read_in_chunks(filename, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = filename.read(chunk_size)
if not data:
break
yield data
Then I want to go piece by piece through the corpus and find bigrams and I use the gensim Phrases() and Phraser() functions, but while training, my model constantly loses state. Thus, I tried to save and reload the model after each megabyte that I read and then free the memory, but it still loses state. My code is here:
with open("./final/corpus.txt", "r", encoding='utf8') as r:
max_vocab_size=20000000
phrases = Phrases(max_vocab_size=max_vocab_size)
i=1
j=1024
sentences = ""
for piece in read_in_chunks(r):
if i<=j:
sentences = sentences + piece
else:
phrases.add_vocab(sentences)
phrases = Phrases(sentences)
phrases = phrases.save('./final/phrases.txt')
phrases = Phraser.load('./final/phrases.txt')
sentences = ""
j+=1024
i+=1
print("Done")
Any suggestion?
Thank you.
When you do the two lines...
phrases.add_vocab(sentences)
phrases = Phrases(sentences)
...that 2nd line is throwing away any existing instance inside the phrases variable, and replacing it with a brand new instance (Phrases(sentences)). There's no chance for additive adjustment to the single instance.
Secondarily, there's no way two consecutive lines of .save()-then-immediate-re-.load() can save net memory usage. At the very best, the .load() would be unnnecessary, only exactly reproducing what was just .save()d, but wasting a lot of time and temporary memory loading a second copy, then discarding the duplicate that was already in phrases to assign phrases to the new clone.
While these are problems, more generally, the issue is that what you need done doesn't have to be this complicated.
The Phrases class will accept as its corpus of sentences an iterable object where each item is a list-of-string-tokens. You don't have to worry about chunk-sizes, and calling add_vocab() multiple times – you can just provide a single object that itself offers up each item in turn, and Phrases will do the right thing. You do have to worry about breaking up raw lines into the specific words you want considered ('tokenization').
(For a large corpus, you might still run into memory issues related to the number of unique words that Phrases is trying to count. But it won't matter how arbitrarily large the number of items is – because it will only look at one at a time. Only the accumulation of unique words will consume running memory.)
For a good intro to how an iterable object can work in such situations, a good blog post is:
Data streaming in Python: generators, iterators, iterables
If your corpus.txt file is already set up to be one reasonably-sized sentence per line, and all words are already delimited by simple spaces, then an iterable class might be as simple as:
class FileByLineIterable(object):
def __init__(self, filename):
self.filename = filename
def __iter__(self):
with open(self.filename, 'r', encoding='utf8') as src:
for line in src.readlines():
yield line.split()
Then, your code might just be as simple as...
sentences = FileByLineIterable('./final/corpus.txt')
phrases = Phrases(sentences, max_vocab_size=max_vocab_size)
...because the Phrases class is getting what it wants – a corpus that offers via iteration just one list-of-words item at a time.
Note:
you may want to enable logging at the INFO level to monitor progress and watch for any hints of things going wrong
there's a slightly more advanced line-by-line iterator, which also limits any one line's text to no more than 10,000 tokens (to match an internal implementation limit of gensim Word2Vec), and opens files from places other than local file-paths, available at gensim.models.word2vec.LineSentence. See:
https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence
(Despite this being packaged in the word2vec package, it can be used elsewhere.)

Load a part of Glove vectors with gensim

I have a word list like['like','Python']and I want to load pre-trained Glove word vectors of these words, but the Glove file is too large, is there any fast way to do it?
What I tried
I iterated through each line of the file to see if the word is in the list and add it to a dict if True. But this method is a little slow.
def readWordEmbeddingVector(Wrd):
f = open('glove.twitter.27B/glove.twitter.27B.200d.txt','r')
words = []
a = f.readline()
while a!= '':
vector = a.split()
if vector[0] in Wrd:
words.append(vector)
Wrd.remove(vector[0])
a = f.readline()
f.close()
words_vector = pd.DataFrame(words).set_index(0).astype('float')
return words_vector
I also tried below, but it loaded the whole file instead of vectors I need
gensim.models.keyedvectors.KeyedVectors.load_word2vec_format('word2vec.twitter.27B.200d.txt')
What I want
Method like gensim.models.keyedvectors.KeyedVectors.load_word2vec_format but I can set a word list to load.
There's no existing gensim support for filtering the words loaded via load_word2vec_format(). The closest is an optional limit parameter, which can be used to limit how many word-vectors are read (ignoring all subsequent vectors).
You could conceivably create your own routine to perform such filtering, using the source code for load_word2vec_format() as a model. As a practical matter, you might have to read the file twice: 1st, to find out exactly how many words in the file intersect with your set-of-words-of-interest (so you can allocate the right-sized array without trusting the declared size at the front of the file), then a second time to actually read the words-of-interest.

how to speed up expanding list from map iterator

I have some text data in a pandas column. Basically each document is part of the column value. Each document is multi sentence long.
I wanted to split each document into sentence and then for each sentence I want to get a list of words. So if a document is 5 sentence long, I will have a list of list of words with length 5.
I used a mapper function to do some operations on that and got a list of words for each sentence of a text. Here is a mapper code:
def text_to_words(x):
""" This function converts sentences in a text to a list of words
"""
nlp=spacy.load('en')
txt_to_words= [str(doc).replace(".","").split(" ") for doc in nlp(x).sents]
return txt_to_words
Then I did this:
%%time
txt_to_words=map(text_to_words,pandas_df.log_text_cleaned)
It got done in 70 micro seconds and I got a mapper iterator.
Now if I want to add each list of list of words of each document as a new value of a new column in the same pandas data frame.
I can simply do this:
txt_to_words=[*map(text_to_words,pandas_df.log_text_cleaned)]
Which will expand the map iterator and store it in txt_to_words as list of list of words.
But this process is very slow.
I even tried looping over the map object :
txt_to_words=map(text_to_words,pandas_df.log_text_cleaned)
txt_to_words_list=[]
for sent in txt_to_words:
txt_to_words_list.append(sent)
But this is similar slow.
extracting the output from a mapper object is very slow. And I just have 67K documents in that pandas data frame column.
Is there a way this can be sped up?
Thanks
The direct answer to your question is that the fastest way to convert an iterator to a list is probably by calling list on it, although that may depend on the size of your lists.
However, this is not going to matter, except to an unnoticeable, barely-measurable degree.
The difference between list(m), [*m], or even an explicit for statement is a matter of microseconds at most, but your code is taking seconds. In fact, you could even eliminate almost all the work done by list by using collections.deque(m, maxlen=0) (which just throws away all of the values without allocating anything or storing them), and you still won't see a difference.
Your real problem is that the work done for each element is slow.
Calling map doesn't actually do that work. All it does is construct a lazy iterator that sets up the work to be done later. When is later? When you convert the iterator to a list (or consume it in some other way).
So, it's that text_to_words function that you need to speed up.
And there's at least one obvious candidate for how to do that:
def text_to_words(x):
""" This function converts sentences in a text to a list of words
"""
nlp=spacy.load('en')
txt_to_words= [str(doc).replace(".","").split(" ") for doc in nlp(x).sents]
return txt_to_words
You're loading in an entire English tokenizer/dictionary/etc. for each sentence? Sure, you'll get some benefit from caching after the first time, but I'll bet it's still way too slow to do for every sentence.
If you were trying to speed things up by making it a local variable rather than a global (which probably won't matter, but it might), that's not the way to do it; this is:
nlp=spacy.load('en')
def text_to_words(x, *. _nlp=nlp):
""" This function converts sentences in a text to a list of words
"""
txt_to_words= [str(doc).replace(".","").split(" ") for doc in _nlp(x).sents]
return txt_to_words

Hashtables over large natural language word sets

I'm writing a program in python to do a unigram (and eventually bigram etc) analysis of movie reviews. The goal is to create feature vectors to feed into libsvm. I have 50,000 odd unique words in my feature vector (which seems rather large to me, but I ham relatively sure I'm right about that).
I'm using the python dictionary implementation as a hashtable to keep track of new words as I meet them, but I'm noticing an enormous slowdown after the first 1000 odd documents are processed. Would I have better efficiency (given the distribution of natural language) if I used several smaller hashtable/dictionaries or would it be the same/worse?
More info:
The data is split into 1500 or so documents, 500-ish words each. There are between 100 and 300 unique words (with respect to all previous documents) in each document.
My current code:
#processes each individual file, tok == filename, v == predefined class
def processtok(tok, v):
#n is the number of unique words so far,
#reference is the mapping reference in case I want to add new data later
#hash is the hashtable
#statlist is the massive feature vector I'm trying to build
global n
global reference
global hash
global statlist
cin=open(tok, 'r')
statlist=[0]*43990
statlist[0] = v
lines = cin.readlines()
for l in lines:
line = l.split(" ")
for word in line:
if word in hash.keys():
if statlist[hash[word]] == 0:
statlist[hash[word]] = 1
else:
hash[word]=n
n+=1
ref.write('['+str(word)+','+str(n)+']'+'\n')
statlist[hash[word]] = 1
cin.close()
return statlist
Also keep in mind that my input data is about 6mb and my output data is about 300mb. I'm simply startled at how long this takes, and I feel that it shouldn't be slowing down so dramatically as it's running.
Slowing down: the first 50 documents take about 5 seconds, the last 50 take about 5 minutes.
#ThatGuy has made the fix, but hasn't actually told you this:
The major cause of your slowdown is the line
if word in hash.keys():
which laboriously makes a list of all the keys so far, then laboriously searches that list for `word'. The time taken is proportional to the number of keys i.e. the number of unique words found so far. That's why it starts fast and becomes slower and slower.
All you need is if word in hash: which in 99.9999999% of cases takes time independent of the number of keys -- one of the major reasons for having a dict.
The faffing about with statlist[hash[word]] doesn't help, either. By the way, the fixed size in statlist=[0]*43990 needs explanation.
More problems
Problem A: Either (1) your code suffered from indentation distortion when you published it, or (2) hash will never be updated by that function. Quite simply, if word is not in hash i.e it's the first time you've seen it, absolutely nothing happens. The hash[word] = n statement (the ONLY code that updates hash) is NOT executed. So no word will ever be in hash.
It looks like this block of code needs to be shifted left 4 columns, so that it's aligned with the outer if:
else:
hash[word]=n
ref.write('['+str(word)+','+str(n)+']'+'\n')
statlist[hash[word]] = 1
Problem B: There is no code at all to update n (allegedly the number of unique words so far).
I strongly suggest that you take as many of the suggestions that #ThatGuy and I have made as you care to, rip out all the global stuff, fix up your code, chuck in a few print statements at salient points, and run it over say 2 documents each of 3 lines with about 4 words in each. Ensure that it is working properly. THEN run it on your big data set (with the prints suppressed). In any case you may want to put out stats (like number of documents, lines, words, unique words, elapsed time, etc) at regular intervals.
Another problem
Problem C: I mentioned this in a comment on #ThatGuy's answer, and he agreed with me, but you haven't mentioned taking it up:
>>> line = "foo bar foo\n"
>>> line.split(" ")
['foo', 'bar', 'foo\n']
>>> line.split()
['foo', 'bar', 'foo']
>>>
Your use of .split(" ") will lead to spurious "words" and distort your statistics, including the number of unique words that you have. You may well find the need to change that hard-coded magic number.
I say again: There is no code that updates n in the function . Doing hash[word] = n seems very strange, even if n is updated for each document.
I don't think Python's Dictionary has anything to do with your slowdown here. Especially when you are saying that the entries are around 100. I am hoping that you are referring to Insertion and Retrival, which are both O(1) in a dictionary. The problem could be that you are not using iterators (or loading key,value pairs one at a time) when creating a dictionary and you are loading the entire words in-memory. In that case, the slowdown is due to memory consumption.
I think you've got a few problems going on here. Mostly, I am unsure of what you are tying to accomplish with statlist. It seems to me like it is serving as a poor duplicate of your dictionary. Create it after you have found all of your words.
Here is my guess as to what you want:
def processtok(tok, v):
global n
global reference
global hash
cin=open(tok, 'rb')
for l in cin:
line = l.split(" ")
for word in line:
if word in hash:
hash[word] += 1
else:
hash[word] = 1
n += 1
ref.write('['+str(word)+','+str(n)+']'+'\n')
cin.close()
return hash
Note, that this means you no longer need an "n" as you can discover this by doing len(n).

Categories

Resources