Optimizing Python algorithm

Optimizing Python algorithm - python

I am running my code off a 10yr old potato computer (i5 with 4GB RAM) and need to do a lot of language processing with NLTK. I cannot afford a new computer yet. I wrote a simple function (as part of a bigger program). Problem is, I do not know which is more efficient, requires less computing power and is quicker for processing overall?
This snippet uses more variables:
import nltk
from nltk.tokenize import PunktSentenceTokenizer #Unsupervised machine learning tokenizer.
#This is the custom tagger I created. To use it in future projects, simply import it from Learn_NLTK and call it in your project.
def custom_tagger(training_file, target_file):
tagged = []
training_text = open(training_file,"r")
target_text = open(target_file,"r")
custom_sent_tokenizer = PunktSentenceTokenizer(training_text.read()) #You need to train the tokenizer on sample data.
tokenized = custom_sent_tokenizer.tokenize(target_text.read()) #Use the trained tokenizer to tag your target file.
for i in tokenized:
words = nltk.word_tokenize(i)
tagging = nltk.pos_tag(words)
tagged.append(tagging)
training_text.close() #ALWAYS close opened files! This is why I have included the extra code to this function!
target_text.close() #ALWAYS close opened files! This is why I have included the extra code to this function!
return tagged
Or is this more efficient? I actually prefer this:
import nltk
from nltk.tokenize import PunktSentenceTokenizer #Unsupervised machine learning tokenizer.
#This is the custom tagger I created. To use it in future projects, simply import it from Learn_NLTK and call it in your project.
def custom_tagger(training_file, target_file):
tagged = []
training_text = open(training_file,"r")
target_text = open(target_file,"r")
#Use the trained tokenizer to tag your target file.
for i in PunktSentenceTokenizer(training_text.read()).tokenize(target_text.read()): tagged.append(nltk.pos_tag(nltk.word_tokenize(i)))
training_text.close() #ALWAYS close opened files! This is why I have included the extra code to this function!
target_text.close() #ALWAYS close opened files! This is why I have included the extra code to this function!
return tagged
Does anyone have any other suggestions for optimizing code?

It does not matter which one you choose.
The bulk of the computation is likely done by the tokenizer, not by the for loop in the presented code.
Moreover the two examples do the same, except one of them has fewer explicit variables, but still the data needs to be stored somewhere.
Usually, algorithmic speedups come from clever elimination of loop iterations, e.g. in sorting algorithms speedups may come from avoiding value comparisons that will not result in a change to the order of elements (ones that don't advance the sorting). Here the number of loop iterations is the same in both cases.

As mentioned by Daniel, timing functions would be your best way to figure out which method is faster.
I'd recommend using an iPython console to test out the timing for each function.
timeit custom_tagger(training_file, target_file)
I don't think there will be much of a speed difference between the two functions as the second is merely a refactoring of the first. Having all that text on one line won't speed up your code, and it makes it quite difficult to follow. If you're concerned about code length, I'd first clean up the way you read the files.
For example:
with open(target_file) as f:
target_text = f.read()
This is much safer as the file is closed immediately after reading. You could also improve the way you name your variables. In your code target_text is actually a file object, when it actually sounds like it's a string.

Related

Optimize Word Generator

I am trying to build a program capable of finding the best word in a scrabble game. In the following code, I am trying to create a list of all the possible words given a set of 7 characters.
import csv
import itertools
with open('dictionary.csv', newline='') as f:
reader = csv.reader(f)
data = list(reader)
def FindLegalWords(data):
LegalWords = []
for i in data:
if len(i[0]) <= 15:
LegalWords.append(i[0])
return LegalWords
PossibleWords = []
def word_generator(chars, start_with, min_len, max_len):
for i in range(min_len - 1, max_len):
for s in itertools.product(chars, repeat=i):
yield start_with + ''.join(s)
for word in word_generator('abcdefg', '', 2, 15):
if word in FindLegalWords(data):
PossibleWords.append(word)
I think it is clear that the aforementioned code will take days to find all the possible words. What would be a better approach to the problem? Personally, I thought of making each word a number and use NumPy to manipulate them because I have heard that NumPy is very quick. Would this solve the problem? Or it would not be enough? I will be happy to answer any questions that will arise about my code.
Thank you in advance

There is about 5_539 billion possibilities and codes working with strings are generally pretty slow (partially due to Unicode and allocations). This is huge. Generating a massive amount of data to filter most of them is not efficient. This algorithmic problem cannot be fixed using optimized libraries like Numpy. One solution to solve this problem is to directly generate a much smaller subset of all possible values that still fit to FindLegalWords. I guess you probably do not want to generate words likes "bfddgfbgfgd". Thus, you can generate pronounceable words by concatenating 2 pronounceable word parts. Doing this is a bit tricky though. A much better solution is to retrieve the possible words from an existing dictionary. You can find such list online. There are also some dictionary of pronounceable words that can be retrieved from free passwords databases. AFAIK, some tools like John-the-Ripper can generate such list of word you can store in a text file and then read it from your Python program. Note that since the list can be huge, it is better to compress the file and read directly the file from a compressed source.
Some notes regarding the update:
Since FindLegalWords(data) is a constant, you can store it so not to recompute it over and over. You can even compute set(FindLegalWords(data)) so to search word faster in the result. Still, the number of possibility is the main problem so it will not be enough.
PossibleWords will contain all possible subsets of all strings in FindLegalWords(data). Thus, you can generate it directly from data rather than using a bruteforce approach combined with a check. This should be several order of magnitude faster is data is small. Otherwise, the main problem will be that PossibleWords will be so big that your RAM will certainly not big enough to contain it anyway...

Detect bigram collection on a very large text dataset

I would like to find bigrams in a large corpus in text format. As the corpus cannot be loaded at once in memory and its lines are very big, I load it by chunks, each 1 kb
def read_in_chunks(filename, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = filename.read(chunk_size)
if not data:
break
yield data
Then I want to go piece by piece through the corpus and find bigrams and I use the gensim Phrases() and Phraser() functions, but while training, my model constantly loses state. Thus, I tried to save and reload the model after each megabyte that I read and then free the memory, but it still loses state. My code is here:
with open("./final/corpus.txt", "r", encoding='utf8') as r:
max_vocab_size=20000000
phrases = Phrases(max_vocab_size=max_vocab_size)
i=1
j=1024
sentences = ""
for piece in read_in_chunks(r):
if i<=j:
sentences = sentences + piece
else:
phrases.add_vocab(sentences)
phrases = Phrases(sentences)
phrases = phrases.save('./final/phrases.txt')
phrases = Phraser.load('./final/phrases.txt')
sentences = ""
j+=1024
i+=1
print("Done")
Any suggestion?
Thank you.

When you do the two lines...
phrases.add_vocab(sentences)
phrases = Phrases(sentences)
...that 2nd line is throwing away any existing instance inside the phrases variable, and replacing it with a brand new instance (Phrases(sentences)). There's no chance for additive adjustment to the single instance.
Secondarily, there's no way two consecutive lines of .save()-then-immediate-re-.load() can save net memory usage. At the very best, the .load() would be unnnecessary, only exactly reproducing what was just .save()d, but wasting a lot of time and temporary memory loading a second copy, then discarding the duplicate that was already in phrases to assign phrases to the new clone.
While these are problems, more generally, the issue is that what you need done doesn't have to be this complicated.
The Phrases class will accept as its corpus of sentences an iterable object where each item is a list-of-string-tokens. You don't have to worry about chunk-sizes, and calling add_vocab() multiple times – you can just provide a single object that itself offers up each item in turn, and Phrases will do the right thing. You do have to worry about breaking up raw lines into the specific words you want considered ('tokenization').
(For a large corpus, you might still run into memory issues related to the number of unique words that Phrases is trying to count. But it won't matter how arbitrarily large the number of items is – because it will only look at one at a time. Only the accumulation of unique words will consume running memory.)
For a good intro to how an iterable object can work in such situations, a good blog post is:
Data streaming in Python: generators, iterators, iterables
If your corpus.txt file is already set up to be one reasonably-sized sentence per line, and all words are already delimited by simple spaces, then an iterable class might be as simple as:
class FileByLineIterable(object):
def __init__(self, filename):
self.filename = filename
def __iter__(self):
with open(self.filename, 'r', encoding='utf8') as src:
for line in src.readlines():
yield line.split()
Then, your code might just be as simple as...
sentences = FileByLineIterable('./final/corpus.txt')
phrases = Phrases(sentences, max_vocab_size=max_vocab_size)
...because the Phrases class is getting what it wants – a corpus that offers via iteration just one list-of-words item at a time.
Note:
you may want to enable logging at the INFO level to monitor progress and watch for any hints of things going wrong
there's a slightly more advanced line-by-line iterator, which also limits any one line's text to no more than 10,000 tokens (to match an internal implementation limit of gensim Word2Vec), and opens files from places other than local file-paths, available at gensim.models.word2vec.LineSentence. See:
https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence
(Despite this being packaged in the word2vec package, it can be used elsewhere.)

Spacy MemoryError

I managed to install spacy but when trying to use nlp then I am getting a MemoryError for some weird reason.
The code I wrote is as follows:
import spacy
import re
from nltk.corpus import gutenberg
def clean_text(astring):
#replace newlines with space
newstring=re.sub("\n"," ",astring)
#remove title and chapter headings
newstring=re.sub("\[[^\]]*\]"," ",newstring)
newstring=re.sub("VOLUME \S+"," ",newstring)
newstring=re.sub("CHAPTER \S+"," ",newstring)
newstring=re.sub("\s\s+"," ",newstring)
return newstring.lstrip().rstrip()
nlp=spacy.load('en')
alice=clean_text(gutenberg.raw('carroll-alice.txt'))
nlp_alice=list(nlp(alice).sents)
The error I am getting is as follows
The error message
Although when my code is something like this then it works:
import spacy
nlp=spacy.load('en')
alice=nlp("hello Hello")
If anybody could point out what I am doing wrong I would be very grateful

I'm guessing you truly are running out of memory. I couldn't find an exact number, but I'm sure Carrol's Alice's Adventures in Wonderland has tens of thousands of sentences. This equates to tens of thousands of Span elements from Spacy. Without modification, nlp() determines everything from POS to dependencies for the string passed to it. Moreover, the sents property returns an iterator which should be taken advantage of, as opposed to immediately expanding in a list.
Basically, you're attempting a computation which very likely might be running into a memory constraint. How much memory does your machine support? In the comments Joe suggested watching your machine's memory usage, I second this. My recommendations: check if your are actually running out of memory, or limit the functionality of nlp(), or consider doing your work with the iterator functionality:
for sentence in nlp(alice).sents:
pass

How to count frequency in gensim.Doc2Vec?

I am training a model with gensim, my corpus is many short sentences, and each sentence has a frequency which indicates times it occurs in total corpus. I implement it as follow, as you can see, I just choose to do repeat freq times. Any way, if the data is small, it should work, but when data grows, the frequency can be very large, it costs too much memory and my machine cannot afford it.
So
1. can I just count the frequency in every record instead of repeat freq times? 2. Or any other ways to save memory?
class AddressSentences(object):
def __init__(self, raw_path, path):
self._path = path
def __iter__(self):
with open(self.path) as fi:
headers = next(fi).split(",")
i_address, i_freq = headers.index("address"), headers.index("freq")
index = 0
for line in fi:
cols = line.strip().split(",")
freq = cols[i_freq]
address = cols[i_address].split()
# Here I do repeat
for i in range(int(freq)):
yield TaggedDocument(address, [index])
index += 1
print("START %s" % datetime.datetime.now())
train_corpus = list(AddressSentences("/data/corpus.csv"))
model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=55)
model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)
print("END %s" % datetime.datetime.now())
corpus is something like this:
address,freq
Cecilia Chapman 711-2880 Nulla St.,1000
The Business Centre,1000
61 Wellfield Road,500
Celeste Slater 606-3727 Ullamcorper. Street,600
Theodore Lowe Azusa New York 39531,700
Kyla Olsen Ap #651-8679 Sodales Av.,300

Two options for your exact question:
(1)
You don't need to reify your corpus iterator into a fully in-memory list, with your line:
train_corpus = list(AddressSentences("/data/corpus.csv"))
The gensim Word2Vec model can use your iterable-object directly as its corpus, since it implements __iter__() (and thus can be iterated over multiple times). So you can just do:
train_corpus = AddressSentences("/data/corpus.csv")
Then each line will be read, and each repeated TaggedDocument re-yield()ed, without requiring the full set in memory.
(2)
Alternatively, in such cases you may sometimes just want to write a separate routine that takes your original file, and rather than directly yielding TaggedDocuments, does the repetition to create a tangible file on disk which includes the repetitions. Then, use a more simple iterable reader to stream that (already-repeated) dataset into your model.
A negative of this approach, in this particular case, is that it would increase the amount of (likely relatively laggy) disk-IO. However, if the special processing your iterator is doing is more costly – such as regex-based tokenization – this sort of process-and-rewrite can help avoid duplicate work by the model later. (The model needs to scan your corpus once for vocabulary-discovery, then again iter times for training – so any time-consuming work in your iterator will be done redundantly, and may be the bottleneck that keeps other training threads idle waiting for data.)
But after those two options, some Doc2Vec-specific warnings:
Repeating documents like this may not benefit the Doc2Vec model, as compared to simply iterating over the full diverse set. It's the tug-of-war interplay of contrasting examples which cause the word-vectors/doc-vectors in Word2Vec/Doc2Vec models to find useful relative arrangements.
Repeating exact documents/word-contexts is a plausible way to "overweight" those examples, but even if that's really what you want, and would help your end-goals, it'd be better to shuffle those repeats through the whole set.
Repeating one example consecutively is like applying the word-cooccurrences of that example like a jackhammer on the internal neural-network, without any chance for interleaved alternate examples to find a mutually-predictive weight arrangement. The iterative gradient-descent optimization through all diverse examples ideally works more like gradual water-driven erosion & re-deposition of values.
That suggests another possible reason to take the second approach, above: after writing the file-with-repeats, you could use an external line-shuffling tool (like sort -R or shuf on Linux) to shuffle the file. Then, the 1000 repeated lines of some examples would be evenly spread among all the other (repeated) examples, a friendlier arrangement for dense-vector learning.
In any case, I would try leaving out repetition entirely, or shuffling repetitions, and evaluate which steps are really helping on whatever the true end goal is.

Most efficient way in Python to iterate over a large file (10GB+)

I'm working on a Python script to go through two files - one containing a list of UUIDs, the other containing a large amount of log entries - each line containing one of the UUIDs from the other file. The purpose of the program is to create a list of the UUIDS from file1, then for each time that UUID is found in the log file, increment the associated value for each time a match is found.
So long story short, count how many times each UUID appears in the log file.
At the moment, I have a list which is populated with UUID as the key, and 'hits' as the value. Then another loop which iterates over each line of the log file, and checking if the UUID in the log matches a UUID in the UUID list. If it matches, it increments the value.
for i, logLine in enumerate(logHandle): #start matching UUID entries in log file to UUID from rulebase
if logFunc.progress(lineCount, logSize): #check progress
print logFunc.progress(lineCount, logSize) #print progress in 10% intervals
for uid in uidHits:
if logLine.count(uid) == 1: #for each UUID, check the current line of the log for a match in the UUID list
uidHits[uid] += 1 #if matched, increment the relevant value in the uidHits list
break #as we've already found the match, don't process the rest
lineCount += 1
It works as it should - but I'm sure there is a more efficient way of processing the file. I've been through a few guides and found that using 'count' is faster than using a compiled regex. I thought reading files in chunks rather than line by line would improve performance by reducing the amount of disk I/O time but the performance difference on a test file ~200MB was neglible. If anyone has any other methods I would be very grateful :)

Think functionally!
Write a function which will take a line of the log file and return the uuid. Call it uuid, say.
Apply this function to every line of the log file. If you are using Python 3 you can use the built-in function map; otherwise, you need to use itertools.imap.
Pass this iterator to a collections.Counter.
collections.Counter(map(uuid, open("log.txt")))
This will be pretty much optimally efficient.
A couple comments:
This completely ignores the list of UUIDs and just counts the ones that appear in the log file. You will need to modify the program somewhat if you don't want this.
Your code is slow because you are using the wrong data structures. A dict is what you want here.

This is not a 5-line answer to your question, but there was an excellent tutorial given at PyCon'08 called Generator Tricks for System Programmers. There is also a followup tutorial called A Curious Course on Coroutines and Concurrency.
The Generator tutorial specifically uses big log file processing as its example.

Like folks above have said, with a 10GB file you'll probably hit the limits of your disk pretty quickly. For code-only improvements, the generator advice is great. In python 2.x it'll look something like
uuid_generator = (line.split(SPLIT_CHAR)[UUID_FIELD] for line in file)
It sounds like this doesn't actually have to be a python problem. If you're not doing anything more complex than counting UUIDs, Unix might be able to solve your problems faster than python can.
cut -d${SPLIT_CHAR} -f${UUID_FIELD} log_file.txt | sort | uniq -c

Have you tried mincemeat.py? It is a Python implementation of the MapReduce distributed computing framework. I'm not sure if you'll have performance gain since I've not yet processed 10GB of data before using it, though you might explore this framework.

Try measuring where most time is spent, using a profiler http://docs.python.org/library/profile.html
Where best to optimise will depend on the nature of your data: If the list of uuids isn't very long, you may find, for example, that a large proportion of time is spend on the "if logFunc.progress(lineCount, logSize)". If the list is very long, you it could help to save the result of uidHits.keys() to a variable outside the loop and iterate over that instead of the dictionary itself, but Rosh Oxymoron's suggesting of finding the id first and then checking for it in uidHits would probably help even more.
In any case, you can eliminate the lineCount variable, and use i instead. And find(uid) != -1 might be better than count(uid) == 1 if the lines are very long.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.