Detect bigram collection on a very large text dataset

Detect bigram collection on a very large text dataset - python

I would like to find bigrams in a large corpus in text format. As the corpus cannot be loaded at once in memory and its lines are very big, I load it by chunks, each 1 kb
def read_in_chunks(filename, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = filename.read(chunk_size)
if not data:
break
yield data
Then I want to go piece by piece through the corpus and find bigrams and I use the gensim Phrases() and Phraser() functions, but while training, my model constantly loses state. Thus, I tried to save and reload the model after each megabyte that I read and then free the memory, but it still loses state. My code is here:
with open("./final/corpus.txt", "r", encoding='utf8') as r:
max_vocab_size=20000000
phrases = Phrases(max_vocab_size=max_vocab_size)
i=1
j=1024
sentences = ""
for piece in read_in_chunks(r):
if i<=j:
sentences = sentences + piece
else:
phrases.add_vocab(sentences)
phrases = Phrases(sentences)
phrases = phrases.save('./final/phrases.txt')
phrases = Phraser.load('./final/phrases.txt')
sentences = ""
j+=1024
i+=1
print("Done")
Any suggestion?
Thank you.

When you do the two lines...
phrases.add_vocab(sentences)
phrases = Phrases(sentences)
...that 2nd line is throwing away any existing instance inside the phrases variable, and replacing it with a brand new instance (Phrases(sentences)). There's no chance for additive adjustment to the single instance.
Secondarily, there's no way two consecutive lines of .save()-then-immediate-re-.load() can save net memory usage. At the very best, the .load() would be unnnecessary, only exactly reproducing what was just .save()d, but wasting a lot of time and temporary memory loading a second copy, then discarding the duplicate that was already in phrases to assign phrases to the new clone.
While these are problems, more generally, the issue is that what you need done doesn't have to be this complicated.
The Phrases class will accept as its corpus of sentences an iterable object where each item is a list-of-string-tokens. You don't have to worry about chunk-sizes, and calling add_vocab() multiple times – you can just provide a single object that itself offers up each item in turn, and Phrases will do the right thing. You do have to worry about breaking up raw lines into the specific words you want considered ('tokenization').
(For a large corpus, you might still run into memory issues related to the number of unique words that Phrases is trying to count. But it won't matter how arbitrarily large the number of items is – because it will only look at one at a time. Only the accumulation of unique words will consume running memory.)
For a good intro to how an iterable object can work in such situations, a good blog post is:
Data streaming in Python: generators, iterators, iterables
If your corpus.txt file is already set up to be one reasonably-sized sentence per line, and all words are already delimited by simple spaces, then an iterable class might be as simple as:
class FileByLineIterable(object):
def __init__(self, filename):
self.filename = filename
def __iter__(self):
with open(self.filename, 'r', encoding='utf8') as src:
for line in src.readlines():
yield line.split()
Then, your code might just be as simple as...
sentences = FileByLineIterable('./final/corpus.txt')
phrases = Phrases(sentences, max_vocab_size=max_vocab_size)
...because the Phrases class is getting what it wants – a corpus that offers via iteration just one list-of-words item at a time.
Note:
you may want to enable logging at the INFO level to monitor progress and watch for any hints of things going wrong
there's a slightly more advanced line-by-line iterator, which also limits any one line's text to no more than 10,000 tokens (to match an internal implementation limit of gensim Word2Vec), and opens files from places other than local file-paths, available at gensim.models.word2vec.LineSentence. See:
https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence
(Despite this being packaged in the word2vec package, it can be used elsewhere.)

Related

Optimize runtime of pattern matching

Consider you had two lists, the first one consisting of 700 words and the second consisting of 30.000 possible sentence beginnings. There will be 21.000.000 combinations of sentence beginning and word.
Furthermore, there are about 400 files with some results for every possible sentence + word combination. Every file consists of 170.000.000 lines and has a structure as follows:
this is the first sentence
1. result for that sentence
2. result for that sentence
...
nth result for that sentence
this is the second sentence
...
this is the nth sentence
...
For every possible sentence + word combination I would like to find the results file that carries some information about the combination (for every combination there is only one results file in which the combination occurs) and read the results out. You could do it in a for loop:
all_results = []
#create combinations
for sentence in sentencelist:
for word in wordlist:
combo = str(sentence + ' ' + word)
#loop through results file while no result for combination has bin found
c_result = []
while not c_result:
for resultsfilename in os.listdir(resultsdirectory):
with open(resultsfilename, 'r') as infile:
results = infile.read().splitlines()
if combo in results:
c_result = function_to_find_pattern_based_on_continuation(continuation, results)
#append results and reset c_result
all_results.append(c_result)
c_result = []
However, this algorithm has quite a long runtime and I'm wondering how it could be improved. For instance, I'm wondering how I could prevent to load resultsfiles over and over again. Furthermore, I would like to create a copy of the resultsfiles and after the results of a sentence + word combination have been read out of a results file, they could be deleted in the copy (I don't want to change the files on the drive). However, every results file is about 7GB big, so it would not make sense to store every file in a variable, right?
Are there some other things, which could be used to improve the runtime?
Edit1: Adapted the size of the lists
Edit2: Add while loop and comments in code

You have 2 problems here as I understand it.
you need some way to reduce I/O on several large files.
you need a way to modify/copy some of these large files
There are a few ways I think you can solve these issues. Firstly if it is possible I would use a database like sqlite - this will remove your lots of file open/closing problem.
Secondly you can use pythons yield operator in your for loop (place it in its own function) and then iterate through it as a generator and edit it like a stream as you go. This will allow you to store the results (say in a file) without putting them all in a list which will run out of memory pretty fast by the sound of it.

Can I pre-trained BERT model from scratch using tokenized input file and custom vocabulary file for Khmer language

I would like to know if it's possible for me to use my own tokenized/segmented documents (with my own vocab file as well) as the input file to the create_pretraining_data.py script (git source: https://github.com/google-research/bert).
The main reason for this question is because the segmentation/tokenization for the Khmer language is different than that of English.
Original:
វាមានមកជាមួយនូវ
Segmented/Tokenized:
វា មាន មក ជាមួយ នូវ
I tried something on my own and managed to get some results after running the create_pretraining_data.py and run_pretraining.py script. However, I'm not sure if what I'm doing can be considered correct.
I also would like to know the method that I should use to verify my model.
Any help is highly appreciated!
Script Modifications
The modifications that I did were:
1. Make input file in a list format
Instead of a normal plain text, my input file is from my custom Khmer tokenization output where I then make it into a list format, mimicking the output that I get when running the sample English text.
[[['ដំណាំ', 'សាវម៉ាវ', 'ជា', 'ប្រភេទ', 'ឈើ', 'ហូប', 'ផ្លែ'],
['វា', 'ផ្តល់', 'ផប្រយោជន៍', 'យ៉ាង', 'ច្រើន', 'ដល់', 'សុខភាព']],
[['cmt', '$', '270', 'នាំ', 'លាភ', 'នាំ', 'សំណាង', 'ហេង', 'ហេង']]]
* The outer bracket indicates a source file, the first nested bracket indicates a document and the second nested bracket indicates a sentence. Exactly the same structure as the variable all_documents inside the create_training_instances() function
2. Vocab file from unique segmented words
This is the part that I'm really really having some serious doubt with. To create my vocab file, all I did was find the unique tokens from the whole documents. I then add the core token requirement [CLS], [SEP], [UNK] and [MASK]. I'm not sure if this the correct way to do it.
Feedback on this part is highly appreciated!
3. Skip tokenization step inside the create_training_instances() function
Since my input file already matches what the variable all_documents is, I skip line 183 to line 207. I replaced it with reading my input as-is:
for input_file in input_files:
with tf.gfile.GFile(input_file, "r") as reader:
lines = reader.read()
all_documents = ast.literal_eval(lines)
Results/Output
The raw input file (before custom tokenization) is from random web-scraping.
Some information on the raw and vocab file:
Number of documents/articles: 5
Number of sentences: 78
Number of vocabs: 649 (including [CLS], [SEP] etc.)
Below is the output (tail end of it) after running the create_pretraining_data.py
And this is what I get after running the run_pretraining.py
As shown in the diagram above I'm getting a very low accuracy from this and hence my concern if I'm doing it correctly.

First of all, you seem to have very little training data (you mention a vocabulary size of 649). BERT is a huge model which needs a lot of training data. The english models published by google are trained on at least the whole wikipedia. Think about that!
BERT uses something called WordPiece which guarantees a fixed vocabulary size. Rare words are split up like that: Jet makers feud over seat width with big orders at stake translates to wordPiece as: _J et _makers _fe ud _over _seat _width _with _big _orders _at _stake.
WordPieceTokenizer.tokenize(text) takes a text pretokenized by whitespace, so you should change the BasicTokenizer, which is run before the WordPieceTokenizer by you specific tokenizer which should separate your tokens by whitespace.
To train your own WorPiece-Tookenizer, have a look at sentenePiece, which is in bpe mode essentially the same as WordPiece.
You can then export a vocabulary list from your WordPiece model.
I did not pretrain a BERT model myself, so I cannot help you on where to change something in the code exactly.

Load a part of Glove vectors with gensim

I have a word list like['like','Python']and I want to load pre-trained Glove word vectors of these words, but the Glove file is too large, is there any fast way to do it?
What I tried
I iterated through each line of the file to see if the word is in the list and add it to a dict if True. But this method is a little slow.
def readWordEmbeddingVector(Wrd):
f = open('glove.twitter.27B/glove.twitter.27B.200d.txt','r')
words = []
a = f.readline()
while a!= '':
vector = a.split()
if vector[0] in Wrd:
words.append(vector)
Wrd.remove(vector[0])
a = f.readline()
f.close()
words_vector = pd.DataFrame(words).set_index(0).astype('float')
return words_vector
I also tried below, but it loaded the whole file instead of vectors I need
gensim.models.keyedvectors.KeyedVectors.load_word2vec_format('word2vec.twitter.27B.200d.txt')
What I want
Method like gensim.models.keyedvectors.KeyedVectors.load_word2vec_format but I can set a word list to load.

There's no existing gensim support for filtering the words loaded via load_word2vec_format(). The closest is an optional limit parameter, which can be used to limit how many word-vectors are read (ignoring all subsequent vectors).
You could conceivably create your own routine to perform such filtering, using the source code for load_word2vec_format() as a model. As a practical matter, you might have to read the file twice: 1st, to find out exactly how many words in the file intersect with your set-of-words-of-interest (so you can allocate the right-sized array without trusting the declared size at the front of the file), then a second time to actually read the words-of-interest.

Optimizing Python algorithm

I am running my code off a 10yr old potato computer (i5 with 4GB RAM) and need to do a lot of language processing with NLTK. I cannot afford a new computer yet. I wrote a simple function (as part of a bigger program). Problem is, I do not know which is more efficient, requires less computing power and is quicker for processing overall?
This snippet uses more variables:
import nltk
from nltk.tokenize import PunktSentenceTokenizer #Unsupervised machine learning tokenizer.
#This is the custom tagger I created. To use it in future projects, simply import it from Learn_NLTK and call it in your project.
def custom_tagger(training_file, target_file):
tagged = []
training_text = open(training_file,"r")
target_text = open(target_file,"r")
custom_sent_tokenizer = PunktSentenceTokenizer(training_text.read()) #You need to train the tokenizer on sample data.
tokenized = custom_sent_tokenizer.tokenize(target_text.read()) #Use the trained tokenizer to tag your target file.
for i in tokenized:
words = nltk.word_tokenize(i)
tagging = nltk.pos_tag(words)
tagged.append(tagging)
training_text.close() #ALWAYS close opened files! This is why I have included the extra code to this function!
target_text.close() #ALWAYS close opened files! This is why I have included the extra code to this function!
return tagged
Or is this more efficient? I actually prefer this:
import nltk
from nltk.tokenize import PunktSentenceTokenizer #Unsupervised machine learning tokenizer.
#This is the custom tagger I created. To use it in future projects, simply import it from Learn_NLTK and call it in your project.
def custom_tagger(training_file, target_file):
tagged = []
training_text = open(training_file,"r")
target_text = open(target_file,"r")
#Use the trained tokenizer to tag your target file.
for i in PunktSentenceTokenizer(training_text.read()).tokenize(target_text.read()): tagged.append(nltk.pos_tag(nltk.word_tokenize(i)))
training_text.close() #ALWAYS close opened files! This is why I have included the extra code to this function!
target_text.close() #ALWAYS close opened files! This is why I have included the extra code to this function!
return tagged
Does anyone have any other suggestions for optimizing code?

It does not matter which one you choose.
The bulk of the computation is likely done by the tokenizer, not by the for loop in the presented code.
Moreover the two examples do the same, except one of them has fewer explicit variables, but still the data needs to be stored somewhere.
Usually, algorithmic speedups come from clever elimination of loop iterations, e.g. in sorting algorithms speedups may come from avoiding value comparisons that will not result in a change to the order of elements (ones that don't advance the sorting). Here the number of loop iterations is the same in both cases.

As mentioned by Daniel, timing functions would be your best way to figure out which method is faster.
I'd recommend using an iPython console to test out the timing for each function.
timeit custom_tagger(training_file, target_file)
I don't think there will be much of a speed difference between the two functions as the second is merely a refactoring of the first. Having all that text on one line won't speed up your code, and it makes it quite difficult to follow. If you're concerned about code length, I'd first clean up the way you read the files.
For example:
with open(target_file) as f:
target_text = f.read()
This is much safer as the file is closed immediately after reading. You could also improve the way you name your variables. In your code target_text is actually a file object, when it actually sounds like it's a string.

How to count frequency in gensim.Doc2Vec?

I am training a model with gensim, my corpus is many short sentences, and each sentence has a frequency which indicates times it occurs in total corpus. I implement it as follow, as you can see, I just choose to do repeat freq times. Any way, if the data is small, it should work, but when data grows, the frequency can be very large, it costs too much memory and my machine cannot afford it.
So
1. can I just count the frequency in every record instead of repeat freq times? 2. Or any other ways to save memory?
class AddressSentences(object):
def __init__(self, raw_path, path):
self._path = path
def __iter__(self):
with open(self.path) as fi:
headers = next(fi).split(",")
i_address, i_freq = headers.index("address"), headers.index("freq")
index = 0
for line in fi:
cols = line.strip().split(",")
freq = cols[i_freq]
address = cols[i_address].split()
# Here I do repeat
for i in range(int(freq)):
yield TaggedDocument(address, [index])
index += 1
print("START %s" % datetime.datetime.now())
train_corpus = list(AddressSentences("/data/corpus.csv"))
model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=55)
model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)
print("END %s" % datetime.datetime.now())
corpus is something like this:
address,freq
Cecilia Chapman 711-2880 Nulla St.,1000
The Business Centre,1000
61 Wellfield Road,500
Celeste Slater 606-3727 Ullamcorper. Street,600
Theodore Lowe Azusa New York 39531,700
Kyla Olsen Ap #651-8679 Sodales Av.,300

Two options for your exact question:
(1)
You don't need to reify your corpus iterator into a fully in-memory list, with your line:
train_corpus = list(AddressSentences("/data/corpus.csv"))
The gensim Word2Vec model can use your iterable-object directly as its corpus, since it implements __iter__() (and thus can be iterated over multiple times). So you can just do:
train_corpus = AddressSentences("/data/corpus.csv")
Then each line will be read, and each repeated TaggedDocument re-yield()ed, without requiring the full set in memory.
(2)
Alternatively, in such cases you may sometimes just want to write a separate routine that takes your original file, and rather than directly yielding TaggedDocuments, does the repetition to create a tangible file on disk which includes the repetitions. Then, use a more simple iterable reader to stream that (already-repeated) dataset into your model.
A negative of this approach, in this particular case, is that it would increase the amount of (likely relatively laggy) disk-IO. However, if the special processing your iterator is doing is more costly – such as regex-based tokenization – this sort of process-and-rewrite can help avoid duplicate work by the model later. (The model needs to scan your corpus once for vocabulary-discovery, then again iter times for training – so any time-consuming work in your iterator will be done redundantly, and may be the bottleneck that keeps other training threads idle waiting for data.)
But after those two options, some Doc2Vec-specific warnings:
Repeating documents like this may not benefit the Doc2Vec model, as compared to simply iterating over the full diverse set. It's the tug-of-war interplay of contrasting examples which cause the word-vectors/doc-vectors in Word2Vec/Doc2Vec models to find useful relative arrangements.
Repeating exact documents/word-contexts is a plausible way to "overweight" those examples, but even if that's really what you want, and would help your end-goals, it'd be better to shuffle those repeats through the whole set.
Repeating one example consecutively is like applying the word-cooccurrences of that example like a jackhammer on the internal neural-network, without any chance for interleaved alternate examples to find a mutually-predictive weight arrangement. The iterative gradient-descent optimization through all diverse examples ideally works more like gradual water-driven erosion & re-deposition of values.
That suggests another possible reason to take the second approach, above: after writing the file-with-repeats, you could use an external line-shuffling tool (like sort -R or shuf on Linux) to shuffle the file. Then, the 1000 repeated lines of some examples would be evenly spread among all the other (repeated) examples, a friendlier arrangement for dense-vector learning.
In any case, I would try leaving out repetition entirely, or shuffling repetitions, and evaluate which steps are really helping on whatever the true end goal is.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.