Load a part of Glove vectors with gensim - python

I have a word list like['like','Python']and I want to load pre-trained Glove word vectors of these words, but the Glove file is too large, is there any fast way to do it?
What I tried
I iterated through each line of the file to see if the word is in the list and add it to a dict if True. But this method is a little slow.
def readWordEmbeddingVector(Wrd):
f = open('glove.twitter.27B/glove.twitter.27B.200d.txt','r')
words = []
a = f.readline()
while a!= '':
vector = a.split()
if vector[0] in Wrd:
words.append(vector)
Wrd.remove(vector[0])
a = f.readline()
f.close()
words_vector = pd.DataFrame(words).set_index(0).astype('float')
return words_vector
I also tried below, but it loaded the whole file instead of vectors I need
gensim.models.keyedvectors.KeyedVectors.load_word2vec_format('word2vec.twitter.27B.200d.txt')
What I want
Method like gensim.models.keyedvectors.KeyedVectors.load_word2vec_format but I can set a word list to load.

There's no existing gensim support for filtering the words loaded via load_word2vec_format(). The closest is an optional limit parameter, which can be used to limit how many word-vectors are read (ignoring all subsequent vectors).
You could conceivably create your own routine to perform such filtering, using the source code for load_word2vec_format() as a model. As a practical matter, you might have to read the file twice: 1st, to find out exactly how many words in the file intersect with your set-of-words-of-interest (so you can allocate the right-sized array without trusting the declared size at the front of the file), then a second time to actually read the words-of-interest.

Related

Optimize runtime of pattern matching

Consider you had two lists, the first one consisting of 700 words and the second consisting of 30.000 possible sentence beginnings. There will be 21.000.000 combinations of sentence beginning and word.
Furthermore, there are about 400 files with some results for every possible sentence + word combination. Every file consists of 170.000.000 lines and has a structure as follows:
this is the first sentence
1. result for that sentence
2. result for that sentence
...
nth result for that sentence
this is the second sentence
...
this is the nth sentence
...
For every possible sentence + word combination I would like to find the results file that carries some information about the combination (for every combination there is only one results file in which the combination occurs) and read the results out. You could do it in a for loop:
all_results = []
#create combinations
for sentence in sentencelist:
for word in wordlist:
combo = str(sentence + ' ' + word)
#loop through results file while no result for combination has bin found
c_result = []
while not c_result:
for resultsfilename in os.listdir(resultsdirectory):
with open(resultsfilename, 'r') as infile:
results = infile.read().splitlines()
if combo in results:
c_result = function_to_find_pattern_based_on_continuation(continuation, results)
#append results and reset c_result
all_results.append(c_result)
c_result = []
However, this algorithm has quite a long runtime and I'm wondering how it could be improved. For instance, I'm wondering how I could prevent to load resultsfiles over and over again. Furthermore, I would like to create a copy of the resultsfiles and after the results of a sentence + word combination have been read out of a results file, they could be deleted in the copy (I don't want to change the files on the drive). However, every results file is about 7GB big, so it would not make sense to store every file in a variable, right?
Are there some other things, which could be used to improve the runtime?
Edit1: Adapted the size of the lists
Edit2: Add while loop and comments in code
You have 2 problems here as I understand it.
you need some way to reduce I/O on several large files.
you need a way to modify/copy some of these large files
There are a few ways I think you can solve these issues. Firstly if it is possible I would use a database like sqlite - this will remove your lots of file open/closing problem.
Secondly you can use pythons yield operator in your for loop (place it in its own function) and then iterate through it as a generator and edit it like a stream as you go. This will allow you to store the results (say in a file) without putting them all in a list which will run out of memory pretty fast by the sound of it.

Can I pre-trained BERT model from scratch using tokenized input file and custom vocabulary file for Khmer language

I would like to know if it's possible for me to use my own tokenized/segmented documents (with my own vocab file as well) as the input file to the create_pretraining_data.py script (git source: https://github.com/google-research/bert).
The main reason for this question is because the segmentation/tokenization for the Khmer language is different than that of English.
Original:
វា​មាន​មក​ជាមួយ​នូវ
Segmented/Tokenized:
វា មាន មក ជាមួយ នូវ
I tried something on my own and managed to get some results after running the create_pretraining_data.py and run_pretraining.py script. However, I'm not sure if what I'm doing can be considered correct.
I also would like to know the method that I should use to verify my model.
Any help is highly appreciated!
Script Modifications
The modifications that I did were:
1. Make input file in a list format
Instead of a normal plain text, my input file is from my custom Khmer tokenization output where I then make it into a list format, mimicking the output that I get when running the sample English text.
[[['ដំណាំ', 'សាវម៉ាវ', 'ជា', 'ប្រភេទ', 'ឈើ', 'ហូប', 'ផ្លែ'],
['វា', 'ផ្តល់', 'ផប្រយោជន៍', 'យ៉ាង', 'ច្រើន', 'ដល់', 'សុខភាព']],
[['cmt', '$', '270', 'នាំ', 'លាភ', 'នាំ', 'សំណាង', 'ហេង', 'ហេង']]]
* The outer bracket indicates a source file, the first nested bracket indicates a document and the second nested bracket indicates a sentence. Exactly the same structure as the variable all_documents inside the create_training_instances() function
2. Vocab file from unique segmented words
This is the part that I'm really really having some serious doubt with. To create my vocab file, all I did was find the unique tokens from the whole documents. I then add the core token requirement [CLS], [SEP], [UNK] and [MASK]. I'm not sure if this the correct way to do it.
Feedback on this part is highly appreciated!
3. Skip tokenization step inside the create_training_instances() function
Since my input file already matches what the variable all_documents is, I skip line 183 to line 207. I replaced it with reading my input as-is:
for input_file in input_files:
with tf.gfile.GFile(input_file, "r") as reader:
lines = reader.read()
all_documents = ast.literal_eval(lines)
Results/Output
The raw input file (before custom tokenization) is from random web-scraping.
Some information on the raw and vocab file:
Number of documents/articles: 5
Number of sentences: 78
Number of vocabs: 649 (including [CLS], [SEP] etc.)
Below is the output (tail end of it) after running the create_pretraining_data.py
And this is what I get after running the run_pretraining.py
As shown in the diagram above I'm getting a very low accuracy from this and hence my concern if I'm doing it correctly.
First of all, you seem to have very little training data (you mention a vocabulary size of 649). BERT is a huge model which needs a lot of training data. The english models published by google are trained on at least the whole wikipedia. Think about that!
BERT uses something called WordPiece which guarantees a fixed vocabulary size. Rare words are split up like that: Jet makers feud over seat width with big orders at stake translates to wordPiece as: _J et _makers _fe ud _over _seat _width _with _big _orders _at _stake.
WordPieceTokenizer.tokenize(text) takes a text pretokenized by whitespace, so you should change the BasicTokenizer, which is run before the WordPieceTokenizer by you specific tokenizer which should separate your tokens by whitespace.
To train your own WorPiece-Tookenizer, have a look at sentenePiece, which is in bpe mode essentially the same as WordPiece.
You can then export a vocabulary list from your WordPiece model.
I did not pretrain a BERT model myself, so I cannot help you on where to change something in the code exactly.

Detect bigram collection on a very large text dataset

I would like to find bigrams in a large corpus in text format. As the corpus cannot be loaded at once in memory and its lines are very big, I load it by chunks, each 1 kb
def read_in_chunks(filename, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = filename.read(chunk_size)
if not data:
break
yield data
Then I want to go piece by piece through the corpus and find bigrams and I use the gensim Phrases() and Phraser() functions, but while training, my model constantly loses state. Thus, I tried to save and reload the model after each megabyte that I read and then free the memory, but it still loses state. My code is here:
with open("./final/corpus.txt", "r", encoding='utf8') as r:
max_vocab_size=20000000
phrases = Phrases(max_vocab_size=max_vocab_size)
i=1
j=1024
sentences = ""
for piece in read_in_chunks(r):
if i<=j:
sentences = sentences + piece
else:
phrases.add_vocab(sentences)
phrases = Phrases(sentences)
phrases = phrases.save('./final/phrases.txt')
phrases = Phraser.load('./final/phrases.txt')
sentences = ""
j+=1024
i+=1
print("Done")
Any suggestion?
Thank you.
When you do the two lines...
phrases.add_vocab(sentences)
phrases = Phrases(sentences)
...that 2nd line is throwing away any existing instance inside the phrases variable, and replacing it with a brand new instance (Phrases(sentences)). There's no chance for additive adjustment to the single instance.
Secondarily, there's no way two consecutive lines of .save()-then-immediate-re-.load() can save net memory usage. At the very best, the .load() would be unnnecessary, only exactly reproducing what was just .save()d, but wasting a lot of time and temporary memory loading a second copy, then discarding the duplicate that was already in phrases to assign phrases to the new clone.
While these are problems, more generally, the issue is that what you need done doesn't have to be this complicated.
The Phrases class will accept as its corpus of sentences an iterable object where each item is a list-of-string-tokens. You don't have to worry about chunk-sizes, and calling add_vocab() multiple times – you can just provide a single object that itself offers up each item in turn, and Phrases will do the right thing. You do have to worry about breaking up raw lines into the specific words you want considered ('tokenization').
(For a large corpus, you might still run into memory issues related to the number of unique words that Phrases is trying to count. But it won't matter how arbitrarily large the number of items is – because it will only look at one at a time. Only the accumulation of unique words will consume running memory.)
For a good intro to how an iterable object can work in such situations, a good blog post is:
Data streaming in Python: generators, iterators, iterables
If your corpus.txt file is already set up to be one reasonably-sized sentence per line, and all words are already delimited by simple spaces, then an iterable class might be as simple as:
class FileByLineIterable(object):
def __init__(self, filename):
self.filename = filename
def __iter__(self):
with open(self.filename, 'r', encoding='utf8') as src:
for line in src.readlines():
yield line.split()
Then, your code might just be as simple as...
sentences = FileByLineIterable('./final/corpus.txt')
phrases = Phrases(sentences, max_vocab_size=max_vocab_size)
...because the Phrases class is getting what it wants – a corpus that offers via iteration just one list-of-words item at a time.
Note:
you may want to enable logging at the INFO level to monitor progress and watch for any hints of things going wrong
there's a slightly more advanced line-by-line iterator, which also limits any one line's text to no more than 10,000 tokens (to match an internal implementation limit of gensim Word2Vec), and opens files from places other than local file-paths, available at gensim.models.word2vec.LineSentence. See:
https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence
(Despite this being packaged in the word2vec package, it can be used elsewhere.)

Problem with imgIdx in DMatch class using FlannBasedMatcher in Python

I have the same issue as here:
how to access best image corresponding to best keypoint match using opencv flannbasedmatcher and dmatch
Unfortunately, this post doesn't have an answer.
I have several images (and corresponding descriptors), that I add to the FlannBasedMatcher, using the 'add' method (once for each set of descriptors, corresponding to a single image).
However, when I match an image, the return imgIdx is way larger than the number of images in the training set. I feel like each descriptor is treated as an image, but this is not what I want.
I want to know which image (or set of descriptors) each feature has been matched to.
Here is a part of my code (I simplified it a bit, and I know 'test' is not great for a variable name, but it's temporary).
Also here I read .key files, which are basically files containing keypoints and descriptors of an image (extracted with SIFT).
I just precise that in the following code, featMatch is just a class I created to create a FlannBasedMatcher (with initialization parameters).
with open(os.path.join(ROOT_DIR,"images\\descriptor_list.txt"),'r') as f:
for line in f:
folder_path = os.path.join(ROOT_DIR,"images\\",line[:-1]+"\\","*.key")
list_key = glob.glob(folder_path)
test2 = []
for key in list_key:
if os.path.isfile(key):
feat = Features()
feat.readFromFile(key)
test = feat.descriptors
test2 = test2+test
featMatch.add(test2)
# Read submitted picture features
feat = Features()
feat.readFromFile(os.path.join(ROOT_DIR,"submitted_picture\\sub.key"))
matches = []
matches.append(featMatch.knnMatch(np.array(feat.descriptors), k=3))
print(matches)
I was expecting, when looking at the matches, and more specifically at the imgIdx of the matches, to be told which image index the matching feature (trainIdx) correspond to, based on the number of descriptor sets I added with 'add' method.
But following this assumption, I should be able to have imgIdx larger than the number of images (or training sets) in my training set.
However, here, I get numbers such as 2960, while I only have about 5 images in my training set.
My guess is that it returns the feature index instead of the image index, but I don't know why.
I noticed that the 'add' method in C++ takes an array of array, where we have a list of descriptor sets (one for each image I guess). But here I have a different number of features for each image, so I can't really create a numpy array with a different number of rows in each column.
Thanks.
I finally figure it out after looking at the C++ source code of matcher.cpp:
https://github.com/opencv/opencv/blob/master/modules/features2d/src/matchers.cpp
I'm gonna post the answer, in case somebody needs it someday.
I thought that the 'add' method would increment the image count when called, but it does not. So, I realized that I have to create a list of Mat (or numpy array in python) and give it once to 'add', instead of calling it for each image.
So here is the updated (and working) source code:
with open(os.path.join(ROOT_DIR,"images\\descriptor_list.txt"),'r') as f:
list_image_descriptors = []
for line in f:
folder_path = os.path.join(ROOT_DIR,"images\\",line[:-1]+"\\","*.key")
list_key = glob.glob(folder_path)
for key in list_key:
if os.path.isfile(key):
feat = Features()
feat.readFromFile(key)
img_descriptors = np.array(feat.descriptors)
list_image_descriptors.append(img_descriptors)
featMatch.add(list_image_descriptors)
# Read submitted picture features
feat = Features()
feat.readFromFile(os.path.join(ROOT_DIR,"submitted_picture\\sub.key"))
matches = []
matches.append(featMatch.knnMatch(np.array(feat.descriptors), k=3))
print(matches)
Hope this helps.

Lambda Functions in python

In the NLTK toolkit, I try to use the lambda function to filter the results.
I have a test_file and a terms_file
What I'm doing is to use the likelihood_ratio in NLTK to rank the multi word terms in the terms_file. But, the input here is the lemma of the multi word terms, so I created a function which extracts from each multi word term its lemma to be introduced afterthat in the lambda function.
so it looks like this
text_file = myfile
terms_file= myfile
def lem(file):
return lemma for each term in the file
My problem is here
How can I call this function in the filter, because when I do what following it does not work.
finder = BigramCollocationFinder.from_words(text_file)
finder.apply_ngram_filter(lambda *w: w not in lem(terms_file))
finder.score_ngrams(BigramAssocMeasures.likelihood_ratio)
print(finder)
Also with the iteration does not work
finder.apply_ngram_filter(lambda *w: w not in [x for x in lem(terms_file)])
(This is sort of a wild guess, but I'm pretty confident that this is the cause of your problem.
Judging from your pseudo-code, the lem function operates on a file handle, reading some information from that file. You need to understand that a file handle is an iterator, and it will be exhausted when iterated once. That is, the first call to lem works as expected, but then the file is fully read and further calls will yield no results.
Thus, I suggest storing the result of lem in a list. This should also be much faster than reading the file again and again. Try something like this:
all_lemma = lem(terms_file) # temporary variable holding the result of `lem`
finder.apply_ngram_filter(lambda *w: w not in all_lemma)
Your line finder.apply_ngram_filter(lambda *w: w not in [x for x in lem(terms_file)]) does not work, because while this creates a list from the result of lem, it does so each time the lambda is executed, so you end up with the same problem.
(Not sure what apply_ngram_filter does, so there might be more problems after that.)
Update: Judging from your other question, it seems like lem itself is a generator function. In this case, you have to explicitly convert the results to a list; otherwise you will run into just the same problem when that generator is exhausted.
all_lemma = list(lem(terms_file))
If the elements yielded by lem are hashable, you can also create a set instead of a list, i.e. all_lemma = set(lem(terms_file)); this will make the lookup in the filter much faster.
If I understand what you are saying, lem(terms_file) returns a list of lemmas. But what do "lemmas" look like? apply_ngram_filter() will only work if each "lemma" is a tuple of exactly two words. If that is indeed the case, then your code should work after you've fixed the file input as suggested by #tobias_k.
Even if your code works, the output of lem() should be stored as a set, not a list. Otherwise your code will be abysmally slow.
all_lemmas = set(lem(terms_file))
But I'm not too sure the above assumptions are right. Why would all lemmas be exactly two words long? I'm guessing that "lemmas" are one word long, and you intended to discard any ngram containing a word that is not in your list. If that's true you need apply_word_filter(), not apply_ngram_filter(). Note that it expects one argument (a word), so it should be written like this:
finder.apply_word_filter(lambda w: w not in all_lemmas)

Categories

Resources