Iterating Over Numpy Array for NLP Application

Iterating Over Numpy Array for NLP Application - python

I have a Word2Vec model that I'm building where I have a vocab_list of about 30k words. I have a list of sentences (sentence_list) about 150k large. I am trying to remove tokens (words) from the sentences that weren't included in vocab_list. The task seemed simple, but nesting for loops and reallocating memory is slow using the below code. This task took approx. 1hr to run so I don't want to repeat it.
Is there a cleaner way to try this?
import numpy as np
from datetime import datetime
start=datetime.now()
timing=[]
result=[]
counter=0
for sent in sentences_list:
counter+=1
if counter %1000==0 or counter==1:
print(counter, 'row of', len(sentences_list), ' Elapsed time: ', datetime.now()-start)
timing.append([counter, datetime.now()-start])
final_tokens=[]
for token in sent:
if token in vocab_list:
final_tokens.append(token)
#if len(final_tokens)>0:
result.append(final_tokens)
print(counter, 'row of', len(sentences_list),' Elapsed time: ', datetime.now()-start)
timing.append([counter, datetime.now()-start])
sentences=result
del result
timing=pd.DataFrame(timing, columns=['Counter', 'Elapsed_Time'])

Note that typical word2vec implementations (like Google's original word2vec.c or gensim Word2Vec) will often just ignore words in their input that aren't part of their established vocabulary (as specified by vocab_list or enforced via a min_count). So you may not need to perform this filtering at all.
Using a more-idiomatic Python list-comprehension might be noticeably faster (and would certainly be more compact). Your code could simply be:
filtered_sentences = [
[word for word in sent if word in vocab_list]
for sent in sentences_list
]

Related

Python sklearn TfidfVectorizer: Vectorize documents ahead of query for semantic search

I want to run semantic search using TF-IDF.
This code works, but it is really slow when used on a large corpus of documents:
search_terms = "my query"
documents = ["my","list","of","docs"]
vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform([search_terms] + documents)
cosine_similarities = linear_kernel(doc_vectors[0:1], doc_vectors).flatten()
document_scores = [item.item() for item in cosine_similarities[1:]]
It seems quite inefficient:
Every new search query triggers a re-vectorizing of the entire corpus.
I am wondering how I can do the bulk work of vectorizing my corpus ahead of time, saving the result in an "index file". So that, when I run a query, the only thing left to do is to vectorize the few words from the query, and then to calculate similarity.
I tried vectorizing query and documents separately:
vec_docs = vectorizer.fit_transform(documents)
vec_query = vectorizer.fit_transform([search_terms])
cosine_similarities = linear_kernel(vec_query, vec_docs).flatten()
But it gives me this error:
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 3 while Y.shape[1] == 260541
How can I run the corpus vectorization ahead of time without knowing what the query will be?
My main goal is to get blazing fast results even with a large corpus of documents (say, a few GB worth of text), even on a low-powered server, by doing the bulk of the data-crunching ahead of time.

TF/IDF vectors are high-dimensional and sparse. The basic data structure that supports that is an inverted index. You can either implement it yourself or use a standard index (e.g., Lucene).
Nevertheless, if you would like to experiment with modern deep-neural-based vector representations, check out the following semantic search demo. It uses a similarity search service that can handle billions of vectors.
(Note, I am a co-author of this demo.)

You almost have it right.
In this instance, you can get away with fitting (and transforming) your documents and only transforming your search terms. Here is your code, modified accordingly and using the twenty_newsgroups documents (11k) in its place. You can run it as a script and interactively verify you get fast results:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
news = fetch_20newsgroups()
search_terms = "my query"
# documents = ["my", "list", "of", "docs"]
documents = news.data
vectorizer = TfidfVectorizer()
# fit_transform does two things: fits the vectorizer and transforms documents
doc_vectors = vectorizer.fit_transform(documents)
# the vectorizer is already fit; just transform search_terms via vectorizer
search_term_vector = vectorizer.transform([search_terms])
cosine_similarities = linear_kernel(doc_vectors, search_term_vector).flatten()
if __name__ == "__main__":
while True:
query_str = input("\n\n\n\nquery string (return to quit): ")
if not query_str:
print("bye!")
break
search_term_vector = vectorizer.transform([query_str])
cosine_similarities = linear_kernel(doc_vectors, search_term_vector).flatten()
best_idx = np.argmax(cosine_similarities)
best_score = cosine_similarities[best_idx]
best_doc = documents[best_idx]
if best_score < 0.1:
print("no good matches")
else:
max_doc = documents[np.argmax(cosine_similarities)]
print(
f"Best match ({round(best_score, 4)}):\n\n", best_doc[0:200] + "...",
)
Example output:
query string (return to quit): protocol
Best match 0.239 (0.014 sec):
From: ethan#cs.columbia.edu (Ethan Solomita)
Subject: Re: X protocol packet type
Article-I.D.: cs.C52I2q.IFJ
Organization: Columbia University Department of Computer Science
Lines: 7
In article <9309...
Note: this algorithm find the best match(es) at best in O(n_documents) time, compared to Lucene (powers Elasticsearch) that uses skip lists that can search in O(log(n_documents)). Production search engines also have quiet a bit of tuning to optimize performance. The above could be useful with some tweaking but isn't going to topple Google tomorrow :)

How can I find the probability of a sentence using GPT-2?

I'm trying to write a program that, given a list of sentences, returns the most probable one. I want to use GPT-2, but I am quite new to using it (as in I don't really know how to do it). I'm planning on finding the probability of a word given the previous words and multiplying all the probabilities together to get the overall probability of that sentence occurring, however I don't know how to find the probability of a word occurring given the previous words. This is my (psuedo) code:
sentences = # my list of sentences
max_prob = 0
best_sentence = sentences[0]
for sentence in sentences:
prob = 1 #probability of that sentence
for idx, word in enumerate(sentence.split()[1:]):
prob *= probability(word, " ".join(sentence[:idx])) # this is where I need help
if prob > max_prob:
max_prob = prob
best_sentence = sentence
print(best_sentence)
Can I have some help please?

You can also try lm-scorer, a tiny wrapper around transformers that allows you to get sentences probabilities using models that support it (only GPT2 models are implemented at the time of writing).
https://github.com/simonepri/lm-scorer
I just used it myself and works perfectly.
Warning: If you use other transformers / pipelines in the same environment, things may get messy.

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
def score(tokens_tensor):
loss=model(tokens_tensor, labels=tokens_tensor)[0]
return np.exp(loss.cpu().detach().numpy())
texts = ['i would like to thank you mr chairman', 'i would liking to thanks you mr chair in', 'thnks chair' ]
for text in texts:
tokens_tensor = tokenizer.encode( text, add_special_tokens=False, return_tensors="pt")
print (text, score(tokens_tensor))
This code snippet could be an example of what are you looking for. You feed the model with a list of sentences, and it scores each whereas the lowest the better.
The output of the code above is:
i would like to thank you mr chairman 122.3066
i would liking to thanks you mr chair in 1183.7637
thnks chair 14135.129

I wrote a set of functions that can do precisely what you're looking for. Recall that GPT-2 parses its input into tokens (not words): the last word in 'Joe flicked the grasshopper' is actually three tokens: ' grass', 'ho', and 'pper'. The cloze_finalword function takes this into account, and computes the probabilities of all tokens (conditioned on the tokens appearing before them). You can adapt part of this function so that it returns what you're looking for. I hope you find the code useful!

I think GPT-2 is a bit overkill for what you're trying to achieve. You can build a basic language model which will give you sentence probability using NLTK. A tutorial for this can be found here.

Doc2Vec not providing adequate results in most_similar

I'm trying to use Doc2Vec to go through the classic exercise of training on Wikipedia articles, using the article title as the tag.
Here's my code and the results, is there something that I'm missing that they would not give the matching results with most_similar? Following this tutorial, but I used the wiki-english-20171001 dataset that came with gensim.
import gensim.downloader as api
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import re
def cleanText(text):
text = re.sub(r'\|\|\|', r' ', text)
text = re.sub(r'http\S+', r'<URL>', text)
text = text.lower()
text = re.sub(r'[^\w\s]','',text)
return text
wiki = api.load("wiki-english-20171001")
data = [d for d in wiki]
for i in range(10):
print(data[i])
def my_create_tagged_docs(data):
for wikiidx in range(len(data)):
yield TaggedDocument([i for i in data[wikiidx].get('section_texts') for i in cleanText(i).split()], [data[wikiidx].get('title')])
wiki_data = my_create_tagged_docs(data)
del data
del wiki
model = Doc2Vec(dm=1, dm_mean=1, size=200, window=8, min_count=19, iter =10, epochs=40)
model.build_vocab(wiki_data)
model.train(wiki_data, total_examples=model.corpus_count, epochs=model.epochs)
model.docvecs.most_similar(positive=["Lady Gaga"], topn=10)
[('Chlorothrix', 0.35521823167800903),
("A Child's Garden of Verses", 0.3533579707145691),
('Fish Mooney', 0.35129639506340027),
('2000 Paris–Roubaix', 0.3463437855243683),
('Calvin C. Chaffee', 0.3439667224884033),
('Murders of Eve Stratford and Lynne Weedon', 0.3397218585014343),
('Black Air', 0.3396576941013336),
('Turzyn', 0.3312540054321289),
('Scott Baker', 0.33018186688423157),
('Amongst the Waves', 0.3297169804573059)]
model.docvecs.most_similar(positive=["Machine learning"], topn=10)
[('Wolf Rock, Connecticut', 0.3855834901332855),
('Amália Rodrigues', 0.3349645137786865),
('Victoria Park, Leicester', 0.33312514424324036),
('List of visual anthropology films', 0.3311382532119751),
('Sadqay Teri Mout Tun', 0.3287636637687683),
('T. Damodaran', 0.32876330614089966),
('Urqu Jawira (Aroma)', 0.32281631231307983),
('Tiggy Wiggy', 0.3226730227470398),
('Frédéric Brun (cyclist, born 1988)', 0.32106447219848633),
('Unholy Crusade', 0.3200794756412506)]

It looks like your wiki_data is a single-pass generator, as returned by my_create_tagged_docs(), which can be iterated over only once - not an iterable object capable of many iterations, as the many steps of the Doc2Vec training requires.
You can test your wiki_data object for whether it's multiply-iterable, just after it's been assigned, by executing:
print(sum(1 for _ in wiki_data))
print(sum(1 for _ in wiki_data))
If you see the same number twice – the total number of documents – all's well. If the 2nd number is 0, you've created a single-use iterator instead of a multiple-use iterable.
As a result, the build_vocab() call will work to initialize the known-vocabulary & model – but then the train() will see an empty iterable, completing instantly with no real training happening. (If you run with logging at the INFO level, this may be obvious in the log timestamps for the various steps.)
Two possible fixes:
If you're lucky enough to have enough RAM to hold the whole corpus as Python objects, converting it into a in-memory list would ensure it's multiple-iterable:
wiki_data = list(my_create_tagged_docs(data))
But, most won't have that much RAM * shouldn't/needn't take that step. Instead, you can define a class for an iterable view on the data, which can return a fresh iterator every time it's needed. There's an example with further explanation in a blog post by the founder of the gensim project at:
https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/

Is there any faster way to check from a words-list with nltk with python?

I am checking from a word-list of approx 2.1 Million keywords with the module nltk for good English words.
The words are read from a text file, then checked for being a correct English word and then write the good one to a text file.
The scripts works well, however is ridiculously slow, approx 7 iterations per second.
Is there any faster way to do this?
Here is my code:
import nltk
from nltk.corpus import words
from tqdm import tqdm
total_size = 2170503
with open('two_words.txt','r',encoding='utf-8') as file:
for word in tqdm(file,total=total_size):
word = word.strip()
if all([w in words.words() for w in word.split()]):
with open('good_two.txt', 'a', encoding='utf-8') as file:
file.write(word)
file.write('\n')
else:
pass
Is there any faster way of doing the same?
IE by using wordnet or any other suggestion?

You can make it much faster by using converting words.words() to a set as the following test shows.
from nltk.corpus import words
import time
# Test Text
text = "she sell sea shell by the seashore"
# Original Method
start = time.time()
x = all([w in words.words() for w in "she sell sea shell by the seashore".split()])
print("Duration Original Method: ", time.time() - start)
# Time to convert words to set
start = time.time()
set_words = set(words.words())
print("Time to generate set: ", time.time() - start)
# Test Using Set (Singe iteration)
start = time.time()
x = all([w in set_words for w in "she sell sea shell by the seashore".split()])
print("Set using 1 iteration: ", time.time() - start)
# Test Using Set (10, 000 iterations)
start = time.time()
for k in range(100000):
x = all([w in set_words for w in "she sell sea shell by the seashore".split()])
print("Set using 100, 000 iterations: ", time.time() - start)
Results shows using set ~200,000 faster.
This is related to words.words() having 236, 736 elements, thus n ~ 236, 736
But, we have reduced the time from O(n) per lookup to O(1) by using sets
Duration Original Method: 0.601 seconds
Time to generate set: 0.131 seconds
Set using 1 iteration: 0.0 seconds
Set using 100, 000 iterations: 0.304 seconds

I would try use threading. Because you do this algorithm just at one thread. But be aware because several writeable streams with one file could be problem. Once you get all words you need, you just merge these files.
The problem is that python is very slow. If you need your solution faster I would consider change language which is not executed by interpreter (sure, good choice is example C/C++), or maybe you can just execute this piece of code in another language from python and then continue with python.
Maybe write the data to binary file could be more faster if you don't have require for .txt output file.

Finding closest related words using word2vec

My goal is to find most relevant words given set of keywords using word2vec. For example, if I have a set of words [girl, kite, beach], I would like relevants words to be output from word2vec: [flying, swimming, swimsuit...]
I understand that word2vec will vectorize a word based on the context of surround words. So what I did, was use the following function:
most_similar_cosmul([girl, kite, beach])
However, it seems to give out words not very related to the set of keywords:
['charade', 0.30288437008857727]
['kinetic', 0.3002534508705139]
['shells', 0.29911646246910095]
['kites', 0.2987399995326996]
['7-9', 0.2962781488895416]
['showering', 0.2953910827636719]
['caribbean', 0.294752299785614]
['hide-and-go-seek', 0.2939240336418152]
['turbine', 0.2933803200721741]
['teenybopper', 0.29288050532341003]
['rock-paper-scissors', 0.2928623557090759]
['noisemaker', 0.2927709221839905]
['scuba-diving', 0.29180505871772766]
['yachting', 0.2907838821411133]
['cherub', 0.2905363440513611]
['swimmingpool', 0.290039986371994]
['coastline', 0.28998953104019165]
['Dinosaur', 0.2893030643463135]
['flip-flops', 0.28784963488578796]
['guardsman', 0.28728148341178894]
['frisbee', 0.28687697649002075]
['baltic', 0.28405341506004333]
['deprive', 0.28401875495910645]
['surfs', 0.2839275300502777]
['outwear', 0.28376665711402893]
['diverstiy', 0.28341981768608093]
['mid-air', 0.2829524278640747]
['kickboard', 0.28234976530075073]
['tanning', 0.281939834356308]
['admiration', 0.28123530745506287]
['Mediterranean', 0.281186580657959]
['cycles', 0.2807052433490753]
['teepee', 0.28070521354675293]
['progeny', 0.2775532305240631]
['starfish', 0.2775339186191559]
['romp', 0.27724218368530273]
['pebbles', 0.2771730124950409]
['waterpark', 0.27666303515434265]
['tarzan', 0.276429146528244]
['lighthouse', 0.2756190896034241]
['captain', 0.2755546569824219]
['popsicle', 0.2753356397151947]
['Pohoda', 0.2751699686050415]
['angelic', 0.27499720454216003]
['african-american', 0.27493417263031006]
['dam', 0.2747344970703125]
['aura', 0.2740659713745117]
['Caribbean', 0.2739778757095337]
['necking', 0.27346789836883545]
['sleight', 0.2733519673347473]
This is the code I used to train word2vec
def train(data_filepath, epochs=300, num_features=300, min_word_count=2, context_size=7, downsampling=1e-3, seed=1,
ckpt_filename=None):
"""
Train word2vec model
data_filepath path of the data file in csv format
:param epochs: number of times to train
:param num_features: increase to improve generality, more computationally expensive to train
:param min_word_count: minimum frequency of word. Word with lower frequency will not be included in training data
:param context_size: context window length
:param downsampling: reduce frequency for frequent keywords
:param seed: make results reproducible for random generator. Same seed means, after training model produces same results.
:returns path of the checkpoint after training
"""
if ckpt_filename == None:
data_base_filename = os.path.basename(data_filepath)
data_filename = os.path.splitext(data_base_filename)[0]
ckpt_filename = data_filename + ".wv.ckpt"
num_workers = multiprocessing.cpu_count()
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
nltk.download("punkt")
nltk.download("stopwords")
print("Training %s ..." % data_filepath)
sentences = _get_sentences(data_filepath)
word2vec = w2v.Word2Vec(
sg=1,
seed=seed,
workers=num_workers,
size=num_features,
min_count=min_word_count,
window=context_size,
sample=downsampling
)
word2vec.build_vocab(sentences)
print("Word2vec vocab length: %d" % len(word2vec.wv.vocab))
word2vec.train(sentences, total_examples=len(sentences), epochs=epochs)
return _save_ckpt(word2vec, ckpt_filename)
def _save_ckpt(model, ckpt_filename):
if not os.path.exists("checkpoints"):
os.makedirs("checkpoints")
ckpt_filepath = os.path.join("checkpoints", ckpt_filename)
model.save(ckpt_filepath)
return ckpt_filepath
def _get_sentences(data_filename):
print("Found Data:")
sentences = []
print("Reading '{0}'...".format(data_filename))
with codecs.open(data_filename, "r") as data_file:
reader = csv.DictReader(data_file)
for row in reader:
sentences.append(ast.literal_eval((row["highscores"])))
print("There are {0} sentences".format(len(sentences)))
return sentences
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='Train Word2vec model')
parser.add_argument('data_filepath',
help='path to training CSV file.')
args = parser.parse_args()
data_filepath = args.data_filepath
train(data_filepath)
This is a sample of training data used for word2vec:
22751473,"[""lover"", ""sweetheart"", ""couple"", ""dietary"", ""meal""]"
28738542,"[""mallotus"", ""villosus"", ""shishamo"", ""smelt"", ""dried"", ""fish"", ""spirinchus"", ""lanceolatus""]"
25163686,"[""Snow"", ""Removal"", ""snow"", ""clearing"", ""female"", ""females"", ""woman"", ""women"", ""blower"", ""snowy"", ""road"", ""operate""]"
32837025,"[""milk"", ""breakfast"", ""drink"", ""cereal"", ""eating""]"
23828321,"[""jogging"", ""female"", ""females"", ""lady"", ""woman"", ""women"", ""running"", ""person""]"
22874156,"[""lover"", ""sweetheart"", ""heterosexual"", ""couple"", ""man"", ""and"", ""woman"", ""consulting"", ""hear"", ""listening""]
For prediction, I simply used the following function for a set of keywords:
most_similar_cosmul
I was wondering whether it is possible to find relevant keywords with word2vec. If it is not, then what machine learning model would be more suitable for this. Any insights would be very helpful

When supplying multiple positive-word examples, like ['girl', 'kite', 'beach'], to most_similar()/most_similar_cosmul(), the vectors for those words will be averaged-together first, then a list of words most similar to the average returned. Those might not be as obviously related to any one of the words than a simple check of a single word. So:
When you try most_similar() (or most_similar_cosmul()) on a single word, what kind of results do you get? Are they words that seem related to the input word, in the way that you care about?
If not, you have deeper problems in your setup that should be fixed before trying a multi-word similarity.
Word2Vec gets its usual results from (1) lots of training data; and (2) natural-language sentences. With enough data, a typical number of epochs training-passes (and thus the default) is 5. You can sometimes, somewhat make up for less data by using more epoch iterations, or a smaller vector size, but not always.
It's not clear how much data you have. Also, your example rows aren't real natural-language sentences – they appear to have had some other preprocessing/reordering applied. That may be hurting rather than helping.
Word-vectors often improve by throwing away more low-frequency words (increasing min_count above the default 5, rather than reducing it to 2.) Low-frequency words don't have enough examples to get good vectors – and the few examples they have, even if repeated with many iterations, tend to be idiosyncratic examples of the words' usage, not the generalizable broad representations that you'd get from many varied examples. And by keeping these doomed-to-be-weak words still in the training-data, the training of other more-frequent words is interfered with. (When you get a word that you don't think belongs in a most-similar ranking, it may be a rare-word that, given its its few occurrence contexts, found its way to those coordinates as the least-bad location among plenty of other unhelpful coordinates.)
If you do get good results from single-word checks, but not from the average-of-multiple-words, the results might improve with more and better data, or adjusted training parameters – but to achieve that you'd need to more rigorously define what you consider good results. (Your existing list doesn't look that bad to me: it includes many words related to sun/sand/beach activities.)
On the other hand, your expectations of Word2Vec may be too high: it may not be that the average of ['girl', 'kite', 'beach'] is necessarily closed to those desired words, compared to the individual words themselves, or that may only be achievable with lots of dataset/parameter tweaking.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterating Over Numpy Array for NLP Application - python

Related

Python sklearn TfidfVectorizer: Vectorize documents ahead of query for semantic search

How can I find the probability of a sentence using GPT-2?

Doc2Vec not providing adequate results in most_similar

Is there any faster way to check from a words-list with nltk with python?

Finding closest related words using word2vec

Categories

Resources