I am a PhD researcher and started using word2vec for my research. I just want to use it for calculating sentence similarity. I searched and found few links but I couldn't run those. I was looking at the following:
import numpy as np
from scipy import spatial
index2word_set = set(model.wv.index2word)
def avg_feature_vector(sentence, model, num_features, index2word_set):
words = sentence.split()
feature_vec = np.zeros((num_features, ), dtype='float32')
n_words = 0
for word in words:
if word in index2word_set:
n_words += 1
feature_vec = np.add(feature_vec, model[word])
if (n_words > 0):
feature_vec = np.divide(feature_vec, n_words)
return feature_vec
s1_afv = avg_feature_vector('this is a sentence', model=model, num_features=300, index2word_set=index2word_set)
s2_afv = avg_feature_vector('this is also sentence', model=model,num_features=300, index2word_set=index2word_set)
sim = 1 - spatial.distance.cosine(s1_afv, s2_afv)
print(sim)
unfortunately I couldn't run this as I don't know how I can find "index2word_set". In addition, should I assign model= ? Or, are there any easy commands or instruction to implement this?
assign model to the model that you generated or any predefined word2vec model that you want to use
and as for index2word_set, you can set it to model.wv
It should work then.
Related
I want to train a Fasttext model in Python using the "gensim" library. First, I should tokenize each sentences to its words, hence converting each sentence to a list of words. Then, this list should be appended to a final list. Therefore, at the end, I will have a nested list containing all tokenized sentences:
word_punctuation_tokenizer = nltk.WordPunctTokenizer()
word_tokenized_corpus = []
for line in open('sentences.txt'):
new = line.strip()
new = word_punctuation_tokenizer.tokenize(new)
if len(new) != 0:
word_tokenized_corpus.append(new)
Then, the model should be built as the following:
embedding_size = 60
window_size = 40
min_word = 5
down_sampling = 1e-2
ft_model = FastText(word_tokenized_corpus,
size=embedding_size,
window=window_size,
min_count=min_word,
sample=down_sampling,
sg=1,
iter=100)
However, the number of sentences in "word_tokenized_corpus" is very large and the program can't handle it. Is it possible that I train the model by giving each tokenized sentence to it one by one, such as the following:?
for line in open('sentences.txt'):
new = line.strip()
new = word_punctuation_tokenizer.tokenize(new)
if len(new) != 0:
ft_model = FastText(new,
size=embedding_size,
window=window_size,
min_count=min_word,
sample=down_sampling,
sg=1,
iter=100)
Does this make any difference to the final results? Is it possible to train the model without having to build such a large list and keeping it in the memory?
Since the volume of the data is very high, it is better to convert the text file into a COR file. Then, read it in the following way:
from gensim.test.utils import datapath
corpus_file = datapath('sentences.cor')
As for the next step:
model = FastText(size=embedding_size,
window=window_size,
min_count=min_word,
sample=down_sampling,
sg=1,
iter=100)
model.build_vocab(corpus_file=corpus_file)
total_words = model.corpus_total_words
model.train(corpus_file=corpus_file, total_words=total_words, epochs=5)
If you want to use the default fasttext API, here how you can do it:
root = "path/to/all/the/texts/in/a/single/txt/files.txt"
training_param = {
'ws': window_size,
'minCount': min_word,
'dim': embedding_size,
't': down_sampling,
'epoch': 5,
'seed': 0
}
# for all the parameters: https://fasttext.cc/docs/en/options.html
model = fasttext.train_unsupervised(path, **training_param)
model.save_model("embeddings_300_fr.bin")
The advantage of using the fasttext API is (1) implemented in C++ with a wrapper in Python (way faster than Gensim) (also multithreaded) (2) manage better the reading of the text. It is also possible to use it directly from the command line.
I'm trying to implement a similarity function using
N-Grams
TF-IDF
Cosine Similaity
Example
Concept:
words = [...]
word = '...'
similarity = predict(words,word)
def predict(words,word):
words_ngrams = create_ngrams(words,range=(2,4))
word_ngrams = create_ngrams(word,range=(2,4))
words_tokenizer = tfidf_tokenizer(words_ngrams)
word_vec = words_tokenizer.transform(word)
return cosine_similarity(word_ved,words_tokenizer)
I searched the web for a simple and safe implementation but I couldn't find one that was using known python packages as sklearn, nltk, scipy etc.
most of them using "self made" calculations.
I'm trying to avoid coding every step by hand, and I'm guessing there is an easy fix for all of 'that pipeline'.
any help(and code) would be appreciated. tnx :)
Eventualy I figured it out...
For who ever will find the need of a solution for this Q, here's a function I wrote that takes care of it...
'''
### N-Gram & TD-IDF & Cosine Similarity
Using n-gram on 'from column' with TF-IDF to predict the 'to column'.
Adding to the df a 'cosine_similarity' feature with the numeric result.
'''
def add_prediction_by_ngram_tfidf_cosine( from_column_name,ngram_range=(2,4) ):
global df
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
vectorizer = TfidfVectorizer( analyzer='char',ngram_range=ngram_range )
vectorizer.fit(df.FromColumn)
w = from_column_name
vec_word = vectorizer.transform([w])
df['vec'] = df.FromColumn.apply(lambda x : vectorizer.transform([x]))
df['cosine_similarity'] = df.vec.apply(lambda x : cosine_similarity(x,vec_word)[0][0])
df = df.drop(['vec'],axis=1)
Note: it's not production ready
I am trying to find maximum similarity between two words in pandas dataframe. Here is my routine
import pandas as pd
from nltk.corpus import wordnet
import itertools
df = pd.DataFrame({'word_1':['desk', 'lamp', 'read'], 'word_2':['call','game','cook']})
def max_similarity(row):
word_1 = row['word_1']
word_2 = row['word_2']
ret_val = max([(wordnet.wup_similarity(syn_1, syn_2) or 0) for
syn_1, syn_2 in itertools.product(wordnet.synsets(word_1), wordnet.synsets(word_2))])
return ret_val
df['result'] = df.apply(lambda x: max_similarity(x), axis= 1)
It works fine, but it is too slow. I am looking for a way to speed it up. wordnet takes majority of time Any suggestions? Cython? I open to using other packages such as spacy.
Since you said you are open to use spacy as NLP library, lets consider a simple benchmark. We will use the brown news corpus to create somewhat arbitrary word pairs by dividing it in half.
from nltk.corpus import brown
brown_corpus = list(brown.words(categories='news'))
brown_df = pd.DataFrame({
'word_1':brown_corpus[:len(brown_corpus)//2],
'word_2': brown_corpus[len(brown_corpus)//2:]
})
len(brown_df)
50277
The cosine similarity of two tokens/documents can be computed with the Doc.similarity method.
import spacy
nlp = spacy.load('en')
def spacy_max_similarity(row):
word_1 = nlp(row['word_1'])
word_2 = nlp(row['word_2'])
return word_1.similarity(word_2)
Finally, apply both methods to the data frame:
nltk_similarity = %timeit -o brown_df.apply(nltk_max_similarity, axis=1)
1 loop, best of 3: 59 s per loop
spacy_similarity = %timeit -o brown_df.apply(spacy_max_similarity, axis=1)
1 loop, best of 3: 8.88 s per loop
Please note that NLTK and spacy use different techniques when it comes to measuring similarity. spacy uses word vectors that have been pretrained with a word2vec algorithm. From the docs:
Using word vectors and semantic similarities
[...]
The default English model installs vectors for one million vocabulary
entries, using the 300-dimensional vectors trained on the Common Crawl
corpus using the GloVe algorithm. The GloVe common crawl vectors have
become a de facto standard for practical NLP.
One way to make it faster is to store word-pair similarities. Then in case of repetition, avoid running the search function in the loop.
import pandas as pd
from nltk.corpus import wordnet
import itertools
df = pd.DataFrame({'word_1':['desk', 'lamp', 'read'], 'word_2':['call','game','cook']})
word_similarities = dict()
def max_similarity(row):
word_1 = row['word_1']
word_2 = row['word_2']
key = tuple(sorted([word_1, word_2])) # symmetric measure :)
if key not in word_similarities:
word_similarities[key] = max([
(wordnet.wup_similarity(syn_1, syn_2) or 0)
for syn_1, syn_2 in itertools.product(wordnet.synsets(word_1), wordnet.synsets(word_2))
])
return word_similarities[key]
df['result'] = df.apply(lambda x: max_similarity(x), axis= 1)
I have read a description, how to apply random forest regression here. In this example the authors use the following code to create the features:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word",max_features = 5000)
train_data_features = vectorizer.fit_transform(clean_train_reviews)
train_data_features = train_data_features.toarray()
I am thinking of combining several possibilities as features and turn them on and off. And I don't know how to do it.
What I have so far is that I define a class, where I will be able to turn on and off the features and see if it brings something (for example, all unigrams and 20 most frequent unigrams, it could be then 10 most frequent adjectives, tf-idf). But for now I don't understand how to combine them together.
The code looks like this, and in the function part I am lost (the kind of function I have would replicate what they do in the tutorial, but it doesn't seem to be really helpful the way I do it):
class FeatureGen: #for example, feat = FeatureGen(unigrams = False) creates feature set without the turned off feature
def __init__(self, unigrams = True, unigrams_freq = True)
self.unigrams = unigrams
self.unigrams_freq = unigrams_freq
def get_features(self, input):
vectorizer = CountVectorizer(analyzer = "word",max_features = 5000)
tokens = input["token"]
if self.unigrams:
train_data_features = vectorizer.fit_transform(tokens)
return train_data_features
What should I do to add one more feature possibility? Like contains 10 most frequent words.
if self.unigrams
train_data_features = vectorizer.fit_transform(tokens)
if self.unigrams_freq:
#something else
return features #and this should be a combination somehow
Looks like you need np.hstack
However you need each features array to have one row per training case.
I have already a classifier trained that I load up through pickle.
My main doubt is if there is anything that can speed up the classification task. It is taking almost 1 minute for each text (feature extraction and classification), is that normal? Should I go on multi-threading?
Here some code fragments to see the overall flow:
for item in items:
review = ''.join(item['review_body'])
review_features = getReviewFeatures(review)
normalized_predicted_rating = getPredictedRating(review_features)
item_processed['rating'] = str(round(float(normalized_predicted_rating),1))
def getReviewFeatures(review, verbose=True):
text_tokens = tokenize(review)
polarity = getTextPolarity(review)
subjectivity = getTextSubjectivity(review)
taggs = getTaggs(text_tokens)
bigrams = processBigram(taggs)
freqBigram = countBigramFreq(bigrams)
sort_bi = sortMostCommun(freqBigram)
adjectives = getAdjectives(taggs)
freqAdjectives = countFreqAdjectives(adjectives)
sort_adjectives = sortMostCommun(freqAdjectives)
word_features_adj = list(sort_adjectives)
word_features = list(sort_bi)
features={}
for bigram,freq in word_features:
features['contains(%s)' % unicode(bigram).encode('utf-8')] = True
features["count({})".format(unicode(bigram).encode('utf-8'))] = freq
for word,freq in word_features_adj:
features['contains(%s)' % unicode(word).encode('utf-8')] = True
features["count({})".format(unicode(word).encode('utf-8'))] = freq
features["polarity"] = polarity
features["subjectivity"] = subjectivity
if verbose:
print "Get review features..."
return features
def getPredictedRating(review_features, verbose=True):
start_time = time.time()
classifier = pickle.load(open("LinearSVC5.pickle", "rb" ))
p_rating = classifier.classify(review_features) # in the form of "# star"
predicted_rating = re.findall(r'\d+', p_rating)[0]
predicted_rating = int(predicted_rating)
best_rating = 5
worst_rating = 1
normalized_predicted_rating = 0
normalized_predicted_rating = round(float(predicted_rating)*float(10.0)/((float(best_rating)-float(worst_rating))+float(worst_rating)))
if verbose:
print "Get predicted rating..."
print "ML_RATING: ", normalized_predicted_rating
print("---Took %s seconds to predict rating for the review---" % (time.time() - start_time))
return normalized_predicted_rating
NLTK is a great tool and a good starting point for Natural Language Processing, but it's sometimes not very useful if speed is important as the authors implicitly said:
NLTK has been called βa wonderful tool for teaching, and working in, computational linguistics using Python,β and βan amazing library to play with natural language.β
So if your problem only lies in the speed of the classifier of the toolkit you have to use another ressource or you have to write the classifier by yourself.
Scikit might be helpful for you if you want to use a classifier which is probably faster.
It seems that you use a dictionary to build the feature vector. I strongly suspect that the problem is there.
The proper way would be using a numpy ndarray, with examples on rows and features on columns. So, something like
import numpy as np
# let's suppose 6 different features = 6-dimensional vector
feats = np.array((1, 6))
# column 0 contains polarity, column 1 subjectivity, and so on..
feats[:, 0] = polarity
feats[:, 1] = subjectivity
# ....
classifier.classify(feats)
Of course, you must use the same data structure and respect the same convention during training.