The program should be returning the second text in the list for most similar, as it is same word to word. But its not the case here.
import gensim
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
data = ["I love machine learning. Its awesome.",
"I love coding in python",
"I love building chatbots",
"they chat amagingly well"]
tagged_data=[TaggedDocument(word_tokenize(_d.lower()),tags=[str(i)]) for i,_d in enumerate(data)]
max_epochs = 100
vec_size = 20
alpha = 0.025
model = Doc2Vec(size=vec_size,
alpha=alpha,
min_alpha=0.00025,
min_count=1,
negative=0,
dm =1)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
#print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.iter)
# decrease the learning rate
model.alpha -= 0.0002
# fix the learning rate, no decay
model.min_alpha = model.alpha
model.save("d2v.model")
loaded_model=Doc2Vec.load("d2v.model")
test_data=["I love coding in python".lower()]
v1=loaded_model.infer_vector(test_data)
similar_doc=loaded_model.docvecs.most_similar([v1])
print similar_doc
Output:
[('0', 0.17585766315460205), ('2', 0.055697083473205566), ('3', -0.02361609786748886), ('1', -0.2507985532283783)]
Its showing the first text in the list as most similar instead of the second text. Can you please help with this ?
First, you won't get good results from Doc2Vec-style models with toy-sized datasets. Just four documents, and a vocabulary of about 20 unique words, can't create a meaningfully-contrasting "dense embedding" vector model full of 20-dimensional vectors.
Second, if you set negative=0 in your model initialization, you're disabling the default model-training-correction mode (negative=5) – and you're not enabling the non-default, less-recommended alternative (hs=1). No training at all will be occurring. There may also be an error shown in the code output – but also, if you're running with at least INFO-level logging, you might notice other issues in the output.
Third, infer_vector() requires a list-of-word-tokens as its argument. You're providing a plain string. That will look like a list of one-character words to the code, so it's like you're asking it to infer on the 23-word sentence:
['i', ' ', 'l', 'o', 'v', 'e', ' ', 'c', ...]
The argument to infer_vector() should be tokenized exactly the same as the training texts were tokenized. (If you used word_tokenize() during training, use it during inference, too.)
infer_vector() will also use a number of repeated inference-passes over the text that's equal to the 'epochs' value inside the Doc2Vec model, unless you specify another value. Since you didn't specify an epochs, the model will still have its default value (inherited from Word2Vec) of epochs=5. Most Doc2Vec work uses 10-20 epochs during training, and using at least as many during inference seems a good practice.
But also:
Don't try to call train() more than once in a loop, or manage alpha in your own code, unless you are an expert.
Whatever online example suggested a code block like your...
for epoch in range(max_epochs):
#print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.iter)
# decrease the learning rate
model.alpha -= 0.0002
# fix the learning rate, no decay
model.min_alpha = model.alpha
...is a bad example. It's sending the effective alpha rate down-and-up incorrectly, it's very fragile if you ever want to change the number of epochs, it actually winds up running 500 epochs (100 * model.iter), it's far more code than is necessary.
Instead, don't change default alpha options, and specify your desired number of epochs when the model is created. So, the model will have a meaningful epochs value cached to be used by a later infer_vector().
Then, only call train() once. It will handle all epochs & alpha-management correctly. For example:
model = Doc2Vec(size=vec_size,
min_count=1, # not good idea w/ real corpuses but OK
dm=1, # not necessary to specify since it's the default but OK
epochs=max_epochs)
model.build_vocab(tagged_data)
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.epochs)
Related
I am trying to train word2vec model on a simple toy dateset of 4 sentences.
The Word2vec version that I need is:
Skip-gram model
no negative sampling
no hierarchical soft-max
no removal or down-scaling of frequent words
vector size of words is 2
Window size 4 i.e all the words in a sentence are considered context words of each other.
epochs can be varied from 1 to 500
Problem that I am facing is: No matter how I change the above parameters, the word vectors are not being updated/learned. The word vectors for epochs=1 and epochs=500 are being same.
from gensim.models import Word2Vec
import numpy as np
import matplotlib.pyplot as plt
import nltk
# toy dataset with 4 sentences
sents = ['what is the time',
'what is the day',
'what time is the meeting',
'cancel the meeting']
sents = [nltk.word_tokenize(string) for string in sents]
# model initialization and training
model = Word2Vec(alpha=0.5, min_alpha =0.25, min_count = 0, size=2, window=4,
workers=1, sg = 1, hs = 0, negative = 0, sample=0, seed = 42)
model.build_vocab(sents)
model.train(sents, total_examples=4, epochs=500)
# getting word vectors into array
vocab = model.wv.vocab.keys()
vocab_vectors = model.wv[vocab]
print(vocab)
print(vocab_vectors)
#plotting word vectors
plt.scatter(vocab_vectors[:,0], vocab_vectors[:,1], c ="blue")
for i, word in enumerate(vocab):
plt.annotate(word, (vocab_vectors[i,0], vocab_vectors[i,1]))
The out put of print(vocab) is as below
['what', 'is', 'time', 'cancel', 'the', 'meeting', 'day']
The output of print(vocab_vectors) is as below
[[ 0.08136337 -0.05059118]
[ 0.06549312 -0.22880174]
[-0.08925873 -0.124718 ]
[ 0.05645624 -0.03120007]
[ 0.15067646 -0.14344342]
[-0.12645201 0.06202405]
[-0.22905378 -0.01489289]]
The plotted 2D vectors
Why do I think the vectors are not being learned? I am changing the epochs value to 1, 10, 50, 500... and running the whole code to check the output for each run. For epochs = #any_value <1,10,50,500>, the output (vocab, vocab_vectors, and the plot) is being same for all the runs.
By providing the parameters negative=0, hs=0, you've disabled both training modes, and no training is happening.
You should either leave the default non-zero negative value in place, or enable the non-default hierarchical-softmax mode while disabling negative-sampling (with hs=1, negative=0).
Other thoughts:
Enabling logging at the INFO level is often helpful, and might have shown progress output which better hinted to you that no real training was happening
Etill, with a tiny toy dataset, the biggest hint that all training was disabled – suspiciously instant completion of training – is nearly indistinguishable from a tiny amount of training. Generally, lots of things will be weird or disappointing with tiny datasets (& tiny vector sizes), as word2vec's usual benefits really depend on large amounts of text.
Lowering min_count is usually a bad idea with any realistic dataset, as word2vec needs multiple varied examples of a word's usage to train useful vectors – and it's usually better to ignore rare words than mix their incomplete info in.
Changing the default alpha/min_alpha is also usually a bad idea – though perhaps here you were just trying extreme values to trigger any change.
I'm doing token-based classification using the pre-trained BERT-model for tensorflow to automatically label cause and effects in sentences.
To access BERT, I'm using the TFBertForTokenClassification-Interface from huggingface: https://huggingface.co/transformers/model_doc/bert.html#tfbertfortokenclassification
The sentences I use to train are all converted to tokens (basically a mapping of words to numbers) according to the BERT-tokenizer and then padded to a certain length before training, so when one sentence has only 50 tokens and another one has only 30 the first one is filled up with 50 pad-tokens and the second one with 70 of them to get a universal input sentence-length of 100.
I then train my model to predict on every token which label this token belongs to; whether it is part of the cause, the effect or none of them.
However, during training and evaluation, my model does predictions on the PAD-tokens as well and they are also included in the accuracy of the model. As PAD-tokens are very easy to predict for the model (they always have the same token and they all have the "none" label which means they neither belong to the cause nor the effect of the sentence), they really distort my model's accuracy.
For example, if you have a sentence which has 30 words -> 30 tokens and you pad all sentences to a length of 100, then this sentence would get a score of 70% even if the model predicted none of the "real" tokens correctly.
This way i'm getting training and validation accuracy of 90+% really quick although the model performs poorly on the real pad-tokens.
I thought that attention-mask is there to solve this problem but this doesn't seem to be the case.
The input-datasets are created as follows:
def example_to_features(input_ids,attention_masks,token_type_ids,label_ids):
return {"input_ids": input_ids,
"attention_mask": attention_masks},label_ids
train_ds = tf.data.Dataset.from_tensor_slices((input_ids_train,attention_masks_train,token_ids_train,label_ids_train)).map(example_to_features).shuffle(buffer_size=1000).batch(32)
Model creation:
from transformers import TFBertForTokenClassification
num_epochs = 30
model = TFBertForTokenClassification.from_pretrained('bert-base-uncased', num_labels=3)
model.layers[-1].activation = tf.keras.activations.softmax
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-6)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
model.summary()
And then I train it like this:
history = model.fit(train_ds, epochs=num_epochs, validation_data=validate_ds)
Has anyone encountered this problem so far or does know how to exclude the predictions on pad-tokens from the model's accuracy during training and evaluation?
Yes, this is normal.
The output of BERT [batch_size, max_seq_len = 100, hidden_size] will include values or embeddings for [PAD] tokens as well. However, you also provide attention_masks to the BERT model so that it does not take into consideration these [PAD] tokens.
Similarly, you need to MASK these [PAD] tokens before passing the BERT results to the final fully-connected layer, mask them when you are calculating loss, and also for calculating metrics like precision and recall.
I am trying find similar sentence using doc2vec. What I am not able to find is actual sentence that is matching from the trained sentences.
Below is the code from this article:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
data = ["I love machine learning. Its awesome.",
"I love coding in python",
"I love building chatbots",
"they chat amagingly well"]
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]
max_epochs = 100
vec_size = 20
alpha = 0.025
model = Doc2Vec(size=vec_size,
alpha=alpha,
min_alpha=0.00025,
min_count=1,
dm =1)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.iter)
# decrease the learning rate
model.alpha -= 0.0002
# fix the learning rate, no decay
model.min_alpha = model.alpha
model.save("d2v.model")
print("Model Saved")
model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
test_data = word_tokenize("I love building chatbots".lower())
v1 = model.infer_vector(test_data)
print("V1_infer", v1)
# to find most similar doc using tags
similar_doc = model.docvecs.most_similar('1')
print(similar_doc)
# to find vector of doc in training data using tags or in other words, printing the vector of document at index 1 in training data
print(model.docvecs['1'])
But the above code only gives me vectors or numbers. But how can I get the actual sentence matched from training data. For Eg - In this case I am expecting the result as "I love building chatbots".
The output of similar_doc is: [('2', 0.991769552230835), ('0', 0.989276111125946), ('3', 0.9854298830032349)]
This shows the similarity score of each document in the data with the requested document and it is sorted in descending order.
Based in this, '2' index in the data is the closest to the requested data i.e. test_data.
print(data[int(similar_doc[0][0])])
// prints: I love building chatbots
Note: this code is giving different results every time, maybe you need a better model or more training data.
Doc2Vec isn't going to give good results on toy-sized datasets, so you shouldn't expect anything meaningful until using much more data.
But also, a Doc2Vec model doesn't retain within itself the full texts you supply during training. It just remembers the learned vectors for each text's tag – which is usually a unique identifier. So when you get back results from most_similar(), you'll be getting back tag values, which you then need to look-up yourself, in your own code/data, to retrieve full documents.
Separately:
Calling train() multiple times in a loop like you're doing is a bad and error-prone idea, as is managing alpha/min_alpha explicitly. You should not follow any tutorial/guide which recommends that approach.
Don't change the defaults for the alpha parameters, and call train() once, with your desired epochs count – and it will do the right number of passes, and right learning-rate management.
To get the actual result you have to pass the text as a vector to most_simlar method to get the actual result. Hard coding the most_similar(1) will always give static results.
similar_doc = model.docvecs.most_similar([v1])
Modified version of your code
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
data = ["I love machine learning. Its awesome.",
"I love coding in python",
"I love building chatbots",
"they chat amagingly well"]
def output_sentences(most_similar):
for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(most_similar)//2), ('LEAST', len(most_similar) - 1)]:
print(u'%s %s: %s\n' % (label, most_similar[index][1], data[int(most_similar[index][0])])))
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]
max_epochs = 100
vec_size = 20
alpha = 0.025
model = Doc2Vec(size=vec_size,
alpha=alpha,
min_alpha=0.00025,
min_count=1,
dm =1)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.iter)
# decrease the learning rate
model.alpha -= 0.0002
# fix the learning rate, no decay
model.min_alpha = model.alpha
model.save("d2v.model")
print("Model Saved")
model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
test_data = word_tokenize("I love building chatbots".lower())
v1 = model.infer_vector(test_data)
print("V1_infer", v1)
# to find most similar doc using tags
similar_doc = model.docvecs.most_similar([v1])
print(similar_doc)
# to print similar sentences
output_sentences(similar_doc)
# to find vector of doc in training data using tags or in other words, printing the vector of document at index 1 in training data
print(model.docvecs['1'])
Semantic “Similar Sentences” with your dataset-NLP
If you are looking for accurate prediction with your dataset and which is less, you can go for,
pip install similar-sentences
I am trying to get predictions from my sentiment analysis models that classify 500 worded News articles. The models validation loss and training loss is in are about the same and their scores are relatively high. However when I try to make predictions with them I get the same classification result in all of them regardless of the text input.
I believe that the problem might be on the way I am trying to make a prediction (I pad my string with spaced characters). I was hoping that someone here could shed some light on this issue (my code below). Thank you for your help
comment = 'SAMPLE TEXT STRING'
for i in range(300-len(comment.split(' '))):
apad += ' A'
comment = comment + apad
tok.fit_on_texts([comment])
X = tokenizer.texts_to_sequences([comment])
X = preprocessing.sequence.pad_sequences(X)
yhat = b.predict_classes(X)
print(yhat)
prediction = b.predict(X, batch_size=None, verbose=0, steps=None)
print(prediction)
The output of this script is below. Both prediction and predicted classes, are regardless of the text input always 0 for some reason:
[[0]] [[0.00645966]]
The problem seems to be with the tokenizer.
You can't fit the tokenizer again, because you will have different tokens for each word. You should fit the tokenizer only once before training and then save the tokens to be used with all new text.
I roughly followed this tutorial:
https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/
A notable difference is that I use 2 LSTM layers with dropout. My data set is different (music data-set in abc notation). I do get some songs generated, but after a certain number of steps (may range from 30 steps to a couple hundred) in the generation process, the LSTM keeps generating the exact same sequence over and over again. For example, it once got stuck with generating URLs for songs:
F: http://www.youtube.com/watch?v=JPtqU6pipQI
and so on ...
It also once got stuck with generating the same two songs (the two songs are a sequence of about 300 characters). In the beginning it generated 3-4 good pieces but afterwards, it kept regenerating the two songs almost indefinitely.
I am wondering, does anyone have some insight into what could be happening ?
I want to clarify that any sequence generated whether repeating or non-repeating seems to be new (model is not memorising). The validation loss and training loss decrease as expected.
Andrej Karpathy is able to generate a document of thousands of characters and I couldn't find this pattern of getting stuck indefinitely.
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Instead of taking the argmax on the prediction output, try introducing some randomness with something like this:
np.argmax(prediction_output)[0])
to
np.random.choice(len(prediction_output), p=prediction_output)
I've been struggling on this repeating sequences issue for a while until I discovered this Colab notebook where I figured out why their model was able to generate some really good samples: https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/shakespeare_with_tpu_and_keras.ipynb#scrollTo=tU7M-EGGxR3E
After I changed this single line, my model went from generating a few words over and over to something actually interesting!
To use and train a text generation model follow these steps:
Drawing from the model a probability distribution over the next character given the text available so far ( This would be our predictions scores )
Reweighting the distribution to a certain "temperature" (See the code below)
Sampling the next character at random according to the reweighted distribution (See the code below)
Adding the new character at the end of the available text
See the sample function:
def sample(preds, temperature=1.0):
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
You should use the sample function during training as follows:
for epoch in range(1, 60):
print('epoch', epoch)
# Fit the model for 1 epoch on the available training data
model.fit(x, y,
batch_size=128,
epochs=1)
# Select a text seed at random
start_index = random.randint(0, len(text) - maxlen - 1)
generated_text = text[start_index: start_index + maxlen]
print('--- Generating with seed: "' + generated_text + '"')
for temperature in [0.2, 0.5, 1.0, 1.2]:
print('------ temperature:', temperature)
sys.stdout.write(generated_text)
# We generate 400 characters
for i in range(400):
sampled = np.zeros((1, maxlen, len(chars)))
for t, char in enumerate(generated_text):
sampled[0, t, char_indices[char]] = 1.
preds = model.predict(sampled, verbose=0)[0]
next_index = sample(preds, temperature)
next_char = chars[next_index]
generated_text += next_char
generated_text = generated_text[1:]
sys.stdout.write(next_char)
sys.stdout.flush()
print()
A low temperature results in extremely repetitive and predictable text, but where local structure is highly realistic: in particular, all words (a word being a local pattern of characters) are real English words. With higher temperatures, the generated text becomes more interesting, surprising, even creative.
See this notebook