Python. Gensim Word2vec. Words similarity

Python. Gensim Word2vec. Words similarity - python

I've got a problem/question with Word2Vec
As I understand: let's train a model on a corpus of text (in my way it's a corpus ~2 Gb size)
Let's take one line from this text and calculate a vector of this line (line's vector = sum of words vectors). It will be smth. like this:
for w in words:
coords += model[w]
Than let's calculate length of this vector. With standard library as:
import numpy as np
vectorLen = np.linalg.norm(coords)
Why do we need Word2Vec? Yes, for converting words to vectors AND contextual proximity (near words that are found and words that are close in meaning have similar coordinates)!
And what I want (what I am waiting) - if I will take some line of the text and add some word from the dictionary which is not typical for this line, than again calculate length of this vector, I will get quite different value that if I will calculate only vector of this line without adding some uncharacteristic words to this line from dictionary.
But in fact - the values of this vectors (before adding word(s) and after) are quite the similar! Moreover - they are practically the same! Why am I getting this result?
If I understand right for the line the coordinates of words will quite the same (contextual proximity), but new words will have rather different coordinates and it should affect to result (vector length of line with new words)!
E.x. it's my W2V model settings:
#Word2Vec model
model = gensim.models.Word2Vec(
sg=0,
size=300,
window=3,
min_count=1,
hs=0,
negative=5,
workers=10,
alpha=0.025,
min_alpha=0.025,
sample=1e-3,
iter=20
)
#prepare the model vocabulary
model.build_vocab(sentences, update=False)
#train model
model.train(sentences, epochs=model.iter, total_examples=model.corpus_count)
OR this:
#Word2Vec model
model = gensim.models.Word2Vec(
sg=1,
size=100,
window=10,
min_count=1,
hs=0,
negative=5,
workers=10,
alpha=0.025,
min_alpha=0.025,
seed=7,
sample=1e-3,
hashfxn=hash,
iter=20
)
#prepare the model vocabulary
model.build_vocab(sentences, update=False)
What's the problem? And how can I get necessary result?

Why do you need the "vector length" to noticeably change, as a "desired result"?
The length of word-vectors (or sums of same) isn't usually of major interest. In fact, it's common to normalize the word-vectors to unit-length before doing comparisons. (And sometimes, when doing sums/averages as a simple way to create vectors for runs-of-multiple-words, the vectors might be unit-normalized before or after such an operation.)
Instead, it's usually the direction (angle) that's of most interest.
Further, what do you mean when describing the length values as "quite the similar"? Without showing the actual lengths you've seen in your tests, it's unclear if your intuitions about what the change "should" be are correct.
Note that in multi-dimensional spaces – and especially high-dimensional spaces - our intuitions are quite often wrong.
For example, try adding a bunch of pairs of random unit vectors in 2d space, and looking at the norm length of the sum. As you might expect, you'll likely see varied results that range from nearly 0.0 to nearly 2.0 – representing moving closer or further to the origin.
Try instead adding a bunch of pairs of random unit vectors in 500d space. Now, the norm length of the sum is going to almost always be close to 1.4. Essentially, with 500 directions to go, most sums won't significantly move closer or further to the origin, even though they still move 1.0 away from either vector individually.
You're likely observing the same thing with your word-vectors. They're fine, but the measure you've chosen to take – the norm of a vector sum – just doesn't change the way you'd expect, in a high-dimensional space.
Separately, unrelated to your main issue, but about your displayed word2vec parameters:
You might think using a non-default min_count=1, by retaining more words/information, results in better vectors. However, it usually hurts word-vector quality to retain such rare words. Word-vector quality requires many varied examples of word usage. Words with just 1, or a few, examples don't get good vectors from those few idiosyncratic usage examples, but do serve as training noise/interference in the improvement of other word-vectors with more examples.
Usual stochastic-gradient-descent optimization relies on the alpha learning-rate decaying to a negligible value over the course of training. Setting the ending min_alpha to the same value as the starting alpha thwarts this. (In general, most users shouldn't change either of the alpha parameters, and if they need to tinker at all, changing the starting value makes more sense.)

Related

Python KMeans Clustering - Handling nan Values

I am trying to cluster a number of words using the KMeans algorithm from scikit learn.
In particular, I use pre-trained word embeddings (300 dimensional vectors) to map each word with a number vector and then I feed these vectors to KMeans and provide the number of clusters.
My issue is that there are certain words in my input corpus which I can not find in the pretrained word embeddings dictionary. This means that in these cases, instead of a vector, I get a numpy array full of nan values. This does not work with the kmeans algorithm and therefore I have to exclude these arrays. However, I am interested in seeing all these cases that were not found in the word embeddings and what is more, if possible throw them inside a separate cluster that will contain only them.
My idea at this point is to set a condition that if the word is returned with a nan-values array from the embeddings index, then assign an arbitrary vector to it. Each dimension of the embeddings vector lie within [-1,1]. Therefore, if I assign the following vector [100000]*300 to all nan words, I have created a set of outliers. In practice, this works as expected, since this particular set of vectors are forced in a separate cluster. However, the initialization of the kmeans centroids is affected by these outlier values and therefore all the rest of my clusters get messed up as well. As a remedey, I tried to initiate the kmeans using init = k-means++ but first, it takes significantly longer to execute and second the improvement is not much better.
Any suggestions as to how to approach this issue?
Thank you.

If you don't have data on a word, then skip it.
You could try to compute a word vector on the fly based on the context, but that essentially is the same as just skipping it.

Doc2vec and word2vec with negative sampling

My current doc2vec code is as follows.
# Train doc2vec model
model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4, iter = 20)
I also have a word2vec code as below.
# Train word2vec model
model = word2vec.Word2Vec(sentences, size=300, sample = 1e-3, sg=1, iter = 20)
I am interested in using both DM and DBOW in doc2vec AND both Skip-gram and CBOW in word2vec.
In Gensim I found the below mentioned sentence:
"Produce word vectors with deep learning via word2vec’s “skip-gram and CBOW models”, using either hierarchical softmax or negative sampling"
Thus, I am confused either to use hierarchical softmax or negative sampling. Please let me know what are the differences in these two methods.
Also, I am interested in knowing what are the parameters that need to be changed to use hierarchical softmax AND/OR negative sampling with respect to dm, DBOW, Skip-gram and CBOW?
P.s. my application is a recommendation system :)

Skip-gram or CBOW are different ways to choose the input contexts for the neural-network. Skip-gram picks one nearby word, then supplies it as input to try to predict a target word; CBOW averages together a bunch of nearby words, then supplies that average as input to try to predict a target word.
DBOW is most similar to skip-gram, in that a single paragraph-vector for a whole text is used to predict individual target words, regardless of distance and without any averaging. It can mix well with simultaneous skip-gram training, where in addition to using the single paragraph-vector, individual nearby word-vectors are also used. The gensim option dbow_words=1 will add skip-gram training to a DBOW dm=0 training.
DM is most similar to CBOW: the paragraph-vector is averaged together with a number of surrounding words to try to predict a target word.
So in Word2Vec, you must choose between skip-gram (sg=1) and CBOW (sg=0) – they can't be mixed. In Doc2Vec, you must choose between DBOW (dm=0) and DM (dm=1) - they can't be mixed. But you can, when doing Doc2Vec DBOW, also add skip-gram word-training (with dbow_words=1).
The choice between hierarchical-softmax and negative-sampling is separate and independent of the above choices. It determines how target-word predictions are read from the neural-network.
With negative-sampling, every possible prediction is assigned a single output-node of the network. In order to improve what prediction a particular input context creates, it checks the output-nodes for the 'correct' word (of the current training example excerpt of the corpus), and for N other 'wrong' words (that don't match the current training example). It then nudges the network's internal weights and the input-vectors to make the 'correct' word output node activation a little stronger, and the N 'wrong' word output node activations a little weaker. (This is called a 'sparse' approach, because it avoids having to calculate every output node, which is very expensive in large vocabularies, instead just calculation N+1 nodes and ignoring the rest.)
You could set negative-sampling with 2 negative-examples with the parameter negative=2 (in Word2Vec or Doc2Vec, with any kind of input-context mode). The default mode, if no negative specified, is negative=5, following the default in the original Google word2vec.c code.
With hierarchical-softmax, instead of every preictable word having its own output node, some pattern of multiple output-node activations is interpreted to mean specific words. Which nodes should be closer to 1.0 or 0.0 in order to represent a word is matter of the word's encoding, which is calculated so that common words have short encodings (involving just a few nodes), while rare words will have longer encodings (involving more nodes). Again, this serves to save calculation time: to check if an input-context is driving just the right set of nodes to the right values to predict the 'correct' word (for the current training-example), just a few nodes need to be checked, and nudged, instead of the whole set.
You enable hierarchical-softmax in gensim with the argument hs=1. By default, it is not used.
You should generally disable negative-sampling, by supplying negative=0, if enabling hierarchical-softmax – typically one or the other will perform better for a given amount of CPU-time/RAM.
(However, following the architecture of the original Google word2vec.c code, it is possible but not recommended to have them both active at once, for example negative=5, hs=1. This will result in a larger, slower model, which might appear to perform better since you're giving it more RAM/time to train, but it's likely that giving equivalent RAM/time to just one or the other would be better.)
Hierarchical-softmax tends to get slower with larger vocabularies (because the average number of nodes involved in each training-example grows); negative-sampling does not (because it's always N+1 nodes). Projects with larger corpuses tend to trend towards preferring negative-sampling.

How does doc2vec.infer_vector combine across words?

I trained a doc2vec model using train(..) with default settings. That worked, but now I'm wondering how infer_vector combines across input words, is it just the average of the individual word vectors?
model.random.seed(0)
model.infer_vector(['cat', 'hat'])
model.random.seed(0)
model.infer_vector(['cat'])
model.infer_vector(['hat']) #doesn't average up to the ['cat', 'hat'] vector
model.random.seed(0)
model.infer_vector(['hat'])
model.infer_vector(['cat']) #doesn't average up to the ['cat', 'hat'] vector
Those don't add up, so I'm wondering what I'm misunderstanding.

infer_vector() doesn't combine the vectors for your given tokens – and in some modes doesn't consider those tokens' vectors at all.
Rather, it considers the entire Doc2Vec model as being frozen against internal changes, and then assumes the tokens you've provided are an example text, with a previously untrained tag. Let's call this implied but unnamed tag X.
Using a training-like process, it tries to find a good vector for X. That is, it starts with a random vector (as it did for all tags in original training), then sees how well that vector as model-input predicts the text's words (by checking the model neural-network's predictions for input X). Then via incremental gradient descent it makes that candidate vector for X better and better at predicting the text's words.
After enough such inference-training, the vector will be about as good (given the rest of the frozen model) as it possibly can be at predicting the text's words. So even though you're providing that text as an "input" to the method, inside the model, what you've provided is used to pick target "outputs" of the algorithm for optimization.
Note that:
tiny examples (like one or a few words) aren't likely to give very meaningful results – they are sharp-edged corner cases, and the essential value of these sorts of dense embedded representations usually arises from the marginal balancing of many word-influences
it will probably help to do far more training-inference cycles than the infer_vector() default steps=5 – some have reported tens or hundreds of steps work best for them, and it may be especially valuable to use more steps with short texts
it may also help to use a starting alpha for inference more like that used in bulk training (alpha=0.025), rather than the infer_vector() default (alpha=0.1)

Parameter estimation for linear One Class SVM training via libsvm for n-grams

I know, there are multiple questions to this, but not a single one to my particular problem.
I'll simplify my problem in order to make it more clear.
Lets say I have multiple sentences from an english document and I want to classify them using a one class svm (in libsvm) in order to be able to see anomalities (e.g. a german sentence) afterwards.
For training: I have samples of one class only (lets assume other classes are not existing beforehand). I extract all 3-grams (so the feature space includes max. 16777216 different features) and save them in libsvm format (label=1, just in case that matters)
Now I want to estimate my paramters. I tried to use the grid.py using additional parameters, however, the runtime is too big for rbf kernels. So I try using linear kernels (therefore, the grid.py may be changed in order to use only one value of gamma, as it does not matter for linear kernels).
Whatsoever, the smallest c grid.py tests will shown as the best solution (does -c matter for linear kernels?).
Furthermore, it does not matter how much I change the -n (nu) value, everytime the same relation between scores will be achieved (even though the number of support vectors changes). Scores are gathered by using the python implementation. (relation between scores means, that e.g. at first they are -1 and -2, i change nu and afterwards they are e.g. -0.5 and -1, so if i sort them, the same order always appears, as in this example):
# python2
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
from svmutil import *
y,x = svm_read_problem("/tmp/english-3-grams.libsvm") # 5000 sentence samples
ym,xm = svm_read_problem("/tmp/german-3-grams.libsvm") # 50 sentence samples
m = svm_train(y,x,"-s 2 -t 2 -n 0.5");
# do the prediction in one or two steps, here is one step:
p_l, p_a, p_v = svm_predict(y[:100]+ym[:100],x[:100]+xm[:100],m)
# p_v are our scores.
# let's plot a roc curve
roc_ret = roc_curve([1]*100+[-1]*100,p_v)
plt.plot(roc_ret[0],roc_ret[1])
plt.show()
Here, everytime the exact same roc-curve is achieved (even though -n is varied). Even if there is only 1 support vector, the same curve is shown.
Hence, my question (let's assume a maximum of 50000 samples per training):
- why is -n not changing anything for the one class training process?
- what parameters do i need to change for a one class svm?
- is a linear kernel the best approach? (+ with regard to runtime) and rbf kernel parameter grid search takes ages for such big datasets
- liblinear is not being use because I want to do anomaly detection = one class svm
Best regards,
mutilis

The performance impact is a result of your huge feature space of 16777216 elements. This results in very sparse vectors for elements like german sentences.
A study by Yang & Petersen, A Comparative Study on Feature Selection in Text Categorization shows, that aggressive feature-selection does not necessarily decrease classification accuracy. I achieved similar results, while performing text classification for (medical) German text documents.
As stated in the comments, LIBLINEAR is fast, because it is build for such sparse data. However, you end up with a linear classifier with all its pitfalls and benefits.
I would suggest the following strategy:
Perform aggressive feature selection (e.g. with InformationGain) with a remaining feature-space of N
Increase N stepwise in combination with cross-validation and find the best maching N for your data.
Go for a grid-search with the N found in 2.
Train your classifier with the best matching parameters found in 3. and the N found in 2.

Very large log probabilities from sklearn's BayesianGaussianMixture

I've been using python to experiment with sklearn's BayesianGaussianMixture (and with GaussianMixture, which shows the same issue).
I fit the model with a number of items drawn from a distribution, then tested the model with a held out data set (some from the distribution, some outside it).
Something like:
X_train = ... # 70x321 matrix
X_in = ... # 20x321 matrix of held out data points from X
X_out = ... # 20x321 matrix of data points drawn from a different distribution
model = BayesianGaussianMixture(n_components=1)
model.fit(X_train)
print(model.score_samples(X_in).mean())
print(model.score_samples(X_out).mean())
outputs:
-1334380148.57
-2953544628.45
The score_samples method returns a per-sample log likelihood of the given data, and "in" samples are much more likely than the "out" samples as expected - I'm just wondering why the absolute values are so high?
The documentation for score_samples states "Compute the weighted log probabilities for each sample" - but I'm unclear what the weights are based on.
Do I need to scale my input first? Is my input dimensionality too high? Do I need to do some additional parameter tuning? Or am I just misunderstanding what the method returns?

The weights are based on the mixture weights.
Do I need to scale my input first?
This is usually not a bad idea but I can't say not knowing more about your data.
Is my input dimensionality too high?
It seems given the amount of data you are fitting it actually is too high. Remember the curse of dimensionality. You have very few rows of data and 312 features, 1:4 ratio; that's not really going to work in practice.
Do I need to do some additional parameter tuning? Or am I just
misunderstanding what the method returns?
Your outputs are log-probabilites that are very negative. If you raise e to such a large negative magnitude you get a probability that is very close to zero. Your results actually make sense from that perspective. You may want to check the log-probability in areas where you know there is a higher probability of being in that component. You may also want to check the covariances for each component to make sure you don't have a degenerate solution, which is quite likely given the amount of data and dimensionality in this case. Before any of that, you may want to get more data or see if you can reduce the number of dimensions.
I forgot to mention a rather important point: The output is for the Density so keep that in mind too.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.