Python KMeans Clustering - Handling nan Values

Python KMeans Clustering - Handling nan Values - python

I am trying to cluster a number of words using the KMeans algorithm from scikit learn.
In particular, I use pre-trained word embeddings (300 dimensional vectors) to map each word with a number vector and then I feed these vectors to KMeans and provide the number of clusters.
My issue is that there are certain words in my input corpus which I can not find in the pretrained word embeddings dictionary. This means that in these cases, instead of a vector, I get a numpy array full of nan values. This does not work with the kmeans algorithm and therefore I have to exclude these arrays. However, I am interested in seeing all these cases that were not found in the word embeddings and what is more, if possible throw them inside a separate cluster that will contain only them.
My idea at this point is to set a condition that if the word is returned with a nan-values array from the embeddings index, then assign an arbitrary vector to it. Each dimension of the embeddings vector lie within [-1,1]. Therefore, if I assign the following vector [100000]*300 to all nan words, I have created a set of outliers. In practice, this works as expected, since this particular set of vectors are forced in a separate cluster. However, the initialization of the kmeans centroids is affected by these outlier values and therefore all the rest of my clusters get messed up as well. As a remedey, I tried to initiate the kmeans using init = k-means++ but first, it takes significantly longer to execute and second the improvement is not much better.
Any suggestions as to how to approach this issue?
Thank you.

If you don't have data on a word, then skip it.
You could try to compute a word vector on the fly based on the context, but that essentially is the same as just skipping it.

Related

Could an array of word vectors be handled if use imbalanced data techniques?

I have an imbalanced classification dataset of text and I used word vectors (word2vec) to embed the text. So, the result of word vector is an array. The next condition, I have variable X for word vector's array and variable Y for class/target of word vectors. If I use technique to handle imbalanced classification dataset, could an array of word vectors be handled too? Or added data / dropped data with imbalanced data technique make a wrong array? Because the technique for handle imbalanced dataset just add/drop data based on label/target of word vector's array.
I have tried and that problems don't get any error but my accuracy of models can't show the best one. I don't know it has a relation or not.
Can anyone help me for the explanation?

Python. Gensim Word2vec. Words similarity

I've got a problem/question with Word2Vec
As I understand: let's train a model on a corpus of text (in my way it's a corpus ~2 Gb size)
Let's take one line from this text and calculate a vector of this line (line's vector = sum of words vectors). It will be smth. like this:
for w in words:
coords += model[w]
Than let's calculate length of this vector. With standard library as:
import numpy as np
vectorLen = np.linalg.norm(coords)
Why do we need Word2Vec? Yes, for converting words to vectors AND contextual proximity (near words that are found and words that are close in meaning have similar coordinates)!
And what I want (what I am waiting) - if I will take some line of the text and add some word from the dictionary which is not typical for this line, than again calculate length of this vector, I will get quite different value that if I will calculate only vector of this line without adding some uncharacteristic words to this line from dictionary.
But in fact - the values of this vectors (before adding word(s) and after) are quite the similar! Moreover - they are practically the same! Why am I getting this result?
If I understand right for the line the coordinates of words will quite the same (contextual proximity), but new words will have rather different coordinates and it should affect to result (vector length of line with new words)!
E.x. it's my W2V model settings:
#Word2Vec model
model = gensim.models.Word2Vec(
sg=0,
size=300,
window=3,
min_count=1,
hs=0,
negative=5,
workers=10,
alpha=0.025,
min_alpha=0.025,
sample=1e-3,
iter=20
)
#prepare the model vocabulary
model.build_vocab(sentences, update=False)
#train model
model.train(sentences, epochs=model.iter, total_examples=model.corpus_count)
OR this:
#Word2Vec model
model = gensim.models.Word2Vec(
sg=1,
size=100,
window=10,
min_count=1,
hs=0,
negative=5,
workers=10,
alpha=0.025,
min_alpha=0.025,
seed=7,
sample=1e-3,
hashfxn=hash,
iter=20
)
#prepare the model vocabulary
model.build_vocab(sentences, update=False)
What's the problem? And how can I get necessary result?

Why do you need the "vector length" to noticeably change, as a "desired result"?
The length of word-vectors (or sums of same) isn't usually of major interest. In fact, it's common to normalize the word-vectors to unit-length before doing comparisons. (And sometimes, when doing sums/averages as a simple way to create vectors for runs-of-multiple-words, the vectors might be unit-normalized before or after such an operation.)
Instead, it's usually the direction (angle) that's of most interest.
Further, what do you mean when describing the length values as "quite the similar"? Without showing the actual lengths you've seen in your tests, it's unclear if your intuitions about what the change "should" be are correct.
Note that in multi-dimensional spaces – and especially high-dimensional spaces - our intuitions are quite often wrong.
For example, try adding a bunch of pairs of random unit vectors in 2d space, and looking at the norm length of the sum. As you might expect, you'll likely see varied results that range from nearly 0.0 to nearly 2.0 – representing moving closer or further to the origin.
Try instead adding a bunch of pairs of random unit vectors in 500d space. Now, the norm length of the sum is going to almost always be close to 1.4. Essentially, with 500 directions to go, most sums won't significantly move closer or further to the origin, even though they still move 1.0 away from either vector individually.
You're likely observing the same thing with your word-vectors. They're fine, but the measure you've chosen to take – the norm of a vector sum – just doesn't change the way you'd expect, in a high-dimensional space.
Separately, unrelated to your main issue, but about your displayed word2vec parameters:
You might think using a non-default min_count=1, by retaining more words/information, results in better vectors. However, it usually hurts word-vector quality to retain such rare words. Word-vector quality requires many varied examples of word usage. Words with just 1, or a few, examples don't get good vectors from those few idiosyncratic usage examples, but do serve as training noise/interference in the improvement of other word-vectors with more examples.
Usual stochastic-gradient-descent optimization relies on the alpha learning-rate decaying to a negligible value over the course of training. Setting the ending min_alpha to the same value as the starting alpha thwarts this. (In general, most users shouldn't change either of the alpha parameters, and if they need to tinker at all, changing the starting value makes more sense.)

Fixed RAM DBSCAN or another clustering algorithm without predefined number of clusters?

I want to cluster 3.5M 300-dimensional word2vec vectors from my custom gensim model to determine whether I can use those clustering to find topic-related words. It is not the same as model.most_similar_..., as I hope to attach quite distant, but still related words.
The overall size of the model (after normalization of vectors, i.e. model.init_sims(replace=True)) in memory is 4GB:
words = sorted(model.wv.vocab.keys())
vectors = np.array([model.wv[w] for w in words])
sys.getsizeof(vectors)
4456416112
I tried both scikit's DBSCAN and some other implementations from GitHub, but they seem to consume more and more RAM during processing and crash with std::bad_alloc after some time. I have 32 GB of RAM and 130GB swap.
Metric is euclidean, I convert my cosine distance threshold cos=0.48 as eps=sqrt(2-2*0.48), so all the optimizations should be applied.
The problem is that I don't know the number of clusters and want to determine them by setting the threshold for closely related words (let it be cos<0.48 or d_l2 < sqrt(2-2*0.48)). DBSCAN seems working on small subsets, but I can't pass the computation on the full data.
Is there any algorithm or workaround in Python which can help with that?
EDIT: Distance matrix seem to be for a size(float)=4bytes: 3.5M*3.5M*4/1024(KB)/1024(MB)/1024(GB)/1024(TB) = 44.5 TB, so it's impossible to precompute it.
EDIT2: Currently trying ELKI, but cannot make it to cluster data on toy subset properly.

How does doc2vec.infer_vector combine across words?

I trained a doc2vec model using train(..) with default settings. That worked, but now I'm wondering how infer_vector combines across input words, is it just the average of the individual word vectors?
model.random.seed(0)
model.infer_vector(['cat', 'hat'])
model.random.seed(0)
model.infer_vector(['cat'])
model.infer_vector(['hat']) #doesn't average up to the ['cat', 'hat'] vector
model.random.seed(0)
model.infer_vector(['hat'])
model.infer_vector(['cat']) #doesn't average up to the ['cat', 'hat'] vector
Those don't add up, so I'm wondering what I'm misunderstanding.

infer_vector() doesn't combine the vectors for your given tokens – and in some modes doesn't consider those tokens' vectors at all.
Rather, it considers the entire Doc2Vec model as being frozen against internal changes, and then assumes the tokens you've provided are an example text, with a previously untrained tag. Let's call this implied but unnamed tag X.
Using a training-like process, it tries to find a good vector for X. That is, it starts with a random vector (as it did for all tags in original training), then sees how well that vector as model-input predicts the text's words (by checking the model neural-network's predictions for input X). Then via incremental gradient descent it makes that candidate vector for X better and better at predicting the text's words.
After enough such inference-training, the vector will be about as good (given the rest of the frozen model) as it possibly can be at predicting the text's words. So even though you're providing that text as an "input" to the method, inside the model, what you've provided is used to pick target "outputs" of the algorithm for optimization.
Note that:
tiny examples (like one or a few words) aren't likely to give very meaningful results – they are sharp-edged corner cases, and the essential value of these sorts of dense embedded representations usually arises from the marginal balancing of many word-influences
it will probably help to do far more training-inference cycles than the infer_vector() default steps=5 – some have reported tens or hundreds of steps work best for them, and it may be especially valuable to use more steps with short texts
it may also help to use a starting alpha for inference more like that used in bulk training (alpha=0.025), rather than the infer_vector() default (alpha=0.1)

Training a Machine Learning predictor

I have been trying to build a prediction model using a user’s data. Model’s input is documents’ metadata (date published, title etc) and document label is that user’s preference (like/dislike). I would like to ask some questions that I have come across hoping for some answers:
There are way more liked documents than disliked. I read somewhere that if somebody train’s a model using way more inputs of one label than the other this affects the performance in a bad way (model tends to classify everything to the label/outcome that has the majority of inputs
Is there possible to have input to a ML algorithm e.g logistic regression be hybrid in terms of numbers and words and how that could be done, sth like:
input = [18,23,1,0,’cryptography’] with label = [‘Like’]
Also can we use a vector ( that represents a word, using tfidf etc) as an input feature (e.g. 50-dimensions vector) ?
In order to construct a prediction model using textual data the only way to do so is by deriving a dictionary out of every word mentioned in our documents and then construct a binary input that will dictate if a term is mentioned or not? Using such a version though we lose the weight of the term in the collection right?
Can we use something as a word2vec vector as a single input in a supervised learning model?
Thank you for your time.

You either need to under-sample the bigger class (take a small random sample to match the size of the smaller class), over-sample the smaller class (bootstrap sample), or use an algorithm that supports unbalanced data - and for that you'll need to read the documentation.
You need to turn your words into a word vector. Columns are all the unique words in your corpus. Rows are the documents. Cell values are one of: whether the word appears in the document, the number of times it appears, the relative frequency of its appearance, or its TFIDF score. You can then have these columns along with your other non-word columns.
Now you probably have more columns than rows, meaning you'll get a singularity with matrix-based algorithms, in which case you need something like SVM or Naive Bayes.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.