What is the output of Spark MLLIB LDA topicsmatrix?

What is the output of Spark MLLIB LDA topicsmatrix? - python

The output of LDAModel.topicsMatrix() is unclear to me.
I think I understand the concept of LDA and that each topic is represented by a distribution over terms.
In the LDAModel.describeTopics() it is clear (I think):
The highest sum of likelihoods of words of a sentence per topic, indicates the evidence of this tweet belonging to a topic.
With n topics, the output of describeTopics() is a n times m matrix where m stands for the size of the vocabulary. The values in this matrix are smaller or equal to 1.
However in the LDAModel.topicsMatrix(), I have no idea what I am looking at. The same holds when reading the documentation.
The matrix is a m times n matrix, the dimensions have changed and the values in this matrix are larger than zero (and thus can take the value 2, which is not a probability value). What are these values? The occurrence of this word in the topic perhaps?
How do I use these values do calculate the distance of a sentence to a topic?

i think the matrix is m*n m is the words number and n is the topic number

Related

How to cluster large number of strings based on similarity matrix?

I need to cluster 500K+ strings based on their similarity.
I have calculated their pair-wise Levenshtein Distances and made a sparse similarity matrix. This matrix contains binary similarities: values for small distances are set to 1.0 and others are 0.0.
I don't know what kind of clustering is good for me. I don't know the number of clusters in advance but it may be considerably large because the similarity matrix is very sparse (about 0.1% values are non-zero).

have you considered doing something like https://en.wikipedia.org/wiki/Soundex ? the advantage in such algorithms is that similar words have the same canonical form. For example, both "Robert" and "Rupert" return the same string "R163". Then your clustering boils down to a map like:
clusters = { canonical_form: [list of similar words] }
Naturally, you can tweak the Soundex rules according to your domain.

In count vectorizer which axis to use?

I want to create a document term matrix. In my case it is not like documents x words but it is sentences x words so the sentences will act as the documents. I am using 'l2' normalization post doc-term matrix creation.
The term count is important for me to create summarization using SVD in further steps.
My query is which axis will be appropriate to apply 'l2' normalization. With sufficient research I understood:
Axis=1 : Will give me the importance of the word in a sentence (column wise normalization)
Axis=0 : Importance of the word in a document (row wise normalization).
Even after knowing the theory I am not able to decide which alternative to choose because the choice will greatly affect my summarization results. So kindly guide me a solution along with a reason for the same.

By L2 normalization, do you mean division by the total count?
If you normalize along axis=0, then the value of x_{i,j} is the probability of the word j over all sentences i (division by the global word count), which is dependent on the length of the sentence, as longer ones can repeat some words over and over again and will have a much higher probability for this word, as they contribute a lot to the global word count.
If you normalize along axis=1, then you're asking whether sentences have the same composition of words, as you normalize along the lenght of the sentence.

What is the meaning of "size" of word2vec vectors [gensim library]?

Assume that we have 1000 words (A1, A2,..., A1000) in a dictionary. As fa as I understand, in words embedding or word2vec method, it aims to represent each word in the dictionary by a vector where each element represents the similarity of that word with the remaining words in the dictionary. Is it correct to say there should be 999 dimensions in each vector, or the size of each word2vec vector should be 999?
But with Gensim Python, we can modify the value of "size" parameter for Word2vec, let's say size = 100 in this case. So what does "size=100" mean? If we extract the output vector of A1, denoted (x1,x2,...,x100), what do x1,x2,...,x100 represent in this case?

It is not the case that "[word2vec] aims to represent each word in the dictionary by a vector where each element represents the similarity of that word with the remaining words in the dictionary".
Rather, given a certain target dimensionality, like say 100, the Word2Vec algorithm gradually trains word-vectors of 100-dimensions to be better and better at its training task, which is predicting nearby words.
This iterative process tends to force words that are related to be "near" each other, in rough proportion to their similarity - and even further the various "directions" in this 100-dimensional space often tend to match with human-perceivable semantic categories. So, the famous "wv(king) - wv(man) + wv(woman) ~= wv(queen)" example often works because "maleness/femaleness" and "royalty" are vaguely consistent regions/directions in the space.
The individual dimensions, alone, don't mean anything. The training process includes randomness, and over time just does "whatever works". The meaningful directions are not perfectly aligned with dimension axes, but angled through all the dimensions. (That is, you're not going to find that a v[77] is a gender-like dimension. Rather, if you took dozens of alternate male-like and female-like word pairs, and averaged all their differences, you might find some 100-dimensional vector-dimension that is suggestive of the gender direction.)
You can pick any 'size' you want, but 100-400 are common values when you have enough training data.

Imposing a cap on word count in scikit learn

I'm analyzing song lyrics where repetition doesn't necessarily mean higher importance, so I'd like to cap the word count per document. For example, if a word appears n times in a song, where n > threshold, then I would replace nwith threshold.
I've checked the CountVectorizer docs, and there's an option for a min_df and max_df, but these can only disregard words that appear in some m documents, not words that appear n times in a single document.
I was thinking of changing the elements of the sparse matrix (say, find all elements > threshold, then replace), but I couldn't find a way to that either. Thanks in advance!

I don't know of any prebuilt feature in scikit learn for this, but you could definitely edit your doc-term matrix directly, with numpy.where for example :
x = numpy.where(x < threshold, x, threshold)
where x is your doc-term matrix and threshold is, well, your threshold.
EDIT :
I hadn't realized numpy.where didn't work on scipy sparse matrices. You can use the find function from scipy.sparse that will return all non-0 indices in a sparse matrix in order to access and modify those values directly:
from scipy.sparse import find
results = find(x > threshold)
for i in range(len(results[0])):
x[results[0][i], results[1][i]] = threshold
It's significantly less elegant but it works.

Semantically weighted mean of word embeddings

Given a list of word embedding vectors I'm trying to calculate an average word embedding where some words are more meaningful than others. In other words, I want to calculate a semantically weighted word embedding.
All the stuff I found is on just finding the mean vector (which is quite trivial of course) which represents the average meaning of the list OR some kind of weighted average of words for document representation, however that is not what I want.
For example, given word vectors for ['sunglasses', 'jeans', 'hats'] I would like to calculate such a vector which represents the semantics of those words BUT with 'sunglasses' having a bigger semantic impact. So, when comparing similarity, the word 'glasses' should be more similar to the list than 'pants'.
I hope the question is clear and thank you very much in advance!

Actually averaging of word vectors can be done in two ways
Mean of word vectors without tfidf weights.
Mean of Word vectors multiplied with tfidf weights.
This will solve your problem of word importance.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What is the output of Spark MLLIB LDA topicsmatrix? - python

i think the matrix is m*n m is the words number and n is the topic number

Related

How to cluster large number of strings based on similarity matrix?

In count vectorizer which axis to use?

What is the meaning of "size" of word2vec vectors [gensim library]?

Imposing a cap on word count in scikit learn

Semantically weighted mean of word embeddings

Categories

Resources