I want to create a document term matrix. In my case it is not like documents x words but it is sentences x words so the sentences will act as the documents. I am using 'l2' normalization post doc-term matrix creation.
The term count is important for me to create summarization using SVD in further steps.
My query is which axis will be appropriate to apply 'l2' normalization. With sufficient research I understood:
Axis=1 : Will give me the importance of the word in a sentence (column wise normalization)
Axis=0 : Importance of the word in a document (row wise normalization).
Even after knowing the theory I am not able to decide which alternative to choose because the choice will greatly affect my summarization results. So kindly guide me a solution along with a reason for the same.
By L2 normalization, do you mean division by the total count?
If you normalize along axis=0, then the value of x_{i,j} is the probability of the word j over all sentences i (division by the global word count), which is dependent on the length of the sentence, as longer ones can repeat some words over and over again and will have a much higher probability for this word, as they contribute a lot to the global word count.
If you normalize along axis=1, then you're asking whether sentences have the same composition of words, as you normalize along the lenght of the sentence.
Related
I want to use the TFIDFVectorizer (or CountVectorizer followed by TFIDFTransformer) to get a vector representation of my terms. That means, I want a vector for a term where the documents are the features. That's simply the transpose of a TF-IDF matrix created by the TFIDFVectorizer.
>>> vectorizer = TfidfVectorizer()
>>> model = vectorizer.fit_transform(corpus)
>>> model.transpose()
However, I have 800k documents which mean my term vectors are very sparse and very large (800k dimensions). The flag max_features in the CountVectorizer would do exactly what I'm looking for. I can specify a dimension and the CountVectorizer tries to fit all information into this dimension. Unfortunately, this option is for the document vectors rather than the terms in the vocabulary. Hence, it reduces the size of my vocabulary because the terms are the features.
Is there any way to do the opposite? Like, perform a transpose on the TFIDFVectorizer object before it starts cutting and normalizing everything? And if such an approach exists, how can I do that? Something like this:
>>> countVectorizer = CountVectorizer(input='filename', max_features=300, transpose=True)
I was looking for such an approach for a while now but every guide, code example, whatever is talking about the document TF-IDF vectors rather than the term vectors.
Thank you so much in advance!
I am not aware of any straight forward way to do this but let me propose a way how this could be achieved.
You are trying to represent each term in your corpus as a vector that uses the documents in your corpus as its component features. Because the number of documents (which are the features in your case) is very large, you would like to limit them in a way similar to what max_features does.
According to CountVectorizer user guide (same for the TfidfVectorizer):
max_features int, default=None
If not None, build a vocabulary that only consider the top
max_features ordered by term frequency across the corpus.
In a similar way, you want to keep the top documents ordered by their "frequency across the terms", as confusing as this may sound. This could be rephrased simplistically as "keep those documents that contain the most unique terms".
One way I can think of doing that is by using the inverse_transform performing the following steps:
vectorizer = TfidfVectorizer()
model = vectorizer.fit_transform(corpus)
# We use the inverse_transform which returns the
# terms per document with nonzero entries
inverse_model = vectorizer.inverse_transform(model)
# Each line in the inverse model corresponds to a document
# and contains a list of feature names (the terms).
# As we want to rank the documents we tranform the list
# of feature names to a number of features
# that each document is represented by.
inverse_model_count = list(map(lambda doc_vec: len(doc_vec), inverse_model))
# As we are going to sort the list, we need to keep track of the
# document id (its index in the corpus), so we create tuples with
# the list index of each item before we sort the list.
inverse_model_count_tuples = list(zip(range(len(inverse_model_count)),
inverse_model_count))
# Then we sort the list by the count of terms
# in each document (the second component)
max_features = 100
top_documents_tuples = sorted(inverse_model_count_tuples,
key=lambda item: item[1],
reverse=True)[:max_features]
# We are interested only in the document ids (the first tuple component)
top_documents, _ = zip(*top_documents_tuples)
# Having the top_documents ids we can slice the initial model
# to keep only the documents indicated by the top_documents list
reduced_model = model[top_documents]
Please note that this approach only takes into account the number of terms per document, no matter what is their count (CountVectorizer) or weight (TfidfVectorizer).
If the direction of this approach is acceptable for you then with some more code it could be possible to also take into account the count or weight of the terms.
I hope this helps!
Assume that we have 1000 words (A1, A2,..., A1000) in a dictionary. As fa as I understand, in words embedding or word2vec method, it aims to represent each word in the dictionary by a vector where each element represents the similarity of that word with the remaining words in the dictionary. Is it correct to say there should be 999 dimensions in each vector, or the size of each word2vec vector should be 999?
But with Gensim Python, we can modify the value of "size" parameter for Word2vec, let's say size = 100 in this case. So what does "size=100" mean? If we extract the output vector of A1, denoted (x1,x2,...,x100), what do x1,x2,...,x100 represent in this case?
It is not the case that "[word2vec] aims to represent each word in the dictionary by a vector where each element represents the similarity of that word with the remaining words in the dictionary".
Rather, given a certain target dimensionality, like say 100, the Word2Vec algorithm gradually trains word-vectors of 100-dimensions to be better and better at its training task, which is predicting nearby words.
This iterative process tends to force words that are related to be "near" each other, in rough proportion to their similarity - and even further the various "directions" in this 100-dimensional space often tend to match with human-perceivable semantic categories. So, the famous "wv(king) - wv(man) + wv(woman) ~= wv(queen)" example often works because "maleness/femaleness" and "royalty" are vaguely consistent regions/directions in the space.
The individual dimensions, alone, don't mean anything. The training process includes randomness, and over time just does "whatever works". The meaningful directions are not perfectly aligned with dimension axes, but angled through all the dimensions. (That is, you're not going to find that a v[77] is a gender-like dimension. Rather, if you took dozens of alternate male-like and female-like word pairs, and averaged all their differences, you might find some 100-dimensional vector-dimension that is suggestive of the gender direction.)
You can pick any 'size' you want, but 100-400 are common values when you have enough training data.
I have the following code that I found from https://pythonprogramminglanguage.com/kmeans-text-clustering/ on document clustering. While I understand the k-means algorithm as a whole, I have a little trouble wrapping my head about what the top terms per cluster represents and how that is computed? Is it the most frequent words that occur in the cluster? One blogpost I read said that the outputted words at the end represent the "top n words that are nearest to the cluster centroid" (but what does it mean for an actual word to be "closest" to the cluster centroid). I really want to understand the details and nuances of what is going on. Thank you!
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
documents = ["This little kitty came to play when I was eating at a restaurant.",
"Merley has the best squooshy kitten belly.",
"Google Translate app is incredible.",
"If you open 100 tab in google you get a smiley face.",
"Best cat photo I've ever taken.",
"Climbing ninja cat.",
"Impressed with google map feedback.",
"Key promoter extension for Google Chrome."]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i),
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind]),
print
'Top' in this context is directly related to the way in which the text has been transformed into an array of numerical values. By using TFIDF you are, for each individual document, assigning each word a score based on how prevalent it is in that document, inverse to the prevalence across the entire set of documents. A word with a high score in a document indicates that it is more significant or more representative of that document than the other words.
Therefore with this generation of top terms for each cluster, they are the words that, on average, are most significant in the documents for that cluster.
The way it has been done here works and is efficient but I find it difficult to understand myself and I don't think it is particularly intuitive as it is difficult to comprehend why, if cluster_centers_ are the co-ordinates for the centroids, then the features with the highest co-ordinate numbers are the top words. I kind of get it but not quite (if anyone wants to explain how this works that would be great!).
I use a different method to find the top terms for a cluster which I find more intuitive. I just tested the method you posted with my own on a corpus of 250 documents and the top words are exactly the same. The value of my method is that it works however you cluster the documents as long as you can provide a list of the cluster assignments (which any clustering algorithm should provide), meaning you're not reliant on the presence of a cluster_centers_ attribute. It's also, I think, more intuitive.
import numpy as np
def term_scorer(doc_term_matrix, feature_name_list, labels=None, target=None, n_top_words=10):
if target is not None:
filter_bool = np.array(labels) == target
doc_term_matrix = doc_term_matrix[filter_bool]
term_scores = np.sum(doc_term_matrix,axis=0)
top_term_indices = np.argsort(term_scores)[::-1]
return [feature_name_list[term_idx] for term_idx in top_term_indices[:n_top_words]]
term_scorer(X, terms, labels=model.labels_, target=1, n_top_words=10)
The model.labels_ attribute gives you a list of the cluster assignments for each document. In this example I want to find the top words for cluster 1 so I assign target=1, the function filters the X array keeping only rows assigned to cluster 1. It then sums all the scores across the documents row wise so it has one single row with a column for each word. It then uses argsort to sort that row by highest values to lowest, replaces the values with the original index positions of the words. Finally it uses a list comprehension to grab index numbers from the top score to n_top_words and then builds a list of words by looking up those indexes in feature_name_list.
When words are converted into vectors, we talk about closeness of words as how similar they are. So for instance, you could use cosine similarity for determining how close two words are to each other. a vector of "dog" and "puppy" will be similar so you could say the two words are close to each other.
In other terms, closeness is also determined by the context words. So, word pair (the, cat) can be close, as per the sentences. That is how word2vec or similar algorithms work to create word vectors.
Given a list of word embedding vectors I'm trying to calculate an average word embedding where some words are more meaningful than others. In other words, I want to calculate a semantically weighted word embedding.
All the stuff I found is on just finding the mean vector (which is quite trivial of course) which represents the average meaning of the list OR some kind of weighted average of words for document representation, however that is not what I want.
For example, given word vectors for ['sunglasses', 'jeans', 'hats'] I would like to calculate such a vector which represents the semantics of those words BUT with 'sunglasses' having a bigger semantic impact. So, when comparing similarity, the word 'glasses' should be more similar to the list than 'pants'.
I hope the question is clear and thank you very much in advance!
Actually averaging of word vectors can be done in two ways
Mean of word vectors without tfidf weights.
Mean of Word vectors multiplied with tfidf weights.
This will solve your problem of word importance.
The output of LDAModel.topicsMatrix() is unclear to me.
I think I understand the concept of LDA and that each topic is represented by a distribution over terms.
In the LDAModel.describeTopics() it is clear (I think):
The highest sum of likelihoods of words of a sentence per topic, indicates the evidence of this tweet belonging to a topic.
With n topics, the output of describeTopics() is a n times m matrix where m stands for the size of the vocabulary. The values in this matrix are smaller or equal to 1.
However in the LDAModel.topicsMatrix(), I have no idea what I am looking at. The same holds when reading the documentation.
The matrix is a m times n matrix, the dimensions have changed and the values in this matrix are larger than zero (and thus can take the value 2, which is not a probability value). What are these values? The occurrence of this word in the topic perhaps?
How do I use these values do calculate the distance of a sentence to a topic?
i think the matrix is m*n m is the words number and n is the topic number