I want to use the TFIDFVectorizer (or CountVectorizer followed by TFIDFTransformer) to get a vector representation of my terms. That means, I want a vector for a term where the documents are the features. That's simply the transpose of a TF-IDF matrix created by the TFIDFVectorizer.
>>> vectorizer = TfidfVectorizer()
>>> model = vectorizer.fit_transform(corpus)
>>> model.transpose()
However, I have 800k documents which mean my term vectors are very sparse and very large (800k dimensions). The flag max_features in the CountVectorizer would do exactly what I'm looking for. I can specify a dimension and the CountVectorizer tries to fit all information into this dimension. Unfortunately, this option is for the document vectors rather than the terms in the vocabulary. Hence, it reduces the size of my vocabulary because the terms are the features.
Is there any way to do the opposite? Like, perform a transpose on the TFIDFVectorizer object before it starts cutting and normalizing everything? And if such an approach exists, how can I do that? Something like this:
>>> countVectorizer = CountVectorizer(input='filename', max_features=300, transpose=True)
I was looking for such an approach for a while now but every guide, code example, whatever is talking about the document TF-IDF vectors rather than the term vectors.
Thank you so much in advance!
I am not aware of any straight forward way to do this but let me propose a way how this could be achieved.
You are trying to represent each term in your corpus as a vector that uses the documents in your corpus as its component features. Because the number of documents (which are the features in your case) is very large, you would like to limit them in a way similar to what max_features does.
According to CountVectorizer user guide (same for the TfidfVectorizer):
max_features int, default=None
If not None, build a vocabulary that only consider the top
max_features ordered by term frequency across the corpus.
In a similar way, you want to keep the top documents ordered by their "frequency across the terms", as confusing as this may sound. This could be rephrased simplistically as "keep those documents that contain the most unique terms".
One way I can think of doing that is by using the inverse_transform performing the following steps:
vectorizer = TfidfVectorizer()
model = vectorizer.fit_transform(corpus)
# We use the inverse_transform which returns the
# terms per document with nonzero entries
inverse_model = vectorizer.inverse_transform(model)
# Each line in the inverse model corresponds to a document
# and contains a list of feature names (the terms).
# As we want to rank the documents we tranform the list
# of feature names to a number of features
# that each document is represented by.
inverse_model_count = list(map(lambda doc_vec: len(doc_vec), inverse_model))
# As we are going to sort the list, we need to keep track of the
# document id (its index in the corpus), so we create tuples with
# the list index of each item before we sort the list.
inverse_model_count_tuples = list(zip(range(len(inverse_model_count)),
inverse_model_count))
# Then we sort the list by the count of terms
# in each document (the second component)
max_features = 100
top_documents_tuples = sorted(inverse_model_count_tuples,
key=lambda item: item[1],
reverse=True)[:max_features]
# We are interested only in the document ids (the first tuple component)
top_documents, _ = zip(*top_documents_tuples)
# Having the top_documents ids we can slice the initial model
# to keep only the documents indicated by the top_documents list
reduced_model = model[top_documents]
Please note that this approach only takes into account the number of terms per document, no matter what is their count (CountVectorizer) or weight (TfidfVectorizer).
If the direction of this approach is acceptable for you then with some more code it could be possible to also take into account the count or weight of the terms.
I hope this helps!
Related
I am in doubt as to whether I use my TD-IDF calculations correctly. I have a large corpus of different documents, each document is stored in it's own row using pandas dataframe. I feed each row to TD-IDF in scikit-learn and store feature_names (words) in a list.
I am using following code:
term_tdidf = []
def tdidf_f(vec, matrix):
f_array = np.array(vec.get_feature_names())
t_sort = np.argsort(matrix.toarray()).flatten()[::-1]
n = 100
top_term = f_array[t_sort][:n]
term_tdidf.append(set(top_term))
for row in df.document:
x = TfidfVectorizer(stop_words='english')
tfidf_matrix = x.fit_transform(row)
terms = x.get_feature_names()
tdidf_f(x, tfidf_matrix)
After that I create new dataframe where each set of tdidf from each document is stored in a separate column.
Is that correct use of TD-IDF? I am running it only on single document, so terms I am getting are only calculated within this one document, correct? As I understand td-idf should be used across all documents to find one set of frequent terms, not multiple sets. Are there any consequences of such application?
My manual review of extracted features from each document indicates that terms I am getting are fitting. Afterwards I am using those terms to calculate similarity between documents and it seems to be correct.
To compute the IDF part of the weighting, you need to count the number of times the term occurs in the whole corpus, so your code is incorrect.
Here is a minimal example of how to use it:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["My first document", "My second document"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
X is then a matrix where each row is a document and each column represents a term.
I have a list of 1000 documents, where the first 500 belongs to documents in movies (i.e. list index from 0 to 499) and the remaining 500 belings to documents in tv series (i.e. list index from 500 to 999).
For movies the document tag starts with movie_ (e.g., movie_fast_and_furious) and for tv series the document tag starts with tv_series_ (e.g., tv_series_the_office)
I use these movies and tv series dataset to build a doc2vec model as follows.
from gensim.models import doc2vec
from collections import namedtuple
dataset = json.load(open(input_file))
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for description in dataset:
tags = [description[0]]
words = description[1]
docs.append(analyzedDocument(words, tags))
model = doc2vec.Doc2Vec(docs, vector_size = 100, window = 10, min_count = 1, workers = 4, epochs = 20)
Now for each movie, I want to get its nearest 5 tv series along with their cosine similarity.
I know, the function gensim provides model.docvecs.most_similar. However, the results of this include movies as well (which is not my intension). Is it possible to do this in gensim (I assume that the document vectors are creating in the order of the documents list that we provide).
I am happy to provide more details if needed.
All the tags are opaque identifiers to Doc2Vec. So, if there's internal distinctions to your data, you'll need to model and filter on that yourself.
So, my main recommendation would be to ask for a much larger topn than you need, then discard those results of the type you don't want, or in excess of the number you actually need.
(Note that every calculation of most_similar() requires a comparison against the whole known set of doc-tags, and using a smaller topn only saves some computation in sorting of those full results. So using a larger topn, even up to the full size of the known doc-tags, isn't as costly as you might fear.)
With just two categories, to get the 10 tv-shows closest to a query movie, you could make topn equal to the count of movies, minus 1 (the query), plus 10 – then in the absolute worst-case, where all movies are closer than than the 1st tv-show, you'd still get 10 valid tv-show results.
(The most_similar() function also includes a restrict_vocab parameter. It takes an int count, and limits results to only the 1st that-many items, in the internal storage order. So if in fact the 1st 500 documents were all tv-shows, restrict_vocab=500 would give only results from that subset. However, I wouldn't recommend relying on this, as (a) it'd only work for one category that was front-loaded, not any others; (b) ideally for training, you wouldn't clump all similar documents together, but shuffle them to be interspersed with contrasting documents. Generally Word2Vec vector-sets are sorted to put the highest-frequency words 1st – no matter the order-of-appearance in the original data. That makes restrict_vocab more useful there, as often only results for the most-common words with the strongest vectors are most interesting.)
I have the following code that I found from https://pythonprogramminglanguage.com/kmeans-text-clustering/ on document clustering. While I understand the k-means algorithm as a whole, I have a little trouble wrapping my head about what the top terms per cluster represents and how that is computed? Is it the most frequent words that occur in the cluster? One blogpost I read said that the outputted words at the end represent the "top n words that are nearest to the cluster centroid" (but what does it mean for an actual word to be "closest" to the cluster centroid). I really want to understand the details and nuances of what is going on. Thank you!
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
documents = ["This little kitty came to play when I was eating at a restaurant.",
"Merley has the best squooshy kitten belly.",
"Google Translate app is incredible.",
"If you open 100 tab in google you get a smiley face.",
"Best cat photo I've ever taken.",
"Climbing ninja cat.",
"Impressed with google map feedback.",
"Key promoter extension for Google Chrome."]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i),
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind]),
print
'Top' in this context is directly related to the way in which the text has been transformed into an array of numerical values. By using TFIDF you are, for each individual document, assigning each word a score based on how prevalent it is in that document, inverse to the prevalence across the entire set of documents. A word with a high score in a document indicates that it is more significant or more representative of that document than the other words.
Therefore with this generation of top terms for each cluster, they are the words that, on average, are most significant in the documents for that cluster.
The way it has been done here works and is efficient but I find it difficult to understand myself and I don't think it is particularly intuitive as it is difficult to comprehend why, if cluster_centers_ are the co-ordinates for the centroids, then the features with the highest co-ordinate numbers are the top words. I kind of get it but not quite (if anyone wants to explain how this works that would be great!).
I use a different method to find the top terms for a cluster which I find more intuitive. I just tested the method you posted with my own on a corpus of 250 documents and the top words are exactly the same. The value of my method is that it works however you cluster the documents as long as you can provide a list of the cluster assignments (which any clustering algorithm should provide), meaning you're not reliant on the presence of a cluster_centers_ attribute. It's also, I think, more intuitive.
import numpy as np
def term_scorer(doc_term_matrix, feature_name_list, labels=None, target=None, n_top_words=10):
if target is not None:
filter_bool = np.array(labels) == target
doc_term_matrix = doc_term_matrix[filter_bool]
term_scores = np.sum(doc_term_matrix,axis=0)
top_term_indices = np.argsort(term_scores)[::-1]
return [feature_name_list[term_idx] for term_idx in top_term_indices[:n_top_words]]
term_scorer(X, terms, labels=model.labels_, target=1, n_top_words=10)
The model.labels_ attribute gives you a list of the cluster assignments for each document. In this example I want to find the top words for cluster 1 so I assign target=1, the function filters the X array keeping only rows assigned to cluster 1. It then sums all the scores across the documents row wise so it has one single row with a column for each word. It then uses argsort to sort that row by highest values to lowest, replaces the values with the original index positions of the words. Finally it uses a list comprehension to grab index numbers from the top score to n_top_words and then builds a list of words by looking up those indexes in feature_name_list.
When words are converted into vectors, we talk about closeness of words as how similar they are. So for instance, you could use cosine similarity for determining how close two words are to each other. a vector of "dog" and "puppy" will be similar so you could say the two words are close to each other.
In other terms, closeness is also determined by the context words. So, word pair (the, cat) can be close, as per the sentences. That is how word2vec or similar algorithms work to create word vectors.
I am attempting to remove words that occur once in my vocabulary to reduce my vocabulary size. I am using the sklearn TfidfVectorizer() and then the fit_transform function on my data frame.
tfidf = TfidfVectorizer()
tfs = tfidf.fit_transform(df['original_post'].values.astype('U'))
My first thought is the preprocessor field in the tfidf vectorizer or using the preprocessing package before machine learning.
Any tips or links to further implementation?
you are looking for min_df param (minimum frequency), from the documentation of scikit-learn TfidfVectorizer:
min_df : float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also
called cut-off in the literature. If float, the parameter represents a
proportion of documents, integer absolute counts. This parameter is
ignored if vocabulary is not None.
# remove words occuring less than 5 times
tfidf = TfidfVectorizer(min_df=5)
you can also remove common words:
# remove words occuring in more than half the documents
tfidf = TfidfVectorizer(max_df=0.5)
you can also remove stopwords like this:
tfidf = TfidfVectorizer(stop_words='english')
ShmulikA's answer will most likely work well but will remove words based on document frequency. Thus, if the specific word occurs 200 times in only 1 document, it will be removed. TF-IDF vectorizer does not provide exactly what you want. You would have to:
Fit the vectorizer to your corpus. Extract the complete vocabulary from the vectorizer
Take the words as keys in a new dictionary.
count every word occurrence:
for every document in corpus: for word in document: vocabulary[word] += 1
Now, find out if there are values = 1, drop these entries from the dictionary. Put the keys into a list and pass the list as parameter to the TF-IDF vectorizer.
It will need a lot of looping, maybe just use min_df, which works well in practice.
I'm implementing feature vectors as bit maps for documents in a corpus. I already have the vocabulary for the entire corpus (as a list/set) and a list of the terms in each document.
For example, if the corpus vocabulary is ['a', 'b', 'c', 'd'] and the terms in document d1 is ['a', 'b', 'd', 'd'], the feature vector for d1 should be [1, 1, 0, 2].
To generate the feature vector, I'd iterate over the corpus vocabulary and check if each term is in the list of document terms, then set the bit in the correct position in the document's feature vector.
What would be the most efficient way to implement this? Here are some things I've considered:
Using a set would make checking vocab membership very efficient but sets have no ordering, and the feature vector bits need to be in the order of the sorted corpus vocabulary.
Using a dict for the corpus vocab (mapping each vocab term to an arbitrary value, like 1) would allow iteration over sorted(dict.keys()) so I could keep track of the index. However, I'd have the space overhead of dict.values().
Using a sorted(list) would be inefficient to check membership.
What would StackOverflow suggest?
I think the most efficient way is to loop over each document's terms, get the position of the term in the (sorted) corpus and set the bit accordingly.
The sorted list of corpus terms can be stored as dictionary with term -> index mapping (basically an inverted index).
You can create it like so:
corpus = dict(((term, index) for index, term in enumerate(sorted(all_words))))
For each document you'd have to generate a list of 0's as feature vector:
num_words = len(corpus)
fvs = [[0]*num_words for _ in docs]
Then building the feature vectors would be:
for i, doc_terms in enumerate(docs):
fv = fvs[i]
for term in doc_terms:
fv[corpus[term]] += 1
There is no overhead in testing membership, you just have to loop over all terms of all documents.
That all said, depending on the size of the corpus, you should have a look at numpy and scipy. It is likely that you will run into memory problems and scipy provides special datatypes for sparse matrices (instead of using a list of lists) which can save a lot of memory.
You can use the same approach as shown above, but instead of adding numbers to list elements, you add it to matrix elements (e.g. the rows will be the documents and the columns the terms of the corpus).
You can also make use of some matrix operations provided by numpy if you want to apply local or global weighting schemes.
I hope this gets you started :)