Remove single occurrences of words in vocabulary TF-IDF - python

I am attempting to remove words that occur once in my vocabulary to reduce my vocabulary size. I am using the sklearn TfidfVectorizer() and then the fit_transform function on my data frame.
tfidf = TfidfVectorizer()
tfs = tfidf.fit_transform(df['original_post'].values.astype('U'))
My first thought is the preprocessor field in the tfidf vectorizer or using the preprocessing package before machine learning.
Any tips or links to further implementation?

you are looking for min_df param (minimum frequency), from the documentation of scikit-learn TfidfVectorizer:
min_df : float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also
called cut-off in the literature. If float, the parameter represents a
proportion of documents, integer absolute counts. This parameter is
ignored if vocabulary is not None.
# remove words occuring less than 5 times
tfidf = TfidfVectorizer(min_df=5)
you can also remove common words:
# remove words occuring in more than half the documents
tfidf = TfidfVectorizer(max_df=0.5)
you can also remove stopwords like this:
tfidf = TfidfVectorizer(stop_words='english')

ShmulikA's answer will most likely work well but will remove words based on document frequency. Thus, if the specific word occurs 200 times in only 1 document, it will be removed. TF-IDF vectorizer does not provide exactly what you want. You would have to:
Fit the vectorizer to your corpus. Extract the complete vocabulary from the vectorizer
Take the words as keys in a new dictionary.
count every word occurrence:
for every document in corpus: for word in document: vocabulary[word] += 1
Now, find out if there are values = 1, drop these entries from the dictionary. Put the keys into a list and pass the list as parameter to the TF-IDF vectorizer.
It will need a lot of looping, maybe just use min_df, which works well in practice.

Related

Reduce Dimension of word-vectors from TFIDFVectorizer / CountVectorizer

I want to use the TFIDFVectorizer (or CountVectorizer followed by TFIDFTransformer) to get a vector representation of my terms. That means, I want a vector for a term where the documents are the features. That's simply the transpose of a TF-IDF matrix created by the TFIDFVectorizer.
>>> vectorizer = TfidfVectorizer()
>>> model = vectorizer.fit_transform(corpus)
>>> model.transpose()
However, I have 800k documents which mean my term vectors are very sparse and very large (800k dimensions). The flag max_features in the CountVectorizer would do exactly what I'm looking for. I can specify a dimension and the CountVectorizer tries to fit all information into this dimension. Unfortunately, this option is for the document vectors rather than the terms in the vocabulary. Hence, it reduces the size of my vocabulary because the terms are the features.
Is there any way to do the opposite? Like, perform a transpose on the TFIDFVectorizer object before it starts cutting and normalizing everything? And if such an approach exists, how can I do that? Something like this:
>>> countVectorizer = CountVectorizer(input='filename', max_features=300, transpose=True)
I was looking for such an approach for a while now but every guide, code example, whatever is talking about the document TF-IDF vectors rather than the term vectors.
Thank you so much in advance!
I am not aware of any straight forward way to do this but let me propose a way how this could be achieved.
You are trying to represent each term in your corpus as a vector that uses the documents in your corpus as its component features. Because the number of documents (which are the features in your case) is very large, you would like to limit them in a way similar to what max_features does.
According to CountVectorizer user guide (same for the TfidfVectorizer):
max_features int, default=None
If not None, build a vocabulary that only consider the top
max_features ordered by term frequency across the corpus.
In a similar way, you want to keep the top documents ordered by their "frequency across the terms", as confusing as this may sound. This could be rephrased simplistically as "keep those documents that contain the most unique terms".
One way I can think of doing that is by using the inverse_transform performing the following steps:
vectorizer = TfidfVectorizer()
model = vectorizer.fit_transform(corpus)
# We use the inverse_transform which returns the
# terms per document with nonzero entries
inverse_model = vectorizer.inverse_transform(model)
# Each line in the inverse model corresponds to a document
# and contains a list of feature names (the terms).
# As we want to rank the documents we tranform the list
# of feature names to a number of features
# that each document is represented by.
inverse_model_count = list(map(lambda doc_vec: len(doc_vec), inverse_model))
# As we are going to sort the list, we need to keep track of the
# document id (its index in the corpus), so we create tuples with
# the list index of each item before we sort the list.
inverse_model_count_tuples = list(zip(range(len(inverse_model_count)),
inverse_model_count))
# Then we sort the list by the count of terms
# in each document (the second component)
max_features = 100
top_documents_tuples = sorted(inverse_model_count_tuples,
key=lambda item: item[1],
reverse=True)[:max_features]
# We are interested only in the document ids (the first tuple component)
top_documents, _ = zip(*top_documents_tuples)
# Having the top_documents ids we can slice the initial model
# to keep only the documents indicated by the top_documents list
reduced_model = model[top_documents]
Please note that this approach only takes into account the number of terms per document, no matter what is their count (CountVectorizer) or weight (TfidfVectorizer).
If the direction of this approach is acceptable for you then with some more code it could be possible to also take into account the count or weight of the terms.
I hope this helps!

Do I use TD-IDF correctly over corpus of raw documents?

I am in doubt as to whether I use my TD-IDF calculations correctly. I have a large corpus of different documents, each document is stored in it's own row using pandas dataframe. I feed each row to TD-IDF in scikit-learn and store feature_names (words) in a list.
I am using following code:
term_tdidf = []
def tdidf_f(vec, matrix):
f_array = np.array(vec.get_feature_names())
t_sort = np.argsort(matrix.toarray()).flatten()[::-1]
n = 100
top_term = f_array[t_sort][:n]
term_tdidf.append(set(top_term))
for row in df.document:
x = TfidfVectorizer(stop_words='english')
tfidf_matrix = x.fit_transform(row)
terms = x.get_feature_names()
tdidf_f(x, tfidf_matrix)
After that I create new dataframe where each set of tdidf from each document is stored in a separate column.
Is that correct use of TD-IDF? I am running it only on single document, so terms I am getting are only calculated within this one document, correct? As I understand td-idf should be used across all documents to find one set of frequent terms, not multiple sets. Are there any consequences of such application?
My manual review of extracted features from each document indicates that terms I am getting are fitting. Afterwards I am using those terms to calculate similarity between documents and it seems to be correct.
To compute the IDF part of the weighting, you need to count the number of times the term occurs in the whole corpus, so your code is incorrect.
Here is a minimal example of how to use it:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["My first document", "My second document"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
X is then a matrix where each row is a document and each column represents a term.

How to ignore words that occurs more than 50% using sci-kit learn TfidfVectorizer?

I'm trying to the bag-of-words algorithm in a certain column of my dataframe, which is made of 6723 rows. But when I apply the tf-idf to the column the vocabulary returned is too big, 8357 words more precisely.
# ...
statements = X_train[:, 0]
tf_idf = TfidfVectorizer()
tf_idf_vectorizer = tf_idf.fit_transform(statements).toarray()
vocabulary = tf_idf.vocabulary_
print(len(vocabulary)) # 8357
print(tf_idf.stop_words_) # set()
print(len(tf_idf.stop_words_)) # 0
After read the documentation I found that we can add the max_df parameter which supposed to ignore words that have frequency higher than the given threshold, so I did this in order to ignore words that have frequency higher than 50%:
#...
tf_idf = TfidfVectorizer(max_df=0.5)
print(len(vocabulary)) # 8356
print(tf_idf.stop_words_) # {'the'}
print(len(tf_idf.stop_words_)) # 1
So, as you can see the results were not too good and I think that I'm doing something wrong, becouse there are other words that have high frequencies that weren't removed, such as: to, in, of, etc. So am I doing something wrong? How can I fix that?

Check the tf-idf scores of sklearn in python

I am following the example here to calculate the TF-IDF values using sklearn.
My code is as follows.
from sklearn.feature_extraction.text import TfidfVectorizer
myvocabulary = ['life', 'learning']
corpus = {1: "The game of life is a game of everlasting learning", 2: "The unexamined life is not worth living", 3: "Never stop learning"}
tfidf = TfidfVectorizer(vocabulary = myvocabulary, ngram_range = (1,3))
tfs = tfidf.fit_transform(corpus.values())
I want to calculate the tf-idf values for the two words life and learning for the 3 documents in corpus.
According to the article I am referring (see Table below) I should get the following values for my example.
However, the values I get from my code is completely different. Please help me find what is wrong in my code and how to fix it.
The main point is that you should not restrict the vocabulary to just two words ('life', 'learning') before constructing the term frequency matrix. If you do that, all other words will be ignored and it will affect the term frequency counting.
There are also several other steps that need to be taken into account if one wants to get exactly the same numbers as in the example by using sklearn:
The features in the example are unigrams (single words) so I have
set ngram_range=(1,1).
The example uses different normalization than sklearn for the term
frequency part (the term counts are normalized by document lengths
in the example, whereas sklearn uses raw term counts by default).
Because of this, I have counted and normalized the term frequencies
separately before calculating the idf part.
The normalization in the example for the idf part is also not the
default for sklearn. This can be adjusted to match the example by
setting smooth_idf to false.
Sklearn's vectorizers discard by default words with just one
character, but such words are kept in the example. In the code
below, I have modified token_pattern to allow also 1-character
words.
The final tfidf matrix is obtained by multiplying the normalized counts by the idf vector.
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import normalize
import pandas as pd
corpus = {1: "The game of life is a game of everlasting learning", 2: "The unexamined life is not worth living", 3: "Never stop learning"}
cvect = CountVectorizer(ngram_range=(1,1), token_pattern='(?u)\\b\\w+\\b')
counts = cvect.fit_transform(corpus.values())
normalized_counts = normalize(counts, norm='l1', axis=1)
tfidf = TfidfVectorizer(ngram_range=(1,1), token_pattern='(?u)\\b\\w+\\b', smooth_idf=False)
tfs = tfidf.fit_transform(corpus.values())
new_tfs = normalized_counts.multiply(tfidf.idf_)
feature_names = tfidf.get_feature_names()
corpus_index = [n for n in corpus]
df = pd.DataFrame(new_tfs.T.todense(), index=feature_names, columns=corpus_index)
print(df.loc[['life', 'learning']])
However, in practice such modifications are rarely needed. One usually obtains good results just by using TfidfVectorizer directly.

Using Sklearn's TfidfVectorizer transform

I am trying to get the tf-idf vector for a single document using Sklearn's TfidfVectorizer object. I create a vocabulary based on some training documents and use fit_transform to train the TfidfVectorizer. Then, I want to find the tf-idf vectors for any given testing document.
from sklearn.feature_extraction.text import TfidfVectorizer
self.vocabulary = "a list of words I want to look for in the documents".split()
self.vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word',
stop_words='english')
self.vect.fit_transform(self.vocabulary)
...
doc = "some string I want to get tf-idf vector for"
tfidf = self.vect.transform(doc)
The problem is that this returns a matrix with n rows where n is the size of my doc string. I want it to return just a single vector representing the tf-idf for the entire string. How can I make this see the string as a single document, rather than each character being a document? Also, I am very new to text mining so if I am doing something wrong conceptually, that would be great to know. Any help is appreciated.
If you want to compute tf-idf only for a given vocabulary, use vocabulary argument to TfidfVectorizer constructor,
vocabulary = "a list of words I want to look for in the documents".split()
vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word',
stop_words='english', vocabulary=vocabulary)
Then, to fit, i.e. calculate counts, with a given corpus, i.e. an iterable of documents, use fit:
vect.fit(corpus)
Method fit_transform is a shortening for
vect.fit(corpus)
corpus_tf_idf = vect.transform(corpus)
Last, transform method accepts a corpus, so for a single document, you should pass it as list, or it is treated as iterable of symbols, each symbol being a document.
doc_tfidf = vect.transform([doc])

Categories

Resources