understanding top n tfidf features in TfidfVectorizer

understanding top n tfidf features in TfidfVectorizer - python

I am trying to understand the TfidfVectorizer of scikit-learn a bit better. The following code has two documents doc1 = The car is driven on the road,doc2 = The truck is driven on the highway. By calling fit_transform a vectorized matrix of tf-idf weights is generated.
According to the tf-idf value matrix, shouldn't highway,truck,car be the top words instead of highway,truck,driven as highway = truck= car= 0.63 and driven = 0.44?
#testing tfidfvectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
tn = ['The car is driven on the road', 'The truck is driven on the highway']
vectorizer = TfidfVectorizer(tokenizer= lambda x:x.split(),stop_words = 'english')
response = vectorizer.fit_transform(tn)
feature_array = np.array(vectorizer.get_feature_names()) #list of features
print(feature_array)
print(response.toarray())
sorted_features = np.argsort(response.toarray()).flatten()[:-1] #index of highest valued features
print(sorted_features)
#printing top 3 weighted features
n = 3
top_n = feature_array[sorted_features][:n]
print(top_n)
['car' 'driven' 'highway' 'road' 'truck']
[[0.6316672 0.44943642 0. 0.6316672 0. ]
[0. 0.44943642 0.6316672 0. 0.6316672 ]]
[2 4 1 0 3 0 3 1 2]
['highway' 'truck' 'driven']

As you can see from the result, the tf-idf matrix is indeed giving a higher score to highway,truck,car (and truck):
tn = ['The car is driven on the road', 'The truck is driven on the highway']
vectorizer = TfidfVectorizer(stop_words = 'english')
response = vectorizer.fit_transform(tn)
terms = vectorizer.get_feature_names()
pd.DataFrame(response.toarray(), columns=terms)
car driven highway road truck
0 0.631667 0.449436 0.000000 0.631667 0.000000
1 0.000000 0.449436 0.631667 0.000000 0.631667
What's wrong is the further check you do by flattening the array. To get the top scores accross all rows, you could instead do something like:
max_scores = response.toarray().max(0).argsort()
np.array(terms)[max_scores[-4:]]
array(['car', 'highway', 'road', 'truck'], dtype='<U7')
Where the highest scores are the feature_names that have a 0.63 score in the dataframe.

Related

How to use sklearn's Matrix factorization to predict new users' recommendation scores

I'm trying to use sklearn.decomposition.NMF to a matrix R that contains data on how users rated items to predict user ratings for items that they have not yet seen.
the matrix's rows being users, columns being items, and values being scores, with 0 score meaning that the user did not rate this item yet.
Now with the code below I have only managed to get the two matrices that when multiplied together give the original matrix back.
import numpy
R = numpy.array([
[5,3,0,1],
[4,0,0,1],
[1,1,0,5],
[1,0,0,4],
[0,1,5,4],
])
from sklearn.decomposition import NMF
model = NMF(n_components=4)
A = model.fit_transform(R)
B = model.components_
n = numpy.dot(A, B)
print(n)
Problem is, that the model does not predict new values in place of 0's, that would be the predicted scores, but instead recreates the matrix as was.
How do I get the model to predict user scores in place of my original matrix's zeros?

That is what is supposed to happen.
However in most of the cases you are not going to have number of components so similar to the number of products and/or customers.
So for instance considering 2 components
model = NMF(n_components=2)
A = model.fit_transform(R)
B = model.components_
R_estimated = np.dot(A, B)
print(np.sum(R-R_estimated))
-1.678873127048393
R_estimated
array([[5.2558264 , 1.99313836, 0. , 1.45512772],
[3.50429478, 1.32891458, 0. , 0.9701988 ],
[1.31294288, 0.94415991, 1.94956896, 3.94609389],
[0.98129195, 0.72179987, 1.52759811, 3.0788454 ],
[0. , 0.65008935, 2.84003662, 5.21894555]])
You can see in this case that many of the previous zeros are now other numbers you could use. Here for a bit of context https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems).
How to select n_components?
I think the question above is answered, but in case the complete procedure could be something as below.
For that we will need to know a the values in R that are real and we want to focus to predict.
In many cases 0 in R are those new cases / scenarios.
It is common to update R with the averages for products or customers and then calculate the decomposition for selecting the ideal n_components. For selection of they maybe a criteria or more to calculate the advantage in a test sample
Create R_with_Averages
Model selection:
2.1) Split R_with_Averages Test and Training
2.2) Compare among different n_components (from 1 and arbitrary number) using a metric (in which you only consider real evaluations in R)
2.3) Select the best model --> best n_components
Predict with the best model.
Perhaps good to see:
Sarwar, B. M., Karypis, G., Konstan, J. A., and Riedl, J. (2000). Application of Dimensionality Reduction in Recommender System—A Case Study. In ACM WebKDD’00 (Web-mining for ECommerce Workshop). This give you and overall view.
http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/. Example with code very similar.

sklearn's implementation of NMF does not seem to support missing values (Nans, here 0 values basically represent unknown ratings corresponding to new users), refer to this issue. However, we can use suprise's NMF implementation, as shown in the following code:
import numpy as np
import pandas as pd
from surprise import NMF, Dataset, Reader
R = np.array([
[5,3,0,1],
[4,0,0,1],
[1,1,0,5],
[1,0,0,4],
[0,1,5,4],
], dtype=np.float)
R[R==0] = np.nan
print(R)
# [[ 5. 3. nan 1.]
# [ 4. nan nan 1.]
# [ 1. 1. nan 5.]
# [ 1. nan nan 4.]
# [nan 1. 5. 4.]]
df = pd.DataFrame(data=R, index=range(R.shape[0]), columns=range(R.shape[1]))
df = pd.melt(df.reset_index(), id_vars='index', var_name='items', value_name='ratings').dropna(axis=0)
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(df[['index', 'items', 'ratings']], reader)
k = 2
algo = NMF(n_factors=k)
trainset = data.build_full_trainset()
algo.fit(trainset)
predictions = algo.test(trainset.build_testset()) # predict the known ratings
R_hat = np.zeros_like(R)
for uid, iid, true_r, est, _ in predictions:
R_hat[uid, iid] = est
predictions = algo.test(trainset.build_anti_testset()) # predict the unknown ratings
for uid, iid, true_r, est, _ in predictions:
R_hat[uid, iid] = est
print(R_hat)
# [[4.40762528 2.62138084 3.48176319 0.91649316]
# [3.52973408 2.10913555 2.95701406 0.89922637]
# [0.94977826 0.81254138 4.98449755 4.34497549]
# [0.89442186 0.73041578 4.09958967 3.50951819]
# [1.33811051 0.99007556 4.37795636 3.53113236]]
The NMF implementation is as per the [NMF:2014] paper as described here and shown below:
Note that, here the optimization is performed using the known ratings only, resulting in the predicted values of the known ratings being close to the true ratings (but the predicted values for the unknown ratings are not in general close to 0, as expected).
Again, as usual, we can find the number of factors k using cross-validation.

Get clusters of words using Kmeans and TF-IDF

I am trying to clusters text words.
Let suppose I have a list of text
text=["WhatsApp extends 'confusing' update deadline",
"India begins world's biggest Covid vaccine drive",
"Nepali climbers make history with K2 winter summit"]
I implemented TF-IDF on this data
vec = TfidfVectorizer()
feat = vec .fit_transform(text)
After that, I applied Kmeans
kmeans = KMeans(n_clusters=num).fit(feat)
The thing I am confused about is how I get clusters of words such as
cluster 0
WhatsApp, update,biggest
cluster 1
history,biggest ,world's
etc.

You can use the get_feature_names() method from the TfidfVectorizer class with the predictions from KMeans to inspect the words in each cluster.
Here's a minimal example with two clusters and the three sentence provided by you:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
text = ["WhatsApp extends 'confusing' update deadline",
"India begins world's biggest Covid vaccine drive",
"Nepali climbers make history with K2 winter summit"]
vec = TfidfVectorizer()
feat = vec.fit_transform(text)
kmeans = KMeans(2).fit(feat)
pred = kmeans.predict(feat)
for i in range(2):
print(f"Cluster #{i}:")
words = []
for sentence in np.array(text)[pred==i]:
words += [fn for fn in vec.get_feature_names() if fn in sentence]
print(words)
Result:
Cluster #0:
['confusing', 'deadline', 'extends', 'update', 'begins', 'biggest', 'drive', 'vaccine', 'world']
Cluster #1:
['climbers', 'history', 'make', 'summit', 'winter', 'with']

K-means predict and fit predictions

I have a simple K-means program that extracts 2 clusters and then tries to predict for new sentences. I would like to find the best 'fit' for each cluster.
In my 'example predict'_c0 has a good fit for cluster 0 while 'predict_bad_fit' covers cluster 0 and 1
I suppose I have to calculate the average distatance to the cluster centroid for each predicted sentence. How do I do that?.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
c0_sents = ['cats and dogs','i like cats','cats not like dogs','cats and dogs animals',]
c1_sents = ['computer is for typing','i play games on my computer','programs run on computer','computer has screen']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(c0_sents+c1_sents)
k_means = KMeans(n_clusters=2)
k_means.fit(X)
order_centroids = k_means.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(2):
for ind in order_centroids[i, :4]:
print (i, terms[ind])
#0 computer
#0 on
#0 screen
#0 has
#1 cats
#1 dogs
#1 like
#1 and
predict_c0 = ['cats are not dogs']
predict_c1 = ['typing on computers']
predict_bad_fit = ['cats on computers dogs on screen']
for sent in predict_c0+predict_c1+predict_bad_fit:
X = vectorizer.transform([sent])
predicted = k_means.predict(X)
print (sent,predicted)
#cats are not dogs [0]
#typing on computers [1]
#cats on computers dogs on screen [1]

Get values from K-Means clusters using dataframe

I have this dataframe (text_df):
There are 10 different authors with 13834 rows of text.
I then created a bag of words and used a TfidfVectorizer like so:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_v = TfidfVectorizer(max_df=0.5,
max_features=13000,
min_df=5,
stop_words='english',
use_idf=True,
norm=u'l2',
smooth_idf=True
)
X = tfidf_v.fit_transform(corpus).toarray() # corpus --> bagofwords
y = text_df.iloc[:,1].values
Shape of X is (13834,2701)
I decided to use 7 clusters for KMeans:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=7,random_state=42)
I'd like to extract the authors of the texts in each cluster to see if the authors are consistently grouped into the same cluster. Not sure about the best way to go about this. Thanks!
Update:
Trying to visualize the author count per cluster using nested dictionary like so:
author_cluster = {}
for i in range(len(y_kmeans)):
# check 20 random predictions
j = np.random.randint(0, 13833, 1)[0]
if y_kmeans[j] not in author_cluster:
author_cluster[y_kmeans[j]] = {}
if y[j] not in author_cluster[y_kmeans[j]]:
author_cluster[y_kmeans[j]][y[j]] = 1
else:
author_cluster[y_kmeans[j]][y[j]] += 1
Output:
There should be a larger count per cluster and probably more than one author per cluster. I'd like to use all of the predictions to get a more accurate count instead of using a subset. But open to alternative solutions.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
tfidf_v = TfidfVectorizer(max_df=0.5,
max_features=13000,
min_df=5,
stop_words='english',
use_idf=True,
norm=u'l2',
smooth_idf=True
)
X = tfidf_v.fit_transform(corpus) # I removed .toarray() - not sure why it was there except maybe for print debugging?
y = text_df.iloc[:,1].values
km = KMeans(n_clusters=7,random_state=42)
model = km.fit(X)
result = model.predict(X)
for i in range(20):
# check 20 random predictions
container = np.random.randint(low=0, high=13833, size=1)
j = container[0]
print(f'Author {y[j]} wrote {X[j]} and was put in cluster {result[j]}')

Should TF-IDF Score be Identical for Every Ngram with the Same Frequency in a Document

I am using sklearn to find frequencies and tf-idf scores for bigrams from a set of 50 documents. Let's say one of the cleaned documents is "run fast slow".
The output is:
ngram freq tfidf
run fast 1 .23
fast slow 1 .23
The ngrams in the output for that one document are found in other documents. Let's say "run fast" is found 20 times in the document collection and "fast slow" is found 30 times. Why are the tfidf scores the same for ngrams within a document that have the same frequency?
This doesn't intuitively seem like the correct output since the frequency across the document collection varies.
This is the code I am using to extract the features. It takes a grouped df and a text column from that df:
def extractFeatures(groupedDF, textCol):
cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
tv = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
features = pd.DataFrame()
for id, group in tqdm(groupedDF):
freq = cv.fit_transform(group[textCol])
tfidf = tv.fit_transform(group[textCol])
freq = sum(freq).toarray()[0]
tfidf.todense()
tfidf = tfidf.toarray()[0]
freq = pd.DataFrame(freq, columns=['frequency'])
tfidf = pd.DataFrame(tfidf, columns=['tfidf'])
dfinner = pd.DataFrame(cv.get_feature_names(), columns=['ngram'])
dfinner['map'] = id
dfinner = dfinner.join(freq)
results = dfinner.join(tfidf)
features = features.append(results)
return features

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

understanding top n tfidf features in TfidfVectorizer - python

Related

How to use sklearn's Matrix factorization to predict new users' recommendation scores

Get clusters of words using Kmeans and TF-IDF

K-means predict and fit predictions

Get values from K-Means clusters using dataframe

Should TF-IDF Score be Identical for Every Ngram with the Same Frequency in a Document

Categories

Resources