transform tf idf pandas dataframe into a tf idf matrix - python

How can I convert a the following pandas dataframe with the tf-idf score of each word in several documents into a matrix named "tfdif" so that I can implement for instance
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
str = 'this sentence has unseen text such as computer but also king lord juliet'
response = tfidf.transform([str])

You need to fit the TfidfVectorizer using the original raw documents before being able to use it to transform a new document.
If you cannot access the original documents you can always recover the idf weights of each word by constructing a dictionary:
idfs[word] = log{(# documents) / (# documents where word has non-zero tf-idf weight)}
Later you can use that dictionary to calculate the tf-idf weights for the new sentence:
from collections import Counter
words = sentence.split()
s_tfs = Counter(words)
s_idfs = {word: idfs.get(word, 0) for word in words}
s_tfidf = {word: s_tfs.get(word, 0) * s_idfs.get(word, 0) for word in idfs.keys()}

Related

TfidfModel "too many values to unpack" error

I am attempting to cluster a group of phrases using TfidfModel from the gensim python package. I am encountering a problem with input length on TfidfModel, specifically ValueError: too many values to unpack (expected 2)
Here is my code:
import re
from gensim.utils import simple_preprocess
from gensim.models import TfidfModel
from gensim.similarities import MatrixSimilarity
from sklearn.cluster import KMeans
def preprocess_text(text):
# Remove punctuation and make all characters lowercase
text = re.sub(r'[^\w\s]', '', text)
text = text.lower()
# Tokenize the text
tokens = simple_preprocess(text)
return tokens
def cluster_phrases(phrases, num_clusters):
# Preprocess the phrases
preprocessed_phrases = [preprocess_text(phrase) for phrase in phrases]
# Create a Tf-Idf model from the preprocessed phrases
tfidf = TfidfModel(preprocessed_phrases)
# Compute the similarity matrix between all the phrases
similarity_matrix = MatrixSimilarity(tfidf[preprocessed_phrases])
# Cluster the phrases using KMeans
kmeans = KMeans(n_clusters=num_clusters)
kmeans.fit(similarity_matrix)
return kmeans.labels_
phrases = ["the dog jumps high", "the dog jumps above", "The duck quacks", "A duck makes a sound", "the dog hops", "the cat is a dog", "the cat is beautiful"]
cluster_labels = cluster_phrases(phrases, num_clusters=3)
for i, label in enumerate(cluster_labels):
print(f"Phrase '{phrases[i]}' is in Cluster {label}")
I have tried looking at what the correct input would be. Or am I supposed to be giving the model 2 lists?
I also tried joining like so:
# Preprocess the phrases
preprocessed_phrases2 = []
preprocessed_phrases = [preprocess_text(phrase) for phrase in phrases]
for x in preprocessed_phrases:
x = ' '.join(x)
print(x)
preprocessed_phrases2.append(x)
# Create a Tf-Idf model from the preprocessed phrases
tfidf = TfidfModel(preprocessed_phrases2)
Question: How can I get TfidfModel to cluster phrases properly?

Gensim for similarities

I have a dataframe in pandas of organisation descriptions and project titles, shown below:
Columns are df['org_name'], df['org_description'], df['proj_title']. I want to add a column with the similarity score between the organisation description and project title, for each project(each row).
I'm trying to use gensim: https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html. However, I'm not sure how to adapt the tutorial for my use case, because in the tutorial we get a new query doc = "Human computer interaction" and then compared that against the documents in the corpus individually. Not sure where this choice is made (sims? vec_lsi?)
But I want the similarity score for just the two items in a given row of dataframe df, not one of them against the whole corpus, for each row and then append that to df as a column. How can I do this?
Here is an adaptation of the Gensim LSI tutorial, where the description represents a corpus of sentences and the title is the query made against it.
from gensim.models import LsiModel
from collections import defaultdict
from gensim import corpora
def desc_title_sim(desc, title):
# remove common words and tokenize
stoplist = set('for a of the and to in'.split()) # add a longer stoplist here
sents = desc.split('.') # crude sentence tokenizer
texts = [
[word for word in sent.lower().split() if word not in stoplist]
for sent in sents
]
# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
texts = [
[token for token in text if frequency[token] > 1]
for text in texts
]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lsi = LsiModel(corpus, id2word=dictionary, num_topics=2)
vec_bow = dictionary.doc2bow(title.lower().split())
vec_lsi = lsi[vec_bow] # convert the query to LSI space
return vec_lsi
Apply the function row-wise to get similarity:
df['sim'] = df.apply(lambda row: desc_title_sim(row['org_description'], row['proj_title']), axis=1)
The newly created sim column will be populated with values like
[(0, 0.4618210045327158), (1, 0.07002766527900064)]

Sklearn TfIdfVectorizer remove docs containing all stopwords

I am using sklearn's TfIdfVectorizer to vectorize my corpus. In my analysis, there are some document which all terms are filtered out due to containing all stopwords. To reduce the sparsity issue and because it is meaningless to include them in the analysis, I would like to remove it.
Looking into the TfIdfVectorizer doc, there is no parameter that can be set to do this. Therefore, I am thinking of removing this manually before passing the corpus into the vectorizer. However, this has a potential issue which the stopwords that I have gotten is not the same as the list used by vectorizer, since I also use both min_df and max_df option to filter out terms.
Is there any better way to achieve what I am looking for (i.e. removing/ignoring document containing all stopwords)?
Any help would be greatly appreciated.
You can:
specify your sopwords and then, after TfidfVecorizer
filter out empty rows
The following code snippet shows a simplified example that should set you in the right direction:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["aa ab","aa ab ac"]
stop_words = ["aa","ab"]
tfidf = TfidfVectorizer(stop_words=stop_words)
corpus_tfidf = tfidf.fit_transform(corpus)
idx = np.array(corpus_tfidf.sum(axis=1)==0).ravel()
corpus_filtered = corpus_tfidf[~idx]
Feel free to ask questions if you still have any!
So, you can use this:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
def tokenize(text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
punctuations="?:!.,;'�۪"
for token in tokens:
if token in punctuations:
tokens.remove(token)
if re.search('[a-zA-Z0-9]', token):
filtered_tokens.append(token)
st = ' '.join(filtered_tokens)
return st
tokenize(data)
tfidf_vectorizer = TfidfVectorizer(max_df=0.8,min_df=0.01,stop_words='english',
use_idf=True,tokenizer=tokenize)
tfidf_matrix = tfidf_vectorizer.fit_transform(df['text'])
ids = np.array(tfidf_matrix.sum(axis=1)==0).ravel()
tfidf_filtered = tfidf_matrix[~ids]
This way you can remove stopwords, empty rows and use min_df and max_df.

Python Tf idf algorithm

I would like to find the most relevant words over a set of documents.
I would like to call a Tf Idf algorithm over 3 documents and return a csv file containing each word and its frequency.
After that, I will take only the ones with a high number and I will use them.
I found this implementation that does what I need https://github.com/mccurdyc/tf-idf/.
I call that jar using the subprocess library. But there is a huge problem in that code: it commits a lot of mistake in analyzing words. It mixs some words, it has problems with ' and - (I think). I am using it over the text of 3 books (Harry Potter) and , for example, I am obtaining words such hermiones, hermionell, riddlehermione, thinghermione instead of just hermione in the csv file.
Am I doing wrong something? Can you give me a working implementation of the Tf idf algorithm? Is there a python library that does that?
Here is an implementation of the Tf-idf algorithm using scikit-learn.
Before applying it, you can word_tokenize() and stem your words.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
def tokenize(text):
tokens = word_tokenize(text)
stems = []
for item in tokens: stems.append(PorterStemmer().stem(item))
return stems
# your corpus
text = ["This is your first text book", "This is the third text for analysis", "This is another text"]
# word tokenize and stem
text = [" ".join(tokenize(txt.lower())) for txt in text]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text).todense()
# transform the matrix to a pandas df
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=False)

adding words to stop_words list in TfidfVectorizer in sklearn

I want to add a few more words to stop_words in TfidfVectorizer. I followed the solution in Adding words to scikit-learn's CountVectorizer's stop list . My stop word list now contains both 'english' stop words and the stop words I specified. But still TfidfVectorizer does not accept my list of stop words and I can still see those words in my features list. Below is my code
from sklearn.feature_extraction import text
my_stop_words = text.ENGLISH_STOP_WORDS.union(my_words)
vectorizer = TfidfVectorizer(analyzer=u'word',max_df=0.95,lowercase=True,stop_words=set(my_stop_words),max_features=15000)
X= vectorizer.fit_transform(text)
I have also tried to set stop_words in TfidfVectorizer as stop_words=my_stop_words . But still it does not work . Please help.
This is how you can do it:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
my_stop_words = text.ENGLISH_STOP_WORDS.union(["book"])
vectorizer = TfidfVectorizer(ngram_range=(1,1), stop_words=my_stop_words)
X = vectorizer.fit_transform(["this is an apple.","this is a book."])
idf_values = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
# printing the tfidf vectors
print(X)
# printing the vocabulary
print(vectorizer.vocabulary_)
In this example, I created the tfidf vectors for two sample documents:
"This is a green apple."
"This is a machine learning book."
By default, this, is, a, and an are all in the ENGLISH_STOP_WORDS list. And, I also added book to the stop word list. This is the output:
(0, 1) 0.707106781187
(0, 0) 0.707106781187
(1, 3) 0.707106781187
(1, 2) 0.707106781187
{'green': 1, 'machine': 3, 'learning': 2, 'apple': 0}
As we can see, the word book is also removed from the list of features because we listed it as a stop word. As a result, tfidfvectorizer did accept the manually added word as a stop word and ignored the word at the time of creating the vectors.
This is answered here: https://stackoverflow.com/a/24386751/732396
Even though sklearn.feature_extraction.text.ENGLISH_STOP_WORDS is a frozenset, you can make a copy of it and add your own words, then pass that variable in to the stop_words argument as a list.
For use with scikit-learn you can always use a list as-well:
from nltk.corpus import stopwords
stop = list(stopwords.words('english'))
stop.extend('myword1 myword2 myword3'.split())
vectorizer = TfidfVectorizer(analyzer = 'word',stop_words=set(stop))
vectors = vectorizer.fit_transform(corpus)
...
The only downside of this method, over a set is that your list may end up containing duplicates, which is why I then convert it back when using it as an argument for TfidfVectorizer

Categories

Resources