I am attempting to cluster a group of phrases using TfidfModel from the gensim python package. I am encountering a problem with input length on TfidfModel, specifically ValueError: too many values to unpack (expected 2)
Here is my code:
import re
from gensim.utils import simple_preprocess
from gensim.models import TfidfModel
from gensim.similarities import MatrixSimilarity
from sklearn.cluster import KMeans
def preprocess_text(text):
# Remove punctuation and make all characters lowercase
text = re.sub(r'[^\w\s]', '', text)
text = text.lower()
# Tokenize the text
tokens = simple_preprocess(text)
return tokens
def cluster_phrases(phrases, num_clusters):
# Preprocess the phrases
preprocessed_phrases = [preprocess_text(phrase) for phrase in phrases]
# Create a Tf-Idf model from the preprocessed phrases
tfidf = TfidfModel(preprocessed_phrases)
# Compute the similarity matrix between all the phrases
similarity_matrix = MatrixSimilarity(tfidf[preprocessed_phrases])
# Cluster the phrases using KMeans
kmeans = KMeans(n_clusters=num_clusters)
kmeans.fit(similarity_matrix)
return kmeans.labels_
phrases = ["the dog jumps high", "the dog jumps above", "The duck quacks", "A duck makes a sound", "the dog hops", "the cat is a dog", "the cat is beautiful"]
cluster_labels = cluster_phrases(phrases, num_clusters=3)
for i, label in enumerate(cluster_labels):
print(f"Phrase '{phrases[i]}' is in Cluster {label}")
I have tried looking at what the correct input would be. Or am I supposed to be giving the model 2 lists?
I also tried joining like so:
# Preprocess the phrases
preprocessed_phrases2 = []
preprocessed_phrases = [preprocess_text(phrase) for phrase in phrases]
for x in preprocessed_phrases:
x = ' '.join(x)
print(x)
preprocessed_phrases2.append(x)
# Create a Tf-Idf model from the preprocessed phrases
tfidf = TfidfModel(preprocessed_phrases2)
Question: How can I get TfidfModel to cluster phrases properly?
Related
I'm trying to work with RegEx to make a dictionary with keys that are the bigrams from a text file, and whose values are the number of occurrences of those bigrams in the text.
I've got this code that gets me the bigrams. It's not perfect, because bigrams should be like "hello, world" , "world, full" ""full, of" "of, wonderful" "wonderful, things", but in my print out the bigrams are ordered differently than that so I'm not sure if I did it right.
I am not sure how to correlate these RegEx phrases that get the bigrams to go into a dictionary with keys that are those bigrams and values that reflect their count throughout the original text file. Any help greatly appreciated.
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
with open('/Users/adamstark/PycharmProjects/Computational_Methods_Course/Assignments/Final/Jules_Verne_From_the_Earth_to_Moon.txt') as file:
txt1 = file.readlines()
# Getting bigrams
txt1 = [remove_string_special_characters(s) for s in txt1]
vectorizer = CountVectorizer(ngram_range = (2,2))
X1 = vectorizer.fit_transform(txt1)
features = (vectorizer.get_feature_names())
print("\n\nFeatures : \n", features)
print("\n\nX1 : \n", X1.toarray())
I have a dataframe in pandas of organisation descriptions and project titles, shown below:
Columns are df['org_name'], df['org_description'], df['proj_title']. I want to add a column with the similarity score between the organisation description and project title, for each project(each row).
I'm trying to use gensim: https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html. However, I'm not sure how to adapt the tutorial for my use case, because in the tutorial we get a new query doc = "Human computer interaction" and then compared that against the documents in the corpus individually. Not sure where this choice is made (sims? vec_lsi?)
But I want the similarity score for just the two items in a given row of dataframe df, not one of them against the whole corpus, for each row and then append that to df as a column. How can I do this?
Here is an adaptation of the Gensim LSI tutorial, where the description represents a corpus of sentences and the title is the query made against it.
from gensim.models import LsiModel
from collections import defaultdict
from gensim import corpora
def desc_title_sim(desc, title):
# remove common words and tokenize
stoplist = set('for a of the and to in'.split()) # add a longer stoplist here
sents = desc.split('.') # crude sentence tokenizer
texts = [
[word for word in sent.lower().split() if word not in stoplist]
for sent in sents
]
# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
texts = [
[token for token in text if frequency[token] > 1]
for text in texts
]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lsi = LsiModel(corpus, id2word=dictionary, num_topics=2)
vec_bow = dictionary.doc2bow(title.lower().split())
vec_lsi = lsi[vec_bow] # convert the query to LSI space
return vec_lsi
Apply the function row-wise to get similarity:
df['sim'] = df.apply(lambda row: desc_title_sim(row['org_description'], row['proj_title']), axis=1)
The newly created sim column will be populated with values like
[(0, 0.4618210045327158), (1, 0.07002766527900064)]
I would like to find the most relevant words over a set of documents.
I would like to call a Tf Idf algorithm over 3 documents and return a csv file containing each word and its frequency.
After that, I will take only the ones with a high number and I will use them.
I found this implementation that does what I need https://github.com/mccurdyc/tf-idf/.
I call that jar using the subprocess library. But there is a huge problem in that code: it commits a lot of mistake in analyzing words. It mixs some words, it has problems with ' and - (I think). I am using it over the text of 3 books (Harry Potter) and , for example, I am obtaining words such hermiones, hermionell, riddlehermione, thinghermione instead of just hermione in the csv file.
Am I doing wrong something? Can you give me a working implementation of the Tf idf algorithm? Is there a python library that does that?
Here is an implementation of the Tf-idf algorithm using scikit-learn.
Before applying it, you can word_tokenize() and stem your words.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
def tokenize(text):
tokens = word_tokenize(text)
stems = []
for item in tokens: stems.append(PorterStemmer().stem(item))
return stems
# your corpus
text = ["This is your first text book", "This is the third text for analysis", "This is another text"]
# word tokenize and stem
text = [" ".join(tokenize(txt.lower())) for txt in text]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text).todense()
# transform the matrix to a pandas df
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=False)
How can I convert a the following pandas dataframe with the tf-idf score of each word in several documents into a matrix named "tfdif" so that I can implement for instance
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
str = 'this sentence has unseen text such as computer but also king lord juliet'
response = tfidf.transform([str])
You need to fit the TfidfVectorizer using the original raw documents before being able to use it to transform a new document.
If you cannot access the original documents you can always recover the idf weights of each word by constructing a dictionary:
idfs[word] = log{(# documents) / (# documents where word has non-zero tf-idf weight)}
Later you can use that dictionary to calculate the tf-idf weights for the new sentence:
from collections import Counter
words = sentence.split()
s_tfs = Counter(words)
s_idfs = {word: idfs.get(word, 0) for word in words}
s_tfidf = {word: s_tfs.get(word, 0) * s_idfs.get(word, 0) for word in idfs.keys()}
I want to get 50 most common words from a corpus and then check if this words are present in the sentences. I want to iterate through all sentences and print vector (0 if word is in the sentence and 1 if not). I wrote this code, but it is showing only 0 (false). Any ideas?
import nltk
from nltk import FreqDist
from nltk.corpus import brown
news = brown.words(categories='news')
news_sents = brown.sents(categories='news')
fdist = FreqDist(w.lower() for w in news)
word_features = list(fdist.values())[:50]
num_sents = len(news.sents(fileid))
for i in range(num_sents):
features = {}
for word in word_features:
features[word] = int(word in news_sents[i])
print(features)