Gensim for similarities - python

I have a dataframe in pandas of organisation descriptions and project titles, shown below:
Columns are df['org_name'], df['org_description'], df['proj_title']. I want to add a column with the similarity score between the organisation description and project title, for each project(each row).
I'm trying to use gensim: https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html. However, I'm not sure how to adapt the tutorial for my use case, because in the tutorial we get a new query doc = "Human computer interaction" and then compared that against the documents in the corpus individually. Not sure where this choice is made (sims? vec_lsi?)
But I want the similarity score for just the two items in a given row of dataframe df, not one of them against the whole corpus, for each row and then append that to df as a column. How can I do this?

Here is an adaptation of the Gensim LSI tutorial, where the description represents a corpus of sentences and the title is the query made against it.
from gensim.models import LsiModel
from collections import defaultdict
from gensim import corpora
def desc_title_sim(desc, title):
# remove common words and tokenize
stoplist = set('for a of the and to in'.split()) # add a longer stoplist here
sents = desc.split('.') # crude sentence tokenizer
texts = [
[word for word in sent.lower().split() if word not in stoplist]
for sent in sents
]
# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
texts = [
[token for token in text if frequency[token] > 1]
for text in texts
]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lsi = LsiModel(corpus, id2word=dictionary, num_topics=2)
vec_bow = dictionary.doc2bow(title.lower().split())
vec_lsi = lsi[vec_bow] # convert the query to LSI space
return vec_lsi
Apply the function row-wise to get similarity:
df['sim'] = df.apply(lambda row: desc_title_sim(row['org_description'], row['proj_title']), axis=1)
The newly created sim column will be populated with values like
[(0, 0.4618210045327158), (1, 0.07002766527900064)]

Related

Retain original document element index of argument passed through sklearn's CountVectorizer() in order to access corresponding part of speech tag

I have a data frame with sentences and the respective part of speech tag for each word (Below is an extract of the data I'm working with (data taken from SNLI corpus). For each sentence in my collection I would like to extract unigrams and the corresponding pos-tag of that word.
For instance if I've the following:
vectorizer_unigram = CountVectorizer(analyzer='word', ngram_range=(1, 1), stop_words = 'english')
doc = {'sent' : ['Two women are embracing while holding to go packages .'], 'tags' : ['NUM NOUN AUX VERB SCONJ VERB PART VERB NOUN PUNCT']}
sentence = vectorizer_unigram.fit(doc['sent'])
sentence_unigrams = sentence.get_feature_names_out()
Then I would get the following unigrams output:
array(['embracing', 'holding', 'packages', 'women'], dtype=object)
But I don't know how to retain the part of speech tag after this. I tried to do a lookup version with the unigrams, but as they may differ from the words in the sentence (if you for instance do sentence.split(' ')) you don't necessarily get the same tokens. Any suggestions of how I can extract unigrams and retain the corresponding part-of-speech tag?
After reviewing the source code for the sklearn CountVectorizer class, particularly the fit function, I don't believe the class has any way of tracking the original document element indexes relative to the extracted unigram features: where the unigram features do not necessarily have the same tokens. Other than the simple solution provided below, you might have to rely on some other method/library to achieve your desired results. If there is a particular case that fails, I'd suggest adding that to your question as it might help people generate solutions to your problem.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_unigram = CountVectorizer(analyzer='word', ngram_range=(1, 1), stop_words = 'english')
doc = {'sent': ['Two women are embracing while holding to go packages .'],
'tags': ['NUM NOUN AUX VERB SCONJ VERB PART VERB NOUN PUNCT']}
sentence = vectorizer_unigram.fit(doc['sent'])
sentence_unigrams = sentence.get_feature_names_out()
sent_token_list = doc['sent'][0].split()
tags_token_list = doc['tags'][0].split()
sentence_tags = []
for unigram in sentence_unigrams:
for i in range(len(sent_token_list)):
if sent_token_list[i] == unigram:
sentence_tags.append(tags_token_list[i])
print(sentence_unigrams)
# Output: ['embracing' 'holding' 'packages' 'women']
print(sentence_tags)
# Output: ['VERB', 'VERB', 'NOUN', 'NOUN']

How to get Bigram/Trigram of word from prelisted unigram from a document corpus / dataframe column

I have a dataframe with text in one of its columns.
I have listed some predefined keywords which I need for analysis and words associated with it (and later make a wordcloud and counter of occurrences) to understand topics /context associated with such keywords.
Use case:
df.text_column()
keywordlist = [coca , food, soft, aerated, soda]
lets say one of the rows of the text column has text : ' coca cola is expanding its business in soft drinks and aerated water'.
another entry like : 'lime soda is the best selling item in fast food stores'
my objective is to get Bigram/trigram like:
'coca_cola','coca_cola_expanding', 'soft_drinks', 'aerated_water', 'business_soft_drinks', 'lime_soda', 'food_stores'
Kindly help me to do that [Python only]
First, you can optioanlly load the nltk's stop word list and remove any stop words from the text (such as "is", "its", "in", and "and"). Alternatively, you can define your own stop words list, as well as even extend the nltk's list with additional words. Following, you can use nltk.bigrams() and nltk.trigrams() methods to get bigrams and trigrams joined with an underscore _, as you asked. Also, have a look at Collocations.
Edit:
If you haven't already, you need to include the following once in your code, in order to download the stop words list.
nltk.download('stopwords')
Code:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
word_data = "coca cola is expanding its business in soft drinks and aerated water"
#word_data = "lime soda is the best selling item in fast food stores"
# load nltk's stop word list
stop_words = list(stopwords.words('english'))
# extend the stop words list
#stop_words.extend(["best", "selling", "item", "fast"])
# tokenise the string and remove stop words
word_tokens = word_tokenize(word_data)
clean_word_data = [w for w in word_tokens if not w.lower() in stop_words]
# get bigrams
bigrams_list = ["_".join(item) for item in nltk.bigrams(clean_word_data)]
print(bigrams_list)
# get trigrams
trigrams_list = ["_".join(item) for item in nltk.trigrams(clean_word_data)]
print(trigrams_list)
Update
Once you get the bigram and trigram lists, you can check for matches against your keyword list to keep only the relevant ones.
keywordlist = ['coca' , 'food', 'soft', 'aerated', 'soda']
def find_matches(n_grams_list):
matches = []
for k in keywordlist:
matching_list = [s for s in n_grams_list if k in s]
[matches.append(m) for m in matching_list if m not in matches]
return matches
all_matching_bigrams = find_matches(bigrams_list) # find all mathcing bigrams
all_matching_trigrams = find_matches(trigrams_list) # find all mathcing trigrams
# join the two lists
all_matches = all_matching_bigrams + all_matching_trigrams
print(all_matches)
Output:
['coca_cola', 'business_soft', 'soft_drinks', 'drinks_aerated', 'aerated_water', 'coca_cola_expanding', 'expanding_business_soft', 'business_soft_drinks', 'soft_drinks_aerated', 'drinks_aerated_water']

Get the most important words in the corpus using tf-idf (Gensim)

I am calculating tf-idf as follows.
texts=['human interface computer',
'survey user computer system response time',
'eps user interface system',
'system human system eps',
'user response time']
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
analyzedDocument = namedtuple('AnalyzedDocument', 'word tfidf_score')
d=[]
for doc in corpus_tfidf:
for id, value in doc:
word = dictionary.get(id)
score = value
d.append(analyzedDocument(word, score))
However, now I want to identify the most 3 important words in my corpus using the words that has the highest idf values. Please let me know how to do it?
Assuming you are getting your list d ok, you should be able to get it arranged as follows: At top:
from operator import itemgetter
Then at bottom:
e=sorted(d, key=itemgetter(1))
top3 = e[:3]
print(top3)

How do you speed up calculating bigrams/trigrams on a large number (~ 1 million) documents in mongodb?

I have about a million documents in mongodb with large text fields. I want to extract the most meaningful terms. My current logic has been to calculate the bigrams for each week in Python using logic similar to the logic below. The problem is this logic is slow. Is there a faster way to go about doing this?
from nltk.tokenize import sent_tokenize
for week_start,week_end in zip(weeks[:-1],weeks[1:]):
all_top_words = Counter()
for post in collection.find( {'date': {'$lt': week_end, '$gte': week_start},'text':{"$exists":True}}):
text = strip_tags(post['text'])
text = remove_brackets(text)
sentences = sent_tokenize(text)
for sentence in sentences:
sentence = sentence.lower()
sentence = remove_punctuation(sentence)
top_words=Counter(ngrams(sentence.split(" "),2))
all_top_words += top_words

transform tf idf pandas dataframe into a tf idf matrix

How can I convert a the following pandas dataframe with the tf-idf score of each word in several documents into a matrix named "tfdif" so that I can implement for instance
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
str = 'this sentence has unseen text such as computer but also king lord juliet'
response = tfidf.transform([str])
You need to fit the TfidfVectorizer using the original raw documents before being able to use it to transform a new document.
If you cannot access the original documents you can always recover the idf weights of each word by constructing a dictionary:
idfs[word] = log{(# documents) / (# documents where word has non-zero tf-idf weight)}
Later you can use that dictionary to calculate the tf-idf weights for the new sentence:
from collections import Counter
words = sentence.split()
s_tfs = Counter(words)
s_idfs = {word: idfs.get(word, 0) for word in words}
s_tfidf = {word: s_tfs.get(word, 0) * s_idfs.get(word, 0) for word in idfs.keys()}

Categories

Resources