Extracting Key-Phrases from text based on the Topic with Python

Extracting Key-Phrases from text based on the Topic with Python - python

I have a large dataset with 3 columns, columns are text, phrase and topic.
I want to find a way to extract key-phrases (phrases column) based on the topic.
Key-Phrase can be part of the text value or the whole text value.
import pandas as pd
text = ["great game with a lot of amazing goals from both teams",
"goalkeepers from both teams made misteke",
"he won all four grand slam championchips",
"the best player from three-point line",
"Novak Djokovic is the best player of all time",
"amazing slam dunks from the best players",
"he deserved yellow-card for this foul",
"free throw points"]
phrase = ["goals", "goalkeepers", "grand slam championchips", "three-point line", "Novak Djokovic", "slam dunks", "yellow-card", "free throw points"]
topic = ["football", "football", "tennis", "basketball", "tennis", "basketball", "football", "basketball"]
df = pd.DataFrame({"text":text,
"phrase":phrase,
"topic":topic})
print(df.text)
print(df.phrase)
I'm having big trouble with finding a path to do something like this, because I have more than 50000 rows in my dataset and around 48000 of unique values of phrases, and 3 different topics.
I guess that building a dataset with all football, basketball and tennis topics are not really the best solution. So I was thinking about making some kind of ML model for this, but again that means that I will have 2 features (text and topic) and one result (phrase), but I will have more than 48000 of different classes in my result, and that is not a good approach.
I was thinking about using text column as a feature and applying classification model in order to find sentiment. After that I can use predicted sentiment to extract key features, but I do not know how to extract them.
One more problem is that I get only 66% accuracy when I try to classify sentiment by using CountVectorizer or TfidfTransformer with Random Forest, Decision Tree, or any other classifying algorithm, and also 66% of accuracy if Im using TextBlob for sentiment analysis.
Any help?

It looks like a good approach here would be to use a Latent Dirichlet allocation model, which is an example of what are known as topic models.
A LDA is a an unsupervised model that finds similar groups among a set of observations, which you can then use to assign a topic to each of them. Here I'll go through what could be an approach to solve this by training a model using the sentences in the text column. Though in the case the phrases are representative enough an contain the necessary information to be captured by the models, then they could also be a good (possibly better) candidate for training the model, though that you'll better judge by yourself.
Before you train the model, you need to apply some preprocessing steps, including tokenizing the sentences, removing stopwords, lemmatizing and stemming. For that you can use nltk:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import lda
from sklearn.feature_extraction.text import CountVectorizer
ignore = set(stopwords.words('english'))
stemmer = WordNetLemmatizer()
text = []
for sentence in df.text:
words = word_tokenize(sentence)
stemmed = []
for word in words:
if word not in ignore:
stemmed.append(stemmer.lemmatize(word))
text.append(' '.join(stemmed))
Now we have more appropriate corpus to train the model:
print(text)
['great game lot amazing goal team',
'goalkeeper team made misteke',
'four grand slam championchips',
'best player three-point line',
'Novak Djokovic best player time',
'amazing slam dunk best player',
'deserved yellow-card foul',
'free throw point']
We can then convert the text to a matrix of token counts through CountVectorizer, which is the input LDA will be expecting:
vec = CountVectorizer(analyzer='word', ngram_range=(1,1))
X = vec.fit_transform(text)
Note that you can use the ngram parameter to spacify the n-gram range you want to consider to train the model. By setting ngram_range=(1,2) for instance you'd end up with features containing all individual words as well as 2-grams in each sentence, here's an example having trained CountVectorizer with ngram_range=(1,2):
vec.get_feature_names()
['amazing',
'amazing goal',
'amazing slam',
'best',
'best player',
....
The advantage of using n-grams is that you could then also find Key-Phrases other than just single words.
Then we can train the LDA with whatever amount of topics you want, in this case I'll just be selecting 3 topics (note that this has nothing to do with the topics column), which you can consider to be the Key-Phrases - or words in this case - that you mention. Here I'll be using lda, though there are several options such as gensim.
Each topic will have associated a set of words from the vocabulary it has been trained with, with each word having a score measuring the relevance of the word in a topic.
model = lda.LDA(n_topics=3, random_state=1)
model.fit(X)
Through topic_word_ we can now obtain these scores associated to each topic. We can use argsort to sort the vector of scores, and use it to index the vector of feature names, which we can obtain with vec.get_feature_names:
topic_word = model.topic_word_
vocab = vec.get_feature_names()
n_top_words = 3
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Topic 0: best player point
Topic 1: amazing team slam
Topic 2: yellow novak card
The printed results don't really represent much in this case, since the model has been trained with the sample from the question, however you should see more clear and meaningful topics by training with your entire corpus.
Also note that for this example I've use the whole vocabulary to train the model. However it seems that in your case what would make more sense, is to split the text column into groups according to the different topics you already have, and train a separate model on each group. But hopefully this gives you a good idea on how to proceed.

It appears you're looking to group short pieces of text by topic. You will have to tokenize the data in one way or another. There are a variety of encodings that you could consider:
Bag of words, which classifies by counting the frequency of each word in your vocabulary.
TF-IDF: Does what's above but makes words that appear in more entries less important
n_grams / bigrams / trigrams which essentially does the bag of words method but also maintains some context around each word. So you'll have encodings for each word but you'll also have tokens for "great_game", "game_with" and "great_game_with" etc.
Orthogonal Sparse Bigrams (OSB)s Also create features that have the words further apart, like "great__with"
Any of these options could be ideal for your dataset (the last two are likely your best bet). If none of these options work, There are a few more options you could try:
First you could use word embeddings. These are vector representations of each word that unlike one-hot-encoding intrinsically contain word meaning. You can sum the words in a sentence together to get a new vector containing the general idea of what the sentence is about which can then be decoded.
You can also use word embeddings alongside a Bidirectional LSTM. This is the most computationally intensive option but if your other options are not working this might be a good choice. biLSTMs try to interpret sentences by looking at the context around words to try to understand what the word might mean in that context.
Hope this helps

I think what your looking for is called "Topic modeling" in NLP.
you should try using LDA for topic modeling. It's one of easiest methods to apply.
also as #Mike mentioned, converting word to vector has many approaches. You should first try simple approaches like count vectorizer and then gradually move to something like word-2-vect or glove.
I am attaching some links for applying LDA to the corpus.
1. https://towardsdatascience.com/nlp-extracting-the-main-topics-from-your-dataset-using-lda-in-minutes-21486f5aa925
2. https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/

Related

Emotional score of sentences using Spacy

I have a series of 100.000+ sentences and I want to rank how emotional they are.
I am quite new to the NLP world, but this is how I managed to get started (adaptation from spacy 101)
import spacy
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
def set_sentiment(matcher, doc, i, matches):
doc.sentiment += 0.1
myemotionalwordlist = ['you','superb','great','free']
sentence0 = 'You are a superb great free person'
sentence1 = 'You are a great person'
sentence2 = 'Rocks are made o minerals'
sentences = [sentence0,sentence1,sentence2]
pattern2 = [[{"ORTH": emotionalword, "OP": "+"}] for emotionalword in myemotionalwordlist]
matcher.add("Emotional", set_sentiment, *pattern2) # Match one or more emotional word
for sentence in sentences:
doc = nlp(sentence)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print("Sentiment", doc.sentiment)
myemotionalwordlist is a list of about 200 words that Ive built manually.
My questions are:
(1-a) Counting the number of emotional words does not seem like the best approach. Anyone has any suggetions of a better way of doing so?
(1-b) In case this approach is good enough, any suggestions on how I can extract emotional words from wordnet?
(2) Whats the best way of escalating this? I am thinking about adding all sentences to a pandas data frame and then applying the match function to each one of them
Thanks in advance!

There are going to be two main approaches:
the one you have started, which is a list of emotional words, and counting how often they appear
showing examples of what you consider emotional sentences and what are unemotional sentences to a machine learning model, and let it work it out.
The first way will get better as you give it more words, but you will eventually hit a limit. (Simply due to the ambiguity and flexibility of human language, e.g. while "you" is more emotive than "it", there are going to be a lot of unemotional sentences that use "you".)
any suggestions on how I can extract emotional words from wordnet?
Take a look at sentiwordnet, which adds a measure of positivity, negativity or neutrality to each wordnet entry. For "emotional" you could extract just those that have either pos or neg score over e.g. 0.5. (Watch out for the non-commercial-only licence.)
The second approach will probably work better if you can feed it enough training data, but "enough" can sometimes be too much. Other downsides are the models often need much more compute power and memory (a serious issue if you need to be offline, or working on a mobile device), and that they are a blackbox.
I think the 2020 approach would be to start with a pre-trained BERT model (the bigger the better, see the recent GPT-3 paper), and then fine-tune it with a sample of your 100K sentences that you've manually annotated. Evaluate it on another sample, and annotate more training data for the ones it got wrong. Keep doing this until you get the desired level of accuracy.
(Spacy has support for both approaches, by the way. What I called fine-tuning above is also called transfer learning. See https://spacy.io/usage/training#transfer-learning Also googling for "spacy sentiment analysis" will find quite a few tutorials.)

How to embed user names in word2vec model in gensim

I have some volunteer essay writings in the format of:
volunteer_names, essay
["emi", "jenne", "john"], [["lets", "protect", "nature"], ["what", "is", "nature"], ["nature", "humans", "earth"]]
["jenne", "li"], [["lets", "manage", "waste"]]
["emi", "li", "jim"], [["python", "is", "cool"]]
...
...
...
I want to identify the similar users based on their essay writings. I feel like word2vec is more suitable in problems like this. However, since I want to embed user names too in the model I am not sure how to do it. The examples I found in the internet only uses the words (See example code).
import gensim
sentences = [['first', 'sentence'], ['second', 'sentence']]
# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1)
In that case, I am wondering if there is special way of doing this in word2vec or can I simply consider user names as just words to input to the model. please let me know your thoughts on this.
I am happy to provide more details if needed.

Word2vec infers the word representation from surrounding words: words similarly often appear in a similar company end up with similar vectors. Usually, a window of 5 words is considered. So, if you want to hack Word2vec, you would need to make sure that the student names will appear frequently enough (perhaps at a beginning and at the end of a sentence or something like that).
Alternatively, you can have a look at Doc2vec. During training, each document gets an ID and learns an embedding for the ID, they are in a lookup table as if they were word embeddings. If you use student names as document IDs, you would get student embeddings. If you have multiple essays from one student, I suppose you would need to hack Gensim a little bit not to have a unique ID for each essay.

Word2Vec Vocab Similarities

I ran a word2vec algo on text of about 750k words (before removing some stop words). Using my model, I started looking at the most similar words to particular words of my choosing, and the similarity scores (for model.wv.most_similar method) are all super close to 1. The tenth closest score is still like .998, so I feel like I'm not getting any significant differences between the similarity of words which leads to meaningless similar words.
My constructor for the model is
model = Word2Vec(all_words, size=75, min_count=30, window=10, sg=1)
I think the problem may lie in how I structure the text to run the neural net on. I store all the words like so:
all_sentences = nltk.sent_tokenize(v)
all_words = [nltk.word_tokenize(sent) for sent in all_sentences]
all_words = [[word for word in all_words[0] if word not in nltk.stopwords('English')]]
...where v is the result of calling read() on a txt file.

Have you looked at all_words, just before passing it to Word2Vec, to make sure it contains the size and variety of corpus you expected? (That last stop-word stripping step looks like it'll only operate on the very 1st sentence, all_words[0].)
Also, have you enabled logging at the INFO level, and watched the output for indicators of the model's final vocabulary size & training progress, to check if those values are as expected?
Note that removing stopwords isn't strictly necessary for word2vec training. Their presence doesn't hurt much, and the default frequent-word downsampling, controlled by the sample parameter, already serves to often-ignore very-frequent words like stopwords.
(Also, min_count=30 is fairly aggressive for a smallish corpus.)

Based on my knowledge, I recommend the following:
Use sg=0 to use the continuous bag of word model instead of the skip-gram model. CBOW is better at smaller dataset. The skip-gram model was trained in the official paper over 1 billion words.
Use min_count=5 which is the one they used in the paper and they had 1 billion. I think 30 is way too much for your data.
Don't remove the stop words as it will change the neighboring words in the moving window.
Use more iterations like iter=10 for example.
Use gensim.utils.simple_preprocess instead of word_tokenize as the punctuation isn't helpful in this case.
Also, I recommend split your dataset into paragraphs instead of sentences, but I don't know if this is applicable in your dataset or not
When following these steps, your code should be:
>>> from gensim.utils import simple_preprocess
>>> all_sentences = nltk.sent_tokenize(v)
>>> all_words = [simple_preprocess(sent) for sent in all_sentences]
>>> # define the model
>>> model = Word2Vec(all_words, size=75, min_count=5, window=10, sg=0, iter=10)

How to Compare Sentences with an idea of the positions of keywords?

I want to compare the two sentences. As a example,
sentence1="football is good,cricket is bad"
sentence2="cricket is good,football is bad"
Generally these senteces have no relationship that means they are different meaning. But when I compare with python nltk tools it will give 100% similarity. How can I fix this Issue? I need Help.

Yes wup_similarity internally uses synsets for single tokens to calculate similarity
Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).
Since ancestor nodes for cricket and football would be same. wup_similarity will return 1.
If you want to fix this issue using wup_similarity is not a good choice.
Simplest token based way would be fitting a vectorizer and then calculating similarity.
Eg.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
corpus = ["football is good,cricket is bad", "cricket is good,football is bad"]
vectorizer = CountVectorizer(ngram_range=(1, 3))
vectorizer.fit(corpus)
x1 = vectorizer.transform(["football is good,cricket is bad"])
x2 = vectorizer.transform(["cricket is good,football is bad"])
cosine_similarity(x1, x2)
There are more intelligent methods to meaure semantic similarity though. One of them which can be tried easily is Google's USE Encoder.
See this link

Semantic Similarity is a bit tricky this way, since even if you use context counts (which would be n-grams > 5) you cannot cope with antonyms (e.g. black and white) well enough. Before using different methods, you could try using a shallow parser or dependency parser for extracting subject-verb or subject-verb-object relations (e.g. ), which you can use as dimensions. If this does not give you the expected similarity (or values adequate for your application), use word embeddings trained on really large data.

Stipulation of "Good"/"Bad"-Cases in an LDA Model (Using gensim in Python)

I am trying to analyze news snippets in order to identify crisis periods.
To do so, I have already downloaded news articles over the past 7 years and have those available.
Now, I am applying a LDA (Latent Dirichlet Allocation) model on this dataset in order to identify those countries show signs of an economic crisis.
I am basing my code on a blog post by Jordan Barber (https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html) – here is my code so far:
import os, csv
#create list with text blocks in rows, based on csv file
list=[]
with open('Testfile.csv', 'r') as csvfile:
emails = csv.reader(csvfile)
for row in emails:
list.append(row)
#create doc_set
doc_set=[]
for row in list:
doc_set.append(row[0])
#import plugins - need to install gensim and stop_words manually for fresh python install
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
tokenizer = RegexpTokenizer(r'\w+')
# create English stop words list
en_stop = get_stop_words('en')
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()
# list for tokenized documents in loop
texts = []
# loop through document list
for i in doc_set:
# clean and tokenize document string
raw = i.lower()
tokens = tokenizer.tokenize(raw)
# remove stop words from tokens
stopped_tokens = [i for i in tokens if not i in en_stop]
# stem tokens
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
# add tokens to list
texts.append(stemmed_tokens)
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]
# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word = dictionary, passes=10)
print(ldamodel.print_topics(num_topics=5, num_words=5))
# map topics to documents
doc_lda=ldamodel[corpus]
with open('doc_lda.csv', 'w') as outfile:
writer = csv.writer(outfile)
for row in doc_lda:
writer.writerow(row)
Essentially, I identify a number of topics (5 in the code above – to be checked), and using the last line I assign each news article a score, which indicates the probability of an article being related to one of these topics.
Now, I can only manually make a qualitative assessment of whether a given topic is related to a crisis, which is bit unfortunate.
What I would much rather do, is to tell the algorithm whether an article was published during a crisis and use this additional piece of information to identify both topics for my “crisis years” as well as for my “non-crisis-years”. Simply splitting my dataset to just consider topics for my “bads” (i.e. crisis years only) won’t work in my opinion, as I would still need to manually select which topics would actually be related to a crisis, and which topics would show up anyways (sports news, …).
So, is there a way to adapt the code to a) incorporate the information of “crisis” vs “non-crisis” and b) to automatically chose the optimal number of topics / words to optimize the predictive power of the model?
Thanks a lot in advance!

First some suggestions on your specific questions:
a) incorporate the information of “crisis” vs “non-crisis”
To do this with a standard LDA model, I'd probably go for mutual information between doc topic proportions and whether docs are in a crisis/non crisis period.
b) to automatically chose the optimal number of topics / words to optimize the predictive power of the model?
If you want to do this properly, experiment with many settings for the number of topics and try to use the topic models to predict conflict/nonconflict for held out documents (documents not included in the topic model).
There are many topic model variants that effectively choose the number of topics ("non-parametric" models). It turns out that the Mallet implementation with hyperparameter optimisation effectively does the same, so I'd suggest using that (provide a large number of topics - hyperparameter optimisaiton will result in many topics with very few assigned words, these topics are just noise).
And some general comments:
There are many topic model variants out there, and in particular a few that incorporate time. These may be a good choice for you (as they'll better resolve topic changes over time than standard LDA - though standard LDA is a good starting point).
One model I particularly like uses pitman-yor word priors (better matching zipf distributed words than a dirichlet), accounts for burstiness in topics and provides clues on junk topics: https://github.com/wbuntine/topic-models

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.