Selecting IDs/rows within a databrame, based on n-grams

Selecting IDs/rows within a databrame, based on n-grams - python

I have the following dataset:
ID Text
12 Coolest fan we’ve ever seen.
12 SHARE this with anyone you know who can use this tip!
31 Time for a Royal Celebration! Save the date.
54 The way to a sports fan’s heart? Behind-the-scenes content from their favourite teams.
419 Start asking your questions now for tomorrow’s LIVE Q&A on careers you can do without going to university.
451 Save the date, we’re hosting a fabulous & fun meetup at Coffee Bar Bryant on 9/20. Stay tuned
I have used ngrams to analyse text and words/sentences frequency.
from nltk import ngrams
text=df.Text.tolist()
list_n=[]
for i in text:
n_grams = ngrams(i.split(), 3)
for grams in n_grams:
list_n.append(grams)
list_n
Since I am interested in finding in which text a particular word/words sequence was used, I would need to create an association between text (i.e. ID) and text with particular ngrams.
For example: I am interested in finding texts which contains "Save the date", i.e. ID=31 and ID=451.
To find the n-grams for one single word, I have been using this:
def ngram_filter(col, word, n):
tokens = col.split()
all_ngrams = ngrams(tokens, n)
filtered_ngrams = [x for x in all_ngrams if word in x]
return filtered_ngrams
However, I do not know how to find the ID associated to the text and how to select more words in the function above.
How could I do this? Any idea?
Please feel free to change tags, if you require. Thanks

I don't have much experience with ngrams, but you could get what you want with str.contains like:
import re
txt = 'save the date'
print (f'The ngrams "{txt}" is in IDs: ',
df.loc[df['Text'].str.contains(txt, flags=re.IGNORECASE), 'ID'].tolist())
The ngrams "save the date" is in IDs: [31, 451]
Another option could be to do this, but the performance may not be good:
txt = 'save the date'
df_ = df.assign(ngrams=df.Text.str.replace(r'[^\w\s]+', '') #remove punctuation
.str.lower() # lower case
.str.split() #split over a whitespace
.apply(lambda x: list(ngrams(x, 3))))\
.explode('ngrams') #create a row per ngram
print (df_.loc[df_['ngrams'].isin(ngrams(txt.lower().split(), 3))])
ID Text ngrams
2 31 Time for a Royal Celebration! Save the date. (save, the, date)
5 451 Save the date, we’re hosting a fabulous & fun ... (save, the, date)

Related

How to filter strings if the first three sentences contain keywords

I have a pandas dataframe called df. It has a column called article. The article column contains 600 strings, each of the strings represent a news article.
I want to only KEEP those articles whose first four sentences contain keywords "COVID-19" AND ("China" OR "Chinese"). But I´m unable to find a way to conduct this on my own.
(in the string, sentences are separated by \n. An example article looks like this:)
\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission.\ .......

First we define a function to return a boolean based on whether your keywords appear in a given sentence:
def contains_covid_kwds(sentence):
kw1 = 'COVID19'
kw2 = 'China'
kw3 = 'Chinese'
return kw1 in sentence and (kw2 in sentence or kw3 in sentence)
Then we create a boolean series by applying this function (using Series.apply) to the sentences of your df.article column.
Note that we use a lambda function in order to truncate the sentence passed on to the contains_covid_kwds up to the fifth occurrence of '\n', i.e. your first four sentences (more info on how this works here):
series = df.article.apply(lambda s: contains_covid_kwds(s[:s.replace('\n', '#', 4).find('\n')]))
Then we pass the boolean series to df.loc, in order to localize the rows where the series was evaluated to True:
filtered_df = df.loc[series]

You can use pandas apply method and do the way I did.
string = "\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission."
df = pd.DataFrame({'article':[string]})
def findKeys(string):
string_list = string.strip().lower().split('\n')
flag=0
keywords=['china','covid-19','wuhan']
# Checking if the article has more than 4 sentences
if len(string_list)>4:
# iterating over string_list variable, which contains sentences.
for i in range(4):
# iterating over keywords list
for key in keywords:
# checking if the sentence contains any keyword
if key in string_list[i]:
flag=1
break
# Else block is executed when article has less than or equal to 4 sentences
else:
# Iterating over string_list variable, which contains sentences
for i in range(len(string_list)):
# iterating over keywords list
for key in keywords:
# Checking if sentence contains any keyword
if key in string_list[i]:
flag=1
break
if flag==0:
return False
else:
return True
and then call the pandas apply method on df:-
df['Contains Keywords?'] = df['article'].apply(findKeys)

First I create a series which contains just the first four sentences from the original `df['articles'] column, and convert it to lower case, assuming that searches should be case-independent.
articles = df['articles'].apply(lambda x: "\n".join(x.split("\n", maxsplit=4)[:4])).str.lower()
Then use a simple boolean mask to filter only those rows where the keywords were found in the first four sentences.
df[(articles.str.contains("covid")) & (articles.str.contains("chinese") | articles.str.contains("china"))]

Here:
found = []
s1 = "hello"
s2 = "good"
s3 = "great"
for string in article:
if s1 in string and (s2 in string or s3 in string):
found.append(string)

Columnwise Summarize multiple sentences present in a list using the gensim summarizer

I am having a data-set consisting of faculty id and the feedback of students regarding the respective faculty. There are multiple comments for each faculty and therefore the comments regarding each faculty are present in the form of a list. I want to apply gensim summarization on the "comments" column of the data-set to generate the summary of faculty performance according to the student feedback.
Just for a trial I tried to summarize the feedbacks corresponding to the first faculty id. There are 8 distinct comments (sentences) in that particular feedback, still gensim throws an error ValueError: input must have more than one sentence.
df_test.head()
csf_id comments
0 9 [' good subject knowledge.', ' he has good kn...
1 10 [' good knowledge of subject. ', ' good subjec...
2 11 [' good at clearing the concepts interactive w...
3 12 [' clears concepts very nicely interactive wit...
4 13 [' good teaching ability.', ' subject knowledg...
from gensim.summarization import summarize
text = df_test["comments"][0]
print("Text")
print(text)
print("Summary")
print(summarize(text))
ValueError: input must have more than one sentence
what changes shold i make so that the summarizer reads all the sentenses and summarizes them.

for gensim summarization, newline and full stop will divide the sentence.
from gensim.summarization.summarizer import summarize
summarize("punctual in time.")
this will throw Same error ValueError: input must have more than one sentence
now when there is something after full stop it will interpret it as more than one sentence
summarize("punctual in time. good subject knowledge")
#o/p will be blank string since the text is very small, and now you won't receive any error
''
Now coming to ur problem, you need to join all the element into one string
#example
import pandas as pd
df = pd.DataFrame([[["good subject."," punctual in time.","discipline person."]]], columns = ['comment'])
print(df)
comment
0 [good subject., punctual in time, discipline ...
df['comment'] = df['comment'].apply(''.join)
df['comment'].apply(summarize) #this will work for you but keep in mind you have long text to generate summary

got the solution, Actually Pandas has inbuilt methods for that to be done. Just follow the below piece of code if some of you face the same problem.
df["comments"] = df["comments"].str.replace(",","").astype(str)
df["comments"] = df["comments"].str.replace("[","").astype(str)
df["comments"] = df["comments"].str.replace("]","").astype(str)
df["comments"] = df["comments"].str.replace("'","").astype(str)
Doing this will remove all the square brackets and commas from the list and the feedback will be treated as one single string. Then you can summarize the text present in the rows of a dataframe using:
from gensim.summarization import summarize
summary = summarize(df["comment[i]"])
print(summary)

Optimizing the process of finding word association strengths from an input text

I have written the following (crude) code to find the association strengths among the words in a given piece of text.
import re
## The first paragraph of Wikipedia's article on itself - you can try with other pieces of text with preferably more words (to produce more meaningful word pairs)
text = "Wikipedia was launched on January 15, 2001, by Jimmy Wales and Larry Sanger.[10] Sanger coined its name,[11][12] as a portmanteau of wiki[notes 3] and 'encyclopedia'. Initially an English-language encyclopedia, versions in other languages were quickly developed. With 5,748,461 articles,[notes 4] the English Wikipedia is the largest of the more than 290 Wikipedia encyclopedias. Overall, Wikipedia comprises more than 40 million articles in 301 different languages[14] and by February 2014 it had reached 18 billion page views and nearly 500 million unique visitors per month.[15] In 2005, Nature published a peer review comparing 42 science articles from Encyclopadia Britannica and Wikipedia and found that Wikipedia's level of accuracy approached that of Britannica.[16] Time magazine stated that the open-door policy of allowing anyone to edit had made Wikipedia the biggest and possibly the best encyclopedia in the world and it was testament to the vision of Jimmy Wales.[17] Wikipedia has been criticized for exhibiting systemic bias, for presenting a mixture of 'truths, half truths, and some falsehoods',[18] and for being subject to manipulation and spin in controversial topics.[19] In 2017, Facebook announced that it would help readers detect fake news by suitable links to Wikipedia articles. YouTube announced a similar plan in 2018."
text = re.sub("[\[].*?[\]]", "", text) ## Remove brackets and anything inside it.
text=re.sub(r"[^a-zA-Z0-9.]+", ' ', text) ## Remove special characters except spaces and dots
text=str(text).lower() ## Convert everything to lowercase
## Can add other preprocessing steps, depending on the input text, if needed.
from nltk.corpus import stopwords
import nltk
stop_words = stopwords.words('english')
desirable_tags = ['NN'] # We want only nouns - can also add 'NNP', 'NNS', 'NNPS' if needed, depending on the results
word_list = []
for sent in text.split('.'):
for word in sent.split():
'''
Extract the unique, non-stopword nouns only
'''
if word not in word_list and word not in stop_words and nltk.pos_tag([word])[0][1] in desirable_tags:
word_list.append(word)
'''
Construct the association matrix, where we count 2 words as being associated
if they appear in the same sentence.
Later, I'm going to define associations more properly by introducing a
window size (say, if 2 words seperated by at most 5 words in a sentence,
then we consider them to be associated)
'''
table = np.zeros((len(word_list),len(word_list)), dtype=int)
for sent in text.split('.'):
for i in range(len(word_list)):
for j in range(len(word_list)):
if word_list[i] in sent and word_list[j] in sent:
table[i,j]+=1
df = pd.DataFrame(table, columns=word_list, index=word_list)
# Count the number of occurrences of each word from word_list in the text
all_words = pd.DataFrame(np.zeros((len(df), 2)), columns=['Word', 'Count'])
all_words.Word = df.index
for sent in text.split('.'):
count=0
for word in sent.split():
if word in word_list:
all_words.loc[all_words.Word==word,'Count'] += 1
# Sort the word pairs in decreasing order of their association strengths
df.values[np.triu_indices_from(df, 0)] = 0 # Make the upper triangle values 0
assoc_df = pd.DataFrame(columns=['Word 1', 'Word 2', 'Association Strength (Word 1 -> Word 2)'])
for row_word in df:
for col_word in df:
'''
If Word1 occurs 10 times in the text, and Word1 & Word2 occur in the same sentence 3 times,
the association strength of Word1 and Word2 is 3/10 - Please correct me if this is wrong.
'''
assoc_df = assoc_df.append({'Word 1': row_word, 'Word 2': col_word,
'Association Strength (Word 1 -> Word 2)': df[row_word][col_word]/all_words[all_words.Word==row_word]['Count'].values[0]}, ignore_index=True)
assoc_df.sort_values(by='Association Strength (Word 1 -> Word 2)', ascending=False)
This produces the word associations like so:
Word 1 Word 2 Association Strength (Word 1 -> Word 2)
330 wiki encyclopedia 3.0
895 encyclopadia found 1.0
1317 anyone edit 1.0
754 peer science 1.0
755 peer encyclopadia 1.0
756 peer britannica 1.0
...
...
...
However, the code contains a lot of for loops which hampers its running time. Specially the last part (sort the word pairs in decreasing order of their association strengths) consumes a lot of time as it computes the association strengths of n^2 word pairs/combinations, where n is the number of words we are interested in (those in word_list in my code above).
So, the following are what I would like some help on:
How do I vectorize the code, or otherwise make it more efficient?
Instead of producing n^2 combinations/pairs of words in the last step, is there any way to prune some of them before producing them? I am going to prune some of the useless/meaningless pairs by inspection after they are produced anyway.
Also, and I know this does not fall into the purview of a coding question, but I would love to know if there's any mistake in my logic, specially when calculating the word association strengths.

Since you asked about your specific code, I will not go into alternate libraries. I will be mostly concentrating on points 1) and 2) of your question:
Instead of iterating through the whole word ist twice (i and j) you can already cut the processing time by ~ half by just iterating j between i + i and the end of the list. This removes duplicate pairs (index 24 and 42 as well as index 42 and 24) as well as the identical pair (index 42 and 42).
for sent in text.split('.'):
for i in range(len(word_list)):
for j in range(i+1, len(word_list)):
if word_list[i] in sent and word_list[j] in sent:
table[i,j]+=1
Be careful with this, though. the in operator will also match partial words (like and in hand)
Of course, you could also remove the j iteration completely by first filtering for all words in your word list and then pairing them afterward:
word_list = set() # Using set instead of list makes lookups faster since this is a hashed structure
for sent in text.split('.'):
for word in sent.split():
'''
Extract the unique, non-stopword nouns only
'''
if word not in word_list and word not in stop_words and nltk.pos_tag([word])[0][1] in desirable_tags:
word_list.add(word)
(...)
for sent in text.split('.'):
found_words = [word for word in sent.split() if word in word_list] # list comprehensions are usually faster than pure for loops
# If you want to count duplicate words, then leave the whole line below out.
found_words = tuple(frozenset(found_words)) # make every word unique using a set and then iterable by index again by converting it into a tuple.
for i in range(len(found_words):
for j in range(i+1, len(found_words):
table[i, j] += 1
In general, you should really think about using external libraries for most of this, though. As some of the comments on your question already pointed out, splitting on . may get you wrong results, the same counts for splitting on whitespace, for example with words that are separated with a - or words that are followed by a ,.

MemoryError: when I run negex.py

I am using negex to find the negation terms in my text along with the negation scope. This is the negex.py :
https://github.com/chapmanbe/negex/blob/master/negex.python/negex.py
This is my wrapper function to call negex on each phrase for each sentence of my input text:
input_text = ' Today the weather is not great for playing baseball.
it is snowing and the wind is strong.
I wish it was sunny but it is not what I want.
Today is Sunday and I have to go to school tomorrow.
Tommorrow is not going to be snowing though.'
wrapper function:
for report in data_samples:
this_txt, this_sentences = sentences_for_text(report)
for i in range(len(this_txt)):
this_string = this_txt[i]
my_sentences = this_sentences[i]
for sntc in my_sentences:
my_ngrams = find_ngrams(sntc)
for grm in my_ngrams:
tagger = negTagger(sentence = sntc, phrases = grm, rules = irules, negP=False)
if 'negated' in tagger.getNegationFlag():
print("tagger.getScopes():", tagger.getScopes())
output.append([this_string, grm])
output.append(tagger.getScopes())
So each report can have more than one segment, I get each segment in the report and I break it into sentences, I extract all the unigram, bigram, trigram and forgrams for each sntc, and I go through all the grams for each sentence to find the negation in the sentence. This code is working, but the problem is it uses so much memory, and before even finishing one report I get MemoryError:. I need to run this for thousands of reports, any idea how can I fix this problem given that I only care about negated tag?

count all occurences of each word from a list that appear in several thousand records in python

I have a list of reviews and a list of words that I am trying to count how many times each word shows in each review. The list of keywords is roughly around 30 and could grow/change. The current population of reviews is roughly 5000 with the review word count ranging from 3 to several hundred words. The number of reviews will definitely grow. Right now the keyword list is static and the number of reviews will not be growing to much so any solution to get the counts of keywords in each review will work, but ideally it will be one where there isn't a major performance issue if the number reviews drastically increase or the keywords change and all the reviews have to be reanalyzed.
I have been reading through different methods on stackoverflow and haven't been able to get any to work. I know you can use skikit learn to get the count of each word, but haven't figured out if there is a way to count a phrase. I have also tried various regex expressions. If the keyword list was all single words, I know I could very easily use skikit learn, a loop or regex, but I am having issues when the keyword has multiple words.
Two links I have tried
Python - Check If Word Is In A String
Phrase matching using regex and Python
the solution here is close, but it doesn't count all occurrences of the same word
How to return the count of words from a list of words that appear in a list of lists?
both the list of keywords and reviews are being pulled from a MySQL DB. All keywords are in lowercase. All text has been made lowercase and all non-alphanumeric except spaces have been stripped from the reviews. My original though was to use skikit learn countvectorizer to count the words, but not knowing how to handle counting a phrase I switched. I am currently attempting with loops and regex, but I am open to any solution
# Example of what I am currently attempting with regex
keywords = ['test','blue sky','grass is green']
reviews = ['this is a test. test should come back twice and not 3 times for testing','this pharse contains test and blue sky and look another test','the grass is green test']
for review in reviews:
for word in keywords:
results = re.findall(r'\bword\b',review) #this returns no results, the variable word is not getting picked up
#--also tried variations of this to no avail
#--tried creating the pattern first and passing it
# pattern = "r'\\b" + word + "\\b'"
# results = re.findall(pattern,review) #this errors with the msg: sre_constants.error: multiple repeat at position 9
#The results would be
review1: test=2; 'blue sky'=0;'grass is green'=0
review2: test=2; 'blue sky'=1;'grass is green'=0
review3: test=1; 'blue sky'=0;'grass is green'=1

I would first do it in brute force rather than overcomplicating it and try to optimize it later.
from collections import defaultdict
keywords = ['test','blue sky','grass is green']
reviews = ['this is a test. test should come back twice and not 3 times for testing','this pharse contains test and blue sky and look another test','the grass is green test']
results = dict()
for i in keywords:
for j in reviews:
results[i] = results.get(i, 0) + j.count(i)
print results
>{'test': 6, 'blue sky': 1, 'grass is green': 1}
it's importont that we query the dict with .get, in case we don't have a key set, we don't want to deal with KeyError exception.
If you want to go the complicated route, you can build your own trie and counter structure to do searches in large text files.
Parsing one terabyte of text and efficiently counting the number of occurrences of each word

None of the options you tried search for the value of word:
results = re.findall(r'\bword\b', review) checks for the word word in the string.
When you try pattern = "r'\\b" + word + "\\b'" you check for the string "r'\b[value of word]\b'.
You can use the first option, but the pattern should be r'\b%s\b' % word. That will search for the value of word.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.