Deleting all the noun phrases from text using Textblob - python

I need to delete all the proper noun from the text.
result is the Dataframe.
I'm using text blob. Below is the code.
from textblob import TextBlob
strings = []
for col in result:
for i in range(result.shape[0]):
text = result[col][i]
Txtblob = TextBlob(text)
for word, pos in Txtblob.noun_phrases:
print (word, pos)
if tag != 'NNP'
print(' '.join(edited_sentence))
It just recognizes one NNP

To remove all words tagged with 'NNP' from the following text (from the documenation), you can do the following:
from textblob import TextBlob
# Sample text
text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.'''
text = TextBlob(text)
# Create a list of words that are tagged with 'NNP'
# In this case it will only be 'Blob'
words_to_remove = [word[0] for word in [tag for tag in text.tags if tag[1] == 'NNP']]
# Remove the Words from the sentence, using words_to_remove
edited_sentence = ' '.join([word for word in text.split(' ') if word not in words_to_remove])
# Show the result
print(edited_sentence)
out
# Notice the lack of the word 'Blob'
'\nThe titular threat of The has always struck me as the ultimate
movie\nmonster: an insatiably hungry, amoeba-like mass able to
penetrate\nvirtually any safeguard, capable of--as a doomed doctor
chillingly\ndescribes it--"assimilating flesh on contact.\nSnide
comparisons to gelatin be damned, it\'s a concept with the
most\ndevastating of potential consequences, not unlike the grey goo
scenario\nproposed by technological theorists fearful of\nartificial
intelligence run rampant.\n'
Comments for your sample
from textblob import TextBlob
strings = [] # This variable is not used anywhere
for col in result:
for i in range(result.shape[0]):
text = result[col][i]
txt_blob = TextBlob(text)
# txt_blob.noun_phrases will return a list of noun_phrases,
# To get the position of each list you need use the function 'enuermate', like this
for word, pos in enumerate(txt_blob.noun_phrases):
# Now you can print the word and position
print (word, pos)
# This will give you something like the following:
# 0 titular threat
# 1 blob
# 2 ultimate movie monster
# This following line does not make any sense, because tag has not yet been assigned
# and you are not iterating over the words from the previous step
if tag != 'NNP'
# You are not assigning anything to edited_sentence, so this would not work either.
print(' '.join(edited_sentence))
Your sample with new code
from textblob import TextBlob
for col in result:
for i in range(result.shape[0]):
text = result[col][i]
txt_blob = TextBlob(text)
# Create a list of words that are tagged with 'NNP'
# In this case it will only be 'Blob'
words_to_remove = [word[0] for word in [tag for tag in txt_blob.tags if tag[1] == 'NNP']]
# Remove the Words from the sentence, using words_to_remove
edited_sentence = ' '.join([word for word in text.split(' ') if word not in words_to_remove])
# Show the result
print(edited_sentence)

Related

How to get Bigram/Trigram of word from prelisted unigram from a document corpus / dataframe column

I have a dataframe with text in one of its columns.
I have listed some predefined keywords which I need for analysis and words associated with it (and later make a wordcloud and counter of occurrences) to understand topics /context associated with such keywords.
Use case:
df.text_column()
keywordlist = [coca , food, soft, aerated, soda]
lets say one of the rows of the text column has text : ' coca cola is expanding its business in soft drinks and aerated water'.
another entry like : 'lime soda is the best selling item in fast food stores'
my objective is to get Bigram/trigram like:
'coca_cola','coca_cola_expanding', 'soft_drinks', 'aerated_water', 'business_soft_drinks', 'lime_soda', 'food_stores'
Kindly help me to do that [Python only]
First, you can optioanlly load the nltk's stop word list and remove any stop words from the text (such as "is", "its", "in", and "and"). Alternatively, you can define your own stop words list, as well as even extend the nltk's list with additional words. Following, you can use nltk.bigrams() and nltk.trigrams() methods to get bigrams and trigrams joined with an underscore _, as you asked. Also, have a look at Collocations.
Edit:
If you haven't already, you need to include the following once in your code, in order to download the stop words list.
nltk.download('stopwords')
Code:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
word_data = "coca cola is expanding its business in soft drinks and aerated water"
#word_data = "lime soda is the best selling item in fast food stores"
# load nltk's stop word list
stop_words = list(stopwords.words('english'))
# extend the stop words list
#stop_words.extend(["best", "selling", "item", "fast"])
# tokenise the string and remove stop words
word_tokens = word_tokenize(word_data)
clean_word_data = [w for w in word_tokens if not w.lower() in stop_words]
# get bigrams
bigrams_list = ["_".join(item) for item in nltk.bigrams(clean_word_data)]
print(bigrams_list)
# get trigrams
trigrams_list = ["_".join(item) for item in nltk.trigrams(clean_word_data)]
print(trigrams_list)
Update
Once you get the bigram and trigram lists, you can check for matches against your keyword list to keep only the relevant ones.
keywordlist = ['coca' , 'food', 'soft', 'aerated', 'soda']
def find_matches(n_grams_list):
matches = []
for k in keywordlist:
matching_list = [s for s in n_grams_list if k in s]
[matches.append(m) for m in matching_list if m not in matches]
return matches
all_matching_bigrams = find_matches(bigrams_list) # find all mathcing bigrams
all_matching_trigrams = find_matches(trigrams_list) # find all mathcing trigrams
# join the two lists
all_matches = all_matching_bigrams + all_matching_trigrams
print(all_matches)
Output:
['coca_cola', 'business_soft', 'soft_drinks', 'drinks_aerated', 'aerated_water', 'coca_cola_expanding', 'expanding_business_soft', 'business_soft_drinks', 'soft_drinks_aerated', 'drinks_aerated_water']

Scraping a sentence across many lines | Recursive error unresolved

Goal: if pdf line contains sub-string, then copy entire sentence (across multiple lines).
I am able to print() the line the phrase appears in.
Now, once I find this line, I want to go back iterations, until I find a sentence terminator: . ! ?, from the previous sentence, and iterate forward again until the next sentence terminator.
This is so as I can print() the entire sentence the phrase belongs in.
However, I have a Recursive Error with scrape_sentence() getting stuck infinitely running.
Jupyter Notebook:
# pip install PyPDF2
# pip install pdfplumber
# ---
# import re
import glob
import PyPDF2
import pdfplumber
# ---
phrase = "Responsible Care Company"
# SENTENCE_REGEX = re.pattern('^[A-Z][^?!.]*[?.!]$')
def scrape_sentence(sentence, lines, index, phrase):
if '.' in lines[index] or '!' in lines[index] or '?' in lines[index]:
return sentence.replace('\n', '').strip()
sentence = scrape_sentence(lines[index-1] + sentence, lines, index-1, phrase) # previous line
sentence = scrape_sentence(sentence + lines[index+1], lines, index+1, phrase) # following line
sentence = sentence.replace('!', '.')
sentence = sentence.replace('?', '.')
sentence = sentence.split('.')
sentence = [s for s in sentence if phrase in s]
sentence = sentence[0] # first occurance
print(sentence)
return sentence
# ---
with pdfplumber.open('../data/gri/reports/GPIC_Sustainability_Report_2020__-_40_Years_of_Sustainable_Success.pdf') as opened_pdf:
for page in opened_pdf.pages:
text = page.extract_text()
lines = text.split('\n')
i = 0
sentence = ''
while i < len(lines):
if 'and Knowledge of Individuals; Behaviours; Attitudes, Perception ' in lines[i]:
sentence = scrape_sentence('', lines, i) # !
print(sentence) # !
i += 1
Output:
connection and the linkage to the relevant UN’s 17 SDGs.and Leadership. We have long realized and recognized that there
Phrase:
Responsible Care Company
Sentence (across multiple lines):
"GPIC is a Responsible Care Company certified for RC 14001
since July 2010."
PDF (pg. 2).
Please let me know if there is anything else I can add to post.
I solved this problem here by removing any recursion from scrape_sentence().

Using WordNet with nltk to find synonyms that make sense

I want to input a sentence, and output a sentence with hard words made simpler.
I'm using Nltk to tokenize sentences and tag words, but I'm having trouble using WordNet to find a synonym for the specific meaning of a word that I want.
For example:
Input:
"I refuse to pick up the refuse"
Maybe refuse #1 is the easiest word for rejecting, but the refuse #2 means garbage, and there are simpler words that could go there.
Nltk might be able to tag refuse #2 as a noun, but then how do I get synonyms for refuse (trash) from WordNet?
Sounds like you want word synonyms based upon the part of speech of the word (i.e. noun, verb, etc.)
Follows creates synonyms for each word in a sentence based upon part of speech.
References:
Extract Word from Synset using Wordnet in NLTK 3.0
Printing the part of speech along with the synonyms of the word
Code
import nltk; nltk.download('popular')
from nltk.corpus import wordnet as wn
def get_synonyms(word, pos):
' Gets word synonyms for part of speech '
for synset in wn.synsets(word, pos=pos_to_wordnet_pos(pos)):
for lemma in synset.lemmas():
yield lemma.name()
def pos_to_wordnet_pos(penntag, returnNone=False):
' Mapping from POS tag word wordnet pos tag '
morphy_tag = {'NN':wn.NOUN, 'JJ':wn.ADJ,
'VB':wn.VERB, 'RB':wn.ADV}
try:
return morphy_tag[penntag[:2]]
except:
return None if returnNone else ''
Example Usage
# Tokenize text
text = nltk.word_tokenize("I refuse to pick up the refuse")
for word, tag in nltk.pos_tag(text):
print(f'word is {word}, POS is {tag}')
# Filter for unique synonyms not equal to word and sort.
unique = sorted(set(synonym for synonym in get_synonyms(word, tag) if synonym != word))
for synonym in unique:
print('\t', synonym)
Output
Note the different sets of synonyms for refuse based upon POS.
word is I, POS is PRP
word is refuse, POS is VBP
decline
defy
deny
pass_up
reject
resist
turn_away
turn_down
word is to, POS is TO
word is pick, POS is VB
beak
blame
break_up
clean
cull
find_fault
foot
nibble
peck
piece
pluck
plunk
word is up, POS is RP
word is the, POS is DT
word is refuse, POS is NN
food_waste
garbage
scraps

Searching text for multiple phrases python

I am currently attempting to search through multiple pdfs for certain pieces of equipment. I have figured out how to parse the pdf file in python along with the equipment list. I am currently having trouble with the actual search function. The best way I found to do it online was to tokenize the text and the search through it with the keywords (code below), but unfortunately some of the names of the equipment are multiple words long, causing those names to be tokenized into meaningless words like "blue" and "evaporate" which are found many times in the text and thus saturate the returns. The only way I have thought of to deal with this is to only look for unique words in the names of the equipment and remove the more common ones, but I was wondering if there was a more elegant solution as even the unique words have a tendency to have multiple false returns per document.
Mainly, I am looking for way to search through a text file for phrases of words such as "Blue Transmitter 3", without parsing that phrase into ["Blue", "Transmitter", "3"]
here is what I have so far
import PyPDF2
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
import re
#open up pdf and get text
pdfName = 'example.pdf'
read_pdf = PyPDF2.PdfFileReader(pdfName)
text = ""
for i in range(read_pdf.getNumPages()):
page = read_pdf.getPage(i)
text += "Page No - " + str(1+read_pdf.getPageNumber(page)) + "\n"
page_content = page.extractText()
text += page_content + "\n"
#tokenize pdf text
tokens = word_tokenize(text)
punctuations = ['(',')',';',':','[',']',',','.']
stop_words = stopwords.words('english')
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
#take out the endline symbol and join the whole equipment data set into one long string
lines = [line.rstrip('\n') for line in open('equipment.txt')]
totalEquip = " ".join(lines)
tokens = word_tokenize(totalEquip)
trash = ['Black', 'furnace', 'Evaporation', 'Evaporator', '500', 'Chamber', 'A']
searchWords = [word for word in tokens if not word in stop_words and not word in punctuations and not word in trash]
for i in searchWords:
for word in splitKeys:
if i.lower() in word.lower():
print(i)
print(word + "\n")
Any help or ideas yall might have would be much appreciated

extract a sentence that contains a list of keywords or phrase using python

I have used the following code to extract a sentence from file(the sentence should contain some or all of the search keywords)
search_keywords=['mother','sing','song']
with open('text.txt', 'r') as in_file:
text = in_file.read()
sentences = text.split(".")
for sentence in sentences:
if (all(map(lambda word: word in sentence, search_keywords))):
print sentence
The problem with the above code is that it does not print the required sentence if one of the search keywords do not match with the sentence words. I want a code that prints the sentence containing some or all of the search keywords. It would be great if the code can also search for a phrase and extract the corresponding sentence.
It seems like you want to count the number of search_keyboards in each sentence. You can do this as follows:
sentences = "My name is sing song. I am a mother. I am happy. You sing like my mother".split(".")
search_keywords=['mother','sing','song']
for sentence in sentences:
print("{} key words in sentence:".format(sum(1 for word in search_keywords if word in sentence)))
print(sentence + "\n")
# Outputs:
#2 key words in sentence:
#My name is sing song
#
#1 key words in sentence:
# I am a mother
#
#0 key words in sentence:
# I am happy
#
#2 key words in sentence:
# You sing like my mother
Or if you only want the sentence(s) that have the most matching search_keywords, you can make a dictionary and find the maximum values:
dct = {}
for sentence in sentences:
dct[sentence] = sum(1 for word in search_keywords if word in sentence)
best_sentences = [key for key,value in dct.items() if value == max(dct.values())]
print("\n".join(best_sentences))
# Outputs:
#My name is sing song
# You sing like my mother
If I understand correctly, you should be using any() instead of all().
if (any(map(lambda word: word in sentence, search_keywords))):
print sentence
So you want to find sentences that contain at least one keyword. You can use any() instead of all().
EDIT:
If you want to find the sentence which contains the most keywords:
sent_words = []
for sentence in sentences:
sent_words.append(set(sentence.split()))
num_keywords = [len(sent & set(search_keywords)) for sent in sent_words]
# Find only one sentence
ind = num_keywords.index(max(num_keywords))
# Find all sentences with that number of keywords
ind = [i for i, x in enumerate(num_keywords) if x == max(num_keywords)]

Categories

Resources