Extracting words/phrase followed by a phrase - python

I have one text file with a list of phrases. Below is how the file looks:
Filename: KP.txt
And from the below input (paragraph), I want to extract the next 2 words after the KP.txt phrase (the phrases could be anything as shown in my above KP.txt file). All I need is to extract the next 2 words.
Input:
This is Lee. Thanks for contacting me. I wanted to know the exchange policy at Noriaqer hardware services.
In the above example, I found the phrase " I wanted to know", matches with the KP.txt file content. So if I wanted to extract the next 2 words after this, my output will be like "exchange policy".
How could I extract this in python?

Assuming you already know how to read the input file into a list, it can be done with some help from regex.
>>> wordlist = ['I would like to understand', 'I wanted to know', 'I wish to know', 'I am interested to know']
>>> input_text = 'This is Lee. Thanks for contacting me. I wanted to know exchange policy at Noriaqer hardware services.'
>>> def word_extraction (input_text, wordlist):
... for word in wordlist:
... if word in input_text:
... output = re.search (r'(?<=%s)(.\w*){2}' % word, input_text)
... print (output.group ().lstrip ())
>>> word_extraction(input_text, wordlist)
exchange policy
>>> input_text = 'This is Lee. Thanks for contacting me. I wish to know where is Noriaqer hardware.'
>>> word_extraction(input_text, wordlist)
where is
>>> input_text = 'This is Lee. Thanks for contacting me. I\'d like to know where is Noriaqer hardware.'
>>> word_extraction(input_text, wordlist)
>>>
First we need to check whether the phrase we want is in the sentence. It's not the most efficient way if you have large list but it works for now.
Next if it is in our "dictionary" of phrase, we use regex to extract the keyword that we wanted.
Finally strip the leading white space in front of our target word.
Regex Hint:
(?<=%s) is look behind assertion. Meaning check the word behind the sentence starting with "I wanted to know"
(.\w*){2} means any character after our phrase followed by one or more words stopping at 2 words after the key phrase.

I Think natural language processing could be a better solution, but this code would help :)
def search_in_text(kp,text):
for line in kp:
#if a search phrase found in kp lines
if line in text:
#the starting index of the two words
i1=text.find(line)+len(line)
#the end index of the following two words (first index+50 at maximum)
i2=(i1+50) if len(text)>(i1+50) else len(text)
#split the following text to words (next_words) and remove empty spaces
next_words=[word for word in text[i1:i2].split(' ') if word!='']
#return only the next two words from (next_words)
return next_words[0:2]
return [] # return empty list if no phrase matching
#read your kp file as list of lines
kp=open("kp.txt").read().split("\n")
#input 1
text = 'This is Lee. Thanks for contacting me. I wanted to know exchange policy at Noriaqer hardware services.'
print('input ->>',text)
output = search_in_text(kp,text)
print('output ->>',output)
input ->> This is Lee. Thanks for contacting me. I wanted to know exchange policy at Noriaqer hardware services.
output ->> ['exchange', 'policy']
#input 2
text = 'Boss was very angry and said: I wish to know why you are late?'
print('input ->>',text)
output = search_in_text(kp,text)
print('output ->>',output)
input ->> Boss was very angry and said: I wish to know why you are late?
output ->> ['why', 'you']

you can use this:
with open("KP.txt") as fobj:
phrases = list(map(lambda sentence : sentence.lower().strip(), fobj.readlines()))
paragraph = input("Enter The Whole Paragraph in one line:\t").lower()
for phrase in phrases:
if phrase in paragraph:
temp = paragraph.split(phrase)[1:]
for clause in temp:
print(" ".join(clause.split()[:2]))

Related

Is there a way in python to count sentences having quotation marks, question mark and full stop?

I have been searching for the solution to this problem. I am writing a custom function to count number of sentences. I tried nltk and textstat for this problem but both are giving me different counts.
An Example of a sentence is something like this.
Annie said, "Are you sure? How is it possible? you are joking, right?"
NLTK is giving me --> count=3.
['Annie said, "Are you sure?', 'How is it possible?', 'you are
joking, right?"']
another example:
Annie said, "It will work like this! you need to go and confront your
friend. Okay!"
NLTK is giving me --> count=3.
Please suggest. The expected count is 1 as it is a single direct sentence.
I have written a simple function that does what you want:
def sentences_counter(text: str):
end_of_sentence = ".?!…"
# complete with whatever end of a sentence punctuation mark I might have forgotten
# you might for instance want to add '\n'.
sentences_count = 0
sentences = []
inside_a_quote = False
start_of_sentence = 0
last_end_of_sentence = -2
for i, char in enumerate(text):
# quote management, to solve your issue
if char == '"':
inside_a_quote = not inside_a_quote
if not inside_a_quote and text[i-1] in end_of_sentence: # 🚩
last_end_of_sentence = i # 🚩
elif inside_a_quote:
continue
# basic management of sentences with the punctuation marks in `end_of_sentence`
if char in end_of_sentence:
last_end_of_sentence = i
elif last_end_of_sentence == i-1:
sentences.append(text[start_of_sentence:i].strip())
sentences_count += 1
start_of_sentence = i
# same as the last block in case there is no end punctuation mark in the text
last_sentence = text[start_of_sentence:]
if last_sentence:
sentences.append(last_sentence.strip())
sentences_count += 1
return sentences_count, sentences
Consider the following:
text = '''Annie said, "Are you sure? How is it possible? you are joking, right?" No, I'm not... I thought you were'''
To generalize your problem a bit, I added 2 more sentences, one with ellipsis and the last one without even any end punctuation mark. Now, if I execute this:
sentences_count, sentences = sentences_counter(text)
print(f'{sentences_count} sentences detected.')
print(f'The detected sentences are: {sentences}')
I obtain this:
3 sentences detected.
The detected sentences are: ['Annie said, "Are you sure? How is it possible? you are joking, right?"', "No, I'm not...", 'I thought you were']
I think it works fine.
Note: Please consider the quote management of my solution works for American style quotes, where the end punctuation mark of the sentence can be inside of the quote. Remove the lines where I have put flag emojis 🚩 to disable this.

Extracting sentence from a dataframe with description column based on a phrase

I have a dataframe with a 'description' column with details about the product. Each of the description in the column has long paragraphs. Like
"This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much"
How do I locate/extract the sentence which has the phrase "superb product", and place it in a new column?
So for this case the result will be
expected output
I have used this,
searched_words=['superb product','SUPERB PRODUCT']
print(df['description'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent)
if stemmer.stem(w.lower()) in searched_words)]))
The output for this is not suitable. Though it works if I put just one word in " Searched Word" List.
There are lot of methods to do that ,#ChootsMagoots gave you the good answer but SPacy is also so efficient, you can simply choose the pattern that will lead you to that sentence, but beofre that, you can need to define a function that will define the sentence here's the code :
import spacy
def product_sentencizer(doc):
''' Look for sentence start tokens by scanning for periods only. '''
for i, token in enumerate(doc[:-2]): # The last token cannot start a sentence
if token.text == ".":
doc[i+1].is_sent_start = True
else:
doc[i+1].is_sent_start = False # Tell the default sentencizer to ignore this token
return doc
nlp = spacy.load('en_core_web_sm', disable=['ner'])
nlp.add_pipe(product_sentencizer, before="parser") # Insert before the parser can build its own sentences
text = "This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much."
doc = nlp(text)
matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'ORTH': 'SUPERB PRODUCT'}]
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start:end]
print(matched_span.text)
print(matched_span.sent)
Assuming the paragraphs are neatly formatted into sentences with ending periods, something like:
for index, paragraph in df['column_name'].iteritems():
for sentence in paragraph.split('.'):
if 'superb prod' in sentence:
print(sentence)
df['extracted_sentence'][index] = sentence
This is going to be quite slow, but idk if there's a better way.

a python library that accepts some text, and replaces phone numbers, names, and so on with tokens

I need a python library that accepts some text, and replaces phone numbers, names, and so on with tokens. Example:
Input: Please call Robert on 0430013454 to discuss this further.
Output: Please call NAME on PHONE to discuss this further.
In other words I need to take a sentence, any sentence, then the program will be run on that sentence and remove anything that looks like a name, phone number or any other identifier, and replace it with a token I.E NAME, PHONE NUMBER So that token would just be text to replace the info so that it is no longer displayed.
Must be python 2.7 compatible. Would anybody know how this would be done?
Cheers!
As Harrison pointed out, nltk has named entity recognition, which is what you want for this task. Here is a good sample to get you started.
From the site:
import nltk
sentences = nltk.sent_tokenize(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)
def extract_entity_names(t):
entity_names = []
if hasattr(t, 'label') and t.label:
if t.label() == 'NE':
entity_names.append(' '.join([child[0] for child in t]))
else:
for child in t:
entity_names.extend(extract_entity_names(child))
return entity_names
entity_names = []
for tree in chunked_sentences:
# Print results per sentence
# print extract_entity_names(tree)
entity_names.extend(extract_entity_names(tree))
# Print all entity names
#print entity_names
# Print unique entity names
print set(entity_names)
Not really sure about name recognition. However, if you know the names that you would be looking for it would be easy. You could have a list of all of the names that you're looking for and check to see if each one is in the string and if so just use string.replace. If the names are random you could maybe look into NLTK I think they might have some name entity recognition. I really don't know anything about it though...
But as for phone numbers, that's easy. You can split the string into a list and check to see if any element consists of numbers. You could even check the length to make sure it's 10 digits (i'm assuming all numbers will be 10 based on your example).
Something like this...
example_input = 'Please call Robert on 0430013454 to discuss this further.'
new_list = example_input.split(' ')
for word in new_list:
if word.isdigit():
pos = new_list.index(word)
new_list[pos] = 'PHONE'
example_output = ' '.join(new_list)
print example_output
This would be the output: 'Please call Robert on PHONE to discuss this further'
The if statement would be something like if word.isdigit() and len(word) == 10: if you wanted to make sure the length of the digits is 10.

Searching a list in Python for position of a certain word

So i have a string which i have split and from that created a list in Python.
I now need to find the location of a certain word in that list.
The problem that im having: the word im looking for appears twice in the list. The code i have brings back the location of the first word however it doesnt continue and bring back the other location.
Example: The main reason that i play football is because i love football.
It will find the first FOOTBALL but not the second. Help!!
This is the code i have :
sentence = " The main reson that i play football is because i love football"
sentence = sentence.split()
print(sentence.index("football"))
In the below snippet i will contain indices of 'football' in the list.
s = 'The main reason that i play football is because i love football.'
words = s.split()
i=[ind for ind,p in enumerate(words) if p=='football']
import re
looking_for = 'football'
in_text = 'The main reason that i play football is because i love football.'
without_punctuation = re.sub('[^a-zA-Z ]', '', in_text)
words = without_punctuation.split(' ')
for i, w in enumerate(words):
if w == looking_for:
print(i)
But of course. Punctuation is going to be an issue, like here (with 'football.') – so I've now stripped most of it.
Try this
def findall(list_in, search_str):
output=[]
last_index=0
while True:
try:
find=list_in[last_index:].index(search_str)+last_index
output.append(find)
last_index= find+1;
except:
break
return output
output is a list of the indices where search_str can be found in list_in

Python: How to format large text outputs to be 'prettier' and user defined

Ahoy StackOverlow-ers!
I have a rather trivial question but it's something that I haven't been able to find in other questions here or on online tutorials: How might we be able to format the output of a Python program that so that it fits a certain aesthetic format without any extra modules?
The aim here is that I have a block of plain text like that from a newspaper article, and I've filtered through it earlier to extract just the words I want but now I'd like to print it out in the format that each line only has 70 characters along it and any word won't be broken if it should normally fall on a line break.
Using .ljust(70) as in stdout.write(article.ljust(70)) doesn't seem to do anything to it.
The other thing about not having words broken would be as:
Latest news tragic m
urder innocent victi
ms family quiet neig
hbourhood
Looking more like this:
Latest news tragic
murder innocent
victims family
quiet neighbourhood
Thank you all kindly in advance!
Checkout the python textwrap module (a standard module)
>>> import textwrap
>>> t="""Latest news tragic murder innocent victims family quiet neighbourhood"""
>>> print "\n".join(textwrap.wrap(t, width=20))
Latest news tragic
murder innocent
victims family quiet
neighbourhood
>>>
use textwrap module:
http://docs.python.org/library/textwrap.html
I'm sure this can be improved on. Without any libraries:
def wrap_text(text, wrap_column=80):
sentence = ''
for word in text.split(' '):
if len(sentence + word) <= 70:
sentence += ' ' + word
else:
print sentence
sentence = word
print sentence
EDIT: From the comment if you want to use Regular expressions to just pick out words use this:
import re
def wrap_text(text, wrap_column=80):
sentence = ''
for word in re.findall(r'\w+', text):
if len(sentence + word) <= 70:
sentence += ' ' + word
else:
print sentence
sentence = word
print sentence

Categories

Resources