I have a dataframe with a 'description' column with details about the product. Each of the description in the column has long paragraphs. Like
"This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much"
How do I locate/extract the sentence which has the phrase "superb product", and place it in a new column?
So for this case the result will be
expected output
I have used this,
searched_words=['superb product','SUPERB PRODUCT']
print(df['description'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent)
if stemmer.stem(w.lower()) in searched_words)]))
The output for this is not suitable. Though it works if I put just one word in " Searched Word" List.
There are lot of methods to do that ,#ChootsMagoots gave you the good answer but SPacy is also so efficient, you can simply choose the pattern that will lead you to that sentence, but beofre that, you can need to define a function that will define the sentence here's the code :
import spacy
def product_sentencizer(doc):
''' Look for sentence start tokens by scanning for periods only. '''
for i, token in enumerate(doc[:-2]): # The last token cannot start a sentence
if token.text == ".":
doc[i+1].is_sent_start = True
else:
doc[i+1].is_sent_start = False # Tell the default sentencizer to ignore this token
return doc
nlp = spacy.load('en_core_web_sm', disable=['ner'])
nlp.add_pipe(product_sentencizer, before="parser") # Insert before the parser can build its own sentences
text = "This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much."
doc = nlp(text)
matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'ORTH': 'SUPERB PRODUCT'}]
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start:end]
print(matched_span.text)
print(matched_span.sent)
Assuming the paragraphs are neatly formatted into sentences with ending periods, something like:
for index, paragraph in df['column_name'].iteritems():
for sentence in paragraph.split('.'):
if 'superb prod' in sentence:
print(sentence)
df['extracted_sentence'][index] = sentence
This is going to be quite slow, but idk if there's a better way.
Related
I am using Python and spaCy as my NLP library, working on a big dataframe that contains feedback about different cars, which looks like this:
'feedback' column contains natural language text to be processed,
'lemmatized' column contains lemmatized version of the feedback text,
'entities' column contains named entities extracted from the feedback text (I've trained the pipeline so that it will recognise car models and brands, labelling these as 'CAR_BRAND' and 'CAR_MODEL')
I then created the following function, which applies the Spacy nlp token to each row of my dataframe and extract any [noun + verb], [verb + noun], [adj + noun], [adj+ proper noun] combinations.
def bi_gram(x):
doc = nlp_token(x)
result = []
text = ''
for i in range(len(doc)):
j = i+1
if j < len(doc):
if (doc[i].pos_ == "NOUN" and doc[j].pos_ == "VERB") or (doc[i].pos_ == "VERB" and doc[j].pos_ == "NOUN") or (doc[i].pos_ == "ADJ" and doc[j].pos_ == "NOUN") or (doc[i].pos_ == "ADJ" and doc[j].pos_ == "PROPN"):
text = doc[i].text + " " + doc[j].text
result.append(text)
i = i+1
return result
Then I applied this function to 'lemmatized' column:
df['bi_gram'] = df['lemmatized'].apply(bi_gram)
This is where I have a problem...
This is producing only one bigram per row maximum. How can I tweak the code so that more than one bigram can be extracted and put in a column? (Also are there more linguistic combinations I should try?)
Is there a possibility to find out what people are saying about 'CAR_BRAND' and 'CAR_MODEL' named entities extracted in the 'entities' column? For example 'Cool Porsche' - Some brands or models are made of more than two words so it's tricky to tackle.
I am very new to NLP.. If there is a more efficient way to tackle this, any advice will be super helpful!
Many thanks for your help in advance.
spaCy has a built-in pattern matching engine that's perfect for your application – it's documented here and in a more extensive usage guide. It allows you to define patterns in a readable and easy-to-maintain way, as lists of dictionaries that define the properties of the tokens to be matched.
Set up the pattern matcher
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm") # or whatever model you choose
matcher = Matcher(nlp.vocab)
# your patterns
patterns = {
"noun_verb": [{"POS": "NOUN"}, {"POS": "VERB"}],
"verb_noun": [{"POS": "VERB"}, {"POS": "NOUN"}],
"adj_noun": [{"POS": "ADJ"}, {"POS": "NOUN"}],
"adj_propn": [{"POS": "ADJ"}, {"POS": "PROPN"}],
}
# add the patterns to the matcher
for pattern_name, pattern in patterns.items():
matcher.add(pattern_name, [pattern])
Extract matches
doc = nlp("The dog chased cats. Fast cats usually escape dogs.")
matches = matcher(doc)
matches is a list of tuples containing
a match id,
the start index of the matched bit and
the end index (exclusive).
This is a test output adopted from the spaCy usage guide:
for match_id, start, end in matches:
# Get string representation
string_id = nlp.vocab.strings[match_id]
# The matched span
span = doc[start:end]
print(repr(span.text))
print(match_id, string_id, start, end)
print()
Result
'dog chased'
1211260348777212867 noun_verb 1 3
'chased cats'
8748318984383740835 verb_noun 2 4
'Fast cats'
2526562708749592420 adj_noun 5 7
'escape dogs'
8748318984383740835 verb_noun 8 10
Some ideas for improvement
Named entity recognition should be able to detect multi-word expressions, so brand and/or model names that consist of more than one token shouldn't be an issue if everything is set up correctly
Matching dependency patterns instead of linear patterns might slightly improve your results
That being said, what you're trying to do – kind of sentiment analysis -is quite a difficult task that's normally engaged with machine learning approaches and heaps of training data. So don't expect too much from simple heuristics.
I have one text file with a list of phrases. Below is how the file looks:
Filename: KP.txt
And from the below input (paragraph), I want to extract the next 2 words after the KP.txt phrase (the phrases could be anything as shown in my above KP.txt file). All I need is to extract the next 2 words.
Input:
This is Lee. Thanks for contacting me. I wanted to know the exchange policy at Noriaqer hardware services.
In the above example, I found the phrase " I wanted to know", matches with the KP.txt file content. So if I wanted to extract the next 2 words after this, my output will be like "exchange policy".
How could I extract this in python?
Assuming you already know how to read the input file into a list, it can be done with some help from regex.
>>> wordlist = ['I would like to understand', 'I wanted to know', 'I wish to know', 'I am interested to know']
>>> input_text = 'This is Lee. Thanks for contacting me. I wanted to know exchange policy at Noriaqer hardware services.'
>>> def word_extraction (input_text, wordlist):
... for word in wordlist:
... if word in input_text:
... output = re.search (r'(?<=%s)(.\w*){2}' % word, input_text)
... print (output.group ().lstrip ())
>>> word_extraction(input_text, wordlist)
exchange policy
>>> input_text = 'This is Lee. Thanks for contacting me. I wish to know where is Noriaqer hardware.'
>>> word_extraction(input_text, wordlist)
where is
>>> input_text = 'This is Lee. Thanks for contacting me. I\'d like to know where is Noriaqer hardware.'
>>> word_extraction(input_text, wordlist)
>>>
First we need to check whether the phrase we want is in the sentence. It's not the most efficient way if you have large list but it works for now.
Next if it is in our "dictionary" of phrase, we use regex to extract the keyword that we wanted.
Finally strip the leading white space in front of our target word.
Regex Hint:
(?<=%s) is look behind assertion. Meaning check the word behind the sentence starting with "I wanted to know"
(.\w*){2} means any character after our phrase followed by one or more words stopping at 2 words after the key phrase.
I Think natural language processing could be a better solution, but this code would help :)
def search_in_text(kp,text):
for line in kp:
#if a search phrase found in kp lines
if line in text:
#the starting index of the two words
i1=text.find(line)+len(line)
#the end index of the following two words (first index+50 at maximum)
i2=(i1+50) if len(text)>(i1+50) else len(text)
#split the following text to words (next_words) and remove empty spaces
next_words=[word for word in text[i1:i2].split(' ') if word!='']
#return only the next two words from (next_words)
return next_words[0:2]
return [] # return empty list if no phrase matching
#read your kp file as list of lines
kp=open("kp.txt").read().split("\n")
#input 1
text = 'This is Lee. Thanks for contacting me. I wanted to know exchange policy at Noriaqer hardware services.'
print('input ->>',text)
output = search_in_text(kp,text)
print('output ->>',output)
input ->> This is Lee. Thanks for contacting me. I wanted to know exchange policy at Noriaqer hardware services.
output ->> ['exchange', 'policy']
#input 2
text = 'Boss was very angry and said: I wish to know why you are late?'
print('input ->>',text)
output = search_in_text(kp,text)
print('output ->>',output)
input ->> Boss was very angry and said: I wish to know why you are late?
output ->> ['why', 'you']
you can use this:
with open("KP.txt") as fobj:
phrases = list(map(lambda sentence : sentence.lower().strip(), fobj.readlines()))
paragraph = input("Enter The Whole Paragraph in one line:\t").lower()
for phrase in phrases:
if phrase in paragraph:
temp = paragraph.split(phrase)[1:]
for clause in temp:
print(" ".join(clause.split()[:2]))
I want to do fuzzy matching on string with words.
The target string could be like.
"Hello, I am going to watch a film today."
where the words I want to search are.
"flim toda".
This hopefully should return "film today" as a search result.
I have used this method but it seems to be working only with one word.
import difflib
def matches(large_string, query_string, threshold):
words = large_string.split()
matched_words = []
for word in words:
s = difflib.SequenceMatcher(None, word, query_string)
match = ''.join(word[i:i+n] for i, j, n in s.get_matching_blocks() if n)
if len(match) / float(len(query_string)) >= threshold:
matched_words.append(match)
return matched_words
large_string = "Hello, I am going to watch a film today"
query_string = "film"
print(list(matches(large_string, query_string, 0.8)))
This only works with one word and it returns when there is little noise.
Is there any way to do such fuzzy matching with words?
The feature you are thinking of is called "query suggestion" and does rely on spell checking, but it relies on markov chains built out of search engine query log.
That being said, you use an approach similar to the one described in this answer: https://stackoverflow.com/a/58166648/140837
You can simply use Fuzzysearch, please see the example below;
from fuzzysearch import find_near_matches
text_string = "Hello, I am going to watch a film today."
matches = find_near_matches('flim toda', text_string, max_l_dist=2)
print([my_string[m.start:m.end] for m in matches])
This will give you the desired output.
['film toda']
Please note that you can give a value for max_l_dist parameter based on how much you are going to tolerate.
I have a dataframe with two pertinent columns, "rm_word" and "article."
Data Sample:
,grouping,fts,article,rm_word
0,"1",fts,"This is the article. This is a sentence. This is a sentence. This is a sentence. This goes on for awhile and that's super ***crazy***. It goes on and on.",crazy
I want to query the last 100 characters of each "article" to determine if its row's respective "rm_word" appears. If it does, then I want to delete the entire sentence in which "rm_word" appears as well as all the sentences that follows it from the "article."
Desired Result (when "crazy" is the "rm_word"):
,grouping,fts,article,rm_word
0,"1",fts,"This is the article. This is a sentence. This is a sentence. This is a sentence.",crazy
This mask is able to determine when an article contains its "rm_word," but I'm having trouble with the sentence deletion bit.
mask = ([ (str(a) in b[-100:].lower()) for a,b in zip(df["rm_word"], df["article"])])
print (df.loc[mask])
Any help would be much appreciated! Thank you so much.
Does this work?
df = pd.DataFrame(
columns=['article', 'rm_word'],
data=[["This is the article. This is a sentence. This is a sentence. This is a sentence.", 'crazy'],
["This is the article. This is a sentence. This is a sentence. This is a sentence. This goes on for awhile and that's super crazy. It goes on and on.", 'crazy']]
)
def clean_article(x):
if x['rm_word'] not in x['article'][-100:].lower():
return x
article = x['article'].rsplit(x['rm_word'])[0]
article = article.split('.')[:-1]
x['article'] = '.'.join(article) + '.'
return x
df = df.apply(lambda x: clean_article(x), axis=1)
df['article'].values
Returns
array(['This is the article. This is a sentence. This is a sentence. This is a sentence.',
'This is the article. This is a sentence. This is a sentence. This is a sentence.'],
dtype=object)
I need a python library that accepts some text, and replaces phone numbers, names, and so on with tokens. Example:
Input: Please call Robert on 0430013454 to discuss this further.
Output: Please call NAME on PHONE to discuss this further.
In other words I need to take a sentence, any sentence, then the program will be run on that sentence and remove anything that looks like a name, phone number or any other identifier, and replace it with a token I.E NAME, PHONE NUMBER So that token would just be text to replace the info so that it is no longer displayed.
Must be python 2.7 compatible. Would anybody know how this would be done?
Cheers!
As Harrison pointed out, nltk has named entity recognition, which is what you want for this task. Here is a good sample to get you started.
From the site:
import nltk
sentences = nltk.sent_tokenize(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)
def extract_entity_names(t):
entity_names = []
if hasattr(t, 'label') and t.label:
if t.label() == 'NE':
entity_names.append(' '.join([child[0] for child in t]))
else:
for child in t:
entity_names.extend(extract_entity_names(child))
return entity_names
entity_names = []
for tree in chunked_sentences:
# Print results per sentence
# print extract_entity_names(tree)
entity_names.extend(extract_entity_names(tree))
# Print all entity names
#print entity_names
# Print unique entity names
print set(entity_names)
Not really sure about name recognition. However, if you know the names that you would be looking for it would be easy. You could have a list of all of the names that you're looking for and check to see if each one is in the string and if so just use string.replace. If the names are random you could maybe look into NLTK I think they might have some name entity recognition. I really don't know anything about it though...
But as for phone numbers, that's easy. You can split the string into a list and check to see if any element consists of numbers. You could even check the length to make sure it's 10 digits (i'm assuming all numbers will be 10 based on your example).
Something like this...
example_input = 'Please call Robert on 0430013454 to discuss this further.'
new_list = example_input.split(' ')
for word in new_list:
if word.isdigit():
pos = new_list.index(word)
new_list[pos] = 'PHONE'
example_output = ' '.join(new_list)
print example_output
This would be the output: 'Please call Robert on PHONE to discuss this further'
The if statement would be something like if word.isdigit() and len(word) == 10: if you wanted to make sure the length of the digits is 10.