I'm trying to extract sentences that contain selected keywords using set.intersection().
So far I'm only getting sentences that have the word 'van'. I can't get sentences with the words 'blue tinge' or 'off the road' because the code below can only handle single keywords.
Why is this happening, and what can I do to solve the problem? Thank you.
from textblob import TextBlob
import nltk
nltk.download('punkt')
search_words = set(["off the road", "blue tinge" ,"van"])
blob = TextBlob("That is the off the road vehicle I had in mind for my adventure.
Which one? The one with the blue tinge. Oh, I'd use the money for a van.")
matches = []
for sentence in blob.sentences:
blobwords = set(sentence.words)
if search_words.intersection(blobwords):
matches.append(str(sentence))
print(matches)
Output: ["Oh, I'd use the money for a van."]
If you want to check for exact match of the search keywords this can be accomplished using:
from nltk.tokenize import sent_tokenize
text = "That is the off the road vehicle I had in mind for my adventure. Which one? The one with the blue tinge. Oh, I'd use the money for a van."
search_words = ["off the road", "blue tinge" ,"van"]
matches = []
sentances = sent_tokenize(text)
for word in search_words:
for sentance in sentances:
if word in sentance:
matches.append(sentance)
print(matches)
The output is:
['That is the off the road vehicle I had in mind for my adventure.',
"Oh, I'd use the money for a van.",
'The one with the blue tinge.']
If you want partial matching then use fuzzywuzzy for percentage matching.
Related
I have a dataframe with text in one of its columns.
I have listed some predefined keywords which I need for analysis and words associated with it (and later make a wordcloud and counter of occurrences) to understand topics /context associated with such keywords.
Use case:
df.text_column()
keywordlist = [coca , food, soft, aerated, soda]
lets say one of the rows of the text column has text : ' coca cola is expanding its business in soft drinks and aerated water'.
another entry like : 'lime soda is the best selling item in fast food stores'
my objective is to get Bigram/trigram like:
'coca_cola','coca_cola_expanding', 'soft_drinks', 'aerated_water', 'business_soft_drinks', 'lime_soda', 'food_stores'
Kindly help me to do that [Python only]
First, you can optioanlly load the nltk's stop word list and remove any stop words from the text (such as "is", "its", "in", and "and"). Alternatively, you can define your own stop words list, as well as even extend the nltk's list with additional words. Following, you can use nltk.bigrams() and nltk.trigrams() methods to get bigrams and trigrams joined with an underscore _, as you asked. Also, have a look at Collocations.
Edit:
If you haven't already, you need to include the following once in your code, in order to download the stop words list.
nltk.download('stopwords')
Code:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
word_data = "coca cola is expanding its business in soft drinks and aerated water"
#word_data = "lime soda is the best selling item in fast food stores"
# load nltk's stop word list
stop_words = list(stopwords.words('english'))
# extend the stop words list
#stop_words.extend(["best", "selling", "item", "fast"])
# tokenise the string and remove stop words
word_tokens = word_tokenize(word_data)
clean_word_data = [w for w in word_tokens if not w.lower() in stop_words]
# get bigrams
bigrams_list = ["_".join(item) for item in nltk.bigrams(clean_word_data)]
print(bigrams_list)
# get trigrams
trigrams_list = ["_".join(item) for item in nltk.trigrams(clean_word_data)]
print(trigrams_list)
Update
Once you get the bigram and trigram lists, you can check for matches against your keyword list to keep only the relevant ones.
keywordlist = ['coca' , 'food', 'soft', 'aerated', 'soda']
def find_matches(n_grams_list):
matches = []
for k in keywordlist:
matching_list = [s for s in n_grams_list if k in s]
[matches.append(m) for m in matching_list if m not in matches]
return matches
all_matching_bigrams = find_matches(bigrams_list) # find all mathcing bigrams
all_matching_trigrams = find_matches(trigrams_list) # find all mathcing trigrams
# join the two lists
all_matches = all_matching_bigrams + all_matching_trigrams
print(all_matches)
Output:
['coca_cola', 'business_soft', 'soft_drinks', 'drinks_aerated', 'aerated_water', 'coca_cola_expanding', 'expanding_business_soft', 'business_soft_drinks', 'soft_drinks_aerated', 'drinks_aerated_water']
Issue
I'm trying to make compounds words singular from plural using spaCy.
However, I cannot fix an error to transform plural to singular as compounds words.
How can I get the preferred output like the below?
cute dog
two or three word
the christmas day
Develop Environment
Python 3.9.1
Error
print(str(nlp(word).lemma_))
AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'lemma_'
Code
import spacy
nlp = spacy.load("en_core_web_sm")
words = ["cute dogs", "two or three words", "the christmas days"]
for word in words:
print(str(nlp(word).lemma_))
Trial
cute
dog
two
or
three
word
the
christmas
day
import spacy
nlp = spacy.load("en_core_web_sm")
words = ["cute dogs", "two or three words", "the christmas days"]
for word in words:
word = nlp(word)
for token in word:
print(str(token.lemma_))
As you've found out, you can't get the lemma of a doc, only of individual words. Multi-word expressions don't have lemmas in English, lemmas are only for individual words. However, conveniently, in English compound words are pluralized just by pluralizing the last word, so you can just make the last word singular. Here's an example:
import spacy
nlp = spacy.load("en_core_web_sm")
def make_compound_singular(text):
doc = nlp(text)
if len(doc) == 1:
return doc[0].lemma_
else:
return doc[:-1].text + doc[-2].whitespace_ + doc[-1].lemma_
texts = ["cute dogs", "two or three words", "the christmas days"]
for text in texts:
print(make_compound_singular(text))
I have a dataframe with a 'description' column with details about the product. Each of the description in the column has long paragraphs. Like
"This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much"
How do I locate/extract the sentence which has the phrase "superb product", and place it in a new column?
So for this case the result will be
expected output
I have used this,
searched_words=['superb product','SUPERB PRODUCT']
print(df['description'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent)
if stemmer.stem(w.lower()) in searched_words)]))
The output for this is not suitable. Though it works if I put just one word in " Searched Word" List.
There are lot of methods to do that ,#ChootsMagoots gave you the good answer but SPacy is also so efficient, you can simply choose the pattern that will lead you to that sentence, but beofre that, you can need to define a function that will define the sentence here's the code :
import spacy
def product_sentencizer(doc):
''' Look for sentence start tokens by scanning for periods only. '''
for i, token in enumerate(doc[:-2]): # The last token cannot start a sentence
if token.text == ".":
doc[i+1].is_sent_start = True
else:
doc[i+1].is_sent_start = False # Tell the default sentencizer to ignore this token
return doc
nlp = spacy.load('en_core_web_sm', disable=['ner'])
nlp.add_pipe(product_sentencizer, before="parser") # Insert before the parser can build its own sentences
text = "This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much."
doc = nlp(text)
matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'ORTH': 'SUPERB PRODUCT'}]
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start:end]
print(matched_span.text)
print(matched_span.sent)
Assuming the paragraphs are neatly formatted into sentences with ending periods, something like:
for index, paragraph in df['column_name'].iteritems():
for sentence in paragraph.split('.'):
if 'superb prod' in sentence:
print(sentence)
df['extracted_sentence'][index] = sentence
This is going to be quite slow, but idk if there's a better way.
I have about a corpus (30,000 customer reviews) in a csv file (or a txt file). This means each customer review is a line in the text file. Some examples are:
This bike is amazing, but the brake is very poor
This ice maker works great, the price is very reasonable, some bad
smell from the ice maker
The food was awesome, but the water was very rude
I want to change these texts to the following:
This bike is amazing POSITIVE, but the brake is very poor NEGATIVE
This ice maker works great POSITIVE and the price is very reasonable
POSITIVE, some bad NEGATIVE smell from the ice maker
The food was awesome POSITIVE, but the water was very rude NEGATIVE
I have two separate lists (lexicons) of positive words and negative words. For example, a text file contains such positive words as:
amazing
great
awesome
very cool
reasonable
pretty
fast
tasty
kind
And, a text file contains such negative words as:
rude
poor
worst
dirty
slow
bad
So, I want the Python script that reads the customer review: when any of the positive words is found, then insert "POSITIVE" after the positive word; when any of the negative words is found, then insert "NEGATIVE" after the positive word.
Here is the code I have tested so far. This works (see my comments in the codes below), but it needs improvement to meet my needs described above.
Specifically, my_escaper works (this code finds such words as cheap and good and replace them with cheap POSITIVE and good POSITIVE), but the problem is that I have two files (lexicons), each containing about thousand positive/negative words. So what I want is that the codes read those word lists from the lexicons, search them in the corpus, and replace those words in the corpus (for example, from "good" to "good POSITIVE", from "bad" to "bad NEGATIVE").
#adapted from http://stackoverflow.com/questions/6116978/python-replace-multiple-strings
import re
def multiple_replacer(*key_values):
replace_dict = dict(key_values)
replacement_function = lambda match: replace_dict[match.group(0)]
pattern = re.compile("|".join([re.escape(k) for k, v in key_values]), re.M)
return lambda string: pattern.sub(replacement_function, string)
def multiple_replace(string, *key_values):
return multiple_replacer(*key_values)(string)
#this my_escaper works (this code finds such words as cheap and good and replace them with cheap POSITIVE and good POSITIVE), but the problem is that I have two files (lexicons), each containing about thousand positive/negative words. So what I want is that the codes read those word lists from the lexicons, search them in the corpus, and replace those words in the corpus (for example, from "good" to "good POSITIVE", from "bad" to "bad NEGATIVE")
my_escaper = multiple_replacer(('cheap','cheap POSITIVE'), ('good', 'good POSITIVE'), ('avoid', 'avoid NEGATIVE'))
d = []
with open("review.txt","r") as file:
for line in file:
review = line.strip()
d.append(review)
for line in d:
print my_escaper(line)
A straightforward way to code this would be to load your positive and negative words from your lexicons into separate sets. Then, for each review, split the sentence into a list of words and look-up each word in the sentiment sets. Checking set membership is O(1) in the average case. Insert the sentiment label (if any) into the word list and then join to build the final string.
Example:
import re
reviews = [
"This bike is amazing, but the brake is very poor",
"This ice maker works great, the price is very reasonable, some bad smell from the ice maker",
"The food was awesome, but the water was very rude"
]
positive_words = set(['amazing', 'great', 'awesome', 'reasonable'])
negative_words = set(['poor', 'bad', 'rude'])
for sentence in reviews:
tagged = []
for word in re.split('\W+', sentence):
tagged.append(word)
if word.lower() in positive_words:
tagged.append("POSITIVE")
elif word.lower() in negative_words:
tagged.append("NEGATIVE")
print ' '.join(tagged)
While this approach is straightforward, there is a downside: you lose the punctuation due to the use of re.split().
If I understood correctly, you need something like:
if word in POSITIVE_LIST:
pattern.sub(replacement_function, word+" POSITIVE")
if word in NEGATIVE_LIST:
pattern.sub(replacement_function, word+" NEGATIVE")
Is it OK with you?
I have the same problem that was discussed in this link Python extract sentence containing word, but the difference is that I want to find 2 words in the same sentence. I need to extract sentences from a corpus, which contains 2 specific words. Does anyone could help me, please?
If this is what you mean:
import re
txt="I like to eat apple. Me too. Let's go buy some apples."
define_words = 'some apple'
print re.findall(r"([^.]*?%s[^.]*\.)" % define_words,txt)
Output: [" Let's go buy some apples."]
You can also try with:
define_words = raw_input("Enter string: ")
Check if the sentence contain the defined words:
import re
txt="I like to eat apple. Me too. Let's go buy some apples."
words = 'go apples'.split(' ')
sentences = re.findall(r"([^.]*\.)" ,txt)
for sentence in sentences:
if all(word in sentence for word in words):
print sentence
This would be simple using the TextBlob package together with Python's builtin sets.
Basically, iterate through the sentences of your text, and check if their exists an intersection between the set of words in the sentence and your search words.
from text.blob import TextBlob
search_words = set(["buy", "apples"])
blob = TextBlob("I like to eat apple. Me too. Let's go buy some apples.")
matches = []
for sentence in blob.sentences:
words = set(sentence.words)
if search_words & words: # intersection
matches.append(str(sentence))
print(matches)
# ["Let's go buy some apples."]
Update:
Or, more Pythonically,
from text.blob import TextBlob
search_words = set(["buy", "apples"])
blob = TextBlob("I like to eat apple. Me too. Let's go buy some apples.")
matches = [str(s) for s in blob.sentences if search_words & set(s.words)]
print(matches)
# ["Let's go buy some apples."]
I think you want an answer using nltk. And I guess that those 2 words don't need to be consecutive right?
>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>> text = 'I like to eat apple. Me too. Let's go buy some apples.'
>>> words = ['like', 'apple']
>>> sentences = sent_tokenize(text)
>>> for sentence in sentences:
... if (all(map(lambda word: word in sentence, words))):
... print sentence
...
I like to eat apple.