Let's say I have a bag of keywords.
Ex :
['profit low', 'loss increased', 'profit lowered']
I have a pdf document and I parse the entire text from that,
now I want to get the sentences which match the bag of words.
Lets say one sentence is :
'The profit in the month of November lowered from 5% to 3%.'
This should match as in bag of words 'profit lowered' matches this sentence.
What will be the best approach to solve this problem in python?
# input
checking_words = ['profit low', 'loss increased', 'profit lowered']
checking_string = 'The profit in the month of November lowered from 5% to 3%.'
trans_check_words = checking_string.split()
# output
for word_bug in [st.split() for st in checking_words]:
if word_bug[0] in trans_check_words and word_bug[1] in trans_check_words:
print(word_bug)
You want to check all your Check Words list elements if it is inside the long sentence
sentence = 'The profit in the month of November lowered from 5% to 3%.'
words = ['profit','month','5%']
for element in words:
if element in sentence:
#do something with it
print(element)
If you want to go cleaner, you can use this one liner loop to collect the matched words to a list:
sentence = 'The profit in the month of November lowered from 5% to 3%.'
words = ['profit','month','5%']
matched_words = [] # Will collect the matched words in the next life loop:
[matched_words.append(word) for word in words if word in sentence]
print(matched_words)
If you have "spaced" words on each element if your list, you want to take care of it by using the split() method.
sentence = 'The profit in the month of November lowered from 5% to 3%.'
words = ['profit low','month high','5% 3%']
single_words = []
for w in words:
for s in range(len(w.split(' '))):
single_words.append(w.split(' ')[s])
matched_words = [] # Will collect the matched words in the next life loop:
[matched_words.append(word) for word in single_words if word in sentence]
print(matched_words)
you can try something like the following:
convert the bag of words to a sentence:
bag_of_words = ['profit low', 'loss increased', 'profit lowered']
bag_of_word_sent = ' '.join(bag_of_words)
then with the list of sentences:
list_sents = ['The profit in the month of November lowered from 5% to 3%.']
use Levenshtein distance:
import distance
for sent in list_sents:
dist = distance.levenshtein(bag_of_word_sent, sent)
if dist > len(bag_of_word_sent):
# do something
print(dist)
Related
I have a dataframe with text in one of its columns.
I have listed some predefined keywords which I need for analysis and words associated with it (and later make a wordcloud and counter of occurrences) to understand topics /context associated with such keywords.
Use case:
df.text_column()
keywordlist = [coca , food, soft, aerated, soda]
lets say one of the rows of the text column has text : ' coca cola is expanding its business in soft drinks and aerated water'.
another entry like : 'lime soda is the best selling item in fast food stores'
my objective is to get Bigram/trigram like:
'coca_cola','coca_cola_expanding', 'soft_drinks', 'aerated_water', 'business_soft_drinks', 'lime_soda', 'food_stores'
Kindly help me to do that [Python only]
First, you can optioanlly load the nltk's stop word list and remove any stop words from the text (such as "is", "its", "in", and "and"). Alternatively, you can define your own stop words list, as well as even extend the nltk's list with additional words. Following, you can use nltk.bigrams() and nltk.trigrams() methods to get bigrams and trigrams joined with an underscore _, as you asked. Also, have a look at Collocations.
Edit:
If you haven't already, you need to include the following once in your code, in order to download the stop words list.
nltk.download('stopwords')
Code:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
word_data = "coca cola is expanding its business in soft drinks and aerated water"
#word_data = "lime soda is the best selling item in fast food stores"
# load nltk's stop word list
stop_words = list(stopwords.words('english'))
# extend the stop words list
#stop_words.extend(["best", "selling", "item", "fast"])
# tokenise the string and remove stop words
word_tokens = word_tokenize(word_data)
clean_word_data = [w for w in word_tokens if not w.lower() in stop_words]
# get bigrams
bigrams_list = ["_".join(item) for item in nltk.bigrams(clean_word_data)]
print(bigrams_list)
# get trigrams
trigrams_list = ["_".join(item) for item in nltk.trigrams(clean_word_data)]
print(trigrams_list)
Update
Once you get the bigram and trigram lists, you can check for matches against your keyword list to keep only the relevant ones.
keywordlist = ['coca' , 'food', 'soft', 'aerated', 'soda']
def find_matches(n_grams_list):
matches = []
for k in keywordlist:
matching_list = [s for s in n_grams_list if k in s]
[matches.append(m) for m in matching_list if m not in matches]
return matches
all_matching_bigrams = find_matches(bigrams_list) # find all mathcing bigrams
all_matching_trigrams = find_matches(trigrams_list) # find all mathcing trigrams
# join the two lists
all_matches = all_matching_bigrams + all_matching_trigrams
print(all_matches)
Output:
['coca_cola', 'business_soft', 'soft_drinks', 'drinks_aerated', 'aerated_water', 'coca_cola_expanding', 'expanding_business_soft', 'business_soft_drinks', 'soft_drinks_aerated', 'drinks_aerated_water']
I am trying to make a regular expressions to get all sentences containing two words (order doesn't matter), but I can't find the solution for this.
"Supermarket. This apple costs 0.99."
I want to get back the following sentence:
This apple costs 0.99.
I tried:
([^.]*?(apple)*?(costs)[^.]*\.)
I have problems because the price contains a dot. Also this expressions gives back results with only one of the words.
Approach: For each Phrase, we have to find the sentences which contain all the words of the phrase. So, for each word in the given phrase, we check if a sentence contains it. We do this for each sentence. This process of searching may become faster if the words in the sentence are stored in a set instead of a list.
Below is the implementation of above approach in python:
def getRes(sent, ph):
sentHash = dict()
# Loop for adding hased sentences to sentHash
for s in range(1, len(sent)+1):
sentHash[s] = set(sent[s-1].split())
# For Each Phrase
for p in range(0, len(ph)):
print("Phrase"+str(p + 1)+":")
# Get the list of Words
wordList = ph[p].split()
res = []
# Then Check in every Sentence
for s in range(1, len(sentHash)+1):
wCount = len(wordList)
# Every word in the Phrase
for w in wordList:
if w in sentHash[s]:
wCount -= 1
# If every word in phrase matches
if wCount == 0:
# add Sentence Index to result Array
res.append(s)
if(len(res) == 0):
print("NONE")
else:
print('% s' % ' '.join(map(str, res)))
# Driver Function
def main():
sent = ["Strings are an array of characters",
"Sentences are an array of words"]
ph = ["an array of", "sentences are strings"]
getRes(sent, ph)
main()
You use a negated character class [^.] which matches any character except a dot.
But in your example data Supermarket. This apple costs 0.99. there are 2 dots before the dot at the end, so you can not cross the dot after Supermarket. to match apple
You could for example match until the first dot, then assert costs and use a capture group to match the part with apple and make sure the line ends with a dot.
The assertion for word 1 with a match for word 2 will match the words in both combinations.
^[^.]*\.\s*(?=.*\bcosts\b)(.*\bapple\b.*\.)$
Explanation
^[^.]*\. From the start of the string, match until and including the first dot
\s* Match 0+ whitespace character
(?=.*\bcosts\b) Positive lookahead, assert costs at the right
( Capture group 1 (this has the desired value)
.*\bapple\b.*\. Match the rest of the line that includes apple and ends with a dot
) Close group 1
$ Assert end of string
Regex demo | Python demo
import re
regex = r"^[^.]*\.\s*(?=.*\bcosts\b)(.*\bapple\b.*\.)$"
test_str = ("Supermarket. This apple costs 0.99.\n"
"Supermarket. This costs apple 0.99.\n"
"Supermarket. This apple is 0.99.\n"
"Supermarket. This orange costs 0.99.")
print(re.findall(regex, test_str, re.MULTILINE))
Output
['This apple costs 0.99.', 'This costs apple 0.99.']
I also suggest to first extract sentences and then find sentences that have both words.
However, the problem of splitting text into sentences is pretty hard because of existence of abbreviations, unusual names, etc. One way to do it is by using nltk.tokenize.punkt module.
You'll need to install NLTK and then run this in Python:
import nltk
nltk.download('punkt')
After that you can use English language sentence tokenizer with two regexes:
TEXT = 'Mr. Bean is in supermarket. iPhone 12 by Apple Inc. costs $999.99.'
WORD1 = 'apple'
WORD2 = 'costs'
import nltk.data, re
# Regex helper
find_word = lambda w, s: re.search(r'(^|\W)' + w + r'(\W|$)', s, re.I)
eng_sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
for sent in eng_sent_detector.tokenize(TEXT):
if find_word(WORD1, sent) and find_word(WORD2, sent):
print (sent,"\n----")
Output:
iPhone 12 by Apple Inc. costs $999.99.
----
Notice that it handles numbers and abbreviations for you.
I am working on removing similar phrases in a list, but I have hit a small roadblock.
I have sentences and phrases, phrases are related to the sentence. All phrases of a sentence are in a single list.
Let the phrase list be : p=[['This is great','is great','place for drinks','for drinks'],['Tonight is a good','good night','is a good','for movies']]
I want my output to be [['This is great','place for drinks'],['Tonight is a good','for movies']]
Basically, I want to get all the longest unique phrases of a list.
I took a look at fuzzywuzzy library, but I am unable to get around to a good solution.
here is my code :
def remove_dup(arr, threshold=80):
ret_arr =[]
for item in arr:
if item[1]<threshold:
ret_arr.append(item[0])
return ret_arr
def find_important(sents=sents, phrase=phrase):
import os, random
from fuzzywuzzy import process, fuzz
all_processed = [] #final array to be returned
for i in range(len(sents)):
new_arr = [] #reshaped phrases for a single sentence
for item in phrase[i]:
new_arr.append(item)
new_arr.sort(reverse=True, key=lambda x : len(x)) #sort with highest length
important = [] #array to store terms
important = process.extractBests(new_arr[0], new_arr) #to get levenshtein distance matches
to_proc = remove_dup(important) #remove_dup removes all relatively matching terms.
to_proc.append(important[0][0]) #the term with highest match is obviously the important term.
all_processed.append(to_proc) #add non duplicates to all_processed[]
return all_processed
Can someone point out what I am missing, or what is a better way to do this?
Thanks in advance!
I would use the difference between each phrase and all the other phrases.
If a phrase has at least one different word compared to all the other phrases then it's unique and should be kept.
I've also made it robust to exact matches and added spaces
sentences = [['This is great','is great','place for drinks','for drinks'],
['Tonight is a good','good night','is a good','for movies'],
['Axe far his favorite brand for deodorant body spray',' Axe far his favorite brand for deodorant spray','Axe is']]
new_sentences = []
s = " "
for phrases in sentences :
new_phrases = []
phrases = [phrase.split() for phrase in phrases]
for i in range(len(phrases)) :
phrase = phrases[i]
if all([len(set(phrase).difference(phrases[j])) > 0 or i == j for j in range(len(phrases))]) :
new_phrases.append(phrase)
new_phrases = [s.join(phrase) for phrase in new_phrases]
new_sentences.append(new_phrases)
print(new_sentences)
Output:
[['This is great', 'place for drinks'],
['Tonight is a good', 'good night', 'for movies'],
['Axe far his favorite brand for deodorant body spray', 'Axe is']]
I have a dictionary where the keys are words and the values are vectors of those words.
I have a list of sentences which I want to convert into an array. I'm getting an array of all the words but I would like to have an array of sentences with word vectors so I can feed it into a neural network
sentences=["For last 8 years life, Galileo house arrest espousing man's theory",
'No. 2: 1912 Olympian; football star Carlisle Indian School; 6 MLB seasons Reds, Giants & Braves',
'The city Yuma state record average 4,055 hours sunshine year'.......]
word_vec={'For': [0.27452874183654785, 0.8040047883987427],
'last': [-0.6316165924072266, -0.2768899202346802],
'years': [-0.2496756911277771, 1.243837594985962],
'life,': [-0.9836481809616089, -0.9561406373977661].....}
I want to convert the above sentences into vectors of their corresponding words from the dictionary.
Try this:
def sentence_to_list(sentence, words_dict):
return [w for w in sentence.split() if w in words_dict]
So the first of the sentences in your example will be converted to:
['For', 'last', 'years', 'life'] # words not in the dictionary are not present here
Update.
I guess you need to remove punctuation characters. There are several methods how to split the string using several delimiter characters, check this answer: Split Strings into words with multiple word boundary delimiters
This will create vectors, containing list of lists of vectors (one list per one sentence):
vectors = []
for sentence in sentences:
sentence_vec = [ word_vec[word] for word in sentence.split() if word in word_vec ]
vectors.append( sentence_vec )
If you want to ommit puntucations (,.: etc), use re.findall (import re) instead of .split:
words = re.findall(r"[\w']+", sentence)
sentence_vec = [ word_vec[word] for word in words if word in word_vec ]
If you don't want to skip words not available in word_vec, use:
sentence_vec = [ word_vec[word] if word in word_vec else [0,0] for word in words ]
It will place 0,0 for each missing word.
I have used the following code to extract a sentence from file(the sentence should contain some or all of the search keywords)
search_keywords=['mother','sing','song']
with open('text.txt', 'r') as in_file:
text = in_file.read()
sentences = text.split(".")
for sentence in sentences:
if (all(map(lambda word: word in sentence, search_keywords))):
print sentence
The problem with the above code is that it does not print the required sentence if one of the search keywords do not match with the sentence words. I want a code that prints the sentence containing some or all of the search keywords. It would be great if the code can also search for a phrase and extract the corresponding sentence.
It seems like you want to count the number of search_keyboards in each sentence. You can do this as follows:
sentences = "My name is sing song. I am a mother. I am happy. You sing like my mother".split(".")
search_keywords=['mother','sing','song']
for sentence in sentences:
print("{} key words in sentence:".format(sum(1 for word in search_keywords if word in sentence)))
print(sentence + "\n")
# Outputs:
#2 key words in sentence:
#My name is sing song
#
#1 key words in sentence:
# I am a mother
#
#0 key words in sentence:
# I am happy
#
#2 key words in sentence:
# You sing like my mother
Or if you only want the sentence(s) that have the most matching search_keywords, you can make a dictionary and find the maximum values:
dct = {}
for sentence in sentences:
dct[sentence] = sum(1 for word in search_keywords if word in sentence)
best_sentences = [key for key,value in dct.items() if value == max(dct.values())]
print("\n".join(best_sentences))
# Outputs:
#My name is sing song
# You sing like my mother
If I understand correctly, you should be using any() instead of all().
if (any(map(lambda word: word in sentence, search_keywords))):
print sentence
So you want to find sentences that contain at least one keyword. You can use any() instead of all().
EDIT:
If you want to find the sentence which contains the most keywords:
sent_words = []
for sentence in sentences:
sent_words.append(set(sentence.split()))
num_keywords = [len(sent & set(search_keywords)) for sent in sent_words]
# Find only one sentence
ind = num_keywords.index(max(num_keywords))
# Find all sentences with that number of keywords
ind = [i for i, x in enumerate(num_keywords) if x == max(num_keywords)]