Find (possibly multi-word) phrase inside sentence in Python - python

I'm trying to find keywords within a sentence, where the keywords are usually single words, but can be multi-word combos (like "cost in euros"). So if I have a sentence like cost in euros of bacon it would find cost in euros in that sentence and return true.
For this, I was using this code:
if any(phrase in line for phrase in keyword['aliases']:
where line is the input and aliases is an array of phrases that match a keyword (like for cost in euros, it's ['cost in euros', 'euros', 'euro cost']).
However, I noticed that it was also triggering on word parts. For example, I had a match phrase of y and a sentence of trippy cake. I'd not expect this to return true, but it does, since it apparently finds the y in trippy. How do I get this to only check whole words? Originally I was doing this keyword search with a list of words (essentially doing line.split() and checking those), but that doesn't work for multi-word keyword aliases.

This should accomplish what you're looking for:
import re
aliases = [
'cost.',
'.cost',
'.cost.',
'cost in euros of bacon',
'rocking euros today',
'there is a cost inherent to bacon',
'europe has cost in place',
'there is a cost.',
'I was accosted.',
'dealing with euro costing is painful']
phrases = ['cost in euros', 'euros', 'euro cost', 'cost']
matched = list(set([
alias
for alias in aliases
for phrase in phrases
if re.search(r'\b{}\b'.format(phrase), alias)
]))
print(matched)
Output:
['there is a cost inherent to bacon', '.cost.', 'rocking euros today', 'there is a cost.', 'cost in euros of bacon', 'europe has cost in place', 'cost.', '.cost']
Basically, we're grabbing all matches, using pythons re module as our test, including cases where multiple phrases occur in a given alias, using a compound list comprehension, then using set() to trim duplicates from the list, then using list() to coerce the set back into a list.
Refs:
Lists:
https://docs.python.org/3/tutorial/datastructures.html#more-on-lists
List comprehensions:
https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions
Sets:
https://docs.python.org/3/tutorial/datastructures.html#sets
re (or regex):
https://docs.python.org/3/library/re.html#module-re

Related

Create a list of alphabetically sorted UNIQUE words and display the first N words in python

I am new to Python, apologize for a simple question. My task is the following:
Create a list of alphabetically sorted unique words and display the first 5 words
I have text variable, which contains a lot of text information
I did
test = text.split()
sorted(test)
As a result, I receive a list, which starts from symbols like $ and numbers.
How to get to words and print N number of them.
I'm assuming by "word", you mean strings that consist of only alphabetical characters. In such a case, you can use .filter to first get rid of the unwanted strings, turn it into a set, sort it and then print your stuff.
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: x.isalpha(), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', 'of', 'peak']
But the problem with this is that it will still ignore words like mountain's, because of that pesky '. A regex solution might actually be far better in such a case-
For now, we'll be going for this regex - ^[A-Za-z']+$, which means the string must only contain alphabets and ', you may add more to this regex according to what you deem as "words". Read more on regexes here.
We'll be using re.match instead of .isalpha this time.
WORD_PATTERN = re.compile(r"^[A-Za-z']+$")
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: bool(WORD_PATTERN.match(x)), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', "mountain's", 'of']
Keep in mind however, this gets tricky when you have a string like hi! What's your name?. hi!, name? are all words except they are not fully alphabetic. The trick to this is to split them in such a way that you get hi instead of hi!, name instead of name? in the first place.
Unfortunately, a true word split is far outside the scope of this question. I suggest taking a look at this question
I am newbie here, apologies for mistakes. Thank you.
test = '''The coronavirus outbreak has hit hard the cattle farmers in Pabna and Sirajganj as they are now getting hardly any customer for the animals they prepared for the last year targeting the Eid-ul-Azha this year.
Normally, cattle traders flock in large numbers to the belt -- one of the biggest cattle producing areas of the country -- one month ahead of the festival, when Muslims slaughter animals as part of their efforts to honour Prophet Ibrahim's spirit of sacrifice.
But the scene is different this year.'''
test = test.lower().split()
test2 = sorted([j for j in test if j.isalpha()])
print(test2[:5])
You can slice the sorted return list until the 5 position
sorted(test)[:5]
or if looking only for words
sorted([i for i in test if i.isalpha()])[:5]
or by regex
sorted([i for i in test if re.search(r"[a-zA-Z]")])
by using the slice of a list you will be able to get all list elements until a specific index in this case 5.

How can I detect multiple keywords in python string?

I am looking for a way to create several lists and for the keywords in those lists to be extracted and matched with a responce.
User Input: This is a good day I am heading out for a jog.
List 1 : Keywords : good day, great day, awesome day, best day.
List 2 : Keywords : a run, a swim, a game.
But for a huge database of words, can this be linked to just the lists? Or does it need to be especific words?
Also would you recommend Python for a huge database of keywords?
The first thing to do is to break the input string up into tokens. A token is just a piece of the string that you want to match. In your case, it looks like your token size is 2 words (but it doesn't have to be). You might also want to strip all punctuation from the input string as well.
Then for your input, your tokens are
['This is', 'is a', 'a good', 'good day', 'day I', 'I am', 'am heading', 'heading out', 'out for', 'for a', 'a jog']
Then you can iterate over the tokens and check to see if they're contained in each one of the lists. Might look like this:
input = 'This is a good day I am heading out for a jog'
words = input.split(' ')
tokens = [' '.join(words[i:i+2]) for i in range(len(words) - 1)]
for token in tokens:
if token in list1:
print('{} is in list1'.format(token))
if token in list2:
print('{} is in list2'.format(token))
One thing you will likely want to do to optimize this is to use sets for list1 and list2, instead of lists.
set1 = set(list1)
sets offer O(1) lookups, as opposed to O(n) for lists, which is critical if your keyword lists are large.

Finding all the concepts in an input string against a large list of concept strings using python

I have a large set (says 30 million) of concepts strings (maximum 13 words per string) in a database. Given an input string (maybe maximum 3 sentences), I would like to find all the concepts from database available in the input string.
I am using python for this purpose. Loaded the all concepts from the database into a list. Loop through the concept list and try to find whether that concept is available in the input string. As I had to search it kind of sequentially, the process takes long and I will have to do it for hundreds of input string.
For pruning some iteration, I tokenized the input string and try to load only the concepts having any one of the tokens and the lenght of the concepts has to be less or equal to the length of the input string. It requires an sql query to load these short listed concepts into the list. Still the list might contain 20 million concepts. The process is not that fast.
Any idea how this process could be made more efficient?
For better visualization I am giving a little pythonic example:
inputString = "The cow is a domestic animal. It has four legs, one tail, two eyes"
#load concept list from the database that have any of the words in input string (after removing stop words). Assume the list is as follows.
concepts = ["cow", "domestic animal", "domestic bird", "domestic cat", "domestic dog", "one eye", "two eyes", "two legs", "four legs", "two ears"]
for c in concepts:
if c in inputString:
print ('found ' + c + ' in ' + inputString)
It would be great if you can give me some suggestions to make it more efficient.
You should make use of sets, which are much faster than lists and full-text search in finding items.
Put these concepts into a dict of sets, indexed by the number of words. Then split inputString into a list of words, and then use a rolling window of the number of words on this list to test if these words exist in the set of the index of the same number of words.
So given the following initialization:
from collections import defaultdict
import re
inputString = "The cow is a domestic animal. It has four legs, one tail, two eyes"
concepts = ["cow", "domestic animal", "domestic bird", "forever and ever", "practice makes perfect", "i will be back"]
We break down concepts into a dict of sets, indexed by the number of words of the concepts contained within a set:
concept_sets = defaultdict(set)
for concept in concepts:
concept_sets[len(concept.split())].add(concept)
So that concept_sets becomes:
{1: {'cow'}, 2: {'domestic bird', 'domestic animal'}, 3: {'practice makes perfect', "forever and ever"}, 4: {'i will be back'}}
Then we turn inputString into a list of words in lowercase so that the match can be case-insensitive. Note that you may want to refine the regex here so that it may include certain other characters as a "word".
input_words = list(map(str.lower, re.findall(r'[a-z]+', inputString, re.IGNORECASE)))
Finally, we loop through each concept set in concept_sets with its number of words, and go through the word list from the input in a rolling window of the same number of words, and test if the words exists in the set.
for num_words, concept_set in concept_sets.items():
for i in range(len(input_words) - num_words + 1):
words = ' '.join(input_words[i: i + num_words])
if words in concept_set:
print("found '%s' in '%s'" % (words, inputString))
This outputs:
found 'cow' in 'The cow is a domestic animal. It has four legs, one tail, two eyes'
found 'domestic animal' in 'The cow is a domestic animal. It has four legs, one tail, two eyes'

Delete regex matches within loop and continue with updated string version

I have a string that I want to run through four wordlists, one with four-grams, one with tri-grams, one with bigrams and one with single terms. To avoid that a word of the single term wordlist gets counted twice when it also forms part of a bigram or trigrams for example, I start with counting for four-grams, then want to update the string in terms of removing the matches to only check the remaining part of the string for matches of trigrams, bigrams and single terms, respectively. I have used the following code and illustrate it here just starting with fourgrams and then trigrams:
financial_trigrams_count=0
financial_fourgrams_count=0
strn="thank you, john, and good morning, everyone. with me today is tim, our chief financial officer."
pattern_fourgrams=["value to the business", "car and truck sales"]
pattern_trigrams=["cash flow statement", "chief financial officer"]
for i in pattern_fourgrams:
financial_fourgrams_count=financial_fourgrams_count+strn.count(i)
new_strn=strn
def clean_text1(pattern_fourgrams, new_strn):
for r in pattern_fourgrams:
new_strn = re.sub(r, '', new_strn)
return new_strn
for i in pattern_trigrams:
financial_trigrams_count=financial_trigrams_count+new_strn.count(i)
new_strn1=new_strn
def clean_text2(pattern_trigrams, new_strn1):
for r in pattern_trigrams:
new_strn1 = re.sub(r, '', new_strn1)
return new_strn1
print(financial_fourgrams_count)
print(financial_trigrams_count)
word_count_wostop=len(strn.split())
print(word_count_wostop)
For fourgrams there is not match, so new_strn will be similar to strn. However, there is one match with trigrams ("chief financial officer"), however, I do not succees in deleteing the match from new_strn1. Instead, I again yield the full string, namely strn (or new_strn which is the same).
Could someone help me find the mistake here?
(As a complement to Tilak Putta's answer)
Note that you are searching the string twice: once when counting the occurrences of the ngrams with .count() and once more when you remove the matches using re.sub().
You can increase performance by counting and removing at the same time.
This can be done using re.subn. This function takes the same parameters as re.sub but returns a tuple containing the cleaned string as well as the number of matches.
Example:
for i in pattern_fourgrams:
new_strn, n = re.subn(r, '', new_strn)
financial_fourgrams_count += n
Note that this assumes the n-grams are pairwaise different (for fixed n), i.e. they shouldn't have a common word, since subn will delete that word the firs time it sees it and thus won't be able to find occurence of other ngrams containing that particular word.
you need to remove def
import re
financial_trigrams_count=0
financial_fourgrams_count=0
strn="thank you, john, and good morning, everyone. with me today is tim, our chief financial officer."
pattern_fourgrams=["value to the business", "car and truck sales"]
pattern_trigrams=["cash flow statement", "chief financial officer"]
for i in pattern_fourgrams:
financial_fourgrams_count=financial_fourgrams_count+strn.count(i)
new_strn=strn
for r in pattern_fourgrams:
new_strn = re.sub(r, '', new_strn)
for i in pattern_trigrams:
financial_trigrams_count=financial_trigrams_count+new_strn.count(i)
new_strn1=new_strn
for r in pattern_trigrams:
new_strn1 = re.sub(r, '', new_strn1)
print(new_strn1)
print(financial_fourgrams_count)
print(financial_trigrams_count)
word_count_wostop=len(strn.split())
print(word_count_wostop)

python, re.search / re.split for phrases which looks like a title, i.e. starting with an uppper case

I have a list of phrases (input by user) I'd like to locate them in a text file, for examples:
titles = ['Blue Team', 'Final Match', 'Best Player',]
text = 'In today Final match, The Best player is Joe from the Blue Team and the second best player is Jack from the Red team.'
1./ I can find all the occurrences of these phrases like so
titre = re.compile(r'(?P<title>%s)' % '|'.join(titles), re.M)
list = [ t for t in titre.split(text) if titre.search(t) ]
(For simplicity, I am assuming a perfect spacing.)
2./ I can also find variants of these phrases e.g. 'Blue team', final Match', 'best player' ... using re.I, if they ever appear in the text.
But I want to restrict to finding only variants of the input phrases with their first letter upper-cased e.g. 'Blue team' in the text, regardless how they were entered as input, e.g. 'bluE tEAm'.
Is it possible to write something to "block" the re.I flag for a portion of a phrase? In pseudo code I imagine generate something like '[B]lue Team|[F]inal Match'.
Note: My primary goal is not, for example, calculating frequency of the input phrases in the text but extracting and analyzing the text fragments between or around them.
I would use re.I and modify the list-comp to:
l = [ t for t in titre.split(text) if titre.search(t) and t[0].isupper() ]
I think regular expressions won't let you specify just a region where the ignore case flag is applicable. However, you can generate a new version of the text in which all the characters have been lower cased, but the first one for every word:
new_text = ' '.join([word[0] + word[1:].lower() for word in text.split()])
This way, a regular expression without the ignore flag will match taking into account the casing only for the first character of each word.
How about modifying the input so that it is in the correct case before you use it in the regular expression?

Categories

Resources