Python: Get the longest matching keyword mentions in text

Python: Get the longest matching keyword mentions in text - python

I have a list of keywords including their variations that I search in text like :
keywords = ['US Dollar', 'Australian Dollar', 'Dollar', 'Dollars']
and I want to look up these keywords in texts like :
'Dollar News: The Australian Dollar slumped in the face of a recovering US Dollar'
and get the most comprhensive matches (i.e. longest) which are 'Dollar' in the beginning of the sentence, and 'Australian Dollar' and 'US Dollar'(and not 'Dollar' in those cases for instance).
I have so far tried this:
keywords.sort(key = len, reverse=True)
first = lambda text, kws: next((k for k in kws if k in text), None)
first(myText, keywords)
which returns 'Australian Dollar' as it is the longest match. How can I get other matches (here, 'Dollar' in 'Dollar News...' and 'US Dollar') as well?

# -*- coding: utf-8 -*-
"""
Created on Thu Jun 13 14:21:59 2019
#author: jainil
"""
keywords = ['US Dollar', 'Australian Dollar', 'Dollar', 'Dollars']
keywords.sort(key = len, reverse=True)
keywords
text='The Australian Dollar slumped in the face of a recovering US Dollar'
dictt={}
for i in keywords:
dictt[i]=text.count(i)
max_len=0
max_value=0
for i in dictt.keys():
if len(i.split())>max_len and dictt[i]>0:
max_len= len(i.split())
if(dictt[i]>max_value):
max_value=dictt[i]
for i,j in dictt.items():
if(len(i.split())==max_len and j==max_value):
print(i,j)

A solution is to use suffix trees to get the positions of every keyword mention and then handle the overlapping as suggested by #EricDuminil .
Here is my function for extracting keywords' kws positions in text source txt:
from suffix_trees import STree
def findMentions(txt, kws):
st = STree.STree(txt)
spans = []
for kw in kws:
starts = st.find_all(kw)
spans.extend([(item, item+len(kw)) for item in starts])
bounds = handleOverlap(spans)
return bounds
and here is the function to handle overlapping character positions:
def handleOverlap(spans):
del_in = []
for x in spans:
if spans.index(x) in del_in: continue
for y in spans:
if spans.index(y) in del_in: continue
if x == y: continue
if len(set(list(range(x[0],x[1]+1))) & set(list(range(y[0],y[1]+1)))) > 0:
if len(list(range(x[0],x[1]+1))) > len(list(range(y[0],y[1]+1))):
del_in.append(spans.index(y))
spans.pop(spans.index(y))
elif len(list(range(y[0],y[1]+1))) > len(list(range(x[0],x[1]+1))):
del_in.append(spans.index(x))
spans.pop(spans.index(x))
return spans
I just had to add spaces to the both ends of each keyword to avoid getting words containing a keyword, like 'petrodollar'. The results are the non-overlapping start and end positions for the longest corresponding mentioned keywords.

Related

Specifying word boundaries for multiple string replacement with regex?

I'm trying to mask city names in a list of texts using 'PAddress' tags. To do this, I borrowed thejonny's solution here for how to perform multiple regex substitutions using a dictionary with regex expressions as keys. In my implementation, the cities are keys and the values are tags that correspond to the exact format of the keys (this is important because the format must be preserved down the line). Eg., {East-Barrington: PAddress-PAddress}, so East-Barrington would be replaced by PAddress-PAddress ; one tag per word with punctuation and spacing preserved. Below is my code - sub_mult_regex() is the helper function called by mask_multiword_cities().
def sub_mult_regex(text, keys, tag_type):
'''
Replaces/masks multiple words at once
Parameters:
Text: TIU note
Keys: a list of words to be replaced by the regex
Tag_type: string you want the words to be replaced with
Creates a replacement dictionary of keys and values
(values are the length of the key, preserving formatting).
Eg., {68 Oak St., PAddress PAddress PAddress.,}
Returns text with relevant text masked
'''
# Creating a list of values to correspond with keys (see key:value example in docstring)
add_vals = []
for val in keys:
add_vals.append(re.sub(r'\w{1,100}', tag_type, val)) # To preserve the precise punctuation, etc. formatting of the keys, only replacing word matches with tags
# Zipping keys and values together as dictionary
add_dict = dict(zip(keys, add_vals))
# Compiling the keys together (regex)
add_subs = re.compile("|".join("("+key+")" for key in add_dict), re.IGNORECASE)
# This is where the multiple substitutions are happening
# Taken from: https://stackoverflow.com/questions/66270091/multiple-regex-substitutions-using-a-dict-with-regex-expressions-as-keys
group_index = 1
indexed_subs = {}
for target, sub in add_dict.items():
indexed_subs[group_index] = sub
group_index += re.compile(target).groups + 1
if len(indexed_subs) > 0:
text_sub = re.sub(add_subs, lambda match: indexed_subs[match.lastindex], text) # text_sub is masked
else:
text_sub = text # Not all texts have names, so text_sub would've been NoneType and broken funct otherwise
# Information on what words were changed pre and post masking (eg., would return 'ANN ARBOR' if that city was masked here)
case_a = text
case_b = text_sub
diff_list = [li for li in difflib.ndiff(case_a.split(), case_b.split()) if li[0] != ' ']
diff_list = [re.sub(r'[-,]', "", term.strip()) for term in diff_list if '-' in term]
return text_sub, diff_list
def mask_multiword_cities(text_string):
multi_word_cities = list(set([city for city in us_cities_all if len(city.split(' ')) > 1 and len(city) > 3 and "Mc" not in city and "State" not in city and city != 'Mary D']))
return sub_mult_regex(text_string, multi_word_cities, "PAddress")
The problem is, the keys in the regex dictionary don't have word boundaries specified, so while only exact matches should be tagged (case insensitive), phrases like 'around others' gets tagged because it thinks that the city 'Round O' is in it (technically that is a substring within that). Take this example text, run through the mask_multiword_cities function:
add_string = "The cities are Round O , NJ and around others"
mask_multiword_cities(add_string)
#(output): ('The cities are PAddress PAddress NJ , and aPAddress PAddressthers', [' Round', ' O', ' around', ' others'])
The output should only be ('The cities are PAddress PAddress NJ , and around others', [' Round', ' O']). I've tried converting each key to a regex expression like r"\b(?=\w)key\b(?!\w)" at various points in the sub_mult_regex function (lines 26 and 37) but that didn't work as expected.
For testing, assume that:
us_cities_all = ['Great Barrington', 'Round O', 'East Orange'].
Also, if anyone can help make this run faster/be more efficient, that would be great! Right now, it takes about 30 seconds to run on a 1000-word note, likely because us_cities_all contains 5,000 cities. Let me know if it would be more helpful to directly post the cities list, I wasn't sure how to do so.

I figured out a word-boundary based solution that would handle multiple cities, in case anyone might find it helpful in a similar situation:
def sub_mult_regex(text, keys, tag_type, city):
'''
Replaces/masks multiple words at once
Parameters:
text: TIU note
keys: a list of words to be replaced by the regex
tag_type: string you want the words to be replaced with
city: bool, True if replacing cities, False if replacing anything else
Creates a replacement dictionary of keys and values
(values are the length of the key, preserving formatting).
Eg., {68 Oak St, PAddress PAddress PAddress}
Returns text with relevant text masked
'''
# Creating a list of values to correspond with keys (see key:value example in docstring)
if city:
# If we're masking a city, handle word boundaries
# This step of only including keys if they show up in the text speeds the code up by a lot, since it's not cross-referencing against thousands of cities, only the ones present
keys = [r"\b"+key+r"\b" for key in keys if key in text or key.upper() in text] # add word boundaries for each key in list
add_vals = []
for val in keys:
# Create dictionary of city word:PAddress by splitting the city on the '\\b' char that remains and then adding one tag per word
# Ex: '\\bDeer Island\\b' --> split('\\b') --> ['', 'Deer Island', ''] --> ''.join --> (key) Deer Island : (value) PAddress PAddress
add_vals.append(re.sub(r'\w{1,100}', tag_type, ''.join(val.split('\\b')))) # To preserve the precise punctuation, etc. formatting of the keys, only replacing word matches with tags
add_vals = [re.sub(r'\\b', "", val) for val in add_vals]
elif not city:
# If we're not masking a city, we don't do the word boundary step
add_vals = []
for val in keys:
add_vals.append(re.sub(r'\w{1,100}', tag_type, val)) # To preserve the precise punctuation, etc. formatting of the keys, only replacing word matches with tags
# Zipping keys and values together as dictionary
add_dict = dict(zip(keys, add_vals))
print("add_dict:", add_dict)
# Compiling the keys together (regex)
add_subs = re.compile("|".join("("+key+")" for key in add_dict), re.IGNORECASE)
# This is where the multiple substitutions are happening
# Taken from: https://stackoverflow.com/questions/66270091/multiple-regex-substitutions-using-a-dict-with-regex-expressions-as-keys
group_index = 1
indexed_subs = {}
for target, sub in add_dict.items():
indexed_subs[group_index] = sub
group_index += re.compile(target).groups + 1
if len(indexed_subs) > 0:
text_sub = re.sub(add_subs, lambda match: indexed_subs[match.lastindex], text) # text_sub is masked text
else:
text_sub = text # Not all texts have names, so text_sub would've been NoneType and broken funct otherwise
# Information on what words were changed pre and post masking (eg., would return 'ANN ARBOR' if that city was masked here)
case_a = text
case_b = text_sub
diff_list = [li for li in difflib.ndiff(case_a.split(), case_b.split()) if li[0] != ' ']
diff_list = [re.sub(r'[-,]', "", term.strip()) for term in diff_list if '-' in term]
return text_sub, diff_list
# sample call:
add_string = 'The cities are Round O NJ, around others and East Orange'
mask_multiword_cities(add_string) # this function remained the same
# output: add_dict: {'\\bEast Orange\\b': 'PAddress PAddress', '\\bRound O\\b': 'PAddress PAddress'} ('The cities are PAddress PAddress NJ, around others are PAddress PAddress', [' Round', ' O', ' East', ' Orange'])

Python program to find if a certain keyword is present in a list of documents (string)

Question: A researcher has gathered thousands of news articles. But she wants to focus her attention on articles including a specific word.
The function should meet the following criteria:
Do not include documents where the keyword string shows up only as a part of a larger word. For example, if she were looking for the keyword “closed”, you would not include the string “enclosed.”
She does not want you to distinguish upper case from lower case letters. So the phrase “Closed the case.” would be included when the keyword is “closed”
Do not let periods or commas affect what is matched. “It is closed.” would be included when the keyword is “closed”. But you can assume there are no other types of punctuation.
My code:-
keywords=["casino"]
def multi_word_search(document,keywords):
dic={}
z=[]
for word in document:
i=document.index(word)
token=word.split()
new=[j.rstrip(",.").lower() for j in token]
for k in keywords:
if k.lower() in new:
dic[k]=z.append(i)
else:
dic[k]=[]
return dic
It must return value of {'casino': [0]} on giving document=['The Learn Python Challenge Casino', 'They bought a car', 'Casinoville?'], keywords=['casino'], but got {'casino': []} instead.
I wonder if someone could help me?

I would first tokenize the string "new" using split(), then build a set to speed up look up.
If you want case insensitive you need to lower case both sides
for k in keywords:
s = set(new.split())
if k in s:
dic[k] = z.append(i)
else:
dic[k]=[]
return dic

This is not as trivial as it seem. From a NLP (natural language processing) splitting a text into words is not trivial (it is called tokenisation).
import nltk
# stemmer = nltk.stem.PorterStemmer()
def multi_word_search(documents, keywords):
# Initialize result dictionary
dic = {kw: [] for kw in keywords}
for i, doc in enumerate(documents):
# Preprocess document
doc = doc.lower()
tokens = nltk.word_tokenize(doc)
tokens = [stemmer.stem(token) for token in tokens]
# Search each keyword
for kw in keywords:
# kw = stemmer.stem(kw.lower())
kw = kw.lower()
if kw in tokens:
# If found, add to result dictionary
dic[kw].append(i)
return dic
documents = ['The Learn Python Challenge Casino', 'They bought a car', 'Casinoville?' 'Some casinos']
keywords=['casino']
multi_word_search(documents, keywords)
To increase matching you can use stemming (it removes plurals and verbs flexions, ex: running -> run)

This should work too..
document=['The Learn Python Challenge Casino', 'They bought a car', 'Casinoville?']
keywords=['casino', 'car']
def findme(term):
for x in document:
val = x.split(' ')
for v in val:
if term.lower() == v.lower():
return document.index(x)
for key in keywords:
n = findme(key)
print(f'{key}:{n}')

How to get all sentences that contain multiple words in Python

I am trying to make a regular expressions to get all sentences containing two words (order doesn't matter), but I can't find the solution for this.
"Supermarket. This apple costs 0.99."
I want to get back the following sentence:
This apple costs 0.99.
I tried:
([^.]*?(apple)*?(costs)[^.]*\.)
I have problems because the price contains a dot. Also this expressions gives back results with only one of the words.

Approach: For each Phrase, we have to find the sentences which contain all the words of the phrase. So, for each word in the given phrase, we check if a sentence contains it. We do this for each sentence. This process of searching may become faster if the words in the sentence are stored in a set instead of a list.
Below is the implementation of above approach in python:
def getRes(sent, ph):
sentHash = dict()
# Loop for adding hased sentences to sentHash
for s in range(1, len(sent)+1):
sentHash[s] = set(sent[s-1].split())
# For Each Phrase
for p in range(0, len(ph)):
print("Phrase"+str(p + 1)+":")
# Get the list of Words
wordList = ph[p].split()
res = []
# Then Check in every Sentence
for s in range(1, len(sentHash)+1):
wCount = len(wordList)
# Every word in the Phrase
for w in wordList:
if w in sentHash[s]:
wCount -= 1
# If every word in phrase matches
if wCount == 0:
# add Sentence Index to result Array
res.append(s)
if(len(res) == 0):
print("NONE")
else:
print('% s' % ' '.join(map(str, res)))
# Driver Function
def main():
sent = ["Strings are an array of characters",
"Sentences are an array of words"]
ph = ["an array of", "sentences are strings"]
getRes(sent, ph)
main()

You use a negated character class [^.] which matches any character except a dot.
But in your example data Supermarket. This apple costs 0.99. there are 2 dots before the dot at the end, so you can not cross the dot after Supermarket. to match apple
You could for example match until the first dot, then assert costs and use a capture group to match the part with apple and make sure the line ends with a dot.
The assertion for word 1 with a match for word 2 will match the words in both combinations.
^[^.]*\.\s*(?=.*\bcosts\b)(.*\bapple\b.*\.)$
Explanation
^[^.]*\. From the start of the string, match until and including the first dot
\s* Match 0+ whitespace character
(?=.*\bcosts\b) Positive lookahead, assert costs at the right
( Capture group 1 (this has the desired value)
.*\bapple\b.*\. Match the rest of the line that includes apple and ends with a dot
) Close group 1
$ Assert end of string
Regex demo | Python demo
import re
regex = r"^[^.]*\.\s*(?=.*\bcosts\b)(.*\bapple\b.*\.)$"
test_str = ("Supermarket. This apple costs 0.99.\n"
"Supermarket. This costs apple 0.99.\n"
"Supermarket. This apple is 0.99.\n"
"Supermarket. This orange costs 0.99.")
print(re.findall(regex, test_str, re.MULTILINE))
Output
['This apple costs 0.99.', 'This costs apple 0.99.']

I also suggest to first extract sentences and then find sentences that have both words.
However, the problem of splitting text into sentences is pretty hard because of existence of abbreviations, unusual names, etc. One way to do it is by using nltk.tokenize.punkt module.
You'll need to install NLTK and then run this in Python:
import nltk
nltk.download('punkt')
After that you can use English language sentence tokenizer with two regexes:
TEXT = 'Mr. Bean is in supermarket. iPhone 12 by Apple Inc. costs $999.99.'
WORD1 = 'apple'
WORD2 = 'costs'
import nltk.data, re
# Regex helper
find_word = lambda w, s: re.search(r'(^|\W)' + w + r'(\W|$)', s, re.I)
eng_sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
for sent in eng_sent_detector.tokenize(TEXT):
if find_word(WORD1, sent) and find_word(WORD2, sent):
print (sent,"\n----")
Output:
iPhone 12 by Apple Inc. costs $999.99.
----
Notice that it handles numbers and abbreviations for you.

Extract first element from the list that occurs after a particular word

I have a string and a list as follow:
text = 'Sherlock Holmes. PARIS. Address: 221B Baker Street, london. Solving case in Madrid.'
city = ['Paris', 'London', 'Madrid']
I want to extract 1st element from the list that occurs after a word Address.
Here's my approach to problem using nltk
import nltk
loc = None
flag = False
for word in nltk.word_tokenize(text):
if word == 'Address':
flag = True
if flag:
if word.capitalize() in city:
loc = word
break
print(loc)
I am getting result as expected from above which is london.
But in real scenario my text is too large and list of cities too, is there a better way to do this?

The lowest hanging fruit I see is that you can turn city into a set for constant time membership checks. Besides that, consider using the next with default argument to return the next city.
city = {'Paris', 'London', 'Madrid'}
while text:
text = text.partition('Address')[-1].strip()
print(
next((w for w in nltk.word_tokenize(text) if w.capitalize() in city), None))

Python re.findall() purpose in this code

I am currently learning Python and I am trying to decipher the code I found online. The point of the code is to compare the raw string with user input key and if it matches, it returns raw string.
I am having problem trying to understand what does re.findall() is doing in this code
So head[0] contains a data string
('2016-12-22 06:28:36', u'Kith x New Era K 59FIFTY Cap - Pink',
u'http://kithnyc.com/products/kith-x-new-era-59fifty-cap-pink')
Key contains a raw string
key=r'Nike|Ultra'
head = self.data
for k in key:
print k
flag=re.findall(k,str(head[0]),flags=re.I)
print len(flag)
if len(flag)>4:
print head[0]
From my understanding, the purpose of the code is to loop through key and see if it matches head[0]. If it matches, it returns head[0]. However, it is still returning, head[0]
('2016-12-22 06:28:36', u'Kith x New Era K 59FIFTY Cap - Pink',
u'http://kithnyc.com/products/kith-x-new-era-59fifty-cap-pink')
even if though it doesn't match.

It is suppose to print items in head if it match key regex.
Use the following code then:
import re
head = ('2016-12-22 06:28:36', 'nike item', 'ultra item', 'Kith x New Era K 59FIFTY Cap - Pink', 'http://kithnyc.com/products/kith-x-new-era-59fifty-cap-pink')
key=r'Nike|Ultra' # This is a regex pattern, matches `Nike` or `Ultra`
for s in head: # Iterate the items in head
if re.search(key, s, flags=re.I): # Search for a match in each item, case insensitively
print(s) # Print if found
Output: nike item and ultra item.
In your code, you loop through the characters of the pattern with for k in key:. With re.findall, all non-overlapping matches were searched for that match a single char in k, and only head[0] was checked, all other items were not considered.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Get the longest matching keyword mentions in text - python

Related

Specifying word boundaries for multiple string replacement with regex?

Python program to find if a certain keyword is present in a list of documents (string)

How to get all sentences that contain multiple words in Python

Extract first element from the list that occurs after a particular word

Python re.findall() purpose in this code

Categories

Resources