Specifying word boundaries for multiple string replacement with regex? - python

I'm trying to mask city names in a list of texts using 'PAddress' tags. To do this, I borrowed thejonny's solution here for how to perform multiple regex substitutions using a dictionary with regex expressions as keys. In my implementation, the cities are keys and the values are tags that correspond to the exact format of the keys (this is important because the format must be preserved down the line). Eg., {East-Barrington: PAddress-PAddress}, so East-Barrington would be replaced by PAddress-PAddress ; one tag per word with punctuation and spacing preserved. Below is my code - sub_mult_regex() is the helper function called by mask_multiword_cities().
def sub_mult_regex(text, keys, tag_type):
'''
Replaces/masks multiple words at once
Parameters:
Text: TIU note
Keys: a list of words to be replaced by the regex
Tag_type: string you want the words to be replaced with
Creates a replacement dictionary of keys and values
(values are the length of the key, preserving formatting).
Eg., {68 Oak St., PAddress PAddress PAddress.,}
Returns text with relevant text masked
'''
# Creating a list of values to correspond with keys (see key:value example in docstring)
add_vals = []
for val in keys:
add_vals.append(re.sub(r'\w{1,100}', tag_type, val)) # To preserve the precise punctuation, etc. formatting of the keys, only replacing word matches with tags
# Zipping keys and values together as dictionary
add_dict = dict(zip(keys, add_vals))
# Compiling the keys together (regex)
add_subs = re.compile("|".join("("+key+")" for key in add_dict), re.IGNORECASE)
# This is where the multiple substitutions are happening
# Taken from: https://stackoverflow.com/questions/66270091/multiple-regex-substitutions-using-a-dict-with-regex-expressions-as-keys
group_index = 1
indexed_subs = {}
for target, sub in add_dict.items():
indexed_subs[group_index] = sub
group_index += re.compile(target).groups + 1
if len(indexed_subs) > 0:
text_sub = re.sub(add_subs, lambda match: indexed_subs[match.lastindex], text) # text_sub is masked
else:
text_sub = text # Not all texts have names, so text_sub would've been NoneType and broken funct otherwise
# Information on what words were changed pre and post masking (eg., would return 'ANN ARBOR' if that city was masked here)
case_a = text
case_b = text_sub
diff_list = [li for li in difflib.ndiff(case_a.split(), case_b.split()) if li[0] != ' ']
diff_list = [re.sub(r'[-,]', "", term.strip()) for term in diff_list if '-' in term]
return text_sub, diff_list
def mask_multiword_cities(text_string):
multi_word_cities = list(set([city for city in us_cities_all if len(city.split(' ')) > 1 and len(city) > 3 and "Mc" not in city and "State" not in city and city != 'Mary D']))
return sub_mult_regex(text_string, multi_word_cities, "PAddress")
The problem is, the keys in the regex dictionary don't have word boundaries specified, so while only exact matches should be tagged (case insensitive), phrases like 'around others' gets tagged because it thinks that the city 'Round O' is in it (technically that is a substring within that). Take this example text, run through the mask_multiword_cities function:
add_string = "The cities are Round O , NJ and around others"
mask_multiword_cities(add_string)
#(output): ('The cities are PAddress PAddress NJ , and aPAddress PAddressthers', [' Round', ' O', ' around', ' others'])
The output should only be ('The cities are PAddress PAddress NJ , and around others', [' Round', ' O']). I've tried converting each key to a regex expression like r"\b(?=\w)key\b(?!\w)" at various points in the sub_mult_regex function (lines 26 and 37) but that didn't work as expected.
For testing, assume that:
us_cities_all = ['Great Barrington', 'Round O', 'East Orange'].
Also, if anyone can help make this run faster/be more efficient, that would be great! Right now, it takes about 30 seconds to run on a 1000-word note, likely because us_cities_all contains 5,000 cities. Let me know if it would be more helpful to directly post the cities list, I wasn't sure how to do so.

I figured out a word-boundary based solution that would handle multiple cities, in case anyone might find it helpful in a similar situation:
def sub_mult_regex(text, keys, tag_type, city):
'''
Replaces/masks multiple words at once
Parameters:
text: TIU note
keys: a list of words to be replaced by the regex
tag_type: string you want the words to be replaced with
city: bool, True if replacing cities, False if replacing anything else
Creates a replacement dictionary of keys and values
(values are the length of the key, preserving formatting).
Eg., {68 Oak St, PAddress PAddress PAddress}
Returns text with relevant text masked
'''
# Creating a list of values to correspond with keys (see key:value example in docstring)
if city:
# If we're masking a city, handle word boundaries
# This step of only including keys if they show up in the text speeds the code up by a lot, since it's not cross-referencing against thousands of cities, only the ones present
keys = [r"\b"+key+r"\b" for key in keys if key in text or key.upper() in text] # add word boundaries for each key in list
add_vals = []
for val in keys:
# Create dictionary of city word:PAddress by splitting the city on the '\\b' char that remains and then adding one tag per word
# Ex: '\\bDeer Island\\b' --> split('\\b') --> ['', 'Deer Island', ''] --> ''.join --> (key) Deer Island : (value) PAddress PAddress
add_vals.append(re.sub(r'\w{1,100}', tag_type, ''.join(val.split('\\b')))) # To preserve the precise punctuation, etc. formatting of the keys, only replacing word matches with tags
add_vals = [re.sub(r'\\b', "", val) for val in add_vals]
elif not city:
# If we're not masking a city, we don't do the word boundary step
add_vals = []
for val in keys:
add_vals.append(re.sub(r'\w{1,100}', tag_type, val)) # To preserve the precise punctuation, etc. formatting of the keys, only replacing word matches with tags
# Zipping keys and values together as dictionary
add_dict = dict(zip(keys, add_vals))
print("add_dict:", add_dict)
# Compiling the keys together (regex)
add_subs = re.compile("|".join("("+key+")" for key in add_dict), re.IGNORECASE)
# This is where the multiple substitutions are happening
# Taken from: https://stackoverflow.com/questions/66270091/multiple-regex-substitutions-using-a-dict-with-regex-expressions-as-keys
group_index = 1
indexed_subs = {}
for target, sub in add_dict.items():
indexed_subs[group_index] = sub
group_index += re.compile(target).groups + 1
if len(indexed_subs) > 0:
text_sub = re.sub(add_subs, lambda match: indexed_subs[match.lastindex], text) # text_sub is masked text
else:
text_sub = text # Not all texts have names, so text_sub would've been NoneType and broken funct otherwise
# Information on what words were changed pre and post masking (eg., would return 'ANN ARBOR' if that city was masked here)
case_a = text
case_b = text_sub
diff_list = [li for li in difflib.ndiff(case_a.split(), case_b.split()) if li[0] != ' ']
diff_list = [re.sub(r'[-,]', "", term.strip()) for term in diff_list if '-' in term]
return text_sub, diff_list
# sample call:
add_string = 'The cities are Round O NJ, around others and East Orange'
mask_multiword_cities(add_string) # this function remained the same
# output: add_dict: {'\\bEast Orange\\b': 'PAddress PAddress', '\\bRound O\\b': 'PAddress PAddress'} ('The cities are PAddress PAddress NJ, around others are PAddress PAddress', [' Round', ' O', ' East', ' Orange'])

Related

Python program to find if a certain keyword is present in a list of documents (string)

Question: A researcher has gathered thousands of news articles. But she wants to focus her attention on articles including a specific word.
The function should meet the following criteria:
Do not include documents where the keyword string shows up only as a part of a larger word. For example, if she were looking for the keyword “closed”, you would not include the string “enclosed.”
She does not want you to distinguish upper case from lower case letters. So the phrase “Closed the case.” would be included when the keyword is “closed”
Do not let periods or commas affect what is matched. “It is closed.” would be included when the keyword is “closed”. But you can assume there are no other types of punctuation.
My code:-
keywords=["casino"]
def multi_word_search(document,keywords):
dic={}
z=[]
for word in document:
i=document.index(word)
token=word.split()
new=[j.rstrip(",.").lower() for j in token]
for k in keywords:
if k.lower() in new:
dic[k]=z.append(i)
else:
dic[k]=[]
return dic
It must return value of {'casino': [0]} on giving document=['The Learn Python Challenge Casino', 'They bought a car', 'Casinoville?'], keywords=['casino'], but got {'casino': []} instead.
I wonder if someone could help me?
I would first tokenize the string "new" using split(), then build a set to speed up look up.
If you want case insensitive you need to lower case both sides
for k in keywords:
s = set(new.split())
if k in s:
dic[k] = z.append(i)
else:
dic[k]=[]
return dic
This is not as trivial as it seem. From a NLP (natural language processing) splitting a text into words is not trivial (it is called tokenisation).
import nltk
# stemmer = nltk.stem.PorterStemmer()
def multi_word_search(documents, keywords):
# Initialize result dictionary
dic = {kw: [] for kw in keywords}
for i, doc in enumerate(documents):
# Preprocess document
doc = doc.lower()
tokens = nltk.word_tokenize(doc)
tokens = [stemmer.stem(token) for token in tokens]
# Search each keyword
for kw in keywords:
# kw = stemmer.stem(kw.lower())
kw = kw.lower()
if kw in tokens:
# If found, add to result dictionary
dic[kw].append(i)
return dic
documents = ['The Learn Python Challenge Casino', 'They bought a car', 'Casinoville?' 'Some casinos']
keywords=['casino']
multi_word_search(documents, keywords)
To increase matching you can use stemming (it removes plurals and verbs flexions, ex: running -> run)
This should work too..
document=['The Learn Python Challenge Casino', 'They bought a car', 'Casinoville?']
keywords=['casino', 'car']
def findme(term):
for x in document:
val = x.split(' ')
for v in val:
if term.lower() == v.lower():
return document.index(x)
for key in keywords:
n = findme(key)
print(f'{key}:{n}')

Python: Get the longest matching keyword mentions in text

I have a list of keywords including their variations that I search in text like :
keywords = ['US Dollar', 'Australian Dollar', 'Dollar', 'Dollars']
and I want to look up these keywords in texts like :
'Dollar News: The Australian Dollar slumped in the face of a recovering US Dollar'
and get the most comprhensive matches (i.e. longest) which are 'Dollar' in the beginning of the sentence, and 'Australian Dollar' and 'US Dollar'(and not 'Dollar' in those cases for instance).
I have so far tried this:
keywords.sort(key = len, reverse=True)
first = lambda text, kws: next((k for k in kws if k in text), None)
first(myText, keywords)
which returns 'Australian Dollar' as it is the longest match. How can I get other matches (here, 'Dollar' in 'Dollar News...' and 'US Dollar') as well?
# -*- coding: utf-8 -*-
"""
Created on Thu Jun 13 14:21:59 2019
#author: jainil
"""
keywords = ['US Dollar', 'Australian Dollar', 'Dollar', 'Dollars']
keywords.sort(key = len, reverse=True)
keywords
text='The Australian Dollar slumped in the face of a recovering US Dollar'
dictt={}
for i in keywords:
dictt[i]=text.count(i)
max_len=0
max_value=0
for i in dictt.keys():
if len(i.split())>max_len and dictt[i]>0:
max_len= len(i.split())
if(dictt[i]>max_value):
max_value=dictt[i]
for i,j in dictt.items():
if(len(i.split())==max_len and j==max_value):
print(i,j)
A solution is to use suffix trees to get the positions of every keyword mention and then handle the overlapping as suggested by #EricDuminil .
Here is my function for extracting keywords' kws positions in text source txt:
from suffix_trees import STree
def findMentions(txt, kws):
st = STree.STree(txt)
spans = []
for kw in kws:
starts = st.find_all(kw)
spans.extend([(item, item+len(kw)) for item in starts])
bounds = handleOverlap(spans)
return bounds
and here is the function to handle overlapping character positions:
def handleOverlap(spans):
del_in = []
for x in spans:
if spans.index(x) in del_in: continue
for y in spans:
if spans.index(y) in del_in: continue
if x == y: continue
if len(set(list(range(x[0],x[1]+1))) & set(list(range(y[0],y[1]+1)))) > 0:
if len(list(range(x[0],x[1]+1))) > len(list(range(y[0],y[1]+1))):
del_in.append(spans.index(y))
spans.pop(spans.index(y))
elif len(list(range(y[0],y[1]+1))) > len(list(range(x[0],x[1]+1))):
del_in.append(spans.index(x))
spans.pop(spans.index(x))
return spans
I just had to add spaces to the both ends of each keyword to avoid getting words containing a keyword, like 'petrodollar'. The results are the non-overlapping start and end positions for the longest corresponding mentioned keywords.

Remove mirrored duplicate strings in list python?

What is an efficient python algorithm to remove all mirrored text duplicates in a list where the items are in the format as below?
ExList = [' dutch italian english', ' italian english dutch', ' dutch italian german', ' dutch german italian' ]
Required result: [' dutch english italian ', 'dutch german italian' ]
This solution uses the set datastructure and focuses on producing compact code, mostly with list/set/generator comprehenstions. If this is a homework task for a beginner course and you just copy the result, it will be very obvious that you did not write the code yourself. Try to follow the thought process and reproduce the results yourself.
1) split each element at " " (space)
for item in ExList:
splitted = item.split(" ")
2) remove now empty elements due to superfluous spaces in the input. This can be done in 1 line with the step above (empty strings are "falsy") using a list comprehenstion:
for item in ExList:
splitted = [lang for lang in item.split(" ") if lang]
3) Put the result in a set, which by definition disregards order and ignores duplicates. For this step we primarily need the property of unordered identity, meaning set([1, 2]) == set([2, 1]). This can be combined with the line above using a generator comprehension:
for item in ExList:
itemSet = set(lang for lang in item.split(" ") if lang)
Now, within that loop, put all those sets of languages into another set. This time, because all the item sets with the same items in any order are considered equal, the outer set will automatically disregard any duplicates. To be able to put the item set into another set, it needs to be immutable (because mutability might cause a change in identity), which is called a frozenset in python. The code looks like this:
ExList = [' dutch italian english', ' italian english dutch', ' dutch italian german', ' dutch german italian' ]
result = set()
for item in ExList:
result.add(frozenset(lang for lang in item.split(" ") if lang))
Or, as a set comprehension on one line:
result = {frozenset(lang for lang in item.split(" ") if lang) for item in ExList}
The result is as follows:
>>> print(result)
{frozenset({'italian', 'dutch', 'german'}), frozenset({'italian', 'dutch', 'english'})}
you can turn that back into lists if the set print output looks confusing to you
>>> print([list(itemSet) for itemSet in result])
[['italian', 'dutch', 'german'], ['italian', 'dutch', 'english']]
This may work for you:
def unique_list(s):
x = set([tuple(sorted(s.split())) for s in ExList])
return [" ".join(s) for s in x]
print(unique_list(ExList)
This might not be the most efficient solution, but hope it will be of some help.
Using the property that keys of dictionary are unique.
m_dict = {}
for a in ExList:
b = a.split()
b.sort()
m_dict[' '.join(b)] = None
print m_dict.keys()

split text by any first item matched from a list

I am looking for an elegant way to find the first match from a list of prepositions in a text so that I can parse a text like "Add shoes behind the window", the result should be ["shoes","behind the window"]
It works as long as there are not multiple prepositions in the text
my keys behind the window before: my keys after: behind the
window
my keys under the table in the kitchen before: my keys under
the table after: in the kitchen
my keys in the box under the table in the kitchen before: my
keys after: in the box under the table in the kitchen
In the 2nd example, the result should be ["my keys","under the table in the kitchen"]
Whats an elegant way to find the first match of any of the words in the list?
def get_text_after_preposition_of_place(text):
"""Returns the texts before[0] and after[1] <preposition of place>"""
prepositions_of_place = ["in front of","behind","in","on","under","near","next to","between","below","above","close to","beside"]
textres = ["",""]
for key in prepositions_of_place:
if textres[0] == "":
if key in text:
textres[0] = text.split(key, 1)[0].strip()
textres[1] = key + " " + text.split(key, 1)[1].strip()
return textres
You can do that using re.split:
import re
def get_text_after_preposition_of_place(text):
"""Returns the texts before[0] and after[1] <preposition of place>"""
prepositions_of_place = ["in front of","behind","in","on","under","near","next to","between","below","above","close to","beside"]
preps_re = re.compile(r'\b(' + '|'.join(prepositions_of_place) + r')\b')
split = preps_re.split(text, maxsplit=1)
return split[0], split[1]+split[2]
print(get_text_after_preposition_of_place('The cat in the box on the table'))
# ('The cat ', 'in the box on the table')
First, we create a regex that will look like (in|on|under). Note the parentheses: they will allow us to capture the strings on which we split the string in order to keep them in the output.
Then, we split, allowing 1 split at most, and concatenate the last two parts: the preposition and the rest of the string.

Python re.findall() purpose in this code

I am currently learning Python and I am trying to decipher the code I found online. The point of the code is to compare the raw string with user input key and if it matches, it returns raw string.
I am having problem trying to understand what does re.findall() is doing in this code
So head[0] contains a data string
('2016-12-22 06:28:36', u'Kith x New Era K 59FIFTY Cap - Pink',
u'http://kithnyc.com/products/kith-x-new-era-59fifty-cap-pink')
Key contains a raw string
key=r'Nike|Ultra'
head = self.data
for k in key:
print k
flag=re.findall(k,str(head[0]),flags=re.I)
print len(flag)
if len(flag)>4:
print head[0]
From my understanding, the purpose of the code is to loop through key and see if it matches head[0]. If it matches, it returns head[0]. However, it is still returning, head[0]
('2016-12-22 06:28:36', u'Kith x New Era K 59FIFTY Cap - Pink',
u'http://kithnyc.com/products/kith-x-new-era-59fifty-cap-pink')
even if though it doesn't match.
It is suppose to print items in head if it match key regex.
Use the following code then:
import re
head = ('2016-12-22 06:28:36', 'nike item', 'ultra item', 'Kith x New Era K 59FIFTY Cap - Pink', 'http://kithnyc.com/products/kith-x-new-era-59fifty-cap-pink')
key=r'Nike|Ultra' # This is a regex pattern, matches `Nike` or `Ultra`
for s in head: # Iterate the items in head
if re.search(key, s, flags=re.I): # Search for a match in each item, case insensitively
print(s) # Print if found
Output: nike item and ultra item.
In your code, you loop through the characters of the pattern with for k in key:. With re.findall, all non-overlapping matches were searched for that match a single char in k, and only head[0] was checked, all other items were not considered.

Categories

Resources