split text by any first item matched from a list - python

I am looking for an elegant way to find the first match from a list of prepositions in a text so that I can parse a text like "Add shoes behind the window", the result should be ["shoes","behind the window"]
It works as long as there are not multiple prepositions in the text
my keys behind the window before: my keys after: behind the
window
my keys under the table in the kitchen before: my keys under
the table after: in the kitchen
my keys in the box under the table in the kitchen before: my
keys after: in the box under the table in the kitchen
In the 2nd example, the result should be ["my keys","under the table in the kitchen"]
Whats an elegant way to find the first match of any of the words in the list?
def get_text_after_preposition_of_place(text):
"""Returns the texts before[0] and after[1] <preposition of place>"""
prepositions_of_place = ["in front of","behind","in","on","under","near","next to","between","below","above","close to","beside"]
textres = ["",""]
for key in prepositions_of_place:
if textres[0] == "":
if key in text:
textres[0] = text.split(key, 1)[0].strip()
textres[1] = key + " " + text.split(key, 1)[1].strip()
return textres

You can do that using re.split:
import re
def get_text_after_preposition_of_place(text):
"""Returns the texts before[0] and after[1] <preposition of place>"""
prepositions_of_place = ["in front of","behind","in","on","under","near","next to","between","below","above","close to","beside"]
preps_re = re.compile(r'\b(' + '|'.join(prepositions_of_place) + r')\b')
split = preps_re.split(text, maxsplit=1)
return split[0], split[1]+split[2]
print(get_text_after_preposition_of_place('The cat in the box on the table'))
# ('The cat ', 'in the box on the table')
First, we create a regex that will look like (in|on|under). Note the parentheses: they will allow us to capture the strings on which we split the string in order to keep them in the output.
Then, we split, allowing 1 split at most, and concatenate the last two parts: the preposition and the rest of the string.

Related

Specifying word boundaries for multiple string replacement with regex?

I'm trying to mask city names in a list of texts using 'PAddress' tags. To do this, I borrowed thejonny's solution here for how to perform multiple regex substitutions using a dictionary with regex expressions as keys. In my implementation, the cities are keys and the values are tags that correspond to the exact format of the keys (this is important because the format must be preserved down the line). Eg., {East-Barrington: PAddress-PAddress}, so East-Barrington would be replaced by PAddress-PAddress ; one tag per word with punctuation and spacing preserved. Below is my code - sub_mult_regex() is the helper function called by mask_multiword_cities().
def sub_mult_regex(text, keys, tag_type):
'''
Replaces/masks multiple words at once
Parameters:
Text: TIU note
Keys: a list of words to be replaced by the regex
Tag_type: string you want the words to be replaced with
Creates a replacement dictionary of keys and values
(values are the length of the key, preserving formatting).
Eg., {68 Oak St., PAddress PAddress PAddress.,}
Returns text with relevant text masked
'''
# Creating a list of values to correspond with keys (see key:value example in docstring)
add_vals = []
for val in keys:
add_vals.append(re.sub(r'\w{1,100}', tag_type, val)) # To preserve the precise punctuation, etc. formatting of the keys, only replacing word matches with tags
# Zipping keys and values together as dictionary
add_dict = dict(zip(keys, add_vals))
# Compiling the keys together (regex)
add_subs = re.compile("|".join("("+key+")" for key in add_dict), re.IGNORECASE)
# This is where the multiple substitutions are happening
# Taken from: https://stackoverflow.com/questions/66270091/multiple-regex-substitutions-using-a-dict-with-regex-expressions-as-keys
group_index = 1
indexed_subs = {}
for target, sub in add_dict.items():
indexed_subs[group_index] = sub
group_index += re.compile(target).groups + 1
if len(indexed_subs) > 0:
text_sub = re.sub(add_subs, lambda match: indexed_subs[match.lastindex], text) # text_sub is masked
else:
text_sub = text # Not all texts have names, so text_sub would've been NoneType and broken funct otherwise
# Information on what words were changed pre and post masking (eg., would return 'ANN ARBOR' if that city was masked here)
case_a = text
case_b = text_sub
diff_list = [li for li in difflib.ndiff(case_a.split(), case_b.split()) if li[0] != ' ']
diff_list = [re.sub(r'[-,]', "", term.strip()) for term in diff_list if '-' in term]
return text_sub, diff_list
def mask_multiword_cities(text_string):
multi_word_cities = list(set([city for city in us_cities_all if len(city.split(' ')) > 1 and len(city) > 3 and "Mc" not in city and "State" not in city and city != 'Mary D']))
return sub_mult_regex(text_string, multi_word_cities, "PAddress")
The problem is, the keys in the regex dictionary don't have word boundaries specified, so while only exact matches should be tagged (case insensitive), phrases like 'around others' gets tagged because it thinks that the city 'Round O' is in it (technically that is a substring within that). Take this example text, run through the mask_multiword_cities function:
add_string = "The cities are Round O , NJ and around others"
mask_multiword_cities(add_string)
#(output): ('The cities are PAddress PAddress NJ , and aPAddress PAddressthers', [' Round', ' O', ' around', ' others'])
The output should only be ('The cities are PAddress PAddress NJ , and around others', [' Round', ' O']). I've tried converting each key to a regex expression like r"\b(?=\w)key\b(?!\w)" at various points in the sub_mult_regex function (lines 26 and 37) but that didn't work as expected.
For testing, assume that:
us_cities_all = ['Great Barrington', 'Round O', 'East Orange'].
Also, if anyone can help make this run faster/be more efficient, that would be great! Right now, it takes about 30 seconds to run on a 1000-word note, likely because us_cities_all contains 5,000 cities. Let me know if it would be more helpful to directly post the cities list, I wasn't sure how to do so.
I figured out a word-boundary based solution that would handle multiple cities, in case anyone might find it helpful in a similar situation:
def sub_mult_regex(text, keys, tag_type, city):
'''
Replaces/masks multiple words at once
Parameters:
text: TIU note
keys: a list of words to be replaced by the regex
tag_type: string you want the words to be replaced with
city: bool, True if replacing cities, False if replacing anything else
Creates a replacement dictionary of keys and values
(values are the length of the key, preserving formatting).
Eg., {68 Oak St, PAddress PAddress PAddress}
Returns text with relevant text masked
'''
# Creating a list of values to correspond with keys (see key:value example in docstring)
if city:
# If we're masking a city, handle word boundaries
# This step of only including keys if they show up in the text speeds the code up by a lot, since it's not cross-referencing against thousands of cities, only the ones present
keys = [r"\b"+key+r"\b" for key in keys if key in text or key.upper() in text] # add word boundaries for each key in list
add_vals = []
for val in keys:
# Create dictionary of city word:PAddress by splitting the city on the '\\b' char that remains and then adding one tag per word
# Ex: '\\bDeer Island\\b' --> split('\\b') --> ['', 'Deer Island', ''] --> ''.join --> (key) Deer Island : (value) PAddress PAddress
add_vals.append(re.sub(r'\w{1,100}', tag_type, ''.join(val.split('\\b')))) # To preserve the precise punctuation, etc. formatting of the keys, only replacing word matches with tags
add_vals = [re.sub(r'\\b', "", val) for val in add_vals]
elif not city:
# If we're not masking a city, we don't do the word boundary step
add_vals = []
for val in keys:
add_vals.append(re.sub(r'\w{1,100}', tag_type, val)) # To preserve the precise punctuation, etc. formatting of the keys, only replacing word matches with tags
# Zipping keys and values together as dictionary
add_dict = dict(zip(keys, add_vals))
print("add_dict:", add_dict)
# Compiling the keys together (regex)
add_subs = re.compile("|".join("("+key+")" for key in add_dict), re.IGNORECASE)
# This is where the multiple substitutions are happening
# Taken from: https://stackoverflow.com/questions/66270091/multiple-regex-substitutions-using-a-dict-with-regex-expressions-as-keys
group_index = 1
indexed_subs = {}
for target, sub in add_dict.items():
indexed_subs[group_index] = sub
group_index += re.compile(target).groups + 1
if len(indexed_subs) > 0:
text_sub = re.sub(add_subs, lambda match: indexed_subs[match.lastindex], text) # text_sub is masked text
else:
text_sub = text # Not all texts have names, so text_sub would've been NoneType and broken funct otherwise
# Information on what words were changed pre and post masking (eg., would return 'ANN ARBOR' if that city was masked here)
case_a = text
case_b = text_sub
diff_list = [li for li in difflib.ndiff(case_a.split(), case_b.split()) if li[0] != ' ']
diff_list = [re.sub(r'[-,]', "", term.strip()) for term in diff_list if '-' in term]
return text_sub, diff_list
# sample call:
add_string = 'The cities are Round O NJ, around others and East Orange'
mask_multiword_cities(add_string) # this function remained the same
# output: add_dict: {'\\bEast Orange\\b': 'PAddress PAddress', '\\bRound O\\b': 'PAddress PAddress'} ('The cities are PAddress PAddress NJ, around others are PAddress PAddress', [' Round', ' O', ' East', ' Orange'])

what is the fast way to match words in text?

i have a list of regex like :
regex_list = [".+rive.+",".+ll","[0-9]+ blue car.+"......] ## list of length 3000
what is the best method to match all this regex to my text
for example :
text : Hello, Owning 2 blue cars for a single driver
so in the output , i want to have a list of matched words :
matched_words = ["Hello","4 blue cars","driver"] ##Hello <==>.+llo
Alright, first of all, you will probably want to adjust your regex_list, because of now, matching those strings will give you the entire text back as match. This is because of .+, which states that there may follow any character any amount of time. What I have done here is the following:
import re
regex_list = [".rive.",".+ll.","[0-9]+ blue car."]
text = "Hello, Owning 2 blue cars for a single driver"
# Returns all the spans of matched regex items in text
spans = [re.search(regex_item,text).span() for regex_item in regex_list]
# Sorts the spans on first occurence (so, first element in item for every item in span).
spans.sort()
# Retrieves the text via index of spans in text.
matching_texts = [text[x[0]:x[1]] for x in spans]
print(matching_texts)
I adjusted your regex_list slightly, so it does not match the entire text. Then, I retrieve all spans from the matches with the text. Additionally, I sort the spans on first occurence. Lastly, I retrieve the texts via the indexes of the spans and print those out. What you will get is the following
['Hello', '2 blue cars', 'driver']
NOTE: I am unsure why you would like to match '4 blue cars', because that is not in your text.
You could also try this which is multi threaded version of #Lexpj answer
from concurrent.futures import ThreadPoolExecutor, as_completed
import re
# list of length 3000
regex_list = [".rive.", ".+ll.", "[0-9]+ blue car."]
my_string = "Hello, Owning 2 blue cars for a single driver "
def test(text, regex):
# Returns all the spans of matched regex items in text
spans = [re.search(regex, text).span()]
# Sorts the spans on first occurence (so, first element in item for every item in span).
spans.sort()
# Retrieves the text via index of spans in text.
matching_texts = [text[x[0]:x[1]] for x in spans]
return matching_texts
with ThreadPoolExecutor(max_workers=10) as executor:
futures = {executor.submit(test, my_string, regex)
for regex in regex_list}
# as_completed() gives you the threads once finished
matched = set()
for f in as_completed(futures):
# Get the results
rs = f.result()
matched = matched.union(set(rs))
print(matched)
Looking at the desired result, your regexes are not correct. You don't want to match .+, but \w+, and also with the second regex, you'll want to match some letters after ll too.
The main idea is then to make one regex for all, by concatenating them with the | symbol:
import re
regex_list = [r"\w+rive\w+", r"\w+ll\w+", r"\d+ blue car\w+"]
regex = re.compile('|'.join(regex_list))
text = "Hello, Owning 2 blue cars for a single driver "
print(regex.findall(text)) # ["Hello","2 blue cars","driver"]
This still could give undesired effects when there is a part of your string that would match with more than one regex in the list. In that case the first will "win". So make sure that when multiple regexes could match the same text, they are ordered along their desired priority.

Python to Find-replace a string and Create Two Paragraphs Before String in Words Document

I have a VBA Macro. In that, I have
.Find Text = 'Pollution'
.Replacement Text = '^p^pChemical'
Here, '^p^pChemical' means Replace the Word Pollution with Chemical and create two empty paragraphs before the word sea.
Before:
After:
Have you noticed that The Word Pollution has been replaced With Chemical and two empty paragraphs preceds it ? This is how I want in Python.
My Code so far:
import docx
from docx import Document
document = Document('Example.docx')
for Paragraph in document.paragraphs:
if 'Pollution' in paragraph:
replace(Pollution, Chemical)
document.add_paragraph(before('Chemical'))
document.add_paragraph(before('Chemical'))
I want to open a word document to find the word, replace it with another word, and create two empty paragraphs before the replaced word.
You can search through each paragraph to find the word of interest, and call insert_paragraph_before to add the new elements:
def replace(doc, target, replacement):
for par in list(document.paragraphs):
text = par.text
while (index := text.find(target)) != -1:
par.insert_paragraph_before(text[:index].rstrip())
par.insert_paragraph_before('')
par.text = replacement + text[index + len(target)]
list(doc.paragraphs) makes a copy of the list, so that the iteration is not thrown off when you insert elements.
Call this function as many times as you need to replace whatever words you have.
This will take the text from the your document, replace the instances of the word pollution with chemical and add paragraphs in between, but it doesn't change the first document, instead it creates a copy. This is probably the safer route to go anyway.
import re
from docx import Document
ref = {"Pollution":"Chemicals", "Ocean":"Sea", "Speaker":"Magnet"}
def get_old_text():
doc1 = Document('demo.docx')
fullText = []
for para in doc1.paragraphs:
fullText.append(para.text)
text = '\n'.join(fullText)
return text
def create_new_document(ref, text):
doc2 = Document()
lines = text.split('\n')
for line in lines:
for k in ref:
if k.lower() in line.lower():
parts = re.split(f'{k}', line, flags=re.I)
doc2.add_paragraph(parts[0])
for part in parts[1:]:
doc2.add_paragraph('')
doc2.add_paragraph('')
doc2.add_paragraph(ref[k] + " " + part)
doc2.save('demo.docx')
text = get_old_text()
create_new_document(ref, text)
You need to use \n for new line. Using re should work like so:
import re
before = "The term Pollution means the manifestation of any unsolicited foregin substance in something. When we talk about pollution on earth, we refer to the contamination that is happening of the natural resources by various pollutants"
pattern = re.compile("pollution", re.IGNORECASE)
after = pattern.sub("\n\nChemical", before)
print(after)
Which will output:
The term
Chemical means the manifestation of any unsolicited foregin substance in something. When we talk about
Chemical on earth, we refer to the contamination that is happening of the natural resources by various pollutants

Function to extract company register number from text string using Regex

I have a function which extracts the company register number (German: handelsregisternummer) from a given text. Although my regex for this particular problem matches the correct format (please see demo), I can not extract the correct company register number.
I want to extract HRB 142663 B but I get HRB 142663.
Most numbers are in the format HRB 123456 but sometimes there is the letter B attached to the end.
import re
def get_handelsregisternummer(string, keyword):
# https://regex101.com/r/k6AGmq/10
reg_1 = fr'\b{keyword}[,:]?(?:[- ](?:Nr|Nummer)[.:]*)?\s?(\d+(?: \d+)*)(?: B)?'
match = re.compile(reg_1)
handelsregisternummer = match.findall(string) # list of matched words
if handelsregisternummer: # not empty
return handelsregisternummer[0]
else: # no match found
handelsregisternummer = ""
return handelsregisternummer
Example text scraped from website. Linebreaks make words attached to each other:
text_impressum = """"Berlin, HRB 142663 BVAT-ID.: DE283580648Tax Reference Number:"""
Apply function:
for keyword in ['HRB', 'HRA', 'HR B', 'HR A']:
handelsregisternummer = get_handelsregisternummer(text_impressum, keyword=keyword)
if handelsregisternummer: # if list is not empty anymore, then do...
handelsregisternummer = keyword + " " + handelsregisternummer
break
if not handelsregisternummer: # if list is empty
handelsregisternummer = 'not specified'
handelsregisternummer_dict = {'handelsregisternummer':handelsregisternummer}
Afterwards I get:
handelsregisternummer_dict ={'handelsregisternummer': 'HRB 142663'}
But I want this:
handelsregisternummer_dict ={'handelsregisternummer': 'HRB 142663 B'}
You need to use two capturing groups in the regex to capture the keyword and the number, and just match the rest:
reg_1 = fr'\b({keyword})[,:]?(?:[- ](?:Nr|Nummer)[.:]*)?\s?(\d+(?: \d+)*(?: B)?)'
# |_________| |___________________|
Then, you need to concatenate, join all the capturing groups matched and returned with findall:
if handelsregisternummer: # if list is not empty anymore, then do...
handelsregisternummer = " ".join(handelsregisternummer)
break
See the Python demo.

Parsing complicated list of strings using regex, loops, enumerate, to produce a pandas dataframe

I have a long list of many elements, each element is a string. See below sample:
data = ['BAT.A.100', 'Regulation 2020-1233', 'this is the core text of', 'the regulation referenced ',
'MOC to BAT.A.100', 'this', 'is', 'one method of demonstrating compliance to BAT.A.100',
'BAT.A.120', 'Regulation 2020-1599', 'core text of the regulation ...', ' more free text','more free text',
'BAT.A.145', 'Regulation 2019-3333', 'core text of' ,'the regulation1111',
'MOC to BAT.A.145', 'here is how you can show compliance to BAT.A.145','more free text',
'MOC2 to BAT.A.145', ' here is yet another way of achieving compliance']
My desired output is ultimately a Pandas DataFrame as follows:
As the strings may have to be concatenated, I have firstly joining all the elements to single string using ## to separate the text which have been joined.
I am going for all regex because there would be lot of conditions to check otherwise.
re_req = re.compile(r'##(?P<Short_ref>BAT\.A\.\d{3})'
r'##(?P<Full_Reg_ref>Regulation\s\d{4}-\d{4})'
r'##(?P<Reg_text>.*?MOC to \1|.*?(?=##BAT\.A\.\d{3})(?!\1))'
r'(?:##)?(?:(?P<Moc_text>.*?MOC2 to \1)(?P<MOC2>(?:##)?.*?(?=##BAT\.A\.\d{3})(?!\1)|.+)'
r'|(?P<Moc_text_temp>.*?(?=##BAT\.A\.\d{3})(?!\1)))')
final_list = []
for match in re_req.finditer("##" + "##".join(data)):
inner_list = [match.group('Short_ref').replace("##", " "),
match.group('Full_Reg_ref').replace("##", " "),
match.group('Reg_text').replace("##", " ")]
if match.group('Moc_text_temp'): # just Moc_text is present
inner_list += [match.group('Moc_text_temp').replace("##", " "), ""]
elif match.group('Moc_text') and match.group('MOC2'): # both Mock_text and MOC2 is present
inner_list += [match.group('Moc_text').replace("##", " "), match.group('MOC2').replace("##", " ")]
else: # neither Moc_text nor MOC2 is present
inner_list += ["", ""]
final_list.append(inner_list)
final_df = pd.DataFrame(final_list, columns=['Short_ref', 'Full_Reg_ref', 'Reg_text', 'Moc_text', 'MOC2'])
First and second line of regex is same as which you posted earlier and identifies the first two columns.
In third line of regex, r'##(?P<Reg_text>.*?MOC to \1|.*?(?=##BAT\.A\.\d{3})(?!\1))' - matches all text till MOC to Short_ref or matches all the text before the next Reg_text. (?=##BAT\.A\.\d{3})(?!\1) part is to taking the text upto Short_ref pattern and if the Short_ref is not the current Reg_text.
Fourth line is for when Moc_text and MOC2 both is present and it is or with fifth line for the case when just Moc_text is present. This part of the regex is similar to the third line.
Last looping over all the matches using finditer and constructing the rows of the dataframe
final_df:

Categories

Resources