I'm writing a scanner, so I'm matching an arbitrary string against a list of regex rules. It would be useful if I could emulate the Java "hitEnd" functionality of knowing not just when the regular expression didn't match, but when it can't match; when the regular expression matcher reached the end of the input before deciding it was rejected, indicating that a longer input might satisfy the rule.
For example, maybe I'm matching html tags for starting to bold a sentence of the form "< b >". So I compile my rule
bold_html_rule = re.compile("<b>")
And I run some tests:
good_match = bold_html_rule.match("<b>")
uncertain_match = bold_html_rule.match("<")
bad_match = bold_html_rule.match("goat")
How can I tell the difference between the "bad" match, for which goat can never be made valid by more input, and the ambiguous match that isn't a match yet, but could be.
Attempts
It is clear that in the above form, there is no way to distinguish, because both the uncertain attempt and the bad attempt return "None". If I wrap all rules in "(RULE)?" then any input will return a match, because at the least the empty string is a substring of all strings. However, when I try and see how far the regex progressed before rejecting my string by using the group method or endPos field, it is always just the length of the string.
Does the Python regex package do a lot of extra work and traverse the whole string even if it's an invalid match on the first character? I can see what it would have to if I used search, which will verify if the sequence is anywhere in the input, but it seems very strange to do so for match.
I've found the question asked before (on non-stackoverflow places) like this one:
https://mail.python.org/pipermail/python-list/2012-April/622358.html
but he doesn't really get a response.
I looked at the regular expression package itself but wasn't able to discern its behavior; could I extend the package to get this result? Is this the wrong way to tackle my task in the first place (I've built effective Java scanners using this strategy in the past)
Try this out. It does feel like a hack, but at least it does achieve the result you are looking for. Though I am a bit concerned about the PrepareCompileString function. It should be able to handle all the escaped characters, but cannot handle any wildcards
import re
#Grouping every single character
def PrepareCompileString(regexString):
newstring = ''
escapeFlag = False
for char in regexString:
if escapeFlag:
char = escapeString+char
escapeFlag = False
escapeString = ''
if char == '\\':
escapeFlag = True
escapeString = char
if not escapeFlag:
newstring += '({})?'.format(char)
return newstring
def CheckMatch(match):
# counting the number of non matched groups
count = match.groups().count(None)
# If all groups matched - good match
# all groups did not match - bad match
# few groups matched - uncertain match
if count == 0:
print('Good Match:', match.string)
elif count < len(match.groups()):
print('Uncertain Match:', match.string)
elif count == len(match.groups()):
print('Bad Match:', match.string)
regexString = '<b>'
bold_html_rule = re.compile(PrepareCompileString(regexString))
good_match = bold_html_rule.match("<b>")
uncertain_match = bold_html_rule.match("<")
bad_match = bold_html_rule.match("goat")
for match in [good_match, uncertain_match, bad_match]:
CheckMatch(match)
I got this result:
Good Match: <b>
Uncertain Match: <
Bad Match: goat
Related
I use tesseract OCR to extract some text from different documents, then I process the extracted text with Regex to see if it matches a specific pattern. Unfortunately, OCR extraction makes common mistakes on ambiguous characters, such as: 5: S, 1: I, 0: O, 2: Z, 4: A, 8: B, etc.. These mistakes are so common that substituting the ambiguous characters would match the pattern perfectly.
Is there a way to postprocess OCR extraction and substitute ambiguous characters (provided in advance) by following a specific pattern?
expected output (and what I could think of so far):
# example: I am extracting car plate numbers that always follow patern [A-Z]{2}\d{5}
# patterns might differ for other example, but will always be some alfa-numeric combination
# complex patterns may be ignored with some warning like "unable to parse"
import re
def post_process(pattern, text, ambiguous_dict):
# get text[0], check pattern
# in this case, should be letter, if no, try to replace from dict, if yes, pass
# continue with next letters until a match is found or looped the whole text
if match:
return match
else:
# some error message
return None
ambiguous_dict = {'2': 'Z', 'B': '8'}
# My plate photo text: AZ45287
# Noise is fairly easy to filter out by filtering on tesseract confidence level, although not ideal
# so, if a function cannot be made that would find a match through the noise
# the noise can be ignored in favor of a simpler fucntion that can just find a match
ocr_output = "someNoise A2452B7 no1Ze"
# 2 in position 1is replaced by Z, B is replaced by 8. It would be acceptable if the function will
# while '2' on pos 5 should remain a 2 as per pattern
# do this iteratively for each element of ocr_output until pattern is matched or return None
# Any other functionally similar (recursive, generator, other) approach is also acceptable.
result = post_process(r"[A-Z]{2}\d{5}", ocr_output, ambiguous_dict)
if result:
print(result) # AZ45287
else: # result is none
print("failed to clean output")
I hope I explained my problem well, but fell free to request additional info
As always with OCR, it is hard to come up with a 100% safe and working solution. In this case, what you can do, is to add the "corrupt" chars to the regex and then "normalize" the matches using the dictionaries with replacements.
It means that you just can't use [A-Z]{2}\d{5} because among the first two uppercase letters there can be an 8, and among the five digits there can be a B. Thus, you need to change the pattern to ([A-Z2]{2})([\dB]{5}) here. Note the capturing parentheses that create two subgroups. To normalize each, you need two separate replacements, as it appears you do not want to replace digits with letters in the numeric part (\d{5}) and letters with digits in the letter part ([A-Z]{2}).
So, here is how it can be implemented in Python:
import re
def post_process(pattern, text, ambiguous_dict_1, ambiguous_dict_2):
matches = list(re.finditer(pattern, text))
if len(matches):
return [f"{x.group(1).translate(ambiguous_dict_1)}{x.group(2).translate(ambiguous_dict_2)}" for x in matches]
else:
return None
ambiguous_dict_1 = {ord('2'): 'Z'} # For the first group
ambiguous_dict_2 = {ord('B'): '8'} # For the second group
ocr_output = "someNoise A2452B7 no1Ze"
result = post_process(r"([A-Z2]{2})([\dB]{5})", ocr_output, ambiguous_dict_1, ambiguous_dict_2)
if result:
print(result) # AZ45287
else: # result is none
print("failed to clean output")
# => ['AZ45287']
See the Python demo
The ambiguous_dict_1 dictionary contains the digit to letter replacements and ambiguous_dict_2 contains the letter to digit replacements.
I have a string and rules/mappings for replacement and no-replacements.
E.g.
"This is an example sentence that needs to be processed into a new sentence."
"This is a second example sentence that shows how 'sentence' in 'sentencepiece' should not be replaced."
Replacement rules:
replace_dictionary = {'sentence': 'processed_sentence'}
no_replace_set = {'example sentence'}
Result:
"This is an example sentence that needs to be processed into a new processed_sentence."
"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."
Additional criteria:
Only replace if case is matched, i.e. case matters.
Whole words replacement only, interpunction should be ignored, but kept after replacement.
I was thinking what would the cleanest way to solve this problem in Python 3.x be?
Based on the answer of demongolem.
UPDATE
I am sorry, I missed the fact, that only whole words should be replaced. I updated my code and even generalized it for usage in a function.
def replace_whole(sentence, replace_token, replace_with, dont_replace):
rx = f"[\"\'\.,:; ]({replace_token})[\"\'\.,:; ]"
iter = re.finditer(rx, sentence)
out_sentence = ""
found = []
indices = []
for m in iter:
indices.append(m.start(0))
found.append(m.group())
context_size=len(dont_replace)
for i in range(len(indices)):
context = sentence[indices[i]-context_size:indices[i]+context_size]
if dont_replace in context:
continue
else:
# First replace the word only in the substring found
to_replace = found[i].replace(replace_token, replace_with)
# Then replace the word in the context found, so any special token like "" or . gets taken over and the context does not change
replace_val = context.replace(found[i], to_replace)
# finally replace the context found with the replacing context
out_sentence = sentence.replace(context, replace_val)
return out_sentence
Use regular expressions for finding all occurences and values of your string (as we need to check whether is a whole word or embedded in any kind of word), by using finditer(). You might need to adjust the rx to what your definition of "whole word" is. Then get the context around these values of the size of your no_replace rule. Then check, whether the context contains your no_replace string.
If not, you may replace it, by using replace() for the word only, then replace the occurence of the word in the context, then replace the context in the whole text. That way the replacing process is nearly unique and no weird behaviour should happen.
Using your examples, this leads to:
replace_whole(sen2, "sentence", "processed_sentence", "example sentence")
>>>"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."
and
replace_whole(sen1, "sentence", "processed_sentence", "example sentence")
>>>'This is an example sentence that needs to be processed into a new processed_sentence.'
After some research, this is what I believe to be the best and cleanest solution to my problem. The solution works by calling the match_fun whenever a match has been found, and the match_fun only performs the replacement, if and only if, there is no "no-replace-phrase" overlapping with the current match. Let me know if you need more clarification or if you believe something can be improved.
replace_dict = ... # The code below assumes you already have this
no_replace_dict = ...# The code below assumes you already have this
text = ... # The text on input.
def match_fun(match: re.Match):
str_match: str = match.group()
if str_match not in cls.no_replace_dict:
return cls.replace_dict[str_match]
for no_replace in cls.no_replace_dict[str_match]:
no_replace_matches_iter = re.finditer(r'\b' + no_replace + r'\b', text)
for no_replace_match in no_replace_matches_iter:
if no_replace_match.start() >= match.start() and no_replace_match.start() < match.end():
return str_match
if no_replace_match.end() > match.start() and no_replace_match.end() <= match.end():
return str_match
return cls.replace_dict[str_match]
for replace in cls.replace_dict:
pattern = re.compile(r'\b' + replace + r'\b')
text = pattern.sub(match_fun, text)
In Python, I try to find the last position in an arbitrary string that does match a given pattern, which is specified as negative character set regex pattern. For example, with the string uiae1iuae200, and the pattern of not being a number (regex pattern in Python for this would be [^0-9]), I would need '8' (the last 'e' before the '200') as result.
What is the most pythonic way to achieve this?
As it's a little tricky to quickly find method documentation and the best suited method for something in the Python docs (due to method docs being somewhere in the middle of the corresponding page, like re.search() in the re page), the best way I quickly found myself is using re.search() - but the current form simply must be a suboptimal way of doing it:
import re
string = 'uiae1iuae200' # the string to investigate
len(string) - re.search(r'[^0-9]', string[::-1]).start()
I am not satisfied with this for two reasons:
- a) I need to reverse string before using it with [::-1], and
- b) I also need to reverse the resulting position (subtracting it from len(string) because of having reversed the string before.
There needs to be better ways for this, likely even with the result of re.search().
I am aware of re.search(...).end() over .start(), but re.search() seems to split the results into groups, for which I did not quickly find a not-cumbersome way to apply it to the last matched group. Without specifying the group, .start(), .end(), etc, seem to always match the first group, which does not have the position information about the last match. However, selecting the group seems to at first require the return value to temporarily be saved in a variable (which prevents neat one-liners), as I would need to access both the information about selecting the last group and then to select .end() from this group.
What's your pythonic solution to this? I would value being pythonic more than having the most optimized runtime.
Update
The solution should be functional also in corner cases, like 123 (no position that matches the regex), empty string, etc. It should not crash e.g. because of selecting the last index of an empty list. However, as even my ugly answer above in the question would need more than one line for this, I guess a one-liner might be impossible for this (simply because one needs to check the return value of re.search() or re.finditer() before handling it). I'll accept pythonic multi-line solutions to this answer for this reason.
You can use re.finditer to extract start positions of all matches and return the last one from list. Try this Python code:
import re
print([m.start(0) for m in re.finditer(r'\D', 'uiae1iuae200')][-1])
Prints:
8
Edit:
For making the solution a bit more elegant to behave properly in for all kind of inputs, here is the updated code. Now the solution goes in two lines as the check has to be performed if list is empty then it will print -1 else the index value:
import re
arr = ['', '123', 'uiae1iuae200', 'uiae1iuae200aaaaaaaa']
for s in arr:
lst = [m.start() for m in re.finditer(r'\D', s)]
print(s, '-->', lst[-1] if len(lst) > 0 else None)
Prints the following, where if no such index is found then prints None instead of index:
--> None
123 --> None
uiae1iuae200 --> 8
uiae1iuae200aaaaaaaa --> 19
Edit 2:
As OP stated in his post, \d was only an example we started with, due to which I came up with a solution to work with any general regex. But, if this problem has to be really done with \d only, then I can give a better solution which would not require list comprehension at all and can be easily written by using a better regex to find the last occurrence of non-digit character and print its position. We can use .*(\D) regex to find the last occurrence of non-digit and easily print its index using following Python code:
import re
arr = ['', '123', 'uiae1iuae200', 'uiae1iuae200aaaaaaaa']
for s in arr:
m = re.match(r'.*(\D)', s)
print(s, '-->', m.start(1) if m else None)
Prints the string and their corresponding index of non-digit char and None if not found any:
--> None
123 --> None
uiae1iuae200 --> 8
uiae1iuae200aaaaaaaa --> 19
And as you can see, this code doesn't need to use any list comprehension and is better as it can just find the index by just one regex call to match.
But in case OP indeed meant it to be written using any general regex pattern, then my above code using comprehension will be needed. I can even write it as a function that can take the regex (like \d or even a complex one) as an argument and will dynamically generate a negative of passed regex and use that in the code. Let me know if this indeed is needed.
To me it sems that you just want the last position which matches a given pattern (in this case the not a number pattern).
This is as pythonic as it gets:
import re
string = 'uiae1iuae200'
pattern = r'[^0-9]'
match = re.match(fr'.*({pattern})', string)
print(match.end(1) - 1 if match else None)
Output:
8
Or the exact same as a function and with more test cases:
import re
def last_match(pattern, string):
match = re.match(fr'.*({pattern})', string)
return match.end(1) - 1 if match else None
cases = [(r'[^0-9]', 'uiae1iuae200'), (r'[^0-9]', '123a'), (r'[^0-9]', '123'), (r'[^abc]', 'abcabc1abc'), (r'[^1]', '11eea11')]
for pattern, string in cases:
print(f'{pattern}, {string}: {last_match(pattern, string)}')
Output:
[^0-9], uiae1iuae200: 8
[^0-9], 123a: 3
[^0-9], 123: None
[^abc], abcabc1abc: 6
[^1], 11eea11: 4
This does not look Pythonic because it's not a one-liner, and it uses range(len(foo)), but it's pretty straightforward and probably not too inefficient.
def last_match(pattern, string):
for i in range(1, len(string) + 1):
substring = string[-i:]
if re.match(pattern, substring):
return len(string) - i
The idea is to iterate over the suffixes of string from the shortest to the longest, and to check if it matches pattern.
Since we're checking from the end, we know for sure that the first substring we meet that matches the pattern is the last.
i have a custom script i want to extract data from with python, but the only way i can think is to take out the marked bits then leave the unmarked bits like "go up" "go down" in this example.
string_a = [start]go up[wait time=500]go down[p]
string_b = #onclick go up[wait time=500]go down active="False"
In trying to do so, all I managed to do was extract the marked bits, but i cant figure out a way to save the data that isnt marked! it always gets lost when i extract the other bits!
this is the function im using to extract them. I call it multiple times in order to whittle away the markers, but I can't choose the order they get extracted in!
class Parsers:
#staticmethod
def extract(line, filters='[]'):
##retval list
substring=line[:]
contents=[]
for bracket in range(line.count(str(filters[0]))):
startend =[]
for f in filters:
now= substring.find(f)
startend.append(now)
contents.append(substring[startend[0]+1:startend[1]])
substring=substring[startend[1]+1:]
return contents, substring
btw the order im calling it at the moment is like this. i think i should put the order back to the # being first, but i dont want to break it again.
star_string, first = Parsers.extract(string_a, filters='* ')
bracket_string, substring = Parsers.extract(string_a, filters='[]')
at_string, final = Parsers.extract(substring, filters='# ')
please excuse my bad python, I learnt this all on my own and im still figuring this out.
You are doing some mighty malabarisms with Python string methods above - but if all you want is to extract the content within brackets, and get the remainder of the string, that would be an eaasier thing with regular expressions (in Python, the "re" module)
import re
string_a = "[start]go up[wait time=500]go down[p]"
expr = r"\[.*?\]"
expr = re.compile(r"\[.*?\]")
contents = expr.findall(string_a)
substring = expr.sub("", string_a)
This simply tells the regexp engine to match for a literal [, and whatever characters are there(.*) up to the following ] (? is used to match the next ], and not the last one) - the findall call gets all such matches as a list of strings, and the sub call replaces all the matches for an empty string.
For nice that regular expressions are, they are less Python than their own sub-programing language. Check the documentation on them: https://docs.python.org/2/library/re.html
Still, a simpler way of doing what you had done is to check character by character, and have some variables to "know" where you are in the string (if inside a tag or not, for example) - just like we would think about the problem if we could look at only one character at a time. I will write the code thinking on Python 3.x - if you are still using Python 2.x, please convert your strings to unicode objects before trying something like this:
def extract(line, filters='[]'):
substring = ""
contents = []
inside_tag = False
partial_tag = ""
for char in line:
if char == filters[0] and not inside_tag:
inside_tag = True
elif char == filters[1] and inside_tag:
contents.append(partial_tag)
partial_tag = ""
inside_tag = False
elif inside_tag:
partial_tag += char
else:
substring += 1
if partial_tag:
print("Warning: unclosed tag '{}' ".format(partial_tag))
return contents, substring
Perceive as there is no need of complicated calculations of where each bracket falls in the line, and so on - you just get them all.
Not sure I understand this fully - you want to get [stuff in brackets] and everything else? If you are just parsing flat strings - no recursive brackets-in-brackets - you can do
import re
parse = re.compile(r"\[.*?\]|[^\[]+").findall
then
>>> parse('[start]go up[wait time=500]go down[p]')
['[start]', 'go up', '[wait time=500]', 'go down', '[p]']
>>> parse('#onclick go up[wait time=500]go down active="False"')
['#onclick go up', '[wait time=500]', 'go down active="False"']
The regex translates as "everything between two square brackets OR anything up to but not including an opening square bracket".
If this isn't what you wanted - do you want #word to be a separate chunk? - please show what string_a and string_b should be parsed as!
I have a series of reg expressions called in order. I need to check the first one, and then the second, then the third etc etc right the way until the end. I need to do some processed on the matched string, so I'm trying to avoid too much logic, but in python, unlike perl I do not think I can perform assignment in the if-elif-elif..blocks so I'll end up doing an assignment, then checking for a match and then getting the results of that match. For example:
m = re.search(patternA, string)
if m:
stripped = m.group(0)
xyz = stripped[45:67]
elif:
m = re.search(patternB, string)
if m:
stripped = m.group(0)
abc = stripped[5:7]
elif:
m = re.search(patternB, string)
if m:
stripped = m.group(0)
txt = stripped[4:5]
elif:
......
Ideally I'd like to find a better structure that ensures I preserve the ordering of the tested regular expressions, and also that I can incorporate the assignment into the if-then statements. So for example:
if (m = re.search(patternA, string)):
stripped = m.group(0)
xyz = stripped[45:67]
elif (m = re.search(patternB, string)):
stripped = m.group(0)
abc = stripped[5:7]
...
What is the most pythonic way of dealing with this? Thanks.
The use case is to read old data - very old data. However each string may include information about particular values and these are only present if the regular expression matches a particular pattern. So the variables extracted are highly dependent upon what matches.
for (pattern, slice) in zip([patternA, patternB, patternC],
[slice(45,67), slice(5,7), slice(4,5)]):
m = re.search(pattern, string)
if m:
value = m.group(0)[slice]
break
else:
# Handle no match found for any pattern here
This iterates over pairs of regular expressions and the relevant portion of their match until a match is found. If there is no match found, the else clause of the for loop will execute. The result of the match is found in value after the loop, regardless of which pattern matches.
Having different variables set based on which "branch" succeeds is not a great idea, since you won't necessarily know which variables are set at any given time. A dictionary would be a better idea if you really want separate labels for each match, since you can query which key or keys are set in a dictionary.
value = {}
for (pattern, slice, key) in zip([patternA, patternB, patternC],
[slice(45,67), slice(5,7), slice(4,5)],
['abc', 'xyx', 'txt']):
m = re.search(pattern, string)
if m:
value[key] = m.group(0)[slice]
break
The general idea, though, is to note that your chain of if statements is like a hard-coded iteration, so you just need to identify which parts of each if/elif clause varies from the preceding ones, and create a list that you can iterate over instead.