How to replace ambiguous characters in words following specific patern - python

I use tesseract OCR to extract some text from different documents, then I process the extracted text with Regex to see if it matches a specific pattern. Unfortunately, OCR extraction makes common mistakes on ambiguous characters, such as: 5: S, 1: I, 0: O, 2: Z, 4: A, 8: B, etc.. These mistakes are so common that substituting the ambiguous characters would match the pattern perfectly.
Is there a way to postprocess OCR extraction and substitute ambiguous characters (provided in advance) by following a specific pattern?
expected output (and what I could think of so far):
# example: I am extracting car plate numbers that always follow patern [A-Z]{2}\d{5}
# patterns might differ for other example, but will always be some alfa-numeric combination
# complex patterns may be ignored with some warning like "unable to parse"
import re
def post_process(pattern, text, ambiguous_dict):
# get text[0], check pattern
# in this case, should be letter, if no, try to replace from dict, if yes, pass
# continue with next letters until a match is found or looped the whole text
if match:
return match
else:
# some error message
return None
ambiguous_dict = {'2': 'Z', 'B': '8'}
# My plate photo text: AZ45287
# Noise is fairly easy to filter out by filtering on tesseract confidence level, although not ideal
# so, if a function cannot be made that would find a match through the noise
# the noise can be ignored in favor of a simpler fucntion that can just find a match
ocr_output = "someNoise A2452B7 no1Ze"
# 2 in position 1is replaced by Z, B is replaced by 8. It would be acceptable if the function will
# while '2' on pos 5 should remain a 2 as per pattern
# do this iteratively for each element of ocr_output until pattern is matched or return None
# Any other functionally similar (recursive, generator, other) approach is also acceptable.
result = post_process(r"[A-Z]{2}\d{5}", ocr_output, ambiguous_dict)
if result:
print(result) # AZ45287
else: # result is none
print("failed to clean output")
I hope I explained my problem well, but fell free to request additional info

As always with OCR, it is hard to come up with a 100% safe and working solution. In this case, what you can do, is to add the "corrupt" chars to the regex and then "normalize" the matches using the dictionaries with replacements.
It means that you just can't use [A-Z]{2}\d{5} because among the first two uppercase letters there can be an 8, and among the five digits there can be a B. Thus, you need to change the pattern to ([A-Z2]{2})([\dB]{5}) here. Note the capturing parentheses that create two subgroups. To normalize each, you need two separate replacements, as it appears you do not want to replace digits with letters in the numeric part (\d{5}) and letters with digits in the letter part ([A-Z]{2}).
So, here is how it can be implemented in Python:
import re
def post_process(pattern, text, ambiguous_dict_1, ambiguous_dict_2):
matches = list(re.finditer(pattern, text))
if len(matches):
return [f"{x.group(1).translate(ambiguous_dict_1)}{x.group(2).translate(ambiguous_dict_2)}" for x in matches]
else:
return None
ambiguous_dict_1 = {ord('2'): 'Z'} # For the first group
ambiguous_dict_2 = {ord('B'): '8'} # For the second group
ocr_output = "someNoise A2452B7 no1Ze"
result = post_process(r"([A-Z2]{2})([\dB]{5})", ocr_output, ambiguous_dict_1, ambiguous_dict_2)
if result:
print(result) # AZ45287
else: # result is none
print("failed to clean output")
# => ['AZ45287']
See the Python demo
The ambiguous_dict_1 dictionary contains the digit to letter replacements and ambiguous_dict_2 contains the letter to digit replacements.

Related

string matching with interchangeable characters

I am trying to do a simple string matching between two strings, a small string to a bigger string. The only catch is that I want to equate two characters in the small string to be the same. In particular if there is a character 'I' or a character 'L' in the smaller string, then I want it to be considered interchangeably.
For example let's say my small string is
s = 'AKIIMP'
and then the bigger string is:
b = 'MPKGEXAKILMP'
I want to write a function that will take the two strings and checks if the smaller one is in the big one. In this particular example even though the smaller string s is not a substring in b because there is no exact match, however in my case it should match with it because like I mentioned characters 'I' and 'L' would be used interchangeably and therefore the result should find a match.
Any idea of how I could proceed with this?
s.replace('I', 'L') in b.replace('I', 'L')
will evaluate to True in your example.
You could do it with regular expressions:
import re
s = 'AKIIMP'
b = 'MPKGEXAKILMP'
p = re.sub('[IL]', '[IL]', s)
if re.search(p, b):
print(f'{s!r} is in {b!r}')
else:
print('Not found')
This is not as elegant as #Deepstop's answer, but it provides a bit more flexibility in terms of what characters you equate.

Check if a variable substring is in a string

I receive an input string having values expressed in two possible formats. E.g.:
#short format
data = '"interval":19'
>>> "interval":19
#extended format
data = '"interval":{"t0":19,"tf":19}'
>>> "interval":{"t0":19,"tf":19}
I would like to check whether a short format is used and, in case, make it extended.
Considering that the string could be composed of multiple values, i.e.
data = '"interval":19,"interval2":{"t0":10,"tf":15}'
>>> "interval":19,"interval2":{"t0":10,"tf":15}
I cannot just say:
if ":{" not in data:
#then short format is used
I would like to code something like:
if ":$(a general int/float/double number)" in data:
#extract the number
#replace ":{number}" with the extended format
I know how to code the replacing part.
I need help for implementing if condition: in my mind, I model it like a variable substring, in which the variable part is the number inside it, while the rigid format is the $(value name) + ":" part.
"some_value":19
^ ^
rigid format variable part
EDIT - WHY NOT PARSE IT?
I know the string is "JSON-friendly" and I can convert it into a dictionary, easily accessing then the values.
Indeed, I already have this solution in my code. But I don't like it since the input string could be multilevel and I need to iterate on the leaf values of the resulting dictionary, independently from the dictionary levels. The latter is not a simple thing to do.
So I was wondering whether a way to act directly on the string exists.
If you replace all keys, except t0, tf, followed by numbers, it should work.
I show you an example on a multilevel string, probably to be put in a better shape:
import re
s = '"interval": 19,"t0interval2":{"t0":10,"tf":15},{"deeper": {"other_interval":23}}'
gex = '("(?!(t0|tf)")\w+":)\s*(\d+)'
new_s = re.sub(gex, r'\1 {"t0": \3, "tf": \3}', s)
print(new_s)
>>> print(new_s)
"interval": {"t0": 19, "tf": 19},"t0interval2":{"t0":10,"tf":15},{"deeper": {"other_interval": {"t0": 23, "tf": 23}}}
You could use a regular expression. ("interval":)(\d+) will look for the string '"interval":' followed by any number of digits.
Let's test this
data = '"interval":19,"interval2":{"t0":10,"tf":15},"interval":25'
result = re.sub(r'("interval":)(\d+)', r'xxx', data)
print(result)
# -> xxx,"interval2":{"t0":10,"tf":15},xxx
We see that we found the correct places. Now we're going to create your target format. Here the matched groups come in handy. In the regular expression ("interval":) is group 1, (\d+) is group 2.
Now we use the content of those groups to create your wanted result.
data = '"interval":19,"interval2":{"t0":10,"tf":15},"interval":25'
result = re.sub(r'("interval":)(\d+)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":10,"tf":15},"interval":{"t0":25,"tf":25}
If there are floating point values involved you'll have to change (\d+) to ([.\d]+).
If you want any Unicode standard word characters and not only interval you can use the special sequence \w and because it could be multiple characters the expression will be \w+.
data = '"interval":19,"interval2":{"t0":10,"tf":15},"Monty":25.4'
result = re.sub(r'("\w+":)([.\d]+)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":{"t0":10,"tf":10},"tf":{"t0":15,"tf":15}},"Monty":{"t0":25.4,"tf":25.4}
Dang! Yes, we found "Monty" but now the values from the second part are found too. We'll have to fix this somehow. Let's see. We don't want ("\w+") if it's preceded by { so were going to use a negative lookbehind assertion: (?<!{)("\w+"). And after the number part (\d+) we don't want a } or an other digit so we're using a negative lookahead assertion here: ([.\d]+)(?!})(?!\d).
data = '"interval":19,"interval2":{"t0":10,"tf":15},"Monty":25.4'
result = re.sub(r'(?<!{)("\w+":)([.\d]+)(?!})(?!\d)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":10,"tf":15},"Monty":{"t0":25.4,"tf":25.4}
Hooray, it works!
Regular expressions are powerful and fun, but if you start to add more constraints this might become unmanageable.

How can I use a recursive regex or another method to recursively validate this BBcode-like markup in Python?

I am attempting to write a program that validates documents written in a markup language similar to BBcode.
This markup language has both matching ([b]bold[/b] text) and non-matching (today is [date]) tags. Unfortunately, using a different markup language is not an option.
However, my regex is not acting the way I want it to. It seems to always stop at the first matching closing tag instead of identifying that nested tag with the recursive (?R).
I am using the regex module, which supports (?R), and not re.
My questions are:
How can I effectively use a recursive regex to match nested tags without terminating on the first tag?
If there's a better method than a regular expression, what is that method?
Here is the regex once I build it:
\[(b|i|u|h1|h2|h3|large|small|list|table|grid)\](?:((?!\[\/\1\]).)*?|(?R))*\[\/\1\]
Here is a test string that doesn't work as expected:
[large]test1 [large]test2[/large] test3[/large] (it should match this whole string but stops before test3)
Here is the regex on regex101.com: https://regex101.com/r/laJSLZ/1
This test doesn't need to finish in milliseconds or even seconds, but it does need to be able to validate about 100 files of 1,000 to 10,000 characters each in a time that is reasonable for a Travis-CI build.
Here is what the logic using this regex looks like, for context:
import io, regex # https://pypi.org/project/regex/
# All the tags that must have opening and closing tags
matching_tags = 'b', 'i', 'u', 'h1', 'h2', 'h3', 'large', 'small', 'list', 'table', 'grid'
# our first part matches an opening tag:
# \[(b|i|u|h1|h2|h3|large|small|list|table|grid)\]
# our middle part matches the text in the middle, including any properly formed tag sets in between:
# (?:((?!\[\/\1\]).)*?|(?R))*
# our last part matches the closing tag for our first match:
# \[\/\1\]
pattern = r'\[(' + '|'.join(matching_tags) + r')\](?:((?!\[\/\1\]).)*?|(?R))*\[\/\1\]'
myRegex = re.compile(pattern)
data = ''
with open('input.txt', 'r') as file:
data = '[br]'.join(file.readlines())
def validate(text):
valid = True
for node in all_nodes(text):
valid = valid and is_valid(node)
return valid
# (Only important thing here is that I call this on every node, this
# should work fine but the regex to get me those nodes does not.)
# markup should be valid iff opening and closing tag counts are equal
# in the whole file, in each matching top-level pair of tags, and in
# each child all the way down to the smallest unit (a string that has
# no tags at all)
def is_valid(text):
valid = True
for tag in matching_tags:
valid = valid and text.count(f'[{tag}]') == text.count(f'[/{tag}]')
return valid
# this returns each child of the text given to it
# this call:
# all_nodes('[b]some [large]text to[/large] validate [i]with [u]regex[/u]![/i] love[/b] to use [b]regex to [i]do stuff[/i][/b]')
# should return a list containing these strings:
# [b]some [large]text to[/large] validate [i]with [u]regex[/u]![/i] love[/b]
# [large]text to[/large]
# [i]with [u]regex[/u]![/i]
# [u]regex[/u]
# [b]regex to [i]do stuff[/i][/b]
# [i]do stuff[/i]
def all_nodes(text):
matches = myRegex.findall(text)
if len(matches) > 0:
for m in matches:
result += all_nodes(m)
return result
exit(0 if validate(data) else 1)
Your main issue is within the ((?!\[\/\1\]).)*? tempered greedy token.
First, it is inefficient since you quantified it and then you quantify the whole group it is in, so making the regex engine look for more ways to match a string, and that makes it rather fragile.
Second, you only match up to the closing tag and you did not restrict the starting tag. The first step is to make the / before \1 optional, \/?. It won't stop before [tag] like tags with no attributes. To add attribute support, add an optional group after \1, (?:\s[^]]*)?. It matches an optional sequence of a whitespace and then any 0+ chars other than ].
A fixed regex will look like
\[([biu]|h[123]|l(?:arge|ist)|small|table|grid)](?:(?!\[/?\1(?:\s[^]]*)?]).|(?R))*\[/\1]
Do not forget to compile it with regex.DOTALL to match across multiple newlines.

Python Regex Partial Match or "hitEnd"

I'm writing a scanner, so I'm matching an arbitrary string against a list of regex rules. It would be useful if I could emulate the Java "hitEnd" functionality of knowing not just when the regular expression didn't match, but when it can't match; when the regular expression matcher reached the end of the input before deciding it was rejected, indicating that a longer input might satisfy the rule.
For example, maybe I'm matching html tags for starting to bold a sentence of the form "< b >". So I compile my rule
bold_html_rule = re.compile("<b>")
And I run some tests:
good_match = bold_html_rule.match("<b>")
uncertain_match = bold_html_rule.match("<")
bad_match = bold_html_rule.match("goat")
How can I tell the difference between the "bad" match, for which goat can never be made valid by more input, and the ambiguous match that isn't a match yet, but could be.
Attempts
It is clear that in the above form, there is no way to distinguish, because both the uncertain attempt and the bad attempt return "None". If I wrap all rules in "(RULE)?" then any input will return a match, because at the least the empty string is a substring of all strings. However, when I try and see how far the regex progressed before rejecting my string by using the group method or endPos field, it is always just the length of the string.
Does the Python regex package do a lot of extra work and traverse the whole string even if it's an invalid match on the first character? I can see what it would have to if I used search, which will verify if the sequence is anywhere in the input, but it seems very strange to do so for match.
I've found the question asked before (on non-stackoverflow places) like this one:
https://mail.python.org/pipermail/python-list/2012-April/622358.html
but he doesn't really get a response.
I looked at the regular expression package itself but wasn't able to discern its behavior; could I extend the package to get this result? Is this the wrong way to tackle my task in the first place (I've built effective Java scanners using this strategy in the past)
Try this out. It does feel like a hack, but at least it does achieve the result you are looking for. Though I am a bit concerned about the PrepareCompileString function. It should be able to handle all the escaped characters, but cannot handle any wildcards
import re
#Grouping every single character
def PrepareCompileString(regexString):
newstring = ''
escapeFlag = False
for char in regexString:
if escapeFlag:
char = escapeString+char
escapeFlag = False
escapeString = ''
if char == '\\':
escapeFlag = True
escapeString = char
if not escapeFlag:
newstring += '({})?'.format(char)
return newstring
def CheckMatch(match):
# counting the number of non matched groups
count = match.groups().count(None)
# If all groups matched - good match
# all groups did not match - bad match
# few groups matched - uncertain match
if count == 0:
print('Good Match:', match.string)
elif count < len(match.groups()):
print('Uncertain Match:', match.string)
elif count == len(match.groups()):
print('Bad Match:', match.string)
regexString = '<b>'
bold_html_rule = re.compile(PrepareCompileString(regexString))
good_match = bold_html_rule.match("<b>")
uncertain_match = bold_html_rule.match("<")
bad_match = bold_html_rule.match("goat")
for match in [good_match, uncertain_match, bad_match]:
CheckMatch(match)
I got this result:
Good Match: <b>
Uncertain Match: <
Bad Match: goat

Python Regular Expressions Findall

To look through data, I am using regular expressions. One of my regular expressions is (they are dynamic and change based on what the computer needs to look for --- using them to search through data for a game AI):
O,2,([0-9],?){0,},X
After the 2, there can (and most likely will) be other numbers, each followed by a comma.
To my understanding, this will match:
O,2,(any amount of numbers - can be 0 in total, each followed by a comma),X
This is fine, and works (in RegExr) for:
O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X # matches this
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X
My issue is that I need to match all the numbers after the original, provided number. So, I want to match (in the example) 9,6,7,11,8.
However, implementing this in Python:
import re
pattern = re.compile("O,2,([0-9],?){0,},X")
matches = pattern.findall(s) # s is the above string
matches is ['8'], the last number, but I need to match all of the numbers after the given (so '9,6,7,11,8').
Note: I need to use pattern.findall because thee will be more than one match (I shortened my list of strings, but there are actually around 20 thousand strings), and I need to find the shortest one (as this would be the shortest way for the AI to win).
Is there a way to match the entire string (or just the last numbers after those I provided)?
Thanks in advance!
Use this:
O,2,((?:[0-9],?){0,}),X
See it in action:http://regex101.com/r/cV9wS1
import re
s = '''O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X'''
pattern = re.compile("O,2,((?:[0-9],?){0,}),X")
matches = pattern.findall(s) # s is the above string
print matches
Outputs:
['9,6,7,11,8']
Explained:
By wrapping the entire value capture between 2, and ,X in (), you end up capturing that as well. I then used the (?: ) to ignore the inner captured set.
you don't have to use regex
split the string to array
check item 0 == 0 , item 1==2
check last item == X
check item[2:-2] each one of them is a number (is_digit)
that's all

Categories

Resources