Finding Standardized Text Pattern In String - python

We are looking through a very large set of strings for standard number patterns in order to locate drawing sheet numbers. For example valid sheet numbers are: A-101, A101, C-101, C102, E-101, A1, C1, A-100-A, ect.
They may be contained in a string such as "The sheet number is A-101 first floor plan"
The sheet number patterns are always comprised of similar patterns of character type (numbers, characters and separators (-, space, _)) and if we convert all valid numbers to a pattern indicating the character type (A-101=ASNNN, A101=ANNN, A1 - AN, etc) that there are only ~100 valid patterns.
Our plan is to convert each character in the string to it's character type and then search for a valid pattern. So the question is what is the best way to search through "AAASAAAAASAAAAAASAASASNNNSAAAAASAAAAASAAAA" to find one of 100 valid character type patterns. We considered doing 100 text searches for each pattern, but there seems like there could be a better way to find a candidate pattern and then search to see if it is one of the 100 valid patterns.

Solution
Is it what you want?
import re
pattern_dict = {
'S': r'[ _-]',
'A': r'[A-Z]',
'N': r'[0-9]',
}
patterns = [
'ASNNN',
'ANNN',
'AN',
]
text = "A-1 A2 B-345 C678 D900 E80"
for pattern in patterns:
converted = ''.join(pattern_dict[c] for c in pattern)
print(pattern, re.findall(rf'\b{converted}\b', text))
output:
ASNNN ['B-345']
ANNN ['C678', 'D900']
AN ['A2']
Exmplanation
rf'some\b {string}': Combination of r-string and f-string.
r'some\b': Raw string. It prevents python string escaping. So it is same with 'some\\b'
f'{string}': Literal format string. Python 3.6+ supports this syntax. It is similar to '{}'.format(string).
So you can alter rf'\b{converted}\b' to '\\b' + converted + '\\b'.
\b in regex: It matches word boundary.

bookmark_strings = []
bookmark_strings.append("I-111 - INTERIOR FINISH PLAN & FINISH SCHEDULE")
bookmark_strings.append("M0.01 SCHEDULES & CALCULATIONS")
bookmark_strings.append("M-1 HVAC PLAN - OH Maple Heights PERMIT")
bookmark_strings.append("P-2 - PLUMBING DEMOLITION")
pattern_dict = {
'S': r'[. _-]',
'A': r'[A-Z]',
'N': r'[0-9]',
}
patterns = [
'ASNNN',
'ANSNN',
'ASN',
'ANNN'
]
for bookmark in bookmark_strings:
for pattern in patterns:
converted = ''.join(pattern_dict[c] for c in pattern)
if len(re.findall(rf'\b{converted}\b', bookmark)) > 0:
print ("We found a match for pattern - {}, value = {} in bookmark {}".format(pattern, re.findall(rf'\b{converted}\b', bookmark) , bookmark))
Output:
We found a match for pattern - ASNNN, value = ['I-111'] in bookmark I-111 - INTERIOR FINISH PLAN & FINISH SCHEDULE
We found a match for pattern - ANSNN, value = ['M0.01'] in bookmark M0.01 SCHEDULES & CALCULATIONS
We found a match for pattern - ASN, value = ['M-1'] in bookmark M-1 HVAC PLAN - OH Maple Heights PERMIT
We found a match for pattern - ASN, value = ['P-2'] in bookmark P-2 - PLUMBING DEMOLITION

use regex
import re
re.findall("[A-Z][-_ ]?[0-9]+",text)

Related

How to extract all comma delimited numbers inside () bracked and ignore any text

I am trying to extract the comma delimited numbers inside () brackets from a string. I can get the numbers if that are alone in a line. But i cant seem to find a solution to get the numbers when other surrounding text is involved. Any help will be appreciated. Below is the code that I current use in python.
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
line = each.strip()
regex_criteria = r'"^([1-9][0-9]*|\([1-9][0-9]*\}|\(([1-9][0-9]*,?)+[1-9][0-9]*\))$"gm'
if (line.__contains__('(') and line.__contains__(')') and not re.search('[a-zA-Z]', refline)):
refline = line[line.find('(')+1:line.find(')')]
if not re.search('[a-zA-Z]', refline):
Remove the ^, $ is whats preventing you from getting all the numbers. And gm flags wont work in python re.
You can change your regex to :([1-9][0-9]*|\([1-9][0-9]*\}|\(?:([1-9][0-9]*,?)+[1-9][0-9]*\)) if you want to get each number separately.
Or you can simplify your pattern to (?<=[(,])[1-9][0-9]+(?=[,)])
Test regex here: https://regex101.com/r/RlGwve/1
Python code:
import re
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
print(re.findall(r'(?<=[(,])[1-9][0-9]+(?=[,)])', line))
# ['101065', '101066', '101067', '101065']
(?<=[(,])[1-9][0-9]+(?=[,)])
The above pattern tells to match numbers which begin with 1-9 followed by one or more digits, only if the numbers begin with or end with either comma or brackets.
Here's another option:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)*(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?<=\(): Lookbehind for (
[1-9]+\d*: At least one number (would \d+ work too?)
(?:,[1-9]\d*)*: Zero or multiple numbers after a ,
(?=\)): Lookahead for )
Result for your line:
[['101065', '101066', '101067'], ['101065']]
If you only want the comma separated numbers:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)+(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?:,[1-9]\d*)+: One or more numbers after a ,
Result:
[['101065', '101066', '101067']]
Now, if your line could also look like
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines ( 101065,101066, 101067 )
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
then you have to sprinkle the pattern with \s* and remove the whitespace afterwards (here with str.translate and str.maketrans):
pattern = re.compile(r"(?<=\()\s*[1-9]+\d*(?:\s*,\s*[1-9]\d*\s*)*(?=\))")
table = str.maketrans("", "", " ")
results = [match[0].translate(table).split(",") for match in pattern.finditer(line)]
Result:
[['101065', '101066', '101067'], ['101065']]
Using the pypi regex module you could also use capture groups:
\((?P<num>\d+)(?:,(?P<num>\d+))*\)
The pattern matches:
\( Match (
(?P<num>\d+) Capture group, match 1+ digits
(?:,(?P<num>\d+))* Optionally repeat matching , and 1+ digits in a capture group
\) Match )
Regex demo | Python demo
Example code
import regex
pattern = r"\((?P<num>\d+)(?:,(?P<num>\d+))*\)"
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
matches = regex.finditer(pattern, line)
for _, m in enumerate(matches, start=1):
print(m.capturesdict())
Output
{'num': ['101065', '101066', '101067']}
{'num': ['101065']}

Extracting data that follows specific string only

I have a file I want to extract data from using regex that looks like this :
RID: RSS-130 SERVICE PAGE: 2
REPORTING FOR: 100019912 SSSE INTSERVICE PROC DATE: 15SEP21
ROLLUP FOR: 100076212 SSSE REPORT REPORT DATE: 15SEP21
ENTITY: 1000208212 SSSE
ACQT
PUR
SAME 10SEP21 120 12,263,518 19,48.5
T PUR 120 12,263,518 19,48.5
The regex I wrote to extract the data :
regex_1 = PROC DATE:\s*(\w+).?* # to get 15SEP21
regex_2 = T PUR\s*([0-9,]*\s*[0-9,]*) # to get the first two elements of the line after T PUR
This works but in the file I have multiple records just like this one, under different RID: RSS-130 for example RID: RSS-140, I want to enforce extracting information only that follows RID: RSS-130 and ACQT and stop when that record is over and not carry on extracting data from what ever is under How can I do that?
Desired output would be :
[(15SEP21;120;12,263,518)] for the record that comes under RID: RSS-130 and after ACQT only
I suggest leveraging a tempered greedy token here:
(?s)PROC DATE:\s*(?P<date>\w+)(?:(?!RID:\s+RSS-\d).)*T PUR\s+(?P<num>\d[.,\d]*)\s+(?P<val>\d[\d,]*)
See the regex demo. Details:
(?s) - an inline re.S / re.DOTALL modifier
PROC DATE: - a literal text
\s* - zero or more whitespaces
(?P<date>\w+) - Group "date": one or more word chars
(?:(?!RID:\s+RSS-\d).)* - any single char, zero or more but as many as possible occurrences, that does not start a RID:\s+RSS-\d pattern (block start pattern, RID:, one or more whitespaces, RSS- and a digit)
T PUR - a literal string
\s+ - one or more whitespaces
(?P<num>\d[.,\d]*) - Group "num": a digit and then zero or more commas, dots and digits
\s+ - one or more digits
(?P<val>\d[\d,]*) - Group "val": a digit and then zero or more commas or digits.
See the Python demo:
import re
text = "RID: RSS-130 SERVICE PAGE: 2 \nREPORTING FOR: 100019912 SSSE INTSERVICE PROC DATE: 15SEP21 \nROLLUP FOR: 100076212 SSSE REPORT REPORT DATE: 15SEP21 \nENTITY: 1000208212 SSSE \n \n \n \n \n ACQT \n \n \n PUR \n SAME 10SEP21 120 12,263,518 19,48.5 \n \n T PUR 120 12,263,518 19,48.5"
rx = r"PROC DATE:\s*(?P<date>\w+)(?:(?!RID:\s+RSS-\d).)*T PUR\s+(?P<num>\d[.,\d]*)\s+(?P<val>\d[\d,]*)"
m = re.search(rx, text, re.DOTALL)
if m:
print(m.groupdict())
# => {'date': '15SEP21', 'num': '120', 'val': '12,263,518'}
If you MUST check for T PUR after ACQT, modify the pattern to
(?s)PROC DATE:\s*(?P<date>\w+)(?:(?!RID:\s+RSS-\d|ACQT).)*ACQT(?:(?!RID:\s+RSS-\d).)*T PUR\s+(?P<num>\d[.,\d]*)\s+(?P<val>\d[\d,]*)
See this regex demo.

Returning strings contained within blocks of text, without punctuation attached

I need to match and return any word containing at least one of the strings/combinations of characters below:
- tion (as in navigation, isolation, or mitigation)
- ex (as in explanation, exfiltrate, or expert)
- ph (as in philosophy, philanthropy, or ephemera)
- ost, ist, ast (as in hostel, distribute, past)
My function appears to do this
TEXT_SAMPLE = """
Striking an average of observations taken at different times-- rejecting those
timid estimates that gave the object a length of 200 feet, and ignoring those
exaggerated views that saw it as a mile wide and three long--you could still
assert that this phenomenal creature greatly exceeded the dimensions of
anything then known to ichthyologists, if it existed at all.
Now then, it did exist, this was an undeniable fact; and since the human mind
dotes on objects of wonder, you can understand the worldwide excitement caused
by this unearthly apparition. As for relegating it to the realm of fiction,
that charge had to be dropped.
In essence, on July 20, 1866, the steamer Governor Higginson, from the
Calcutta & Burnach Steam Navigation Co., encountered this moving mass five
miles off the eastern shores of Australia.
"""
def latin_ish_words(text):
#Returns input text into list of words, splitting on whitespace, allocates list to text_list
text_list = text.split()
#Creates an empty string, match_list
match_list = []
#Creates a string containing latinish featurs
part_list = ["tion", "ex", "ph", "ost", "ist", "ast"]
#Iterates through list of words and latinish features, adds word to match_list if contains latinish features
for word in text_list:
for part in part_list:
if part in word:
match_list.append(word)
match_list = list(dict.fromkeys(match_list))
return match_list
latin_ish_words(TEXT_SAMPLE)
['observations', 'exaggerated', 'phenomenal', 'exceeded', 'ichthyologists,', 'existed', 'exist,', 'excitement', 'apparition.', 'fiction,', 'Navigation', 'eastern']
However, when numbers have punctuation attached, the function will also return punctuation
E.g - exist,',
How could one filter out such attached punctuation?
You can use r"\b\w*(?:tion|ex|ph|ost|ist|ast)\w*\b" regex. Explanation (see also docs):
\b ... word boundary
\w ... word character
* ... 0 or more repetition
\w* ... 0 or more word characters
(?:...) ... "plain" parens, not creating a group
| ... or
tion|ex|ph ... tion or ex or ph
Code:
import re
print(re.findall(r"\b\w*(?:tion|ex|ph|ost|ist|ast)\w*\b",TEXT_SAMPLE))
For convenience, you can build the pattern programtically, adding the parts from a variable:
import re
part_list = [
"tion",
"ex",
"ph",
"ost",
"ist",
"ast",
]
part_re = "|".join(part_list)
pattern = fr"\b\w*(?:{part_re})\w*\b"
# pattern = r"\b\w*(?:{})\w*\b".format(part_re) # for older versions not allowing f-string syntax
print(re.findall(pattern,TEXT_SAMPLE))
Output:
[
'observations',
'exaggerated',
'phenomenal',
'exceeded',
'ichthyologists',
'existed',
'exist',
'excitement',
'apparition',
'fiction',
'Navigation',
'eastern',
]

Regex to match strings in quotes that contain only 3 or less capitalized words

I've searched and searched, but can't find an any relief for my regex woes.
I wrote the following dummy sentence:
Watch Joe Smith Jr. and Saul "Canelo" Alvarez fight Oscar de la Hoya and Genaddy Triple-G Golovkin for the WBO belt GGG. Canelo Alvarez and Floyd 'Money' Mayweather fight in Atlantic City, New Jersey. Conor MacGregor will be there along with Adonis Superman Stevenson and Mr. Sugar Ray Robinson. "Here Goes a String". 'Money Mayweather'. "this is not a-string", "this is not A string", "This IS a" "Three Word String".
I'm looking for a regular expression that will return the following when used in Python 3.6:
Canelo, Money, Money Mayweather, Three Word String
The regex that has gotten me the closest is:
(["'])[A-Z](\\?.)*?\1
I want it to only match strings of 3 capitalized words or less immediately surrounded by single or double quotes. Unfortunately, so far it seem to match any string in quotes, no matter what the length, no matter what the content, as long is it begins with a capital letter.
I've put a lot of time into trying to hack through it myself, but I've hit a wall. Can anyone with stronger regex kung-fu give me an idea of where I'm going wrong here?
Try to use this one: (["'])((?:[A-Z][a-z]+ ?){1,3})\1
(["']) - opening quote
([A-Z][a-z]+ ?){1,3} - Capitalized word repeating 1 to 3 times separated by space
[A-Z] - capital char (word begining char)
[a-z]+ - non-capital chars (end of word)
_? - space separator of capitalized words (_ is a space), ? for single word w/o ending space
{1,3} - 1 to 3 times
\1 - closing quote, same as opening
Group 2 is what you want.
Match 1
Full match 29-37 `"Canelo"`
Group 1. 29-30 `"`
Group 2. 30-36 `Canelo`
Match 2
Full match 146-153 `'Money'`
Group 1. 146-147 `'`
Group 2. 147-152 `Money`
Match 3
Full match 318-336 `'Money Mayweather'`
Group 1. 318-319 `'`
Group 2. 319-335 `Money Mayweather`
Match 4
Full match 398-417 `"Three Word String"`
Group 1. 398-399 `"`
Group 2. 399-416 `Three Word String`
RegEx101 Demo: https://regex101.com/r/VMuVae/4
Working with the text you've provided, I would try to use regular expression lookaround to get the words surrounded by quotes and then apply some conditions on those matches to determine which ones meet your criterion. The following is what I would do:
[p for p in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt) if all(x.istitle() for x in p.split(' ')) and len(p.split(' ')) <= 3]
txt is the text you've provided here. The output is the following:
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Cleaner:
matches = []
for m in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt):
if all(x.istitle() for x in m.split(' ')) and len(m.split(' ')) <= 3:
matches.append(m)
print(matches)
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Here's my go at it: ([\"'])(([A-Z][^ ]*? ?){1,3})\1

Find best substring match

I'm looking for a library or a method using existing libraries( difflib, fuzzywuzzy, python-levenshtein) to find the closest match of a string (query) in a text (corpus)
I've developped a method based on difflib, where I split my corpus into ngrams of size n (length of query).
import difflib
from nltk.util import ngrams
def get_best_match(query, corpus):
ngs = ngrams( list(corpus), len(query) )
ngrams_text = [''.join(x) for x in ngs]
return difflib.get_close_matches(query, ngrams_text, n=1, cutoff=0)
it works as I want when the difference between the query and the matched string are just character replacements.
query = "ipsum dolor"
corpus = "lorem 1psum d0l0r sit amet"
match = get_best_match(query, corpus)
# match = "1psum d0l0r"
But when the difference is character deletion, it is not.
query = "ipsum dolor"
corpus = "lorem 1psum dlr sit amet"
match = get_best_match(query, corpus)
# match = "psum dlr si"
# expected_match = "1psum dlr"
Is there a way to get a more flexible result size ( as for expected_match ) ?
EDIT 1:
The actual use of this script is to match queries (strings) with a
messy ocr output.
As I said in the question, the ocr can confound characters, and even miss them.
If possible consider also the case when a space is missing between words.
A best match, is the one that does not include characters from other words than those on the query.
EDIT 2:
The solution I use now is to extend the ngrams with (n-k)-grams for k = {1,2,3} to prevent 3 deletions. It's much better than the first version, but not efficient in terms of speed, as we have more than 3 times the number of ngrams to check. It is also a non generalizable solution.
This function finds best matching substring of variable length.
The implementation considers the corpus as one long string, hence avoiding your concerns with spaces and unseparated words.
Code summary:
1. Scan the corpus for match values in steps of size step to find the approximate location of highest match value, pos.
2. Find the substring in the vicinity of pos with the highest match value, by adjusting the left/right positions of the substring.
from difflib import SequenceMatcher
def get_best_match(query, corpus, step=4, flex=3, case_sensitive=False, verbose=False):
"""Return best matching substring of corpus.
Parameters
----------
query : str
corpus : str
step : int
Step size of first match-value scan through corpus. Can be thought of
as a sort of "scan resolution". Should not exceed length of query.
flex : int
Max. left/right substring position adjustment value. Should not
exceed length of query / 2.
Outputs
-------
output0 : str
Best matching substring.
output1 : float
Match ratio of best matching substring. 1 is perfect match.
"""
def _match(a, b):
"""Compact alias for SequenceMatcher."""
return SequenceMatcher(None, a, b).ratio()
def scan_corpus(step):
"""Return list of match values from corpus-wide scan."""
match_values = []
m = 0
while m + qlen - step <= len(corpus):
match_values.append(_match(query, corpus[m : m-1+qlen]))
if verbose:
print(query, "-", corpus[m: m + qlen], _match(query, corpus[m: m + qlen]))
m += step
return match_values
def index_max(v):
"""Return index of max value."""
return max(range(len(v)), key=v.__getitem__)
def adjust_left_right_positions():
"""Return left/right positions for best string match."""
# bp_* is synonym for 'Best Position Left/Right' and are adjusted
# to optimize bmv_*
p_l, bp_l = [pos] * 2
p_r, bp_r = [pos + qlen] * 2
# bmv_* are declared here in case they are untouched in optimization
bmv_l = match_values[p_l // step]
bmv_r = match_values[p_l // step]
for f in range(flex):
ll = _match(query, corpus[p_l - f: p_r])
if ll > bmv_l:
bmv_l = ll
bp_l = p_l - f
lr = _match(query, corpus[p_l + f: p_r])
if lr > bmv_l:
bmv_l = lr
bp_l = p_l + f
rl = _match(query, corpus[p_l: p_r - f])
if rl > bmv_r:
bmv_r = rl
bp_r = p_r - f
rr = _match(query, corpus[p_l: p_r + f])
if rr > bmv_r:
bmv_r = rr
bp_r = p_r + f
if verbose:
print("\n" + str(f))
print("ll: -- value: %f -- snippet: %s" % (ll, corpus[p_l - f: p_r]))
print("lr: -- value: %f -- snippet: %s" % (lr, corpus[p_l + f: p_r]))
print("rl: -- value: %f -- snippet: %s" % (rl, corpus[p_l: p_r - f]))
print("rr: -- value: %f -- snippet: %s" % (rl, corpus[p_l: p_r + f]))
return bp_l, bp_r, _match(query, corpus[bp_l : bp_r])
if not case_sensitive:
query = query.lower()
corpus = corpus.lower()
qlen = len(query)
if flex >= qlen/2:
print("Warning: flex exceeds length of query / 2. Setting to default.")
flex = 3
match_values = scan_corpus(step)
pos = index_max(match_values) * step
pos_left, pos_right, match_value = adjust_left_right_positions()
return corpus[pos_left: pos_right].strip(), match_value
Example:
query = "ipsum dolor"
corpus = "lorem i psum d0l0r sit amet"
match = get_best_match(query, corpus, step=2, flex=4)
print(match)
('i psum d0l0r', 0.782608695652174)
Some good heuristic advice is to always keep step < len(query) * 3/4, and flex < len(query) / 3. I also added case sensitivity, in case that's important. It works quite well when you start playing with the step and flex values. Small step values gives better results but takes longer to compute. flex governs how flexible the length of the resulting substring is allowed to be.
Important to note: This will only find the first best match, so if there are multiple equally good matches, only the first will be returned. To allow for multiple matches, change index_max() to return a list of indices for the n highest values of the input list, and loop over adjust_left_right_positions() for values in that list.
The main path to a solution uses finite state automata (FSA) of some kind. If you want a detailed summary of the topic, check this dissertation out (PDF link). Error-based models (including Levenshtein automata and transducers, the former of which Sergei mentioned) are valid approaches to this. However, stochastic models, including various types of machine learning approaches integrated with FSAs, are very popular at the moment.
Since we are looking at edit distances (effectively misspelled words), the Levenshtein approach is good and relatively simple. This paper (as well as the dissertation; also PDF) give a decent outline of the basic idea and it also explicitly mentions the application to OCR tasks. However, I will review some of the key points below.
The basic idea is that you want to build an FSA that computes both the valid string as well as all strings up to some error distance (k). In the general case, this k could be infinite or the size of the text, but this is mostly irrelevant for OCR (if your OCR could even potentially return bl*h where * is the rest of the entire text, I would advise finding a better OCR system). Hence, we can restrict regex's like bl*h from the set of valid answers for the search string blah. A general, simple and intuitive k for your context is probably the length of the string (w) minus 2. This allows b--h to be a valid string for blah. It also allows bla--h, but that's okay. Also, keep in mind that the errors can be any character you specify, including spaces (hence 'multiword' input is solvable).
The next basic task is to set up a simple weighted transducer. Any of the OpenFST Python ports can do this (here's one). The logic is simple: insertions and deletions increment the weight while equality increments the index in the input string. You could also just hand code it as the guy in Sergei's comment link did.
Once you have the weights and associated indexes of the weights, you just sort and return. The computational complexity should be O(n(w+k)), since we will look ahead w+k characters in the worst case for each character (n) in the text.
From here, you can do all sorts of things. You could convert the transducer to a DFA. You could parallelize the system by breaking the text into w+k-grams, which are sent to different processes. You could develop a language model or confusion matrix that defines what common mistakes exist for each letter in the input set (and thereby restrict the space of valid transitions and the complexity of the associated FSA). The literature is vast and still growing so there are probably as many modifications as there are solutions (if not more).
Hopefully that answers some of your questions without giving any code.
I would try to build a regular expression template from the query string. The template could then be used to search the corpus for substrings that are likely to match the query. Then use difflib or fuzzywuzzy to check if the substring does match the query.
For example, a possible template would be to match at least one of the first two letters of the query, at least one of the last two letters of the query, and have approximately the right number of letters in between:
import re
query = "ipsum dolor"
corpus = ["lorem 1psum d0l0r sit amet",
"lorem 1psum dlr sit amet",
"lorem ixxxxxxxr sit amet"]
first_letter, second_letter = query[:2]
minimum_gap, maximum_gap = len(query) - 6, len(query) - 3
penultimate_letter, ultimate_letter = query[-2:]
fmt = '(?:{}.|.{}).{{{},{}}}(?:{}.|.{})'.format
pattern = fmt(first_letter, second_letter,
minimum_gap, maximum_gap,
penultimate_letter, ultimate_letter)
#print(pattern) # for debugging pattern
m = difflib.SequenceMatcher(None, "", query, False)
for c in corpus:
for match in re.finditer(pattern1, c, re.IGNORECASE):
substring = match.group()
m.set_seq1(substring)
ops = m.get_opcodes()
# EDIT fixed calculation of the number of edits
#num_edits = sum(1 for t,_,_,_,_ in ops if t != 'equal')
num_edits = sum(max(i2-i1, j2-j1) for op,i1,i2,j1,j2 in ops if op != 'equal' )
print(num_edits, substring)
Output:
3 1psum d0l0r
3 1psum dlr
9 ixxxxxxxr
Another idea is to use the characteristics of the ocr when building the regex. For example, if the ocr always gets certain letters correct, then when any of those letters are in the query, use a few of them in the regex. Or if the ocr mixes up '1', '!', 'l', and 'i', but never substitutes something else, then if one of those letters is in the query, use [1!il] in the regex.

Categories

Resources