Say I have one large string and an array of substrings that when joined equal the large string (with small differences).
For example (note the subtle differences between the strings):
large_str = "hello, this is a long string, that may be made up of multiple
substrings that approximately match the original string"
sub_strs = ["hello, ths is a lng strin", ", that ay be mad up of multiple",
"subsrings tat aproimately ", "match the orginal strng"]
How can I best align the strings to produce a new set of sub strings from the original large_str? For example:
["hello, this is a long string", ", that may be made up of multiple",
"substrings that approximately ", "match the original string"]
Additional Info
The use case for this is to find the page breaks of the original text from the existing page breaks of text extracted from a PDF document. Text extracted from the PDF is OCR'd and has small errors compared to the original text, but the original text does not have page breaks. The goal is to accurately page break the original text avoiding the OCR errors of the PDF text.
Concatenate the substrings
Align the concatenation with the original string
Keep track of which positions in the original string are aligned with the boundaries between the substrings
Split the original string on the positions aligned with those boundaries
An implementation using Python's difflib:
from difflib import SequenceMatcher
from itertools import accumulate
large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"
sub_strs = [
"hello, ths is a lng strin",
", that ay be mad up of multiple",
"subsrings tat aproimately ",
"match the orginal strng"]
sub_str_boundaries = list(accumulate(len(s) for s in sub_strs))
sequence_matcher = SequenceMatcher(None, large_str, ''.join(sub_strs), autojunk = False)
match_index = 0
matches = [''] * len(sub_strs)
for tag, i1, i2, j1, j2 in sequence_matcher.get_opcodes():
if tag == 'delete' or tag == 'insert' or tag == 'replace':
matches[match_index] += large_str[i1:i2]
while j1 < j2:
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
while submatch_len == 0:
match_index += 1
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
j1 += submatch_len
else:
while j1 < j2:
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
while submatch_len == 0:
match_index += 1
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
matches[match_index] += large_str[i1:i1+submatch_len]
j1 += submatch_len
i1 += submatch_len
print(matches)
Output:
['hello, this is a long string',
', that may be made up of multiple ',
'substrings that approximately ',
'match the original string']
You are trying to solve sequence alignment problem. In your case it is a "local" sequence alignment. It can be solved with Smith-Waterman approach. One possible implementation is here.
If you run it, you receive:
large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"
sub_strs = ["hello, ths is a lng sin", ", that ay be md up of mulple", "susrings tat aproimately ", "manbvch the orhjgnal strng"]
for sbs in sub_strs:
water(large_str, sbs)
>>>
Identity = 85.185 percent
Score = 210
hello, this is a long strin
hello, th s is a l ng s in
hello, th-s is a l-ng s--in
Identity = 84.848 percent
Score = 255
, that may be made up of multiple
, that ay be m d up of mul ple
, that -ay be m-d- up of mul--ple
Identity = 83.333 percent
Score = 225
substrings that approximately
su s rings t at a pro imately
su-s-rings t-at a-pro-imately
Identity = 75.000 percent
Score = 175
ma--tch the or-iginal string
ma ch the or g nal str ng
manbvch the orhjg-nal str-ng
The middle line shows matching characters. If you need the positions, look for max_i value to get ending position in original string. The starting position will be the value of i at the end of water() function.
(The additional info makes a lot of the following unnecessary. It was written for a situation where the substrings provided might be any permutation of the order in which they occur in the main string)
There will be a dynamic programming solution for a problem very close to this. In the dynamic programming algorithm that gives you edit distance, the state of the dynamic program is (a, b) where a is the offset into the first string and b is the offset into the second string. For each pair (a, b) you work out the smallest possible edit distance that matches the first a characters of the first string with the first b characters of the second string, working out (a, b) from (a-1, b-1), (a-1, b), and (a, b-1).
You can now write a similar algorithm with state (a, n, m, b) where a is the total number of characters consumed by substrings so far, n is the index of the current substring, m is the position within the current substring, and b is the number of characters matched in the second string. This solves the problem of matching b against a string composed by pasting together any number of copies of any of the available substrings.
This is a different problem, because if you are trying to reconstitute a long string from fragments, you might get a solution that uses the same fragment more than once, but if you are doing this you might hope that the answer is obvious enough that the collection of substrings it produces happens to be a permutation of the collection given to it.
Because the edit distance returned by this method will always be at least as good as the best edit distance when you force a permutation, you could also use this to compute a lower bound on the best possible edit distance for a permutation, and run a branch and bound algorithm to find the best permutation.
Related
I have two things that I would like to replace in my text files.
Add " " between String end with '#' (eg. ABC#) into (eg. A B C)
Ignore certain Strings end with 'H' or 'xx:xx:xx' (eg. 1111H - ignore), (eg. if is 1111, process into 'one one one one')
so far this is my code..
import re
dest1 = r"C:\\Users\CL\\Desktop\\Folder"
files = os.listdir(dest1)
#dictionary to process Int to Str
numbers = {"0":"ZERO ", "1":"ONE ", "2":"TWO ", "3":"THREE ", "4":"FOUR ", "5":"FIVE ", "6":"SIX ", "7":"SEVEN ", "8":"EIGHT ", "9":"NINE "}
for f in files:
text= open(dest1+ "\\" + f,"r")
text_read = text.read()
#num sub pattern
text = re.sub('[%s]\s?' % ''.join(numbers), lambda x: numbers[x.group().strip()]+' ', text)
#write result to file
data = f.write(text)
f.close()
sample .txt
1111H I have 11 ABC# apples
11:12:00 I went to my# room
output required
1111H I have ONE ONE A B C apples
11:12:00 I went to M Y room
also.. i realized when I write the new result, the format gets 'messy' without the breaks. not sure why.
#current output
ONE ONE ONE ONE H - I HAVE ONE ONE ABC# APPLES
ONE ONE ONE TWO H - I WENT TO MY# ROOM
#overwritten output
ONE ONE ONE ONE H - I HAVE ONE ONE ABC# APPLES ONE ONE ONE TWO H - I WENT TO MY# ROOM
You can use
def process_match(x):
if x.group(1):
return " ".join(x.group(1).upper())
elif x.group(2):
return f'{numbers[x.group(2)] }'
else:
return x.group()
print(re.sub(r'\b(?:\d+[A-Z]+|\d{2}:\d{2}:\d{2})\b|\b([A-Za-z]+)#|([0-9])', process_match, text_read))
# => 1111H I have ONE ONE A B C apples
# 11:12:00 I went to M Y room
See the regex demo. The main idea behind this approach is to parse the string only once capturing or not parts of it, and process each match on the go, either returning it as is (if it was not captured) or converted chunks of text (if the text was captured).
Regex details:
\b(?:\d+[A-Z]+|\d{2}:\d{2}:\d{2})\b - a word boundary, and then either one or more digits and one or more uppercase letters, or three occurrences of colon-separated double digits, and then a word boundary
| - or
\b([A-Za-z]+)# - Group 1: words with # at the end: a word boundary, then oneor more letters and a #
| - or
([0-9]) - Group 2: an ASCII digit.
I have a pandas data frame that contains two columns named Potential Word, Fixed Word. The Potential Word column contains words of different languages which contains spell mistakes words and correct words and the Fixed Word column contains the correct words corresponded to Potential Word.
Below I have shared some of the samples data
Potential Word
Fixed Word
Exemple
Example
pipol
People
pimple
Pimple
Iunik
unique
My vocab Dataframe contains 600K unique row.
My Solution:
key = given_word
glob_match_value = 0
potential_fixed_word = ''
match_threshold = 0.65
for each in df['Potential Word']:
match_value = match(each, key) # match is a function that returns a
# similarity value of two strings
if match_value > glob_match_value and match_value > match_threshold:
glob_match_value = match_value
potential_fixed_word = each
Problem
The problem with my code its takes a lot of time to fix every word because of the loop running through the large vocab list. When a word is missed on the vocab then it takes almost 5 or 6 sec to solve a sentence of 10 ~12 words. The match function performs decently so the objective of the optimization.
I need optimized solution help me here
In the perspective of Information Retrieval (IR), You need to reduce search space. Matching the given_word (as key) against all Potential Words is definitely inefficient.
Instead, you need to match against a reasonable number of candidates.
To find such candidates, you need to index Potential Words and Fixed Words.
from whoosh.analysis import StandardAnalyzer
from whoosh.fields import Schema, TEXT
from whoosh.index import create_in
ix = create_in("indexdir", Schema(
potential=TEXT(analyzer=StandardAnalyzer(stoplist=None), stored=True),
fixed=TEXT(analyzer=StandardAnalyzer(stoplist=None), stored=True)
))
writer = ix.writer()
writer.add_document(potential='E x e m p l e', fixed='Example')
writer.add_document(potential='p i p o l', fixed='People')
writer.add_document(potential='p i m p l e', fixed='Pimple')
writer.add_document(potential='l u n i k', fixed='unique')
writer.commit()
With this index, you can search some candidates.
from whoosh.qparser import SimpleParser
with ix.searcher() as searcher:
results = searcher.search(SimpleParser('potential', ix.schema).parse('p i p o l'))
for result in results[:2]:
print(result)
The output is
<Hit {'fixed': 'People', 'potential': 'p i p o l'}>
<Hit {'fixed': 'Pimple', 'potential': 'p i m p l e'}>
Now, you can match the given_word only against few candidates, instead of all 600K.
It is not perfect, however, this is inevitable trade-off and how IR essentially works. Try this with different numbers of candidates.
Without changing much on your implementation, as I believe it is somewhat required to iterate over the list of potential words for each word.
Here my intention is not to optimize the match function itself but to leverage multiple threads to search in parallel.
import concurrent.futures
import time
from concurrent.futures.thread import ThreadPoolExecutor
from typing import Any, Union, Iterator
import pandas as pd
# Replace your dataframe here for testing this
df = pd.DataFrame({'Potential Word': ["a", "b", "c"], "Fixed Word": ["a", "c", "b"]})
# Replace by your match function
def match(w1, w2):
# Simulate some work is happening here
time.sleep(1)
return 1
# This is mostly your function itself
# Using index to recreate the sentence from the returned values
def matcher(idx, given_word):
key = given_word
glob_match_value = 0
potential_fixed_word = ''
match_threshold = 0.65
for each in df['Potential Word']:
match_value = match(each, key) # match is a function that returns a
# similarity value of two strings
if match_value > glob_match_value and match_value > match_threshold:
glob_match_value = match_value
potential_fixed_word = each
return idx, potential_fixed_word
else:
# Handling default case, you might want to change this
return idx, ""
sentence = "match is a function that returns a similarity value of two strings match is a function that returns a " \
"similarity value of two strings"
start = time.time()
# Using a threadpool executor
# You can increase or decrease the max_workers based on your machine
executor: Union[ThreadPoolExecutor, Any]
with concurrent.futures.ThreadPoolExecutor(max_workers=24) as executor:
futures: Iterator[Union[str, Any]] = executor.map(matcher, list(range(len(sentence.split()))), sentence.split())
# Joining back the input sentence
out_sentence = " ".join(x[1] for x in sorted(futures, key=lambda x: x[0]))
print(out_sentence)
print(time.time() - start)
Please note that the runtime of this will depend upon
The maximum time taken by a single match call
Number of words in the sentence
Number of worker threads (TIP: Try to see if you can use as many as the number of words in the sentence)
I would use the sortedcollections module. In general the access time to a SortedList or SortedDict is O(log(n)) instead of O(n); In your case 19.1946 if/then checks vs 600,000 if/then checks.
from sortedcollections import SortedDict
I want to auto-correct the words which are in my list.
Say I have a list
kw = ['tiger','lion','elephant','black cat','dog']
I want to check if these words appeared in my sentence. If they are wrongly spelled I want to correct them. I don't intend to touch other words except from the given list.
Now I have list of str
s = ["I saw a tyger","There are 2 lyons","I mispelled Kat","bulldogs"]
Expected output:
['tiger','lion',None,'dog']
My Efforts:
import difflib
op = [difflib.get_close_matches(i,kw,cutoff=0.5) for i in s]
print(op)
My Output:
[[], [], [], ['dog']]
The problem with above code is I want to compare entire sentence and my kw list can have more than 1 word(upto 4-5 words).
If I lower the cutoff value it starts returning the words which is should not.
So even if I plan to create bigrams, trigrams from given sentence it would consume a lot of time.
So is there way to implement this?
I have explored few more libraries like autocorrect, hunspell etc. but no success.
You could implement something based of levenshtein distance.
It's interesting to note elasticsearch's implementation: https://www.elastic.co/guide/en/elasticsearch/guide/master/fuzziness.html
Clearly, bieber is a long way from beaver—they are too far apart to be
considered a simple misspelling. Damerau observed that 80% of human
misspellings have an edit distance of 1. In other words, 80% of
misspellings could be corrected with a single edit to the original
string.
Elasticsearch supports a maximum edit distance, specified with the
fuzziness parameter, of 2.
Of course, the impact that a single edit has on a string depends on
the length of the string. Two edits to the word hat can produce mad,
so allowing two edits on a string of length 3 is overkill. The
fuzziness parameter can be set to AUTO, which results in the following
maximum edit distances:
0 for strings of one or two characters
1 for strings of three, four, or five characters
2 for strings of more than five characters
I like to use pyxDamerauLevenshtein myself.
pip install pyxDamerauLevenshtein
So you could do a simple implementation like:
keywords = ['tiger','lion','elephant','black cat','dog']
from pyxdameraulevenshtein import damerau_levenshtein_distance
def correct_sentence(sentence):
new_sentence = []
for word in sentence.split():
budget = 2
n = len(word)
if n < 3:
budget = 0
elif 3 <= n < 6:
budget = 1
if budget:
for keyword in keywords:
if damerau_levenshtein_distance(word, keyword) <= budget:
new_sentence.append(keyword)
break
else:
new_sentence.append(word)
else:
new_sentence.append(word)
return " ".join(new_sentence)
Just make sure you use a better tokenizer or this will get messy, but you get the point. Also note that this is unoptimized, and will be really slow with a lot of keywords. You should implement some kind of bucketing to not match all words with all keywords.
Here is one way using difflib.SequenceMatcher. The SequenceMatcher class allows you to measure sentence similarity with its ratio method, you only need to provide a suitable threshold in order to keep words with a ratio that falls above the given threshold:
def find_similar_word(s, kw, thr=0.5):
from difflib import SequenceMatcher
out = []
for i in s:
f = False
for j in i.split():
for k in kw:
if SequenceMatcher(a=j, b=k).ratio() > thr:
out.append(k)
f = True
if f:
break
if f:
break
else:
out.append(None)
return out
Output
find_similar_word(s, kw)
['tiger', 'lion', None, 'dog']
Although this is slightly different from your expected output (it is a list of list instead of a list of string) I thing it is a step in the right direction. The reason I chose this method, is so that you can have multiple corrections per sentence. That is why I added another example sentence.
import difflib
import itertools
kw = ['tiger','lion','elephant','black cat','dog']
s = ["I saw a tyger","There are 2 lyons","I mispelled Kat","bulldogs", "A tyger is different from a doog"]
op = [[difflib.get_close_matches(j,kw,cutoff=0.5) for j in i.split()] for i in s]
op = [list(itertools.chain(*o)) for o in op]
print(op)
The output is generate is:
[['tiger'], ['lion'], [], ['dog'], ['tiger', 'dog']]
The trick is to split all the sentences along the whitespaces.
I am using str.contains for text analytics in Pandas. If for the sentence "My latest Data job was an Analyst" , I want a combination of the words "Data" & "Analyst" but at the same time I want to specify the number of words between the two words used for the combination( here it is 2 words between "Data" and "Analyst".Currently I am using (DataFile.XXX.str.contains('job') & DataFile.XXX.str.contains('Analyst') to get the counts for "job Analyst".
How can I Specify the number of words in between the 2 words in the str.contains syntax.
Thanks in advance
You can't. At least, not in a simple or standardized way.
Even the basics, like how you define a "word," are a lot more complex than you probably imagine. Both word parsing and lexical proximity (e.g. "are two words within distance D of one another in sentence s?") is the realm of natural language processing (NLP). NLP and proximity searches are not part of basic Pandas, nor of Python's standard string processing. You could import something like NLTK, the Natural Language Toolkit to solve this problem in a general way, but that's a whole 'nother story.
Let's look at a simple approach. First you need a way to parse a string into words. The following is rough by NLP standards, but will work for simpler cases:
def parse_words(s):
"""
Simple parser to grab English words from string.
CAUTION: A simplistic solution to a hard problem.
Many possibly-important edge- and corner-cases
not handled. Just one example: Hyphenated words.
"""
return re.findall(r"\w+(?:'[st])?", s, re.I)
E.g.:
>>> parse_words("and don't think this day's last moment won't come ")
['and', "don't", 'think', 'this', "day's", 'last', 'moment', "won't", 'come']
Then you need a way to find all the indices in a list where a target word is found:
def list_indices(target, seq):
"""
Return all indices in seq at which the target is found.
"""
indices = []
cursor = 0
while True:
try:
index = seq.index(target, cursor)
except ValueError:
return indices
else:
indices.append(index)
cursor = index + 1
And finally a decision making wrapper:
def words_within(target_words, s, max_distance, case_insensitive=True):
"""
Determine if the two target words are within max_distance positiones of one
another in the string s.
"""
if len(target_words) != 2:
raise ValueError('must provide 2 target words')
# fold case for case insensitivity
if case_insensitive:
s = s.casefold()
target_words = [tw.casefold() for tw in target_words]
# for Python 2, replace `casefold` with `lower`
# parse words and establish their logical positions in the string
words = parse_words(s)
target_indices = [list_indices(t, words) for t in target_words]
# words not present
if not target_indices[0] or not target_indices[1]:
return False
# compute all combinations of distance for the two words
# (there may be more than one occurance of a word in s)
actual_distances = [i2 - i1 for i2 in target_indices[1] for i1 in target_indices[0]]
# answer whether the minimum observed distance is <= our specified threshold
return min(actual_distances) <= max_distance
So then:
>>> s = "and don't think this day's last moment won't come at last"
>>> words_within(["THIS", 'last'], s, 2)
True
>>> words_within(["think", 'moment'], s, 2)
False
The only thing left to do is map that back to Pandas:
df = pd.DataFrame({'desc': [
'My latest Data job was an Analyst',
'some day my prince will come',
'Oh, somewhere over the rainbow bluebirds fly',
"Won't you share a common disaster?",
'job! rainbow! analyst.'
]})
df['ja2'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 2))
df['ja3'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 3))
This is basically how you'd solve the problem. Keep in mind, it's a rough and simplistic solution. Some simply-posed questions are not simply-answered. NLP questions are often among them.
I'm looking for a library or a method using existing libraries( difflib, fuzzywuzzy, python-levenshtein) to find the closest match of a string (query) in a text (corpus)
I've developped a method based on difflib, where I split my corpus into ngrams of size n (length of query).
import difflib
from nltk.util import ngrams
def get_best_match(query, corpus):
ngs = ngrams( list(corpus), len(query) )
ngrams_text = [''.join(x) for x in ngs]
return difflib.get_close_matches(query, ngrams_text, n=1, cutoff=0)
it works as I want when the difference between the query and the matched string are just character replacements.
query = "ipsum dolor"
corpus = "lorem 1psum d0l0r sit amet"
match = get_best_match(query, corpus)
# match = "1psum d0l0r"
But when the difference is character deletion, it is not.
query = "ipsum dolor"
corpus = "lorem 1psum dlr sit amet"
match = get_best_match(query, corpus)
# match = "psum dlr si"
# expected_match = "1psum dlr"
Is there a way to get a more flexible result size ( as for expected_match ) ?
EDIT 1:
The actual use of this script is to match queries (strings) with a
messy ocr output.
As I said in the question, the ocr can confound characters, and even miss them.
If possible consider also the case when a space is missing between words.
A best match, is the one that does not include characters from other words than those on the query.
EDIT 2:
The solution I use now is to extend the ngrams with (n-k)-grams for k = {1,2,3} to prevent 3 deletions. It's much better than the first version, but not efficient in terms of speed, as we have more than 3 times the number of ngrams to check. It is also a non generalizable solution.
This function finds best matching substring of variable length.
The implementation considers the corpus as one long string, hence avoiding your concerns with spaces and unseparated words.
Code summary:
1. Scan the corpus for match values in steps of size step to find the approximate location of highest match value, pos.
2. Find the substring in the vicinity of pos with the highest match value, by adjusting the left/right positions of the substring.
from difflib import SequenceMatcher
def get_best_match(query, corpus, step=4, flex=3, case_sensitive=False, verbose=False):
"""Return best matching substring of corpus.
Parameters
----------
query : str
corpus : str
step : int
Step size of first match-value scan through corpus. Can be thought of
as a sort of "scan resolution". Should not exceed length of query.
flex : int
Max. left/right substring position adjustment value. Should not
exceed length of query / 2.
Outputs
-------
output0 : str
Best matching substring.
output1 : float
Match ratio of best matching substring. 1 is perfect match.
"""
def _match(a, b):
"""Compact alias for SequenceMatcher."""
return SequenceMatcher(None, a, b).ratio()
def scan_corpus(step):
"""Return list of match values from corpus-wide scan."""
match_values = []
m = 0
while m + qlen - step <= len(corpus):
match_values.append(_match(query, corpus[m : m-1+qlen]))
if verbose:
print(query, "-", corpus[m: m + qlen], _match(query, corpus[m: m + qlen]))
m += step
return match_values
def index_max(v):
"""Return index of max value."""
return max(range(len(v)), key=v.__getitem__)
def adjust_left_right_positions():
"""Return left/right positions for best string match."""
# bp_* is synonym for 'Best Position Left/Right' and are adjusted
# to optimize bmv_*
p_l, bp_l = [pos] * 2
p_r, bp_r = [pos + qlen] * 2
# bmv_* are declared here in case they are untouched in optimization
bmv_l = match_values[p_l // step]
bmv_r = match_values[p_l // step]
for f in range(flex):
ll = _match(query, corpus[p_l - f: p_r])
if ll > bmv_l:
bmv_l = ll
bp_l = p_l - f
lr = _match(query, corpus[p_l + f: p_r])
if lr > bmv_l:
bmv_l = lr
bp_l = p_l + f
rl = _match(query, corpus[p_l: p_r - f])
if rl > bmv_r:
bmv_r = rl
bp_r = p_r - f
rr = _match(query, corpus[p_l: p_r + f])
if rr > bmv_r:
bmv_r = rr
bp_r = p_r + f
if verbose:
print("\n" + str(f))
print("ll: -- value: %f -- snippet: %s" % (ll, corpus[p_l - f: p_r]))
print("lr: -- value: %f -- snippet: %s" % (lr, corpus[p_l + f: p_r]))
print("rl: -- value: %f -- snippet: %s" % (rl, corpus[p_l: p_r - f]))
print("rr: -- value: %f -- snippet: %s" % (rl, corpus[p_l: p_r + f]))
return bp_l, bp_r, _match(query, corpus[bp_l : bp_r])
if not case_sensitive:
query = query.lower()
corpus = corpus.lower()
qlen = len(query)
if flex >= qlen/2:
print("Warning: flex exceeds length of query / 2. Setting to default.")
flex = 3
match_values = scan_corpus(step)
pos = index_max(match_values) * step
pos_left, pos_right, match_value = adjust_left_right_positions()
return corpus[pos_left: pos_right].strip(), match_value
Example:
query = "ipsum dolor"
corpus = "lorem i psum d0l0r sit amet"
match = get_best_match(query, corpus, step=2, flex=4)
print(match)
('i psum d0l0r', 0.782608695652174)
Some good heuristic advice is to always keep step < len(query) * 3/4, and flex < len(query) / 3. I also added case sensitivity, in case that's important. It works quite well when you start playing with the step and flex values. Small step values gives better results but takes longer to compute. flex governs how flexible the length of the resulting substring is allowed to be.
Important to note: This will only find the first best match, so if there are multiple equally good matches, only the first will be returned. To allow for multiple matches, change index_max() to return a list of indices for the n highest values of the input list, and loop over adjust_left_right_positions() for values in that list.
The main path to a solution uses finite state automata (FSA) of some kind. If you want a detailed summary of the topic, check this dissertation out (PDF link). Error-based models (including Levenshtein automata and transducers, the former of which Sergei mentioned) are valid approaches to this. However, stochastic models, including various types of machine learning approaches integrated with FSAs, are very popular at the moment.
Since we are looking at edit distances (effectively misspelled words), the Levenshtein approach is good and relatively simple. This paper (as well as the dissertation; also PDF) give a decent outline of the basic idea and it also explicitly mentions the application to OCR tasks. However, I will review some of the key points below.
The basic idea is that you want to build an FSA that computes both the valid string as well as all strings up to some error distance (k). In the general case, this k could be infinite or the size of the text, but this is mostly irrelevant for OCR (if your OCR could even potentially return bl*h where * is the rest of the entire text, I would advise finding a better OCR system). Hence, we can restrict regex's like bl*h from the set of valid answers for the search string blah. A general, simple and intuitive k for your context is probably the length of the string (w) minus 2. This allows b--h to be a valid string for blah. It also allows bla--h, but that's okay. Also, keep in mind that the errors can be any character you specify, including spaces (hence 'multiword' input is solvable).
The next basic task is to set up a simple weighted transducer. Any of the OpenFST Python ports can do this (here's one). The logic is simple: insertions and deletions increment the weight while equality increments the index in the input string. You could also just hand code it as the guy in Sergei's comment link did.
Once you have the weights and associated indexes of the weights, you just sort and return. The computational complexity should be O(n(w+k)), since we will look ahead w+k characters in the worst case for each character (n) in the text.
From here, you can do all sorts of things. You could convert the transducer to a DFA. You could parallelize the system by breaking the text into w+k-grams, which are sent to different processes. You could develop a language model or confusion matrix that defines what common mistakes exist for each letter in the input set (and thereby restrict the space of valid transitions and the complexity of the associated FSA). The literature is vast and still growing so there are probably as many modifications as there are solutions (if not more).
Hopefully that answers some of your questions without giving any code.
I would try to build a regular expression template from the query string. The template could then be used to search the corpus for substrings that are likely to match the query. Then use difflib or fuzzywuzzy to check if the substring does match the query.
For example, a possible template would be to match at least one of the first two letters of the query, at least one of the last two letters of the query, and have approximately the right number of letters in between:
import re
query = "ipsum dolor"
corpus = ["lorem 1psum d0l0r sit amet",
"lorem 1psum dlr sit amet",
"lorem ixxxxxxxr sit amet"]
first_letter, second_letter = query[:2]
minimum_gap, maximum_gap = len(query) - 6, len(query) - 3
penultimate_letter, ultimate_letter = query[-2:]
fmt = '(?:{}.|.{}).{{{},{}}}(?:{}.|.{})'.format
pattern = fmt(first_letter, second_letter,
minimum_gap, maximum_gap,
penultimate_letter, ultimate_letter)
#print(pattern) # for debugging pattern
m = difflib.SequenceMatcher(None, "", query, False)
for c in corpus:
for match in re.finditer(pattern1, c, re.IGNORECASE):
substring = match.group()
m.set_seq1(substring)
ops = m.get_opcodes()
# EDIT fixed calculation of the number of edits
#num_edits = sum(1 for t,_,_,_,_ in ops if t != 'equal')
num_edits = sum(max(i2-i1, j2-j1) for op,i1,i2,j1,j2 in ops if op != 'equal' )
print(num_edits, substring)
Output:
3 1psum d0l0r
3 1psum dlr
9 ixxxxxxxr
Another idea is to use the characteristics of the ocr when building the regex. For example, if the ocr always gets certain letters correct, then when any of those letters are in the query, use a few of them in the regex. Or if the ocr mixes up '1', '!', 'l', and 'i', but never substitutes something else, then if one of those letters is in the query, use [1!il] in the regex.