Tokenizing unsplit words from OCR using NLTK

Tokenizing unsplit words from OCR using NLTK - python

I'm using NLTK to process some text that is extracted from PDF files. I can recover the text mostly intact, but there are lots of instances where spaces between words are not captured, so I get words like ifI instead of if I, or thatposition instead of that position, or andhe's instead of and he's.
My question is this: how can I use NLTK to look for words it does not recognize/has not learned, and see if there are "nearby" word combinations that are much more likely to occur? Is there a more graceful way to implement this kind of check than simply marching through the unrecognized word, one character at a time, splitting it, and seeing if it makes two recognizable words?

I would suggest that you consider using pyenchant instead, since it is a more robust solution for this sort of problem. You can download pyenchant here. Here is an example of how you would obtain your results after you install it:
>>> text = "IfI am inthat position, Idon't think I will." # note the lack of spaces
>>> from enchant.checker import SpellChecker
>>> checker = SpellChecker("en_US")
>>> checker.set_text(text)
>>> for error in checker:
for suggestion in error.suggest():
if error.word.replace(' ', '') == suggestion.replace(' ', ''): # make sure the suggestion has exact same characters as error in the same order as error and without considering spaces
error.replace(suggestion)
break
>>> checker.get_text()
"If I am in that position, I don't think I will." # text is now fixed

Related

Remove whitespace between two lowercase letters

Trying to find a regex (or different method), that removes whitespace in a string only if it occurs between two lowercase letters. I'm doing this because I'm cleaning noisy text from scans where whitespace was mistakenly added inside of words.
For example, I'd like to turn the string noisy = "Hel lo, my na me is Mark." into clean= "Hello, my name is Mark."
I've tried to capture the group in a regex (see below) but don't know how to then replace only whitespace in between two lowercase letters. Same issue with re.sub.
This is what I've tried, but it doesn't work because it removes all the whitespace from the string:
import re
noisy = "Hel lo my name is Mark"
finder = re.compile("[a-z](\s)[a-z]")
whitesp = finder.search(noisy).group(1)
clean = noisy.replace(whitesp,"")
print(clean)
Any ideas are appreciated thanks!
EDIT 1:
My use case is for Swedish words and sentences that I have OCR'd from scanned documents.

To correct an entire string, you could try symspellpy.
First, install it using pip:
python -m pip install -U symspellpy
Then, import the required packages, and load dictionaries. Dictionary files shipped with symspellpy can be accessed using pkg_resources. You can pass your string through the lookup_compound function, which will return a list of spelling suggestions (SuggestItem objects). Words that require no change will still be included in this list. max_edit_distance refers to the maximum edit distance for doing lookups (per single word, not entire string). You can maintain casing by setting transfer_casing to True. To get the clean string, a simple join statement with a little list comprehension does the trick.
import pkg_resources
from symspellpy import SymSpell
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
"symspellpy",
"frequency_dictionary_en_82_765.txt"
)
bigram_path = pkg_resources.resource_filename(
"symspellpy",
"frequency_bigramdictionary_en_243_342.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)
my_str = "Hel lo, my na me is Mark."
sugs = sym_spell.lookup_compound(
my_str,
max_edit_distance=2,
transfer_casing=True
)
print(" ".join([sug.term for sug in sugs]))
Output:
Hello my name is Mark
Check out their documentation for other examples and use cases.

Is this what you want:
In [3]: finder = re.compile("([a-z])\s([a-z])")
In [4]: clean = finder.sub(r'\1\2', noisy, 1)
In [5]: clean
Out[5]: 'Hello my name is Mark'

I think you need a Python module that contain words (like an oxford dictionary) that can check for any valid words in the string by matching the character that has space in between, for example, you can break the string into list string.split() then loop the list starting with index 1 range(1,len(your_list)) by joining the current index and the previous index list[index - 1] + list[index] into a string (i.e., token/word); then use this token to check the set of words that you have collected to see if this token is a valid word; if is true, append this token into a temporary list, if not true then just append the previous word into the temporary list, once the loop is done, you can just join the list into a string.
You can try Python spelling checker pyenchant, Python grammar checker language-check, or even using NLTK Corpora to build your own checker.

Replacing method for words with boundaries in python (like with regex)

I am seeking for a more robust replace method in python because I am building a
spellchecker to input words in ocr-context.
Let's say we have the following text in python:
text = """
this is a text, generated using optical character recognition.
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, his text is very difficult to work with.
"""
It is easy to realize that instead of "his is a text" the right phrase would be "this is a text".
And if I do text.replace('his','this') then I replace every single 'his' for this, so I would get errors like "tthis" is a text.
When I do a replacement. I would like to replace the whole word 'this' and not his or this.
Why not trying this?
word_to_replace='his'
corrected_word = 'this'
corrected_text = re.sub('\b'+word_to_replace+'\b',corrected_word,text)
corrected_text
Awesome, we did it, but the problem is... what if the word to correct contains an special character like '|'. For example,
'|ights are on' instead of 'lights are one'. Trust me, it happened to me, the re.sub is a disaster in that case.
The question is, have you encountered the same problem? Is there any method to solve this? The replacement is the most
robust option.
I tried text.replace(' '+word_to_replace+' ',' '+word_to_replace+' ') and this solve a lot of things but still
have the problem of phrases like "his is a text " because the replacement doesnt work here since 'his' is at the begining of a sentence
and not ' his ' for ' this '.
Is there any replacement method in python that takes the whole word like in regexs \b word_to_correct \b
as input ?

after a few days I solved the problem that I had. I hope this could
be helpful for someone else. Let me know if you have any question or something.
text = """
this is a text, generated using optical character recognition.
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, his text is very difficult to work with.
"""
# Asume you already have corrected your word via ocr
# and you just have to replace it in the text (I did it with my ocr spellchecker)
# So we get the following word2correct and corrected_word (word after spellchecking system)
word2correct = 'his'
corrected_word = 'this'
#
# now we replace the word and the its context
def context_replace(old_word,new_word,text):
# Match word between boundaries \\b\ using regex. This will capture his and its context but not this and its context
phrase2correct = re.findall('.{1,10}'+'\\b'+word2correct+'\\b'+'.{1,10}',text)[0]
# Once you matched the context, input the new word
phrase_corrected = phrase2correct.replace(word2correct,corrected_word)
# Now replace the old phrase (phrase2correct) with the new one *phrase_corrected
text = text.replace(phrase2correct,phrase_corrected)
return text
Test if the function works...
print(context_replace(old_word=word2correct,new_word=corrected_word,text=text))
Output:
this is a text, generated using optical character recognition.
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, this text is very difficult to work with.
It worked for my purpose. I hope this is helpful for someone else.

Text Segmentation using Python package of wordsegment

Folks,
I am using python library of wordsegment by Grant Jenks for the past couple of hours. The library works fine for any incomplete words or separating combined words such as e nd ==> end and thisisacat ==> this is a cat.
I am working on the textual data which involves numbers as well and using this library on this textual data is having a reverse effect. The perfectly fine text of increased $55 million or 23.8% for converts to something very weird increased 55millionor238 for (after performing join operation on the retuned list). Note that this happens randomly (may or may not happen) for any part of the text which involves numbers.
Have anybody worked with this library before?
If yes, have you faced similar situation and found a workaround?
If not, do you know of any other python library that does this trick for us?
Thank you.

Looking at the code, the segment function first runs clean which removes all non-alphanumeric character, it then searches for known unigrams and bigrams within the text clump and scores the words it finds based on the their frequency of occurrence in English.
'increased $55 million or 23.8% for'
becomes
'increased55millionor238for'
When searching for sub-terms, it finds 'increased' and 'for', but the score for the unknown phrase '55millionor238' is better than the score for breaking it up for some reason.
It seems to do better with unknown text, especially smaller unknown text elements. You could substitute out non-alphabetic character sequences, run it through segment and then substitute back in.
import re
from random import choices
CONS = 'bdghjklmpqvwxz'
def sub_map(s, mapping):
out = s
for k,v in mapping.items():
out = out.replace(k,v)
return out
mapping = {m.group():''.join(choices(cons, k=3)) for m
in re.finditer(r'[0-9\.,$%]+', s)}
revmap = {v:k for k,v in mapping.items()}
word_list = wordsegment.segment(sub_map(s, mapping))
word_list = [revmap.get(w,w) for w in word_list]
word_list
# returns:
['increased', '$55', 'million', 'or', '23.8%', 'for']

There are implementations in Ruby and Python at Need help understanding this Python Viterbi algorithm.
The algorithm (and those implementations) are pretty straightforward, and copy & paste may be better than using a library because (in my experience) this problem almost always needs some customisation to fit the data at hand (i. e. language/specific topics/custom entities/date or currency formats).

Python regex: find words and emoticons

I want to find matches between a tweet and a list of strings containing words, phrases, and emoticons. Here is my code:
words = [':)','and i','sleeping','... :)','! <3','facebook']
regex = re.compile(r'\b%s\b|(:\(|:\))+' % '\\b|\\b'.join(words), flags=re.IGNORECASE)
I keep receiving this error:
error: unbalanced parenthesis
Apparently there is something wrong with the code and it cannot match emoticons. Any idea how to fix it?

I tried the below and it stopped throwing the error:
words = [':\)','and i','sleeping','... :\)','! <3','facebook']

The re module has a function escape that takes care of correct escaping of words, so you could just use
words = map(re.escape, [':)','and i','sleeping','... :)','! <3','facebook'])
Note that word boundaries might not work as you expect when used with words that don't start or end with actual word characters.

While words has all the necessary formatting, re uses ( and ) as special characters. This requires you to use \( or \) to avoid them being interpreted as special characters, but rather as the ASCII characters 40 and 41. Since you didn't understand what #Nicarus was saying, you need to use this:
words = [':\)','and i','sleeping','... :\)','! <3','facebook']
Note: I'm only spelling it out because this doesn't seem like a school assignment, for all the people who might want to criticize this. Also, look at the documentation prior to going to stack overflow. This explains everything.

Use concordance to find hyphenated words

I was able to get the expected output of this book, page 4 "Searching Text". When I tried to apply it to my case I got No matches which was not my expected output. I think I'm not tokenizing at the proper level (word instead of character) but am unsure of how to correct that. Any suggestions? The output I want is every hyphen lined up vertically with its surrounding context.
>>> f = open('hyphen.txt')
>>> raw = f.read()
>>> import nltk
>>> tokens = nltk.word_tokenize(raw)
>>> text = nltk.Text(tokens)
>>> text.concordance("-")
No matches
>>> text
<Text: Fog Air-Flow Switch stuck off ? Bubble Tower...>
(Python 3.4.3)
EDIT
I think I'm close by using regular expressions but I don't know how to remove the 'NoneType' objects. Any suggestions?
The output I'd want to see would look like this:
Fog Air-Flow Switch stuck off?
Bubble Tower Check-Valve stuck closed?
Chamber Drain-Trap broken, dry, or missing?
Chamber Exhaust-Vent blocked or restricted?
etc.
It's okay if the context is wider than the sentence with the hyphen - all that matters to me is that the hyphens are lined up vertically with its surrounding context.

Need to change your code little bit.
import nltk
f = open("/path/to/file") //path of the file
raw = f.read()
text = nltk.Text(raw)
text.concordance("-")
Required Output:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.