Ways to clean up photos for Tesseract OCR?

Ways to clean up photos for Tesseract OCR? - python

I'm new to Tesseract and wanted to know if there were any ways to clean up photos for a simple OCR program to get better results. Thanks in advance for any help!
The code I am using:
#loads tesseract
tess.pytesseract.tesseract_cmd =
#filepath
file_path =
image = Image.open(file_path)
#processes image
text = tess.image_to_string(image, config='')
print(text)

I've used pytesseract in the past and with the following four modifications, I could read almost anything as long as the text font wasn't too small to begin with. pytesseract seems to struggle with small writing, even after resizing.
- Convert to Black & White -
Converting the images to black and white would frequently improve the recognition of the program. I used OpenCV to do so and got the code from the end of this article.
- Crop -
If all your photos are in similar format, as in the text you need is always in the same spot, I'd recommend cropping your pictures. If possible, pass only the exact part of the photo to pytesseract that you want analyzed, the less the program has to analyze, the better. In my case, I was taking screenshots and would specify the exact region of where to take one.
- Resize -
Another thing you can do is to play with the scaling of the original photo. Sometimes after resizing to almost double it's initial size pytesseract could read the text a lot easier. Generaly, bigger text is better but there's a limit as the photo can become too pixelated after resizing to be recognizable.
- Config -
I've noticed that pytesseract can recognize text a lot easier than numbers. If possible, break the photo down into sections and whenever you have a trouble spot with numbers you can use:
pytesseract.image_to_string(image, config='digits')

Related

OCR text extraction from user interfaces image

I am currently using Pytesseract to extract text from images like Amazon, ebay, (e-commerce) etc to observe certain patterns. I do not want to use a web crawler since this is about recognising certain patterns from the text on such sites. The image example looks like this:
However every website looks different so template matching wouldn't help as well. Also the image background is not of the same colour.
The code gives me about 40% accuracy. But if I crop the images into smaller size, it gives me all the text correctly.
Is there a way to take in one image, crop it into multiple parts and then extract text? The preprocessing of images does not help. What I have tried is using: rescaling, removing noise, deskewing, skewing, adaptiveThreshold, grey scale,otsu, etc but I am unable to figure out what to do.
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
# import pickle
def ocr_processing(filename):
"""
This function uses Pillow to open the file and Pytesseract to find string in image.
"""
text = pytesseract.image_to_data(Image.open(
filename), lang='eng', config='--psm 6')
# text = pytesseract.image_to_string(Image.open(
# filename), lang='eng', config ='--psm 11')
return text

Just for a recommendation if you have a lot of text and you want to detect it through OCR (example image is above), "Keras" is a very good option. Much much better than pytesseract or using just EAST. It was a suggestion provided in the comments section. It was able to trace 98.99% of the text correctly.
Here is the link to the Keras-ocr documentation: https://keras-ocr.readthedocs.io/en/latest/

How to get information from an image of a document, like name, CPF, RG, on python?

I'm sorry for the title of my question if it doesn't let clear my problem.
I'm trying to get information from an image of a document using tesseract, but it doesn't work well on pictures (on print screens of text it works very well). I want to ask if somebody know a technique that can help me. I think that letting the image black and white, where the information I want is in black would help a lot, but I don't know how to do that.
I will be glad if somebody knows how to help me. (:

Using opencv might help to preprocess the image before passing it to tesseract.
I usually follow these steps
Convert the image to grayscale
If the texts in the image are small, resize the image using cv2.resize()
Blur the image (GaussianBlur or MedianBlur)
Apply threshhold to make the text prominent (cv2.threshold)
Use tesseract config to instruct tesseract to look for specific characters.
For example If the image contains only alphanumeric upper case english text then passing
config='-c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" would help.

ocr image cleansing with python opencv

I'm currently learning about computer vision OCR. I have an image that needs to be scan. I face a problem during the image cleansing.
I use opencv2 in python to do the things. This is the original image:
image = cv2.imread(image_path)
cv2.imshow("imageWindow", image)
I want to cleans the above image, the number at the middle (64) is the area I wanted to scan. However, the number got cleaned as well.
image[np.where((image > [0,0,105]).all(axis=2))] = [255,255,255]
cv2.imshow("imageWindow", image)
What should I do to correct the cleansing here? I wanted to make the screen where the number 64 located is cleansed coz I will perform OCR scan afterwards.
Please help, thank you in advance.

What you're trying to do is called "thresholding". Looks like your technique is recoloring pixels that fall below a certain threshold, but the LCD digit darkness varies enough in that image to throw it off.
I'd spend some time reading about thresholding, here's a good starting place:
Thresholding in OpenCV with Python. You're probably going to need an adaptive technique (like Adaptive Gaussian Thresholding), but you may find other ways that work for your images.

tesseract unable to read image tile

i am writing a program to extract information from Govt. ID's and use contours to extract characters from the image (since passing it as is, into tesseract results in junk output). I have tried this approach with other printed documents and it yields much better results than passing the whole image into tesseract. But for some reason, with the govt id images, it seems to either give "Error in pixGenHalftoneMask: pix too small" (though the images are at least 100x100) or blank output. I am attaching both images used
I used psm 10 for the image with an "M" and 8 for the other image and yet it bore no fruit. I have a feeling the character is just too thick ? i tried erosion as well ..but the end result was the same :( .. any pointers please ?

Preprocessing seven segment image for Tesseract OCR using OpenCV

I'm trying to develop a system which can convert a seven-segment display on an old analog pressure output system to text, such that the data can be processed by LabVIEW. I've been working on image processing to get Tesseract (using v3.02) to recognize the numbers correctly, but have been hitting some roadblocks and don't quite know how to proceed. This is what I've got so far:
Image needs to be a height of between 50-100 pixels for Tesseract to read it correctly. I've found the best results with a height of 50.
Image needs to be cropped such that there is only one line of text.
Image should be in black and white
Image should be relatively level from left to right.
I've been using the seven-segment training data 'letsgodigital'. This is the code for the image manipulation I've been doing so far:
ret, i = video.read()
h,width,channels = i.shape #get dimensions
g = cv2.cvtColor(i,cv2.COLOR_BGR2GRAY)
histeq=cv2.equalizeHist(g) #spreads pixel values across entire spectrum
_,t = cv2.threshold(histeq,150,225,cv2.THRESH_BINARY) #thresholds histeq
cropped = t[int(0.4*h):int(.6*h), int(0.1*width):int(0.9*width)]
rotated = imutils.rotate_bound(cropped, angle)
resized = imutils.resize(rotated,height=resizing_height)
Some numbers work better than others - for example, '1' seems to have a lot of trouble. The numbers occurring after the '+' or '-' often don't show up, and the '+' often shows up as a '-'. I've played around with the threshold values a bit, too.
The last three parts are because my video sample I've been drawing from was slightly askew. I could try taking some better data to work with, and I could also try making my own training data over the standard 'letsgodigital' lang. I feel like I'm not doing the image processing in the best way though, and would appreciate some guidance.
I plan to use some degree of edge detection to autocrop to the display, but for now I've just been trying to keep it simple and manually get the results I want. I've uploaded sample images with various degrees of image processing applied at http://imgur.com/a/vnqgP. It's difficult because sometimes I get the exact right answer from tesseract, and other times get nothing. The camera or light levels haven't really changed though, which makes me think it's a problem with my training data. Any suggestions or direction on where I should go would be much appreciated!! Thank you

For reading seven segment digits, normal OCR programs like tesseract don't usually work too well because of the space between individual segments. You should try ssocr, which was made specifically for reading seven segment digits. However, your preprocessing will need to be better as ssocr expects the input to be a single row of seven segment digits.
References - https://www.unix-ag.uni-kl.de/~auerswal/ssocr/
Usage example - http://www.instructables.com/id/Raspberry-Pi-Reading-7-Segment-Displays/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.