Tesseract unable to read simple number - python

I have this image and I need tesseract to read the value.
import cv2
import pytesseract
im = cv2.imread("num.png")
print(pytesseract.image_to_string(im))
It does not print anything. Am I doing something wrong since it is pretty clear that it is a 7.
Even after scaling the image up by 5x with intercubic it still would not work. This is the image now

As described here:
By default Tesseract expects a page of text when it segments an image. If you’re just seeking to OCR a small region, try a different segmentation mode, using the --psm argument.
In this case, --psm from 6 to 10 should work fine. Example:
pytesseract.image_to_string(im, config='--psm 6')

The code is correct. I think that image of 7 is not clear enough for pytesseract. You need to preprocess the image. This link might help.

Related

OCR text extraction from user interfaces image

I am currently using Pytesseract to extract text from images like Amazon, ebay, (e-commerce) etc to observe certain patterns. I do not want to use a web crawler since this is about recognising certain patterns from the text on such sites. The image example looks like this:
However every website looks different so template matching wouldn't help as well. Also the image background is not of the same colour.
The code gives me about 40% accuracy. But if I crop the images into smaller size, it gives me all the text correctly.
Is there a way to take in one image, crop it into multiple parts and then extract text? The preprocessing of images does not help. What I have tried is using: rescaling, removing noise, deskewing, skewing, adaptiveThreshold, grey scale,otsu, etc but I am unable to figure out what to do.
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
# import pickle
def ocr_processing(filename):
"""
This function uses Pillow to open the file and Pytesseract to find string in image.
"""
text = pytesseract.image_to_data(Image.open(
filename), lang='eng', config='--psm 6')
# text = pytesseract.image_to_string(Image.open(
# filename), lang='eng', config ='--psm 11')
return text
Just for a recommendation if you have a lot of text and you want to detect it through OCR (example image is above), "Keras" is a very good option. Much much better than pytesseract or using just EAST. It was a suggestion provided in the comments section. It was able to trace 98.99% of the text correctly.
Here is the link to the Keras-ocr documentation: https://keras-ocr.readthedocs.io/en/latest/

How to get information from an image of a document, like name, CPF, RG, on python?

I'm sorry for the title of my question if it doesn't let clear my problem.
I'm trying to get information from an image of a document using tesseract, but it doesn't work well on pictures (on print screens of text it works very well). I want to ask if somebody know a technique that can help me. I think that letting the image black and white, where the information I want is in black would help a lot, but I don't know how to do that.
I will be glad if somebody knows how to help me. (:
Using opencv might help to preprocess the image before passing it to tesseract.
I usually follow these steps
Convert the image to grayscale
If the texts in the image are small, resize the image using cv2.resize()
Blur the image (GaussianBlur or MedianBlur)
Apply threshhold to make the text prominent (cv2.threshold)
Use tesseract config to instruct tesseract to look for specific characters.
For example If the image contains only alphanumeric upper case english text then passing
config='-c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" would help.

Ways to clean up photos for Tesseract OCR?

I'm new to Tesseract and wanted to know if there were any ways to clean up photos for a simple OCR program to get better results. Thanks in advance for any help!
The code I am using:
#loads tesseract
tess.pytesseract.tesseract_cmd =
#filepath
file_path =
image = Image.open(file_path)
#processes image
text = tess.image_to_string(image, config='')
print(text)
I've used pytesseract in the past and with the following four modifications, I could read almost anything as long as the text font wasn't too small to begin with. pytesseract seems to struggle with small writing, even after resizing.
- Convert to Black & White -
Converting the images to black and white would frequently improve the recognition of the program. I used OpenCV to do so and got the code from the end of this article.
- Crop -
If all your photos are in similar format, as in the text you need is always in the same spot, I'd recommend cropping your pictures. If possible, pass only the exact part of the photo to pytesseract that you want analyzed, the less the program has to analyze, the better. In my case, I was taking screenshots and would specify the exact region of where to take one.
- Resize -
Another thing you can do is to play with the scaling of the original photo. Sometimes after resizing to almost double it's initial size pytesseract could read the text a lot easier. Generaly, bigger text is better but there's a limit as the photo can become too pixelated after resizing to be recognizable.
- Config -
I've noticed that pytesseract can recognize text a lot easier than numbers. If possible, break the photo down into sections and whenever you have a trouble spot with numbers you can use:
pytesseract.image_to_string(image, config='digits')

Tesseract unable to read math expression

I got this image of a simple math expression Tesseract fails to read:
I've tested a screenshot of the same expression written on an Android phone and it was read pretty well. So I thought it's a font problem.
I considered:
Preprocess the image by inverting or removing the red areas
Training Tesseract with images (StackOverflow question with no answers)
Using WhatFontIs.com to find similar font then training Tesseract with the font file with TrainYourTesseract
But as I was typing the question, I looked around for more.
And this answer prompted me to double check my sanity with this VietOCR software which outputs 8-3, close enough!
Then I messed around the software and found that I could pass --psm 7 (Page Segmentation Mode 7: Treat the image as a single text line) to my script, which works well for my math expressions:
pytesseract.image_to_string(img, config='--psm 7')
List of PSMs

Extract text from light text on withe background image

I have an image like the following:
and I would want to extract the text from it, that should be ws35, I've tried with pytesseract library using the method :
pytesseract.image_to_string(Image.open(path))
but it returns nothing... Am I doing something wrong? How can I get back the text using the OCR ? Do I need to apply some filter on it ?
You can try the following approach:
Binarize the image with a method of your choice (Thresholding with 127 seems to be sufficient in this case)
Use a minimum filter to connect the lose dots to form characters. Thereby, a filter with r=4 seems to work quite good:
If necessary the result can be further improved via application of a median blur (r=4):
Because i personally do not use tesseract i am not able to try this picture, but online ocr tools seem to be able to identify the sequence correctly (especially if you use the blurred version).
Similar to #SilverMonkey's suggestion: Gaussian blur followed by Otsu thresholding.
The problem is that this picture is low quality and very noisy!
even proffesional and enterprisal programs are struggling with this
you have most likely seen a capatcha before and the reason for those is because its sent back to a database with your answer and the image and then used to train computers to read images like these.
short answer is: pytesseract cant read the text inside this image and most likely no module or proffesional programs can read it either.
You may need apply some image processing/enhancement on it. Look at this post read suggestions and try to apply.

Categories

Resources