I need to extract digits from images (see sample images). I tried pytesseract but it is not working, it produces empty results. Below is the code I am using
Code
import pytesseract
import cv2
img = cv2.imread('image_path')
digits = pytesseract.image_to_string(img)
print(digits)
Sample Images
I have a large pool of images, as shown above. Tesseract is not working on any of them.
Try adding config --psm 7 (meaning Treat the image as a single text line.)
import pytesseract
import cv2
img = cv2.imread('image_path')
digits = pytesseract.image_to_string(img,config='--psm 7')
print(digits)
#'971101004900 1545'
Related
I'm trying to read the digits from this image:
Using pytesseract with these settings:
custom_config = r'--oem 3 --psm 6'
pytesseract.image_to_string(img, config=custom_config)
This is the output:
((E ST7 [71aT6T2 ] THETOGOG5 15 [8)
Whitelisting only integers, as well as changing your psm provides much better results. You also need to remove carriage returns, and white space. Below is code that does that.
import pytesseract
import re
from PIL import Image
#Open image
im = Image.open("numbers.png")
#Define configuration that only whitelists number characters
custom_config = r'--oem 3 --psm 11 -c tessedit_char_whitelist=0123456789'
#Find the numbers in the image
numbers_string = pytesseract.image_to_string(im, config=custom_config)
#Remove all non-number characters
numbers_int = re.sub(r'[a-z\n]', '', numbers_string.lower())
#print the output
print(numbers_int)
The result of the code on your image is: '31477423353'
Unfortunately, a few numbers are still missing. I tried some experimentation, and downloaded your image and erased the grid.
After removing the grid and executing the code again, pytesseract produces a perfect result: '314774628300558'
So you might try to think about how you can remove the grid programmatically. There are alternatives to pytesseract, but regardless you will get better output with the text isolated in the image.
I'm trying to extract texts from some images. It worked for hundreds of other images but in some cases it doesn't find any texts. In order to optimize the images for extraction phase, all images are converted to black and white. All of their backgrounds are white and others are black such as icons, texts etc.
For example it worked for below image and succesfully found 'Sleep Timer' text in the image. I'm not sure if it's relevant but size of the below image with 'Sleep Timer' text is 320 × 351
But for the below image it doesn't find any text at all. Image size for this one is 161 × 320.
Since I couldn't find the reason, I tried to resize the image but it didn't work.
Here is my code:
from pytesseract import Output
import pytesseract
import cv2
image = cv2.imread('imagePath')
rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = pytesseract.image_to_data(rgb, output_type=Output.DICT)
for i in range(0, len(results["text"])):
text = results["text"][i]
conf = int(results["conf"][i])
print("Confidence: {}".format(conf))
print("Text: {}".format(text))
print("")
It is working for me I tested:
import pytesseract
print(pytesseract.image_to_string('../images/grmgrm.jfif'))
results = pytesseract.image_to_data('../images/grmgrm.jfif', output_type=pytesseract.Output.DICT)
print(results)
Are you getting an error? Show us the error you are getting.
I want to be able to recognize digits from images. So I have been playing around with tesseract and python. I looked into how to prepare the image and tried running tesseract on it and I must say I am pretty disappointed by how badly my digits are recognized. I have tried to prepare my images with OpenCV and thought I did a pretty good job (see examples below) but tesseract has a lot of errors when trying to identify my images. Am I expecting too much here? But when I look at these example images I think that tesseract should easily be able to identify these digits without any problems. I am wondering if the accuracy is not there yet or if somehow my configuration is not optimal. Any help or direction would be gladly appreciated.
Things I tried to improve the digit recognition: (nothing seemed to improved the results significantly)
limit characters: config = "--psm 13 --oem 3 -c tessedit_char_whitelist=0123456789"
Upscale images
add a white border around the image to give the letters more space, as I have read that this improves the recognition process
Threshold image to only have black and white pixels
Examples:
Image 1:
Tesseract recognized: 72
Image 2:
Tesseract recognized: 0
EDIT:
Image 3:
https://ibb.co/1qVtRYL
Tesseract recognized: 1723
I'm not sure what's going wrong for you. I downloaded those images and tesseract interprets them just fine for me. What version of tesseract are you using (I'm using 5.0)?
781429
209441
import pytesseract
import cv2
import numpy as np
from PIL import Image
# set path
pytesseract.pytesseract.tesseract_cmd = r'C:\\Users\\ichu\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.exe';
# load images
first = cv2.imread("first_text.png");
second = cv2.imread("second_text.png");
images = [first, second];
# convert to pillow
pimgs = [];
for img in images:
rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB);
pimgs.append(Image.fromarray(rgb));
# do text
for img in pimgs:
text = pytesseract.image_to_string(img, config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789');
print(text[:-2]); # drops newline + end char
I've this python code which I use to convert a text written in a picture to a string, it does work for certain images which have large characters, but not for the one I'm trying right now which contains only digits.
This is the picture:
This is my code:
import pytesseract
from PIL import Image
img = Image.open('img.png')
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
result = pytesseract.image_to_string(img)
print (result)
Why is it failing at recognising this specific image and how can I solve this problem?
I have two suggestions.
First, and this is by far the most important, in OCR preprocessing images is key to obtaining good results. In your case I suggest binarization. Your images look extremely good so you shouldn't have any problem but if you do, then maybe you should try to binarize your images:
import cv2
from PIL import Image
img = cv2.imread('gradient.png')
# If your image is not already grayscale :
# img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
threshold = 180 # to be determined
_, img_binarized = cv2.threshold(img, threshold, 255, cv2.THRESH_BINARY)
pil_img = Image.fromarray(img_binarized)
And then try the ocr again with the binarized image.
Check if your image is in grayscale and uncomment if needed.
This is simple thresholding. Adaptive thresholding also exists but it is noisy and does not bring anything in your case.
Binarized images will be much easier for Tesseract to handle. This is already done internally (https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) but sometimes things can be messed up and very often it's useful to do your own preprocessing.
You can check if the threshold value is right by looking at the images :
import matplotlib.pyplot as plt
plt.imshow(img, cmap='gray')
plt.imshow(img_binarized, cmap='gray')
Second, if what I said above still doesn't work, I know this doesn't answer "why doesn't pytesseract work here" but I suggest you try out tesserocr. It is a maintained python wrapper for Tesseract.
You could try:
import tesserocr
text_from_ocr = tesserocr.image_to_text(pil_img)
Here is the doc for tesserocr from pypi : https://pypi.org/project/tesserocr/
And for opencv : https://pypi.org/project/opencv-python/
As a side-note, black and white is treated symetrically in Tesseract so having white digits on a black background is not a problem.
I'm trying to read different cropped images from a big file and I manage to read most of them but there are some of them which return an empty string when I try to read them with tesseract.
The code is just this line:
pytesseract.image_to_string(cv2.imread("img.png"), lang="eng")
Is there anything I can try to be able to read these kind of images?
Thanks in advance
Edit:
Thresholding the image before passing it to pytesseract increases the accuracy.
import cv2
import numpy as np
# Grayscale image
img = Image.open('num.png').convert('L')
ret,img = cv2.threshold(np.array(img), 125, 255, cv2.THRESH_BINARY)
# Older versions of pytesseract need a pillow image
# Convert back if needed
img = Image.fromarray(img.astype(np.uint8))
print(pytesseract.image_to_string(img))
This printed out
5.78 / C02
Edit:
Doing just thresholding on the second image returns 11.1. Another step that can help is to set the page segmentation mode to "Treat the image as a single text line." with the config --psm 7. Doing this on the second image returns 11.1 "202 ', with the quotation marks coming from the partial text at the top. To ignore those, you can also set what characters to search for with a whitelist by the config -c tessedit_char_whitelist=0123456789.%. Everything together:
pytesseract.image_to_string(img, config='--psm 7 -c tessedit_char_whitelist=0123456789.%')
This returns 11.1 202. Clearly pytesseract is having a hard time with that percent symbol, which I'm not sure how to improve on that with image processing or config changes.