How to setup Tesseract OCR properly

How to setup Tesseract OCR properly - python

I am using Tesseract OCR trying to convert a preprocessed license plate image into text, but I have not had much success with some images which look very much OK. The tesseract setup can be seen in the function definition. I am running this on Google Colab. The input image is ZG NIVEA 1 below. I am not sure if I am using something wrong or if there is a better way to do this - the result I get form this particular image is A.
!sudo apt install -q tesseract-ocr
!pip install -q pytesseract
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'
import cv2
import re
def pytesseract_image_to_string(img, oem=3, psm=7) -> str:
'''
oem - OCR Engine Mode
0 = Original Tesseract only.
1 = Neural nets LSTM only.
2 = Tesseract + LSTM.
3 = Default, based on what is available.
psm - Page Segmentation Mode
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
11 = Sparse text. Find as much text as possible in no particular order.
12 = Sparse text with OSD.
13 = Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
'''
tess_string = pytesseract.image_to_string(img, config=f'--oem {oem} --psm {psm}')
regex_result = re.findall(r'[A-Z0-9]', tess_string) # filter only uppercase alphanumeric symbols
return ''.join(regex_result)
image = cv2.imread('nivea.png')
print(pytesseract_image_to_string(image))
Edit: The approach in the accepted answer works for the ZGNIVEA1 image, but not for others, e.g. , is there a general "font size" that Tesseract OCR works with best, or is there a rule of thumb?

by applying gaussian blur before OCR, I ended up with the correct output. Also, you may not need to use regex by adding -c tessedit_char_whitelist=ABC.. to your config string.
The code that produces correct output for me:
import cv2
import pytesseract
image = cv2.imread("images/tesseract.png")
config = '--oem 3 --psm 7 -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ'
image = cv2.resize(image, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
image = cv2.GaussianBlur(image, (5, 5), 0)
string = pytesseract.image_to_string(image, config=config)
print(string)
Output:
Answer 2:
Sorry for the late reply. I tested the same code on your second image, and it gave me correct output, are you sure you removed the config part since it doesnt allow numbers in my whitelist.
Most accurate solution here is training your own tesseract model on license plates' fonts (FE-Schrift) instead of tesseract's default eng.traineddata model. It will definetly increase the accuracy since it only contains your case's characters as output classes. As answer to your latter question, tesseract does some preprocessing before the recognition process (thresholding, morphological closing etc.) that is why image it is so sensitive to letter size. (smaller image: contours are closer to eachother so closing will not seperate them).
To train tesseract on custom font you can follow the official docs
To read more about Tesseract's theoritical part you can check these papers:
1 (relatively old)
2 (newer)

Related

(OCR) Tesseract not recognizing simple digits

I am using PyTesseract to extract information from multiple images which contain vertically separated prices (one price per line), horizontally aligned to the right like the following image:
Tesseract is not able to extract reliable text with such image, so, image processing has to occur:
Image scaling to 4x;
Binarization
"Bolding";
Gaussian blur;
Which results in the following image:
Pytesseract is successfully able to extract its information (using PSM --6) resulting in a string containing:
96,000,000
94,009,999
90,000,000
85,000,000
78,000,000
70,000,000
66,000,000
However, when Pytesseract is presented with some edge cases like an image with a single digit, recognition fails. Example:
Pre-processed:
post-processed:
Which results in an empty string extracted. This is strange as the number 8 was previously successfully read. What other suggestions should I follow? I've spent endless hours trying to optimize the algorythm without success for such case scenarios.

I had tried the same exact scenario with easyocr. Easyocr is also using tesseract engine internally for optical character recognition. I try with resizing image of custom size (600,600) and fed to easyocr, it worked.
import easyocr
import cv2
image = cv2.imread('7.png')
image = cv2.resize(image,(600,600))
cv2.imwrite('image.png',image)
reader = easyocr.Reader(['en'])
result = reader.readtext('image.png')
texts = [detection[1] for detection in result if detection[2] > 0.5]
print(texts)
The output for first image is,
['96,000,000', '94,009,999', '90,000,000', '85,000,000', '78,000,000', '70,000,000', '66,000,000']
The output for second image is,
['8']
May be this alternate solution work for your case. You can install easyocr bypip install easyocr. Happy coding :)

Using pytesseract to get numbers from an image

I'm trying to take an image that's just a number 1-10 in a bubble font and use pytesseract to get the number.
Picture in question:
Here is an article that makes this process seem straightforward:
https://towardsdatascience.com/read-text-from-image-with-one-line-of-python-code-c22ede074cac
lives = pyautogui.locateOnScreen('pics/lives.png', confidence = 0.9)
ss = pyautogui.screenshot(region=(lives[0]+lives[2],lives[1],lives[2]-6,lives[3]))
ss.save('pics/tempLives.png')
img = cv2.imread('pics/tempLives.png')
cv2.imwrite('pics/testPic.png',img)
test = pytesseract.image_to_string(img)
print(test)
I know 'img' is the same as the image provided because I've used ss.save cv2.imwrite to see it.
I suppose my question is why it works so well in the article yet I cannot manage to get anything to print? I suppose the bubble font makes it trickier, but in the article those blue parentheses were read easily, so that makes me think this font wouldn't be too hard. Thanks for any help!

There are many cases when PyTesseract fails to recognize the text, and in some cases we have to give it some hints.
In the specific image you have posted, we better add config=" --psm 6" argument.
According to Tesseract documentation regarding PSM:
6 Assume a single uniform block of text.
Here is a code sample that manages to identify the text from the posted image:
import cv2
import pytesseract
#pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # May be required when using Windows
img = cv2.imread('pics/testPic.png') # Reading the input image (the PNG image from the posted question).
text = pytesseract.image_to_string(img, config=" --psm 6") # Execute PyTesseract OCR with "PSM 6" configuration (Assume a single uniform block of text)
print(text) # Prints the text (prints 10).
Note:
The OCR is not always working, and there are many techniques to improve the OCR accuracy.

Python Tesseract not recognising number in my image

I've got this picture (preprocessed image) from which I want to extract the numeric values of each line. I'm using pytesseract but it doesnt show any results for this image.
I've tried several config options from other questions like "--psm 13 --oem 3" or whitelisting numbers but nothing yields results.
As a result I usually get just one or two characters or ~5 dots/dashes but nothing even remotly resembling the size of my input.
I hope someone can help me cheers in advance for your time.
pytesseract version: 0.3.8
tesseract version: 5.0.0-alpha.20210506

You must think to use --psm 4, it's more appropriate for your image. I also recommend to rethink about the image pre-process. Tesseract is not perfect and it requires good image as input to work well.
import cv2 as cv
import pytesseract as tsr
img = cv.imread('41DAx.jpg')
img = cv.cvtColor(img, cv.COLOR_BGR2RGB)
config = '--psm 4 -c tessedit_char_whitelist=0123456789,'
text = tsr.image_to_string(img, config=config)
print(text)
The above code was not able to well detect all digts in the image, but almost of them. Maybe with a bit of image pre-processing, you can reach your objective.

Tesseract - digit regonition with many errors

I want to be able to recognize digits from images. So I have been playing around with tesseract and python. I looked into how to prepare the image and tried running tesseract on it and I must say I am pretty disappointed by how badly my digits are recognized. I have tried to prepare my images with OpenCV and thought I did a pretty good job (see examples below) but tesseract has a lot of errors when trying to identify my images. Am I expecting too much here? But when I look at these example images I think that tesseract should easily be able to identify these digits without any problems. I am wondering if the accuracy is not there yet or if somehow my configuration is not optimal. Any help or direction would be gladly appreciated.
Things I tried to improve the digit recognition: (nothing seemed to improved the results significantly)
limit characters: config = "--psm 13 --oem 3 -c tessedit_char_whitelist=0123456789"
Upscale images
add a white border around the image to give the letters more space, as I have read that this improves the recognition process
Threshold image to only have black and white pixels
Examples:
Image 1:
Tesseract recognized: 72
Image 2:
Tesseract recognized: 0
EDIT:
Image 3:
https://ibb.co/1qVtRYL
Tesseract recognized: 1723

I'm not sure what's going wrong for you. I downloaded those images and tesseract interprets them just fine for me. What version of tesseract are you using (I'm using 5.0)?
781429
209441
import pytesseract
import cv2
import numpy as np
from PIL import Image
# set path
pytesseract.pytesseract.tesseract_cmd = r'C:\\Users\\ichu\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.exe';
# load images
first = cv2.imread("first_text.png");
second = cv2.imread("second_text.png");
images = [first, second];
# convert to pillow
pimgs = [];
for img in images:
rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB);
pimgs.append(Image.fromarray(rgb));
# do text
for img in pimgs:
text = pytesseract.image_to_string(img, config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789');
print(text[:-2]); # drops newline + end char

Easily readable text not recognized by tesseract

I have used the following PyTorch implementation of EAST ( Efficient and Accurate Scene Text Detector) to identify and draw bounding boxes around text in a number of images and it works very well!
However, the next step of OCR which I am trying with pytesseract in order to extract the text form these images and converting them to strings - is failing horribly. Using all possible configurations of --oem and --psm, I am unable to get pytesseract to detect what appears to be very clear text, for example:
The recognized text is below the images. Even though I have applied contrast enhancement, and also tried dilating and eroding, I cannot get tesseract to recognize the text. This is just one example of many images where the text is even larger and clearer. Any suggestions on transformations, configs, or other libraries would be helpful!
UPDATE: After trying Gaussian blur + Otso thresholding, I am able to get black text on white background (apparently which is ideal for pytesseract), and also added Spanish language, but it still cannot read very plain text - for example:
reads as gibberish.
The processed text images are and
and the code I am using:
img_path = './images/fesa.jpg'
img = Image.open(img_path)
boxes = detect(img, model, device)
origbw = cv2.imread(img_path, 0)
for box in boxes:
box = box[:-1]
poly = [(box[0], box[1]),(box[2], box[3]),(box[4], box[5]),(box[6], box[7])]
x = []
y = []
for coord in poly:
x.append(coord[0])
y.append(coord[1])
startX = int(min(x))
startY = int(min(y))
endX = int(max(x))
endY = int(max(y))
#use pre-defined bounding boxes produced by EAST to crop the original image
cropped_image = origbw[startY:endY, startX:endX]
#contrast enhancement
clahe = cv2.createCLAHE(clipLimit=4.0, tileGridSize=(8,8))
res = clahe.apply(cropped_image)
text = pytesseract.image_to_string(res, config = "-psm 12")
plt.imshow(res)
plt.show()
print(text)

Use these updated data files.
This guide criticizes out-of-the box performance (and maybe the accuracy could be affected too):
Trained data. On the moment of writing, tesseract-ocr-eng APT package for Ubuntu 18.10 has terrible out of the box performance, likely because of corrupt training data.
According to the following test I did, using the updated data files seems to provide better results. This is the code I used:
import pytesseract
from PIL import Image
print(pytesseract.image_to_string(Image.open('farmacias.jpg'), lang='spa', config='--tessdata-dir ./tessdata --psm 7'))
I downloaded spa.traineddata (your example images have Spanish words, right?) to ./tessdata/spa.traineddata. And the result was:
ARMACIAS
And for the second image:
PECIALIZADA:
I used --psm 7 because here it says that it means "Treat the image as a single text line" and I thought that should make sense for your test images.
In this Google Colab you can see the test I did.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to setup Tesseract OCR properly - python

Related

(OCR) Tesseract not recognizing simple digits

Using pytesseract to get numbers from an image

Python Tesseract not recognising number in my image

Tesseract - digit regonition with many errors

Easily readable text not recognized by tesseract

Categories

Resources