Tesseract OCR on binary image

Tesseract OCR on binary image - python

I have a binary image like this,
I want to extract the numbers in the image using tesseract ocr in Python. I used pytesseract like this on the image,
txt = pytesseract.image_to_string(img)
But I am not getting any good results.
What can I do in pre-processing or augmentation that can help tesseract do better.?
I tried to localize the text from the image using East Text Detector but it was not able to recognize the text.
How to proceed with this in python.?

I think the page-segmentation-mode is an important factor here.
Since we are trying to read column values, we could use --psm 4 (source)
import cv2
import pytesseract
img = cv2.imread("k7bqx.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
txt = pytesseract.image_to_string(gry, config="--psm 4")
We want to get the text starts with #
txt = sorted([t[:2] for t in txt if "#" in t])
Result:
['#3', '#7', '#9', '#€']
But we miss 4, 5, we could apply adaptive-thresholding:
Result:
['#3', '#4', '#5', '#7', '#9', '#€']
Unfortunately, #2 and #6 are not recognized.
Code:
import cv2
import pytesseract
img = cv2.imread("k7bqx.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.adaptiveThreshold(gry, 252, cv2.ADAPTIVE_THRESH_MEAN_C,
cv2.THRESH_BINARY_INV, blockSize=131, C=100)
bnt = cv2.bitwise_not(thr)
txt = pytesseract.image_to_string(bnt, config="--psm 4")
txt = txt.strip().split("\n")
txt = sorted([t[:2] for t in txt if "#" in t])
print(txt)

Related

How to detect if image contains ASCII characters?

I have a dataset of images and I want to filter out all images that contain text (ASCII chars). For example, I have the following cute image of a dog:
As you can see, on right bottom corner there is a text "MAY 18 2003" so it should be filtered out.
After some research, I came across with tesseract OCR. In python I have the following code:
# Attempt 1
img = Image.open('n02086240_1681.jpg')
text = pytesseract.image_to_string(img)
print(text)
# Attempt 2
import unidecode
img = Image.open('n02086240_1681.jpg')
text = pytesseract.image_to_string(img)
text = unidecode.unidecode(text)
print(text)
# Attempt 3
import string
char_whitelist = string.digits
char_whitelist += string.ascii_lowercase
char_whitelist += string.ascii_uppercase
text = pytesseract.image_to_string(img,lang='eng',
config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
print(text)
None of them detected the string (prints whitespaces). How can I detect it?

you should prepare the image for the OCR.
for example, for this image I would do the following:
convert it to Black & White image with threshold that make the text visible (for this image it is 130)
then I would Invert the image (so the text be in black)
now try tesseract OCR

You can use Easy-OCR instead of pytesseract to get directly this output
Kay
10 2003
and as your goal is just to detect ASCII, you don't care about the accurate characters because you just want to filter the images which contain them.
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import cv2
import easyocr
path = ""
img = cv2.imread(path+"input.jpg")
# Now apply the Easy-OCR
reader = easyocr.Reader(['en'])
output = reader.readtext(img)
for i in range(len(output)):
print(output[i][-2])

You can use inRange thresholding
The result will be:
If you set psm mode to the 6, the output will be:
<<
‘\
' MAY 18 2003
All the digits are captured correctly, but we have some unwanted characters.
If we add an 'only-alpha numeric' condition, then the result will be:
['M', 'A', 'Y', '1', '8', '2', '0', '0', '3']
First, I've upsampled the image, and then apply tesseract-OCR. The reason is that the date is too small to read.
Code:
import cv2
import pytesseract
from numpy import array
img = cv2.imread("result.png") # Load the upsampled image
img = cv2.cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
msk = cv2.inRange(img, array([0, 103, 171]), array([179, 255, 255]))
krn = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 3))
dlt = cv2.dilate(msk, krn, iterations=1)
thr = 255 - cv2.bitwise_and(dlt, msk)
txt = pytesseract.image_to_string(thr, config='--psm 6')
print([t for t in txt if t.isalnum()])
cv2.imshow("", thr)
cv2.waitKey(0)
You can set the new values for the minimum and maximum ranges:
import numpy as np
min_range = np.array([0, 103, 171])
max_range = np.array([179, 255, 255])
msk = cv2.inRange(img, min_range, max_range)
You can also test with different psm parameters:
txt = pytesseract.image_to_string(thr, config='--psm 6')
For more read: Improving the quality of the output

Why will tesseract not detect this letter?

I am trying to detect this letter but it doesn't seem to recognize it.
import cv2
import pytesseract as tess
img = cv2.imread("letter.jpg")
imggray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
print(tess.image_to_string(imggray))
this is the image in question:

Preprocessing of the image (e.g. inverting it) should help, and also you could take advantage of pytesseract image_to_string config options.
For instance, something along these lines:
import pytesseract
import cv2 as cv
import requests
import numpy as np
import io
# I read this directly from imgur
response = requests.get('https://i.stack.imgur.com/LGFAu.jpg')
nparr = np.frombuffer(response.content, np.uint8)
img = cv.imdecode(nparr, cv.IMREAD_GRAYSCALE)
# simple inversion as preprocessing
neg_img = cv.bitwise_not(img)
# invoke tesseract with options
text = pytesseract.image_to_string(neg_img, config='--psm 7')
print(text)
should parse the letter correctly.
Have a look at related questions for some additional info about preprocessing and tesseract options:
Why does pytesseract fail to recognise digits from image with darker background?
Why does pytesseract fail to recognize digits in this simple image?
Why does tesseract fail to read text off this simple image?

#Davide Fiocco 's answer is definitely correct.
I just want to show another way of doing it with adaptive-thresholding
When you apply adaptive-thesholding result will be:
Now when you read it:
txt = pytesseract.image_to_string(thr, config="--psm 7")
print(txt)
Result:
B
Code:
import cv2
import pytesseract
img = cv2.imread("LGFAu.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.adaptiveThreshold(gry, 252, cv2.ADAPTIVE_THRESH_MEAN_C,
cv2.THRESH_BINARY_INV, 11, 2)
txt = pytesseract.image_to_string(thr, config="--psm 7")
print(txt)

How can I get text from this image with Tesseract?

Currently I'm using the code below to get text from image and it works fine, but it doesn't work well with these two images, it seems like tesseract cannot scan these types of image. Please show me how to fix it
https://i.ibb.co/zNkbhKG/Untitled1.jpg
https://i.ibb.co/XVbjc3s/Untitled3.jpg
def read_screen():
spinner = Halo(text='Reading screen', spinner='bouncingBar')
spinner.start()
screenshot_file="Screens/to_ocr.png"
screen_grab(screenshot_file)
#prepare argparse
ap = argparse.ArgumentParser(description='HQ_Bot')
ap.add_argument("-i", "--image", required=False,default=screenshot_file,help="path to input image to be OCR'd")
ap.add_argument("-p", "--preprocess", type=str, default="thresh", help="type of preprocessing to be done")
args = vars(ap.parse_args())
# load the image
image = cv2.imread(args["image"])
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
if args["preprocess"] == "thresh":
gray = cv2.threshold(gray, 177, 177,
cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
elif args["preprocess"] == "blur":
gray = cv2.medianBlur(gray, 3)
# store grayscale image as a temp file to apply OCR
filename = "Screens/{}.png".format(os.getpid())
cv2.imwrite(filename, gray)
# load the image as a PIL/Pillow image, apply OCR, and then delete the temporary file
pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
#ENG
#text = pytesseract.image_to_string(Image.open(filename))
#VIET
text = pytesseract.image_to_string(Image.open(filename), lang='vie')
os.remove(filename)
os.remove(screenshot_file)
# show the output images
'''cv2.imshow("Image", image)
cv2.imshow("Output", gray)
os.remove(screenshot_file)
if cv2.waitKey(0):
cv2.destroyAllWindows()
print(text)
'''
spinner.succeed()
spinner.stop()
return text

You should try different psm modes instead of default like so:
target = pytesseract.image_to_string(im,config='--psm 4',lang='vie')
Exert from docs:
Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
So for example for /Untitled3.jpg you could try --psm 4 and failing that you could try --psm 11 for both.
Depending on your version of tesseract you could also try different oem modes:
Use --oem 1 for LSTM, --oem 0 for Legacy Tesseract. Please note that Legacy Tesseract models are only included in traineddata files from tessdata repo.
EDIT
Also as seen in your images there are two languages so if you wish to use lang parameter you need to manually separate image into two to not to confuse tesseract engine and use different lang values for them.
EDIT 2
Below a full working example with Unitiled3. What I noticed was your improper use of thresholding. You should set maxval to something bigger than the value you are thresholding at. Like in my example I set thresh 177 but maxval to 255 so everything above 177 will be black. I didn't even had to do any binarization.
import cv2
import pytesseract
from cv2.cv2 import imread, cvtColor, COLOR_BGR2GRAY, threshold, THRESH_BINARY
image = imread("./Untitled3.jpg")
image = cvtColor(image,COLOR_BGR2GRAY)
_,image = threshold(image,177,255,THRESH_BINARY)
cv2.namedWindow("TEST")
cv2.imshow("TEST",image)
cv2.waitKey()
text = pytesseract.image_to_string(image, lang='eng')
print(text)
Output:
New York, New York
Salzburg, Austria
Hollywood, California

Pytesseract reading receipt

I have tried to read text from image of receipt using pytesseract. But a result text have a lot weird characters and it really looks awful.
There is my code which i used to manipulate image:
import sys
from PIL import Image
import cv2 as cv
import numpy as np
import pytesseract
def manipulate_image(img):
img = cv.cvtColor(img, cv.COLOR_BGR2GRAY)
kernel = np.ones((1,1), dtype = "uint8")
img = cv.erode(img, kernel, iterations = 1)
img = cv.threshold(img, 0, 255,
cv.THRESH_BINARY | cv.THRESH_OTSU)[1]
img = cv.medianBlur(img, 3)
return img
if len(sys.argv) > 2:
print("Please provide only name of image.")
elif len(sys.argv) == 2:
img = cv.imread(sys.argv[1])
img = manipulate_image(img)
cv.imwrite("test.png", img)
text = pytesseract.image_to_string(img)
print text.encode('utf8')
else:
print("Please provide name of image.")
there is my test receipt image:
https://imgur.com/a/RjeQ9dL
and there is output image after manupulate:
https://imgur.com/a/1tFZRdq
and there is text result:
""'9vco4v‘l7
0 .Vt3t00N 00t300N BUNUUS
SKLEP PUU POPUGOH|
UL. JHGIELLUNSKA 25, 70-364 SZCZ[C|N
TEL. 91 4841-20-58
N|P: 955—150-21-B2
dn.19r03.05 Uydr.8534
PARAGON FISKALNY
CIHSTKH 17 0,3 ¥ 16,30 = 4.89 B
Sp.0p.B 4,89 PTU B= 8,00% 0,35
Razem PTU 0,35
ZOP{HCUNU GUTUNKQ PLN
RESZTA PLN
0025/1373 H0103 0N|0 H.
15F H9HF[B9416} 13ﬂ02D6k0[20D4334C
7?? BW 140
Any idea how to perform it in better way to get nicer results?

Applying simple thresholding will not be enough for pyTesseract to properly detect the characters. There is much more preprocessing that can be done to drastically improve your results, such as:
using Tesseract V4, where deep learning is implemented
segmenting characters
using only the part of the receipt where the text is through edge detection
perspective transform to straighten out the text
These are somewhat lengthy topics to write all in one answer, but you can check out some articles on pyImageSearch, where this is talked about in much more depth:
https://www.pyimagesearch.com/2014/09/01/build-kick-ass-mobile-document-scanner-just-5-minutes/
https://www.pyimagesearch.com/2018/09/17/opencv-ocr-and-text-recognition-with-tesseract/

extracting text information from a national id

I'm trying to do OCR arabic on the following ID but I get a very noisy picture, and can't extract information from it.
Here is my attempt
import tesserocr
from PIL import Image
import pytesseract
import matplotlib as plt
import cv2
import imutils
import numpy as np
image = cv2.imread(r'c:\ahmed\ahmed.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.bilateralFilter(gray,11,18,18)
gray = cv2.GaussianBlur(gray,(5,5), 0)
kernel = np.ones((2,2), np.uint8)
gray = cv2.adaptiveThreshold(gray,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY,11,2)
#img_dilation = cv2.erode(gray, kernel, iterations=1)
#cv2.imshow("dilation", img_dilation)
cv2.imshow("gray", gray)
text = pytesseract.image_to_string(gray, lang='ara')
print(text)
with open(r"c:\ahmed\file.txt", "w", encoding="utf-8") as myfile:
myfile.write(text)
cv2.waitKey(0)
result
sample

The text for your id is in black color which makes the extraction process easy. All you need to do is threshold the dark pixels and you should be able to get the text out.
Here is a snip of the code
import cv2
import numpy as np
# load image in grayscale
image = cv2.imread('AVXjv.jpg',0)
# remove noise
dst = cv2.blur(image,(3,3))
# extract dark regions which corresponds to text
val, dst = cv2.threshold(dst,80,255,cv2.THRESH_BINARY_INV)
# morphological close to connect seperated blobs
dst = cv2.dilate(dst,None)
dst = cv2.erode(dst,None)
cv2.imshow("dst",dst)
cv2.waitKey(0)
And here is the result:

This is my output using ImageMagick TextCleaner script:
Script: textcleaner -g -e stretch -f 50 -o 30 -s 1 C:/Users/PC/Desktop/id.jpg C:/Users/PC/Desktop/out.png
Take a look here if you want to install and use TextCleaner script on Windows... It's a tutorial I made as simple as possible after few researches I made when I was in your same situation.
Now it should be very easy to detect the text and (not sure how simple) recognize it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tesseract OCR on binary image - python

Related

How to detect if image contains ASCII characters?

Why will tesseract not detect this letter?

How can I get text from this image with Tesseract?

Pytesseract reading receipt

extracting text information from a national id

Categories

Resources