Unable to OCR alphanumerical image with Tesseract

Unable to OCR alphanumerical image with Tesseract - python

I'm trying to read some alphanumerical strings in python with pytesseract.
I pre-process the images to reduce noise and make them black and white, but I consistently have issues reading the digits inside the string.
original:
after cleanup:
Extracted text: WISOMW
Code used:
def convert(path):
image = cv2.imread(path)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3, 3), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
invert = 255 - thresh
cv2.imwrite("processed.jpg", invert)
# Perform text extraction
return pytesseract.image_to_string(invert, config="--psm 7")
I've tried different configuration options for tesseract:
oem: tried 1, 3
psm: tried different modes
tessedit_char_whitelist: limited to alphanumerical characters
I feel I'm missing something obvious given that it reliably reads the alpha characters. Any ideas of what can it be?

You were so close. A dilate helps increase white/decrease black. The resolution is low, so a small kernel is used for dilate. If you remove the _INV from your threshold step, you don't need to do another inversion.
import cv2
import numpy as np
import pytesseract
img = cv2.imread('wis9mw.jpg', cv2.IMREAD_GRAYSCALE )
img = cv2.GaussianBlur(img, (3, 3), 0)
img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
kernel = np.ones((1,1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
cv2.imwrite('processed.jpg', img)
text = pytesseract.image_to_string(img, config="--psm 6")
print(text)
gives
WIS9MW

Related

how to improve pytesseract arguments to work properly

I would like to read this captcha using pytesseract:
I follow the advice here: Use pytesseract OCR to recognize text from an image
My code is:
import pytesseract
import cv2
def captcha_to_string(picture):
image = cv2.imread(picture)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
# Morph open to remove noise and invert image
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=1)
invert = 255 - opening
cv2.imwrite('thresh.jpg', thresh)
cv2.imwrite('opening.jpg', opening)
cv2.imwrite('invert.jpg', invert)
# Perform text extraction
text = pytesseract.image_to_string(invert, lang='eng', config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
return text
But my code returns 8\n\x0c which is nonsense.
This is how thresh looks like:
This is how opening looks like:
This is how invert looks like:
Can you help me, how can I improve captcha_to_string function to read the captcha properly? Thanks a lot.

You are on the right way. Removing the noise (small black spots in the inverted image) looks like the way to extract the text successfully.
FYI, the configuration of pytessearct makes the outcome worse only. So, I removed it.
My approach is as follows:
import pytesseract
import cv2
import matplotlib.pyplot as plt
import numpy as np
def remove_noise(img,threshold):
"""
remove salt-and-pepper noise in a binary image
"""
filtered_img = np.zeros_like(img)
labels,stats= cv2.connectedComponentsWithStats(img.astype(np.uint8),connectivity=8)[1:3]
label_areas = stats[1:, cv2.CC_STAT_AREA]
for i,label_area in enumerate(label_areas):
if label_area > threshold:
filtered_img[labels==i+1] = 1
return filtered_img
def preprocess(img_path):
"""
convert the grayscale captcha image to a clean binary image
"""
img = cv2.imread(img_path,0)
blur = cv2.GaussianBlur(img, (3,3), 0)
thresh = cv2.threshold(blur, 150, 255, cv2.THRESH_BINARY_INV)[1]
filtered_img = 255-remove_noise(thresh,20)*255
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
erosion = cv2.erode(filtered_img,kernel,iterations = 1)
return erosion
def extract_letters(img):
text = pytesseract.image_to_string(img)#,config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
return text
img = preprocess('captcha.jpg')
text=extract_letters(img)
print(text)
plt.imshow(img,'gray')
plt.show()
This is the processed image.
And, the script returns 18L9R.

Using pytesseract to get text from an image

I'm trying to use pytesseract to convert some images into text. The images are very basic and I tried using some preprocessing:
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.bitwise_not(gray)
gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
The original image looks like this:
The resulting image looks like this:
I do this for a bunch of numbers with the same font in the same location here are the results:
It still gives no text in the output. For a few of the images, it does, but not for all and the images look nearly identical.
Here is a snippet of the code I'm using:
def checkCurrentState():
"""image = pyautogui.screenshot()
image = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)
cv2.imwrite("screenshot.png", image)"""
image = cv2.imread("screenshot.png")
checkNumbers(image)
def checkNumbers(image):
numbers = []
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.bitwise_not(gray)
gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
for i in storeLocations:
cropped = gray[i[1]:i[1]+storeHeight, i[0]:i[0]+storeWidth]
number = pytesseract.image_to_string(cropped)
numbers.append(number)
print(number)
cv2.imshow("Screenshot", cropped)
cv2.waitKey(0)

To perform OCR on an image, its important to preprocess the image. The idea is to obtain a processed image where the text to extract is in black with the background in white. Here's a simple approach using OpenCV and Pytesseract OCR.
To do this, we convert to grayscale, apply a slight Gaussian blur, then Otsu's threshold to obtain a binary image. From here, we can apply morphological operations to remove noise. We perform text extraction using the --psm 6 configuration option to assume a single uniform block of text. Take a look here for more options.
Here's a visualization of each step:
Input image
Convert to grayscale -> Gaussian blur
Otsu's threshold -> Morph open to remove noise
Result from Pytesseract OCR
1100
Code
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Grayscale, Gaussian blur, Otsu's threshold
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
# Morph open to remove noise
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=1)
# Perform text extraction
data = pytesseract.image_to_string(opening, lang='eng', config='--psm 6')
print(data)
cv2.imshow('blur', blur)
cv2.imshow('thresh', thresh)
cv2.imshow('opening', opening)
cv2.waitKey()

Pytesseract and OpenCV can't detect digits

Thanks in advance to everyone that will answer.
I am new to OpenCV, Pytesseract and overall very inexperienced about image processing and recognition.
I am trying to detect a digit from a pdf, for the sake of this code I will directly provide the image:
Initial image
My objective is to detect the number in the colored box, which in this case is number 6.
My code for preprocessing is the following:
import numpy as np
import pytesseract
from PIL import Image
from PIL import ImageFilter, ImageEnhance
pytesseract.pytesseract.tesseract_cmd = 'Tesseract-OCR\tesseract.exe'
# -----Reading the image-----------------------------------------------------
img = cv2.imread('page_image.jpg')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
gray = cv2.resize(gray, (1028, 720))
thres_gray = cv2.threshold(gray, 0, 255, cv2.THRESH_OTSU)[1]
gray_inv = cv2.bitwise_not(thres_gray)
gray_test = cv2.bitwise_not(gray_inv)
out2 = cv2.bitwise_or(gray, gray, mask=gray_inv)
thresh_end = cv2.threshold(out2, 254, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
imageObject = Image.fromarray(thresh_end)
enhancer = ImageEnhance.Sharpness(imageObject)
sharpened1 = imageObject.filter(ImageFilter.SHARPEN)
sharpened2 = sharpened1.filter(ImageFilter.SHARPEN)
# sharpened2.show()
From this I obtain the following picture:
Preprocessed image
At this point, since I am still learning about how to detect the region of interest and crop it with OpenCV, to test the code I decided to manually crop the image to test if my script works correctly enough.
Therefore the image I pass to pytesseract is the following:
Final image to read with pytesseract
I am not really sure if the image is good enough to be read, but this is the best I could get.
From this I try image_to_string:
trial = pytesseract.image_to_string(sharpened2, config='--psm 13 --oem 3 -c tessedit_char_whitelist=0123456789')
I have tried many different configurations for the tesseract but none of it worked and the final output is always an empty string.
I would be really grateful if you could help me understand whether the image is not good enough or I am doing something wrong with the tesseract configuration.
If you could also be able to help me cropping the image correctly that would be awesome, but even detecting the number is enough for me.
Sorry for the long post and thanks again.

Try this:
import cv2
import pytesseract
import numpy as np
pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
img = cv2.imread("form.jpg")
# https://stackoverflow.com/questions/10948589/choosing-the-correct-upper-and-lower-hsv-boundaries-for-color-detection-withcv
ORANGE_MIN = np.array([5, 50, 50], np.uint8)
ORANGE_MAX = np.array([15, 255, 255], np.uint8)
hsv_img = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
frame_threshed = cv2.inRange(hsv_img, ORANGE_MIN, ORANGE_MAX)
# cv2.imshow("frame_threshed", frame_threshed)
thresh = cv2.threshold(frame_threshed, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
# cv2.imshow("thresh", thresh)
cnts = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
# cv2.imshow("dilate", thresh)
for c in cnts:
x, y, w, h = cv2.boundingRect(c)
ROI = thresh[y:y + h, x:x + w]
ratio = 100.0 / ROI.shape[1]
dim = (100, int(ROI.shape[0] * ratio))
resizedCubic = cv2.resize(ROI, dim, interpolation=cv2.INTER_CUBIC)
threshGauss = cv2.adaptiveThreshold(resizedCubic, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 255, 17)
cv2.imshow("ROI", threshGauss)
text = int(pytesseract.image_to_string(threshGauss, lang='eng', config="--oem 3 --psm 13"))
print(f"Detected text: {text}")
cv2.waitKey(0)
I used HSV method to detect orange color first. Then, once the ROI was clearly visible, I applied "classic" image pre-processing steps.
Take a look at this link to understand how to select other colors than orange.
I also resized the ROI a bit.

Unable to read text from Image using pytesseract.image_to_string

The problem here is I need to remove the lines and write code to recognize the characters. Till now I have seen solutions, where the char was in solid, but this has char with double border.

For this specific captcha, there's quite a simple solution. But, there's no guarantee for this approach to work on other, even very similar captchas – due to the "nature" of captchas as already mentioned in the comments, and in general when dealing with image-processing tasks with limited provided input data.
Read the image as grayscale.
Threshold the image at nearly white cutoff.
Flood fill the "background" with black.
Run pytesseract with -psm 6 option.
That'd be the whole code:
import cv2
import pytesseract
# Read image as grayscale
img = cv2.imread('FuZEJ.png', cv2.IMREAD_GRAYSCALE)
# Threshold at nearly white cutoff
thr = cv2.threshold(img, 224, 255, cv2.THRESH_BINARY)[1]
# Floodfill "background" with black
ff = cv2.floodFill(thr, None, (0, 0), 0)[1]
# OCR using pytesseract
text = pytesseract.image_to_string(ff, config='--psm 6').replace('\n', '').replace('\f', '')
print(text)
# xwphs
Caveat: I use a special version of Tesseract from the Mannheim University Library.
----------------------------------------
System information
----------------------------------------
Platform: Windows-10-10.0.16299-SP0
Python: 3.9.1
PyCharm: 2021.1.1
OpenCV: 4.5.1
pytesseract: 5.0.0-alpha.20201127
----------------------------------------

I would try a mask:
import cv2
import numpy as np
def process(img): # To process the image
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
_, img_gray = cv2.threshold(img_gray, 224, 255, cv2.THRESH_TOZERO_INV)
img_blur = cv2.GaussianBlur(img_gray, (7, 7), 6)
img_canny = cv2.Canny(img_blur, 0, 100)
return cv2.dilate(img_canny, np.ones((1, 5)), iterations=1)
def get_mask(img): # To generate the mask
mask = np.zeros(img.shape[:2], 'uint8')
contours, _ = cv2.findContours(process(img), cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)
for cnt in contours:
cv2.drawContours(mask, [cnt], -1, 255, -1)
return mask
def crop(img, mask): # To mask an image and use white background
bg = np.full(img.shape, 255, 'uint8')
fg = cv2.bitwise_or(img, img, mask=mask)
fg_back_inv = cv2.bitwise_or(bg, bg, mask=cv2.bitwise_not(mask))
return cv2.bitwise_or(fg, fg_back_inv)
img = cv2.imread("image.png")
img = cv2.pyrUp(cv2.pyrUp(img)) # To enlarge image by 4x
cv2.imshow("Masked Image", crop(img, get_mask(img)))
cv2.waitKey(0)
Before:
After:

How to improve pytesseract function for capctha decoding?

I want to extract the numbers from an image in python. In order to do that, I have chosen pytesseract. When I tried extracting the text from the image, the results weren't satisfactory. I also went through the following code and implemented all the techniques listed with other answers. Yet, it doesn't seem to perform well.
sample images:
and my code is:
import cv2 as cv
import pytesseract
from PIL import Image
import matplotlib.pyplot as plt
pytesseract.pytesseract.tesseract_cmd = r"E:\tesseract\tesseract.exe"
def recognize_text(image):
# edge preserving filter denoising 10,150
dst = cv.pyrMeanShiftFiltering(image, sp=10, sr=150)
plt.imshow(dst)
# grayscale image
gray = cv.cvtColor(dst, cv.COLOR_BGR2GRAY)
# binarization
ret, binary = cv.threshold(gray, 0, 255, cv.THRESH_BINARY_INV | cv.THRESH_OTSU)
# morphological manipulation corrosion expansion
erode = cv.erode(binary, None, iterations=2)
dilate = cv.dilate(erode, None, iterations=1)
# logical operation makes the background white the font is black for easy recognition.
cv.bitwise_not(dilate, dilate)
# identify
test_message = Image.fromarray(dilate)
custom_config = r'digits'
text = pytesseract.image_to_string(test_message, config=custom_config)
print(f' recognition result ：{text}')
src = cv.imread(r'roughh/testt/f.jpg')
recognize_text(src)
My problem with my code is that it only works with the images of '396156' & '436359' and not with any other images. Please suggest some improvement in my code.

I don't know if you've solved your problem, but this kind of images must be pre-processed using this solution. You will need to tweak the parameters. I worked with a similar dataset and aforementioned solution works well. Let me know your results.
Editing the answer
I'm improving my answer, to not show just link for reference.
The key for this kind of problem is image pre-processing. The main idea is to clean up the input image conserving just the characters.
Given an input image as
We want an output image as
The follow code contains the image pre-processing that I used based on the solution:
# loading image and checking the height and width
img = cv.imread('PNgCd.jpg')
(h, w) = img.shape[:2]
print("Height: {} Width:{}".format(h,w))
cv.imshow('Image', img)
cv.waitKey(0)
cv.destroyAllWindows()
#converting into RBG and resizing the image
img = cv.cvtColor(img, cv.COLOR_BGR2RGB) # converting into RGB order
img = imutils.resize(img, width=450) #resizing the width into 500 pxls
cv.imshow('Image', img)
cv.waitKey(0)
cv.destroyAllWindows()
#gray scale
gray = cv.cvtColor(img, cv.COLOR_RGB2GRAY)
cv.imshow('Gray', gray)
cv.waitKey(0)
cv.destroyAllWindows()
# image thresholdinf with Otsu method and inverse operation
thresh = cv.threshold(gray, 0, 255, cv.THRESH_BINARY_INV | cv.THRESH_OTSU)[1]
cv.imshow('Thresh Otsu', thresh)
cv.waitKey(0)
cv.destroyAllWindows()
#distance tramsform
dist = cv.distanceTransform(thresh, cv.DIST_L2, 5)
dist = cv.normalize(dist, dist, 0, 1.0, cv.NORM_MINMAX)
dist = (dist*255).astype('uint8')
cv.imshow('dist', dist)
cv.waitKey(0)
cv.destroyAllWindows()
#image thresholding with binary operation
dist = cv.threshold(dist, 0, 255, cv.THRESH_BINARY |
cv.THRESH_OTSU)[1]
cv.imshow('thresh binary', dist)
cv.waitKey(0)
cv.destroyAllWindows()
#morphological operation
kernel = cv.getStructuringElement(cv.MORPH_CROSS, (2,2))
opening = cv.morphologyEx(dist, cv.MORPH_OPEN, kernel)
cv.imshow('Morphological - Opening', opening)
cv.waitKey(0)
cv.destroyAllWindows()
#dilation or erode (it's depend on your image)
kernel = cv.getStructuringElement(cv.MORPH_CROSS, (2,2))
dilation = cv.dilate(opening, kernel, iterations = 1)
cv.imshow('Dilation', dilation)
cv.waitKey(0)
cv.destroyAllWindows()
# found contours and filtering them
cnts = cv.findContours(dilation.copy(), cv.RETR_EXTERNAL, cv.CHAIN_APPROX_SIMPLE)
cnts = imutils.grab_contours(cnts)
nums = []
for c in cnts:
(x, y, w, h) = cv.boundingRect(c)
if w >= 5 and h > 15:
nums.append(c)
len(nums)
#Convex hull and image masking
nums = np.vstack([nums[i] for i in range(0, len(nums))])
hull = cv.convexHull(nums)
mask = np.zeros(dilation.shape[:2], dtype='uint8')
cv.drawContours(mask, [hull], -1, 255, -1)
mask = cv.dilate(mask, None, iterations = 2)
cv.imshow('mask', mask)
cv.waitKey(0)
cv.destroyAllWindows()
# bitwise to retrieval the characters from the original image
final = cv.bitwise_and(dilation, dilation, mask=mask)
cv.imshow('final', final)
cv.imwrite('final.jpg', final)
cv.waitKey(0)
cv.destroyAllWindows()
# OCR'ing the pre-processed image
config = "--psm 7 -c tessedit_char_whitelist=0123456789"
text = tsr.image_to_string(final, config=config)
print(text)
The code is an example to how to deal with this kind of image. We must keep in mind, Tesseract is not perfect and, it requires cleaned images to work well. This code can also fail for others images like that, we must tweak the parameters or try other techniques of image pre-processing. You must also know the --psm modes, in this case I've considered --psm 7, that treats the image as a single text line. For this kind of image, you can also try --psm 8, that treats the image as single word. This code is just a start point, you can improve it according your need.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to OCR alphanumerical image with Tesseract - python

Related

how to improve pytesseract arguments to work properly

Using pytesseract to get text from an image

Pytesseract and OpenCV can't detect digits

Unable to read text from Image using pytesseract.image_to_string

How to improve pytesseract function for capctha decoding?

Categories

Resources