Unable to read text from Image using pytesseract.image_to_string

Unable to read text from Image using pytesseract.image_to_string - python

The problem here is I need to remove the lines and write code to recognize the characters. Till now I have seen solutions, where the char was in solid, but this has char with double border.

For this specific captcha, there's quite a simple solution. But, there's no guarantee for this approach to work on other, even very similar captchas – due to the "nature" of captchas as already mentioned in the comments, and in general when dealing with image-processing tasks with limited provided input data.
Read the image as grayscale.
Threshold the image at nearly white cutoff.
Flood fill the "background" with black.
Run pytesseract with -psm 6 option.
That'd be the whole code:
import cv2
import pytesseract
# Read image as grayscale
img = cv2.imread('FuZEJ.png', cv2.IMREAD_GRAYSCALE)
# Threshold at nearly white cutoff
thr = cv2.threshold(img, 224, 255, cv2.THRESH_BINARY)[1]
# Floodfill "background" with black
ff = cv2.floodFill(thr, None, (0, 0), 0)[1]
# OCR using pytesseract
text = pytesseract.image_to_string(ff, config='--psm 6').replace('\n', '').replace('\f', '')
print(text)
# xwphs
Caveat: I use a special version of Tesseract from the Mannheim University Library.
----------------------------------------
System information
----------------------------------------
Platform: Windows-10-10.0.16299-SP0
Python: 3.9.1
PyCharm: 2021.1.1
OpenCV: 4.5.1
pytesseract: 5.0.0-alpha.20201127
----------------------------------------

I would try a mask:
import cv2
import numpy as np
def process(img): # To process the image
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
_, img_gray = cv2.threshold(img_gray, 224, 255, cv2.THRESH_TOZERO_INV)
img_blur = cv2.GaussianBlur(img_gray, (7, 7), 6)
img_canny = cv2.Canny(img_blur, 0, 100)
return cv2.dilate(img_canny, np.ones((1, 5)), iterations=1)
def get_mask(img): # To generate the mask
mask = np.zeros(img.shape[:2], 'uint8')
contours, _ = cv2.findContours(process(img), cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)
for cnt in contours:
cv2.drawContours(mask, [cnt], -1, 255, -1)
return mask
def crop(img, mask): # To mask an image and use white background
bg = np.full(img.shape, 255, 'uint8')
fg = cv2.bitwise_or(img, img, mask=mask)
fg_back_inv = cv2.bitwise_or(bg, bg, mask=cv2.bitwise_not(mask))
return cv2.bitwise_or(fg, fg_back_inv)
img = cv2.imread("image.png")
img = cv2.pyrUp(cv2.pyrUp(img)) # To enlarge image by 4x
cv2.imshow("Masked Image", crop(img, get_mask(img)))
cv2.waitKey(0)
Before:
After:

Related

how to improve pytesseract arguments to work properly

I would like to read this captcha using pytesseract:
I follow the advice here: Use pytesseract OCR to recognize text from an image
My code is:
import pytesseract
import cv2
def captcha_to_string(picture):
image = cv2.imread(picture)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
# Morph open to remove noise and invert image
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=1)
invert = 255 - opening
cv2.imwrite('thresh.jpg', thresh)
cv2.imwrite('opening.jpg', opening)
cv2.imwrite('invert.jpg', invert)
# Perform text extraction
text = pytesseract.image_to_string(invert, lang='eng', config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
return text
But my code returns 8\n\x0c which is nonsense.
This is how thresh looks like:
This is how opening looks like:
This is how invert looks like:
Can you help me, how can I improve captcha_to_string function to read the captcha properly? Thanks a lot.

You are on the right way. Removing the noise (small black spots in the inverted image) looks like the way to extract the text successfully.
FYI, the configuration of pytessearct makes the outcome worse only. So, I removed it.
My approach is as follows:
import pytesseract
import cv2
import matplotlib.pyplot as plt
import numpy as np
def remove_noise(img,threshold):
"""
remove salt-and-pepper noise in a binary image
"""
filtered_img = np.zeros_like(img)
labels,stats= cv2.connectedComponentsWithStats(img.astype(np.uint8),connectivity=8)[1:3]
label_areas = stats[1:, cv2.CC_STAT_AREA]
for i,label_area in enumerate(label_areas):
if label_area > threshold:
filtered_img[labels==i+1] = 1
return filtered_img
def preprocess(img_path):
"""
convert the grayscale captcha image to a clean binary image
"""
img = cv2.imread(img_path,0)
blur = cv2.GaussianBlur(img, (3,3), 0)
thresh = cv2.threshold(blur, 150, 255, cv2.THRESH_BINARY_INV)[1]
filtered_img = 255-remove_noise(thresh,20)*255
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
erosion = cv2.erode(filtered_img,kernel,iterations = 1)
return erosion
def extract_letters(img):
text = pytesseract.image_to_string(img)#,config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
return text
img = preprocess('captcha.jpg')
text=extract_letters(img)
print(text)
plt.imshow(img,'gray')
plt.show()
This is the processed image.
And, the script returns 18L9R.

Pytesseract and OpenCV can't detect digits

Thanks in advance to everyone that will answer.
I am new to OpenCV, Pytesseract and overall very inexperienced about image processing and recognition.
I am trying to detect a digit from a pdf, for the sake of this code I will directly provide the image:
Initial image
My objective is to detect the number in the colored box, which in this case is number 6.
My code for preprocessing is the following:
import numpy as np
import pytesseract
from PIL import Image
from PIL import ImageFilter, ImageEnhance
pytesseract.pytesseract.tesseract_cmd = 'Tesseract-OCR\tesseract.exe'
# -----Reading the image-----------------------------------------------------
img = cv2.imread('page_image.jpg')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
gray = cv2.resize(gray, (1028, 720))
thres_gray = cv2.threshold(gray, 0, 255, cv2.THRESH_OTSU)[1]
gray_inv = cv2.bitwise_not(thres_gray)
gray_test = cv2.bitwise_not(gray_inv)
out2 = cv2.bitwise_or(gray, gray, mask=gray_inv)
thresh_end = cv2.threshold(out2, 254, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
imageObject = Image.fromarray(thresh_end)
enhancer = ImageEnhance.Sharpness(imageObject)
sharpened1 = imageObject.filter(ImageFilter.SHARPEN)
sharpened2 = sharpened1.filter(ImageFilter.SHARPEN)
# sharpened2.show()
From this I obtain the following picture:
Preprocessed image
At this point, since I am still learning about how to detect the region of interest and crop it with OpenCV, to test the code I decided to manually crop the image to test if my script works correctly enough.
Therefore the image I pass to pytesseract is the following:
Final image to read with pytesseract
I am not really sure if the image is good enough to be read, but this is the best I could get.
From this I try image_to_string:
trial = pytesseract.image_to_string(sharpened2, config='--psm 13 --oem 3 -c tessedit_char_whitelist=0123456789')
I have tried many different configurations for the tesseract but none of it worked and the final output is always an empty string.
I would be really grateful if you could help me understand whether the image is not good enough or I am doing something wrong with the tesseract configuration.
If you could also be able to help me cropping the image correctly that would be awesome, but even detecting the number is enough for me.
Sorry for the long post and thanks again.

Try this:
import cv2
import pytesseract
import numpy as np
pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
img = cv2.imread("form.jpg")
# https://stackoverflow.com/questions/10948589/choosing-the-correct-upper-and-lower-hsv-boundaries-for-color-detection-withcv
ORANGE_MIN = np.array([5, 50, 50], np.uint8)
ORANGE_MAX = np.array([15, 255, 255], np.uint8)
hsv_img = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
frame_threshed = cv2.inRange(hsv_img, ORANGE_MIN, ORANGE_MAX)
# cv2.imshow("frame_threshed", frame_threshed)
thresh = cv2.threshold(frame_threshed, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
# cv2.imshow("thresh", thresh)
cnts = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
# cv2.imshow("dilate", thresh)
for c in cnts:
x, y, w, h = cv2.boundingRect(c)
ROI = thresh[y:y + h, x:x + w]
ratio = 100.0 / ROI.shape[1]
dim = (100, int(ROI.shape[0] * ratio))
resizedCubic = cv2.resize(ROI, dim, interpolation=cv2.INTER_CUBIC)
threshGauss = cv2.adaptiveThreshold(resizedCubic, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 255, 17)
cv2.imshow("ROI", threshGauss)
text = int(pytesseract.image_to_string(threshGauss, lang='eng', config="--oem 3 --psm 13"))
print(f"Detected text: {text}")
cv2.waitKey(0)
I used HSV method to detect orange color first. Then, once the ROI was clearly visible, I applied "classic" image pre-processing steps.
Take a look at this link to understand how to select other colors than orange.
I also resized the ROI a bit.

How to improve pytesseract function for capctha decoding?

I want to extract the numbers from an image in python. In order to do that, I have chosen pytesseract. When I tried extracting the text from the image, the results weren't satisfactory. I also went through the following code and implemented all the techniques listed with other answers. Yet, it doesn't seem to perform well.
sample images:
and my code is:
import cv2 as cv
import pytesseract
from PIL import Image
import matplotlib.pyplot as plt
pytesseract.pytesseract.tesseract_cmd = r"E:\tesseract\tesseract.exe"
def recognize_text(image):
# edge preserving filter denoising 10,150
dst = cv.pyrMeanShiftFiltering(image, sp=10, sr=150)
plt.imshow(dst)
# grayscale image
gray = cv.cvtColor(dst, cv.COLOR_BGR2GRAY)
# binarization
ret, binary = cv.threshold(gray, 0, 255, cv.THRESH_BINARY_INV | cv.THRESH_OTSU)
# morphological manipulation corrosion expansion
erode = cv.erode(binary, None, iterations=2)
dilate = cv.dilate(erode, None, iterations=1)
# logical operation makes the background white the font is black for easy recognition.
cv.bitwise_not(dilate, dilate)
# identify
test_message = Image.fromarray(dilate)
custom_config = r'digits'
text = pytesseract.image_to_string(test_message, config=custom_config)
print(f' recognition result ：{text}')
src = cv.imread(r'roughh/testt/f.jpg')
recognize_text(src)
My problem with my code is that it only works with the images of '396156' & '436359' and not with any other images. Please suggest some improvement in my code.

I don't know if you've solved your problem, but this kind of images must be pre-processed using this solution. You will need to tweak the parameters. I worked with a similar dataset and aforementioned solution works well. Let me know your results.
Editing the answer
I'm improving my answer, to not show just link for reference.
The key for this kind of problem is image pre-processing. The main idea is to clean up the input image conserving just the characters.
Given an input image as
We want an output image as
The follow code contains the image pre-processing that I used based on the solution:
# loading image and checking the height and width
img = cv.imread('PNgCd.jpg')
(h, w) = img.shape[:2]
print("Height: {} Width:{}".format(h,w))
cv.imshow('Image', img)
cv.waitKey(0)
cv.destroyAllWindows()
#converting into RBG and resizing the image
img = cv.cvtColor(img, cv.COLOR_BGR2RGB) # converting into RGB order
img = imutils.resize(img, width=450) #resizing the width into 500 pxls
cv.imshow('Image', img)
cv.waitKey(0)
cv.destroyAllWindows()
#gray scale
gray = cv.cvtColor(img, cv.COLOR_RGB2GRAY)
cv.imshow('Gray', gray)
cv.waitKey(0)
cv.destroyAllWindows()
# image thresholdinf with Otsu method and inverse operation
thresh = cv.threshold(gray, 0, 255, cv.THRESH_BINARY_INV | cv.THRESH_OTSU)[1]
cv.imshow('Thresh Otsu', thresh)
cv.waitKey(0)
cv.destroyAllWindows()
#distance tramsform
dist = cv.distanceTransform(thresh, cv.DIST_L2, 5)
dist = cv.normalize(dist, dist, 0, 1.0, cv.NORM_MINMAX)
dist = (dist*255).astype('uint8')
cv.imshow('dist', dist)
cv.waitKey(0)
cv.destroyAllWindows()
#image thresholding with binary operation
dist = cv.threshold(dist, 0, 255, cv.THRESH_BINARY |
cv.THRESH_OTSU)[1]
cv.imshow('thresh binary', dist)
cv.waitKey(0)
cv.destroyAllWindows()
#morphological operation
kernel = cv.getStructuringElement(cv.MORPH_CROSS, (2,2))
opening = cv.morphologyEx(dist, cv.MORPH_OPEN, kernel)
cv.imshow('Morphological - Opening', opening)
cv.waitKey(0)
cv.destroyAllWindows()
#dilation or erode (it's depend on your image)
kernel = cv.getStructuringElement(cv.MORPH_CROSS, (2,2))
dilation = cv.dilate(opening, kernel, iterations = 1)
cv.imshow('Dilation', dilation)
cv.waitKey(0)
cv.destroyAllWindows()
# found contours and filtering them
cnts = cv.findContours(dilation.copy(), cv.RETR_EXTERNAL, cv.CHAIN_APPROX_SIMPLE)
cnts = imutils.grab_contours(cnts)
nums = []
for c in cnts:
(x, y, w, h) = cv.boundingRect(c)
if w >= 5 and h > 15:
nums.append(c)
len(nums)
#Convex hull and image masking
nums = np.vstack([nums[i] for i in range(0, len(nums))])
hull = cv.convexHull(nums)
mask = np.zeros(dilation.shape[:2], dtype='uint8')
cv.drawContours(mask, [hull], -1, 255, -1)
mask = cv.dilate(mask, None, iterations = 2)
cv.imshow('mask', mask)
cv.waitKey(0)
cv.destroyAllWindows()
# bitwise to retrieval the characters from the original image
final = cv.bitwise_and(dilation, dilation, mask=mask)
cv.imshow('final', final)
cv.imwrite('final.jpg', final)
cv.waitKey(0)
cv.destroyAllWindows()
# OCR'ing the pre-processed image
config = "--psm 7 -c tessedit_char_whitelist=0123456789"
text = tsr.image_to_string(final, config=config)
print(text)
The code is an example to how to deal with this kind of image. We must keep in mind, Tesseract is not perfect and, it requires cleaned images to work well. This code can also fail for others images like that, we must tweak the parameters or try other techniques of image pre-processing. You must also know the --psm modes, in this case I've considered --psm 7, that treats the image as a single text line. For this kind of image, you can also try --psm 8, that treats the image as single word. This code is just a start point, you can improve it according your need.

Preprocess images using OpenCV for pytesseract OCR

I want to use OCR (pytesseract) to recognize the text located in images like these:
I have thousands of these arrows. Until now the procedure is as follows: I first resize the image (for another process). Then I crop the image to get rid of the most part of the arrow. Next I draw a white rectangle as a frame to remove further noise but still have distance between text and image borders for better text recognition. I resize the image again to ensure a height of capital letters to ~30 px (https://groups.google.com/forum/#!msg/tesseract-ocr/Wdh_JJwnw94/24JHDYQbBQAJ). Finally I binarize the image with a threshold of 150.
Full code:
import cv2
image_file = '001.jpg'
# load the input image and grab the image dimensions
image = cv2.imread(image_file, cv2.IMREAD_GRAYSCALE)
(h_1, w_1) = image.shape[:2]
# resize the image and grab the new image dimensions
image = cv2.resize(image, (int(w_1*320/h_1), 320))
(h_1, w_1) = image.shape
# crop image
image_2 = image[70:h_1-70, 20:w_1-20]
# get image_2 height, width
(h_2, w_2) = image_2.shape
# draw white rectangle as a frame around the number -> remove noise
cv2.rectangle(image_2, (0, 0), (w_2, h_2), (255, 255, 255), 40)
# resize image, that capital letters are ~ 30 px in height
image_2 = cv2.resize(image_2, (int(w_2*50/h_2), 50))
# image binarization
ret, image_2 = cv2.threshold(image_2, 150, 255, cv2.THRESH_BINARY)
# save image to file
cv2.imwrite('processed_' + image_file, image_2)
# tesseract part can be commented out
import pytesseract
config_7 = ("-c tessedit_char_whitelist=0123456789AB --oem 1 --psm 7")
text = pytesseract.image_to_string(image_2, config=config_7)
print("OCR TEXT: " + "{}\n".format(text))
The problem is that the text located in the arrow is never centered. Sometimes I remove part of the text with the method described above (e.g. in image 50A).
Is there a method in image processing to get rid of the arrow in a more elegant way? For instance using contour detection and deletion? I am more interested in the OpenCV part than the tesseract part to recognize the text.
Any help is appreciated.

If you look at the pictures you will see that there is a white arrow in the image which is also the biggest contour (especially if you draw a black border on the image). If you make a blank mask and draw the arrow (biggest contour on the image) then erode it a little bit you can perform a per element bitwise conjunction of the actual image and eroded mask. If it is not clear look at the bottom code and comments and you will see that it is actually pretty simple.
# imports
import cv2
import numpy as np
img = cv2.imread("number.png") # read image
# you can resize the image here if you like - it should still work for both sizes
h, w = img.shape[:2] # get the actual images height and width
img = cv2.resize(img, (int(w*320/h), 320))
h, w = img.shape[:2]
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # transform to grayscale
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU)[1] # perform OTSU threhold
cv2.rectangle(thresh, (0, 0), (w, h), (0, 0, 0), 2)
contours = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)[0] # search for contours
max_cnt = max(contours, key=cv2.contourArea) # select biggest one
mask = np.zeros((h, w), dtype=np.uint8) # create a black mask
cv2.drawContours(mask, [max_cnt], -1, (255, 255, 255), -1) # draw biggest contour on the mask
kernel = np.ones((15, 15), dtype=np.uint8) # make a kernel with appropriate values - in both cases (resized and original) 15 is ok
erosion = cv2.erode(mask, kernel, iterations=1) # erode the mask with given kernel
reverse = cv2.bitwise_not(img.copy()) # reversed image of the actual image 0 becomes 255 and 255 becomes 0
img = cv2.bitwise_and(reverse, reverse, mask=erosion) # per-element bit-wise conjunction of the actual image and eroded mask (erosion)
img = cv2.bitwise_not(img) # revers the image again
# save image to file and display
cv2.imwrite("res.png", img)
cv2.imshow("img", img)
cv2.waitKey(0)
cv2.destroyAllWindows()
Result:

You can try simple Python script:
import cv2
import numpy as np
img = cv2.imread('mmubS.png', cv2.IMREAD_GRAYSCALE)
thresh = cv2.threshold(img, 200, 255, cv2.THRESH_BINARY_INV )[1]
im_flood_fill = thresh.copy()
h, w = thresh.shape[:2]
im_flood_fill=cv2.rectangle(im_flood_fill, (0,0), (w-1,h-1), 255, 2)
mask = np.zeros((h + 2, w + 2), np.uint8)
cv2.floodFill(im_flood_fill, mask, (0, 0), 0)
im_flood_fill = cv2.bitwise_not(im_flood_fill)
cv2.imshow('clear text', im_flood_fill)
cv2.imwrite('text.png', im_flood_fill)
Result:

Is there a way to thresholding an image such that it ignores as much as possible shadows with opencv?

I want to be able to find the bounding boxes of digits in images that may or may not have shadows in it.
To do that I convert the image to grayscale, then to black and white and then I find the digits with cv2.findCountours()
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img = cv2.bitwise_not(img)
img = cv2.GaussianBlur(img,(3,3),0)
cv2.threshold(img,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU,img)
contours, _ = cv2.findContours(img, cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_NONE)
But in the last example I get this black and white image:
Which doesn't allow the find contours function to work well.
Is there a way to solve this problem?

Otsu's threshold is not the right option. It is correct given a histogram with bimodal distribution. Read more here.
Among many alternatives is adaptive threshold.
import cv2
img = cv2.imread("path/to/image")
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img = cv2.bitwise_not(img)
img = cv2.GaussianBlur(img, (3, 3), 0)
# _, img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 401)
contours, _, _ = cv2.findContours(img, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
You here have to provide a value of the size of the block size of the adaptive kernel. I think 401 worked fine here, but it might not work on your other images.
For a little simpler solution, here is one using the OpenCV Wrapper library:
import cv2
import opencv_wrapper as cvw
import numpy as np
img = cv2.imread("masterproject/numbers.jpg")
img = cvw.bgr2gray(img)
img = ~img.astype(np.uint8) # Not part of the library, this is numpy. Only works with uint8
img = cvw.blur_gaussian(img, 3)
img = cvw.threshold_adaptive(img, 401)
contours = cvw.find_external_contours(img)
cvw.draw_contours(img, contours, cvw.Color.GREEN)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to read text from Image using pytesseract.image_to_string - python

The problem here is I need to remove the lines and write code to recognize the characters. Till now I have seen solutions, where the char was in solid, but this has char with double border.

Related

how to improve pytesseract arguments to work properly

Pytesseract and OpenCV can't detect digits

How to improve pytesseract function for capctha decoding?

Preprocess images using OpenCV for pytesseract OCR

Is there a way to thresholding an image such that it ignores as much as possible shadows with opencv?

Categories

Resources