Pytesseract not able to extract low contrast text from image

Pytesseract not able to extract low contrast text from image - python

I am trying to extract date from an image, but it is not working. I have more images with dates written in different colour.I have tried few preprocessing techniques like adaptive thresholding, erosion, dilation etc.
def cropright(img):
(h, w) = img.shape[:2]
crp = img[h-60:h, int((4*w)/7):w]
crp = cv2.resize(crp, (0, 0), fx=5, fy=5,interpolation=cv2.INTER_CUBIC)
return(crp)
def extract_text(img):
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
plt.imshow(img)
kernel = np.ones((1, 1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)
plt.imshow(img)
img = cv2.GaussianBlur(img, (5, 5), 0)
plt.imshow(img)
text = pytesseract.image_to_string(img,lang='eng')
return text
file = 'test.jpg'
img = cv2.imread("Request/" + file)
img = cropright(img)
plt.imshow(img)
text2 = extract_text(img)
print(text2)
Here is the image.I have more images with date written in different colours, so I need to develop a solution which will automatically work for all them
Image

I think by converting the image into gray scale you should be able to extract the date from the image irrespective of which color It is written in. I created an InstaFilters application that can apply filters to your images. The image is from that Web App. You can access it at https://share.streamlit.io/arkalsekar/instafilters/main/app.py
The code to apply a grayscale filter can be found on Github in the filters file:
https://github.com/arkalsekar/instafilters
You can even try these ones and believe that this should work. If it doesn't work then you can read this amazing post with some more intersting preprocessing techniques for OCR.

Related

how to improve pytesseract arguments to work properly

I would like to read this captcha using pytesseract:
I follow the advice here: Use pytesseract OCR to recognize text from an image
My code is:
import pytesseract
import cv2
def captcha_to_string(picture):
image = cv2.imread(picture)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
# Morph open to remove noise and invert image
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=1)
invert = 255 - opening
cv2.imwrite('thresh.jpg', thresh)
cv2.imwrite('opening.jpg', opening)
cv2.imwrite('invert.jpg', invert)
# Perform text extraction
text = pytesseract.image_to_string(invert, lang='eng', config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
return text
But my code returns 8\n\x0c which is nonsense.
This is how thresh looks like:
This is how opening looks like:
This is how invert looks like:
Can you help me, how can I improve captcha_to_string function to read the captcha properly? Thanks a lot.

You are on the right way. Removing the noise (small black spots in the inverted image) looks like the way to extract the text successfully.
FYI, the configuration of pytessearct makes the outcome worse only. So, I removed it.
My approach is as follows:
import pytesseract
import cv2
import matplotlib.pyplot as plt
import numpy as np
def remove_noise(img,threshold):
"""
remove salt-and-pepper noise in a binary image
"""
filtered_img = np.zeros_like(img)
labels,stats= cv2.connectedComponentsWithStats(img.astype(np.uint8),connectivity=8)[1:3]
label_areas = stats[1:, cv2.CC_STAT_AREA]
for i,label_area in enumerate(label_areas):
if label_area > threshold:
filtered_img[labels==i+1] = 1
return filtered_img
def preprocess(img_path):
"""
convert the grayscale captcha image to a clean binary image
"""
img = cv2.imread(img_path,0)
blur = cv2.GaussianBlur(img, (3,3), 0)
thresh = cv2.threshold(blur, 150, 255, cv2.THRESH_BINARY_INV)[1]
filtered_img = 255-remove_noise(thresh,20)*255
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
erosion = cv2.erode(filtered_img,kernel,iterations = 1)
return erosion
def extract_letters(img):
text = pytesseract.image_to_string(img)#,config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
return text
img = preprocess('captcha.jpg')
text=extract_letters(img)
print(text)
plt.imshow(img,'gray')
plt.show()
This is the processed image.
And, the script returns 18L9R.

How to improve pytesseract function for capctha decoding?

I want to extract the numbers from an image in python. In order to do that, I have chosen pytesseract. When I tried extracting the text from the image, the results weren't satisfactory. I also went through the following code and implemented all the techniques listed with other answers. Yet, it doesn't seem to perform well.
sample images:
and my code is:
import cv2 as cv
import pytesseract
from PIL import Image
import matplotlib.pyplot as plt
pytesseract.pytesseract.tesseract_cmd = r"E:\tesseract\tesseract.exe"
def recognize_text(image):
# edge preserving filter denoising 10,150
dst = cv.pyrMeanShiftFiltering(image, sp=10, sr=150)
plt.imshow(dst)
# grayscale image
gray = cv.cvtColor(dst, cv.COLOR_BGR2GRAY)
# binarization
ret, binary = cv.threshold(gray, 0, 255, cv.THRESH_BINARY_INV | cv.THRESH_OTSU)
# morphological manipulation corrosion expansion
erode = cv.erode(binary, None, iterations=2)
dilate = cv.dilate(erode, None, iterations=1)
# logical operation makes the background white the font is black for easy recognition.
cv.bitwise_not(dilate, dilate)
# identify
test_message = Image.fromarray(dilate)
custom_config = r'digits'
text = pytesseract.image_to_string(test_message, config=custom_config)
print(f' recognition result ：{text}')
src = cv.imread(r'roughh/testt/f.jpg')
recognize_text(src)
My problem with my code is that it only works with the images of '396156' & '436359' and not with any other images. Please suggest some improvement in my code.

I don't know if you've solved your problem, but this kind of images must be pre-processed using this solution. You will need to tweak the parameters. I worked with a similar dataset and aforementioned solution works well. Let me know your results.
Editing the answer
I'm improving my answer, to not show just link for reference.
The key for this kind of problem is image pre-processing. The main idea is to clean up the input image conserving just the characters.
Given an input image as
We want an output image as
The follow code contains the image pre-processing that I used based on the solution:
# loading image and checking the height and width
img = cv.imread('PNgCd.jpg')
(h, w) = img.shape[:2]
print("Height: {} Width:{}".format(h,w))
cv.imshow('Image', img)
cv.waitKey(0)
cv.destroyAllWindows()
#converting into RBG and resizing the image
img = cv.cvtColor(img, cv.COLOR_BGR2RGB) # converting into RGB order
img = imutils.resize(img, width=450) #resizing the width into 500 pxls
cv.imshow('Image', img)
cv.waitKey(0)
cv.destroyAllWindows()
#gray scale
gray = cv.cvtColor(img, cv.COLOR_RGB2GRAY)
cv.imshow('Gray', gray)
cv.waitKey(0)
cv.destroyAllWindows()
# image thresholdinf with Otsu method and inverse operation
thresh = cv.threshold(gray, 0, 255, cv.THRESH_BINARY_INV | cv.THRESH_OTSU)[1]
cv.imshow('Thresh Otsu', thresh)
cv.waitKey(0)
cv.destroyAllWindows()
#distance tramsform
dist = cv.distanceTransform(thresh, cv.DIST_L2, 5)
dist = cv.normalize(dist, dist, 0, 1.0, cv.NORM_MINMAX)
dist = (dist*255).astype('uint8')
cv.imshow('dist', dist)
cv.waitKey(0)
cv.destroyAllWindows()
#image thresholding with binary operation
dist = cv.threshold(dist, 0, 255, cv.THRESH_BINARY |
cv.THRESH_OTSU)[1]
cv.imshow('thresh binary', dist)
cv.waitKey(0)
cv.destroyAllWindows()
#morphological operation
kernel = cv.getStructuringElement(cv.MORPH_CROSS, (2,2))
opening = cv.morphologyEx(dist, cv.MORPH_OPEN, kernel)
cv.imshow('Morphological - Opening', opening)
cv.waitKey(0)
cv.destroyAllWindows()
#dilation or erode (it's depend on your image)
kernel = cv.getStructuringElement(cv.MORPH_CROSS, (2,2))
dilation = cv.dilate(opening, kernel, iterations = 1)
cv.imshow('Dilation', dilation)
cv.waitKey(0)
cv.destroyAllWindows()
# found contours and filtering them
cnts = cv.findContours(dilation.copy(), cv.RETR_EXTERNAL, cv.CHAIN_APPROX_SIMPLE)
cnts = imutils.grab_contours(cnts)
nums = []
for c in cnts:
(x, y, w, h) = cv.boundingRect(c)
if w >= 5 and h > 15:
nums.append(c)
len(nums)
#Convex hull and image masking
nums = np.vstack([nums[i] for i in range(0, len(nums))])
hull = cv.convexHull(nums)
mask = np.zeros(dilation.shape[:2], dtype='uint8')
cv.drawContours(mask, [hull], -1, 255, -1)
mask = cv.dilate(mask, None, iterations = 2)
cv.imshow('mask', mask)
cv.waitKey(0)
cv.destroyAllWindows()
# bitwise to retrieval the characters from the original image
final = cv.bitwise_and(dilation, dilation, mask=mask)
cv.imshow('final', final)
cv.imwrite('final.jpg', final)
cv.waitKey(0)
cv.destroyAllWindows()
# OCR'ing the pre-processed image
config = "--psm 7 -c tessedit_char_whitelist=0123456789"
text = tsr.image_to_string(final, config=config)
print(text)
The code is an example to how to deal with this kind of image. We must keep in mind, Tesseract is not perfect and, it requires cleaned images to work well. This code can also fail for others images like that, we must tweak the parameters or try other techniques of image pre-processing. You must also know the --psm modes, in this case I've considered --psm 7, that treats the image as a single text line. For this kind of image, you can also try --psm 8, that treats the image as single word. This code is just a start point, you can improve it according your need.

How to improve Tesseract accuracy

I am trying to run OCR on set of images that are similar but can vary in size. For some reason I cannot get a predictable result. Is there anything I can do do get better results.
Tesseract with or without cv2 preprocessing works beautifully on some images and fails on some and there is no pattern. Images are more or less similar.
Upper image represents processed image
def filter_img(img):
# Read pil image as cv2
img = np.array(img)
img = cv2.resize(img, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
# Converting image to grayscale (important for applying threshold)
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
#Apply dilation and erosion to remove some noise
kernel = np.ones((1, 1), np.uint8)
# img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)
# Apply blur to smooth out the edges
img = cv2.GaussianBlur(img, (5, 5), 0)
# img = cv.medianBlur(img,5)
# Apply threshold to get image with only b&w (binarization)
img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
img = Image.fromarray(img)
img = ImageOps.expand(img,border=2,fill='black')
visualize.show_labeled_image(img,boxes)
return img
# Applying Tesseract OCR
def run_tesseract(img):
# Tesseract cmd setup
# pytesseract.pytesseract.tesseract_cmd = "tesseract"
whitelist = string.ascii_uppercase + string.digits + ".-"
parameters = '-c load_freq_dawg=0 -c tessedit_char_whitelist="{}"'.format(whitelist)
psm = 8
custom_oem_psm_config = "--dpi 300 --oem 3 --psm {psm} {parameters}".format(parameters=parameters, psm=psm)
try:
text = pytesseract.image_to_string(img, config=custom_oem_psm_config, timeout=2)
return text.strip()
except RuntimeError:
print ("TIMEOUT")
return ""

If your image format is highly consistent, you might consider using split images. And after ocr the image, use conditional judgments on the first letter or number for error-prone areas, such as 0 and O are confusing. Of course, all of the above is only valid if the image is highly consistent.
enter code here
import cv2
import numpy as np
import pytesseract
import matplotlib.pyplot as plt
pytesseract.pytesseract.tesseract_cmd = 'D://Program Files/Tesseract-
OCR/tesseract.exe'
img = cv2.imread('vATKQ.png')
img2 = img[100:250, 180:650] #split to region you want
plt.imshow(img2)
text=pytesseract.image_to_string(img2)
print(text)

Text Extraction from Image with Single letter in it

I have an image not of much good quality with a single letter in it. I need to extract the value from this
I tried doing this with open CV. the code works on good quality image but need help to extract from this image
from PIL import Image
import pytesseract
import argparse
import os
import cv2
import numpy as np
img = cv2.imread(r"/home/ubuntu/xyz/xyz.jpg")
img = cv2.resize(img, None, fx=1.5, fy=1.5, interpolation=cv2.INTER_CUBIC)
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
kernel = np.ones((1, 1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)
img = cv2.GaussianBlur(img, (5, 5), 0)
img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)\[1\]
# Save the filtered image
cv2.imwrite(r"/home/ubuntu/xyz/rr.jpg", img)
# Read text with tesseract for python
result = pytesseract.image_to_string(img, lang="eng")
result

why u need Gaussian Blur in this situation
img = cv2.GaussianBlur(img, (5, 5), 0)
with a big window (5,5)
I think you can make a white border outside instead of resizing the image,
and you may use erosion technical to remove the noise from image

Is there a way to thresholding an image such that it ignores as much as possible shadows with opencv?

I want to be able to find the bounding boxes of digits in images that may or may not have shadows in it.
To do that I convert the image to grayscale, then to black and white and then I find the digits with cv2.findCountours()
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img = cv2.bitwise_not(img)
img = cv2.GaussianBlur(img,(3,3),0)
cv2.threshold(img,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU,img)
contours, _ = cv2.findContours(img, cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_NONE)
But in the last example I get this black and white image:
Which doesn't allow the find contours function to work well.
Is there a way to solve this problem?

Otsu's threshold is not the right option. It is correct given a histogram with bimodal distribution. Read more here.
Among many alternatives is adaptive threshold.
import cv2
img = cv2.imread("path/to/image")
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img = cv2.bitwise_not(img)
img = cv2.GaussianBlur(img, (3, 3), 0)
# _, img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 401)
contours, _, _ = cv2.findContours(img, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
You here have to provide a value of the size of the block size of the adaptive kernel. I think 401 worked fine here, but it might not work on your other images.
For a little simpler solution, here is one using the OpenCV Wrapper library:
import cv2
import opencv_wrapper as cvw
import numpy as np
img = cv2.imread("masterproject/numbers.jpg")
img = cvw.bgr2gray(img)
img = ~img.astype(np.uint8) # Not part of the library, this is numpy. Only works with uint8
img = cvw.blur_gaussian(img, 3)
img = cvw.threshold_adaptive(img, 401)
contours = cvw.find_external_contours(img)
cvw.draw_contours(img, contours, cvw.Color.GREEN)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pytesseract not able to extract low contrast text from image - python

Related

how to improve pytesseract arguments to work properly

How to improve pytesseract function for capctha decoding?

How to improve Tesseract accuracy

Text Extraction from Image with Single letter in it

Is there a way to thresholding an image such that it ignores as much as possible shadows with opencv?

Categories

Resources