Extract specific contents from text using python and Tesseract OCR

Extract specific contents from text using python and Tesseract OCR - python

I am using tesseract OCR to extract text from image file .
Below is the sample text I got from my Image:
Certificate No. Certificate Issued Date Acoount Reference Unique Doc. Reference IN-KA047969602415880 18-Feb-2016 01:39 PM NONACC(FI)/kakfscI08/BTM LAYOUT/KA-BA SUBIN-KAKAKSFCL0858710154264833O
How can I extract Certificate No. from this? Any hint or solution will help me here.

If the certificate number is always in the structure it is given here (2 letters, hyphen, 17 digits) you can use regex:
import regex as re
# i took the entire sequence originally but this is just an example
sequence = 'Reference IN-KA047969602415880 18-Feb-2016 01:39'
re.search('[A-Z]{2}-.{17}', seq).group()
#'IN-KA047969602415880'
.search searches for a specific pattern you dictate, and .group() return the first result (in this case there would be only one). You can search for anything like this in a given string, I suggest a review of regex here.

Before throwing the image into Tesseract OCR, it's important to preprocess the image to remove noise and smooth the text. Here's a simple approach using OpenCV
Convert image to grayscale
Otsu's threshold to obtain binary image
Gaussian blur and invert image
After converting to grayscale, we Otsu's threshold to get a binary image
From here we give it a slight blur and invert the image to get our result
Results from Pytesseract
Certificate No. : IN-KA047969602415880
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
image = cv2.imread('1.png',0)
thresh = cv2.threshold(image, 0, 255, cv2.THRESH_OTSU + cv2.THRESH_BINARY_INV)[1]
blur = cv2.GaussianBlur(thresh, (3,3), 0)
result = 255 - blur
data = pytesseract.image_to_string(result, lang='eng', config='--psm 6')
print(data)
cv2.imshow('thresh', thresh)
cv2.imshow('result', result)
cv2.waitKey()

Related

How to OCR a text with white colour characters on a blue background from a cropped image?

First, I want to crop an image using a mouse event, and then print the text inside the cropped image. I tried OCR scripts but all can't work for this image attached below. I think the reason is that the text has white characters on blue background.
Can you help me with doing this?
Full image:
Cropped image:
An example what I tried is:
import pytesseract
import cv2
import numpy as np
pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
img = cv2.imread('D:/frame/time 0_03_.jpg')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
adaptiveThresh = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 35, 30)
inverted_bin=cv2.bitwise_not(adaptiveThresh)
#Some noise reduction
kernel = np.ones((2,2),np.uint8)
processed_img = cv2.erode(inverted_bin, kernel, iterations = 1)
processed_img = cv2.dilate(processed_img, kernel, iterations = 1)
#Applying image_to_string method
text = pytesseract.image_to_string(processed_img)
print(text)

[EDIT]
For anyone wondering, the image in the question was updated after posting my answer. That was the original image:
Thus, the below output in my original answer.
That's the newly posted image:
The specific Turkish characters, especially in the last word, are still not properly detected (since I still can't use lang='tur' right now), but at least the Ö and Ü can be detected using lang='deu', which I have installed:
text = pytesseract.image_to_string(mask, lang='deu').strip().replace('\n', '').replace('\f', '')
print(text)
# GÖKYÜZÜ AVCILARI ILE TEKE TEK KLASIGI
[/EDIT]
I wouldn't use cv2.adaptiveThreshold here, but simple cv2.threshold using cv2.THRESH_OTSU + cv2.THRESH_BINARY_INV. Since, the comma touches the image border, I'd add another, one pixel wide border via cv2.copyMakeBorder to capture the comma properly. So, that would be the full code (replacing \f is due to my pytesseract version only):
import cv2
import pytesseract
img = cv2.imread('n7nET.jpg')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
mask = cv2.threshold(gray, 0, 255, cv2.THRESH_OTSU + cv2.THRESH_BINARY_INV)[1]
mask = cv2.copyMakeBorder(mask, 1, 1, 1, 1, cv2.BORDER_CONSTANT, 0)
text = pytesseract.image_to_string(mask).strip().replace('\n', '').replace('\f', '')
print(text)
# 2020'DE SALGINI BILDILER, YA 2021'DE?
The output seems correct to me – of course, not for this special (I assume Turkish) capital I character with the dot above. Unfortunately, I can't run pytesseract.image_to_string(..., lang='tur'), since it's simply not installed. Maybe, have a look at that to get the proper characters here as well.
----------------------------------------
System information
----------------------------------------
Platform: Windows-10-10.0.16299-SP0
Python: 3.9.1
PyCharm: 2021.1.1
OpenCV: 4.5.1
pytesseract: 5.0.0-alpha.20201127
----------------------------------------

Pytesseract doesn't recognize decimal points

I'm trying to read the text in this image that contains also decimal points and decimal numbers
in this way:
img = cv2.imread(path_to_image)
print(pytesseract.image_to_string(img))
and what I get is:
73-82
Primo: 50 —
I've tried to specify also the italian language but the result is pretty similar:
73-82 _
Primo: 50
Searching through other questions on stackoverflow I found that the reading of the decimal numbers can be improved by using a whitelist, in this case tessedit_char_whitelist='0123456789.', but I want to read also the words in the image. Any idea on how to improve the reading of decimal numbers?

I would suggest passing tesseract every row of text as separate image.
For some reason it seams to solve the decimal point issue...
Convert image from grayscale to black and white using cv2.threshold.
Use cv2.dilate morphological operation with very long horizontal kernel (merge blocks across horizontal direction).
Use find contours - each merged row is going to be in a separate contour.
Find bounding boxes of the contours.
Sort the bounding boxes according to the y coordinate.
Iterate bounding boxes, and pass slices to pytesseract.
Here is the code:
import numpy as np
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # I am using Windows
path_to_image = 'image.png'
img = cv2.imread(path_to_image, cv2.IMREAD_GRAYSCALE) # Read input image as Grayscale
# Convert to binary using automatic threshold (use cv2.THRESH_OTSU)
ret, thresh = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
# Dilate thresh for uniting text areas into blocks of rows.
dilated_thresh = cv2.dilate(thresh, np.ones((3,100)))
# Find contours on dilated_thresh
cnts = cv2.findContours(dilated_thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)[-2] # Use index [-2] to be compatible to OpenCV 3 and 4
# Build a list of bounding boxes
bounding_boxes = [cv2.boundingRect(c) for c in cnts]
# Sort bounding boxes from "top to bottom"
bounding_boxes = sorted(bounding_boxes, key=lambda b: b[1])
# Iterate bounding boxes
for b in bounding_boxes:
x, y, w, h = b
if (h > 10) and (w > 10):
# Crop a slice, and inverse black and white (tesseract prefers black text).
slice = 255 - thresh[max(y-10, 0):min(y+h+10, thresh.shape[0]), max(x-10, 0):min(x+w+10, thresh.shape[1])]
text = pytesseract.image_to_string(slice, config="-c tessedit"
"_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890-:."
" --psm 3"
" ")
print(text)
I know it's not the most general solution, but it manages to solve the sample you have posted.
Please treat the answer as a conceptual solution - finding a robust solution might be very challenging.
Results:
Thresholder image after dilate:
First slice:
Second slice:
Third slice:
Output text:
7.3-8.2
Primo:50

You can easily recognize by down-sampling the image.
If you down-sample by 0.5, result will be:
Now if you read:
7.3 - 8.2
Primo: 50
I got the result by using pytesseract 0.3.7 version (current)
Code:
# Load the libraries
import cv2
import pytesseract
# Load the image
img = cv2.imread("s9edQ.png")
# Convert to the gray-scale
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Down-sample
gry = cv2.resize(gry, (0, 0), fx=0.5, fy=0.5)
# OCR
txt = pytesseract.image_to_string(gry)
print(txt)
Explanation:
The input-image contains a little bit of an artifact. You can see it on the right part of the image. On the other hand, the current image is perfect for OCR recognition. You need to use the pre-preprocessing method when the data from the image is not visible or corrupted. Please read the followings:
Image processing
Page-segmentation-mode

PyTesseract not recognizing decimals

This is not truly a duplicate of How to extract decimal in image with Pytesseract, as those answers did not solve my problem and my use case is different.
I'm using PyTesseract to recognise text in table cells. When it comes to recognising drug doses with decimal points, the OCR fails to recognise the ., though is accurate for everything else. I'm using tesseract v5.0.0-alpha.20200328 on Windows 10.
My pre-processing consists of upscaling by 400% using cubic, conversion to black and white, dilation and erosion, morphology, and blurring. I've tried a decent combination of all of these (as well as each on their own), and nothing has recognized the ..
I've tried --psm of various values as well as a character whitelist. I believe the font is Sergoe UI.
Before processing:
After processing:
PyTesseract output: 25mg »p
Processing code:
import cv2, pytesseract
import numpy as np
image = cv2.imread( '01.png' )
upscaled_image = cv2.resize(image, None, fx = 4, fy = 4, interpolation = cv2.INTER_CUBIC)
bw_image = cv2.cvtColor(upscaled_image, cv2.COLOR_BGR2GRAY)
kernel = np.ones((2, 2), np.uint8)
dilated_image = cv2.dilate(bw_image, kernel, iterations=1)
eroded_image = cv2.erode(dilated_image, kernel, iterations=1)
thresh = cv2.threshold(eroded_image, 205, 255, cv2.THRESH_BINARY)[1]
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))
morh_image = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel)
blur_image = cv2.threshold(cv2.bilateralFilter(morh_image, 5, 75, 75), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
final_image = blur_image
text = pytesseract.image_to_string(final_image, lang='eng', config='--psm 10')

If you haven't made sure of this, check out this link
visit https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/xk2ErJnFBQAJ
One major solution for for many problems is text height, I was facing many issues but wasn't able to figure out why, but seems sending image with correct size letters to tesseract solves many problems.
instead of upscaling to a random % try the number with which your image has letters close to 30- 40 Px.
Also if somehow your preprocessing change "." into a noise like char then too it will get ignored.

I had a similar case that and was able to increase the number of correct decimals by using image processing methods and upscaling of the image. Yet, a small share of the decimals were not recognized correctly.
The solution I found was to change the language setting for pytesseract:
I was using a non-English setting, but changing the config to lang='eng' fixed all remaining issues.
That might not help with the original question, though, as the setting is already eng.

How can I get text from this image with Tesseract?

Currently I'm using the code below to get text from image and it works fine, but it doesn't work well with these two images, it seems like tesseract cannot scan these types of image. Please show me how to fix it
https://i.ibb.co/zNkbhKG/Untitled1.jpg
https://i.ibb.co/XVbjc3s/Untitled3.jpg
def read_screen():
spinner = Halo(text='Reading screen', spinner='bouncingBar')
spinner.start()
screenshot_file="Screens/to_ocr.png"
screen_grab(screenshot_file)
#prepare argparse
ap = argparse.ArgumentParser(description='HQ_Bot')
ap.add_argument("-i", "--image", required=False,default=screenshot_file,help="path to input image to be OCR'd")
ap.add_argument("-p", "--preprocess", type=str, default="thresh", help="type of preprocessing to be done")
args = vars(ap.parse_args())
# load the image
image = cv2.imread(args["image"])
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
if args["preprocess"] == "thresh":
gray = cv2.threshold(gray, 177, 177,
cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
elif args["preprocess"] == "blur":
gray = cv2.medianBlur(gray, 3)
# store grayscale image as a temp file to apply OCR
filename = "Screens/{}.png".format(os.getpid())
cv2.imwrite(filename, gray)
# load the image as a PIL/Pillow image, apply OCR, and then delete the temporary file
pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
#ENG
#text = pytesseract.image_to_string(Image.open(filename))
#VIET
text = pytesseract.image_to_string(Image.open(filename), lang='vie')
os.remove(filename)
os.remove(screenshot_file)
# show the output images
'''cv2.imshow("Image", image)
cv2.imshow("Output", gray)
os.remove(screenshot_file)
if cv2.waitKey(0):
cv2.destroyAllWindows()
print(text)
'''
spinner.succeed()
spinner.stop()
return text

You should try different psm modes instead of default like so:
target = pytesseract.image_to_string(im,config='--psm 4',lang='vie')
Exert from docs:
Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
So for example for /Untitled3.jpg you could try --psm 4 and failing that you could try --psm 11 for both.
Depending on your version of tesseract you could also try different oem modes:
Use --oem 1 for LSTM, --oem 0 for Legacy Tesseract. Please note that Legacy Tesseract models are only included in traineddata files from tessdata repo.
EDIT
Also as seen in your images there are two languages so if you wish to use lang parameter you need to manually separate image into two to not to confuse tesseract engine and use different lang values for them.
EDIT 2
Below a full working example with Unitiled3. What I noticed was your improper use of thresholding. You should set maxval to something bigger than the value you are thresholding at. Like in my example I set thresh 177 but maxval to 255 so everything above 177 will be black. I didn't even had to do any binarization.
import cv2
import pytesseract
from cv2.cv2 import imread, cvtColor, COLOR_BGR2GRAY, threshold, THRESH_BINARY
image = imread("./Untitled3.jpg")
image = cvtColor(image,COLOR_BGR2GRAY)
_,image = threshold(image,177,255,THRESH_BINARY)
cv2.namedWindow("TEST")
cv2.imshow("TEST",image)
cv2.waitKey()
text = pytesseract.image_to_string(image, lang='eng')
print(text)
Output:
New York, New York
Salzburg, Austria
Hollywood, California

How to extract dotted text from image?

I'm working on my bachelor's degree final project and I want to create an OCR for bottle inspection with python. I need some help with text recognition from the image. Do I need to apply the cv2 operations in a better way, train tesseract or should I try another method?
I tried image processing operations on the image and I used pytesseract to recognize the characters.
Using the code bellow I got from this photo:
to this one:
and then to this one:
Sharpen function:
def sharpen(img):
sharpen = iaa.Sharpen(alpha=1.0, lightness = 1.0)
sharpen_img = sharpen.augment_image(img)
return sharpen_img
Image processing code:
textZone = cv2.pyrUp(sharpen(originalImage[y:y + h - 1, x:x + w - 1])) #text zone cropped from the original image
sharp = cv2.cvtColor(textZone, cv2.COLOR_BGR2GRAY)
ret, thresh = cv2.threshold(sharp, 127, 255, cv2.THRESH_BINARY)
#the functions such as opening are inverted (I don't know why) that's why I did opening with MORPH_CLOSE parameter, dilatation with erode and so on
kernel_open = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))
open = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel_open)
kernel_dilate = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(5,7))
dilate = cv2.erode(open,kernel_dilate)
kernel_close = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 5))
close = cv2.morphologyEx(dilate, cv2.MORPH_OPEN, kernel_close)
print(pytesseract.image_to_string(close))
This is the result of pytesseract.image_to_string:
22203;?!)
92:53 a
The expected result is :
22/03/20
02:53 A

"Do I need to apply the cv2 operations in a better way, train tesseract or should I try another method?"
First, kudos for taking this project on and getting this far with it. What you have from the OpenCV/cv2 standpoint looks pretty good.
Now, if you're thinking of Tesseract to carry you the rest of the way, at the very least you'll have to train it. Here you have a tough choice: Invest in training Tesseract, or work up a CNN to recognize a limited alphabet. If you have a way to segment the image, I'd be tempted to go with the latter.

From the result you got and the expected result, you can see that some of the characters are recognized correctly. Assuming you are using a different image from that shown in the tutorial, I recommend you to change the values of threshold and getStructuringElement.
These values work better depending on the image color. The tutorial author must have optimized it for his/her use (by trial and error or some other way).
Here is a video if you want to play around with those value using sliders in opencv. You can also print your result in the same loop to see if you are getting the desired result.

One potential thing you could do to improve recognition on the characters is to dilate the characters so pytesseract gives a better result. Dilating the characters will connect the individual blobs together and can fix the / or the A characters. So starting with your latest binary image:
Original
Dilate with a 3x3 kernel with iterations=1 (left) or iterations=2 (right). You can experiment with other values but don't do it too much or the characters will all connect. Maybe this will provide a better result with you OCR.
import cv2
image = cv2.imread("1.PNG")
thresh = cv2.threshold(image, 115, 255, cv2.THRESH_BINARY_INV)[1]
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
dilate = cv2.dilate(thresh, kernel, iterations=1)
final = cv2.threshold(dilate, 115, 255, cv2.THRESH_BINARY_INV)[1]
cv2.imshow('image', image)
cv2.imshow('dilate', dilate)
cv2.imshow('final', final)
cv2.waitKey(0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract specific contents from text using python and Tesseract OCR - python

Related

How to OCR a text with white colour characters on a blue background from a cropped image?

Pytesseract doesn't recognize decimal points

PyTesseract not recognizing decimals

How can I get text from this image with Tesseract?

How to extract dotted text from image?

Categories

Resources