How can I get text from this image with Tesseract?

How can I get text from this image with Tesseract? - python

Currently I'm using the code below to get text from image and it works fine, but it doesn't work well with these two images, it seems like tesseract cannot scan these types of image. Please show me how to fix it
https://i.ibb.co/zNkbhKG/Untitled1.jpg
https://i.ibb.co/XVbjc3s/Untitled3.jpg
def read_screen():
spinner = Halo(text='Reading screen', spinner='bouncingBar')
spinner.start()
screenshot_file="Screens/to_ocr.png"
screen_grab(screenshot_file)
#prepare argparse
ap = argparse.ArgumentParser(description='HQ_Bot')
ap.add_argument("-i", "--image", required=False,default=screenshot_file,help="path to input image to be OCR'd")
ap.add_argument("-p", "--preprocess", type=str, default="thresh", help="type of preprocessing to be done")
args = vars(ap.parse_args())
# load the image
image = cv2.imread(args["image"])
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
if args["preprocess"] == "thresh":
gray = cv2.threshold(gray, 177, 177,
cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
elif args["preprocess"] == "blur":
gray = cv2.medianBlur(gray, 3)
# store grayscale image as a temp file to apply OCR
filename = "Screens/{}.png".format(os.getpid())
cv2.imwrite(filename, gray)
# load the image as a PIL/Pillow image, apply OCR, and then delete the temporary file
pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
#ENG
#text = pytesseract.image_to_string(Image.open(filename))
#VIET
text = pytesseract.image_to_string(Image.open(filename), lang='vie')
os.remove(filename)
os.remove(screenshot_file)
# show the output images
'''cv2.imshow("Image", image)
cv2.imshow("Output", gray)
os.remove(screenshot_file)
if cv2.waitKey(0):
cv2.destroyAllWindows()
print(text)
'''
spinner.succeed()
spinner.stop()
return text

You should try different psm modes instead of default like so:
target = pytesseract.image_to_string(im,config='--psm 4',lang='vie')
Exert from docs:
Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
So for example for /Untitled3.jpg you could try --psm 4 and failing that you could try --psm 11 for both.
Depending on your version of tesseract you could also try different oem modes:
Use --oem 1 for LSTM, --oem 0 for Legacy Tesseract. Please note that Legacy Tesseract models are only included in traineddata files from tessdata repo.
EDIT
Also as seen in your images there are two languages so if you wish to use lang parameter you need to manually separate image into two to not to confuse tesseract engine and use different lang values for them.
EDIT 2
Below a full working example with Unitiled3. What I noticed was your improper use of thresholding. You should set maxval to something bigger than the value you are thresholding at. Like in my example I set thresh 177 but maxval to 255 so everything above 177 will be black. I didn't even had to do any binarization.
import cv2
import pytesseract
from cv2.cv2 import imread, cvtColor, COLOR_BGR2GRAY, threshold, THRESH_BINARY
image = imread("./Untitled3.jpg")
image = cvtColor(image,COLOR_BGR2GRAY)
_,image = threshold(image,177,255,THRESH_BINARY)
cv2.namedWindow("TEST")
cv2.imshow("TEST",image)
cv2.waitKey()
text = pytesseract.image_to_string(image, lang='eng')
print(text)
Output:
New York, New York
Salzburg, Austria
Hollywood, California

Related

How to remove bad characters or special character in opencv python and improve OCR accuracy?

I have built a program for extract text in image in python and OCR, but when i run the code I get some bad characters and its accuracy is not good , but it works.
Can I add some datasetes about the characters that should be processed?
How can I solve the problems?
This is my image :
And this is the code :
import cv2
import numpy as np
import pytesseract
# Read input image, convert to grayscale
img = cv2.imread('9.jpg')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Remove shadows, cf. https://stackoverflow.com/a/44752405/11089932
dilated_img = cv2.dilate(gray, np.ones((7, 7), np.uint8))
bg_img = cv2.medianBlur(dilated_img, 21)
diff_img = 255 - cv2.absdiff(gray, bg_img)
norm_img = cv2.normalize(diff_img, None, alpha=0, beta=255,
norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_8UC1)
# Threshold using Otsu's
work_img = cv2.threshold(norm_img, 0, 255, cv2.THRESH_OTSU)[1]
# Tesseract
custom_config = r'--oem 3 --psm 6'
text = pytesseract.image_to_string(work_img, config=custom_config)
print(text)
And finally this is the output :
fe
|Urine Analysis
| Urine analysis
| Color Yellow RBC/hpf 4-6
| Appereance Turbid WBC/hpf 2-3
; Specific Gravity 1014 Epithelial cells/Lpf 1-2
PH 7 Bacteria (Few)
| Protein Pos(+) Casts Pos(+)
Glucose Negative Mucous (Few)
Keton. Negative
Blood Pos(+)
Bilirubin Negative
' Urobilinogen Negative
| Nitrite Pos(+)

I had the similar web. I was trying to extract some information from the image but I was getting other raw text as well.
So what you do is you can try an algorithm to extract only desired data.
Here is my image as input like yours
Input image
Now this algorithm or code is extracting only IDs or Registration numbers of students.
Regs_No = list(new)
regs_no = []
count =0
Status = []
#Extracting Only Registration Number
for i in range(len(Regs_No)):
if new[i][1:6] == "8MDSW":
regs_no.append(new[i])
Status.append('P')
So the above code is only extracting registration number.
In you case you can also use some code to get only desired text.
Hope it works.
Thanks.

Pytesseract doesn't recognize decimal points

I'm trying to read the text in this image that contains also decimal points and decimal numbers
in this way:
img = cv2.imread(path_to_image)
print(pytesseract.image_to_string(img))
and what I get is:
73-82
Primo: 50 —
I've tried to specify also the italian language but the result is pretty similar:
73-82 _
Primo: 50
Searching through other questions on stackoverflow I found that the reading of the decimal numbers can be improved by using a whitelist, in this case tessedit_char_whitelist='0123456789.', but I want to read also the words in the image. Any idea on how to improve the reading of decimal numbers?

I would suggest passing tesseract every row of text as separate image.
For some reason it seams to solve the decimal point issue...
Convert image from grayscale to black and white using cv2.threshold.
Use cv2.dilate morphological operation with very long horizontal kernel (merge blocks across horizontal direction).
Use find contours - each merged row is going to be in a separate contour.
Find bounding boxes of the contours.
Sort the bounding boxes according to the y coordinate.
Iterate bounding boxes, and pass slices to pytesseract.
Here is the code:
import numpy as np
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # I am using Windows
path_to_image = 'image.png'
img = cv2.imread(path_to_image, cv2.IMREAD_GRAYSCALE) # Read input image as Grayscale
# Convert to binary using automatic threshold (use cv2.THRESH_OTSU)
ret, thresh = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
# Dilate thresh for uniting text areas into blocks of rows.
dilated_thresh = cv2.dilate(thresh, np.ones((3,100)))
# Find contours on dilated_thresh
cnts = cv2.findContours(dilated_thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)[-2] # Use index [-2] to be compatible to OpenCV 3 and 4
# Build a list of bounding boxes
bounding_boxes = [cv2.boundingRect(c) for c in cnts]
# Sort bounding boxes from "top to bottom"
bounding_boxes = sorted(bounding_boxes, key=lambda b: b[1])
# Iterate bounding boxes
for b in bounding_boxes:
x, y, w, h = b
if (h > 10) and (w > 10):
# Crop a slice, and inverse black and white (tesseract prefers black text).
slice = 255 - thresh[max(y-10, 0):min(y+h+10, thresh.shape[0]), max(x-10, 0):min(x+w+10, thresh.shape[1])]
text = pytesseract.image_to_string(slice, config="-c tessedit"
"_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890-:."
" --psm 3"
" ")
print(text)
I know it's not the most general solution, but it manages to solve the sample you have posted.
Please treat the answer as a conceptual solution - finding a robust solution might be very challenging.
Results:
Thresholder image after dilate:
First slice:
Second slice:
Third slice:
Output text:
7.3-8.2
Primo:50

You can easily recognize by down-sampling the image.
If you down-sample by 0.5, result will be:
Now if you read:
7.3 - 8.2
Primo: 50
I got the result by using pytesseract 0.3.7 version (current)
Code:
# Load the libraries
import cv2
import pytesseract
# Load the image
img = cv2.imread("s9edQ.png")
# Convert to the gray-scale
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Down-sample
gry = cv2.resize(gry, (0, 0), fx=0.5, fy=0.5)
# OCR
txt = pytesseract.image_to_string(gry)
print(txt)
Explanation:
The input-image contains a little bit of an artifact. You can see it on the right part of the image. On the other hand, the current image is perfect for OCR recognition. You need to use the pre-preprocessing method when the data from the image is not visible or corrupted. Please read the followings:
Image processing
Page-segmentation-mode

Python cannot read text from an image [Python OCR with Tesseract]

I have this issue with reading exactly two lines of numbers (each line contains max of 3 digits) from an image.
My Python code has a big problem with reading a data from images like the ones below:
Most of the times it is just printing random numbers.
What should I do to make this work?
This is my Python code:
from PIL import ImageGrab, Image
from datetime import datetime
from pytesseract import pytesseract
import numpy as nm
pytesseract.tesseract_cmd = 'F:\\Tesseract\\tesseract'
while True:
screenshot = ImageGrab.grab(bbox=(515, 940, 560, 990))
datetime = datetime.now()
filename = 'pic_{}.{}.png'.format(datetime.strftime('%H%M_%S'), datetime.microsecond / 500000)
gray = screenshot.convert('L')
bw = nm.asarray(gray).copy()
bw[bw < 160] = 0
bw[bw >= 160] = 255
convertedScreenshot = Image.fromarray(bw)
tesseract = pytesseract.image_to_string(convertedScreenshot, config='digits --psm 6')
convertedScreenshot.save(filename)
print(tesseract)
The image has to have white text on the black background or the black text on the white background.
It is also important to save the image afterwards.

Tesseract works best on images having black text on white Background. Invert the image before using tesseract by adding the below line :
convertedScreenshot = 255 - convertedScreenshot

Hey I was facing similar problem, I still am but using a few arguments in 'image_to_string' function helped.
I was using it for a single digit detection
d = pytesseract.image_to_string(thr, lang='eng',config='--psm 10 --oem
3 -c tessedit_char_whitelist=0123456789')
This helped me detect the single digits

Recognize numbers from an image python

I am trying to extract numbers from in game screenshots.
I'm trying to extract:
98
3430
5/10
from PIL import Image
import pytesseract
image="D:/img/New folder (2)/1.png"
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract.exe'
text = pytesseract.image_to_string(Image.open(image),lang='eng',config='--psm 5')
print(text)
output is gibberish
‘t hl) keteeeees
ek pSlaerenen
JU) pgrenmnreserenny
Rates B
d dali eas. 5
cle aM (Sores
|, S| pgranmrerererecons
a cee 3
pea 3
oS :
(geo eenee
ey
=
es A

okay, so I tried changing it into grayscale, reverse contrast or use different treshold, but it all seems to be fairly inaccurate.
The issue seems to be the tilted and smaller numbers. You do not happen to have any hiher res image?
Most accurate I could get was the following code.
import cv2
import pytesseract
import imutils
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
img = cv2.imread('D:/img/New folder (2)/1.png') #test.png is your original image
img = imutils.resize(img, width=1400)
crop = img[340:530, 100:400]
data = pytesseract.image_to_string(crop,config=' --psm 1 --oem 3 -c tessedit_char_whitelist=0123456789/')
print(data)
cv2.imshow('crop', crop)
cv2.waitKey()
Otherwise I recommend one of these methods as described in the similar question
or in this one.

if the text is surrounded with the designs, tesseract suffers a lot
insted of tesseract try using findcontours in opencv (after little blurring, dilating)
you will get bounding boxes, then it might cover that text also

Extract specific contents from text using python and Tesseract OCR

I am using tesseract OCR to extract text from image file .
Below is the sample text I got from my Image:
Certificate No. Certificate Issued Date Acoount Reference Unique Doc. Reference IN-KA047969602415880 18-Feb-2016 01:39 PM NONACC(FI)/kakfscI08/BTM LAYOUT/KA-BA SUBIN-KAKAKSFCL0858710154264833O
How can I extract Certificate No. from this? Any hint or solution will help me here.

If the certificate number is always in the structure it is given here (2 letters, hyphen, 17 digits) you can use regex:
import regex as re
# i took the entire sequence originally but this is just an example
sequence = 'Reference IN-KA047969602415880 18-Feb-2016 01:39'
re.search('[A-Z]{2}-.{17}', seq).group()
#'IN-KA047969602415880'
.search searches for a specific pattern you dictate, and .group() return the first result (in this case there would be only one). You can search for anything like this in a given string, I suggest a review of regex here.

Before throwing the image into Tesseract OCR, it's important to preprocess the image to remove noise and smooth the text. Here's a simple approach using OpenCV
Convert image to grayscale
Otsu's threshold to obtain binary image
Gaussian blur and invert image
After converting to grayscale, we Otsu's threshold to get a binary image
From here we give it a slight blur and invert the image to get our result
Results from Pytesseract
Certificate No. : IN-KA047969602415880
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
image = cv2.imread('1.png',0)
thresh = cv2.threshold(image, 0, 255, cv2.THRESH_OTSU + cv2.THRESH_BINARY_INV)[1]
blur = cv2.GaussianBlur(thresh, (3,3), 0)
result = 255 - blur
data = pytesseract.image_to_string(result, lang='eng', config='--psm 6')
print(data)
cv2.imshow('thresh', thresh)
cv2.imshow('result', result)
cv2.waitKey()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I get text from this image with Tesseract? - python

Related

How to remove bad characters or special character in opencv python and improve OCR accuracy?

Pytesseract doesn't recognize decimal points

Python cannot read text from an image [Python OCR with Tesseract]

Recognize numbers from an image python

Extract specific contents from text using python and Tesseract OCR

Categories

Resources