I have tried to read text from image of receipt using pytesseract. But a result text have a lot weird characters and it really looks awful.
There is my code which i used to manipulate image:
import sys
from PIL import Image
import cv2 as cv
import numpy as np
import pytesseract
def manipulate_image(img):
img = cv.cvtColor(img, cv.COLOR_BGR2GRAY)
kernel = np.ones((1,1), dtype = "uint8")
img = cv.erode(img, kernel, iterations = 1)
img = cv.threshold(img, 0, 255,
cv.THRESH_BINARY | cv.THRESH_OTSU)[1]
img = cv.medianBlur(img, 3)
return img
if len(sys.argv) > 2:
print("Please provide only name of image.")
elif len(sys.argv) == 2:
img = cv.imread(sys.argv[1])
img = manipulate_image(img)
cv.imwrite("test.png", img)
text = pytesseract.image_to_string(img)
print text.encode('utf8')
else:
print("Please provide name of image.")
there is my test receipt image:
https://imgur.com/a/RjeQ9dL
and there is output image after manupulate:
https://imgur.com/a/1tFZRdq
and there is text result:
""'9vco4v‘l7
0 .Vt3t00N 00t300N BUNUUS
SKLEP PUU POPUGOH|
UL. JHGIELLUNSKA 25, 70-364 SZCZ[C|N
TEL. 91 4841-20-58
N|P: 955—150-21-B2
dn.19r03.05 Uydr.8534
PARAGON FISKALNY
CIHSTKH 17 0,3 ¥ 16,30 = 4.89 B
Sp.0p.B 4,89 PTU B= 8,00% 0,35
Razem PTU 0,35
ZOP{HCUNU GUTUNKQ PLN
RESZTA PLN
0025/1373 H0103 0N|0 H.
15F H9HF[B9416} 13fl02D6k0[20D4334C
7?? BW 140
Any idea how to perform it in better way to get nicer results?
Applying simple thresholding will not be enough for pyTesseract to properly detect the characters. There is much more preprocessing that can be done to drastically improve your results, such as:
using Tesseract V4, where deep learning is implemented
segmenting characters
using only the part of the receipt where the text is through edge detection
perspective transform to straighten out the text
These are somewhat lengthy topics to write all in one answer, but you can check out some articles on pyImageSearch, where this is talked about in much more depth:
https://www.pyimagesearch.com/2014/09/01/build-kick-ass-mobile-document-scanner-just-5-minutes/
https://www.pyimagesearch.com/2018/09/17/opencv-ocr-and-text-recognition-with-tesseract/
Related
I have a following question. I would like to read these types of captcha in python:
The best code I have done is this, however it is not able to solve all these captchas:
import pytesseract
import cv2
import numpy as np
import re
def odstran_sum(img,threshold):
"""Funkce odstrani sum."""
filtered_img = np.zeros_like(img)
labels,stats= cv2.connectedComponentsWithStats(img.astype(np.uint8),connectivity=8)[1:3]
label_areas = stats[1:, cv2.CC_STAT_AREA]
for i,label_area in enumerate(label_areas):
if label_area > threshold:
filtered_img[labels==i+1] = 1
return filtered_img
def preprocess(img_path):
"""Konvertuje do binary obrazku."""
img = cv2.imread(img_path,0)
blur = cv2.GaussianBlur(img, (3,3), 0)
thresh = cv2.threshold(blur, 150, 255, cv2.THRESH_BINARY_INV)[1]
filtered_img = 255-odstran_sum(thresh,20)*255
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
erosion = cv2.erode(filtered_img,kernel,iterations = 1)
return erosion
def captcha_to_string(obrazek):
"""Funkce vrati text z captchy"""
text = pytesseract.image_to_string(obrazek)
return re.sub(r'[^\x00-\x7F]+',' ', text).strip()
img = preprocess(CAPTCHA_NAME)
text = captcha_to_string(img)
print(text)
Is it possible to improve my code that it will be able to read all these five examples? Thanks a lot.
I don't think there is much to be improved beside writing own neural network for image recognition based on similar captchas. Captchas are rather designed so that computer has hard time decoding them, so I don't think you can get perfect results.
I'm trying to read the text in this image that contains also decimal points and decimal numbers
in this way:
img = cv2.imread(path_to_image)
print(pytesseract.image_to_string(img))
and what I get is:
73-82
Primo: 50 —
I've tried to specify also the italian language but the result is pretty similar:
73-82 _
Primo: 50
Searching through other questions on stackoverflow I found that the reading of the decimal numbers can be improved by using a whitelist, in this case tessedit_char_whitelist='0123456789.', but I want to read also the words in the image. Any idea on how to improve the reading of decimal numbers?
I would suggest passing tesseract every row of text as separate image.
For some reason it seams to solve the decimal point issue...
Convert image from grayscale to black and white using cv2.threshold.
Use cv2.dilate morphological operation with very long horizontal kernel (merge blocks across horizontal direction).
Use find contours - each merged row is going to be in a separate contour.
Find bounding boxes of the contours.
Sort the bounding boxes according to the y coordinate.
Iterate bounding boxes, and pass slices to pytesseract.
Here is the code:
import numpy as np
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # I am using Windows
path_to_image = 'image.png'
img = cv2.imread(path_to_image, cv2.IMREAD_GRAYSCALE) # Read input image as Grayscale
# Convert to binary using automatic threshold (use cv2.THRESH_OTSU)
ret, thresh = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
# Dilate thresh for uniting text areas into blocks of rows.
dilated_thresh = cv2.dilate(thresh, np.ones((3,100)))
# Find contours on dilated_thresh
cnts = cv2.findContours(dilated_thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)[-2] # Use index [-2] to be compatible to OpenCV 3 and 4
# Build a list of bounding boxes
bounding_boxes = [cv2.boundingRect(c) for c in cnts]
# Sort bounding boxes from "top to bottom"
bounding_boxes = sorted(bounding_boxes, key=lambda b: b[1])
# Iterate bounding boxes
for b in bounding_boxes:
x, y, w, h = b
if (h > 10) and (w > 10):
# Crop a slice, and inverse black and white (tesseract prefers black text).
slice = 255 - thresh[max(y-10, 0):min(y+h+10, thresh.shape[0]), max(x-10, 0):min(x+w+10, thresh.shape[1])]
text = pytesseract.image_to_string(slice, config="-c tessedit"
"_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890-:."
" --psm 3"
" ")
print(text)
I know it's not the most general solution, but it manages to solve the sample you have posted.
Please treat the answer as a conceptual solution - finding a robust solution might be very challenging.
Results:
Thresholder image after dilate:
First slice:
Second slice:
Third slice:
Output text:
7.3-8.2
Primo:50
You can easily recognize by down-sampling the image.
If you down-sample by 0.5, result will be:
Now if you read:
7.3 - 8.2
Primo: 50
I got the result by using pytesseract 0.3.7 version (current)
Code:
# Load the libraries
import cv2
import pytesseract
# Load the image
img = cv2.imread("s9edQ.png")
# Convert to the gray-scale
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Down-sample
gry = cv2.resize(gry, (0, 0), fx=0.5, fy=0.5)
# OCR
txt = pytesseract.image_to_string(gry)
print(txt)
Explanation:
The input-image contains a little bit of an artifact. You can see it on the right part of the image. On the other hand, the current image is perfect for OCR recognition. You need to use the pre-preprocessing method when the data from the image is not visible or corrupted. Please read the followings:
Image processing
Page-segmentation-mode
I am trying to detect this letter but it doesn't seem to recognize it.
import cv2
import pytesseract as tess
img = cv2.imread("letter.jpg")
imggray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
print(tess.image_to_string(imggray))
this is the image in question:
Preprocessing of the image (e.g. inverting it) should help, and also you could take advantage of pytesseract image_to_string config options.
For instance, something along these lines:
import pytesseract
import cv2 as cv
import requests
import numpy as np
import io
# I read this directly from imgur
response = requests.get('https://i.stack.imgur.com/LGFAu.jpg')
nparr = np.frombuffer(response.content, np.uint8)
img = cv.imdecode(nparr, cv.IMREAD_GRAYSCALE)
# simple inversion as preprocessing
neg_img = cv.bitwise_not(img)
# invoke tesseract with options
text = pytesseract.image_to_string(neg_img, config='--psm 7')
print(text)
should parse the letter correctly.
Have a look at related questions for some additional info about preprocessing and tesseract options:
Why does pytesseract fail to recognise digits from image with darker background?
Why does pytesseract fail to recognize digits in this simple image?
Why does tesseract fail to read text off this simple image?
#Davide Fiocco 's answer is definitely correct.
I just want to show another way of doing it with adaptive-thresholding
When you apply adaptive-thesholding result will be:
Now when you read it:
txt = pytesseract.image_to_string(thr, config="--psm 7")
print(txt)
Result:
B
Code:
import cv2
import pytesseract
img = cv2.imread("LGFAu.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.adaptiveThreshold(gry, 252, cv2.ADAPTIVE_THRESH_MEAN_C,
cv2.THRESH_BINARY_INV, 11, 2)
txt = pytesseract.image_to_string(thr, config="--psm 7")
print(txt)
I am trying to write a function that will take a jpg of a floorplan of a house and use OCR to extract the square footage that is written somewhere on the image
import requests
from PIL import Image
import pytesseract
import pandas as pd
import numpy as np
import cv2
import io
def floorplan_ocr(url):
""" a row-wise function to use pytesseract to scrape the word data from the floorplan
images, requires tesseract
to be installed https://github.com/tesseract-ocr/tesseract/wiki"""
if pd.isna(url):
return np.nan
res = ''
response = requests.get(url, stream=True)
if response.status_code == 200:
img = response.raw
img = np.asarray(bytearray(img.read()), dtype="uint8")
img = cv2.imdecode(img, cv2.CV_8UC1)
img = cv2.adaptiveThreshold(img,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,\
cv2.THRESH_BINARY,11,2)
#img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)
res = pytesseract.image_to_string(img, lang='eng', config='--remove-background')
del response
del img
else:
return np.nan
#print(res)
return res
However I am not getting much success. Only about 1 in 4 images actually outputs text that contains the square footage.
e.g currently
floorplan_ocr(https://i.imgur.com/9qwozIb.jpg) outputs 'K\'Fréfiéfimmimmuuéé\n2|; apprnxx 135 max\nGArhaPpmxd1m max\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\nTOTAL APPaux noon AREA 523 so Fr, us. a 50. M )\nav .Wzms him "a! m m... mi unwary mmnmrmm mma y“ mum“;\n‘ wmduw: reams m wuhrmmm mm“ .m nanspmmmmy 3 mm :51\nmm" m mmm m; wan wmumw- mm my and mm mm as m by any\nwfmw PM” rmwm mm m .pwmwm m. mum mud ms nu mum.\n(.5 n: ma undammmw an we Ewen\nM vagw‘m Mewpkeem' (and takes a long time to do it)
floorplan_ocr(https://i.imgur.com/sjxMpVp.jpg) outputs ' '.
I think some of the issues I am facing are:
text may be greyscale
Images are low DPI (appears to be some debate if this is actually important or if it the total resolution)
Text is not formatted consistently
I am stuck and am struggling to improve my results. All I want to extract is 'XXX sq ft' (and all the ways that might be written)
Is there a better way to do this?
Many thanks.
By applying these few lines to resize and change contrast/brightness on your second image, after cropping the bottom quarter of the image :
img = cv2.imread("download.jpg")
img = cv2.resize(img, (0, 0), fx=2, fy=2)
img = cv2.convertScaleAbs(img, alpha=1.2, beta=-40)
text = pytesseract.image_to_string(img, config='-l eng --oem 1 --psm 3')
i managed to get this result :
TOTAL APPROX. FLOOR AREA 528 SQ.FT. (49.0 SQ.M.)
Whilst every attempt has been made to ensure the accuracy of the floor
plan contained here, measurements: of doors, windows, rooms and any
other items are approximate and no responsibility ts taken for any
error, omission, or mis-statement. This plan is for #ustrative
purposes only and should be used as such by any prospective purchaser.
The services, systems and appliances shown have not been tested and no
guarantee a8 to the operability or efficiency can be given Made with
Metropix ©2019
I did not treshold the image as your images structures vary from one another, and since the image is not only text, OTSU Thresholding does not find the right value.
To answer everything: Tesseract actually work best with grayscale image (black text on white background).
About the DPI/Resolution question, there is indeed some debate but there is also some empirical truth : DPI value doesn't really matters (since text size can vary for same DPI). For Tesseract OCR to work best, your characters need to be (edited :) 30-33 pixels (height), smaller by a few px can make Tesseract almost useless, and bigger characters actually reduce accuracy, though not significantly. (edit : found the source -> https://groups.google.com/forum/#!msg/tesseract-ocr/Wdh_JJwnw94/24JHDYQbBQAJ)
Finally, text format doesn't really change (at least in your examples). So your main problem here is text size, and the fact that you parse a whole page. If the text line you want is consistently at the bottom of the image, just extract (slice) your original image so you only feed Tesseract the relevent data, wich also will make it way faster.
EDIT :
If you were also searching for a way to extract the square footage from your ocr'ed text :
text = "some place holder text 5471 square feet some more text"
# store here all the possible way it can be written
sqft_list = ["sq ft", "square feet", "sqft"]
extracted_value = ""
for sqft in sqft_list:
if sqft in text:
start = text.index(sqft) - 1
end = start + len(sqft) + 1
while text[start - 1] != " ":
start -= 1
extracted_value = text[start:end]
break
print(extracted_value)
5471 square feet
All of the pixelation around the text makes it harder for Tesseract to do its thing.
I used a simple brightness/contrast algorithm from here to make the dots go away. I didn't do any thresholding/binarization. But I did have to scale the image to get any character recognition.
import pytesseract
import numpy as np
import cv2
img = cv2.imread('floor_original.jpg', 0) # read as grayscale
img = cv2.resize(img, (0,0), fx=2, fy=2) # scale image 2X
alpha = 1.2
beta = -20
img = cv2.addWeighted( img, alpha, img, 0, beta)
cv2.imwrite('output.png', img)
res = pytesseract.image_to_string(img, lang='eng', config='--remove-background')
print(res)
Edit
There may be some platform/version dependence on above code. It runs on my Linux machine, but not on my Windows machine. To get it to run on Windows, I modified last two lines to
res = pytesseract.image_to_string(img, lang='eng', config='remove-background')
print(res.encode())
Output from tesseract(bolding added by me to emphasize sq footage):
TT xs?
IN
Approximate Gross Internal Area = 50.7 sq m / 546 sq ft
All dimensions are estimates only and may not be exact meas ent plans
are subject lo change The sketches. renderngs graph matenala, lava,
apectes
ne developer, the management company, the owners and other affiliates
re rng oo all of ma ther sole discrebon and without enor scbioe
jements Araxs are approximate
Image after processing:
I'm trying to do OCR arabic on the following ID but I get a very noisy picture, and can't extract information from it.
Here is my attempt
import tesserocr
from PIL import Image
import pytesseract
import matplotlib as plt
import cv2
import imutils
import numpy as np
image = cv2.imread(r'c:\ahmed\ahmed.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.bilateralFilter(gray,11,18,18)
gray = cv2.GaussianBlur(gray,(5,5), 0)
kernel = np.ones((2,2), np.uint8)
gray = cv2.adaptiveThreshold(gray,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY,11,2)
#img_dilation = cv2.erode(gray, kernel, iterations=1)
#cv2.imshow("dilation", img_dilation)
cv2.imshow("gray", gray)
text = pytesseract.image_to_string(gray, lang='ara')
print(text)
with open(r"c:\ahmed\file.txt", "w", encoding="utf-8") as myfile:
myfile.write(text)
cv2.waitKey(0)
result
sample
The text for your id is in black color which makes the extraction process easy. All you need to do is threshold the dark pixels and you should be able to get the text out.
Here is a snip of the code
import cv2
import numpy as np
# load image in grayscale
image = cv2.imread('AVXjv.jpg',0)
# remove noise
dst = cv2.blur(image,(3,3))
# extract dark regions which corresponds to text
val, dst = cv2.threshold(dst,80,255,cv2.THRESH_BINARY_INV)
# morphological close to connect seperated blobs
dst = cv2.dilate(dst,None)
dst = cv2.erode(dst,None)
cv2.imshow("dst",dst)
cv2.waitKey(0)
And here is the result:
This is my output using ImageMagick TextCleaner script:
Script: textcleaner -g -e stretch -f 50 -o 30 -s 1 C:/Users/PC/Desktop/id.jpg C:/Users/PC/Desktop/out.png
Take a look here if you want to install and use TextCleaner script on Windows... It's a tutorial I made as simple as possible after few researches I made when I was in your same situation.
Now it should be very easy to detect the text and (not sure how simple) recognize it.