Processing Images with OCR

Processing Images with OCR - python

I am trying to get all the numerical data from the Mean column with multiple identical pictures as attached. nonsip2_write_8000M
The way I do it is by using a cursor position script to get the column of information that I want. However, the first data I get is always a bunch of character while the rest are correctly processed. results Even though I rearrange the order of images, the results are the same and I get a bunch of characters for the first data. Is there a better way to do this? I think that I might not have preprocessed the images properly.
import numpy as np
import time
import datetime
import cv2
import pytesseract
from PIL import ImageGrab
import sys
import subprocess
pytesseract.pytesseract.tesseract_cmd=r'C:\Program Files\Teseract-OCR\tesseract.exe'
Tstamp = datetime.datetime.now().strftime('%Y_%m_%d_%H_%M_%S')
report_fname='C:\Test_Automation\excel_file\ocr_'+TStamp+'.csv'
fid_1=open((report_fname),"a")
filename_set = ['nonsip_read_10M.jpg', 'nonsip_read_200M.jpg', 'nonsip_read_8000M.jpg', 'nonsip_read_8000M_long.jpg', 'nonsip_write_10M.jpg', 'nonsip_write_200M.jpg', 'nonsip_write_8000M.jpg', 'nonsip_write_8000M_long.jpg','nonsip2_read_8000M.jpg','nonsip2_write_8000M.jpg']
while filename_set:
filename=filename_set.pop(-1)
print(filename)
img=cv2.imread(filename,0)
cv2.namedWindow("window",cv2.WND_PROP_FULLSCREEN)
cv2.setWindowProperty("window",cv2.WND_PROP_FULLSCREEN,cv2.WINDOW_FULLSCREEN)
cv2.imshow("window",img)
cv2.waitKey(1)
x_start=839
x_end=927
y_start=844
y_end=1057
x_interval=(x_end - x_start)/8
y_interval=(y_end - y_start)/8
x1=x_start
y1=y_start
x2=x_end
y2=y_end
for i in range(1,9,1):
y2=int(y_start + i*y_interval)
print(i,x1,y1,x1,y2)
img1=ImageGrab.grab(bbox=(x1,y1,x1,y2))
print("debug1")
img1.save('sc.png')
img1=cv2.imread('sc.png',0)
img1=np.invert(img1)
data=pytesseract.image_to_string(img1, lang='eng',config='--psm 6')
print(data)
fid_1.write('%s.%s\n'%(filename,data))
y1=y2

My solution is:
Crop the image
Apply tresholding to the image
Set page-segmentation-mode to the column read (4)
First of all you want the bottom part of the image. Therefore we can crop the image with the ratio:
a. Get image height (h) and width (w) values
h, w = img.shape[:2] # get height and width
b. Set starting x, y coordinates.
x = int(w/3)
y = int((3*h)/4)
c. Crop the image
img = img[y:int(y+h/4), x:x+int(w/5)]
Result:
Apply thresholding to the cropped area:
thr = cv2.threshold(gry, 127, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
Result:
Set page-segmentation-mode to the column read, since you want the meancolumn values
txt = pytesseract.image_to_string(thr, config="--psm 4")
Do a cleaning in the txt variable (we want only mean values)
txt = txt.strip().split("\n")
for t in txt:
t = t.split(" ")
is_cnt_dgt = [i for i in t if i.replace(".", "").isdigit()]
if len(is_cnt_dgt) != 0:
print(t[len(t)-2])
Result:
887.958
919.142
846.984
72.1587
897.016
934.200
857.695
76.5089
If you don't want to clean your code the result will be:
‘Current Mean
§88529 mV 887.958 mV
Q2L7B5 mV 919.142 mV
846308 mV 846.984 mV
TSATI mV 72.1587 mV
897.397 mV 897.016 mV
934378mV 934.200 mV
856477 mV 857.695 mV
TI901 mY 76.5089 mV
es
Code:
import cv2
import pytesseract
img = cv2.imread("imYLS.jpg")
h, w = img.shape[:2] # get height and width
x = int(w/3)
y = int((3*h)/4)
img = img[y:int(y+h/4), x:x+int(w/5)]
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.threshold(gry, 127, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
txt = pytesseract.image_to_string(thr, config="--psm 4")
txt = txt.strip().split("\n")
for t in txt:
t = t.split(" ")
is_cnt_dgt = [i for i in t if i.replace(".", "").isdigit()]
if len(is_cnt_dgt) != 0:
print(t[len(t)-2])
cv2.imshow("thr", thr)
cv2.waitKey(0)

Related

How to detect if image contains ASCII characters?

I have a dataset of images and I want to filter out all images that contain text (ASCII chars). For example, I have the following cute image of a dog:
As you can see, on right bottom corner there is a text "MAY 18 2003" so it should be filtered out.
After some research, I came across with tesseract OCR. In python I have the following code:
# Attempt 1
img = Image.open('n02086240_1681.jpg')
text = pytesseract.image_to_string(img)
print(text)
# Attempt 2
import unidecode
img = Image.open('n02086240_1681.jpg')
text = pytesseract.image_to_string(img)
text = unidecode.unidecode(text)
print(text)
# Attempt 3
import string
char_whitelist = string.digits
char_whitelist += string.ascii_lowercase
char_whitelist += string.ascii_uppercase
text = pytesseract.image_to_string(img,lang='eng',
config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
print(text)
None of them detected the string (prints whitespaces). How can I detect it?

you should prepare the image for the OCR.
for example, for this image I would do the following:
convert it to Black & White image with threshold that make the text visible (for this image it is 130)
then I would Invert the image (so the text be in black)
now try tesseract OCR

You can use Easy-OCR instead of pytesseract to get directly this output
Kay
10 2003
and as your goal is just to detect ASCII, you don't care about the accurate characters because you just want to filter the images which contain them.
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import cv2
import easyocr
path = ""
img = cv2.imread(path+"input.jpg")
# Now apply the Easy-OCR
reader = easyocr.Reader(['en'])
output = reader.readtext(img)
for i in range(len(output)):
print(output[i][-2])

You can use inRange thresholding
The result will be:
If you set psm mode to the 6, the output will be:
<<
‘\
' MAY 18 2003
All the digits are captured correctly, but we have some unwanted characters.
If we add an 'only-alpha numeric' condition, then the result will be:
['M', 'A', 'Y', '1', '8', '2', '0', '0', '3']
First, I've upsampled the image, and then apply tesseract-OCR. The reason is that the date is too small to read.
Code:
import cv2
import pytesseract
from numpy import array
img = cv2.imread("result.png") # Load the upsampled image
img = cv2.cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
msk = cv2.inRange(img, array([0, 103, 171]), array([179, 255, 255]))
krn = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 3))
dlt = cv2.dilate(msk, krn, iterations=1)
thr = 255 - cv2.bitwise_and(dlt, msk)
txt = pytesseract.image_to_string(thr, config='--psm 6')
print([t for t in txt if t.isalnum()])
cv2.imshow("", thr)
cv2.waitKey(0)
You can set the new values for the minimum and maximum ranges:
import numpy as np
min_range = np.array([0, 103, 171])
max_range = np.array([179, 255, 255])
msk = cv2.inRange(img, min_range, max_range)
You can also test with different psm parameters:
txt = pytesseract.image_to_string(thr, config='--psm 6')
For more read: Improving the quality of the output

How to improve accuracy of OpenCV matching results in Python

I'm trying to configure OpenCV within Python 3.6 to match a character icon (pattern) 1 with a box of characters 2. Nevertheless, the match is quite low, especially for shaded characters like 1.
I tried to solve it by using not only matchTemplate, but also comparing histograms, nevertheless - result is still poor.
I did try using gray-scale, colors, matching just a center of picture (cropped face), matching whole picture... resizing pattern to have exact dimension as it would be in a box... all combinations... and still this is VERY random (see attached image of correlation results)
Thank you in advance for help!
Here's the code:
import numpy as np
import cv2 as cv
from PIL import Image
import os
box = Image.open("/Users/user/Desktop/dbz/my_box.jpeg")
box.thumbnail((592,1053))
#conditions for each match step
character_threshold = 0.6 #checks in box
hist_threshold = 0.3
import numpy as np
import cv2 as cv
from PIL import Image
import os
box = Image.open("/Users/user/Desktop/dbz/my_box.jpeg")
box.thumbnail((592,1053))
#conditions for each match step
character_threshold = 0.6
hist_threshold = 0.3
for root, dirs, files in os.walk("/Users/user/Desktop/dbz/img/Super/TEQ/"):
for file in files:
if not file.startswith("."):
print("now " + file)
char = os.path.join(root, file)
#Opens and generate character's icon
character = Image.open(char)
character.thumbnail((153,139))
#Crops face from the character's icon and converts to grayscale CV object
face = character.crop((22,22,94,94)) #size 72x72 with centered face (should be 22,22,94,94)
face_array = np.array(face).astype(np.uint8)
face_array_gray = cv.cvtColor(face_array, cv.COLOR_RGB2GRAY)
#Converts the character's icon to grayscale CV object
character_array = np.array(character).astype(np.uint8)
character_array_gray = cv.cvtColor(character_array, cv.COLOR_RGB2GRAY)
#Converts box screen to grayscale CV object
box_array = np.array(box).astype(np.uint8)
box_array_gray = cv.cvtColor(box_array, cv.COLOR_RGB2GRAY)
#Check whether the face is in the box
character_score = cv.matchTemplate(box_array[:,:,2],face_array[:,:,2],cv.TM_CCOEFF_NORMED)
if character_score.max() > character_threshold:
ij = np.unravel_index(np.argmax(character_score),character_score.shape)
x, y = ij[::-1] #np returns lower-left coordinates, whilst PIL accepts upper, left,lower, right !!!
w, h = face_array_gray.shape
face.show()
found = box.crop((x,y,x+w,y+h)) #expand border to 25 pixels in each size (Best is (x-20,y-5,x+w,y+h+20))
#found.show()
#found_character = np.array(found_character).astype(np.uint8)
#found_character = cv.cvtColor(found_character, cv.COLOR_RGB2GRAY)
found_array = np.array(found).astype(np.uint8)
found_array_gray = cv.cvtColor(found_array, cv.COLOR_RGB2GRAY)
found_hist = cv.calcHist([found_array],[0,1,2],None,[8,8,8],[0,256,0,256,0,256])
found_hist = cv.normalize(found_hist,found_hist).flatten()
found_hist_gray = cv.calcHist([found_array_gray],[0],None,[8],[0,256])
found_hist_gray = cv.normalize(found_hist_gray,found_hist_gray).flatten()
face_hist = cv.calcHist([face_array],[0,1,2],None,[8,8,8],[0,256,0,256,0,256])
face_hist = cv.normalize(face_hist,face_hist).flatten()
face_hist_gray = cv.calcHist([face_array_gray],[0],None,[8],[0,256])
face_hist_gray = cv.normalize(face_hist_gray,face_hist_gray).flatten()
character_hist = cv.calcHist([character_array],[0,1,2],None,[8,8,8],[0,256,0,256,0,256])
character_hist = cv.normalize(character_hist,character_hist).flatten()
character_hist_gray = cv.calcHist([character_array_gray],[0],None,[8],[0,256])
character_hist_gray = cv.normalize(character_hist_gray,character_hist_gray).flatten()
hist_compare_result_CORREL = cv.compareHist(found_hist_gray, character_hist_gray,cv.HISTCMP_CORREL)
#hist_compare_result_CHISQR = cv.compareHist(found_hist_gray, character_hist_gray,cv.HISTCMP_CHISQR)
#hist_compare_result_INTERSECT = cv.compareHist(found_hist_gray, character_hist_gray,cv.HISTCMP_INTERSECT)
#hist_compare_result_BHATTACHARYYA = cv.compareHist(found_hist_gray, character_hist_gray,cv.HISTCMP_BHATTACHARYYA)
if (hist_compare_result_CORREL+character_score.max()) > 1:
print(f"Found {file} with a score:\n match:{character_score.max()}\n hist_correl: {hist_compare_result_CORREL}\n SUM:{hist_compare_result_CORREL+character_score.max()}", file=open("/Users/user/Desktop/dbz/out.log","a+"))
(1)
(2)

Here is a simple example of masked template matching in Python/OpenCV.
Image:
Transparent Template:
Template with alpha removed:
Template alpha channel extracted as mask image:
i
mport cv2
import numpy as np
# read image
img = cv2.imread('logo.png')
# read template with alpha
tmplt = cv2.imread('hat_alpha.png', cv2.IMREAD_UNCHANGED)
hh, ww = tmplt.shape[:2]
# extract template mask as grayscale from alpha channel and make 3 channels
tmplt_mask = tmplt[:,:,3]
tmplt_mask = cv2.merge([tmplt_mask,tmplt_mask,tmplt_mask])
# extract templt2 without alpha channel from tmplt
tmplt2 = tmplt[:,:,0:3]
# do template matching
corrimg = cv2.matchTemplate(img,tmplt2,cv2.TM_CCORR_NORMED, mask=tmplt_mask)
min_val, max_val, min_loc, max_loc = cv2.minMaxLoc(corrimg)
max_val_ncc = '{:.3f}'.format(max_val)
print("correlation match score: " + max_val_ncc)
xx = max_loc[0]
yy = max_loc[1]
print('xmatch =',xx,'ymatch =',yy)
# draw red bounding box to define match location
result = img.copy()
pt1 = (xx,yy)
pt2 = (xx+ww, yy+hh)
cv2.rectangle(result, pt1, pt2, (0,0,255), 1)
cv2.imshow('image', img)
cv2.imshow('template2', tmplt2)
cv2.imshow('template_mask', tmplt_mask)
cv2.imshow('result', result)
cv2.waitKey(0)
cv2.destroyAllWindows()
# save results
cv2.imwrite('logo_hat_match2.png', result)
Match location on input:
Match Information:
correlation match score: 1.000
xmatch = 417 ymatch = 44
Without the mask, the large green area in the template would mismatch in the input and lower the match score dramatically.

PyTesseract - Text broken by horizontal white lines

It is a classic PyTesseract problem of noisy image scanning. However, in this case dot matrix printer is printing some horizontal white lines in the text. Attached are some samples. I am not sure what kind of preprocessing will improve the scanning of the text.
Using below command following output comes for below sample:
tesseract test.png stdout --psm 6 --dpi 120
Output: (Expected is "RVC 64.80%" )
PRVG
64.5056"
For Above image pytesseract gives
152.00 KILOGRAW
817.51 USO
and the expected is - 152.00 KILOGRAM 617.51 USD
I know the images are noisy so please do not post obvious answer that as the images are noisy so the output is bad. As I always get same text from the printer so I can apply same type of preprocessing.

The first one picture,handle code:
from PIL import Image
import numpy as np
import pytesseract
import time
from collections import Counter
img = Image.open('OCR.png').convert('L')
pixelArray = img.load()
threshold = 240
table = []
for y in range(img.size[1]): # binaryzation
List = []
for x in range(img.size[0]):
if pixelArray[x,y] < threshold:
List.append(0)
else:
List.append(256)
table.append(List)
img = Image.fromarray(np.array(table))
img.show()
def operation(image):
resultList = []
pixelList = image.load()
flag = False
for y in range(image.size[1]):
temp = []
linePixel = 0
for x in range(image.size[0]):
if not pixelList[x,y]:
linePixel += 1
temp.append(pixelList[x,y])
if linePixel >= 35: # judge the black dot in one line
flag = True
resultList.append(temp)
elif flag:
# resultList.append([0]*image.size[0]) # to check the handling lines
flag = False
else:
resultList.append([256] * image.size[0])
return Image.fromarray(np.array(resultList))
for i in range(6):
img = operation(img)
img.show()
print(pytesseract.image_to_string(img,config='--psm 6'))
The first measure for handling(binaryzation):
The second measure is to remove the white line(judge black pixels in one line):
And finally,the result is:
"RVC
64.80%"

Python - Remove black outline & overlay PNG image on JPEG image

I have two images:
Fragments from painting
Whole painting
I need to solve two issues:
1st. On the first image, I need to remove the black outline from each fragment. I've tried threshold and erosion, but neither of them worked. How can I do that?
2nd. I can't overlap the first image on the second, and I really don't know why. It always result on the first image overlapping it totally and putting black pixels where it should be possible to see the second image.
I'm using Python3 and OpenCV 3.2, on Ubuntu 18.04.
My program:
from PIL import Image
from matplotlib import pyplot as plt
import numpy as np
import cv2
import sys
plano_f = cv2.imread("Domenichino_Virgin-and-unicorn.jpg")
sobrepor = cv2.imread("Domenichino_Virgin-and-unicorn_img.png")
plano_f = cv2.cvtColor(plano_f, cv2.COLOR_BGR2GRAY, -1)
#sobrepor_BGRA = cv2.cvtColor(sobrepor, cv2.COLOR_BGR2BGRA)
sobrepor_BGRA = cv2.imread("nova_png.png", -1)
plt.imshow(sobrepor_BGRA),plt.show()
rows, cols, han = sobrepor_BGRA.shape
total = rows*cols
#printProgressBar(0, total, prefix="Executando...", suffix="completo", length=50)
'''for i in range(rows):
for j in range(cols):
if(sobrepor_BGRA[i, j][0] <= 5 and sobrepor_BGRA[i, j][1] <= 5 and sobrepor_BGRA[i, j][2] <= 5 and sobrepor_BGRA[i, j][3] != 0):
sobrepor_BGRA[i, j] = (0, 0, 0, 0)
#printProgressBar(i*j, total, prefix='Executando...', suffix='completo', length=50)
sys.stdout.write("\rExecutando linha " + str(i) + " de " + str(rows) + "...")
sys.stdout.flush()
cv2.imwrite("nova_png.png", sobrepor_BGRA)'''
kernel = cv2.getStructuringElement(cv2.MORPH_CROSS, (3,3))
#sobrepor_BGRA = cv2.cvtColor(sobrepor_BGRA, cv2.COLOR_BGRA2GRAY, -1)
sobrepor_BGRA = cv2.erode(sobrepor_BGRA, kernel, iterations=3)
#sobrepor_BGRA = cv2.cvtColor(sobrepor_BGRA, cv2.COLOR_GRAY2BGRA)
cv2.imwrite("nova_png2.png", sobrepor_BGRA)
#sobrepor_RGBA = cv2.cvtColor(sobrepor_BGRA, cv2.COLOR_BGRA2RGBA)
#plt.imshow(sobrepor_RGBA),plt.show()
sys.stdout.write("\nPronto!")
nova_img = cv2.addWeighted(sobrepor_BGRA, 1, plano_f, 0, 0)
cv2.imwrite("combined.png", nova_img)
plt.imshow(nova_img),plt.show()

You can use bitwise operations to do this. The idea is to obtain a mask of the missing sections of the fragments then bitwise-or the two sections together. Here's two halfs of the image, one is the fragments you already have and the other is the missing sections.
We combine both halves to get the whole painting
import cv2
import numpy as np
fragment = cv2.imread('1.jpg')
whole = cv2.imread('2.jpg')
fragment[np.where((fragment <= [250,250,250]).all(axis=2))] = [0]
result1 = cv2.bitwise_and(whole, fragment)
result2 = cv2.bitwise_and(whole, 255 - fragment)
final = result1 + result2
cv2.imshow('result1', result1)
cv2.imshow('result2', result2)
cv2.imshow('final', final)
cv2.waitKey()

1st - your image is a jpeg image which means that the black lines around the pieces are going to be imperfect due to compression artifacts, a simple threshold or dilation isn't going to perfectly remove these. You can try saving in a lossless format and modifying by hand in paint or something to clean up, you may even want to perform this step after doing an erosion and cleaning up most of it.
2nd - why don't you just copy with a mask using the copyTo function, here is an example:
import cv2
img1 = cv2.imread('x2djw.jpg')
img2 = cv2.imread('5RnNh.jpg')
img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2GRAY)
thr, img1_mask = cv2.threshold(img1, 250, 255, cv2.THRESH_BINARY_INV)
img1_mask = img1_mask[:, :, 0] & img1_mask[:, :, 1] & img1_mask[:, :, 2]
el = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))
img1_mask = cv2.erode(img1_mask, el)
img2 = cv2.merge((img2, img2, img2))
img2 = cv2.copyTo(img1, img1_mask, img2)
cv2.imwrite('test_result.png', img2)

extracting text information from a national id

I'm trying to do OCR arabic on the following ID but I get a very noisy picture, and can't extract information from it.
Here is my attempt
import tesserocr
from PIL import Image
import pytesseract
import matplotlib as plt
import cv2
import imutils
import numpy as np
image = cv2.imread(r'c:\ahmed\ahmed.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.bilateralFilter(gray,11,18,18)
gray = cv2.GaussianBlur(gray,(5,5), 0)
kernel = np.ones((2,2), np.uint8)
gray = cv2.adaptiveThreshold(gray,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY,11,2)
#img_dilation = cv2.erode(gray, kernel, iterations=1)
#cv2.imshow("dilation", img_dilation)
cv2.imshow("gray", gray)
text = pytesseract.image_to_string(gray, lang='ara')
print(text)
with open(r"c:\ahmed\file.txt", "w", encoding="utf-8") as myfile:
myfile.write(text)
cv2.waitKey(0)
result
sample

The text for your id is in black color which makes the extraction process easy. All you need to do is threshold the dark pixels and you should be able to get the text out.
Here is a snip of the code
import cv2
import numpy as np
# load image in grayscale
image = cv2.imread('AVXjv.jpg',0)
# remove noise
dst = cv2.blur(image,(3,3))
# extract dark regions which corresponds to text
val, dst = cv2.threshold(dst,80,255,cv2.THRESH_BINARY_INV)
# morphological close to connect seperated blobs
dst = cv2.dilate(dst,None)
dst = cv2.erode(dst,None)
cv2.imshow("dst",dst)
cv2.waitKey(0)
And here is the result:

This is my output using ImageMagick TextCleaner script:
Script: textcleaner -g -e stretch -f 50 -o 30 -s 1 C:/Users/PC/Desktop/id.jpg C:/Users/PC/Desktop/out.png
Take a look here if you want to install and use TextCleaner script on Windows... It's a tutorial I made as simple as possible after few researches I made when I was in your same situation.
Now it should be very easy to detect the text and (not sure how simple) recognize it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Processing Images with OCR - python

Related

How to detect if image contains ASCII characters?

How to improve accuracy of OpenCV matching results in Python

PyTesseract - Text broken by horizontal white lines

Python - Remove black outline & overlay PNG image on JPEG image

extracting text information from a national id

Categories

Resources