I have an image, and from the image I want to extract key and value pair details.
As an example, I want to extract the value of "MASTER-AIRWAYBILL NO:"
I have written to extract the entire text from the image using python opencv and OCR, but I don't have any clue how to extract only the value for "MASTER-AIRWAYBILL NO:" from the entire result text of the image.
Please find the code:
import cv2
import numpy as np
import pytesseract
from PIL import Image
print ("Hello")
src_path = "C:\\Users\Venkatraman.R\Desktop\\alpha_bill.jpg"
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe"
print (src_path)
# Read image with opencv
img = cv2.imread(src_path)
# Convert to gray
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Apply dilation and erosion to remove some noise
kernel = np.ones((1, 1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)
# Write image after removed noise
cv2.imwrite(src_path + "removed_noise.png", img)
# Apply threshold to get image with only black and white
#img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)
# Write the image after apply opencv to do some ...
cv2.imwrite(src_path + "thres.png", img)
# Recognize text with tesseract for python
result = pytesseract.image_to_string(Image.open(src_path + "thres.png"))
# Remove template file
#os.remove(temp)
print ('--- Start recognize text from image ---')
print (result)
So output should be like:
MASTER-AIRWAYBILL NO: 157-46637194
You can use pytesseract image_to_string() and a regex to extract the desired text, i.e.:
from PIL import Image
import pytesseract, re
f = "ocr.jpg"
t = pytesseract.image_to_string(Image.open(f))
m = re.findall(r"MASTER-AIRWAYBILL NO: [\d—-]+", t)
if m:
print(m[0])
Output:
MASTER-AIRWAYBILL NO: 157—46637194
I m using python 2.7 and i m also want to finde vendor name from the image
how should i find?
m = re.findall(r"MASTER-AIRWAYBILL NO: [\d—-]+", t)
for the above line its showing error
and if i use m=re.findall(r'Vendor Name:[\d--]+', t) then also its showing error
You can try this after installing tesseract.
from PIL import Image
import pytesseract, re
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
t = pytesseract.image_to_string(Image.open("path"))
m = re.findall(r"Invoice No. [\d—-]+", t)
if m:
print(m[0])
Related
I am trying to run OCR on set of images that are similar but can vary in size. For some reason I cannot get a predictable result. Is there anything I can do do get better results.
Tesseract with or without cv2 preprocessing works beautifully on some images and fails on some and there is no pattern. Images are more or less similar.
Upper image represents processed image
def filter_img(img):
# Read pil image as cv2
img = np.array(img)
img = cv2.resize(img, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
# Converting image to grayscale (important for applying threshold)
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
#Apply dilation and erosion to remove some noise
kernel = np.ones((1, 1), np.uint8)
# img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)
# Apply blur to smooth out the edges
img = cv2.GaussianBlur(img, (5, 5), 0)
# img = cv.medianBlur(img,5)
# Apply threshold to get image with only b&w (binarization)
img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
img = Image.fromarray(img)
img = ImageOps.expand(img,border=2,fill='black')
visualize.show_labeled_image(img,boxes)
return img
# Applying Tesseract OCR
def run_tesseract(img):
# Tesseract cmd setup
# pytesseract.pytesseract.tesseract_cmd = "tesseract"
whitelist = string.ascii_uppercase + string.digits + ".-"
parameters = '-c load_freq_dawg=0 -c tessedit_char_whitelist="{}"'.format(whitelist)
psm = 8
custom_oem_psm_config = "--dpi 300 --oem 3 --psm {psm} {parameters}".format(parameters=parameters, psm=psm)
try:
text = pytesseract.image_to_string(img, config=custom_oem_psm_config, timeout=2)
return text.strip()
except RuntimeError:
print ("TIMEOUT")
return ""
If your image format is highly consistent, you might consider using split images. And after ocr the image, use conditional judgments on the first letter or number for error-prone areas, such as 0 and O are confusing. Of course, all of the above is only valid if the image is highly consistent.
enter code here
import cv2
import numpy as np
import pytesseract
import matplotlib.pyplot as plt
pytesseract.pytesseract.tesseract_cmd = 'D://Program Files/Tesseract-
OCR/tesseract.exe'
img = cv2.imread('vATKQ.png')
img2 = img[100:250, 180:650] #split to region you want
plt.imshow(img2)
text=pytesseract.image_to_string(img2)
print(text)
I'm using pytesseract to try extract text numbers from image.
I'm trying to extract the three numbers from this picture.
A straightforward method using pytesseract is:
from PIL import Image
from pytesseract import pytesseract
text = pytesseract.image_to_string(Image.open("uploaded_image.png"))
print(text)
But this prints blank.
Why can't it extract the numbers as it can for normal usual text ?
Your images need some preprocessing in order to be efficiently processed by pytesseract.
The following shows this process using cv2.adaptiveThreshold(), cv2.findContours(), cv2.drawContours() operations before converting image to black and white and invert it:
import numpy as np
import cv2
from PIL import Image
import pytesseract
img = cv2.imread('uploaded_image.png', cv2.IMREAD_COLOR)
img = cv2.blur(img, (5, 5))
#HSV (hue, saturation, value)
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
h, s, v = cv2.split(hsv)
#Applying threshold on pixels' Value (or Brightness)
thresh = cv2.adaptiveThreshold(v, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV, 11, 2)
#Finding contours
contours, hierarchy = cv2.findContours(thresh,cv2.RETR_TREE,cv2.CHAIN_APPROX_SIMPLE)
#Filling contours
contours = cv2.drawContours(img,np.array(contours),-1,(255,255,255),-1)
#To black and white
grayImage = cv2.cvtColor(contours, cv2.COLOR_BGR2GRAY)
#And inverting it
#Setting all `dark` pixels to white
grayImage[grayImage > 200] = 0
#Setting relatively clearer pixels to black
grayImage[grayImage < 100] = 255
#Write the temp file
cv2.imwrite('temp.png',grayImage)
#Read it with tesseract
text = pytesseract.image_to_string(Image.open('temp.png'),config='tessedit_char_whitelist=0123456789 -psm 6 ')
#Output
print("#### Raw text ####")
print(text)
print()
print("#### Extracted digits ####")
print([''.join([y for y in x if y.isdigit()]) for x in text.split('\n')])
Output
#### Raw text ####
93
31
92
#### Extracted digits ####
['93', '31', '92']
Processed image :
EDIT
Updated answer using cv2 library and getting all the digits from image
I have python code which does OCR for one tiff file and prints result in python window. I have more number of tiff files in a directory, it will take more hours to OCR all images one by one using my code.
Since I'm a beginner, I'm getting error while adding 'for' loop in code
import cv2
import numpy as np
import pytesseract
from PIL import Image
# Path of working folder on Disk
src_path = "D:/OpenCV and Tesseract/Image Split/Test 1/"
def get_string(img_path):
# Read image with opencv
img = cv2.imread(img_path)
# Convert to gray
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Apply dilation and erosion to remove some noise
kernel = np.ones((1, 1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)
# Write image after removed noise
cv2.imwrite(src_path + "removed_noise.png", img)
# Apply threshold to get image with only black and white
#img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)
# Write the image after apply opencv to do some ...
cv2.imwrite(src_path + "thres.png", img)
# Recognize text with tesseract for python
result = pytesseract.image_to_string(Image.open(src_path + "thres.png"))
# Remove template file
#os.remove(temp)
return result
print '--- Start recognize text from image ---'
print get_string(src_path + "21.tif")
print "------ Done -------"
Someone help me to modify the python code to do OCR for all images in a directory and store all text in a single .txt file line by line
I'm trying to read off some stats off the cropped (manually) sections of tables in pdf files.
Here is the image I'm trying to process
The current result I get has most of the numbers but not all of the text, as seen below:
Hmuwinu'fg. cm’: -009,d1-I (F -o.761.l= .om,
Tamar wuall ma: 2 1.41(F-o.167
Tao! hr aubgrwp dimes: Nol wvwe
I've tried using interpolations other than inter-cubic during the resizing step, and played around changing the kernel size but 1x1 seems to work the best.
Here is the current code:
# import the packages
from PIL import Image
import pytesseract
import numpy as np
import argparse
import cv2
import os
# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,help="path to input image to OCR'd")
ap.add_argument("-p","--preprocess",type=str,default="thresh",help="type of preprocessing to be done")
args = vars(ap.parse_args())
#load the example image
image = cv2.imread(args["image"])
# Rescale image
image = cv2.resize(image,None,fx=1.5,fy=1.5,interpolation=cv2.INTER_CUBIC)
#Apply dilation and erosion to remove some noise
kernel = np.ones((1,1),np.uint8)
image = cv2.dilate(image,kernel,iterations=1)
image = cv2.erode(image,kernel,iterations=1)
#Convert it to grayscale
gray = cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)
# check to see if we should apply thresholding to process image
if args["preprocess"] == "thresh":
gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
# make a check to see if median blurring should be applied
elif args["preprocess"] == "blur":
gray = cv2.medianBlur(gray,3)
#write the gray scale image to a disk as a temp file so we can OCR it
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename,gray)
#load the image as a PIL/pillow image, apploy OCR, then delete temp file
text = pytesseract.image_to_string(Image.open(filename))
os.remove(filename)
print(text)
# show the output images
cv2.imshow("Image",image)
cv2.imshow("Output",gray)
cv2.waitKey(0)
Any suggestions or methods are really appreciated.
I applied adaptive-threshold + bitwise-not operations and result is:
Now, when I read:
txt = pytesseract.image_to_string(bnt, config="--psm 6")
print(txt)
Result:
Hewrogenedty: Chit «0.09, die 1 (P = 0,78); If 0.0%
Teal for overall ettect: Z = 1.41 (P = 0.16)
Test tor subgroup ditlrenote: Not appliaalle
Not prefect but at least numbers are correct (If I'm not mistaken)
Code:
import cv2
import pytesseract
img = cv2.imread("Q8iIo.png")
img = cv2.resize(img, None, fx=2.5, fy=2.5,
interpolation=cv2.INTER_CUBIC)
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.adaptiveThreshold(gry, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
cv2.THRESH_BINARY_INV, 25, 28)
bnt = cv2.bitwise_not(thr)
txt = pytesseract.image_to_string(bnt, config="--psm 6")
print(txt)
I need to use Pytesseract to extract text from this picture:
and the code:
from PIL import Image, ImageEnhance, ImageFilter
import pytesseract
path = 'pic.gif'
img = Image.open(path)
img = img.convert('RGBA')
pix = img.load()
for y in range(img.size[1]):
for x in range(img.size[0]):
if pix[x, y][0] < 102 or pix[x, y][1] < 102 or pix[x, y][2] < 102:
pix[x, y] = (0, 0, 0, 255)
else:
pix[x, y] = (255, 255, 255, 255)
img.save('temp.jpg')
text = pytesseract.image_to_string(Image.open('temp.jpg'))
# os.remove('temp.jpg')
print(text)
and the "temp.jpg" is
Not bad, but the result of print is ,2 WW
Not the right text2HHH, so how can I remove those black dots?
Here's a simple approach using OpenCV and Pytesseract OCR. To perform OCR on an image, its important to preprocess the image. The idea is to obtain a processed image where the text to extract is in black with the background in white. To do this, we can convert to grayscale, apply a slight Gaussian blur, then Otsu's threshold to obtain a binary image. From here, we can apply morphological operations to remove noise. Finally we invert the image. We perform text extraction using the --psm 6 configuration option to assume a single uniform block of text. Take a look here for more options.
Here's a visualization of the image processing pipeline:
Input image
Convert to grayscale -> Gaussian blur -> Otsu's threshold
Notice how there are tiny specs of noise, to remove them we can perform morphological operations
Finally we invert the image
Result from Pytesseract OCR
2HHH
Code
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Grayscale, Gaussian blur, Otsu's threshold
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
# Morph open to remove noise and invert image
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=1)
invert = 255 - opening
# Perform text extraction
data = pytesseract.image_to_string(invert, lang='eng', config='--psm 6')
print(data)
cv2.imshow('thresh', thresh)
cv2.imshow('opening', opening)
cv2.imshow('invert', invert)
cv2.waitKey()
Here is my solution:
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
im = Image.open("temp.jpg") # the second one
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('temp2.jpg')
text = pytesseract.image_to_string(Image.open('temp2.jpg'))
print(text)
I have something different pytesseract approach for our community.
Here is my approach
import pytesseract
from PIL import Image
text = pytesseract.image_to_string(Image.open("temp.jpg"), lang='eng',
config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
print(text)
To extract the text directly from the web, you can try the following implementation (making use of the first image):
import io
import requests
import pytesseract
from PIL import Image, ImageFilter, ImageEnhance
response = requests.get('https://i.stack.imgur.com/HWLay.gif')
img = Image.open(io.BytesIO(response.content))
img = img.convert('L')
img = img.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(img)
img = enhancer.enhance(2)
img = img.convert('1')
img.save('image.jpg')
imagetext = pytesseract.image_to_string(img)
print(imagetext)
Here is my small advancement with removing noise and arbitrary line within certain colour frequency range.
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
im = Image.open(img) # img is the path of the image
im = im.convert("RGBA")
newimdata = []
datas = im.getdata()
for item in datas:
if item[0] < 112 or item[1] < 112 or item[2] < 112:
newimdata.append(item)
else:
newimdata.append((255, 255, 255))
im.putdata(newimdata)
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('temp2.jpg')
text = pytesseract.image_to_string(Image.open('temp2.jpg'),config='-c tessedit_char_whitelist=0123456789abcdefghijklmnopqrstuvwxyz -psm 6', lang='eng')
print(text)
you only need grow up the size of picture by cv2.resize
image = cv2.resize(image,(0,0),fx=7,fy=7)
my picture 200x40 -> HZUBS
resized same picture 1400x300 -> A 1234 (so, this is right)
and then,
retval, image = cv2.threshold(image,200,255, cv2.THRESH_BINARY)
image = cv2.GaussianBlur(image,(11,11),0)
image = cv2.medianBlur(image,9)
and change parameters for enhance results
Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
from PIL import Image, ImageEnhance, ImageFilter
import pytesseract
path = 'hhh.gif'
img = Image.open(path)
img = img.convert('RGBA')
pix = img.load()
for y in range(img.size[1]):
for x in range(img.size[0]):
if pix[x, y][0] < 102 or pix[x, y][1] < 102 or pix[x, y][2] < 102:
pix[x, y] = (0, 0, 0, 255)
else:
pix[x, y] = (255, 255, 255, 255)
text = pytesseract.image_to_string(Image.open('hhh.gif'))
print(text)