I am trying to extract data from a scanned form. The form has a standard format similar to the one shown in the image below:
I have tried using pytesseract (tesseract OCR) to detect the image's text and it has done a decent job at finding the text and converting the image to text.
However it essentially just gives me all the detected text without keeping the format of the data.
I would like to be able to do something like the below:
Find a particular piece of text and then find the associated data below or beside it. Similar to this question using opencv Detect text region in image using Opencv
Is there a way that I can essentially do the following:
Either find all text boxes on the form, perform OCR on each box and see which one is the closest match to the "witnesess:" text, then find the sections immediately below it and perform separate OCR on those.
Or if the form is standard and I know the approximate location of the "witness" text section can I specify its general location in opencv and then just extract the below text and perform OCR on it.
EDIT: I have tried the below code to try to detect specific regions of the text. However it is not specifically identifying the text just all regions.
import cv2
img = cv2.imread('t2.jpg')
mser = cv2.MSER_create()
img = cv2.resize(img, (img.shape[1]*2, img.shape[0]*2))
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
vis = img.copy()
regions = mser.detectRegions(gray)
hulls = [cv2.convexHull(p.reshape(-1, 1, 2)) for p in regions[0]]
cv2.polylines(vis, hulls, 1, (0,255,0))
cv2.imshow('img', vis)
Here is the result:
I think you have the answer already in your own post.
I did recently something similar and this is how I did it:
//id_image was loaded with cv2.imread
temp_image = id_image[start_y:end_y,start_x:end_x]
img = Image.fromarray(temp_image)
text = pytesseract.image_to_string(img, config="-psm 7")
So basically, if your format is predefined, you just need to know the location of the fields that you want the text of (which you already know), crop it, and then apply the ocr (tesseract) extraction.
In this case you need import pytesseract, PIL, cv2, numpy.
Related
I have attached a very simple text image that I want text from. It is white with a black background. To the naked eye it seems absolutely legible but apparently to tesseract it is a rubbish. I have tried changing the oem and psm parameters but nothing seems to work. Please note that this works for other images but for this one.
Please try running it on your machine and see if it works. Or else I might have to change my ocr engine altogether.
Note: It was working earlier until I tried to add black pixels around the image to help the extraction process. Also I don't think that tesseract was trained on black text on a white background. It should be able to do this too. Also if this was true why does it work for other text images that have the same format as this one
Edit: Miraculously I tried running the script again and this time it was able to extract Chand properly but failed in the below mentioned case. Also please look at the parameters I have used. I have read the documentation and I feel this would be the right choice. I have added the image for your reference. It is not about just this image. Why is tesseract failing for such simple use cases?
To find the desired result, you need to know the followings:
Page-segmentation-modes
Suggested Image processing methods
The input images are boldly written, we need to shrink the bold font and then assume the output as a single uniform block of text.
To shrink the images we could use erosion
Result will be:
Erode
Result
CHAND
BAKLIWAL
Code:
# Load the library
import cv2
import pytesseract
# Initialize the list
img_lst = ["lKpdZ.png", "ZbDao.png"]
# For each image name in the list
for name in img_lst:
# Load the image
img = cv2.imread(name)
# Convert to gry-scale
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Erode the image
erd = cv2.erode(gry, None, iterations=2)
# OCR with assuming the image as a single uniform block of text
txt = pytesseract.image_to_string(erd, config="--psm 6")
print(txt)
I have used the following PyTorch implementation of EAST ( Efficient and Accurate Scene Text Detector) to identify and draw bounding boxes around text in a number of images and it works very well!
However, the next step of OCR which I am trying with pytesseract in order to extract the text form these images and converting them to strings - is failing horribly. Using all possible configurations of --oem and --psm, I am unable to get pytesseract to detect what appears to be very clear text, for example:
The recognized text is below the images. Even though I have applied contrast enhancement, and also tried dilating and eroding, I cannot get tesseract to recognize the text. This is just one example of many images where the text is even larger and clearer. Any suggestions on transformations, configs, or other libraries would be helpful!
UPDATE: After trying Gaussian blur + Otso thresholding, I am able to get black text on white background (apparently which is ideal for pytesseract), and also added Spanish language, but it still cannot read very plain text - for example:
reads as gibberish.
The processed text images are and
and the code I am using:
img_path = './images/fesa.jpg'
img = Image.open(img_path)
boxes = detect(img, model, device)
origbw = cv2.imread(img_path, 0)
for box in boxes:
box = box[:-1]
poly = [(box[0], box[1]),(box[2], box[3]),(box[4], box[5]),(box[6], box[7])]
x = []
y = []
for coord in poly:
x.append(coord[0])
y.append(coord[1])
startX = int(min(x))
startY = int(min(y))
endX = int(max(x))
endY = int(max(y))
#use pre-defined bounding boxes produced by EAST to crop the original image
cropped_image = origbw[startY:endY, startX:endX]
#contrast enhancement
clahe = cv2.createCLAHE(clipLimit=4.0, tileGridSize=(8,8))
res = clahe.apply(cropped_image)
text = pytesseract.image_to_string(res, config = "-psm 12")
plt.imshow(res)
plt.show()
print(text)
Use these updated data files.
This guide criticizes out-of-the box performance (and maybe the accuracy could be affected too):
Trained data. On the moment of writing, tesseract-ocr-eng APT package for Ubuntu 18.10 has terrible out of the box performance, likely because of corrupt training data.
According to the following test I did, using the updated data files seems to provide better results. This is the code I used:
import pytesseract
from PIL import Image
print(pytesseract.image_to_string(Image.open('farmacias.jpg'), lang='spa', config='--tessdata-dir ./tessdata --psm 7'))
I downloaded spa.traineddata (your example images have Spanish words, right?) to ./tessdata/spa.traineddata. And the result was:
ARMACIAS
And for the second image:
PECIALIZADA:
I used --psm 7 because here it says that it means "Treat the image as a single text line" and I thought that should make sense for your test images.
In this Google Colab you can see the test I did.
I am trying to recognize the text in a captcha and it is not possible for me. I am using python3, openCv and tesseract.
The simplified code is:
import cv2
from pytesseract import *
img_path = "path"
img = cv2.imread(img_path)
img = cv2.resize(img, None, fx=2, fy=2, interpolation=cv2.INTER_LINEAR)
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
pytesseract.image_to_string(img)
I think I should remove the color lines first, then leave the text alone, and maybe change the brightness and contrast. What filter could apply?
These are some images to recognize.
For recognising captcha text using pytesseract-ocr, you can do the following..
Prepare custom train_set to training your tesseract instance to recognise a specific font [Optional]
The captcha images need some pre-processing(such as * Apply Black & White Filter > Scale(up) > Blur > Morphological Transformation + Adaptive threshold*)to enhance the text part and reduce the noises/lines.
For removing lines: In the sample images only the text can be seen in black color and there is no black line, so you can simply convert the each non-black pixel to white by using PIL or OpenCV, you can even utilize some specific algo like Hough Line Transform to detect and remove lines.
You can learn about all these filters and algos from the official documentation and tutorial on OpenCV website.
I have a very specific scene text detection and parsing problem. I am not even sure if you can say it is an actual scene text.
I have extracted a name field from an identity card photo:
I could immediately start applying some OCR on that image, but i believe a further text localisation could be applied. To achieve this image: Do you know any of such text localisation algorithms? I have already tried 'FASText by Busta', 'EAST by argman' and they work decently. Any algorithms on this specific task?
After the localisation of the text i think now it is the best time to apply OCR. And now i feel lost. Which OCR could you recommend to use? I have already tried 'Tesseract' but it just doesn't work well. Is it a better idea to make your own OCR for document characters by using e.g. Tensorflow?
Try to increase the contrast of the image. You can use:
import matplotlib.pyplot as plt
import cv2
import numpy as np
def cvt_BGR2RGB(img):
return cv2.cvtColor(img,cv2.COLOR_BGR2RGB)
def contrast(img,show=False):
# CLAHE (Contrast Limited Adaptive Histogram Equalization)
clahe=cv2.createCLAHE(clipLimit=3., tileGridSize=(8,8))
lab=cv2.cvtColor(img, cv2.COLOR_BGR2LAB) # convert from BGR to LAB color space
l,a,b=cv2.split(lab) # split on 3 different channels
l2=clahe.apply(l) # apply CLAHE to the L-channel
lab=cv2.merge((l2,a,b)) # merge channels
img2=cv2.cvtColor(lab, cv2.COLOR_LAB2BGR) # convert from LAB to BGR
if show:
#plot the original and contrasted image
f=plt.figure(figsize=(15,15))
ax1=f.add_subplot(121)
img1_cvt=cvt_BGR2RGB(img)
plt.imshow(img1_cvt)
ax2=f.add_subplot(122)
img2_cvt=cvt_BGR2RGB(img2)
plt.imshow(img2_cvt)
plt.show()
return img,img2
And maybe then you can use pyteserract
I would like to create a panoramic image by combining 2 images in which the same feature, a plus sign.
I've used cv2.xfeatures2d.SIFT_create() to find keypoints in the image however it doesn't find the plus symbol very well. Is there some way I can improve this by making it search specifically for a plus-shaped feature?
import cv2
image1 = cv2.imread('example_image.png')
sift = cv2.xfeatures2d.SIFT_create()
kp = sift.detect(grey_image1, None)
kp_image = cv2.drawKeypoints(grey_image1, kp, None)
def showimage(image, name="No name given"):
cv2.imshow(name, image)
cv2.waitKey(0)
cv2.destroyAllWindows()
return
showimage(kp_image)
The source image is here, second image to align is here. Here is the resulting image from the code above. This is an example of the desired output made using GIMP and manually aligning the two images (the second image will need to transformed to fit properly).`
NB I'm open to using other approaches outside of OpenCV/Python to solve this problem.