How to improve OCR results with Google Vision API and Python?

How to improve OCR results with Google Vision API and Python? - python

I am working with Google Vision API and Python to apply text_detection which is an OCR function of Google Vision API which detects the text on the image and returns it as an output. My original image is the following:
I have used the following different algorithms:
1) Apply text_detection to the original image
2) Enlarge the original image by 3 times and then apply text_detection
3) Apply Canny, findContours, drawContours on a mask (with OpenCV) and then text_detection to this
4) Enlarge the original image by 3 times, apply Canny, findContours, drawContours on a mask (with OpenCV) and then text_detection to this
5) Sharpen the original image and then apply text_detection
6) Enlarge the original image by 3 times, sharpen the image and then apply text_detection
The ones which fare the best are (2) and (5). On the other hand, (3) and (4) are probably the worse among them.
The major problem is that text_detection does not detect in most cases the minus sign especially the one of '-1.00'.
Also, I do not know why, sometimes it does not detect '-1.00' itself at all which is quite surprising as it does not have any significant problem with the other numbers.
What do you suggest me to do to detect accurately the minus sign and in general the numbers?
(Keep in mind that I want to apply this algorithm to different boxes so the numbers may not be at the same position as in this image)

I dealt with the same problem. Your end goal is to correctly identify the text. For OCR conversion you are using a third party service or tool (google API / tesseract etc.)
All the the approach that you are talking about become meaningless because whatever transformations that you are doing using openCV will be repeated by tesseract.The best you should do is supply the input in a easy format.
What did work for me the best is breaking the image is parts (BOXES - "SQUARES AND RECTANGLES" - using a sample code for identifying the rectangles in all channels in openCV repo examples using https://github.com/opencv/opencv/blob/master/samples/python/squares.py) and then crop it and then send it for OCR by parts.

Since you are using Google Vision API which detects the text on the image, so it is not obvious for a text detection API to detect negative numbers in first place. Assuming that fact that you may not able to re-train the API as per your case, I would recommend you to write a simple script which filters the contours on the basis of it's shape and size, using this script you can easily segment out the negative signs and then merge it with the output from Google Vision API as
import cv2
import numpy as np
img = cv2.imread("path/to/img.jpg", 0)
ret, thresh = cv2.threshold(img, 200, 255, cv2.THRESH_BINARY)
i, contours, hierarchy = cv2.findContours(thresh.copy(), cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)
# filter the contours.
for cnt in contours:
x, y, w, h = cv2.boundingRect(cnt)
if 5 < cv2.contourArea(cnt) < 50 and float(w)/h > 3:
print "I have detected a minus sign at : ", x, y, w, h
After this filtering process you can make a calculated guess if a given digit has a negative sign close it it's left side.

If Google Vision API uses Tesseract, which I think it does,,
then optimization is usually as follows:
Sharpen
Binarize (or grayscale if you must)
Trim borders (Tesseract likes smooth background)
Deskew (Tesseract tolerates very small skew angle. It likes nice straight text lines)
Reshape and resize (Put it in a page-like shape and resize if necessary)
As for negative signs, well, use Tesseract directly, if you can.
You will be able to retrain it or to download better trainings.
Or well, you can correct the errors using additional algorithm. I.e. implement your recheck as suggested in ZdaR's answer.

Related

Bokeh-like Blur with Mask as Intensity of Blur Radius

I know how to Gaussian blur with Pillow, but can't track how to mask it by intensity of the radius value with a mask.
I am using MiDaS package to produce depth maps form 2D images. What I want to do is be able to blur the original image by the depth mask as a pseudo depth of field.
Here is a visual demonstration of the result I'm after with CV2 or Pillow (I don't understand which can do what I'm after.)
Note: I'm sorry if this is considered junk, I've sat on this question for a month. I tried scouring the net for something like this, and all I found was Poor Man's Portrait Mode which I could not get to work, and also would be reproducing depth maps when I already have them from my script and used for the 3D image creation.
Edit:
I did come up with this, using composite Not sure why I didn't take note of it before. Though I have to say, the results aren't too great. I think I really do need to emulate some sort of shape blur like bokeh.
sharpen = 3
boxBlur = 5
oimg = Image.open('2.png').convert('RGB')
width, height = oimg.size
mimg = Image.open('2_depth.png').resize((width, height)).convert('L')
bimg = oimg.filter(ImageFilter.BoxBlur(int(boxBlur)))
bimg = bimg.filter(ImageFilter.BLUR)
for i in range(sharpen):
bimg = bimg.filter(ImageFilter.SHARPEN)
rimg = Image.composite(oimg, bimg, mimg)
Basically get your image, and mask, ensure the mask matches the image (I had a issue where images didn't match, but were the same size, just saved different from 2 saved the same way)
Blur your image to a new variable, however you like, Gaussian, etc. Gaussian was too soft for me. Add whatever extra filtering you want
Composite the results together, using depth map as a mask for composite.
Note: If someone knows how to achieve a different sort of blur that mimics bokeh, I'd like to know, and have adjusted the question title. I read about a discBlur but couldn't find anything for PIL/CV2.

I’ve got only a brute-force solution with iteration over pixels: Variable blur intensity.
My code is working but not as efficiently as I want.
You can try. Open your image as input and put your depth map in the variable blur_map.

Image Registration of Scanned Text Forms

We print 500 bubble surveys, get them back, and scan them in a giant batch giving us 500 PNG images.
Each image has a slight variations in alignment, but identical size and resolution. We need to register the images so they're all perfectly aligned. (With the next step being semi-automated scoring of the bubbles).
If these were 3D-MRI images, I could accomplish this with a single command line utility; But I'm not seeing any such tool for aligning scanned text documents.
I've played around with opencv as described in Image Alignment (Feature Based) using OpenCV, and it produces dynamite results when it works, but it often fails spectacularly. That approach is looking for documents hidden within natural scenes, a much harder problem than our case where the images are just rotated and translated in 2D, not 3.
I've also explored imreg_dft, which runs consistently but does a very poor job -- presumably the dft approach is better on photographs than text documents.
Does a solution for Image Registration of Scanned Forms already exist? If not, what's the correct approach? Opencv, imreg_dft, or something else?
Similar prior question: How to find blank field on scanned document image

What you can try is using the red outline of the answer boxes to create a mask where you can select the outline. I create a sample below. You can also remove the blue letters by creating a mask for the letters, inverting it, then apply it as a mask. I didn't do that, because he image of the publisher is low-res, and it caused issues. I expect your scans to perform better.
When you have the contours of the boxes you can transform/compare them individually (as the boxes have different sizes). Or you can use the biggest contour to create a transform for the entire document.
You can then use minAreaRect to find the cornerpoints of the contours. Threshold the contourArea to exclude noise / non answer area's.
import cv2
import numpy as np
# load image
img = cv2.imread('Untitled.png')
# convert to hsv colorspace
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
# define range of image bachground in HSV
lower_val = np.array([0,0,0])
upper_val = np.array([179,255,237])
# Threshold the HSV image
mask = cv2.inRange(hsv, lower_val, upper_val)
# find external contours in the mask
contours, hier = cv2.findContours(mask, cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)
# draw contours
for cnt in contours:
cv2.drawContours(img,[cnt],0,(0,255,0),3)
# display image
cv2.imshow('Result', img)
cv2.waitKey(0)
cv2.destroyAllWindows()

Python OCR: ignore signatures in documents

I'm trying to do OCR of a scanned document which has handwritten signatures in it. See the image below.
My question is simple, is there a way to still extract the names of the people using OCR while ignoring the signatures? When I run Tesseract OCR it fails to retrieve the names. I tried grayscaling/blurring/thresholding, using the code below, but without luck. Any suggestions?
image = cv2.imread(file_path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
image = cv2.GaussianBlur(image, (5, 5), 0)
image = cv2.threshold(image, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)[1]

You can use scikit-image's Gaussian filter to blur thin lines first (with an appropriate sigma), followed by binarization of image (e.g., with some thresholding function), then by morphological operations (such as remove_small_objects or opening with some appropriate structure), to remove the signatures mostly and then try classification of the digits with sliding window (assuming that one is already trained with some blurred characters as in the test image). The following shows an example.
from skimage.morphology import binary_opening, square
from skimage.filters import threshold_minimum
from skimage.io import imread
from skimage.color import rgb2gray
from skimage.filters import gaussian
im = gaussian(rgb2gray(imread('lettersig.jpg')), sigma=2)
thresh = threshold_minimum(im)
im = im > thresh
im = im.astype(np.bool)
plt.figure(figsize=(20,20))
im1 = binary_opening(im, square(3))
plt.imshow(im1)
plt.axis('off')
plt.show()
[EDIT]: Use Deep Learning Models
Another option is to pose the problem as an object detection problem where the alphabets are objects. We can use deep learning: CNN/RNN/Fast RNN models (with tensorflow/keras) for object detection or Yolo model (refer to the this article for car detection with yolo model).

I suppose the input pictures are grayscale, otherwise maybe the different color of the ink could have a distinctive power.
The problem here is that, your training set - I guess - contains almost only 'normal' letters, without the disturbance of the signature - so naturally the classifier won't work on letters with the ink of signature on them. One way to go could be to extend the training set with letters of this type. Of course it is quite a job to extract and label these letters one-by-one.
You can use real letters with different signatures on them, but it might be also possible to artificially generate similar letters. You just need different letters with different snippets of signatures moved above them. This process might be automated.

You may try to preprocess the image with morphologic operations.
You can try opening to remove the thin lines of the signature. The problem is that it may remove the punctuation as well.
image = cv2.imread(file_path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
kernel = cv2.getStructuringElement(cv2.MORPH_CROSS,(5,5))
image = cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel)
You may have to alter the kernel size or shape. Just try different sets.

You can try other OCR providers for the same task. For example, https://cloud.google.com/vision/ try this. You can upload an image and check for free.
You will get a response from API from where you can extract the text which you need. Documentation for extracting that text is also given on the same webpage.
Check out this. this will help you in fetching that text. this is my own answer when I faced the same problem. Convert Google Vision API response to JSON

How can I extract hand features from these images?

I have two different types of images (which I cannot post due to reputation, so I've linked them.):
Image 1 Image 2
I was trying to extract hand features from the images using OpenCV and Python. Which kinda looks like this:
import cv2
image = cv2.imread('image.jpg')
blur = cv2.GaussianBlur(image, (5,5), 0)
gray = cv2.cvtColor(blur, cv2.COLOR_BGR2GRAY)
retval, thresh1 = cv2.threshold(gray, 70, 255, / cv2.THRESH_BINARY_INV+cv2.THRESH_OTSU)
cv2.imshow('image', thresh1)
cv2.waitKey(0)
The result of which looks like this:
Image 1 Image 2
The change in background in the second image is messing with the cv2.threshold() function and its not getting the skin parts right. Is there a way to do this right?
As a follow up question, what is the best way to extract hand features? I tried a HaaR Cascade and I didn't really get results? Should I train my own cascade? What other options do I have?

It's hard to say based on a sample size of two images, but I would try OpenCV's Integral Channel Features (ChnFtrs), which are like supercharged Haar features that can take cues from colour as well as any other image channels you care to create and provide.
In any case, you are going to have to train your own cascades. Separate cascades for front and profile shots of course.
Take out your thresholding by skin colour, because as you've already noticed, it may throw away some or all of the hands depending on the actual subject's skin colour and lighting. ChnFtrs will do the skin detection for you more robustly than a fixed threshold can. (Though for future reference, all humans are actually orange :))
You could eliminate some false positives by only detecting within a bounding box of where you expect the hands to be.
Try both RGB and YUV channels to see what works best. You could also throw in the results of edge detection (say, Canny, maximised across your 3 colour channels) for good measure. At the end, you could cull channels which are underused to save processing if necessary.
If you have much variation in hand pose, you may need to group similar poses and train a separate ChnFtrs cascade for each group. Individual cascades do not have a branching structure, so they do not cope well when the positive samples are disjoint in parameter space. This is, AFAIK, a bit of an unexplored area.
A correctly trained ChnFtrs cascade (or several) may give you a bounding box for the hands, which will help in extracting hand contours, but it can't exclude invalid contours within the same bounding box. Most other object detection routines will also have this problem.
Another option, which may be better/simpler than ChnFtrs, is LINEMOD (a current favourite of mine). It has the advantage that there's no complex training process, nor any training time needed.

Blur part of an Image and blend it with the Background

I need to blur faces to protect the privacy of people in street view images like Google does in Google Street View. The blur should not make the image aesthetically unpleasant. I read in the paper titled Large-scale Privacy Protection in Google Street View by Google (link) that Google does the following to blur the detected faces.
We chose to apply a combination of noise and aggressive Gaussian blur that we alpha-blend smoothly with the background starting at the edge of the box.
Can someone explain how to perform this task? I understand Gaussian Blur, but how to blend it with the background?
Code will be helpful but not required
My question is not how to blur a part of image?, it is how to blend the blurred portion with the background so that blur is not unpleasant? Please refer to the quote I provided from the paper.
I have large images and a lot of them. An iterative process as in the possible duplicate will be time consuming.
EDIT
If someone ever wants to do something like this, I wrote a Python implementation. It isn't exactly what I was asking for but it does the job.
Link: pyBlur

I'm reasonably sure the general idea is:
Create a shape for the area you want to blur (say a rectangle).
Extend your shape by X pixels outwards.
Apply a gradient on alpha from 0.0 .. 1.0 (or similar) over the extended area.
Apply blur the extended area (ignoring alpha)
Now use an alpha-blend to apply the modified image to the original image.
Adding noise in a similar way to the original image would make it further less obvious that it's been blurred (because the blur will of course also blur away the noise).
I don't know the exact parameters for how much to grow, what values to use for the alpha gradient, etc, but that's what I understand from the quoted text.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.