Removing (cropping) text from image - python

I am trying to implement an image recognition program and I need to remove (or "crop") all text, present on the image, so for example from that:
to that:
I already tried the Keras OCR method, but firstly I don't need the background blur I simply need to delete the text, and secondly it takes a lot of time and CPU power. Is there an easier way to detect those text regions and simply crop them out of the picture?

One way is to detect the text with findContours - the ones with an area < threshold are letters, then paint over these areas, or/and first find their bounding rectangle and paint a big one.
Text Extraction from image after detecting text region with contours
There is also pytesseract to detect letters and their region, but I guess it will be heavier than the contours.
Here is an example project where I worked with pytesseract: How to obtain the best result from pytesseract?

Related

How can I crop a specific area of an image using Python?

This is an image containing some text, then a text box, then a Signature & after that a bottom line ending the picture.
And the 2nd image is what I want as output using python.
I have several pictures like this, I want the same cropped output for all the images.
Here is what I tried: I used pytesseract to OCR the image 1st to locate the text box determining that as starting point & thresholding to determine the endpoint of the signature area....then tried using OpenCV to crop that area & saving it in a local directory but the approach is not much promising.
Can someone help me to solve the problem?
There are several relatively simple approaches you could try.
You could use cv2.findContours() to locate the two big black titles "Additional Information" and "Signatures and Certifications" and scale off their position to deduce where the signature is. Your intermediate image might be like this:
You could use "Hit and Miss" matching to locate the 4 corners of the box "The IRS doesn't require your consent...".
You could do flood-fills in say, red, starting from white pixels near the centre of the image till you get a red box filled around the right size for the box "The IRS doesn't require your consent...". Then scale from that.
You could do a flood-fill in black, starting 2/3 of the way down the image and half way across, then find the remaining white area which would be the box "The IRS doesn't require...".
You could look for the large, empty white area below the signature (using summation across the rows) and scale off that. I am showing the summation across the rows to the right of the original image. You could invert first and look for black. Do the summation with:
rowTotals = np.sum(img, axis=1)

Tesseract OCR, extract dark box from image / finding y-coordinate to text at

For pre-processing, I need to crop out any pixels that are not in the dark box.
The intent is to crop out just the text.
Finally, I do additional processing to turn it into a black and white image perfect for Tesseract.
Unfortunately, I've been finding it difficult to find the y-coordinate to crop at.
I've tried:
Adding in contours to find the box (either too many lines or cannot detect the end of the dark box) - contour lines
Using template matching to find the last margin in the image (fails when the background is different) - working example - failing example
Increasing saturation/brightness - doesn't help isolate
Process extraneous text instead of trying to crop image – unfortunately the extra image data often seems to ruin the accuracy of the text that I want
Assumptions:
The image width, font, line-spacing, and margins is always the same.
The number of lines of text will vary.
The last line of the text will vary.
The background will often be different.
Here are some other images that I would want the algorithm to work with: https://imgur.com/a/Lu4pPBL
Thank you in advance!

Extracting text regions from an Image using contours - Opencv , Python

I have been working on an OCR project for business cards using opencv in python.
Till now, I have been able to crop the card of the image. I am trying to detect text regions in the cropped image using contours.
(ie, Taking a Canny image , finding contours from those edges and dilating them to get connected components, which must potentially be text regions).
While I was trying to detect closed connected components,some contours cover an extra part(like symbols) apart from the text like in this image.
Due to this,Applying tesseract-ocr on those text regions gives unwanted text(garbage) along with the required text.This is result of my OCR.
**(P) (972) 656-6074
(F) (972) 656-6077
(M) (214) 505-8473
5910 N. Central Expressway, Suite1625»
Dallas, Texas 75206
ken.shulman#capviewpartners.com
WKw™/"
CAPVIEW
EPARTNERS
Ken Shulman, CRE
Partner**
I tried modifying the dilating factor but the part of symbol in the image always becomes a part of the text region.
I want to make the pre-processing as optimized as possible so that tesseract-ocr doesn't make any mistakes.So how to remove those extra parts(symbols) from the text regions or is there an alternate approach ?

Removing text while processing the image

I am working on an application where I need feature like Cam Scanner where document is to be detected in an image. For that I am using Canny Edge detection followed by Hough Transform.
The results look promising but the text in the document is creating issues as explained via images below:
Original Image
After canny edge detection
After hough transform
My issue lies in the third image, the text in original mage near the bottom has forced hough transform to detect the horizontal line(2nd cluster from bottom).
I know I can take the largest quadrilateral and that would work fine in most cases, but still I want to know any other ways where in this processing I can ignore the effect of text on the edges.
Any help would be appreciated.
I solved the issue of text with the help of median filter of size 15(square) in an image of 500x700.
Median filter doesn't affect the boundaries of the paper, but can help eliminate the text completely.
Using that I was able to get much more effective boundaries.
Another approach you could try is to use thresholding to find the paper boundaries. This would create a binary image. You can then examine the blobs of white pixels and see if any are large enough to be the paper and have the right dimensions. If it fits the criteria, you can find the min/max points of this blob to represent the paper.
There are several ways to do the thresholding, including iterative, otsu, and adaptive.
Also, for best results you may have to dilate the binary image to close the black lines in the table as shown in your example.

OCR after Bounding Box detection

I am trying to find the frequency at which certain words appear in different books using python. For this purpose I have attempted to find the bounding box around each word.
the input:- https://www.dropbox.com/s/ib74y9wh2vrxlwi/textin.jpg
and the output that I get after performing binarisation and other morphological operations for detecting the bounding boxes:-
https://www.dropbox.com/s/9q4x61dyvstu5ub/textout.png
My question is,
I need to perform ocr using pytesser. My current implementation is quite dirty. I am currently saving each of the bounding box detected into small png files .Then run the code for pytesser separately which loops through each of these small images containing words. This process hogs my system.
Is there some other way round to feed my images(detected by bounding boxes) directly into pytesser without first saving them?
After my code is run, I have a list of 544(here in this example) bounding Boxes like
[minrow, mincol, maxrow, maxcol].

Categories

Resources