I have been working on an OCR project for business cards using opencv in python.
Till now, I have been able to crop the card of the image. I am trying to detect text regions in the cropped image using contours.
(ie, Taking a Canny image , finding contours from those edges and dilating them to get connected components, which must potentially be text regions).
While I was trying to detect closed connected components,some contours cover an extra part(like symbols) apart from the text like in this image.
Due to this,Applying tesseract-ocr on those text regions gives unwanted text(garbage) along with the required text.This is result of my OCR.
**(P) (972) 656-6074
(F) (972) 656-6077
(M) (214) 505-8473
5910 N. Central Expressway, Suite1625»
Dallas, Texas 75206
ken.shulman#capviewpartners.com
WKw™/"
CAPVIEW
EPARTNERS
Ken Shulman, CRE
Partner**
I tried modifying the dilating factor but the part of symbol in the image always becomes a part of the text region.
I want to make the pre-processing as optimized as possible so that tesseract-ocr doesn't make any mistakes.So how to remove those extra parts(symbols) from the text regions or is there an alternate approach ?
Related
This is an image containing some text, then a text box, then a Signature & after that a bottom line ending the picture.
And the 2nd image is what I want as output using python.
I have several pictures like this, I want the same cropped output for all the images.
Here is what I tried: I used pytesseract to OCR the image 1st to locate the text box determining that as starting point & thresholding to determine the endpoint of the signature area....then tried using OpenCV to crop that area & saving it in a local directory but the approach is not much promising.
Can someone help me to solve the problem?
There are several relatively simple approaches you could try.
You could use cv2.findContours() to locate the two big black titles "Additional Information" and "Signatures and Certifications" and scale off their position to deduce where the signature is. Your intermediate image might be like this:
You could use "Hit and Miss" matching to locate the 4 corners of the box "The IRS doesn't require your consent...".
You could do flood-fills in say, red, starting from white pixels near the centre of the image till you get a red box filled around the right size for the box "The IRS doesn't require your consent...". Then scale from that.
You could do a flood-fill in black, starting 2/3 of the way down the image and half way across, then find the remaining white area which would be the box "The IRS doesn't require...".
You could look for the large, empty white area below the signature (using summation across the rows) and scale off that. I am showing the summation across the rows to the right of the original image. You could invert first and look for black. Do the summation with:
rowTotals = np.sum(img, axis=1)
I am trying to implement an image recognition program and I need to remove (or "crop") all text, present on the image, so for example from that:
to that:
I already tried the Keras OCR method, but firstly I don't need the background blur I simply need to delete the text, and secondly it takes a lot of time and CPU power. Is there an easier way to detect those text regions and simply crop them out of the picture?
One way is to detect the text with findContours - the ones with an area < threshold are letters, then paint over these areas, or/and first find their bounding rectangle and paint a big one.
Text Extraction from image after detecting text region with contours
There is also pytesseract to detect letters and their region, but I guess it will be heavier than the contours.
Here is an example project where I worked with pytesseract: How to obtain the best result from pytesseract?
For pre-processing, I need to crop out any pixels that are not in the dark box.
The intent is to crop out just the text.
Finally, I do additional processing to turn it into a black and white image perfect for Tesseract.
Unfortunately, I've been finding it difficult to find the y-coordinate to crop at.
I've tried:
Adding in contours to find the box (either too many lines or cannot detect the end of the dark box) - contour lines
Using template matching to find the last margin in the image (fails when the background is different) - working example - failing example
Increasing saturation/brightness - doesn't help isolate
Process extraneous text instead of trying to crop image – unfortunately the extra image data often seems to ruin the accuracy of the text that I want
Assumptions:
The image width, font, line-spacing, and margins is always the same.
The number of lines of text will vary.
The last line of the text will vary.
The background will often be different.
Here are some other images that I would want the algorithm to work with: https://imgur.com/a/Lu4pPBL
Thank you in advance!
I'm trying to denoise an image (photographed text) in order to improve OCR. I'm using Python - skimage for the task, but I'm open to other library recommendations (PIL, cv2, ...)
Example image (should read "i5"):
I used skimage.morphology.erosion and skimage.morphology.remove_small_objects quite successfully, resulting in :
The noise is gone, but so is some part of the 5 and dot on the i.
Now the question: I had an idea how to repair the 5. I add the original image to the denoised one, resulting in parts being black, and parts being gray:
Then I make all gray parts connected to black parts black (propagate over the structure). Finally, by erasing all parts which are still gray, I get a clean image.
But I don't know how to do the propagate part using one of the above libraries. Is there an algorithm for that ?
Bonus question: How can I preserve the dot on the i ?
I am working on an application where I need feature like Cam Scanner where document is to be detected in an image. For that I am using Canny Edge detection followed by Hough Transform.
The results look promising but the text in the document is creating issues as explained via images below:
Original Image
After canny edge detection
After hough transform
My issue lies in the third image, the text in original mage near the bottom has forced hough transform to detect the horizontal line(2nd cluster from bottom).
I know I can take the largest quadrilateral and that would work fine in most cases, but still I want to know any other ways where in this processing I can ignore the effect of text on the edges.
Any help would be appreciated.
I solved the issue of text with the help of median filter of size 15(square) in an image of 500x700.
Median filter doesn't affect the boundaries of the paper, but can help eliminate the text completely.
Using that I was able to get much more effective boundaries.
Another approach you could try is to use thresholding to find the paper boundaries. This would create a binary image. You can then examine the blobs of white pixels and see if any are large enough to be the paper and have the right dimensions. If it fits the criteria, you can find the min/max points of this blob to represent the paper.
There are several ways to do the thresholding, including iterative, otsu, and adaptive.
Also, for best results you may have to dilate the binary image to close the black lines in the table as shown in your example.