"Newbie" questions about PyTesseract OCR

"Newbie" questions about PyTesseract OCR - python

I'm working on an application that would extract information from invoices that the user takes a picture of with his phone (using flask and pytesseract).
Everything works on the extraction and classification side for my needs, using the image_to_data method of pytesseract.
But the problem is on the pre-processing side.
I refine the image with greyscale filters, binarization, dilation, etc.
But sometimes the user will take a picture that has a specific angle, like this:
invoice
And then tesseract will return characters that don't make sense, or sometimes it will just return nothing.
At the moment I "scan" the image during pre-processing (I'm largely inspired by this tutorial: https://www.pyimagesearch.com/2014/09/01/build-kick-ass-mobile-document-scanner-just-5-minutes/), but it's not efficient at all.
Does anyone know a way to make it easier for tesseract to work on these type of images?
If not, should I focus on making this pre-processing scan thing?
Thank you for your help!

Related

Is there a way to transform a full resolution and regular image to look like I took a photo of it with my cellphone from my computer screen?

I'm training a computer vision algorithm, and I want to give it some more robust data. For the application of the software I'm building, oftentimes people will take pictures of their computer screen with their phones and use those images, rather than the actual original image file to run the computer vision on.
Do you know what kind of transformations I can make to my already labeled image dataset to emulate what it would look like if someone used a cellphone to take a picture of a screen?
Like some qualities demonstrated below in the sample image of my screen for this question:
I guess this is what I'm thinking so far conceptually, but I'm not sure what libraries to use in Python:
The image resolution will probably drop, so modulating that to be lower or more commensurate with what a cellphone's granularity is
Adding in random color aberrations to the images, because when you take pictures of screens it seems like mini rainbows form?
Warping the angle the image is viewed at, since when someone takes a photo, they may not be taking it perfectly square/flat
Adding pixel-looking grids to the images to make them look more like the images taken of screens.
Is there anything I missed and do yall have any library recommendations or starting code to help me? I really want to avoid relabelling all of my data...
Thanks in advance!
I found this: https://graphicdesign.stackexchange.com/a/14001
It seems to be exactly what I'm looking for, but how do I translate this into code? Any library recommendations?

How to get information from an image of a document, like name, CPF, RG, on python?

I'm sorry for the title of my question if it doesn't let clear my problem.
I'm trying to get information from an image of a document using tesseract, but it doesn't work well on pictures (on print screens of text it works very well). I want to ask if somebody know a technique that can help me. I think that letting the image black and white, where the information I want is in black would help a lot, but I don't know how to do that.
I will be glad if somebody knows how to help me. (:

Using opencv might help to preprocess the image before passing it to tesseract.
I usually follow these steps
Convert the image to grayscale
If the texts in the image are small, resize the image using cv2.resize()
Blur the image (GaussianBlur or MedianBlur)
Apply threshhold to make the text prominent (cv2.threshold)
Use tesseract config to instruct tesseract to look for specific characters.
For example If the image contains only alphanumeric upper case english text then passing
config='-c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" would help.

Extract text from light text on withe background image

I have an image like the following:
and I would want to extract the text from it, that should be ws35, I've tried with pytesseract library using the method :
pytesseract.image_to_string(Image.open(path))
but it returns nothing... Am I doing something wrong? How can I get back the text using the OCR ? Do I need to apply some filter on it ?

You can try the following approach:
Binarize the image with a method of your choice (Thresholding with 127 seems to be sufficient in this case)
Use a minimum filter to connect the lose dots to form characters. Thereby, a filter with r=4 seems to work quite good:
If necessary the result can be further improved via application of a median blur (r=4):
Because i personally do not use tesseract i am not able to try this picture, but online ocr tools seem to be able to identify the sequence correctly (especially if you use the blurred version).

Similar to #SilverMonkey's suggestion: Gaussian blur followed by Otsu thresholding.

The problem is that this picture is low quality and very noisy!
even proffesional and enterprisal programs are struggling with this
you have most likely seen a capatcha before and the reason for those is because its sent back to a database with your answer and the image and then used to train computers to read images like these.
short answer is: pytesseract cant read the text inside this image and most likely no module or proffesional programs can read it either.

You may need apply some image processing/enhancement on it. Look at this post read suggestions and try to apply.

Get a map of where text was detected using tesseract in Python

I am trying to write a code which, given an image, will run tessearct on the entire image and return a map of all locations where text was detected (as a binary image).
It doesn't have to be pixel-by-pixel, a union of bounding boxes is more than enough.
Is there a way to do this?
Thanks in advance

Yes... (of course). Look at the Python imaging library for loading the image, and cropping it. Then you can apply Tesseract on each piece and check the output.
Have a look at the program I answered a while back. It might help you with the elements you need. It lets you manually select and area, and OCR it. But that can be easily changed.

CAPTCHAs Image Manipulation using Pillow

As an exercise, I'm attempting to break the following CAPTCHA:
It doesn't seem like it would be too difficult to break as the edges seems to fairly solid and noise should be relatively easy to remove. Problem is, I have very little experience with image manipulation. Currently I'm using Python with the Pillow library to manipulate the CAPTCHA image, after which it will be passed into Tesseract for OCR.
In the following code I attempt to bring out the edges by sharpening the image and the convert the image to black and white
from PIL import Image, ImageFilter
try:
img = Image.open("Captcha.jpg")
except:
print("Can't load captcha.")
exit()
# Bring out the edges by sharpening.
out = img.filter(ImageFilter.SHARPEN)
out = out.convert("L")
out = out.point(lambda x: 0 if x<136 else 255, "1")
width, height = out.size
out = out.resize((width*5, height*5), Image.NEAREST)
out.save("captcha_modified.png")
At this point I see the following:
However, Tesseract is still unable to read the characters. As an experiment, I used good ol' mspaint to manually modify the image to a point to where it could be read by Tesseract:
So if can get the image to that point, I think Tesseract will do a fairly good job at detecting characters. So my current thoughts are that I need to enhance the edges and reduce the noise the image. Also, I imagine it would be easier for Tesseract to detect the letters if the letters will filled in rather than outlined, but I have not idea how I'd do this.
Any suggestions on how to go about this? Is there a better way to process the images?

I am short on time so this answer may not be incredibly useful but goes over my own 2 algorithms exactly. There isn't much code but a few method reccomendations. It is a good idea to use code rather than MS Paint.With code its actually really easy to break a captcha and achieve above 50% success rate. Behavioral recognition may be a better security mechanism or may be an additional one.
A. Edge Detection Method you use:
Edge detection really isn't necessary. In this case, just use the getpixel((x,y)) function and fill in the area between the bounding lines, recognizing to fill at lines 1,3,5;etc. and turn off the fill after intersection 2,4,6;etc. Luckilly, you chose an easy Captcha so edge detection is a decent solution without decluttering,rotating, and re-alignment.
B. Manipulation Method:
Another method I use utilizes OpenCV and pillow as well. I am really busy but am posting a blog article on this later at druid5.wordpress.com/ which will contain code examples of this method. Since it isn't illegal to get through them, at least I am told, I use the method I will post to collect data all the time. Mostly, contrast and detail from pillow, some basic clutter removal with stats, re-alignment with a basic dfs, and rotation (performable with opencv or easily with a kernal). Tesseract is a good choice for open source but it isn't too hard to create an OCR with opencv either.
This exercies is a decent introduction to OpenCV, PIL (pillow), image manipulation with math, and some other things that help with everything from robotics to AI.
Using flow control to find the failed conditions and try different routes may be necessary but the aim should always be a generic solution.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.