Extract text from light text on withe background image

Extract text from light text on withe background image - python

I have an image like the following:
and I would want to extract the text from it, that should be ws35, I've tried with pytesseract library using the method :
pytesseract.image_to_string(Image.open(path))
but it returns nothing... Am I doing something wrong? How can I get back the text using the OCR ? Do I need to apply some filter on it ?

You can try the following approach:
Binarize the image with a method of your choice (Thresholding with 127 seems to be sufficient in this case)
Use a minimum filter to connect the lose dots to form characters. Thereby, a filter with r=4 seems to work quite good:
If necessary the result can be further improved via application of a median blur (r=4):
Because i personally do not use tesseract i am not able to try this picture, but online ocr tools seem to be able to identify the sequence correctly (especially if you use the blurred version).

Similar to #SilverMonkey's suggestion: Gaussian blur followed by Otsu thresholding.

The problem is that this picture is low quality and very noisy!
even proffesional and enterprisal programs are struggling with this
you have most likely seen a capatcha before and the reason for those is because its sent back to a database with your answer and the image and then used to train computers to read images like these.
short answer is: pytesseract cant read the text inside this image and most likely no module or proffesional programs can read it either.

You may need apply some image processing/enhancement on it. Look at this post read suggestions and try to apply.

Related

"Newbie" questions about PyTesseract OCR

I'm working on an application that would extract information from invoices that the user takes a picture of with his phone (using flask and pytesseract).
Everything works on the extraction and classification side for my needs, using the image_to_data method of pytesseract.
But the problem is on the pre-processing side.
I refine the image with greyscale filters, binarization, dilation, etc.
But sometimes the user will take a picture that has a specific angle, like this:
invoice
And then tesseract will return characters that don't make sense, or sometimes it will just return nothing.
At the moment I "scan" the image during pre-processing (I'm largely inspired by this tutorial: https://www.pyimagesearch.com/2014/09/01/build-kick-ass-mobile-document-scanner-just-5-minutes/), but it's not efficient at all.
Does anyone know a way to make it easier for tesseract to work on these type of images?
If not, should I focus on making this pre-processing scan thing?
Thank you for your help!

How can I make a bayer image and become a debayer again? (demosaicing)

My goal is to blur the picture a bit using a bilinear debayer.
This is to embody the dirty image of the VHS days.
As a graphic major, I tried to reproduce it with various graphic tools, but did not get the desired quality result.
I want that subtle feeling of faded haze when scanned with a scanner.
I decided to emulate a camera sensor.
The process I envisioned is this:
I convert the tiff,targa.png.jpg format image I made into a bayer format image. I want to restore the original image by debayering it again with a bilinear algorithm.
The reason for the bilinear method is that it degrades most gently and strongly.
The link below is the image change according to the algorithm.
https://www.dpreview.com/forums/post/63514167
I'm not a programmer at all, but I've tried something on my own to get what I want.
https://codegolf.stackexchange.com/questions/86410/reverse-bayer-filter-of-an-image
I succeeded in making an image of the Bayer pattern using the coding here.
And I tried debayering by running the debayer source code downloaded from other places, but it failed because the extension was not supported.
So you can change demoasic(debayer) in various ways
I got a program called darkable and raw therapy and tried to convert it, but these programs could only recognize raw files.
Even the algorithms provided by both programs were so good that it was hard to get the impression that the image was degraded.
How do I make what I want?
What can I look for? I really want to make this.
Please let me know which way I should go.

Removing borders (lines) from image

I have this image (some information was deleted from this on purpose)
What I need is some kind of way to remove the borders(lines) around the text.
I am doing OCR on these images and the lines are really in the way for text recognition.
Also everything has to work automatically, OCR and all other scripts get executed on the server side when someone uploads a document.

You could try using a Hough transform to detect all straight lines in the image, then all you need to do is mask them.

You can use Leptonica to remove lines.
http://www.leptonica.com/line-removal.html
https://github.com/DanBloomberg/leptonica/blob/master/prog/lineremoval_reg.c

Get a map of where text was detected using tesseract in Python

I am trying to write a code which, given an image, will run tessearct on the entire image and return a map of all locations where text was detected (as a binary image).
It doesn't have to be pixel-by-pixel, a union of bounding boxes is more than enough.
Is there a way to do this?
Thanks in advance

Yes... (of course). Look at the Python imaging library for loading the image, and cropping it. Then you can apply Tesseract on each piece and check the output.
Have a look at the program I answered a while back. It might help you with the elements you need. It lets you manually select and area, and OCR it. But that can be easily changed.

CAPTCHAs Image Manipulation using Pillow

As an exercise, I'm attempting to break the following CAPTCHA:
It doesn't seem like it would be too difficult to break as the edges seems to fairly solid and noise should be relatively easy to remove. Problem is, I have very little experience with image manipulation. Currently I'm using Python with the Pillow library to manipulate the CAPTCHA image, after which it will be passed into Tesseract for OCR.
In the following code I attempt to bring out the edges by sharpening the image and the convert the image to black and white
from PIL import Image, ImageFilter
try:
img = Image.open("Captcha.jpg")
except:
print("Can't load captcha.")
exit()
# Bring out the edges by sharpening.
out = img.filter(ImageFilter.SHARPEN)
out = out.convert("L")
out = out.point(lambda x: 0 if x<136 else 255, "1")
width, height = out.size
out = out.resize((width*5, height*5), Image.NEAREST)
out.save("captcha_modified.png")
At this point I see the following:
However, Tesseract is still unable to read the characters. As an experiment, I used good ol' mspaint to manually modify the image to a point to where it could be read by Tesseract:
So if can get the image to that point, I think Tesseract will do a fairly good job at detecting characters. So my current thoughts are that I need to enhance the edges and reduce the noise the image. Also, I imagine it would be easier for Tesseract to detect the letters if the letters will filled in rather than outlined, but I have not idea how I'd do this.
Any suggestions on how to go about this? Is there a better way to process the images?

I am short on time so this answer may not be incredibly useful but goes over my own 2 algorithms exactly. There isn't much code but a few method reccomendations. It is a good idea to use code rather than MS Paint.With code its actually really easy to break a captcha and achieve above 50% success rate. Behavioral recognition may be a better security mechanism or may be an additional one.
A. Edge Detection Method you use:
Edge detection really isn't necessary. In this case, just use the getpixel((x,y)) function and fill in the area between the bounding lines, recognizing to fill at lines 1,3,5;etc. and turn off the fill after intersection 2,4,6;etc. Luckilly, you chose an easy Captcha so edge detection is a decent solution without decluttering,rotating, and re-alignment.
B. Manipulation Method:
Another method I use utilizes OpenCV and pillow as well. I am really busy but am posting a blog article on this later at druid5.wordpress.com/ which will contain code examples of this method. Since it isn't illegal to get through them, at least I am told, I use the method I will post to collect data all the time. Mostly, contrast and detail from pillow, some basic clutter removal with stats, re-alignment with a basic dfs, and rotation (performable with opencv or easily with a kernal). Tesseract is a good choice for open source but it isn't too hard to create an OCR with opencv either.
This exercies is a decent introduction to OpenCV, PIL (pillow), image manipulation with math, and some other things that help with everything from robotics to AI.
Using flow control to find the failed conditions and try different routes may be necessary but the aim should always be a generic solution.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.