I have this image (some information was deleted from this on purpose)
What I need is some kind of way to remove the borders(lines) around the text.
I am doing OCR on these images and the lines are really in the way for text recognition.
Also everything has to work automatically, OCR and all other scripts get executed on the server side when someone uploads a document.
You could try using a Hough transform to detect all straight lines in the image, then all you need to do is mask them.
You can use Leptonica to remove lines.
http://www.leptonica.com/line-removal.html
https://github.com/DanBloomberg/leptonica/blob/master/prog/lineremoval_reg.c
Related
I'm working on an application that would extract information from invoices that the user takes a picture of with his phone (using flask and pytesseract).
Everything works on the extraction and classification side for my needs, using the image_to_data method of pytesseract.
But the problem is on the pre-processing side.
I refine the image with greyscale filters, binarization, dilation, etc.
But sometimes the user will take a picture that has a specific angle, like this:
invoice
And then tesseract will return characters that don't make sense, or sometimes it will just return nothing.
At the moment I "scan" the image during pre-processing (I'm largely inspired by this tutorial: https://www.pyimagesearch.com/2014/09/01/build-kick-ass-mobile-document-scanner-just-5-minutes/), but it's not efficient at all.
Does anyone know a way to make it easier for tesseract to work on these type of images?
If not, should I focus on making this pre-processing scan thing?
Thank you for your help!
I am working on an automatation program using tensorflow. But i need some data to bypass text based CAPTCHA and i try to gather some data(images actually) from sites. How can i take "clean" screenshots with the help of OpenCV. With "clean" i mean images without white blanks.
Note: I know that we can take screenshot of desired web element using selenium (refer to: https://www.lambdatest.com/blog/screenshots-with-selenium-webdriver/) but in this site there are two text based CAPTCHAs so the screenshot also include white blanks, which ı don't want to have. I also tried capturing images manually but because of my not sensitive hands images also include white blanks.
When I was trying to get the web element using selenium. I was not satisfied with the result because it has white blanks, which I don't want in my dataset
Normally the images look like that. All I want is getting two seperate images without a white blank
All I want is getting two seperate images without a white blank in order to use in my data for training. Could you please help me?
You could use Playwright and take an element screenshot with omitBackground enabled: https://playwright.dev/#version=v1.0.2&path=docs%2Fapi.md&q=elementhandlescreenshotoptions
So, the idea here is that the given text, which happens to be Devanagari character such as संस्थानका कर्मचारी and I want to convert given text to image. Here is what I have attempted.
def draw_image(myString):
width=500
height=100
back_ground_color=(255,255,255)
font_size=10
font_color=(0,0,0)
unicode_text = myString
im = Image.new ( "RGB", (width,height), back_ground_color )
draw = ImageDraw.Draw (im)
unicode_font = ImageFont.truetype("arial.ttf", font_size)
draw.text ( (10,10), unicode_text, font=unicode_font, fill=font_color )
im.save("text.jpg")
if cv2.waitKey(0)==ord('q'):
cv2.destroyAllWindows()
But the font is not recognized, so the image consists of boxes, and other characters that are not understandable. So, which font should I use to get the correct image? Or is there any better approach to convert, the given text in character such as those, to image?
So I had a similar problem when I wanted to write text in Urdu onto images, firstly you need the correct font since writing purely with PIL or even openCV requires the appropriate Unicode characters, and even when you get the appropriate font the letters of one word are disjointed, and you don't get the correct results.
To resolve this you have to stray a bit from the traditional python-only approach since I was creating artificial datasets for an OCR, i needed to print large sets of such words onto a white background. I decided to use graphics software for this. Since some like photoshop even allows you to write scripts to automate processes.
The software I went for was GIMP, which allows you to quickly write and run extensions.scripts to automate the process. It allows you to write an extension in python, or more accurately a modified version of python, known as python-fu. Documentation was limited so it was difficult to get started, but with some persistence, I was able to write functions that would read text from a text file, and place them on white backgrounds and save to disk.
I was able to generate around 300k images from this in a matter of hours. I would suggest if you too are aiming for large amounts of text writing then you too rely on python-fu and GIMP.
For more info you may refer to the GIMP Python Documentation
I have an image like the following:
and I would want to extract the text from it, that should be ws35, I've tried with pytesseract library using the method :
pytesseract.image_to_string(Image.open(path))
but it returns nothing... Am I doing something wrong? How can I get back the text using the OCR ? Do I need to apply some filter on it ?
You can try the following approach:
Binarize the image with a method of your choice (Thresholding with 127 seems to be sufficient in this case)
Use a minimum filter to connect the lose dots to form characters. Thereby, a filter with r=4 seems to work quite good:
If necessary the result can be further improved via application of a median blur (r=4):
Because i personally do not use tesseract i am not able to try this picture, but online ocr tools seem to be able to identify the sequence correctly (especially if you use the blurred version).
Similar to #SilverMonkey's suggestion: Gaussian blur followed by Otsu thresholding.
The problem is that this picture is low quality and very noisy!
even proffesional and enterprisal programs are struggling with this
you have most likely seen a capatcha before and the reason for those is because its sent back to a database with your answer and the image and then used to train computers to read images like these.
short answer is: pytesseract cant read the text inside this image and most likely no module or proffesional programs can read it either.
You may need apply some image processing/enhancement on it. Look at this post read suggestions and try to apply.
I am trying to write a code which, given an image, will run tessearct on the entire image and return a map of all locations where text was detected (as a binary image).
It doesn't have to be pixel-by-pixel, a union of bounding boxes is more than enough.
Is there a way to do this?
Thanks in advance
Yes... (of course). Look at the Python imaging library for loading the image, and cropping it. Then you can apply Tesseract on each piece and check the output.
Have a look at the program I answered a while back. It might help you with the elements you need. It lets you manually select and area, and OCR it. But that can be easily changed.