I am building custom ocr for some documents. After getting ROI I am passing them to tesseract. To improve accuracy I want to remove background of image. I am observing that when there are images like this:
tesseract is not able to read anything.(Because of lines in the image)
But for images like this: Its giving correct results.
Can any one suggest how to remove everything from image except text?
Related
I am trying to implement an image recognition program and I need to remove (or "crop") all text, present on the image, so for example from that:
to that:
I already tried the Keras OCR method, but firstly I don't need the background blur I simply need to delete the text, and secondly it takes a lot of time and CPU power. Is there an easier way to detect those text regions and simply crop them out of the picture?
One way is to detect the text with findContours - the ones with an area < threshold are letters, then paint over these areas, or/and first find their bounding rectangle and paint a big one.
Text Extraction from image after detecting text region with contours
There is also pytesseract to detect letters and their region, but I guess it will be heavier than the contours.
Here is an example project where I worked with pytesseract: How to obtain the best result from pytesseract?
I am currently using Pytesseract to extract text from images like Amazon, ebay, (e-commerce) etc to observe certain patterns. I do not want to use a web crawler since this is about recognising certain patterns from the text on such sites. The image example looks like this:
However every website looks different so template matching wouldn't help as well. Also the image background is not of the same colour.
The code gives me about 40% accuracy. But if I crop the images into smaller size, it gives me all the text correctly.
Is there a way to take in one image, crop it into multiple parts and then extract text? The preprocessing of images does not help. What I have tried is using: rescaling, removing noise, deskewing, skewing, adaptiveThreshold, grey scale,otsu, etc but I am unable to figure out what to do.
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
# import pickle
def ocr_processing(filename):
"""
This function uses Pillow to open the file and Pytesseract to find string in image.
"""
text = pytesseract.image_to_data(Image.open(
filename), lang='eng', config='--psm 6')
# text = pytesseract.image_to_string(Image.open(
# filename), lang='eng', config ='--psm 11')
return text
Just for a recommendation if you have a lot of text and you want to detect it through OCR (example image is above), "Keras" is a very good option. Much much better than pytesseract or using just EAST. It was a suggestion provided in the comments section. It was able to trace 98.99% of the text correctly.
Here is the link to the Keras-ocr documentation: https://keras-ocr.readthedocs.io/en/latest/
I'm new to Tesseract and wanted to know if there were any ways to clean up photos for a simple OCR program to get better results. Thanks in advance for any help!
The code I am using:
#loads tesseract
tess.pytesseract.tesseract_cmd =
#filepath
file_path =
image = Image.open(file_path)
#processes image
text = tess.image_to_string(image, config='')
print(text)
I've used pytesseract in the past and with the following four modifications, I could read almost anything as long as the text font wasn't too small to begin with. pytesseract seems to struggle with small writing, even after resizing.
- Convert to Black & White -
Converting the images to black and white would frequently improve the recognition of the program. I used OpenCV to do so and got the code from the end of this article.
- Crop -
If all your photos are in similar format, as in the text you need is always in the same spot, I'd recommend cropping your pictures. If possible, pass only the exact part of the photo to pytesseract that you want analyzed, the less the program has to analyze, the better. In my case, I was taking screenshots and would specify the exact region of where to take one.
- Resize -
Another thing you can do is to play with the scaling of the original photo. Sometimes after resizing to almost double it's initial size pytesseract could read the text a lot easier. Generaly, bigger text is better but there's a limit as the photo can become too pixelated after resizing to be recognizable.
- Config -
I've noticed that pytesseract can recognize text a lot easier than numbers. If possible, break the photo down into sections and whenever you have a trouble spot with numbers you can use:
pytesseract.image_to_string(image, config='digits')
I'm trying to compare two images with usernames and check if both are the same. I'm not able to use OCR Tesseract because the usernames can have letters from two or three different languages. Because of this, Tesseract is not able to parse the text from the image. I used ImageHash to try and figure out if the images are similar.
But when I try to compare this image:
then ImageHash gives me a result that user name Mustang1202 is more similar to this image than Mustang1203.
Is there a different way I can detect similar text in images?
I'm trying to recognize the page number from a colored image but pytesseract does not recognize nothing with the command:
print(pytesseract.image_to_string(Image.open("C:\\Users\\user\\Desktop\\test\\colorimage.jpg")))
Then i tried to binarize the same image: same result.
Finally I decided to adjust the binarized image by creating a "unique" background.
Check this results map: https://i.stack.imgur.com/sMD3M.jpg
How can I recognize text without edit the default image with Python?