OCR text extraction from user interfaces image - python

I am currently using Pytesseract to extract text from images like Amazon, ebay, (e-commerce) etc to observe certain patterns. I do not want to use a web crawler since this is about recognising certain patterns from the text on such sites. The image example looks like this:
However every website looks different so template matching wouldn't help as well. Also the image background is not of the same colour.
The code gives me about 40% accuracy. But if I crop the images into smaller size, it gives me all the text correctly.
Is there a way to take in one image, crop it into multiple parts and then extract text? The preprocessing of images does not help. What I have tried is using: rescaling, removing noise, deskewing, skewing, adaptiveThreshold, grey scale,otsu, etc but I am unable to figure out what to do.
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
# import pickle
def ocr_processing(filename):
"""
This function uses Pillow to open the file and Pytesseract to find string in image.
"""
text = pytesseract.image_to_data(Image.open(
filename), lang='eng', config='--psm 6')
# text = pytesseract.image_to_string(Image.open(
# filename), lang='eng', config ='--psm 11')
return text

Just for a recommendation if you have a lot of text and you want to detect it through OCR (example image is above), "Keras" is a very good option. Much much better than pytesseract or using just EAST. It was a suggestion provided in the comments section. It was able to trace 98.99% of the text correctly.
Here is the link to the Keras-ocr documentation: https://keras-ocr.readthedocs.io/en/latest/

Related

How to find and remove watermarks in pdf using python?

I am currently using python to remove watermarks in PDF files. For example, I have a file like this:
The green shape on the center of the page is the watermark. I think it's not stored in the PDF in text form, because I can't find that text by simply searching using Edge browser (which can read PDF files).
Also, I cannot find the watermark by image. I extracted all images from the PDF using PyMuPDF, and the watermark (which was supposed to appear on each page) is not to be found.
The code I used for extracting is like this:
document = fitz.open(self.input)
for each_page in document:
image_list = each_page.getImageList()
for image_info in image_list:
pix = fitz.Pixmap(document, image_info[0])
png = pix.tobytes() # return picture in png format
if png == watermark_image:
document._deleteObject(image_info[0])
document.save(out_filename)
So how do I find and remove the watermark using python's libraries? How is the watermark stored inside a PDF?
Are there any other "magic" libraries that can do this task, other than PyMuPDF?
For anyone interested in details see the solution provided here.
Removal of the type of watermark used in this file works with PyMuPDF's low-level code interface. There is no direct, specialized high-level API for doing this.

Ways to clean up photos for Tesseract OCR?

I'm new to Tesseract and wanted to know if there were any ways to clean up photos for a simple OCR program to get better results. Thanks in advance for any help!
The code I am using:
#loads tesseract
tess.pytesseract.tesseract_cmd =
#filepath
file_path =
image = Image.open(file_path)
#processes image
text = tess.image_to_string(image, config='')
print(text)
I've used pytesseract in the past and with the following four modifications, I could read almost anything as long as the text font wasn't too small to begin with. pytesseract seems to struggle with small writing, even after resizing.
- Convert to Black & White -
Converting the images to black and white would frequently improve the recognition of the program. I used OpenCV to do so and got the code from the end of this article.
- Crop -
If all your photos are in similar format, as in the text you need is always in the same spot, I'd recommend cropping your pictures. If possible, pass only the exact part of the photo to pytesseract that you want analyzed, the less the program has to analyze, the better. In my case, I was taking screenshots and would specify the exact region of where to take one.
- Resize -
Another thing you can do is to play with the scaling of the original photo. Sometimes after resizing to almost double it's initial size pytesseract could read the text a lot easier. Generaly, bigger text is better but there's a limit as the photo can become too pixelated after resizing to be recognizable.
- Config -
I've noticed that pytesseract can recognize text a lot easier than numbers. If possible, break the photo down into sections and whenever you have a trouble spot with numbers you can use:
pytesseract.image_to_string(image, config='digits')

OCR Tesseract - Get Image Font Attributes

I have been using Pytesseract to extract text from image. I am currently in a restoration task of an image document. Aside from extracting text from an image, I also wanted to identify each words font, font size, whether the character is capital or not, italicized or not, bold or not and so and so forth. Is this currently possible with Tesseract? I have read the documentation of Pytesseract, but found none about it. If this is not possible, how can I make it happen? Is there any open source font recognition API's? Thanks.

Can OpenCV or PyTesseract recognize fonts

Using the below code I am able to read all text in an image:
import cv2
img = cv2.imread(r'/<path_to_image>/text.png')
print(pytesseract.image_to_string(img))
What I want to know is does OpenCV or PyTesseract support text extraction based on font name? For example, if particular text is in Times New Roman and the rest of the text is Arial only extract the Times New Roman. Something like this:
print(pytesseract.image_to_string(img, lang='font'))
Of course no. Tesseract hardly recognizes G from 6 and OpenCV is computer vision library.

Remove background of image containing text

I am building custom ocr for some documents. After getting ROI I am passing them to tesseract. To improve accuracy I want to remove background of image. I am observing that when there are images like this:
tesseract is not able to read anything.(Because of lines in the image)
But for images like this: Its giving correct results.
Can any one suggest how to remove everything from image except text?

Categories

Resources