I have been using Pytesseract to extract text from image. I am currently in a restoration task of an image document. Aside from extracting text from an image, I also wanted to identify each words font, font size, whether the character is capital or not, italicized or not, bold or not and so and so forth. Is this currently possible with Tesseract? I have read the documentation of Pytesseract, but found none about it. If this is not possible, how can I make it happen? Is there any open source font recognition API's? Thanks.
Related
I am currently using Pytesseract to extract text from images like Amazon, ebay, (e-commerce) etc to observe certain patterns. I do not want to use a web crawler since this is about recognising certain patterns from the text on such sites. The image example looks like this:
However every website looks different so template matching wouldn't help as well. Also the image background is not of the same colour.
The code gives me about 40% accuracy. But if I crop the images into smaller size, it gives me all the text correctly.
Is there a way to take in one image, crop it into multiple parts and then extract text? The preprocessing of images does not help. What I have tried is using: rescaling, removing noise, deskewing, skewing, adaptiveThreshold, grey scale,otsu, etc but I am unable to figure out what to do.
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
# import pickle
def ocr_processing(filename):
"""
This function uses Pillow to open the file and Pytesseract to find string in image.
"""
text = pytesseract.image_to_data(Image.open(
filename), lang='eng', config='--psm 6')
# text = pytesseract.image_to_string(Image.open(
# filename), lang='eng', config ='--psm 11')
return text
Just for a recommendation if you have a lot of text and you want to detect it through OCR (example image is above), "Keras" is a very good option. Much much better than pytesseract or using just EAST. It was a suggestion provided in the comments section. It was able to trace 98.99% of the text correctly.
Here is the link to the Keras-ocr documentation: https://keras-ocr.readthedocs.io/en/latest/
I'm new to Tesseract and wanted to know if there were any ways to clean up photos for a simple OCR program to get better results. Thanks in advance for any help!
The code I am using:
#loads tesseract
tess.pytesseract.tesseract_cmd =
#filepath
file_path =
image = Image.open(file_path)
#processes image
text = tess.image_to_string(image, config='')
print(text)
I've used pytesseract in the past and with the following four modifications, I could read almost anything as long as the text font wasn't too small to begin with. pytesseract seems to struggle with small writing, even after resizing.
- Convert to Black & White -
Converting the images to black and white would frequently improve the recognition of the program. I used OpenCV to do so and got the code from the end of this article.
- Crop -
If all your photos are in similar format, as in the text you need is always in the same spot, I'd recommend cropping your pictures. If possible, pass only the exact part of the photo to pytesseract that you want analyzed, the less the program has to analyze, the better. In my case, I was taking screenshots and would specify the exact region of where to take one.
- Resize -
Another thing you can do is to play with the scaling of the original photo. Sometimes after resizing to almost double it's initial size pytesseract could read the text a lot easier. Generaly, bigger text is better but there's a limit as the photo can become too pixelated after resizing to be recognizable.
- Config -
I've noticed that pytesseract can recognize text a lot easier than numbers. If possible, break the photo down into sections and whenever you have a trouble spot with numbers you can use:
pytesseract.image_to_string(image, config='digits')
im trying to get the font size of text in an image, is there any library in python or can it be done using opencv? Thanks in advance
You can first use a text detector or a contour detector to get the height of one line of text. That can be used to get the size of text in pixels. But after that, you need some reference to convert it to font size (which is usually defined in points).
All of this can be done in OpenCV. If you post an image with text, some of us will be able to provide more detailed answers.
So, the idea here is that the given text, which happens to be Devanagari character such as संस्थानका कर्मचारी and I want to convert given text to image. Here is what I have attempted.
def draw_image(myString):
width=500
height=100
back_ground_color=(255,255,255)
font_size=10
font_color=(0,0,0)
unicode_text = myString
im = Image.new ( "RGB", (width,height), back_ground_color )
draw = ImageDraw.Draw (im)
unicode_font = ImageFont.truetype("arial.ttf", font_size)
draw.text ( (10,10), unicode_text, font=unicode_font, fill=font_color )
im.save("text.jpg")
if cv2.waitKey(0)==ord('q'):
cv2.destroyAllWindows()
But the font is not recognized, so the image consists of boxes, and other characters that are not understandable. So, which font should I use to get the correct image? Or is there any better approach to convert, the given text in character such as those, to image?
So I had a similar problem when I wanted to write text in Urdu onto images, firstly you need the correct font since writing purely with PIL or even openCV requires the appropriate Unicode characters, and even when you get the appropriate font the letters of one word are disjointed, and you don't get the correct results.
To resolve this you have to stray a bit from the traditional python-only approach since I was creating artificial datasets for an OCR, i needed to print large sets of such words onto a white background. I decided to use graphics software for this. Since some like photoshop even allows you to write scripts to automate processes.
The software I went for was GIMP, which allows you to quickly write and run extensions.scripts to automate the process. It allows you to write an extension in python, or more accurately a modified version of python, known as python-fu. Documentation was limited so it was difficult to get started, but with some persistence, I was able to write functions that would read text from a text file, and place them on white backgrounds and save to disk.
I was able to generate around 300k images from this in a matter of hours. I would suggest if you too are aiming for large amounts of text writing then you too rely on python-fu and GIMP.
For more info you may refer to the GIMP Python Documentation
i want to detect the font of text in an image so that i can do better OCR on it. searching for a solution i found this post. although it may seem that it is the same as my question, it does not exactly address my problem.
background
for OCR i am using tesseract, which uses trained data for recognizing text. training tesseract with lots of fonts reduces the accuracy which is natural and understandable. one solution is to build multiple trained data - one per few similar fonts - and then automatically use the appropriate data for each image. for this to work we need to be able to detect the font in image.
number 3 in this answer uses OCR to isolate image of characters along with their recognized character and then generates the same character's image with each font and compare them with the isolated image. in my case the user should provide a bounding box and the character associated with it. but because i want to OCR Arabic script(which is cursive and character shapes may vary depending on what other characters are adjacent to it) and because the bounding box may not be actually the minimal bounding box, i am not sure how i can do the comparing.
i believe Hausdorff distance is not applicable here. am i right?
shape context may be good(?) and there is a shapeContextDistanceExtractor class in opencv but i am not sure how i can use it in opencv-python
thank you
sorry for bad English