I have N number of scanned images. the images contain different languages like Chinese, Arabic and Japanese. I tried to do OCR process for the files using OCRMYPDF and tesseract. both required the language of the input file. But I don't know the language of the file. So, Now my question is how to identify the languages used in a image ?
I have been using Pytesseract to extract text from image. I am currently in a restoration task of an image document. Aside from extracting text from an image, I also wanted to identify each words font, font size, whether the character is capital or not, italicized or not, bold or not and so and so forth. Is this currently possible with Tesseract? I have read the documentation of Pytesseract, but found none about it. If this is not possible, how can I make it happen? Is there any open source font recognition API's? Thanks.
So, the idea here is that the given text, which happens to be Devanagari character such as संस्थानका कर्मचारी and I want to convert given text to image. Here is what I have attempted.
def draw_image(myString):
width=500
height=100
back_ground_color=(255,255,255)
font_size=10
font_color=(0,0,0)
unicode_text = myString
im = Image.new ( "RGB", (width,height), back_ground_color )
draw = ImageDraw.Draw (im)
unicode_font = ImageFont.truetype("arial.ttf", font_size)
draw.text ( (10,10), unicode_text, font=unicode_font, fill=font_color )
im.save("text.jpg")
if cv2.waitKey(0)==ord('q'):
cv2.destroyAllWindows()
But the font is not recognized, so the image consists of boxes, and other characters that are not understandable. So, which font should I use to get the correct image? Or is there any better approach to convert, the given text in character such as those, to image?
So I had a similar problem when I wanted to write text in Urdu onto images, firstly you need the correct font since writing purely with PIL or even openCV requires the appropriate Unicode characters, and even when you get the appropriate font the letters of one word are disjointed, and you don't get the correct results.
To resolve this you have to stray a bit from the traditional python-only approach since I was creating artificial datasets for an OCR, i needed to print large sets of such words onto a white background. I decided to use graphics software for this. Since some like photoshop even allows you to write scripts to automate processes.
The software I went for was GIMP, which allows you to quickly write and run extensions.scripts to automate the process. It allows you to write an extension in python, or more accurately a modified version of python, known as python-fu. Documentation was limited so it was difficult to get started, but with some persistence, I was able to write functions that would read text from a text file, and place them on white backgrounds and save to disk.
I was able to generate around 300k images from this in a matter of hours. I would suggest if you too are aiming for large amounts of text writing then you too rely on python-fu and GIMP.
For more info you may refer to the GIMP Python Documentation
I have to do the following task on Python and I have not idea where
to begin:
OCR of handwritten dates
Page/document orientation detection for pretreatment
Stamp and Logo Detection and classification
a. Orientation variation included
b. Quality degradation to be considered
c. Overlying Primary Content
Anybody could help me?
THANKS IN ADVANCE¡¡
You can ocrmypdf to extract text from a pdf. It will extract text from the page and return a pdf same like original pdf with text on it. For detection of logos, you need to implement a computer vision-based model. if you need more details then please specify your requirement in details
I want to extract table information from OCR data, I have raw text and it's text.
I tried pytesseract but couldn't find the actual Implementation.
Here is an image: https://drive.google.com/open?id=1CGJwbmf5snoXvwlQAsRAxIRRixbT_Q8l
I tried this: https://github.com/WZBSocialScienceCenter/pdftabextract
this method didn't work for me at all.
I want a tabular structure of this table from OCR data for my further processing.
pdftabextract is not an OCR. It requires scanned pages with OCR
information, i.e. a "sandwich PDF" that contains both the scanned
images and the recognized text. You need software like tesseract or
ABBYY Finereader for OCR.
Please try tesseract it has a relatively easier implementation.