I have N number of scanned images. the images contain different languages like Chinese, Arabic and Japanese. I tried to do OCR process for the files using OCRMYPDF and tesseract. both required the language of the input file. But I don't know the language of the file. So, Now my question is how to identify the languages used in a image ?
Related
So, the idea here is that the given text, which happens to be Devanagari character such as संस्थानका कर्मचारी and I want to convert given text to image. Here is what I have attempted.
def draw_image(myString):
width=500
height=100
back_ground_color=(255,255,255)
font_size=10
font_color=(0,0,0)
unicode_text = myString
im = Image.new ( "RGB", (width,height), back_ground_color )
draw = ImageDraw.Draw (im)
unicode_font = ImageFont.truetype("arial.ttf", font_size)
draw.text ( (10,10), unicode_text, font=unicode_font, fill=font_color )
im.save("text.jpg")
if cv2.waitKey(0)==ord('q'):
cv2.destroyAllWindows()
But the font is not recognized, so the image consists of boxes, and other characters that are not understandable. So, which font should I use to get the correct image? Or is there any better approach to convert, the given text in character such as those, to image?
So I had a similar problem when I wanted to write text in Urdu onto images, firstly you need the correct font since writing purely with PIL or even openCV requires the appropriate Unicode characters, and even when you get the appropriate font the letters of one word are disjointed, and you don't get the correct results.
To resolve this you have to stray a bit from the traditional python-only approach since I was creating artificial datasets for an OCR, i needed to print large sets of such words onto a white background. I decided to use graphics software for this. Since some like photoshop even allows you to write scripts to automate processes.
The software I went for was GIMP, which allows you to quickly write and run extensions.scripts to automate the process. It allows you to write an extension in python, or more accurately a modified version of python, known as python-fu. Documentation was limited so it was difficult to get started, but with some persistence, I was able to write functions that would read text from a text file, and place them on white backgrounds and save to disk.
I was able to generate around 300k images from this in a matter of hours. I would suggest if you too are aiming for large amounts of text writing then you too rely on python-fu and GIMP.
For more info you may refer to the GIMP Python Documentation
I want to extract table information from OCR data, I have raw text and it's text.
I tried pytesseract but couldn't find the actual Implementation.
Here is an image: https://drive.google.com/open?id=1CGJwbmf5snoXvwlQAsRAxIRRixbT_Q8l
I tried this: https://github.com/WZBSocialScienceCenter/pdftabextract
this method didn't work for me at all.
I want a tabular structure of this table from OCR data for my further processing.
pdftabextract is not an OCR. It requires scanned pages with OCR
information, i.e. a "sandwich PDF" that contains both the scanned
images and the recognized text. You need software like tesseract or
ABBYY Finereader for OCR.
Please try tesseract it has a relatively easier implementation.
I need to read text using Tesseract OCR and I need to get character position from the image there is any way to achieve these task please help me
I got the answer, I am using Tesseract with hocr
hOCR is an open standard of data representation for formatted text obtained from optical character recognition. The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Markup Language in the form of Hypertext Markup Language or XHTML
Command line syntax like
tesseract someimage.jpg hocr
i want to detect the font of text in an image so that i can do better OCR on it. searching for a solution i found this post. although it may seem that it is the same as my question, it does not exactly address my problem.
background
for OCR i am using tesseract, which uses trained data for recognizing text. training tesseract with lots of fonts reduces the accuracy which is natural and understandable. one solution is to build multiple trained data - one per few similar fonts - and then automatically use the appropriate data for each image. for this to work we need to be able to detect the font in image.
number 3 in this answer uses OCR to isolate image of characters along with their recognized character and then generates the same character's image with each font and compare them with the isolated image. in my case the user should provide a bounding box and the character associated with it. but because i want to OCR Arabic script(which is cursive and character shapes may vary depending on what other characters are adjacent to it) and because the bounding box may not be actually the minimal bounding box, i am not sure how i can do the comparing.
i believe Hausdorff distance is not applicable here. am i right?
shape context may be good(?) and there is a shapeContextDistanceExtractor class in opencv but i am not sure how i can use it in opencv-python
thank you
sorry for bad English
I want to write a script, that converts unknown images (jpg, png, gif, bmp, tiff, etc.) to a specific resolution and format as well as generating a thumbnail.
the problem is that the compression level, that is totally fine for pictures produces crap for exports of Presentations for example; So I want to differ the conversion settings based on the contents of the image.
Does anyone have experience in doing that kind of stuff in python (or shell scripts whose output is easily pasreable)?
my ideas are:
increase contrast and check histogramm if there are only single spikes left
doing a high pass filtering of the image and check what?
doing face recognition of known letters
the goal is that the recognition should be quite fast (approx. 10 images/second) and quite easy to implement
This is a pretty trivial machine learning problem, I would research the MNIST dataset problems that teach you how to recognize handwritten characters, this process should be very similar. Check out this tutorial and see if you can modify it to recognize graphics vs pictures. If your error rate ends up too high you'll have to try more advanced machine learning techniques.
http://mxnet.io/tutorials/python/mnist.html