Detecting Bangla character using pytesseract

Detecting Bangla character using pytesseract - python

I am trying to detect bangla character from image using python, so i decided to use pytesseract. For this purpose i have used below code:
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
im = Image.open("input.png") # the second one
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('temp2.png')
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
text = pytesseract.image_to_string(Image.open('temp2.png'),lang="ben")
print text
The problem is that if i gave a image of english character is detects. But when i am writing lang="ben" and detecting from image of bengali characters my code is running for endless time or like forever.
P.S: I have downloaded bengali language train data to tessdata folder and i am trying to run it in PyCharm.
Can anyone help me to solve this problem?
sample of input.png

I added Bangla(india) language to Windows. Downloaded ben.traineddata to TESSDATA_PREFIX which equals to C:\Program Files\Tesseract 4.0.0\tessdata in my PC. Then run,
> tesseract -l ben bangla.jpg bangla_out
in command prompt and got the result below in 2 seconds. The result looks fine even I don't understand the language.
Have you tried to run tesseract in command prompt to verify if it works for -l ben?
EDIT:
Used Spyder, similar to PyCharm, which comes with Anaconda to test
it. Modified your code to call Tesseract as below.
pytesseract.pytesseract.tesseract_cmd = "C:/Program Files/Tesseract 4.0.0/tesseract.exe"
Test Code in Spyder:
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
import os
im = Image.open("bangla.jpg") # the second one
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save("bangla_pp.jpg")
pytesseract.pytesseract.tesseract_cmd = "C:/Program Files/Tesseract 4.0.0/tesseract.exe"
text = pytesseract.image_to_string(Image.open("bangla_pp.jpg"),lang="ben")
print text
It works and produced result below on the processed image. Apparently, the OCR result of the processed image is not as good as the original one.
Result from the processed bangla_pp.jpg:
প্রত্যাবর্তনকারীরা
তাঁদের দেশে গিয়ে
-~~-<~~~~--
প্রত্যাবর্তন-পরবর্তী
আর্থিক সহায়তা
= পাবেন তার
Result from original image, directly feed to Tesseract.
Code:
from PIL import Image
import pytesseract as tess
print tess.image_to_string(Image.open('bangla.jpg'), lang='ben')
Output:
প্রত্যাবর্তনকারীরা
তাঁদের দেশে গিয়ে
প্রত্যাবর্তন-পরবর্তী
আর্থিক সহায়তা
পাবেন তার

I have installed some fonts in windows from here
https://www.omicronlab.com/bangla-fonts.html
After that, it worked perfectly fine for me in Pycharm.

Related

How can I fix my python code to save and resize images using glob on linux

I wrote the following code to resize images in a folder to 100*100 and save the images in the same folder using for loop.I am wondering why it isn't working.
The following is the code I have written:
import cv2
import glob
images=glob.glob("*.jpg")
for image in images:
img=cv2.imread(image,0)
re=cv2.resize(img,(100,100))
cv2.imshow("Hey",re)
cv2.waitKey(500)
cv2.destroyAllWindows()
cv2.imwrite("resized_"+image,re)
I executed this:
nsu#NSU:~/Desktop/cryptology$ python3 img2.py
I got no error:
nsu#NSU:~/Desktop/cryptology$ python3 img2.py
nsu#NSU:~/Desktop/cryptology$
But the folder where i have saved the images and code is as it is...
what should i do?
**A viewer posted an answer which resulted the same.
**The problem MAY not be with the code.
**Please consider this

If you like you can use very robust module of python called Pillow for image manipulation, so first do pip3 install pillow
then run this code
from PIL import Image
import glob
images=glob.glob("*.jpg")
for im in images:
# print(im)
img = Image.open(im)
img = img.resize((100, 100), Image.ANTIALIAS)
img.save(im+'_resized.jpg')
hope it helps ... you need to improve the code if you want to keep the aspect ratio of the image

python pytesseract.image_to_string unable to read clear text in image

I am using python3.6 and Tesseract-OCR on my mac. I have pictures containing the text which is clearly readable. However, despite that it is super clear to the human eyes, the Tesseract can't extract them correctly. The attached one is the extreme case that nothing is returned
Below is the snapshot of the code I am using
import cv2
import pytesseract
img = cv2.imread('frame40.jpg')
img = cv2.resize(img, (600, 450))
text = pytesseract.image_to_string(img)
print(text)
What am I missing here?

Not able to work on face detection of OpenCV in python

I am currently learning Image detection using CNN etc. I found out a good article here which explain the face detection steps using OpenCV. I followed each and every steps. But I am really stuck since hours when trying to test a single sample image. Below is the code I used in google Colab:
import cv2
import matplotlib.pyplot as plt
import dlib
import os
from imutils import face_utils
font = cv2.FONT_HERSHEY_SIMPLEX
cascPath=r'C:\Users\randomUser\Desktop\haarcascade_frontalface_default.xml'
eyePath = r'C:\Users\randomUser\Desktop\haarcascade_eye.xml'
smilePath = r'C:\Users\randomUser\Desktop\haarcascade_smile.xml'
faceCascade = cv2.CascadeClassifier(cascPath)
eyeCascade = cv2.CascadeClassifier(eyePath)
smileCascade = cv2.CascadeClassifier(smilePath)
# even if I use the below path, I am still getting the error.
path = r'C:\Users\randomUser\Desktop\imagedata.jpeg'
gray = cv2.imread('imagedata.jpeg')
plt.figure(figsize=(12,8))
plt.imshow(gray, cmap='gray')
plt.show()
I have downloaded all the default files as mentioned above in my directory location along with the test image imagedata
However, when I am running the first few steps, I am getting the below error :(
I have tried giving physical path but I don't understand what am I missing.
I ran through different articles that explain the nature of the error, but none of them helped so I thought of asking here directly.

You should:
print( gray.shape )
after you read it. Because most likely you're reading a non-existent file that renders all code below that moot.

The issue that I was facing was because of the google drive path. After researching and using the Image path, I found out that when using colab, and mounting the google drive, even if you give the absolute path, it will add /content at the beginning of the path. Just because of this the path was not correct.

I think I found the error:
# This is what you have
path = r'C:\Users\randomUser\Desktop\imagedata.jpeg'
gray = cv2.imread('imagedata.jpeg')
# This is what you should have
path = r'C:\Users\randomUser\Desktop\imagedata.jpeg'
gray = cv2.imread(path) # <-- you weren't using the path of the image
Opening the image with PIL?
from PIL import Image
path = r'C:\Users\randomUser\Desktop\imagedata.jpeg'
gray = Image.open(path).convert("L") # L to open the image in gray scale

pytesseract output is so weird without any error

import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
pytesseract.pytesseract.tesseract_cmd="C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe"
im = Image.open("C:\\1.png") # the second one
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('temp2.png')
#im.show()
text = pytesseract.image_to_string(Image.open('temp2.png'),config='-psm 8')
print(text)
Hi to all
I'm trying to extract text from image(captcha) so it's the code above i'am coding so far.
I don't think there is any problem so far since there is any error when i run it. but output is so poor.
when i run this it shows nothing but i change -psm 8 to -psm 5, it shows ';«'.
Would you give me some advice to fix it?

It's done.
I changed other pic for test and there was no problem at least for this pic.
but i think this module is too poor.. it will be way better to find other module..

Tesseract could not recognize text from a pdf file

I tried to use Tesseract in Python to OCR some PDFs. The workflow is to convert a PDF to a series of images first using wand, then send them to Tesseract based on this example. I applied this to 5 PDFs but found it failed to convert one (completely failed). It works fine to convert PDF to Tiff. Thus, I guess maybe something needs to be tuned in the OCR process? Or any other tools I should use to deal with this situation? I tried xpdfbin-win-3.04 which worked on this PDF but did not work as well as Tesseract on the other PDFs...
Screenshot of failed PDF
Output text
Code
from wand.image import Image
from PIL import Image as PI
import pyocr
import pyocr.builders
import io
tool = pyocr.get_available_tools()[0]
lang = tool.get_available_languages()2
pth_str = "C:/Users/TH/Desktop/OCR_test/"
fname_list = ["999437-Asb_1-34.pdf"]
for each_file in fname_list:
print each_file
req_image = []
final_text = []
# convert to tiff
image_pdf = Image(filename=pth_str+each_file, resolution=600)
image_tif = image_pdf.convert('tiff')
for img in image_tif.sequence:
img_page = Image(image=img)
req_image.append(img_page.make_blob('tiff'))
# begin OCR
for img in req_image:
txt = tool.image_to_string(
PI.open(io.BytesIO(img)),
lang=lang,
builder=pyocr.builders.TextBuilder()
)
final_text.append(txt.encode('ascii','ignore'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Detecting Bangla character using pytesseract - python

I have installed some fonts in windows from here https://www.omicronlab.com/bangla-fonts.html After that, it worked perfectly fine for me in Pycharm.

Related

How can I fix my python code to save and resize images using glob on linux

python pytesseract.image_to_string unable to read clear text in image

Not able to work on face detection of OpenCV in python

pytesseract output is so weird without any error

Tesseract could not recognize text from a pdf file

Categories

Resources