Tesseract could not recognize text from a pdf file

Tesseract could not recognize text from a pdf file - python

I tried to use Tesseract in Python to OCR some PDFs. The workflow is to convert a PDF to a series of images first using wand, then send them to Tesseract based on this example. I applied this to 5 PDFs but found it failed to convert one (completely failed). It works fine to convert PDF to Tiff. Thus, I guess maybe something needs to be tuned in the OCR process? Or any other tools I should use to deal with this situation? I tried xpdfbin-win-3.04 which worked on this PDF but did not work as well as Tesseract on the other PDFs...
Screenshot of failed PDF
Output text
Code
from wand.image import Image
from PIL import Image as PI
import pyocr
import pyocr.builders
import io
tool = pyocr.get_available_tools()[0]
lang = tool.get_available_languages()2
pth_str = "C:/Users/TH/Desktop/OCR_test/"
fname_list = ["999437-Asb_1-34.pdf"]
for each_file in fname_list:
print each_file
req_image = []
final_text = []
# convert to tiff
image_pdf = Image(filename=pth_str+each_file, resolution=600)
image_tif = image_pdf.convert('tiff')
for img in image_tif.sequence:
img_page = Image(image=img)
req_image.append(img_page.make_blob('tiff'))
# begin OCR
for img in req_image:
txt = tool.image_to_string(
PI.open(io.BytesIO(img)),
lang=lang,
builder=pyocr.builders.TextBuilder()
)
final_text.append(txt.encode('ascii','ignore'))

Related

Python with pytesseract - How to get the same output for pytesseract.image_to_data in a searchable PDF?

I have this piece of code in Python that makes use of pytesseract (method pytesseract.image_to_data).
This gives me great text information and coordinates that are saved in a text file that is fed to a third party software. It works perfectly for PDF files that have been scanned
data = pytesseract.image_to_data(Image.open('file-001-page-001.png')))
The issue now is that I have a demand for output in the exact same structure for PDFs that already contain text. It's possible to keep the same code and continue as if the PDF had no text, extracting images and doing OCR, but it doesn't seem like the right solution...
Is it possible to achieve this with pytesseract?
Suggestions are welcome

You can use this:
import pytesseract
from PIL import Image
# Open the PDF file
with open('file.pdf', 'rb') as f:
# Extract text from the PDF file and save it to a variable
text = pytesseract.image_to_pdf_or_hocr(f, extension='hocr', lang='eng', config='--oem 3 --psm 6')
# Save the extracted text to a file in the desired format
with open('output.hocr', 'w')as f:
f.write(text)

Convert png/jpg to word file using python

I need to convert lots of jpg/png files to docx files & then to pdf. My sole concern is to write the data in an image to a pdf file & if I need to edit any text manually, I can do that in word & save it in the corresponding pdf file.
I've tried using API but failed as the text is not correctly matching.
My image files contain only texts & not anything else.
I already have docx to pdf conversion code in Python.
from docx2pdf import convert
input = 'INPUT_FILE_NAME.docx'
output = 'OUTPUT_FILE_NAME.pdf'
convert(input)
convert(input, output)
convert("Output")
Kindly suggest me how to convert a png/jpg file to docx. Thanks.
EDIT --------------
I've successfully made this code run. I've uploaded in my github repo.

from PIL import Image
from pytesseract import pytesseract
#Define path to tessaract.exe
path_to_tesseract = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
#Define path to image
path_to_image = 'texttoimage.png'
#Point tessaract_cmd to tessaract.exe
pytesseract.tesseract_cmd = path_to_tesseract
#Open image with PIL
img = Image.open(path_to_image)
#Extract text from image
text = pytesseract.image_to_string(img)
print(text)

gif generating using imageio.mimwrite

I have the following code to generate GIFs from images, the code works fine and the gif is saved locally, what I want to do rather the saving the GIF locally, I want for example data URI that I could return to my project using a request. How can I generate the GIF and return it without saving it?
my code to generate the GIF
import os
import imageio as iio
import imageio
png_dir='./'
images=[]
for file_name in url:
images.append(file_name)
imageio.imwrite('movie.gif', images, format='gif')

I found I can save it as bytes with the following code
gif_encoded = iio.mimsave("<bytes>", images, format='gif')
it will save the GIF as bytes then you can encode it.
encoded_string = base64.b64encode(gif_encoded)
encoded_string = b'data:image/gif;base64,'+encoded_string
decoded_string = encoded_string.decode()
for more examples check this out
https://imageio.readthedocs.io/en/stable/examples.html#read-from-fancy-sources

pdf2image - convert_from_path returns an empty image for pdfs with colour

I have a collection of pdfs, each containing a scan of an A4 paper, that are different in size. I would like to convert them to an image and fix the resolution of the outgoing image.
My code to convert to jpg (without resizing):
from pdf2image import convert_from_path
filename_in = 'myfile.pdf'
filename_out = 'myfile.jpg'
jpeg = convert_from_path( filename_in )
jpeg[0].save( filename_out , 'JPEG' )
If the pdf I am trying to convert has any colour in it, the above does not work and the outgoing image is completely white (with non-zero dimensions). Is this a known problem and does a solution exist?
I am using Python 3.7.3.
I am unable to share the pdf files as they contain private information.

You can try to extract the images and correct resolutions instead of converting PDFs.
Try pdfreader, here is a sample code extracting all images (the both inline and XObject) from a doc.
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
images = []
try:
while True:
viewer.render()
images.extend(viewer.canvas.inline_images)
images.extend(viewer.canvas.images.values())
viewer.next()
except PageDoesNotExist:
pass
Then you can convert images to PIL/Pillow object and save (or do whatever you need)
for i, img in enumerate(images):
img.to_Pillow().save("{}.png".format(i))

Detecting Bangla character using pytesseract

I am trying to detect bangla character from image using python, so i decided to use pytesseract. For this purpose i have used below code:
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
im = Image.open("input.png") # the second one
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('temp2.png')
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
text = pytesseract.image_to_string(Image.open('temp2.png'),lang="ben")
print text
The problem is that if i gave a image of english character is detects. But when i am writing lang="ben" and detecting from image of bengali characters my code is running for endless time or like forever.
P.S: I have downloaded bengali language train data to tessdata folder and i am trying to run it in PyCharm.
Can anyone help me to solve this problem?
sample of input.png

I added Bangla(india) language to Windows. Downloaded ben.traineddata to TESSDATA_PREFIX which equals to C:\Program Files\Tesseract 4.0.0\tessdata in my PC. Then run,
> tesseract -l ben bangla.jpg bangla_out
in command prompt and got the result below in 2 seconds. The result looks fine even I don't understand the language.
Have you tried to run tesseract in command prompt to verify if it works for -l ben?
EDIT:
Used Spyder, similar to PyCharm, which comes with Anaconda to test
it. Modified your code to call Tesseract as below.
pytesseract.pytesseract.tesseract_cmd = "C:/Program Files/Tesseract 4.0.0/tesseract.exe"
Test Code in Spyder:
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
import os
im = Image.open("bangla.jpg") # the second one
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save("bangla_pp.jpg")
pytesseract.pytesseract.tesseract_cmd = "C:/Program Files/Tesseract 4.0.0/tesseract.exe"
text = pytesseract.image_to_string(Image.open("bangla_pp.jpg"),lang="ben")
print text
It works and produced result below on the processed image. Apparently, the OCR result of the processed image is not as good as the original one.
Result from the processed bangla_pp.jpg:
প্রত্যাবর্তনকারীরা
তাঁদের দেশে গিয়ে
-~~-<~~~~--
প্রত্যাবর্তন-পরবর্তী
আর্থিক সহায়তা
= পাবেন তার
Result from original image, directly feed to Tesseract.
Code:
from PIL import Image
import pytesseract as tess
print tess.image_to_string(Image.open('bangla.jpg'), lang='ben')
Output:
প্রত্যাবর্তনকারীরা
তাঁদের দেশে গিয়ে
প্রত্যাবর্তন-পরবর্তী
আর্থিক সহায়তা
পাবেন তার

I have installed some fonts in windows from here
https://www.omicronlab.com/bangla-fonts.html
After that, it worked perfectly fine for me in Pycharm.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tesseract could not recognize text from a pdf file - python

Related

Python with pytesseract - How to get the same output for pytesseract.image_to_data in a searchable PDF?

Convert png/jpg to word file using python

gif generating using imageio.mimwrite

pdf2image - convert_from_path returns an empty image for pdfs with colour

Detecting Bangla character using pytesseract

Categories

Resources