unable to extract text from tif image using pytesseract in python

unable to extract text from tif image using pytesseract in python - python

I am unable to extract text from .tif image file using pytesseract & PIL in Python.
It works well for .png, .jpg image file, it only gives error in .tif image file.
I am using Python 3.7.1 version
It gives below error while running Python code for .tif image file. Please let me know what I am doing wrong.
Fax3SetupState: Bits/sample must be 1 for Group 3/4 encoding/decoding.
Traceback (most recent call last):
File "C:/Users/u88ltuc/PycharmProjects/untitled1/Image Processing/Prog1.py", line 13, in <module>
image_to_text = pytesseract.image_to_string(image, lang='eng')
File "C:\Users\u88ltuc\PycharmProjects\untitled1\venv\lib\site-packages\pytesseract\pytesseract.py", line 347, in image_to_string
}[output_type]()
File "C:\Users\u88ltuc\PycharmProjects\untitled1\venv\lib\site-packages\pytesseract\pytesseract.py", line 346, in <lambda>
Output.STRING: lambda: run_and_get_output(*args),
File "C:\Users\u88ltuc\PycharmProjects\untitled1\venv\lib\site-packages\pytesseract\pytesseract.py", line 246, in run_and_get_output
with save(image) as (temp_name, input_filename):
File "C:\Program Files\Python37\lib\contextlib.py", line 112, in __enter__
return next(self.gen)
File "C:\Users\u88ltuc\PycharmProjects\untitled1\venv\lib\site-packages\pytesseract\pytesseract.py", line 171, in save
image.save(input_file_name, format=extension, **image.info)
File "C:\Users\u88ltuc\PycharmProjects\untitled1\venv\lib\site-packages\PIL\Image.py", line 2102, in save
save_handler(self, fp, filename)
File "C:\Users\u88ltuc\PycharmProjects\untitled1\venv\lib\site-packages\PIL\TiffImagePlugin.py", line 1626, in _save
raise OSError("encoder error %d when writing image file" % s)
OSError: encoder error -2 when writing image file
Below is the Python code for it.
#Import modules
from PIL import Image
import pytesseract
# Include tesseract executable in your path
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Create an image object of PIL library
image = Image.open(r'C:\Users\u88ltuc\Desktop\12110845-e001.tif')
# pass image into pytesseract module
image_to_text = pytesseract.image_to_string(image, lang='eng')
# Print the text
print(image_to_text)
Below is the tif image and its link:
https://ecat.aptiv.com/docs/default-source/ecatalog-documents/12110845-e001-tif.tif?sfvrsn=3ee3b8a1_0

Firstly,you should change your image extension.
This maybe can sovle your problem:
from PIL import Image
from io import BytesIO
import pytesseract
img = Image.open(r"C:\Users\u88ltuc\Desktop\12110845-e001.tif")
TempIO = BytesIO()
img.save(TempIO,format="JPEG")
img = Image.open(BytesIO(TempIO.getvalue()))
print(pytesseract.image_to_string(img))
Or if you don't mind your desktop have two same picture,you don't need to import BytesIO,and here it is:
from PIL import Image
import pytesseract
img = Image.open(r"C:\Users\u88ltuc\Desktop\12110845-e001.tif")
img.save(r"C:\Users\u88ltuc\Desktop\12110845-e001.jpg")
img = Image.open(r"C:\Users\u88ltuc\Desktop\12110845-e001.jpg")
print(pytesseract.image_to_string(img))

Related

How to work around skimage.io.imread() error and re-save image correctly?

The below error is thrown when trying to read the image URL referenced below. (Note, I can't even upload the image to SO because it throws an error when I try to upload it.)
https://s3.amazonaws.com/comicgeeks/comics/covers/large-7441962.jpg
image = imread('https://s3.amazonaws.com/comicgeeks/comics/covers/large-7441962.jpg', as_gray=True)
This is the stack trace.
Traceback (most recent call last):
....
return imread(image_or_path, as_gray=True)
File "skimage/io/_io.py", line 48, in imread
img = call_plugin('imread', fname, plugin=plugin, **plugin_args)
File "skimage/io/manage_plugins.py", line 209, in call_plugin
return func(*args, **kwargs)
File "skimage/io/_plugins/imageio_plugin.py", line 10, in imread
return np.asarray(imageio_imread(*args, **kwargs))
File "imageio/__init__.py", line 86, in imread
return imread_v2(uri, format=format, **kwargs)
File "imageio/v2.py", line 159, in imread
with imopen(uri, "ri", plugin=format) as file:
File "imageio/core/imopen.py", line 333, in imopen
raise err_type(err_msg)
ValueError: Could not find a backend to open `/var/folders/82/rky4yjcx75n1zskhy5570v0m0000gn/T/tmpy9xg7dvb.jpg`` with iomode `ri`.
After looking into this further I believe the image was originally a TIFF file that was just renamed to a .jpg file manually, but I'm not sure. If I download the file and try to open it with Photoshop I get the following message.
Could not open “large-7441962.jpeg” because an unknown or invalid JPEG marker type is found.
If I simply change the extension to a .tiff file it will not open as it states it is an invalid tiff file.
The only way I can open it with photoshop is if I open it with the preview.app and then save a copy of the image as a .tiff file. Then I can open it in photoshop.
This is an issue with a potentially large number of images so re-saving them one-by-one is not an option.
Are there any possible ways to re-save this file when this error is thrown? Or somehow figure out how to handle it even though imread() is failing?

I was able to work around this by using the following.
from PIL import Image
from skimage.io import imread
try:
image = imread(url, as_gray=True)
return image
except:
image = Image.open(requests.get(url, stream=True).raw)
return image
However it is worth noting that when having to make this request using PIL it is significantly slower.

how to fix a erro python tesseract error?

I need to use python tesseract to extract text from a photo:
import pytesseract
from PIL import Image
img = Image.open('stest.png')
pytesseract.pytesseract.tesseract_cmd = 'D:\\python\\venv\\Scripts\\pytesseract.exe'
file_name = img.filename
file_name = file_name.split(".")[0]
text = pytesseract.image_to_string(img,lang=None, config='')
print(text)
with open(f'{file_name}.txt', 'w') as text_file:
text_file.write(text)
But the error appears as if it is not related to my code:
Traceback (most recent call last):
File "D:\python\pybotavito\main.py", line 13, in <module>
text = pytesseract.image_to_string(img,lang=None, config='')
File "D:\python\venv\lib\site-packages\pytesseract\pytesseract.py", line 413, in image_to_string
return {
File "D:\python\venv\lib\site-packages\pytesseract\pytesseract.py", line 416, in <lambda>
Output.STRING: lambda: run_and_get_output(*args),
File "D:\python\venv\lib\site-packages\pytesseract\pytesseract.py", line 284, in run_and_get_output
run_tesseract(**kwargs)
File "D:\python\venv\lib\site-packages\pytesseract\pytesseract.py", line 260, in run_tesseract
raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (2, 'Usage: pytesseract [-l lang] input_file')

From the README page of pytesseract,
# If you don't have tesseract executable in your PATH, include the following:
pytesseract.pytesseract.tesseract_cmd = r'<full_path_to_your_tesseract_executable>'
# Example tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract'
This line must point to the tesseract executable, NOT to the pytesseract python executable. Try to set the correct executable path. There is an example on where to find the executable in the comment bellow the corresponding line.

PIL "OSError: cannot identify image file " while opening an Image

I am trying to load a dataset of images using keras.load_img which returns a PIL Image instance, but am unable to do so due to some reason. The image was created by the following lines of code, after being generated by the imagaug module of python, which returns a numpy array:
img = augmenter(image=image)
img = Image.fromarray(img)
img.save(os.path.join(os.path.dirname(__file__), 'Roll 1_Augmented', 'PaperPrintImgAug' + str(i+1) + '_' + str(j) + '.jpg'))
The iamges seem to be saved successfully and I am able to open them with any image viewer. So I do not think they are corrupted. But I get the following error whenever I try to open the images using load_img:
Traceback (most recent call last):
File "ae.py", line 61, in readImages
img = load_img(folder_path[i], target_size=inputShape)
File "/home/ies/billa/miniconda3/envs/pfprint/lib/python3.6/site-packages/keras_preprocessing/image/utils.py", line 114, in load_img
img = pil_image.open(io.BytesIO(f.read()))
File "/home/ies/billa/miniconda3/envs/pfprint/lib/python3.6/site-packages/PIL/Image.py", line 2896, in open
"cannot identify image file %r" % (filename if filename else fp)
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7f7350df9e08>
I tried various solutions provided online and on stackoverflow, but the situations don't seem to match with mine. Upon downloading the latest version of Pillow, I got the following change in the traceback:
File "/home/ies/billa/miniconda3/envs/pfprint/lib/python3.6/site-packages/keras_preprocessing/image/utils.py", line 114, in load_img
img = pil_image.open(io.BytesIO(f.read()))
File "/home/ies/billa/miniconda3/envs/pfprint/lib/python3.6/site-packages/PIL/Image.py", line 2006, in open
raise IOError("cannot identify image file")
OSError: cannot identify image file
I suspect it might be some issue with the image encoding.
Can someone help me with this please?
EDIT:
The issue was solved by saving the images as png files.
For the issue with JPEG files, the following is the code I used:
from tensorflow.python.keras.preprocessing.image import load_img
def readImages(folder_path):
images = []
for i in range(len(folder_path)):
img = load_img(folder_path[i], target_size=inputShape)
img = np.array(img)
images.append(img)
images = np.array(images)
return images
augdata = glob('/home/ies/billa/Roll 1_Augmented/*')
augmentedImages = readImages(augdata)

Open process and save specific images in related folder

I'm looking for a way to open and crop several tiff images and then save the new croped images created in the same folder (related to my script folder).
My current code looks like this:
from PIL import Image
import os,platform
filespath = os.path.join(os.environ['USERPROFILE'],"Desktop\Python\originalImagesfolder")
for file in os.listdir(filespath):
if file.endswith(".tif"):
im = Image.open(file)
im.crop((3000, 6600, 3700, 6750)).save(file+"_crop.tif")
This script is returning me the error:
Traceback (most recent call last):
File "C:\Users...\Desktop\Python\script.py", line 22, in
im = Image.open(file)
File "C:\Python34\lib\site-packages\PIL\Image.py", line 2219, in open
fp = builtins.open(fp, "rb")
FileNotFoundError: [Errno 2] No such file or directory: 'Image1Name.tif'
'Image1Name.tif' is the first tif image I'm trying to process in the folder. I don't get how the script can give the file's name without being able to find it. Any Help?
PS: I have 2 days experience in python and codes generaly speaking. Sorry if the answer is obvious
[EDIT/Update]
After modifying my initial code thanks to vttran and ChrisGuest answers, turning then into this:
from PIL import Image
import os,platform
filespath = os.path.join(os.environ['USERPROFILE'],"Desktop\Python\originalImagesfolder")
for file in os.listdir(filespath):
if file.endswith(".tif"):
filepath = os.path.join(filespath, file)
im = Image.open(filepath)
im.crop((3000, 6600, 3700, 6750)).save("crop"+file)
the script is returning me a new error message:
Traceback (most recent call last):
File "C:/Users/.../Desktop/Python/script.py", line 11, in
im.crop((3000, 6600, 3700, 6750)).save("crop"+file)
File "C:\Python34\lib\site-packages\PIL\Image.py", line 986, in crop
self.load()
File "C:\Python34\lib\site-packages\PIL\ImageFile.py", line 166, in load
self.load_prepare()
File "C:\Python34\lib\site-packages\PIL\ImageFile.py", line 250, in
load_prepare
self.im = Image.core.new(self.mode, self.size) ValueError: unrecognized mode
A maybe-useful information, it's a Landsat8 image in GeoTiff format. The TIFF file therefore include geoposition, projection... informations. The script works perfectly fine if I first open and re-save them with a software like Photoshop (16int tiff format).

When you are search for the file names you use filespath to specify the directory.
But then when you open the file, you are only using the base filename.
So you could replace
im = Image.open(file)
with
filepath = os.path.join(filespath, file)
im = Image.open(filepath)
Also consider using the glob module, as you can do glob.glob(r'path\*.tif) .
It is also good practice to avoid using builtin functions like file as variable names.

Pillow not loading image -cannot identify image file

What's wrong with the following snippet?
It's not related to the image format, I tried both with jpg and png.
import Image
from cStringIO import StringIO
with open('/path/to/file/image.png') as f:
data = f.read()
img = Image.open(StringIO(data))
img.load()
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
File "/usr/lib64/python2.7/site-packages/PIL/Image.py", line 2030, in open
raise IOError("cannot identify image file")
IOError: cannot identify image file
EDIT:
This does happen with a randomly downloaded picture from the internet and the following most basic snippet:
import Image
im = Image.open('WicZW.jpg')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/site-packages/PIL/Image.py", line 2030, in open
raise IOError("cannot identify image file")
IOError: cannot identify image file

The problem was in the mutual presence of the PIL and Pillow library on the machine:
# pip freeze | grep -E '(Pillow|PIL)'
PIL==1.1.7
Pillow==2.1.0

I solved this by using
from PIL import Image
instead of just doing
import Image

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

unable to extract text from tif image using pytesseract in python - python

Related

How to work around skimage.io.imread() error and re-save image correctly?

how to fix a erro python tesseract error?

PIL "OSError: cannot identify image file " while opening an Image

Open process and save specific images in related folder

Pillow not loading image -cannot identify image file

Categories

Resources