Error setting psm for pytesseract - python

I'm trying to use a psm of 0 with pytesseract, but I'm getting an error. My code is:
import pytesseract
from PIL import Image
img = Image.open('pathToImage')
pytesseract.image_to_string(img, config='-psm 0')
The error that comes up is
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pytesseract/pytesseract.py", line 126, in image_to_string
f = open(output_file_name, 'rb')
IOError: [Errno 2] No such file or directory:
'/var/folders/m8/pkg0ppx11m19hwn71cft06jw0000gp/T/tess_uIaw2D.txt'
When I go into '/var/folders/m8/pkg0ppx11m19hwn71cft06jw0000gp/T', there's a file called tess_uIaw2D.osd that seems to contain the output information I was looking for. It seems like tesseract is saving a file as .osd, then looking for that file but with a .txt extension. When I run tesseract through the command line with --psm 0, it saves the output file as .osd instead of .txt.
Is it correct that pytesseract's image_to_string() works by saving an output file somewhere and then automatically reading that output file? And is there any way to either set tesseract to save the file as .txt, or to set it to look for a .osd file? I'm having no issues just running the image_to_string() function when I don't set the psm.

You have a couple of questions here:
PSM error
In your question you mention that you are running "--psm 0" in the command line. However in your code snip you have "-psm 0".
Using the double dash, config= "--psm 0", will fix that issue.
If you read the tesseract command line documentation, you can specify where to output the text read from the image. I suggest you start there.
Is it correct that pytesseract's image_to_string() works by saving an output file somewhere and then automatically reading that output file?
From my usage of tesseract, this is not how it works
pytesseract.image_to_string() by default returns the string found on the image. This is defined by the parameter output_type=Output.STRING, when you look at the function image_to_string.
The other return options include (1) Output.BYTES and (2) Output.DICT
I usually have something like text = pytesseract.image_to_string(img)
I then write that text to a log file
Here is an example:
import datetime
import io
import pytesseract
import cv2
img = cv2.imread("pathToImage")
text = pytesseract.image_to_string(img, config="--psm 0")
ocr_log = "C:/foo/bar/output.txt"
timestamp_fmt = "%Y-%m-%d_%H-%M-%S-%f"
# ...
# DO SOME OTHER STUFF BEFORE WRITING TO LOG FILE
# ...
with io.open(ocr_log, "a") as ocr_file:
timestamp = datetime.datetime.now().strftime(timestamp_fmt)
ocr_file.write(f"{timestamp}:\n====OCR-START===\n")
ocr_file.write(text)
ocr_file.write("\n====OCR-END====\n")

Related

python-pptx: Dealing with password-protected PowerPoint files

I'm using a slightly modified version of the "Extract all text from slides in presentation" example at https://python-pptx.readthedocs.io/en/latest/user/quickstart.html to extract text from some PowerPoint slides.
I'm getting a PackageNotFoundError when I try to use the Presentation() method to open some of the PowerPoint files to read the text.
This appears to be due to the fact that, unbeknownst to me before I started this project, a few of the PowerPoint files are password protected.
I obviously don't expect to be able to read text from a password-protected file but is there a recommended way of dealing with password-protected PowerPoint files? Having my Python script die every time it runs into one is annoying.
I'd be fine with something that basically went: "Hi! The file you're trying to read may be password-protected. Skipping."
I tried using a try/except block to catch the PackageNotFoundError but then I got "NameError: name 'PackageNotFoundError' is not defined".
EDIT1: Here's a minimal case the generates the error:
EDIT2: See below for a working try/catch block, thanks to TheGamer007's suggestion.
import pptx
from pptx import Presentation
password_protected_file = r"C:\Users\J69401\Documents\password_protected_file.pptx"
prs = Presentation(password_protected_file)
And here's the error that is generated:
Traceback (most recent call last):
File "T:/W/Wintermute/50 Sandbox/Pownall/Python/copy files/minimal_case_opening_file.py", line 6, in <module>
prs = Presentation(password_protected_file)
File "C:\Anaconda3\lib\site-packages\python_pptx-0.6.18-py3.6.egg\pptx\api.py", line 28, in Presentation
presentation_part = Package.open(pptx).main_document_part
File "C:\Anaconda3\lib\site-packages\python_pptx-0.6.18-py3.6.egg\pptx\opc\package.py", line 125, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "C:\Anaconda3\lib\site-packages\python_pptx-0.6.18-py3.6.egg\pptx\opc\pkgreader.py", line 33, in from_file
phys_reader = PhysPkgReader(pkg_file)
File "C:\Anaconda3\lib\site-packages\python_pptx-0.6.18-py3.6.egg\pptx\opc\phys_pkg.py", line 32, in __new__
raise PackageNotFoundError("Package not found at '%s'" % pkg_file)
pptx.exc.PackageNotFoundError: Package not found at 'C:\Users\J69401\Documents\password_protected_file.pptx'
Here's the minimal case again but with a working try/catch block.
import pptx
from pptx import Presentation
import pptx.exc
from pptx.exc import PackageNotFoundError
password_protected_file = r"C:\Users\J69401\Documents\password_protected_file.pptx"
try:
prs = Presentation(password_protected_file)
except PackageNotFoundError:
print("PackageNotFoundError generated - possible password-protected file.")

Is there a method to write metadata to an image in PyExifTool?

I used ExifTool's '-tagsFromFile' to copy Exif metadata from a source image to a destination image from the command line. Wanted to do the same in a python script, which is when I got to know of PyExifTool. However, I didn't find any command to copy or write to a destination image. Am I missing something? Is there a way I can fix this?
I found user5008949's answer to a similar question which suggested doing this:
import exiftool
filename = '/home/radha/src.JPG'
with exiftool.ExifTool() as et:
et.execute("-tagsFromFile", filename , "dst.JPG")
However, it gives me the following error:
Traceback (most recent call last):
File "metadata.py", line 9, in <module>
et.execute("-tagsFromFile", filename , "dst.JPG")
File "/home/radha/venv/lib/python3.6/site-packages/exiftool.py", line 221, in execute
self._process.stdin.write(b"\n".join(params + (b"-execute\n",)))
TypeError: sequence item 0: expected a bytes-like object, str found
execute() method requires bytes as inputs and you are passing strings. That is why it fails.
In your case, the code should look like this:
import exiftool
filename = b"/home/radha/src.JPG"
with exiftool.ExifTool() as et:
et.execute(b"-tagsFromFile", filename , b"dst.JPG")
Please find this answer as a refernece.

Python Image to Text

i'm trying to write a python script that will take an image as an input and print out whatever is in the image as text to the terminal or a file. i do have python 2.7 and 3.7
i do have PIL and pytesseract install on my Kali linux
but i'm getting this errors
Traceback (most recent call last):
File "imgtotxt.py", line 8, in <module>
img =Image.open("/home/Desktop/ITT/1.jpeg")
File "/usr/lib/python3/dist-packages/PIL/Image.py", line 2609, in open
fp = builtins.open(filename, "rb")
FileNotFoundError: [Errno 2] No such file or directory: '/home/Desktop/ITT/1.jpeg'
HERE IS MY CODE
#!/usr/bin/python
from PIL import Image
from pytesseract import image_to_string
img =Image.open("/home/Desktop/ITT/1.jpeg")
text =image_to_string(img)
print (text)
Something is wrong with how you typed the filename.
Try this in your python code:
import os
print(os.listdir("/home/Desktop/ITT/"))
You should see your filename printed. Copy the filename from there instead.
If this fails, go up a directory (eg /home/Desktop) and try that.
Make sure if the file exists at the exact location you specified. The system isn't finding the file. Perhaps it's at /home/YOUR_USER/Desktop/ITT/1.jpeg ?
Put the script in the same folder as is the image, change path to only a name of the image and you will se if something is REALLY wrong.
EDIT:
Try this then:
import cv2
import numpy as np
image = cv2.imread('1.jpeg') # alternativly /home/Desktop/ITT/
img = Image.fromarray(image.astype(np.uint8))
....
Also check if your image is not corrupted. This is pretty strange

Python PyMuPDF Fitz insertImage

Have been trying to put an image into a PDF file using PyMuPDF / Fitz and everywhere I look on the internet I get the same syntax, but when I use it I'm getting a runtime error.
>>> doc = fitz.open("NewPDF.pdf")
>>> page = doc[1]
>>> rect = fitz.Rect(0,0,880,1080)
>>> page.insertImage(rect, filename = "Image01.jpg")
error: object is not a stream
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\fitz\fitz.py", line 1225, in insertImage
return _fitz.Page_insertImage(self, rect, filename, pixmap, overlay)
RuntimeError: object is not a stream
>>> page
page 1 of NewPDF.pdf
I've tried a few different variations on this, with pixmap and without, with overlay value set, and without. The PDF file exists and can be opened with Adobe Acrobat Reader, and the image file exists - I have tried PNG and JPG.
Thank you in advanced for any help.
just some hints to attempt:
Ensure that your "Image01.jpg" file is open and use the full path.
image_path = "/full/path/to/Image01.jpg"
image_file = Image.open(
open(image_path, 'rb'))
# side-note: generally it is better to use the open with syntax, see link below
# https://stackoverflow.com/questions/9282967/how-to-open-a-file-using-the-open-with-statement
To ensure that you are actually on the pdf page that you expect to be, try this. This code will insert the image only on the first page
for page in doc:
page.InsertImage(rect, filename=image_path)
break # Without this, the image will appear on each page of your pdf

How to compress a text file?

I have a text file created and I want to compress it.
How would I accomplish this?
I have done some research, around the forum ; found a question, similar to this but when I tried it out, it did not work as it was text typed in, not a file, for example
import zlib, base64
text = 'STACK OVERFLOW'
code = base64.b64encode(zlib.compress(text,9))
print code
source from: (Compressing a file in python and keep the grammar exact when opening it again)
When i tried it out this error came up, for example:
hTraceback (most recent call last):
File "C:\Users\Shahid\Desktop\Suhail\Task 3.py", line 3, in <module>
code = base64.b64encode(zlib.compress(text,9))
TypeError: must be string or read-only buffer, not file
Here is the code that I have used:
import zlib, base64
text = open('Suitable.txt','r')
code = base64.b64encode(zlib.compress(text,9))
print code
But what i want is a text file to be compressed.
there is a section entitled "Example of how to GZIP compress an existing file" at the bottom of https://docs.python.org/2/library/gzip.html
you should use this code to do what you tried:
import zlib, base64
file = open('Suitable.txt','r')
text = file.read()
file.close()
code = base64.b64encode(zlib.compress(text.encode('utf-8'),9))
code = code.decode('utf-8')
print(code)
but it actually want be compressed because code is longer than text.

Categories

Resources