python-pptx: Dealing with password-protected PowerPoint files

python-pptx: Dealing with password-protected PowerPoint files - python

I'm using a slightly modified version of the "Extract all text from slides in presentation" example at https://python-pptx.readthedocs.io/en/latest/user/quickstart.html to extract text from some PowerPoint slides.
I'm getting a PackageNotFoundError when I try to use the Presentation() method to open some of the PowerPoint files to read the text.
This appears to be due to the fact that, unbeknownst to me before I started this project, a few of the PowerPoint files are password protected.
I obviously don't expect to be able to read text from a password-protected file but is there a recommended way of dealing with password-protected PowerPoint files? Having my Python script die every time it runs into one is annoying.
I'd be fine with something that basically went: "Hi! The file you're trying to read may be password-protected. Skipping."
I tried using a try/except block to catch the PackageNotFoundError but then I got "NameError: name 'PackageNotFoundError' is not defined".
EDIT1: Here's a minimal case the generates the error:
EDIT2: See below for a working try/catch block, thanks to TheGamer007's suggestion.
import pptx
from pptx import Presentation
password_protected_file = r"C:\Users\J69401\Documents\password_protected_file.pptx"
prs = Presentation(password_protected_file)
And here's the error that is generated:
Traceback (most recent call last):
File "T:/W/Wintermute/50 Sandbox/Pownall/Python/copy files/minimal_case_opening_file.py", line 6, in <module>
prs = Presentation(password_protected_file)
File "C:\Anaconda3\lib\site-packages\python_pptx-0.6.18-py3.6.egg\pptx\api.py", line 28, in Presentation
presentation_part = Package.open(pptx).main_document_part
File "C:\Anaconda3\lib\site-packages\python_pptx-0.6.18-py3.6.egg\pptx\opc\package.py", line 125, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "C:\Anaconda3\lib\site-packages\python_pptx-0.6.18-py3.6.egg\pptx\opc\pkgreader.py", line 33, in from_file
phys_reader = PhysPkgReader(pkg_file)
File "C:\Anaconda3\lib\site-packages\python_pptx-0.6.18-py3.6.egg\pptx\opc\phys_pkg.py", line 32, in __new__
raise PackageNotFoundError("Package not found at '%s'" % pkg_file)
pptx.exc.PackageNotFoundError: Package not found at 'C:\Users\J69401\Documents\password_protected_file.pptx'
Here's the minimal case again but with a working try/catch block.
import pptx
from pptx import Presentation
import pptx.exc
from pptx.exc import PackageNotFoundError
password_protected_file = r"C:\Users\J69401\Documents\password_protected_file.pptx"
try:
prs = Presentation(password_protected_file)
except PackageNotFoundError:
print("PackageNotFoundError generated - possible password-protected file.")

Related

Manipulating docm files with Python

I was looking how to manipulate docm files with Python and found this library named python-docx-docm.
I followed the documentation and tried a simple programm :
import docx
doc = docx.Document(my docm file)
all_paras = doc.paragraphs
for para in all_paras:
print(para.text)
print("-----------")
To which I am geting the following error :
Traceback (most recent call last):
File "c:\Users\clemdcz\Desktop\Projet\intoWORD.py", line 3, in <module>
doc = docx.Document(r'C:\Users\clemdcz\Desktop\my_file.docm')
File "C:\Users\clemdcz\AppData\Local\Programs\Python\Python310\lib\site-packages\docx\api.py", line 36, in Document
return document_part.document
AttributeError: 'Part' object has no attribute 'document'
If I then try with a docx file it works fine and shows me the correct data. So I was wondering on how to fix this error ?
The documentation doesn't seem to give informations on docm file. But I read that it was supposed to work the same for both docm and docx. I couldn't find any other libraries that could manipulate docx files with python.

OpenAI retrieve file content

Unable to retrieve the content of file uploaded already.
Kindly suggest what is going wrong? I have tried for each type of file: search, classification, answers, and fine-tune. Files upload successfully but while retrieving content it shows an error.
import openai
openai.api_key = "sk-bbjsjdjsdksbndsndksbdksbknsndksd" # this is wrong key
# Replace file_id with the file's id whose file content is required
content = openai.File.download("file-5Xs86wEDO5gx8fOitMYArV8r")
print(content)
Error:
Traceback (most recent call last):
File "main.py", line 6, in <module>
content = openai.File.download("file-5Xs86wEDO5gx8fOitMYArV8r")
File "/usr/local/lib/python3.8/dist-packages/openai/api_resources/file.py", line 61, in download
raise requestor.handle_error_response(
openai.error.InvalidRequestError: Not allowed to download files of purpose: classifications

Answer from OpenAI community
Currently, we only allow downloads on the results of fine-tuning runs and not the input files to the fine tuning run. We also don’t allow downloads for search related files. ↗️

Simple PyPDF exercise - AttributeError: 'NullObject' object has no attribute 'get'

Working on a simple PyPDF related exercise - I basically need to take a PDF file and apply a watermark to to it.
Here's my code:
# We need to build a program that will watermark all of our PDF files
# Use the wtr.pdf and apply it to all of the pages of our PDF file
import PyPDF2
# Open the file we want to add the watermark to
with open("combined.pdf", mode="rb") as file:
reader = PyPDF2.PdfFileReader(file)
# Open the watermark file and get the watermark
with open("wtr.pdf", mode="rb") as watermark_file:
watermark_reader = PyPDF2.PdfFileReader(watermark_file)
# Create a writer object for the output file
writer = PyPDF2.PdfFileWriter()
for i in range(reader.numPages):
page = reader.getPage(i)
# Merge the watermark page object into our current page
page.mergePage(watermark_reader.getPage(0))
# Append this new page into our writer object
writer.addPage(page)
with open("watermarked.pdf", mode="wb") as output_file:
writer.write(output_file)
I am unclear as to why I get this error:
$ python watermark.py
Traceback (most recent call last):
File "watermark.py", line 20, in <module>
page.mergePage(watermark_reader.getPage(0))
File "C:\Python38\lib\site-packages\PyPDF2\pdf.py", line 2239, in mergePage
self._mergePage(page2)
File "C:\Python38\lib\site-packages\PyPDF2\pdf.py", line 2260, in _mergePage
new, newrename = PageObject._mergeResources(originalResources, page2Resources, res)
File "C:\Python38\lib\site-packages\PyPDF2\pdf.py", line 2170, in _mergeResources
newRes.update(res1.get(resource, DictionaryObject()).getObject())
AttributeError: 'NullObject' object has no attribute 'get'
I would appreciate any insights. I have been staring at this for a while.

For some reason your pdf file doesn't contain "/Resources". PyPDF2 tries to get it in line 2314 in https://github.com/mstamy2/PyPDF2/blob/master/PyPDF2/pdf.py#L2314
You can try another pdf file to check if the error persists. May be it is a bug in the library or the library doesn't support such files.
Another thing I noticed is that line numbers in master branch of the library do not match line numbers in your stack trace, so may be you need to get more recent version of the library and hope that the problem is fixed there.
By briefly looking at pdf file structure it seems that /Resources are optional. If this is a case, then PyPDF2 doesn't handle this case and it should be probably reported as a bug at https://github.com/mstamy2/PyPDF2/issues

Error setting psm for pytesseract

I'm trying to use a psm of 0 with pytesseract, but I'm getting an error. My code is:
import pytesseract
from PIL import Image
img = Image.open('pathToImage')
pytesseract.image_to_string(img, config='-psm 0')
The error that comes up is
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pytesseract/pytesseract.py", line 126, in image_to_string
f = open(output_file_name, 'rb')
IOError: [Errno 2] No such file or directory:
'/var/folders/m8/pkg0ppx11m19hwn71cft06jw0000gp/T/tess_uIaw2D.txt'
When I go into '/var/folders/m8/pkg0ppx11m19hwn71cft06jw0000gp/T', there's a file called tess_uIaw2D.osd that seems to contain the output information I was looking for. It seems like tesseract is saving a file as .osd, then looking for that file but with a .txt extension. When I run tesseract through the command line with --psm 0, it saves the output file as .osd instead of .txt.
Is it correct that pytesseract's image_to_string() works by saving an output file somewhere and then automatically reading that output file? And is there any way to either set tesseract to save the file as .txt, or to set it to look for a .osd file? I'm having no issues just running the image_to_string() function when I don't set the psm.

You have a couple of questions here:
PSM error
In your question you mention that you are running "--psm 0" in the command line. However in your code snip you have "-psm 0".
Using the double dash, config= "--psm 0", will fix that issue.
If you read the tesseract command line documentation, you can specify where to output the text read from the image. I suggest you start there.
Is it correct that pytesseract's image_to_string() works by saving an output file somewhere and then automatically reading that output file?
From my usage of tesseract, this is not how it works
pytesseract.image_to_string() by default returns the string found on the image. This is defined by the parameter output_type=Output.STRING, when you look at the function image_to_string.
The other return options include (1) Output.BYTES and (2) Output.DICT
I usually have something like text = pytesseract.image_to_string(img)
I then write that text to a log file
Here is an example:
import datetime
import io
import pytesseract
import cv2
img = cv2.imread("pathToImage")
text = pytesseract.image_to_string(img, config="--psm 0")
ocr_log = "C:/foo/bar/output.txt"
timestamp_fmt = "%Y-%m-%d_%H-%M-%S-%f"
# ...
# DO SOME OTHER STUFF BEFORE WRITING TO LOG FILE
# ...
with io.open(ocr_log, "a") as ocr_file:
timestamp = datetime.datetime.now().strftime(timestamp_fmt)
ocr_file.write(f"{timestamp}:\n====OCR-START===\n")
ocr_file.write(text)
ocr_file.write("\n====OCR-END====\n")

How to compress a text file?

I have a text file created and I want to compress it.
How would I accomplish this?
I have done some research, around the forum ; found a question, similar to this but when I tried it out, it did not work as it was text typed in, not a file, for example
import zlib, base64
text = 'STACK OVERFLOW'
code = base64.b64encode(zlib.compress(text,9))
print code
source from: (Compressing a file in python and keep the grammar exact when opening it again)
When i tried it out this error came up, for example:
hTraceback (most recent call last):
File "C:\Users\Shahid\Desktop\Suhail\Task 3.py", line 3, in <module>
code = base64.b64encode(zlib.compress(text,9))
TypeError: must be string or read-only buffer, not file
Here is the code that I have used:
import zlib, base64
text = open('Suitable.txt','r')
code = base64.b64encode(zlib.compress(text,9))
print code
But what i want is a text file to be compressed.

there is a section entitled "Example of how to GZIP compress an existing file" at the bottom of https://docs.python.org/2/library/gzip.html

you should use this code to do what you tried:
import zlib, base64
file = open('Suitable.txt','r')
text = file.read()
file.close()
code = base64.b64encode(zlib.compress(text.encode('utf-8'),9))
code = code.decode('utf-8')
print(code)
but it actually want be compressed because code is longer than text.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python-pptx: Dealing with password-protected PowerPoint files - python

Related

Manipulating docm files with Python

OpenAI retrieve file content

Simple PyPDF exercise - AttributeError: 'NullObject' object has no attribute 'get'

Error setting psm for pytesseract

How to compress a text file?

Categories

Resources