Working on a simple PyPDF related exercise - I basically need to take a PDF file and apply a watermark to to it.
Here's my code:
# We need to build a program that will watermark all of our PDF files
# Use the wtr.pdf and apply it to all of the pages of our PDF file
import PyPDF2
# Open the file we want to add the watermark to
with open("combined.pdf", mode="rb") as file:
reader = PyPDF2.PdfFileReader(file)
# Open the watermark file and get the watermark
with open("wtr.pdf", mode="rb") as watermark_file:
watermark_reader = PyPDF2.PdfFileReader(watermark_file)
# Create a writer object for the output file
writer = PyPDF2.PdfFileWriter()
for i in range(reader.numPages):
page = reader.getPage(i)
# Merge the watermark page object into our current page
page.mergePage(watermark_reader.getPage(0))
# Append this new page into our writer object
writer.addPage(page)
with open("watermarked.pdf", mode="wb") as output_file:
writer.write(output_file)
I am unclear as to why I get this error:
$ python watermark.py
Traceback (most recent call last):
File "watermark.py", line 20, in <module>
page.mergePage(watermark_reader.getPage(0))
File "C:\Python38\lib\site-packages\PyPDF2\pdf.py", line 2239, in mergePage
self._mergePage(page2)
File "C:\Python38\lib\site-packages\PyPDF2\pdf.py", line 2260, in _mergePage
new, newrename = PageObject._mergeResources(originalResources, page2Resources, res)
File "C:\Python38\lib\site-packages\PyPDF2\pdf.py", line 2170, in _mergeResources
newRes.update(res1.get(resource, DictionaryObject()).getObject())
AttributeError: 'NullObject' object has no attribute 'get'
I would appreciate any insights. I have been staring at this for a while.
For some reason your pdf file doesn't contain "/Resources". PyPDF2 tries to get it in line 2314 in https://github.com/mstamy2/PyPDF2/blob/master/PyPDF2/pdf.py#L2314
You can try another pdf file to check if the error persists. May be it is a bug in the library or the library doesn't support such files.
Another thing I noticed is that line numbers in master branch of the library do not match line numbers in your stack trace, so may be you need to get more recent version of the library and hope that the problem is fixed there.
By briefly looking at pdf file structure it seems that /Resources are optional. If this is a case, then PyPDF2 doesn't handle this case and it should be probably reported as a bug at https://github.com/mstamy2/PyPDF2/issues
Related
I am processing a set of DICOM files, some of which have image information and some of which don't. If a file has image information, the following code works fine.
file_reader = sitk.ImageFileReader()
file_reader.SetFileName(fileName)
file_reader.ReadImageInformation()
However, if the file does not have image information, I get the following error.
Traceback (most recent call last):
File "<ipython-input-61-d187aed107ed>", line 5, in <module>
file_reader.ReadImageInformation()
File "/home/peter/anaconda3/lib/python3.7/site-packages/SimpleITK/SimpleITK.py", line 8673, in ReadImageInformation
return _SimpleITK.ImageFileReader_ReadImageInformation(self)
RuntimeError: Exception thrown in SimpleITK ImageFileReader_ReadImageInformation: /tmp/SimpleITK/Code/IO/src/sitkImageReaderBase.cxx:107:
sitk::ERROR: Unable to determine ImageIO reader for "/path/115.dcm"
If the DICOM file has no information, I would like to just ignore the file rather than calling ReadImageInformation(). Is there a way to check whether ReadImageInformation() will work before it is called? I tried the following and they are no different between files where ReadImageInformation() and files where it does not.
file_reader.GetImageIO()
file_reader.GetMetaDataKeys() # Crashes
file_reader.GetDimension()
I would just put an exception handler around it to catch the error. So it'd look something like this:
file_reader = sitk.ImageFileReader()
file_reader.SetFileName(fileName)
try:
file_reader.ReadImageInformation()
except:
print(fileName, "has no image information")
I've used PyMuPDF library to parse the content of any specific page of a pdf file locally and found it working. However, when I try to apply the same logic while parsing the content of any specific page of a pdf file available online, I encounter an error.
I got success using the following script (local pdf):
import fitz
path = r'C:\Users\WCS\Desktop\pymupdf\Regular Expressions Cookbook.pdf'
doc = fitz.open(path)
page1 = doc.loadPage(5)
page1text = page1.getText("text")
print(page1text)
The script below throws an error (pdf that is available online):
import fitz
import requests
URL = 'https://buildmedia.readthedocs.org/media/pdf/pdfminer-docs/latest/pdfminer-docs.pdf'
res = requests.get(URL)
doc = fitz.open(res.content)
page1 = doc.loadPage(5)
page1text = page1.getText("text")
print(page1text)
Error that the script encounters:
Traceback (most recent call last):
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\general_demo.py", line 8, in <module>
doc = fitz.open(res.content)
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\fitz\fitz.py", line 2010, in __init__
_fitz.Document_swiginit(self, _fitz.new_Document(filename, stream, filetype, rect, width, height, fontsize))
RuntimeError: cannot open b'%PDF-1.5\n%\xd0\xd4\xc5\xd8\n1 0 obj\n<<\n/Length 843 \n/Filter /FlateDecode\n>>\nstream\nx\xdamUMo\xe20\x10\xbd\xe7Wx\x0f\x95\xda\x03\xc5N\xc8W\x85\x90\x9c\x84H\x1c\xb6\xad\nZ\xed\x95&\xa6\x8bT\x12\x14\xe0\xd0\x7f\xbf~3\x13\xda\xae\xf
How can I read the content directly from online?
Looks like you need to initialize the object with stream:
>>> # from memory
>>> doc = fitz.open(stream=mem_area, filetype="pdf")
mem_area has the data of the document.
https://pymupdf.readthedocs.io/en/latest/document.html#Document
I think you were missing the read() function to read file as bytesIO which pymupdf can then consume.
with fitz.open(stream=uploaded_pdf.read(), filetype="pdf") as doc:
text = ""
for page in doc:
text += page.getText()
print(text)
I'm using a slightly modified version of the "Extract all text from slides in presentation" example at https://python-pptx.readthedocs.io/en/latest/user/quickstart.html to extract text from some PowerPoint slides.
I'm getting a PackageNotFoundError when I try to use the Presentation() method to open some of the PowerPoint files to read the text.
This appears to be due to the fact that, unbeknownst to me before I started this project, a few of the PowerPoint files are password protected.
I obviously don't expect to be able to read text from a password-protected file but is there a recommended way of dealing with password-protected PowerPoint files? Having my Python script die every time it runs into one is annoying.
I'd be fine with something that basically went: "Hi! The file you're trying to read may be password-protected. Skipping."
I tried using a try/except block to catch the PackageNotFoundError but then I got "NameError: name 'PackageNotFoundError' is not defined".
EDIT1: Here's a minimal case the generates the error:
EDIT2: See below for a working try/catch block, thanks to TheGamer007's suggestion.
import pptx
from pptx import Presentation
password_protected_file = r"C:\Users\J69401\Documents\password_protected_file.pptx"
prs = Presentation(password_protected_file)
And here's the error that is generated:
Traceback (most recent call last):
File "T:/W/Wintermute/50 Sandbox/Pownall/Python/copy files/minimal_case_opening_file.py", line 6, in <module>
prs = Presentation(password_protected_file)
File "C:\Anaconda3\lib\site-packages\python_pptx-0.6.18-py3.6.egg\pptx\api.py", line 28, in Presentation
presentation_part = Package.open(pptx).main_document_part
File "C:\Anaconda3\lib\site-packages\python_pptx-0.6.18-py3.6.egg\pptx\opc\package.py", line 125, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "C:\Anaconda3\lib\site-packages\python_pptx-0.6.18-py3.6.egg\pptx\opc\pkgreader.py", line 33, in from_file
phys_reader = PhysPkgReader(pkg_file)
File "C:\Anaconda3\lib\site-packages\python_pptx-0.6.18-py3.6.egg\pptx\opc\phys_pkg.py", line 32, in __new__
raise PackageNotFoundError("Package not found at '%s'" % pkg_file)
pptx.exc.PackageNotFoundError: Package not found at 'C:\Users\J69401\Documents\password_protected_file.pptx'
Here's the minimal case again but with a working try/catch block.
import pptx
from pptx import Presentation
import pptx.exc
from pptx.exc import PackageNotFoundError
password_protected_file = r"C:\Users\J69401\Documents\password_protected_file.pptx"
try:
prs = Presentation(password_protected_file)
except PackageNotFoundError:
print("PackageNotFoundError generated - possible password-protected file.")
I am trying to make a python program which loops through all files in a folder, selects those which have extension '.pdf', and encrypt them with restricted permissions. I am using this version of PyPDF2 library:
https://github.com/vchatterji/PyPDF2. (A modification of the original PyPDF2 which also allows to set permissions). I have tested it with a single pdf file and it works fine. I want that the original pdf file should be deleted and the encrypted one should remain with the same name.
Here is my code:
import os
import PyPDF2
directory = './'
for filename in os.listdir(directory):
if filename.endswith(".pdf"):
pdfFile = open(filename, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFile)
pdfWriter = PyPDF2.PdfFileWriter()
for pageNum in range(pdfReader.numPages):
pdfWriter.addPage(pdfReader.getPage(pageNum))
pdfFile.close()
os.remove(filename)
pdfWriter.encrypt('', 'ispat', perm_mask=-3904)
resultPdf = open(filename, 'wb')
pdfWriter.write(resultPdf)
resultPdf.close()
continue
else:
continue
It gives the following error:
C:\Users\manul\Desktop\ghh>python encrypter.py
Traceback (most recent call last):
File "encrypter.py", line 9, in <module>
pdfReader = PyPDF2.PdfFileReader(pdfFile)
File "C:\Users\manul\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\pdf.py", line 1153, in __init__
self.read(stream)
File "C:\Users\manul\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\pdf.py", line 1758, in read
stream.seek(-1, 2)
OSError: [Errno 22] Invalid argument
I have some PDFs stored in 'ghh' folder on Desktop. Any help is greatly appreciated.
Using pdfReader = PyPDF2.PdfFileReader(filename) will make the reader work, but this specific error is caused by your files being empty. You can check the file sizes with os.path.getsize(filename). Your files were probably wiped because the script deletes the original file, then creates a new file with open(filepath, "wb"), and then it terminates incorrectly due to an error that occurs with pdfWriter.write(resultPdf), leaving an empty file with the original file name.
Passing a file name instead of a file object to PdfFileReader as mentioned resolves the error that occurs with pdfWriter (I don't know why), but you'll need to replace any empty files in your directory with copies of the original pdfs to get rid of the OSError.
Have been trying to put an image into a PDF file using PyMuPDF / Fitz and everywhere I look on the internet I get the same syntax, but when I use it I'm getting a runtime error.
>>> doc = fitz.open("NewPDF.pdf")
>>> page = doc[1]
>>> rect = fitz.Rect(0,0,880,1080)
>>> page.insertImage(rect, filename = "Image01.jpg")
error: object is not a stream
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\fitz\fitz.py", line 1225, in insertImage
return _fitz.Page_insertImage(self, rect, filename, pixmap, overlay)
RuntimeError: object is not a stream
>>> page
page 1 of NewPDF.pdf
I've tried a few different variations on this, with pixmap and without, with overlay value set, and without. The PDF file exists and can be opened with Adobe Acrobat Reader, and the image file exists - I have tried PNG and JPG.
Thank you in advanced for any help.
just some hints to attempt:
Ensure that your "Image01.jpg" file is open and use the full path.
image_path = "/full/path/to/Image01.jpg"
image_file = Image.open(
open(image_path, 'rb'))
# side-note: generally it is better to use the open with syntax, see link below
# https://stackoverflow.com/questions/9282967/how-to-open-a-file-using-the-open-with-statement
To ensure that you are actually on the pdf page that you expect to be, try this. This code will insert the image only on the first page
for page in doc:
page.InsertImage(rect, filename=image_path)
break # Without this, the image will appear on each page of your pdf