Encrypting many PDFs by python using PyPDF2

Encrypting many PDFs by python using PyPDF2 - python

I am trying to make a python program which loops through all files in a folder, selects those which have extension '.pdf', and encrypt them with restricted permissions. I am using this version of PyPDF2 library:
https://github.com/vchatterji/PyPDF2. (A modification of the original PyPDF2 which also allows to set permissions). I have tested it with a single pdf file and it works fine. I want that the original pdf file should be deleted and the encrypted one should remain with the same name.
Here is my code:
import os
import PyPDF2
directory = './'
for filename in os.listdir(directory):
if filename.endswith(".pdf"):
pdfFile = open(filename, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFile)
pdfWriter = PyPDF2.PdfFileWriter()
for pageNum in range(pdfReader.numPages):
pdfWriter.addPage(pdfReader.getPage(pageNum))
pdfFile.close()
os.remove(filename)
pdfWriter.encrypt('', 'ispat', perm_mask=-3904)
resultPdf = open(filename, 'wb')
pdfWriter.write(resultPdf)
resultPdf.close()
continue
else:
continue
It gives the following error:
C:\Users\manul\Desktop\ghh>python encrypter.py
Traceback (most recent call last):
File "encrypter.py", line 9, in <module>
pdfReader = PyPDF2.PdfFileReader(pdfFile)
File "C:\Users\manul\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\pdf.py", line 1153, in __init__
self.read(stream)
File "C:\Users\manul\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\pdf.py", line 1758, in read
stream.seek(-1, 2)
OSError: [Errno 22] Invalid argument
I have some PDFs stored in 'ghh' folder on Desktop. Any help is greatly appreciated.

Using pdfReader = PyPDF2.PdfFileReader(filename) will make the reader work, but this specific error is caused by your files being empty. You can check the file sizes with os.path.getsize(filename). Your files were probably wiped because the script deletes the original file, then creates a new file with open(filepath, "wb"), and then it terminates incorrectly due to an error that occurs with pdfWriter.write(resultPdf), leaving an empty file with the original file name.
Passing a file name instead of a file object to PdfFileReader as mentioned resolves the error that occurs with pdfWriter (I don't know why), but you'll need to replace any empty files in your directory with copies of the original pdfs to get rid of the OSError.

Related

Simple PyPDF exercise - AttributeError: 'NullObject' object has no attribute 'get'

Working on a simple PyPDF related exercise - I basically need to take a PDF file and apply a watermark to to it.
Here's my code:
# We need to build a program that will watermark all of our PDF files
# Use the wtr.pdf and apply it to all of the pages of our PDF file
import PyPDF2
# Open the file we want to add the watermark to
with open("combined.pdf", mode="rb") as file:
reader = PyPDF2.PdfFileReader(file)
# Open the watermark file and get the watermark
with open("wtr.pdf", mode="rb") as watermark_file:
watermark_reader = PyPDF2.PdfFileReader(watermark_file)
# Create a writer object for the output file
writer = PyPDF2.PdfFileWriter()
for i in range(reader.numPages):
page = reader.getPage(i)
# Merge the watermark page object into our current page
page.mergePage(watermark_reader.getPage(0))
# Append this new page into our writer object
writer.addPage(page)
with open("watermarked.pdf", mode="wb") as output_file:
writer.write(output_file)
I am unclear as to why I get this error:
$ python watermark.py
Traceback (most recent call last):
File "watermark.py", line 20, in <module>
page.mergePage(watermark_reader.getPage(0))
File "C:\Python38\lib\site-packages\PyPDF2\pdf.py", line 2239, in mergePage
self._mergePage(page2)
File "C:\Python38\lib\site-packages\PyPDF2\pdf.py", line 2260, in _mergePage
new, newrename = PageObject._mergeResources(originalResources, page2Resources, res)
File "C:\Python38\lib\site-packages\PyPDF2\pdf.py", line 2170, in _mergeResources
newRes.update(res1.get(resource, DictionaryObject()).getObject())
AttributeError: 'NullObject' object has no attribute 'get'
I would appreciate any insights. I have been staring at this for a while.

For some reason your pdf file doesn't contain "/Resources". PyPDF2 tries to get it in line 2314 in https://github.com/mstamy2/PyPDF2/blob/master/PyPDF2/pdf.py#L2314
You can try another pdf file to check if the error persists. May be it is a bug in the library or the library doesn't support such files.
Another thing I noticed is that line numbers in master branch of the library do not match line numbers in your stack trace, so may be you need to get more recent version of the library and hope that the problem is fixed there.
By briefly looking at pdf file structure it seems that /Resources are optional. If this is a case, then PyPDF2 doesn't handle this case and it should be probably reported as a bug at https://github.com/mstamy2/PyPDF2/issues

How do I know my file is attached in my PDF using PyPDF2?

I am trying to attach an .exe file into a PDF using PyPDF2.
I ran the code below, but my PDF file is still the same size.
I don't know if my file was attached or not.
from PyPDF2 import PdfFileWriter, PdfFileReader
writer = PdfFileWriter()
reader = PdfFileReader("doc1.pdf")
# check it's whether work or not
print("doc1 has %d pages" % reader.getNumPages())
writer.addAttachment("doc1.pdf", "client.exe")
What am I doing wrong?

First of all, you have to use the PdfFileWriter class properly.
You can use appendPagesFromReader to copy pages from the source PDF ("doc1.pdf") to the output PDF (ex. "out.pdf"). Then, for addAttachment, the 1st parameter is the filename of the file to attach and the 2nd parameter is the attachment data (it's not clear from the docs, but it has to be a bytes-like sequence). To get the attachment data, you can open the .exe file in binary mode, then read() it. Finally, you need to use write to actually save the PdfFileWriter object to an actual PDF file.
Here is a more working example:
from PyPDF2 import PdfFileReader, PdfFileWriter
reader = PdfFileReader("doc1.pdf")
writer = PdfFileWriter()
writer.appendPagesFromReader(reader)
with open("client.exe", "rb") as exe:
writer.addAttachment("client.exe", exe.read())
with open("out.pdf", "wb") as f:
writer.write(f)
Next, to check if attaching was successful, you can use os.stat.st_size to compare the file size (in bytes) before and after attaching the .exe file.
Here is the same example with checking for file sizes:
(I'm using Python 3.6+ for f-strings)
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
reader = PdfFileReader("doc1.pdf")
writer = PdfFileWriter()
writer.appendPagesFromReader(reader)
with open("client.exe", "rb") as exe:
writer.addAttachment("client.exe", exe.read())
with open("out.pdf", "wb") as f:
writer.write(f)
# Check result
print(f"size of SOURCE: {os.stat('doc1.pdf').st_size}")
print(f"size of EXE: {os.stat('client.exe').st_size}")
print(f"size of OUTPUT: {os.stat('out.pdf').st_size}")
The above code prints out
size of SOURCE: 42942
size of EXE: 989744
size of OUTPUT: 1031773
...which sort of shows that the .exe file was added to the PDF.
Of course, you can manually check it by opening the PDF in Adobe Reader:
As a side note, I am not sure what you want to do with attaching exe files to PDF, but it seems you can attach them but Adobe treats them as security risks and may not be possible to be opened. You can use the same code above to attach another PDF file (or other documents) instead of an executable file, and it should still work.

Merge 2 pdf files giving me an empty pdf

I am using the following standard code:
# importing required modules
import PyPDF2
def PDFmerge(pdfs, output):
# creating pdf file merger object
pdfMerger = PyPDF2.PdfFileMerger()
# appending pdfs one by one
for pdf in pdfs:
with open(pdf, 'rb') as f:
pdfMerger.append(f)
# writing combined pdf to output pdf file
with open(output, 'wb') as f:
pdfMerger.write(f)
def main():
# pdf files to merge
pdfs = ['example.pdf', 'rotated_example.pdf']
# output pdf file name
output = 'combined_example.pdf'
# calling pdf merge function
PDFmerge(pdfs = pdfs, output = output)
if __name__ == "__main__":
# calling the main function
main()
But when I call this with my 2 pdf files (which just contain some text), it produces an empty pdf file, I am wondering how this may be caused?

The problem is that you're closing the files before the write.
When you call pdfMerger.append, it doesn't actually read and process the whole file then; it only does so later, when you call pdfMerger.write. Since the files you've appended are closed, it reads no data from each of them, and therefore outputs an empty PDF.
This should actually raise an exception, which would have made the problem and the fix obvious. Apparently this is a bug introduced in version 1.26, and it will be fixed in the next version. Unfortunately, while the fix was implemented in July 2016, there hasn't been a next version since May 2016. (See this issue.)
You could install directly off the github master (and hope there aren't any new bugs), or you could continue to wait for 1.27, or you could work around the bug. How? Simple: just keep the files open until the write is done:
with contextlib.ExitStack() as stack:
pdfMerger = PyPDF2.PdfFileMerger()
files = [stack.enter_context(open(pdf, 'rb')) for pdf in pdfs]
for f in files:
pdfMerger.append(f)
with open(output, 'wb') as f:
pdfMerger.write(f)

The workaround I have found that works uses an instance of PdfFileReader as the object to append.
from PyPDF2 import PdfFileMerger
from PyPDF2 import PdfFileReader
merger = PdfFileMerger()
for f in ['file1.pdf', 'file2.pdf', 'file3.pdf']:
merger.append(PdfFileReader(f), 'rb')
with open('finished_copy.pdf', 'wb') as new_file:
merger.write(new_file)
Hope that helps!

CSV Should Return Strings, Not Bytes Error

I am trying to read CSV files from a directory that is not in the same directory as my Python script.
Additionally the CSV files are stored in ZIP folders that have the exact same names (the only difference being one ends with .zip and the other is a .csv).
Currently I am using Python's zipfile and csv libraries to open and get the data from the files, however I am getting the error:
Traceback (most recent call last): File "write_pricing_data.py", line 13, in <module>
for row in reader:
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
My code:
import os, csv
from zipfile import *
folder = r'D:/MarketData/forex'
localFiles = os.listdir(folder)
for file in localFiles:
zipArchive = ZipFile(folder + '/' + file)
with zipArchive.open(file[:-4] + '.csv') as csvFile:
reader = csv.reader(csvFile, delimiter=',')
for row in reader:
print(row[0])
How can I resolve this error?

It's a bit of a kludge and I'm sure there's a better way (that just happens to elude me right now). If you don't have embedded new lines, then you can use:
import zipfile, csv
zf = zipfile.ZipFile('testing.csv.zip')
with zf.open('testing.csv', 'r') as fin:
# Create a generator of decoded lines for input to csv.reader
# (the csv module is only really happy with ASCII input anyway...)
lines = (line.decode('ascii') for line in fin)
for row in csv.reader(lines):
print(row)

PyPDF2 IOError: [Errno 22] Invalid argument on PyPdfFileReader Python 2.7

Goal = Open file, encrypt file, write encrypted file.
Trying to use the PyPDF2 module to accomplish this. I have verified theat "input" is a file type object. I have researched this error and it translates to "file not found". I believe that it is linked somehow to the file/file path but am unsure how to debug or troubleshoot. and getting the following error:
Traceback (most recent call last):
File "CommissionSecurity.py", line 52, in <module>
inputStream = PyPDF2.PdfFileReader(input)
File "build\bdist.win-amd64\egg\PyPDF2\pdf.py", line 1065, in __init__
File "build\bdist.win-amd64\egg\PyPDF2\pdf.py", line 1660, in read
IOError: [Errno 22] Invalid argument
Below is the relevant code. I'm not sure how to correct this issue because I'm not really sure what the issue is. Any guidance is appreciated.
for ID in FileDict:
if ID in EmailDict :
path = "C:\\Apps\\CorVu\\DATA\\Reports\\AlliD\\Monthly Commission Reports\\Output\\pdcom1\\"
#print os.listdir(path)
file = os.path.join(path + FileDict[ID])
with open(file, 'rb') as input:
print type(input)
inputStream = PyPDF2.PdfFileReader(input)
output = PyPDF2.PdfFileWriter()
output = inputStream.encrypt(EmailDict[ID][1])
with open(file, 'wb') as outputStream:
output.write(outputStream)
else : continue

I think your problem might be caused by the fact that you use the same filename to both open and write to the file, opening it twice:
with open(file, 'rb') as input :
with open(file, 'wb') as outputStream :
The w mode will truncate the file, thus the second line truncates the input.
I'm not sure what you're intention is, because you can't really try to read from the (beginning) of the file, and at the same time overwrite it. Even if you try to write to the end of the file, you'll have to position the file pointer somewhere.
So create an extra output file that has a different name; you can always rename that output file to your input file after both files are closed, thus overwriting your input file.
Or you could first read the complete file into memory, then write to it:
with open(file, 'rb') as input:
inputStream = PyPDF2.PdfFileReader(input)
output = PyPDF2.PdfFileWriter()
output = input.encrypt(EmailDict[ID][1])
with open(file, 'wb') as outputStream:
output.write(outputStream)
Notes:
you assign inputStream, but never use it
you assign PdfFileWriter() to output, and then assign something else to output in the next line. Hence, you never used the result from the first output = line.
Please check carefully what you're doing, because it feels there are numerous other problems with your code.
Alternatively, here are some other tips that may help:
The documentation suggests that you can also use the filename as first argument to PdfFileReader:
stream – A File object or an object that supports the standard read
and seek methods similar to a File object. Could also be a string
representing a path to a PDF file.
So try:
inputStream = PyPDF2.PdfFileReader(file)
You can also try to set the strict argument to False:
strict (bool) – Determines whether user should be warned of all
problems and also causes some correctable problems to be fatal.
Defaults to True.
For example:
inputStream = PyPDF2.PdfFileReader(file, strict=False)

Using open(file, 'rb') was causing the issue becuase PdfFileReader() does that automagically. I just removed the with statement and that corrected the problem.
with open(file, 'rb') as input:
inputStream = PyPDF2.PdfFileReader(input)

This error raised up because of PDF file is empty.
My PDF file was empty that's why my error was raised up. So First of all i fill my PDF file with some data and Then start reeading it using PyPDF2.PdfFileReader,
And it solved my Problem!!!

Late but, you may be opening an invalid PDF file or an empty file that's named x.pdf and you think it's a PDF file

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Encrypting many PDFs by python using PyPDF2 - python

Related

Simple PyPDF exercise - AttributeError: 'NullObject' object has no attribute 'get'

How do I know my file is attached in my PDF using PyPDF2?

Merge 2 pdf files giving me an empty pdf

CSV Should Return Strings, Not Bytes Error

PyPDF2 IOError: [Errno 22] Invalid argument on PyPdfFileReader Python 2.7

Categories

Resources