Merge 2 pdf files giving me an empty pdf - python

I am using the following standard code:
# importing required modules
import PyPDF2
def PDFmerge(pdfs, output):
# creating pdf file merger object
pdfMerger = PyPDF2.PdfFileMerger()
# appending pdfs one by one
for pdf in pdfs:
with open(pdf, 'rb') as f:
pdfMerger.append(f)
# writing combined pdf to output pdf file
with open(output, 'wb') as f:
pdfMerger.write(f)
def main():
# pdf files to merge
pdfs = ['example.pdf', 'rotated_example.pdf']
# output pdf file name
output = 'combined_example.pdf'
# calling pdf merge function
PDFmerge(pdfs = pdfs, output = output)
if __name__ == "__main__":
# calling the main function
main()
But when I call this with my 2 pdf files (which just contain some text), it produces an empty pdf file, I am wondering how this may be caused?

The problem is that you're closing the files before the write.
When you call pdfMerger.append, it doesn't actually read and process the whole file then; it only does so later, when you call pdfMerger.write. Since the files you've appended are closed, it reads no data from each of them, and therefore outputs an empty PDF.
This should actually raise an exception, which would have made the problem and the fix obvious. Apparently this is a bug introduced in version 1.26, and it will be fixed in the next version. Unfortunately, while the fix was implemented in July 2016, there hasn't been a next version since May 2016. (See this issue.)
You could install directly off the github master (and hope there aren't any new bugs), or you could continue to wait for 1.27, or you could work around the bug. How? Simple: just keep the files open until the write is done:
with contextlib.ExitStack() as stack:
pdfMerger = PyPDF2.PdfFileMerger()
files = [stack.enter_context(open(pdf, 'rb')) for pdf in pdfs]
for f in files:
pdfMerger.append(f)
with open(output, 'wb') as f:
pdfMerger.write(f)

The workaround I have found that works uses an instance of PdfFileReader as the object to append.
from PyPDF2 import PdfFileMerger
from PyPDF2 import PdfFileReader
merger = PdfFileMerger()
for f in ['file1.pdf', 'file2.pdf', 'file3.pdf']:
merger.append(PdfFileReader(f), 'rb')
with open('finished_copy.pdf', 'wb') as new_file:
merger.write(new_file)
Hope that helps!

Related

How do I know my file is attached in my PDF using PyPDF2?

I am trying to attach an .exe file into a PDF using PyPDF2.
I ran the code below, but my PDF file is still the same size.
I don't know if my file was attached or not.
from PyPDF2 import PdfFileWriter, PdfFileReader
writer = PdfFileWriter()
reader = PdfFileReader("doc1.pdf")
# check it's whether work or not
print("doc1 has %d pages" % reader.getNumPages())
writer.addAttachment("doc1.pdf", "client.exe")
What am I doing wrong?
First of all, you have to use the PdfFileWriter class properly.
You can use appendPagesFromReader to copy pages from the source PDF ("doc1.pdf") to the output PDF (ex. "out.pdf"). Then, for addAttachment, the 1st parameter is the filename of the file to attach and the 2nd parameter is the attachment data (it's not clear from the docs, but it has to be a bytes-like sequence). To get the attachment data, you can open the .exe file in binary mode, then read() it. Finally, you need to use write to actually save the PdfFileWriter object to an actual PDF file.
Here is a more working example:
from PyPDF2 import PdfFileReader, PdfFileWriter
reader = PdfFileReader("doc1.pdf")
writer = PdfFileWriter()
writer.appendPagesFromReader(reader)
with open("client.exe", "rb") as exe:
writer.addAttachment("client.exe", exe.read())
with open("out.pdf", "wb") as f:
writer.write(f)
Next, to check if attaching was successful, you can use os.stat.st_size to compare the file size (in bytes) before and after attaching the .exe file.
Here is the same example with checking for file sizes:
(I'm using Python 3.6+ for f-strings)
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
reader = PdfFileReader("doc1.pdf")
writer = PdfFileWriter()
writer.appendPagesFromReader(reader)
with open("client.exe", "rb") as exe:
writer.addAttachment("client.exe", exe.read())
with open("out.pdf", "wb") as f:
writer.write(f)
# Check result
print(f"size of SOURCE: {os.stat('doc1.pdf').st_size}")
print(f"size of EXE: {os.stat('client.exe').st_size}")
print(f"size of OUTPUT: {os.stat('out.pdf').st_size}")
The above code prints out
size of SOURCE: 42942
size of EXE: 989744
size of OUTPUT: 1031773
...which sort of shows that the .exe file was added to the PDF.
Of course, you can manually check it by opening the PDF in Adobe Reader:
As a side note, I am not sure what you want to do with attaching exe files to PDF, but it seems you can attach them but Adobe treats them as security risks and may not be possible to be opened. You can use the same code above to attach another PDF file (or other documents) instead of an executable file, and it should still work.

Don't understand this PdfReadError: EOF marker not found

I am downloading multiple PDFs. I have a list of urls and the code is written to download them and also create one big pdf with them all in. The code works for the first 144 pdfs then it throws this error:
PdfReadError: EOF marker not found
I've tried making all the pdfs end in %%EOF but that doesn't work - it still reaches the same point then I get the error again.
Here's my code:
my file and converting to list for python to read each separately
with open('minutelinks.txt', 'r') as file:
data = file.read()
links = data.split()
download pdfs
from PyPDF2 import PdfFileMerger
import requests
urls = links
merger = PdfFileMerger()
for url in urls:
response = requests.get(url)
title = url.split("/")[-1]
with open(title, 'wb') as f:
f.write(response.content)
merger.append(title)
merger.write("allminues.pdf")
merger.close()
I want to be able to download all of them and create one big pdf - which it appears to do until it throws this error. I have about 750 pdfs and it only gets to 144.
This is how I changed my code so it now downloads all of the pdfs and skips the one (or more) that may be correupted. I also had to add the self argument to the function.
from PyPDF2 import PdfFileMerger
import requests
import sys
urls = links
def download_pdfs(self):
merger = PdfFileMerger()
for url in urls:
try:
response = requests.get(url)
title = url.split("/")[-1]
with open(title, 'wb') as f:
f.write(response.content)
except PdfReadError:
print(title)
sys.exit()
merger.append(title)
merger.write("allminues.pdf")
merger.close()
The end of file marker '%%EOF' is meant to be the very last line. It is a kind of marker where the pdf parser knows, that the PDF document ends here.
My solution is to force this marker to stay at the end:
def reset_eof(self, pdf_file):
with open(pdf_file, 'rb') as p:
txt = (p.readlines())
for i, x in enumerate(txt[::-1]):
if b'%%EOF' in x:
actual_line = len(txt)-i-1
break
txtx = txt[:actual_line] + [b'%%EOF']
with open(pdf_file, 'wb') as f:
f.writelines(txtx)
return PyPDF4.PdfFileReader(pdf_file)
I read that EOF is a kind of tag included in PDF files. link in portuguese
However, I guess some kinds of PDF files do not have the 'EOF marker' and PyPDF2 do not recognizes those ones.
So, what I did to fix "PdfReadError: EOF marker not found" was opening my PDF with Google Chromer and print it as .pdf once more, so that the file is converted to .pdf by Chromer and hopefully with the EOF marker.
I ran my script with the new .pdf file converted by Chromer and it worked fine.

PyPDF2 - merging pages from two different PDF files is not working

I'm trying to merge pages from two PDF files into a single PDF with a single page. So I tried the code below that uses PyPDF2:
from PyPDF2 import PdfFileReader,PdfFileWriter
import sys
f = sys.argv[1]
k = sys.argv[2]
print f,k
file1 = PdfFileReader(file(f, "rb"))
file2 = PdfFileReader(file(k, "rb"))
output = PdfFileWriter()
page = file1.getPage(0)
page.mergePage(file2.getPage(0))
output.addPage(page)
outputStream = file("join.pdf", "wb")
output.write(outputStream)
outputStream.close()
It produces a single file and single page with the contents of page 1 from file 1, but I don't find any data from page 1 of file2. Seems like it didn't get merged.
On using your exact same code, I am able to get two PDF as merged PDF in one page with the second one overlapping the first one, I referred this link for detailed information.
And, instead of file() it is better to use open() as per this Python Documentation, so I did that.
Also, I made slight changes in your code but still, the working is same and correct on my machine. I am using Ubuntu 16.04 with python 2.7.
Here is the code:
from PyPDF2 import PdfFileReader,PdfFileWriter
import sys
f = sys.argv[1]
k = sys.argv[2]
print f, k
file1 = PdfFileReader(open(f, "rb"))
file2 = PdfFileReader(open(k, "rb"))
output = PdfFileWriter()
page = file1.getPage(0)
page.mergePage(file2.getPage(0))
output.addPage(page)
with open("join.pdf", "wb") as outputStream:
output.write(outputStream)
I hope this helps.
UPDATE:
Here is the code which is working for me and merging the two pdf's page as single page.
from pyPdf import PdfFileWriter, PdfFileReader
from pdfnup import generateNup
initial_output = PdfFileWriter()
input1 = PdfFileReader(open("landscape1.pdf", "rb"))
input2 = PdfFileReader(open("landscape2.pdf", "rb"))
initial_output.addPage(input1.getPage(0))
initial_output.addPage(input2.getPage(0))
# creates a new pdf file with required pages as separate pages.
initial_output.write(file("final.pdf", "wb"))
# merges newly created pdf file pages as one.
generateNup("final.pdf", 2, "intermediate.pdf")
# overwrite and rotates the final.pdf
final_output = PdfFileWriter()
final_output.addPage(PdfFileReader(open("intermediate.pdf", "rb")).getPage(0).rotateClockwise(90))
final_output.write(open("final.pdf", "wb"))
I have added a new code and now it is also rotating the final pdf. Output PDF that you need is final.pdf
And here is the Google Drive link to my drive for PDF files. Also, I made slight changes into pdfnup.py for compatibility with my system for Immutableset if you want to use the same file then, you can find it too in the drive link above.
def merge_page(self, output_pdf,*input_pdfs):
a=len(input_pdfs)
print (a)
merge = PyPDF2.PdfFileMerger()
outputStream = open(output_pdf, "wb")
if a<2:
raise Exception ("Need Atleast Two Pdf for Merging")
else:
for x in input_pdfs:
merge.append(open(x,"rb"))
merge.write(outputStream)
outputStream.close()
For me this code is working in PyCharm and it can take n no of pdf files for merging into single pdf file but the no should be 2 or more less than that will give error.

How can I set the PDF version with PyPDF2?

I'm using PyPDF2 1.4 and Python 2.7:
How can I change the PDF version from a file?
What I tried
my_input_filename.pdf is PDF version 1.5, but _my_output_filename.pdf is a 1.3 PDF, I want to keep 1.5 in the output:
from PyPDF2 import PdfFileWriter, PdfFileReader
from PyPDF2.generic import NameObject, createStringObject
input_filename = 'my_input_filename.pdf'
# Read input PDF file
inputPDF = PdfFileReader(open(input_filename, 'rb'))
info = inputPDF.documentInfo
for i in xrange(inputPDF.numPages):
# Create output PDF
outputPDF = PdfFileWriter()
# Create dictionary for output PDF
infoDict = outputPDF._info.getObject()
# Update output PDF metadata with input PDF metadata
for key in info:
infoDict.update({NameObject(key): createStringObject(info[key])})
outputPDF.addPage(inputPDF.getPage(i))
with open(output_filename , 'wb') as outputStream:
outputPDF.write(outputStream)
PyPDF2 in its current versions can't produce anything but files with a PDF1.3 header; from the official source code :
class PdfFileWriter(object):
"""
This class supports writing PDF files out, given pages produced by another
class (typically :class:`PdfFileReader<PdfFileReader>`).
"""
def __init__(self):
self._header = b_("%PDF-1.3")
...
If that is legal, considering it gives you the ability to feed in >1.3 things, is questionable.
If you want to just fix the version string in the header (I don't know which consequences that would have, so I assume you know more about the PDF standard than I do!)
from PyPDF2.utils import b_
...
outputPDF._header.replace(b_("PDF-1.3"),b_("PDF-1.5"))
or something of the like.
Going to add to Marcus' answer above:
There's (currently - I can't speak for when Marcus wrote his post) nothing stopping you from specifying the version in the metadata using standard PyPDF2 addMetadata function. The example below is using PdfFileMerger (as I've recently being doing some cleanup of PDF metadata on existing files), but PdfFileWriter has the same function:
from PyPDF2 import PdfFileMerger
# Define file input/output, and metadata containing version string.
# Using separate input/output files, since it's worth keeping a copy of the originals!
fileIn = 'foo.pdf'
fileOut = 'bar.pdf'
metadata = {
u'/Version': 'PDF-1.5'
}
# Set up PDF file merger, copy existing file contents into merger object.
merger = PdfFileMerger()
with open( fileIn, 'rb') as fh_in:
merger.append(fh_in)
# Append metadata to PDF content in merger.
merger.addMetadata(metadata)
# Write new PDF file with appended metadata to output
# CAUTION: This will overwrite any existing files without prompt!
with open( fileOut, 'wb' ) as fh_out:
merger.write(fh_out)
If I open the output of a file I ran through PyPDF2 it is showing version 1.7 8x, changing 1.3 to 1.5 or whatever doesn't make a difference

Sending multiple .CSV files to .ZIP without storing to disk in Python

I'm working on a reporting application for my Django powered website. I want to run several reports and have each report generate a .csv file in memory that can be downloaded in batch as a .zip. I would like to do this without storing any files to disk. So far, to generate a single .csv file, I am following the common operation:
mem_file = StringIO.StringIO()
writer = csv.writer(mem_file)
writer.writerow(["My content", my_value])
mem_file.seek(0)
response = HttpResponse(mem_file, content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename=my_file.csv'
This works fine, but only for a single, unzipped .csv. If I had, for example, a list of .csv files created with a StringIO stream:
firstFile = StringIO.StringIO()
# write some data to the file
secondFile = StringIO.StringIO()
# write some data to the file
thirdFile = StringIO.StringIO()
# write some data to the file
myFiles = [firstFile, secondFile, thirdFile]
How could I return a compressed file that contains all objects in myFiles and can be properly unzipped to reveal three .csv files?
zipfile is a standard library module that does exactly what you're looking for. For your use-case, the meat and potatoes is a method called "writestr" that takes a name of a file and the data contained within it that you'd like to zip.
In the code below, I've used a sequential naming scheme for the files when they're unzipped, but this can be switched to whatever you'd like.
import zipfile
import StringIO
zipped_file = StringIO.StringIO()
with zipfile.ZipFile(zipped_file, 'w') as zip:
for i, file in enumerate(files):
file.seek(0)
zip.writestr("{}.csv".format(i), file.read())
zipped_file.seek(0)
If you want to future-proof your code (hint hint Python 3 hint hint), you might want to switch over to using io.BytesIO instead of StringIO, since Python 3 is all about the bytes. Another bonus is that explicit seeks are not necessary with io.BytesIO before reads (I haven't tested this behavior with Django's HttpResponse, so I've left that final seek in there just in case).
import io
import zipfile
zipped_file = io.BytesIO()
with zipfile.ZipFile(zipped_file, 'w') as f:
for i, file in enumerate(files):
f.writestr("{}.csv".format(i), file.getvalue())
zipped_file.seek(0)
The stdlib comes with the module zipfile, and the main class, ZipFile, accepts a file or file-like object:
from zipfile import ZipFile
temp_file = StringIO.StringIO()
zipped = ZipFile(temp_file, 'w')
# create temp csv_files = [(name1, data1), (name2, data2), ... ]
for name, data in csv_files:
data.seek(0)
zipped.writestr(name, data.read())
zipped.close()
temp_file.seek(0)
# etc. etc.
I'm not a user of StringIO so I may have the seek and read out of place, but hopefully you get the idea.
def zipFiles(files):
outfile = StringIO() # io.BytesIO() for python 3
with zipfile.ZipFile(outfile, 'w') as zf:
for n, f in enumarate(files):
zf.writestr("{}.csv".format(n), f.getvalue())
return outfile.getvalue()
zipped_file = zip_files(myfiles)
response = HttpResponse(zipped_file, content_type='application/octet-stream')
response['Content-Disposition'] = 'attachment; filename=my_file.zip'
StringIO has getvalue method which return the entire contents. You can compress the zipfile
by zipfile.ZipFile(outfile, 'w', zipfile.ZIP_DEFLATED). Default value of compression is ZIP_STORED which will create zip file without compressing.

Categories

Resources