PyPDF2 corrupts file when watermarking

PyPDF2 corrupts file when watermarking - python

I have been trying to speed up our date stamping process by adding a stamp as a watermark to PDFs through PyPDF2. I found the code below online as I'm pretty new to coding.
When I run this it seems to work, but the file is corrupted and won't open. Does anyone have any ideas where I am going wrong?
from PyPDF2 import PdfFileWriter, PdfFileReader
def create_watermark(input_pdf, output_pdf, watermark):
watermark_obj = PdfFileReader(watermark,False,)
watermark_page = watermark_obj.getPage(0)
pdf_reader = PdfFileReader(input_pdf)
pdf_writer = PdfFileWriter()
# Watermark all the pages
for page in range(pdf_reader.getNumPages()):
page = pdf_reader.getPage(page)
page.mergePage(watermark_page)
pdf_writer.addPage(page)
with open(input_pdf, 'wb') as out:
pdf_writer.write(out)
if __name__ == '__main__':
input_pdf = "C:\\Users\\A***\\OneDrive - ***\\Desktop\\Invoice hold\\Test\\1.pdf"
output_pdf = "C:\\Users\\A***\\OneDrive - ***\\Desktop\\Invoice hold\\Test\\1 WM.pdf"
watermark = "C:\\Users\\A***\\OneDrive - ***\\Desktop\\Invoice hold\\WM.pdf"
create_watermark(input_pdf,output_pdf,watermark)

If you want to save pdf file under the name of output_pdf,
try this :
result = open(output_pdf, 'wb')
pdf_writer.write(result)
your code :
with open(input_pdf, 'wb') as out:
pdf_writer.write(out)
Your code is to overwrite input_pdf.
And if there is a problem while working, the pdf file will be damaged.
I succeeded in inserting the watermark by applying your code and my proposed method.
I recommend checking if the pdf file is not damaged.

Related

PyPDF2 is not working in watermarking Could you help me with this?

import PyPDF2
template = PyPDF2.PdfFileReader(open('super.pdf', 'rb'))
watermark = PyPDF2.PdfFileReader(open('wtr.pdf', 'rb'))
output = PyPDF2.PdfFileWriter()
for i in range(template.getNumPages()):
page = template.getPage(i)
page.mergePage(watermark.getPage(0))
output.addPage(page)
with open('watermarked_output.pdf', 'wb') as file:
output.write(file)
It is not showing an error but not showing a result as well

Beautifulsoup - why arent the images im scraping saving?

Im iterating through and scraping images off a website... but for some reason the "write" isn't working and saving the image. Am I supposed to declare a directory to save them to or something? here's my request. Im using python 2.7
for img in imgs:
image = img['href']
img_url = my_url + image
resource = urllib.urlretrieve(img_url)
resource = resource[0]
output = open(resource, "wb")
output.write(resource)
output.close()

You're working too hard! urlretrieve will already have written the file to disk, all you need to do is copy it to somewhere more permanent.
filename,headers = urllib.urlretreive(img_url)
import shutil
shutil.copy(filename, "/path/to/somewhere")
But to answer your question about what is going on...
resource = urllib.urlretrieve(img_url) # the file is on disk at /tmp/foobar
resource = resource[0] # resource now contains "/tmp/foobar"
output = open(resource, "wb") # oops! You just opened "/tmp/foobar" for writing, which clears the file

Don't understand this PdfReadError: EOF marker not found

I am downloading multiple PDFs. I have a list of urls and the code is written to download them and also create one big pdf with them all in. The code works for the first 144 pdfs then it throws this error:
PdfReadError: EOF marker not found
I've tried making all the pdfs end in %%EOF but that doesn't work - it still reaches the same point then I get the error again.
Here's my code:
my file and converting to list for python to read each separately
with open('minutelinks.txt', 'r') as file:
data = file.read()
links = data.split()
download pdfs
from PyPDF2 import PdfFileMerger
import requests
urls = links
merger = PdfFileMerger()
for url in urls:
response = requests.get(url)
title = url.split("/")[-1]
with open(title, 'wb') as f:
f.write(response.content)
merger.append(title)
merger.write("allminues.pdf")
merger.close()
I want to be able to download all of them and create one big pdf - which it appears to do until it throws this error. I have about 750 pdfs and it only gets to 144.

This is how I changed my code so it now downloads all of the pdfs and skips the one (or more) that may be correupted. I also had to add the self argument to the function.
from PyPDF2 import PdfFileMerger
import requests
import sys
urls = links
def download_pdfs(self):
merger = PdfFileMerger()
for url in urls:
try:
response = requests.get(url)
title = url.split("/")[-1]
with open(title, 'wb') as f:
f.write(response.content)
except PdfReadError:
print(title)
sys.exit()
merger.append(title)
merger.write("allminues.pdf")
merger.close()

The end of file marker '%%EOF' is meant to be the very last line. It is a kind of marker where the pdf parser knows, that the PDF document ends here.
My solution is to force this marker to stay at the end:
def reset_eof(self, pdf_file):
with open(pdf_file, 'rb') as p:
txt = (p.readlines())
for i, x in enumerate(txt[::-1]):
if b'%%EOF' in x:
actual_line = len(txt)-i-1
break
txtx = txt[:actual_line] + [b'%%EOF']
with open(pdf_file, 'wb') as f:
f.writelines(txtx)
return PyPDF4.PdfFileReader(pdf_file)

I read that EOF is a kind of tag included in PDF files. link in portuguese
However, I guess some kinds of PDF files do not have the 'EOF marker' and PyPDF2 do not recognizes those ones.
So, what I did to fix "PdfReadError: EOF marker not found" was opening my PDF with Google Chromer and print it as .pdf once more, so that the file is converted to .pdf by Chromer and hopefully with the EOF marker.
I ran my script with the new .pdf file converted by Chromer and it worked fine.

Add metadata with PyPDF2

The following code adds a watermark on every page and also adds metadata (or better should do).
The watermarking works perfectly fine, but there is no metadata in the output document and no error
from PyPDF2 import PdfFileReader as PdfReader, PdfFileWriter as PdfWriter
nf = []
sources = ["example.pdf", "example2.pdf"]
for i in sources:
# new pdf file name
new_file_name = i + " " + row["name"] + ".pdf"
nf.append(new_file_name)
print(f"Registering {i} watermarked version for {row['name']}.")
reader = PdfReader(i)
writer = PdfWriter()
# adding watermark to each page
for page in reader.pages:
# creating watermarked page object
wmpageObj = add_watermark(packet, page)
# adding watermarked page object to pdf writer
writer.addPage(wmpageObj)
# Write Metadate
writer.addMetadata({"/Registered to": row["name"]})
writer.addMetadata({"/ATC": "ACME Inc."})
# writing watermarked pages to new file
with open(new_file_name, "wb") as newFile:
writer.write(newFile)
Adding print(pdfWriter._info) before and after adding the metadata gives me only:
IndirectObject(2, 0)
IndirectObject(2, 0)
Also interesting: I tried Adobe Acrobat Reader DC on Mac and Windows and it's not possible to show the metadata of the output file (the window just won't open), but works fine with the source file, i.e. before adding watermark and metadata.

PyPDF2 - merging pages from two different PDF files is not working

I'm trying to merge pages from two PDF files into a single PDF with a single page. So I tried the code below that uses PyPDF2:
from PyPDF2 import PdfFileReader,PdfFileWriter
import sys
f = sys.argv[1]
k = sys.argv[2]
print f,k
file1 = PdfFileReader(file(f, "rb"))
file2 = PdfFileReader(file(k, "rb"))
output = PdfFileWriter()
page = file1.getPage(0)
page.mergePage(file2.getPage(0))
output.addPage(page)
outputStream = file("join.pdf", "wb")
output.write(outputStream)
outputStream.close()
It produces a single file and single page with the contents of page 1 from file 1, but I don't find any data from page 1 of file2. Seems like it didn't get merged.

On using your exact same code, I am able to get two PDF as merged PDF in one page with the second one overlapping the first one, I referred this link for detailed information.
And, instead of file() it is better to use open() as per this Python Documentation, so I did that.
Also, I made slight changes in your code but still, the working is same and correct on my machine. I am using Ubuntu 16.04 with python 2.7.
Here is the code:
from PyPDF2 import PdfFileReader,PdfFileWriter
import sys
f = sys.argv[1]
k = sys.argv[2]
print f, k
file1 = PdfFileReader(open(f, "rb"))
file2 = PdfFileReader(open(k, "rb"))
output = PdfFileWriter()
page = file1.getPage(0)
page.mergePage(file2.getPage(0))
output.addPage(page)
with open("join.pdf", "wb") as outputStream:
output.write(outputStream)
I hope this helps.
UPDATE:
Here is the code which is working for me and merging the two pdf's page as single page.
from pyPdf import PdfFileWriter, PdfFileReader
from pdfnup import generateNup
initial_output = PdfFileWriter()
input1 = PdfFileReader(open("landscape1.pdf", "rb"))
input2 = PdfFileReader(open("landscape2.pdf", "rb"))
initial_output.addPage(input1.getPage(0))
initial_output.addPage(input2.getPage(0))
# creates a new pdf file with required pages as separate pages.
initial_output.write(file("final.pdf", "wb"))
# merges newly created pdf file pages as one.
generateNup("final.pdf", 2, "intermediate.pdf")
# overwrite and rotates the final.pdf
final_output = PdfFileWriter()
final_output.addPage(PdfFileReader(open("intermediate.pdf", "rb")).getPage(0).rotateClockwise(90))
final_output.write(open("final.pdf", "wb"))
I have added a new code and now it is also rotating the final pdf. Output PDF that you need is final.pdf
And here is the Google Drive link to my drive for PDF files. Also, I made slight changes into pdfnup.py for compatibility with my system for Immutableset if you want to use the same file then, you can find it too in the drive link above.

def merge_page(self, output_pdf,*input_pdfs):
a=len(input_pdfs)
print (a)
merge = PyPDF2.PdfFileMerger()
outputStream = open(output_pdf, "wb")
if a<2:
raise Exception ("Need Atleast Two Pdf for Merging")
else:
for x in input_pdfs:
merge.append(open(x,"rb"))
merge.write(outputStream)
outputStream.close()
For me this code is working in PyCharm and it can take n no of pdf files for merging into single pdf file but the no should be 2 or more less than that will give error.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PyPDF2 corrupts file when watermarking - python

Related

PyPDF2 is not working in watermarking Could you help me with this?

Beautifulsoup - why arent the images im scraping saving?

Don't understand this PdfReadError: EOF marker not found

Add metadata with PyPDF2

PyPDF2 - merging pages from two different PDF files is not working

Categories

Resources