I'm trying to do the following:
Download a pdf file from S3 to my heroku.
Process the pdf.
Email the pdf as an attachment.
Is it possible? If yes, could you please give me a tip how?
I'm running Django and pdf is about 1MB.
This is my processing part:
from PyPDF2 import PdfFileWriter, PdfFileReader
from reportlab.pdfgen import canvas
from reportlab.lib.colors import HexColor
import os, sys
import requests
from io import BytesIO
URL = "https://domainname.com/sample.pdf"
response=requests.get(URL)
p = BytesIO(response.content)
p.seek(0, os.SEEK_END)
def watermark_product(watermark_text, input_file_path, output_file_path):
c = canvas.Canvas("watermark.pdf")
c.setFont("Helvetica", 24)
c.setFillGray(0.5,0.5)
c.saveState()
c.translate(500,100)
c.rotate(45)
c.drawCentredString(0, 300, watermark_text)
c.restoreState()
c.save()
input_file = PdfFileReader(input_file_path)
output_writer = PdfFileWriter()
total_pages = input_file.getNumPages()
for single_page in range(total_pages):
page = input_file.getPage(single_page)
watermark = PdfFileReader("watermark.pdf")
page.mergePage(watermark.getPage(0))
output_writer.addPage(page)
with open(output_file_path, "wb") as outputStream:
output_writer.write(outputStream)
os.remove("watermark.pdf")
watermark_product('testtesatd', p, 'w1.pdf')
EDIT:
I've managed to keep the pdf file in memory.
Related
i'm having hundreds of pdf files on my google drive and i want to extract page 6 from all the pdf files without necessarily changing the original pdf file name as output using Jupyter notebook on google colab.
I used the code below to extract a page without changing the original file name, and it worked just fine:
from PyPDF2 import PdfFileReader, PdfFileWriter
path = '/content/drive/Shareddrives/2022 | ICT 4014 | Group M2N2/datasets/frazer/1.pdf'
file_ext = path.replace('.pdf', '')
pdf = PdfFileReader(path)
pdfpage = [6]
PdfWriter = PdfFileWriter() #Creating pdfWriter instance
for page_num in pdfpage:
PdfWriter.addPage(pdf.getPage(page_num))
with open('{0}_1.pdf'.format(file_ext), 'wb') as a:
PdfWriter.write(a)
a.close()
Output:
1_1.pdf
I further tried to implement a loop so that i can extract page 6 from the specified directory and i got an error:
import PyPDF2
import os
import re
import sys
import glob
import PyPDF2 as pdf
from PyPDF2 import PdfFileReader, PdfFileWriter
path = glob.glob(os.path.join('/content/drive/Shareddrives/2022 | ICT 4014 | Group M2N2/datasets/unza_etd_pdfs','*.pdf'))
for pdf_files in path:
file_ext = path.replace('.pdf', '')
pdf = PdfFileReader(path)
pdfpage = [6]
PdfWriter = PdfFileWriter() #Creating pdfWriter instance
for page_num in pdfpage:
PdfWriter.addPage(pdf.getPage(page_num))
with open('{0}_1.pdf'.format(file_ext), 'wb') as a:
PdfWriter.write(a)
a.close()
Output Error:
AttributeError: 'list' object has no attribute 'replace'
I would like help, as I need to split a pdf file into sizes smaller than 10mb. I already managed to split the file into pages, but could not divide by the size of the destination file.
Below is the code I used to split into pages, using the PyPDF2 library, with the information I've collected right here in stackoverflow.
Thank you for your help.
from PyPDF2 import PdfFileWriter, PdfFileReader
from tkinter.filedialog import askopenfilename as procArq
url = procArq ()
arquivo = PdfFileReader(open(url, "rb"))
for i in range(arquivo.numPages):
saida = PdfFileWriter()
saida.addPage(arquivo.getPage(i))
with open("document-page%s.pdf" % i, "wb") as arquivo_de_saida:
saida.write(arquivo_de_saida)
I have a function which generates a PDF and I have a Flask website. I would like to combine the two so that when you visit the site, a PDF is generated and downloaded. I am working on combining various bits of code that I don't fully understand. The PDF is generated and downloaded but fails to ever load when I try to open it. What am I missing?
import cStringIO
from reportlab.lib.enums import TA_JUSTIFY
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from flask import make_response, Flask
from reportlab.pdfgen import canvas
app = Flask(__name__)
#app.route('/')
def pdf():
output = cStringIO.StringIO()
doc = SimpleDocTemplate("test.pdf",pagesize=letter)
Story=[]
styles=getSampleStyleSheet()
styles.add(ParagraphStyle(name='Justify', alignment=TA_JUSTIFY))
ptext = '<font size=12>test</font>'
Story.append(Paragraph(ptext, styles["Justify"]))
doc.build(Story)
pdf_out = output.getvalue()
output.close()
response = make_response(pdf_out)
response.headers['Content-Disposition'] = "attachment; filename='test.pdf"
response.mimetype = 'application/pdf'
return response
app.run()
You can use Flask's send_file for serving files:
from Flask import send_file
return send_file('test.pdf', as_attachment=True)
With as_attachment=True you can force the client to download the file instead of viewing it inside the browser.
How to add watermark to pdf file generated from this code?
import xhtml2pdf
from xhtml2pdf import pisa
def delivery_cancel(request, did):
d_instance = get_object_or_404(Delivery, pk=did, user=request.user)
users = request.user.get_profile()
user = request.user
contents = render_to_string('delivery_cancel.html', {'delivery':d_instance,'users':users,'user':user})
response = HttpResponse(mimetype='application/pdf')
response['Content-Disposition'] = 'inline; filename=mypdf.pdf'
result = StringIO.StringIO()
pdf = pisa.pisaDocument(StringIO.StringIO(contents.encode('utf-8')), result, show_error_as_pdf=True, encoding='UTF-8')
response.write(result.getvalue())
result.close()
return response
I tried to use reportlab but I failed so I'm asking for another solution.
The input to xhtml2pdf is XHTML, so you probably want to specify your watermark there. The documentation says to use a background-image on #page.
Alternatively, you can create a single-page PDF that just contains the watermark and apply it to your generated file after the fact using something like pdftk's background option.
My approach is a longer one but it should solve most of the problems faced.
With this script you will be able to add the list of watermark email address from a xlsx sheet and add the same email address as watermark to all the pages of a pdf which you input
# Importing all required packages
import xlrd
from reportlab.pdfgen import canvas
from reportlab.lib.units import inch, cm
from PyPDF2 import PdfFileWriter, PdfFileReader
from reportlab.lib.colors import HexColor
# create watermarked booklet
def final_booklets(file_name,booklet):
watermark_obj = PdfFileReader(file_name)
watermark_page = watermark_obj.getPage(0)
pdf_reader = PdfFileReader(booklet)
pdf_writer = PdfFileWriter()
# Watermark all the pages
for page in range(pdf_reader.getNumPages()):
page = pdf_reader.getPage(page)
page.mergePage(watermark_page)
pdf_writer.addPage(page)
output = file_name+"_booklet.pdf"
with open(output, 'wb') as out:
pdf_writer.write(out)
# Create watermark pdf again each email address
def watermark_pdf(target,booklet):
file_name = (target + ".pdf")
c = canvas.Canvas(file_name)
c.saveState()
c.setFillColor(HexColor('#dee0ea'))
c.setFont("Helvetica", 40)
c.translate(15*cm, 20*cm )
c.rotate(45)
c.drawRightString(0,0,target)
c.restoreState()
c.showPage()
c.save()
final_booklets(file_name,booklet)
# Read the sheet to get everyones email address
def read_xlsx(fn):
book = xlrd.open_workbook(fn)
sheet = book.sheet_by_index(0)
booklet = "book.pdf"
for cell in range(1,sheet.nrows):
target = sheet.cell(cell,1).value
watermark_pdf(target,booklet)
# main controller
if __name__ == "__main__":
fn = "Test.xlsx"
read_xlsx(fn)
Original Github link: https://github.com/manojitballav/python_watermark/blob/master/master.py
I'm attempting to combine a few PDF files into a single PDF file using Python. I've tried both PyPDF and PyPDF2 - on some files, they both throw this same error:
PdfReadError: EOF marker not found
Here's my code (page_files) is a list of PDF file paths to combine:
from PyPDF2 import PdfReader, PdfWriter
writer = PdfWriter()
for path in ["example1.pdf", "example2.pdf"]:
reader = PdfReader(path)
for page in reader.pages:
writer.add_page(page)
with open("out.pdf", "wb") as fp:
writer.write(fp)
I've read a few StackOverflow threads on the topic, but none contain a solution that works. If you've successfully combined PDF files using Python, I'd love to hear how.
You were running in an issue of PyPDF2 which was solved with PR #321. The fix was released in PyPDF2==1.27.8 (released on 2022-04-21).
Is there is still someone looking for merging a "list" of pdfs:
Note:
Using glob to get the correct filelist. <- this will really safe your day ^^
Check this out: glob module reference
from PyPDF2 import PdfFileMerger, PdfFileReader, PdfFileWriter
import os
import glob
class MergeAllPDF:
def __init__(self):
self.mergelist = []
def create(self, filepath, outpath, outfilename):
self.outfilname = outfilename
self.filepath = filepath
self.outpath = outpath
self.pdfs = glob.glob(self.filepath)
self.myrange = len(self.pdfs)
for _ in range(self.myrange):
if self.pdfs:
self.mergelist.append(self.pdfs.pop(0))
self.merge()
def merge(self):
if self.mergelist:
self.merger = PdfFileMerger()
for pdf in self.mergelist:
self.merger.append(open(pdf, 'rb'))
self.merger.write(self.outpath + "%s.pdf" % (self.outfilname))
self.merger.close()
self.mergelist = []
else:
print("mergelist is empty please check your input path")
# example how to use
#update your path here:
inpath = r"C:\Users\Fabian\Desktop\mergeallpdfs\scan\*.pdf" #here are your single page pdfs stored
outpath = r"C:\Users\Fabian\Desktop\mergeallpdfs\output\\" #here your merged pdf will be stored
b = MergeAllPDF()
b.create(inpath, outpath, "mergedpdf")