Python: How to set BaseUrl in pdf metadata?

Python: How to set BaseUrl in pdf metadata? - python

I would like to add metadata to PDF in python.
I have used PyPDF2 and it works fine with metadata (tags) except of "BaseUrl"
Code:
dict_1['/BaseUrl'] = 'http://bud-arch.pollub.pl/wp-content/uploads/Bud-arch_171_2018_005-012_Boruci%C5%84ska%E2%80%93Bie%C5%84kowska_Maciejko.pdf'
writer = PyPDF2.PdfFileWriter()
writer.addMetadata(dict_1)
The result is metadata "BaseUrl" in the section for creator free tags. not in section "advanced". How to solve this issue?
Łukasz
(source: pokazywarka.pl)

Related

How to extract the title, authors, creation date of a PDF in Python

I manage papers locally and rename each PDF file in the form of "creationdate_authors_title.pdf". Hence, extracting the title, authors, creation date of each paper from the PDF file automatically is required.
I have written a python script using the package pdfminer to extract info. However, for certain files, after parsing them, the file info stored in the dictionary doc.info[0] by using PDFDocument may not contain some keys such as "Author", or these keys' values are empty.
I'm wondering how can I locate the required info such as the paper's title directly from the PDF file using the function like "extract_pages". Or, more generally, how can I accurately and efficiently extract the info I required?
Any hint would be appreciated! Many thanks in advance.

You can use this script to extract all the metadata using the library PyPDF2
from PyPDF2 import PdfFileReader
def get_info(path):
with open(path, 'rb') as f:
pdf = PdfFileReader(f)
info = pdf.getDocumentInfo()
number_of_pages = pdf.getNumPages()
print(info)
author = info.author
creator = info.creator
producer = info.producer
subject = info.subject
title = info.title
if __name__ == '__main__':
path = 'reportlab-sample.pdf'
get_info(path)
As you see inside the info variable you have all you need. Check this documentation

pypdf mergepage issue

I want to add pdf watermark using pypdf lib, code below:
def add_wm(pdf_in, pdf_out):
wm_file = open("watermark.pdf", "rb")
pdf_wm = PdfFileReader(wm_file)
pdf_output = PdfFileWriter()
input_stream = open(pdf_in, "rb")
pdf_input = PdfFileReader(input_stream)
pageNum = pdf_input.getNumPages()
#print pageNum
for i in range(pageNum):
page = pdf_input.getPage(i)
page.mergePage(pdf_wm.getPage(0)) # !! here is fail if has chinese character
page.compressContentStreams()
pdf_output.addPage(page)
output_stream = open(pdf_out, "wb")
pdf_output.write(output_stream)
output_stream.close()
input_stream.close()
wm_file.close()
return True
The issue is if page = pdf_input.getPage(i) page has Chinese characters, page.mergePage will be raise exception and cause failure. How do I work around this?

The Python library pdfrw also supports watermarking. If it does not work for your particular PDF, please email it to me (address at github) and I will investigate -- I am the pdfrw author.

I had the same problem when i was watermarking with PyPdf2 1.25.1 .
pdfrw as Patrick suggested isn't working for my PDF (works for word documents exported as pdf but not for scanned documents i guess).
Updating to the newest version of PyPDF2 (for me this is 1.26.0) fixed this bug.
For more information see PyPDF2 issue #176

How can I set the PDF version with PyPDF2?

I'm using PyPDF2 1.4 and Python 2.7:
How can I change the PDF version from a file?
What I tried
my_input_filename.pdf is PDF version 1.5, but _my_output_filename.pdf is a 1.3 PDF, I want to keep 1.5 in the output:
from PyPDF2 import PdfFileWriter, PdfFileReader
from PyPDF2.generic import NameObject, createStringObject
input_filename = 'my_input_filename.pdf'
# Read input PDF file
inputPDF = PdfFileReader(open(input_filename, 'rb'))
info = inputPDF.documentInfo
for i in xrange(inputPDF.numPages):
# Create output PDF
outputPDF = PdfFileWriter()
# Create dictionary for output PDF
infoDict = outputPDF._info.getObject()
# Update output PDF metadata with input PDF metadata
for key in info:
infoDict.update({NameObject(key): createStringObject(info[key])})
outputPDF.addPage(inputPDF.getPage(i))
with open(output_filename , 'wb') as outputStream:
outputPDF.write(outputStream)

PyPDF2 in its current versions can't produce anything but files with a PDF1.3 header; from the official source code :
class PdfFileWriter(object):
"""
This class supports writing PDF files out, given pages produced by another
class (typically :class:`PdfFileReader<PdfFileReader>`).
"""
def __init__(self):
self._header = b_("%PDF-1.3")
...
If that is legal, considering it gives you the ability to feed in >1.3 things, is questionable.
If you want to just fix the version string in the header (I don't know which consequences that would have, so I assume you know more about the PDF standard than I do!)
from PyPDF2.utils import b_
...
outputPDF._header.replace(b_("PDF-1.3"),b_("PDF-1.5"))
or something of the like.

Going to add to Marcus' answer above:
There's (currently - I can't speak for when Marcus wrote his post) nothing stopping you from specifying the version in the metadata using standard PyPDF2 addMetadata function. The example below is using PdfFileMerger (as I've recently being doing some cleanup of PDF metadata on existing files), but PdfFileWriter has the same function:
from PyPDF2 import PdfFileMerger
# Define file input/output, and metadata containing version string.
# Using separate input/output files, since it's worth keeping a copy of the originals!
fileIn = 'foo.pdf'
fileOut = 'bar.pdf'
metadata = {
u'/Version': 'PDF-1.5'
}
# Set up PDF file merger, copy existing file contents into merger object.
merger = PdfFileMerger()
with open( fileIn, 'rb') as fh_in:
merger.append(fh_in)
# Append metadata to PDF content in merger.
merger.addMetadata(metadata)
# Write new PDF file with appended metadata to output
# CAUTION: This will overwrite any existing files without prompt!
with open( fileOut, 'wb' ) as fh_out:
merger.write(fh_out)

If I open the output of a file I ran through PyPDF2 it is showing version 1.7 8x, changing 1.3 to 1.5 or whatever doesn't make a difference

HTML to PDF conversion in app engine python

My website has a lot of dynamically generated HTML content and I would like to give my website users a way to save the data in PDF format. Any ideas on how it can be done? I tried xhtml2pdf library but I couldn't get it to work. I even tried reportlibrary but we have to enter the PDF details manually. Is there any library which converts HTML content to PDF and works on app engine?

You need to copy all dependencies into your GAE project folder:
html5lib
reportlab
six
xhtml2pdf
Then you can use it like this:
from xhtml2pdf import pisa
from cStringIO import StringIO
content = StringIO('html goes here')
output = StringIO()
pisa.log.setLevel('DEBUG')
pdf = pisa.CreatePDF(content, output, encoding='utf-8')
pdf_data = pdf.dest.getvalue()
Some useful info that I googled just for you:
http://www.prahladyeri.com/2013/11/how-to-generate-pdf-in-python-for-google-app-engine/
https://github.com/danimajo/pineapple_pdf

Merge Existing PDF into new ReportLab PDF via flowables

I have a reportlab SimpleDocTemplate and returning it as a dynamic PDF. I am generating it's content based on some Django model metadata. Here's my template setup:
buff = StringIO()
doc = SimpleDocTemplate(buff, pagesize=letter,
rightMargin=72,leftMargin=72,
topMargin=72,bottomMargin=18)
Story = []
I can easily add textual metadata from the Entry model into the Story list to be built later:
ptext = '<font size=20>%s</font>' % entry.title.title()
paragraph = Paragraph(ptext, custom_styles["Custom"])
Story.append(paragraph)
And then generate the PDF to be returned in the response by calling build on the SimpleDocTemplate:
doc.build(Story, onFirstPage=entry_page_template, onLaterPages=entry_page_template)
pdf = buff.getvalue()
resp = HttpResponse(mimetype='application/x-download')
resp['Content-Disposition'] = 'attachment;filename=logbook.pdf'
resp.write(pdf)
return resp
One metadata field on the model is a file attachment. When those file attachments are PDFs, I'd like to merge them into the Story that I am generating; IE meaning a PDF of reportlab "flowable" type.
I'm attempting to do so using pdfrw, but haven't had any luck. Ideally I'd love to just call:
from pdfrw import PdfReader
pdf = pPdfReader(entry.document.file.path)
Story.append(pdf)
and append the pdf to the existing Story list to be included in the generation of the final document, as noted above.
Anyone have any ideas? I tried something similar using pagexobj to create the pdf, trying to follow this example:
http://code.google.com/p/pdfrw/source/browse/trunk/examples/rl1/subset.py
from pdfrw.buildxobj import pagexobj
from pdfrw.toreportlab import makerl
pdf = pagexobj(PdfReader(entry.document.file.path))
But didn't have any luck either. Can someone explain to me the best way to merge an existing PDF file into a reportlab flowable? I'm no good with this stuff and have been banging my head on pdf-generation for days now. :) Any direction greatly appreciated!

I just had a similar task in a project. I used reportlab (open source version) to generate pdf files and pyPDF to facilitate the merge. My requirements were slightly different in that I just needed one page from each attachment, but I'm sure this is probably close enough for you to get the general idea.
from pyPdf import PdfFileReader, PdfFileWriter
def create_merged_pdf(user):
basepath = settings.MEDIA_ROOT + "/"
# following block calls the function that uses reportlab to generate a pdf
coversheet_path = basepath + "%s_%s_cover_%s.pdf" %(user.first_name, user.last_name, datetime.now().strftime("%f"))
create_cover_sheet(coversheet_path, user, user.performancereview_set.all())
# now user the cover sheet and all of the performance reviews to create a merged pdf
merged_path = basepath + "%s_%s_merged_%s.pdf" %(user.first_name, user.last_name, datetime.now().strftime("%f"))
# for merged file result
output = PdfFileWriter()
# for each pdf file to add, open in a PdfFileReader object and add page to output
cover_pdf = PdfFileReader(file( coversheet_path, "rb"))
output.addPage(cover_pdf.getPage(0))
# iterate through attached files and merge. I only needed the first page, YMMV
for review in user.performancereview_set.all():
review_pdf = PdfFileReader(file(review.pdf_file.file.name, "rb"))
output.addPage(review_pdf.getPage(0)) # only first page of attachment
# write out the merged file
outputStream = file(merged_path, "wb")
output.write(outputStream)
outputStream.close()

I used the following class to solve my issue. It inserts the PDFs as vector PDF images.
It works great because I needed to have a table of contents. The flowable object allowed the built in TOC functionality to work like a charm.
Is there a matplotlib flowable for ReportLab?
Note: If you have multiple pages in the file, you have to modify the class slightly. The sample class is designed to just read the first page of the PDF.

I know the question is a bit old but I'd like to provide a new solution using the latest PyPDF2.
You now have access to the PdfFileMerger, which can do exactly what you want, append PDFs to an existing file. You can even merge them in different positions and choose a subset or all the pages!
The official docs are here: https://pythonhosted.org/PyPDF2/PdfFileMerger.html
An example from the code in your question:
import tempfile
import PyPDF2
from django.core.files import File
# Using a temporary file rather than a buffer in memory is probably better
temp_base = tempfile.TemporaryFile()
temp_final = tempfile.TemporaryFile()
# Create document, add what you want to the story, then build
doc = SimpleDocTemplate(temp_base, pagesize=letter, ...)
...
doc.build(...)
# Now, this is the fancy part. Create merger, add extra pages and save
merger = PyPDF2.PdfFileMerger()
merger.append(temp_base)
# Add any extra document, you can choose a subset of pages and add bookmarks
merger.append(entry.document.file, bookmark='Attachment')
merger.write(temp_final)
# Write the final file in the HTTP response
django_file = File(temp_final)
resp = HttpResponse(django_file, content_type='application/pdf')
resp['Content-Disposition'] = 'attachment;filename=logbook.pdf'
if django_file.size is not None:
resp['Content-Length'] = django_file.size
return resp

Use this custom flowable:
class PDF_Flowable(Flowable):
#----------------------------------------------------------------------
def __init__(self,P,page_no):
Flowable.__init__(self)
self.P = P
self.page_no = page_no
#----------------------------------------------------------------------
def draw(self):
"""
draw the line
"""
canv = self.canv
pages = self.P
page_no = self.page_no
canv.translate(x, y)
canv.doForm(makerl(canv, pages[page_no]))
canv.restoreState()
and then after opening existing pdf i.e.
pages = PdfReader(BASE_DIR + "/out3.pdf").pages
pages = [pagexobj(x) for x in pages]
for i in range(0, len(pages)):
F = PDF_Flowable(pages,i)
elements.append(F)
elements.append(PageBreak())
use this code to add this custom flowable in elements[].

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: How to set BaseUrl in pdf metadata? - python

Related

How to extract the title, authors, creation date of a PDF in Python

pypdf mergepage issue

How can I set the PDF version with PyPDF2?

HTML to PDF conversion in app engine python

Merge Existing PDF into new ReportLab PDF via flowables

Categories

Resources