Add text to existing PDF document in Python

Add text to existing PDF document in Python - python

I'm trying to convert a pdf to the same size as my pdf which is an A4 page.
convert my_pdf.pdf -density 300x300 -page A4 my_png.png
The resulting png file, however, is 595px × 842px which should be the resolution at 72 dpi.
I was thinking of using PIL to write some text on some of the pdf fields and convert it back to PDF. But currently the image is coming out wrong.
Edit: I was approaching the problem from the wrong angle. The correct approach didn't include imagemagick at all.

After searching around some I finally found the solution:
It turns out that this was the correct approach after all.
Yet, i feel that it wasn't verbose enough.
It appears that the poster probably took it from here (same variable names etc).
The idea: create new blank PDF with Reportlab which only contains a text string.
Then merge/add it as a watermark using pyPdf.
from pyPdf import PdfFileWriter, PdfFileReader
import StringIO
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
packet = StringIO.StringIO()
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
can.drawString(100,100, "Hello world")
can.save()
#move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PdfFileReader(packet)
# read your existing PDF
existing_pdf = PdfFileReader(file("mypdf.pdf", "rb"))
output = PdfFileWriter()
# add the "watermark" (which is the new pdf) on the existing page
page = existing_pdf.getPage(0)
page.mergePage(new_pdf.getPage(0))
output.addPage(page)
# finally, write "output" to a real file
outputStream = file("/home/joe/newpdf.pdf", "wb")
output.write(outputStream)
outputStream.close()
Hope this helps somebody else.

I just tried the solution above, but I had quite some troubles to get it running in Python3. So, I would like to share my modifications. The adapted code looks as follows:
from PyPDF2 import PdfFileWriter, PdfFileReader
import io
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
packet = io.BytesIO()
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
can.drawString(100, 100, "Hello world")
can.save()
# move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PdfFileReader(packet)
# read your existing PDF
existing_pdf = PdfFileReader(open("mypdf.pdf", "rb"))
output = PdfFileWriter()
# add the "watermark" (which is the new pdf) on the existing page
page = existing_pdf.getPage(0)
page2 = new_pdf.getPage(0)
page.mergePage(page2)
output.addPage(page)
# finally, write "output" to a real file
outputStream = open("newpdf.pdf", "wb")
output.write(outputStream)
outputStream.close()
Now the page.mergePage throws an error. Turns out to be a porting error in pypdf2. Please refer to this question for the solution: Porting to Python3: PyPDF2 mergePage() gives TypeError

You should look at Add text to Existing PDF using Python and also Python as PDF Editing and Processing Framework. These will point you in the right direction.
If you do what you've proposed in the question, when you export back to .pdf, it will really just be an image file embedded in a .pdf, it won't be text.

pdfrw will let you take existing PDFs and place them as form XObjects (similar to images) on a reportlab canvas. There are some examples for this in the pdfrw examples/rl1 subdirectory on github. Disclaimer -- I am the pdfrw author.

Related

pypdf gives output with incorrect PDF format

I am using the following code to resize pages in a PDF:
from pypdf import PdfReader, PdfWriter, Transformation, PageObject, PaperSize
from pypdf.generic import RectangleObject
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
A4_w = PaperSize.A4.width
A4_h = PaperSize.A4.height
# resize page to fit *inside* A4
h = float(page.mediabox.height)
w = float(page.mediabox.width)
scale_factor = min(A4_h/h, A4_w/w)
transform = Transformation().scale(scale_factor,scale_factor).translate(0, A4_h/2 - h*scale_factor/2)
page.add_transformation(transform)
page.cropbox = RectangleObject((0, 0, A4_w, A4_h))
# merge the pages to fit inside A4
# prepare A4 blank page
page_A4 = PageObject.create_blank_page(width = A4_w, height = A4_h)
page.mediabox = page_A4.mediabox
page_A4.merge_page(page)
writer.add_page(page_A4)
writer.write('output.pdf')
Source: https://stackoverflow.com/a/75274841/11501160
While this code works fine for the resizing part, I have found that most input files work fine but some input files do not work fine.
I am providing download links to input.pdf and output.pdf files for testing and review. The output file is completely different from the input file. The images are missing, the background colour is different, even the pure text on first page has only the first line visible.
What is interesting is that these difference are only seen when I open the output pdf in Adobe Acrobat, or look at the physically printed pages.
The PDF looks perfect when i open in Preview (on MacOS) or open the PDF in my Chrome Browser.
and
The origin of the input pdf is that I created it in Preview (on MacOS) by mixing pages from different PDFs and dragging image files into the thumbnails as per these instructions:
https://support.apple.com/en-ca/HT202945
I've never had a problem before while making PDFs like this and even Adobe Acrobat reads the input pdf properly. Only the output pdf is problematic in Acrobat and in printers.
Is this a bug with pypdf or am I doing something wrong ?
How can i get the output PDF to be proper in Adobe Acrobat and printers etc ?

This is a valid bug with pypdf and the fix is due to be released in the next version.
Refer:
https://github.com/py-pdf/pypdf/issues/1607

The following is what PyMuPDF has to offer here. The output displays correctly in all PDF readers:
import fitz # import PyMuPDF
src = fitz.open("input.pdf")
doc = fitz.open()
for i in range(len(src)):
page = doc.new_page() # this is A4 portrait by default
page.show_pdf_page(page.rect, src, i) # scaling will happen automatically
doc.save("fitz-output.pdf",garbage=3,deflate=True)
The above method show_pdf_page() supports many more options, like selecting sub-rectangles form the source page, rotating it by arbitrary angles, and of course freely select the target page's sub-rectangle to receive the content.

How do I add an image from a list in python using docx?

I wrote a code that takes a screenshot that I want to paste into a word document using docx. So far I have to save the image as a png file. The relevant part of my code is:
from docx import Document
import pyautogui
import docx
doc = Document()
images = []
img = pyautogui.screenshot(region = (some region))
images.append(img)
img.save(imagepath.png)
run =doc.add_picture(imagepath.png)
run
I would like to be able to add the image without saving it. Is it possible to do this using docx?

Yes, according to add_picture — Document objects — python-docx 0.8.10 documentation, add_picture can import data from a stream as well.
As per Screenshot Functions — PyAutoGUI 1.0.0 documentation, screenshot() produces a PIL/Pillow image object which can be save()'d with a BytesIO() as destination to produce a compressed image data stream in memory.
So that'll be:
import io
imdata = io.BytesIO()
img.save(imdata, format='png')
imdata.seek(0)
doc.add_picture(imdata)
del imdata # cannot reuse it for other pictures, you need a clean buffer each time
# can use .truncate(0) then .seek(0) instead but this is probably easier

using pytesseract to generate a PDF from image

I am using the following code to generate a PDF from image.
PDF=pytesseract.image_to_pdf_or_hocr(test_image,lang='dan',config='',nice=0,extension='pdf')
and the type of PDF variable is being shown as BYTES.
HOw Do i publish or get the PDF generated?

I have found the answer. Just to close the thread, posting the same.
f = open("demofile.pdf", "w+b")
f.write(bytearray(pdf))
f.close()
demofile.pdf happens to be resultant pdf which gets published in the workspace.

From Pytesseract-PYPI:
Get a searchable PDF
pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf')
with open('test.pdf', 'w+b') as f:
f.write(pdf) # pdf type is bytes by default

Substitute string in .docx file by jpg with python

Sorry for my bad English.
I'm trying to substitute a string in .docx file by a .jpg file. First I convert JPEG to BMP and move it to clipboard follow Copy PIL/PILLOW Image to Windows Clipboard, then use Find.Execute with "^c" to substitute a special string in docx file.
The substitution works well, but it paste a image with width of 15.42cm to the .docx file. I've tried to resize it with im.resize, but it ends with a large blur image rather than a small one. How could I make it smaller?
I'm using python2.7.2 and Win7. Thanks a lot.
from win32com.client import Dispatch
from cStringIO import StringIO
import win32clipboard
import win32com
from PIL import Image
def setImageToClipboard(clip_type, data):
win32clipboard.OpenClipboard()
win32clipboard.EmptyClipboard()
win32clipboard.SetClipboardData(clip_type, data)
win32clipboard.CloseClipboard()
filepath = 'd:/tmp.jpg'
im = Image.open(filepath)
#im = im.resize((10, 10))
output = StringIO()
im.convert("RGB").save(output, "BMP")
data = output.getvalue()[14:]
output.close()
w = win32com.client.Dispatch('Word.Application')
w.Visible = 1
w.DisplayAlerts = 0
doc = w.Documents.Open("d:/clipboard_test.docx")
search = "TEST"
setImageToClipboard(win32clipboard.CF_DIB, data)
w.Selection.Find.ClearFormatting()
w.Selection.Find.Replacement.ClearFormatting()
w.Selection.Find.Execute("TEST", False, True, False, False, False, True, 1, True, ReplaceWith="^c", Replace=2000)
doc.SaveAs("d:/clipboard_test2.docx")
doc.Close()
w.Quit()

You can do this much more cleanly and without actually needing MS-Word installed or starting it up by using python-docx.
You will need to read in your existing document, find the text that you need to replace and then use document.add_picture('monty-truth.png', width=Inches(1.25))
rather than using word and pretending to paint.
The only other option is to have python select the image that you have pasted and set its properties within word.

Issue writing temp images to temp pdf in pyramid with reportlabs

I am using python 3, with pyramid and reportlabs to generate dynamic pdfs.
I am having a issue writing images in to a pdf. I am using Reportlab in a web to generate a pdf with images, by my images are not stored locally, they are on a remote server. I am downloading them locally into a temp directory ( they are saving, I have checked) When i go to add the images to the pdf, they space is allocating but image is not showing up.
Here is my relevant code (simplified):
# creates pdf in memory
doc = SimpleDocTemplate(pdfName, pagesize=A4)
elements = []
for item in model['items']:
# image goes here:
if item['IMAGENAME']:
response = getImageFromRemoteServer(item['IMAGENAME'])
dir_filename = directory + item['IMAGENAME']
if response.status_code == 200:
with open(dir_filename, 'wb') as f:
for chunk in response.iter_content():
f.write(chunk)
questions.append(Image(dir_filename, width=2*inch, height=2*inch))
# create and save the pdf
doc.build(elements,canvasmaker=NumberedCanvas)
I have followed the user guide here https://www.reportlab.com/docs/reportlab-userguide.pdf and have tried the above way, plus embedded images (as the user guide says in the paragraph section) and putting the image in the table.
I also looked here: and it did not help me.
My question is really, what is the right what to download an image and put in a pdf?
EDIT: fixed code indentation
EDIT 2:
Answered, I was finally about to get the images in the PDF. I am not sure what was the trigger to get it to work. The only thing that know I change was now I am using urllib to do the request and before i was not. Here is the my working code (simplified for the question only, this is more abstracted and encapsulated in my code.):
doc = SimpleDocTemplate(pdfName, pagesize=A4)
# array of elements in the pdf
elements = []
for question in model['questions']:
# image goes here:
if question['IMAGEFILE']:
filename = question['IMAGEFILE']
dir_filename = directory + filename
url = get_url(settings, filename)
response = urllib.request.urlopen(url)
raw_data = response.read()
f = open(dir_filename, 'wb')
f.write(raw_data)
f.close()
response.close()
myImage = Image(dir_filename)
myImage.drawHeight = 2* inch
myImage.drawWidth = 2* inch
myImage.hAlign = "LEFT"
elements.append(myImage)
# create and save the pdf
doc.build(elements)

Make your code independent from where the files come from. Separate file/resource retrieval from document generation. Ensure that your toolset is working with local files. Encapsulate the code to load files in a loader class or function. The encapsulation is what matters. Noticed this again this week while looking at thumbor loader classes.
If that works, you know reportlab, PIL and your application basically work.
Then make your code work with remote files using URI like http://path/to/remote/files.
Afterwards you can switch from using your fileloader or your httploader depending on environment or use case.
Another option to go would be to make your code work with local files using URI like file://path/to/file
This way the only thing that changes when switching from local to remote is the URL. Probably you need a python library supporting this. requests library is well suited for downloading things, most probably it supports URL scheme file:// as well.

Most probably the lazy parameter was responsible that your first code sample did not render the images. Triggering reportlab PDF rendering outside of the context managers for temporary files could have lead to this behaviour.
reportlab.platypus.flowables.py (using version 3.1.8)
class Image(Flowable):
"""an image (digital picture). Formats supported by PIL/Java 1.4 (the Python/Java Imaging Library
are supported. At the present time images as flowables are always centered horozontally
in the frame. We allow for two kinds of lazyness to allow for many images in a document
which could lead to file handle starvation.
lazy=1 don't open image until required.
lazy=2 open image when required then shut it.
"""
_fixedWidth = 1
_fixedHeight = 1
def __init__(self, filename, width=None, height=None, kind='direct', mask="auto", lazy=1):
"""If size to draw at not specified, get it from the image."""
self.hAlign = 'CENTER'
self._mask = mask
fp = hasattr(filename,'read')
if fp:
self._file = filename
self.filename = repr(filename)
...
The last three lines of the code example tell you that you can pass an object that has a read method. This is exactly what a call to urllib.request.urlopen(url) returns. Using that memory buffer you create an instance of Image. No need to have write access to filesystem, no need to delete these files after PDF rendering. Applying our new knowledge to improve code readability. Since your use-case includes retrieval of remote resources using memory buffers that support python file API could be a much cleaner approach to assemble your PDF files.
from contextlib import closing
import urllib.request
doc = SimpleDocTemplate(pdfName, pagesize=A4)
# array of elements in the pdf
elements = []
for question in model['questions']:
# download image and create Image from file-like object
if question['IMAGEFILE']:
filename = question['IMAGEFILE']
image_url = get_url(settings, filename)
with closing(urllib.request.urlopen(image_url)) as image_file:
myImage = Image(image_file, width=2*inch, height=2*inch)
myImage.hAlign = "LEFT"
elements.append(myImage)
# create and save the pdf
doc.build(elements)
References
Coding with context managers

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Add text to existing PDF document in Python - python

pdfrw will let you take existing PDFs and place them as form XObjects (similar to images) on a reportlab canvas. There are some examples for this in the pdfrw examples/rl1 subdirectory on github. Disclaimer -- I am the pdfrw author.

Related

pypdf gives output with incorrect PDF format

How do I add an image from a list in python using docx?

using pytesseract to generate a PDF from image

Substitute string in .docx file by jpg with python

Issue writing temp images to temp pdf in pyramid with reportlabs

Categories

Resources