I have a couple of PDFs I want to add a few inches on one side to give myself more room for handwritten comments in a notes app. Basically, I want to give myself more room to scribble on the sides of the pages (lecture scripts).
The pages should not be scaled, I simply want the contents to stay at the same spot from the upper left corner, but add more space at the right and maybe at the bottom.
Is there a good way to to this either using one of the Python PDF libs or using a command line tool?
Can I simply add extra space to the Media box, or do I need to do something else?
OK, the following code seems to work.
Had to set mediaBox and cropBox to get the desired result.
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf = PdfFileReader("org.pdf")
writer = PdfFileWriter()
factor = 1.3
for page in pdf.pages:
x,y = page.mediaBox.lowerRight
page.mediaBox.lowerRight = ( (factor * float(x)), float(y))
x,y = page.cropBox.lowerRight
page.cropBox.lowerRight = ( (factor * float(x)), float(y))
writer.addPage( page )
with open("out.pdf", "wb") as out_f:
writer.write(out_f)
Related
I wrote a little script which shall blank out the lower half of a PDF document. The document itself shall remain the same size, but the lower half shall be just white.
(This is to remove the "instructions" part from parcel labels of German parcel comanies like DHL and Hermes.)
To do this, I take the PDF page, adjust the Mediabox, and then merge this page onto a new, blank page.
Fortunately, this works as intended with the PDFs I need it for. However, I also tried a few other PDFs and for some, it just does not work. It copies over the complete PDF. This happens for example, when my code is given this file: https://www.veeam.com/veeam_backup_product_overview_ds.pdf
Here is the code:
import pypdf # PyPDF2, 3 and 4 are deprecated. PyPDF is currently in active development
reader = pypdf.PdfReader(source_filename)
writer = pypdf.PdfWriter()
# get first page
page = reader.pages[0]
# create new page
new_page = pypdf.PageObject.create_blank_page( None, width = page.mediabox.width, height = page.mediabox.height )
# crop original
page.mediabox.bottom = ( page.mediabox.top - page.mediabox.bottom ) / 2 + page.mediabox.bottom
# merge original into empty new page
new_page.merge_page( page )
writer.add_page(new_page)
with open(output_file, "wb") as fp:
writer.write(fp)
Can anyone explain why it does not work sometimes?
I need to replace a K words with K other words for every PDF file I have within a certain path file location and on top of this I need to replace every logo with another logo. I have around 1000 PDF files, and so I do not want to use Adobe Acrobat and edit 1 file at a time. How can I start this?
Replacing words seems at least doable as long as there is a decent PDF reader one can access through Python ( Note I want to do this task in Python ), however replacing an image might be more difficult. I will most likely have to find the dimension of the current image and resize the image being used to replace the current image dynamically, whilst the program runs through these PDF files.
Hi, so I've written down some code regarding this:
from pikepdf import Pdf, PdfImage, Name
import os
import glob
from PIL import Image
import zlib
example = Pdf.open(r'...\Likelihood.pdf')
PagesWithImages = []
ImageCodesForPages = []
# Grab all the pages and all the images in every page.
for i in example.pages:
if len(list(i.images.keys())) >= 1:
PagesWithImages.append(i)
ImageCodesForPages.append(list(i.images.keys()))
pdfImages = []
for i,j in zip(PagesWithImages, ImageCodesForPages):
for x in j:
pdfImages.append(i.images[x])
# Replace every single page using random image, ensure that the dimensions remain the same?
for i in pdfImages:
pdfimage = PdfImage(i)
rawimage = pdfimage.obj
im = Image.open(r'...\panda.jpg')
pillowimage = pdfimage.as_pil_image()
print(pillowimage.height)
print(pillowimage.width)
im = im.resize((pillowimage.width, pillowimage.height))
im.show()
rawimage.write(zlib.compress(im.tobytes()), filter=Name("/FlateDecode"))
rawimage.ColorSpace = Name("/DeviceRGB")
So just one problem, it doesn't actually replace anything. If you're wondering why and how I wrote this code I actually got it from this documentation:
https://buildmedia.readthedocs.org/media/pdf/pikepdf/latest/pikepdf.pdf
Start at Page 53
I essentially put all the pdfImages into a list, as 1 page can have multiple images. In conjunction with this, the last for loop essentially tries to replace all these images whilst maintaining the same width and height size. Also note, the file path names I changed here and it definitely is not the issue.
Again Thank You
I have figured out what I was doing wrong. So for anyone that wants to actually replace an image with another image in place on a PDF file what you do is:
from pikepdf import Pdf, PdfImage, Name
from PIL import Image
import zlib
example = Pdf.open(filepath, allow_overwriting_input=True)
PagesWithImages = []
ImageCodesForPages = []
# Grab all the pages and all the images in every page.
for i in example.pages:
imagelists = list(i.images.keys())
if len(imagelists) >= 1:
for x in imagelists:
rawimage = i.images[x]
pdfimage = PdfImage(rawimage)
rawimage = pdfimage.obj
pillowimage = pdfimage.as_pil_image()
im = Image.open(imagePath)
im = im.resize((pillowimage.width, pillowimage.height))
rawimage.write(zlib.compress(im.tobytes()), filter=Name("/FlateDecode"))
rawimage.ColorSpace = Name("/DeviceRGB")
rawimage.Width, rawimage.Height = pillowimage.width, pillowimage.height
example.save()
Essentially, I changed the arguements in the first line, such that I specify that I can overwrite. In conjunction, I also added the last line which actually allows me to save.
Given a pdf file, is there any way to find its page dimensions and orientations (horizontal or vertical) etc? The pypdf2 library gives a function to check for number of pages but how can I extract other info? Is it possible to use this link to find information about the file. Date of creation, number of pages, title etc? Or anything else that is possible.
from PyPDF2 import PdfFileWriter, PdfFileReader
input1 = PdfFileReader(open("document1.pdf", "rb"))
# print how many pages input1 has:
print "document1.pdf has %d pages." % input1.getNumPages()
https://pythonhosted.org/PyPDF2/
You can use the /Rotate in order to get a page's rotation.
pdf = PyPDF2.PdfFileReader(open('document1.pdf', 'rb'))
orientation = pdf.getPage(pagenumber).get('/Rotate')
It will yield a value in degrees. Though it may be useful for some documents, you should note, that the page rotation by itself does not denote the orientation. As was contributed by #mkl in the comments.
As to other metadata, there are many things you can pull out. You can look into PyPDF2.pdf.DocumentInformation methods for all of them.
My original goal was to remove the extensive white margins on my PDF pages.
Then I found this purpose can be achieved by scaling the page using the code below, but annotations are not scaled.
import PyPDF2
# This works fine
with open('old.pdf', 'rb') as pdf_obj:
pdf = PyPDF2.PdfFileReader(pdf_obj)
out = PyPDF2.PdfFileWriter()
for page in pdf.pages:
page.scale(2, 2)
out.addPage(page)
with open('new.pdf', 'wb') as f:
out.write(f)
# This attempts to remove annotations
with open('old.pdf', 'rb') as pdf_obj:
pdf = PyPDF2.PdfFileReader(pdf_obj)
page = pdf.pages[2]
print(page['/Annots'], '\n\n\n\n')
page.Annots = []
print(page['/Annots'])
Is there a way to remove annotations? Or any suggestion that can help me to get rid of the white margin.
The method PdfFileWriter.removeLinks() removes links and annotations. So, if you are okay with losing both you can add out.removeLinks() in your first block of code, the one that's working fine.
I've written a script with python's xlrd and pptx to read each workbook in a directory and pull information from each sheet into a table in a PowerPoint slide. It works okay if the excel table is small but I don't know what will be in these excel files. It becomes illegible when there is too many rows and columns. My main problem arose when an excel file had graphs instead of cells and the script couldn't read it. So I tried using pyscreenshot to open the document and take a screenshot but this seems slow and unnecessary. I'd like to make a slide in the PowerPoint look exactly as it would in excel but with the ability to add and change things.
import libraries and modules
import xlrd
from pptx import Presentation
from pptx.util import Inches, Pt
import time
import glob
import os
start = time.time()
prs = Presentation()
title_slide_layout = prs.slide_layouts[0]
slide = prs.slides.add_slide(title_slide_layout)
shapes = slide.shapes
title = slide.shapes.title
subtitle = slide.placeholders[1]
title.text = "Dashboard Generator"
subtitle.text = "made with Python-pptx and xlrd"
for filename in glob.glob(os.path.join("C:/Users/penelope/Desktop/PMO/myfiles/", '*.xlsx')):
print(filename)
file_location = filename
try:
workbook = xlrd.open_workbook(file_location)
nsheets = workbook.nsheets
for n in range(0, nsheets):
sheet = workbook.sheet_by_index(n)
print("sheet:", sheet)
rows = sheet.nrows
cols = sheet.ncols
c = cols
r = rows
if c > 0:
print(c, r)
slide = prs.slides.add_slide(prs.slide_layouts[5])
shapes = slide.shapes
title = slide.shapes.title
title.text = "Table testing"
left = Inches(0.0)
top = Inches(2.0)
width = Inches(6.0)
height = Inches(4.0)
num = 10.0/c
table = shapes.add_table(rows, cols, left, top, width, height).table
for i in range(0, c):
table.columns[i].width = Inches(num)
for i in range(0,r):
for e in range(0,c):
table.cell(i,e).text = str(sheet.cell_value(i,e))
cell = table.rows[i].cells[e]
paragraph = cell.text_frame.paragraphs[0]
paragraph.font.size = Pt(11)
except:
print("Error!")
pass
prs.save('powerpointfile1.pptx')
end = time.time()
print(end - start)
And this is my screenshot script:
import os
import time
import pyscreenshot as ImageGrab
from PIL import Image
if __name__ == "__main__":
os.system('start excel.exe "C:/Users/penelope/Desktop/PMO/TestCase.xlsx"')
time.sleep(3)
im=ImageGrab.grab(bbox=(24,210,1800,990))
im.save("image7.png")
img = Image.open('image7.png')
img.show()
Well, you've chosen a hard problem. Certainly all the times I've attempted this sort of thing I've ended up abandoning the effort.
The fundamental explanation I formed was that Excel (and Word) are "flowed" document environments. That is, when you run out of room on one page, it flows to the next. PowerPoint, on the other hand, is a page-by-page exhibit layout environment. Each slide is independent of the rest (evidenced by the ability to reorder slides freely), each meant to be shown all at once, and not scrolled. This leads to each slide being self-contained, which means constrained to a single "page".
There's a limit to how much information one can place on a slide and still have it communicate. Generally less is better. So, perhaps it's not a surprise all my early efforts there ended in frustration :) I also concluded that an effective "dashboard" slide would require very skillful layout, and extreme restraint on content length, probably requiring specific (human) summarization effort (not just copying from a "database").
Regarding the charts bit, those theoretically can be moved to PowerPoint and I've even seen it done, but it's technically quite challenging. There is no API support for it in python-pptx. This historical issue on the GitHub repo may give some idea what was involved. Not for the faint of heart I expect :)