pypdf gives output with incorrect PDF format

pypdf gives output with incorrect PDF format - python

I am using the following code to resize pages in a PDF:
from pypdf import PdfReader, PdfWriter, Transformation, PageObject, PaperSize
from pypdf.generic import RectangleObject
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
A4_w = PaperSize.A4.width
A4_h = PaperSize.A4.height
# resize page to fit *inside* A4
h = float(page.mediabox.height)
w = float(page.mediabox.width)
scale_factor = min(A4_h/h, A4_w/w)
transform = Transformation().scale(scale_factor,scale_factor).translate(0, A4_h/2 - h*scale_factor/2)
page.add_transformation(transform)
page.cropbox = RectangleObject((0, 0, A4_w, A4_h))
# merge the pages to fit inside A4
# prepare A4 blank page
page_A4 = PageObject.create_blank_page(width = A4_w, height = A4_h)
page.mediabox = page_A4.mediabox
page_A4.merge_page(page)
writer.add_page(page_A4)
writer.write('output.pdf')
Source: https://stackoverflow.com/a/75274841/11501160
While this code works fine for the resizing part, I have found that most input files work fine but some input files do not work fine.
I am providing download links to input.pdf and output.pdf files for testing and review. The output file is completely different from the input file. The images are missing, the background colour is different, even the pure text on first page has only the first line visible.
What is interesting is that these difference are only seen when I open the output pdf in Adobe Acrobat, or look at the physically printed pages.
The PDF looks perfect when i open in Preview (on MacOS) or open the PDF in my Chrome Browser.
and
The origin of the input pdf is that I created it in Preview (on MacOS) by mixing pages from different PDFs and dragging image files into the thumbnails as per these instructions:
https://support.apple.com/en-ca/HT202945
I've never had a problem before while making PDFs like this and even Adobe Acrobat reads the input pdf properly. Only the output pdf is problematic in Acrobat and in printers.
Is this a bug with pypdf or am I doing something wrong ?
How can i get the output PDF to be proper in Adobe Acrobat and printers etc ?

This is a valid bug with pypdf and the fix is due to be released in the next version.
Refer:
https://github.com/py-pdf/pypdf/issues/1607

The following is what PyMuPDF has to offer here. The output displays correctly in all PDF readers:
import fitz # import PyMuPDF
src = fitz.open("input.pdf")
doc = fitz.open()
for i in range(len(src)):
page = doc.new_page() # this is A4 portrait by default
page.show_pdf_page(page.rect, src, i) # scaling will happen automatically
doc.save("fitz-output.pdf",garbage=3,deflate=True)
The above method show_pdf_page() supports many more options, like selecting sub-rectangles form the source page, rotating it by arbitrary angles, and of course freely select the target page's sub-rectangle to receive the content.

Related

Cropping the Mediabox does not work for some pdfs

I wrote a little script which shall blank out the lower half of a PDF document. The document itself shall remain the same size, but the lower half shall be just white.
(This is to remove the "instructions" part from parcel labels of German parcel comanies like DHL and Hermes.)
To do this, I take the PDF page, adjust the Mediabox, and then merge this page onto a new, blank page.
Fortunately, this works as intended with the PDFs I need it for. However, I also tried a few other PDFs and for some, it just does not work. It copies over the complete PDF. This happens for example, when my code is given this file: https://www.veeam.com/veeam_backup_product_overview_ds.pdf
Here is the code:
import pypdf # PyPDF2, 3 and 4 are deprecated. PyPDF is currently in active development
reader = pypdf.PdfReader(source_filename)
writer = pypdf.PdfWriter()
# get first page
page = reader.pages[0]
# create new page
new_page = pypdf.PageObject.create_blank_page( None, width = page.mediabox.width, height = page.mediabox.height )
# crop original
page.mediabox.bottom = ( page.mediabox.top - page.mediabox.bottom ) / 2 + page.mediabox.bottom
# merge original into empty new page
new_page.merge_page( page )
writer.add_page(new_page)
with open(output_file, "wb") as fp:
writer.write(fp)
Can anyone explain why it does not work sometimes?

Extracting Powerpoint background images using python-pptx

I have several powerpoints that I need to shuffle through programmatically and extract images from. The images then need to be converted into OpenCV format for later processing/analysis. I have done this successfully for images in the pptx, using:
for slide in presentation:
for shape in slide.shapes
if 'Picture' in shape.name:
pic_list.append(shape)
for extraction, and:
img = cv2.imdecode(np.frombuffer(page[i].image.blob, np.uint8), cv2.IMREAD_COLOR)
for python-pptx Picture to OpenCV conversion. However, I am having a lot of trouble extracting and manipulating the backgrounds in a similar fashion.
slide.background
is sufficient to extract a "_Background" object, but I have not found a good way to convert it into a OpenCV object similar to Pictures. Does anyone know how to do this? I am using python-pptx for extraction, but am not adverse to other packages if it's not possible with that package.

After a fair bit of work I discovered how to do this -- i.e., you don't. As far as I can tell, there is no way to directly extract the backgrounds with either python-pptx or Aspose. Powerpoint -- which, as it turns out, is an archive that can be unzipped with 7zip -- keeps its backgrounds disassembled in the ppt/media (pics), ppt/slideLayouts and ppt/slideMasters (text, formatting), and they are only put together by the Powerpoint renderer. This means that to extract the backgrounds as displayed, you basically need to run Powerpoint and take pics of the slides after removing text/pictures/etc. from the foreground.
I did not need to do this, as I just needed to extract text from the backgrounds. This can be done by checking slideLayouts and slideMasters XMLs using BeautifulSoup, at the <a:t> tag. The code to do this is pretty simple:
import zipfile
with zipfile.ZipFile(pptx_path, 'r') as zip_ref:
zip_ref.extractall(extraction_directory)
This will extract the .pptx into its component files.
from glob import glob
layouts = glob(os.path.join(extr_dir, 'ppt\slideLayouts\*.xml'))
masters = glob(os.path.join(extr_dir, 'ppt\slideMasters\*.xml'))
files = layouts + masters
This gets you the paths for slide layouts/masters.
from bs4 import BeautifulSoup
text_list = []
for file in files:
with open(file) as f:
data = f.read()
bs_data = BeautifulSoup(data, "xml")
bs_a_t = bs_data.find_all('a:t')
for a_t in bs_a_t:
text_list.append(str(a_t.contents[0]))
This will get you the actual text from the XMLs.
Hopefully this will be useful to someone else in the future.

Text extraction libraries don't return text for non-empty page

I wrote a program which extracts text from PDF documents. But one PDF file is giving me empty texts. I can open the PDF file in Acrobat Reader and it works fine. My code works great with other PDF files, so I want to know what is causing this issue. I used PyPDF2 and pdfplumber, but same result. So there must be something wrong with the file: https://drive.google.com/file/d/1kNqWmf0zb_Q53WnKKZ817B7h9n5bRJ50/view?usp=sharing'
My Code
from PyPDF2 import PdfReader
reader = PdfReader("example.pdf")
for page in reader.pages:
text = page.extract_text()
print(text)
I do a lot more than just this. But its just a glimpse

The PDF is made of images, and doesn't contain any text :)
Cheers

You need to distinguish 3 types of pdf files:
Digitally-created PDF (aka "pure" PDFs / Digitally-born PDFs): Was created via software like Microsoft Word, Latex,... Text from those files can be read with PyPDF2 / Pymupdf / Tika / Pdfium. The mistakes here are mostly around whitespaces / ligatures / font encodings / text linearization.
Scanned PDF: essentially those are just images. You need Optical Character Recognition (OCR) software like tesseract to read text from images. This is prone to mistakes like confusing similar looking characters such as o / O / 0
OCRed PDFs (layered PDFs): the image is in the foreground, but a text layer is in the background. You can select and copy the text. PyPDF2 / Pymupdf / Tika / Pdfium can read the text in the background
Tesseract is Open Source and used e.g. by OCRmyPDF

Issue writing temp images to temp pdf in pyramid with reportlabs

I am using python 3, with pyramid and reportlabs to generate dynamic pdfs.
I am having a issue writing images in to a pdf. I am using Reportlab in a web to generate a pdf with images, by my images are not stored locally, they are on a remote server. I am downloading them locally into a temp directory ( they are saving, I have checked) When i go to add the images to the pdf, they space is allocating but image is not showing up.
Here is my relevant code (simplified):
# creates pdf in memory
doc = SimpleDocTemplate(pdfName, pagesize=A4)
elements = []
for item in model['items']:
# image goes here:
if item['IMAGENAME']:
response = getImageFromRemoteServer(item['IMAGENAME'])
dir_filename = directory + item['IMAGENAME']
if response.status_code == 200:
with open(dir_filename, 'wb') as f:
for chunk in response.iter_content():
f.write(chunk)
questions.append(Image(dir_filename, width=2*inch, height=2*inch))
# create and save the pdf
doc.build(elements,canvasmaker=NumberedCanvas)
I have followed the user guide here https://www.reportlab.com/docs/reportlab-userguide.pdf and have tried the above way, plus embedded images (as the user guide says in the paragraph section) and putting the image in the table.
I also looked here: and it did not help me.
My question is really, what is the right what to download an image and put in a pdf?
EDIT: fixed code indentation
EDIT 2:
Answered, I was finally about to get the images in the PDF. I am not sure what was the trigger to get it to work. The only thing that know I change was now I am using urllib to do the request and before i was not. Here is the my working code (simplified for the question only, this is more abstracted and encapsulated in my code.):
doc = SimpleDocTemplate(pdfName, pagesize=A4)
# array of elements in the pdf
elements = []
for question in model['questions']:
# image goes here:
if question['IMAGEFILE']:
filename = question['IMAGEFILE']
dir_filename = directory + filename
url = get_url(settings, filename)
response = urllib.request.urlopen(url)
raw_data = response.read()
f = open(dir_filename, 'wb')
f.write(raw_data)
f.close()
response.close()
myImage = Image(dir_filename)
myImage.drawHeight = 2* inch
myImage.drawWidth = 2* inch
myImage.hAlign = "LEFT"
elements.append(myImage)
# create and save the pdf
doc.build(elements)

Make your code independent from where the files come from. Separate file/resource retrieval from document generation. Ensure that your toolset is working with local files. Encapsulate the code to load files in a loader class or function. The encapsulation is what matters. Noticed this again this week while looking at thumbor loader classes.
If that works, you know reportlab, PIL and your application basically work.
Then make your code work with remote files using URI like http://path/to/remote/files.
Afterwards you can switch from using your fileloader or your httploader depending on environment or use case.
Another option to go would be to make your code work with local files using URI like file://path/to/file
This way the only thing that changes when switching from local to remote is the URL. Probably you need a python library supporting this. requests library is well suited for downloading things, most probably it supports URL scheme file:// as well.

Most probably the lazy parameter was responsible that your first code sample did not render the images. Triggering reportlab PDF rendering outside of the context managers for temporary files could have lead to this behaviour.
reportlab.platypus.flowables.py (using version 3.1.8)
class Image(Flowable):
"""an image (digital picture). Formats supported by PIL/Java 1.4 (the Python/Java Imaging Library
are supported. At the present time images as flowables are always centered horozontally
in the frame. We allow for two kinds of lazyness to allow for many images in a document
which could lead to file handle starvation.
lazy=1 don't open image until required.
lazy=2 open image when required then shut it.
"""
_fixedWidth = 1
_fixedHeight = 1
def __init__(self, filename, width=None, height=None, kind='direct', mask="auto", lazy=1):
"""If size to draw at not specified, get it from the image."""
self.hAlign = 'CENTER'
self._mask = mask
fp = hasattr(filename,'read')
if fp:
self._file = filename
self.filename = repr(filename)
...
The last three lines of the code example tell you that you can pass an object that has a read method. This is exactly what a call to urllib.request.urlopen(url) returns. Using that memory buffer you create an instance of Image. No need to have write access to filesystem, no need to delete these files after PDF rendering. Applying our new knowledge to improve code readability. Since your use-case includes retrieval of remote resources using memory buffers that support python file API could be a much cleaner approach to assemble your PDF files.
from contextlib import closing
import urllib.request
doc = SimpleDocTemplate(pdfName, pagesize=A4)
# array of elements in the pdf
elements = []
for question in model['questions']:
# download image and create Image from file-like object
if question['IMAGEFILE']:
filename = question['IMAGEFILE']
image_url = get_url(settings, filename)
with closing(urllib.request.urlopen(image_url)) as image_file:
myImage = Image(image_file, width=2*inch, height=2*inch)
myImage.hAlign = "LEFT"
elements.append(myImage)
# create and save the pdf
doc.build(elements)
References
Coding with context managers

Add text to existing PDF document in Python

I'm trying to convert a pdf to the same size as my pdf which is an A4 page.
convert my_pdf.pdf -density 300x300 -page A4 my_png.png
The resulting png file, however, is 595px × 842px which should be the resolution at 72 dpi.
I was thinking of using PIL to write some text on some of the pdf fields and convert it back to PDF. But currently the image is coming out wrong.
Edit: I was approaching the problem from the wrong angle. The correct approach didn't include imagemagick at all.

After searching around some I finally found the solution:
It turns out that this was the correct approach after all.
Yet, i feel that it wasn't verbose enough.
It appears that the poster probably took it from here (same variable names etc).
The idea: create new blank PDF with Reportlab which only contains a text string.
Then merge/add it as a watermark using pyPdf.
from pyPdf import PdfFileWriter, PdfFileReader
import StringIO
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
packet = StringIO.StringIO()
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
can.drawString(100,100, "Hello world")
can.save()
#move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PdfFileReader(packet)
# read your existing PDF
existing_pdf = PdfFileReader(file("mypdf.pdf", "rb"))
output = PdfFileWriter()
# add the "watermark" (which is the new pdf) on the existing page
page = existing_pdf.getPage(0)
page.mergePage(new_pdf.getPage(0))
output.addPage(page)
# finally, write "output" to a real file
outputStream = file("/home/joe/newpdf.pdf", "wb")
output.write(outputStream)
outputStream.close()
Hope this helps somebody else.

I just tried the solution above, but I had quite some troubles to get it running in Python3. So, I would like to share my modifications. The adapted code looks as follows:
from PyPDF2 import PdfFileWriter, PdfFileReader
import io
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
packet = io.BytesIO()
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
can.drawString(100, 100, "Hello world")
can.save()
# move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PdfFileReader(packet)
# read your existing PDF
existing_pdf = PdfFileReader(open("mypdf.pdf", "rb"))
output = PdfFileWriter()
# add the "watermark" (which is the new pdf) on the existing page
page = existing_pdf.getPage(0)
page2 = new_pdf.getPage(0)
page.mergePage(page2)
output.addPage(page)
# finally, write "output" to a real file
outputStream = open("newpdf.pdf", "wb")
output.write(outputStream)
outputStream.close()
Now the page.mergePage throws an error. Turns out to be a porting error in pypdf2. Please refer to this question for the solution: Porting to Python3: PyPDF2 mergePage() gives TypeError

You should look at Add text to Existing PDF using Python and also Python as PDF Editing and Processing Framework. These will point you in the right direction.
If you do what you've proposed in the question, when you export back to .pdf, it will really just be an image file embedded in a .pdf, it won't be text.

pdfrw will let you take existing PDFs and place them as form XObjects (similar to images) on a reportlab canvas. There are some examples for this in the pdfrw examples/rl1 subdirectory on github. Disclaimer -- I am the pdfrw author.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pypdf gives output with incorrect PDF format - python

This is a valid bug with pypdf and the fix is due to be released in the next version. Refer: https://github.com/py-pdf/pypdf/issues/1607

Related

Cropping the Mediabox does not work for some pdfs

Extracting Powerpoint background images using python-pptx

Text extraction libraries don't return text for non-empty page

Issue writing temp images to temp pdf in pyramid with reportlabs

Add text to existing PDF document in Python

Categories

Resources