How to extract page containing image in a pdf with python? - python

I have 4000 scanned documents as pdf's. Each pdf contains a kyc form that i want to extract .Each pdf has 40 pages.What techniques can we use to get the page number of image ,since i can extract the page using pdf2image provided i have page number.
The kyc form will be similar and there will be images as posted. I have blurred the image but it will be of better quality

This is a simplistic approach that scans all bookmarks to find the matching object and then scans each page until it matches the same object. Possibly not the most elegant approach, but should get the job done.
from PyPDF2 import PdfFileReader
reader = PdfFileReader('D:\\Downloads\Sample.pdf')
# Scan outlines for bookmark containing KYC
outlines = reader.outlines
print(outlines)
for bookmark in outlines:
print(bookmark['/Title'])
print(bookmark['/Page'])
if bookmark['/Title'] == 'KYC':
mypage = bookmark['/Page']
# Scan page looking for the matching object
print(reader.getNumPages())
for x in range(0, reader.getNumPages()):
apage = reader.getPage(x)
print(apage)
if apage == mypage:
print('Eureka on page', x + 1)

Related

pypdf gives output with incorrect PDF format

I am using the following code to resize pages in a PDF:
from pypdf import PdfReader, PdfWriter, Transformation, PageObject, PaperSize
from pypdf.generic import RectangleObject
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
A4_w = PaperSize.A4.width
A4_h = PaperSize.A4.height
# resize page to fit *inside* A4
h = float(page.mediabox.height)
w = float(page.mediabox.width)
scale_factor = min(A4_h/h, A4_w/w)
transform = Transformation().scale(scale_factor,scale_factor).translate(0, A4_h/2 - h*scale_factor/2)
page.add_transformation(transform)
page.cropbox = RectangleObject((0, 0, A4_w, A4_h))
# merge the pages to fit inside A4
# prepare A4 blank page
page_A4 = PageObject.create_blank_page(width = A4_w, height = A4_h)
page.mediabox = page_A4.mediabox
page_A4.merge_page(page)
writer.add_page(page_A4)
writer.write('output.pdf')
Source: https://stackoverflow.com/a/75274841/11501160
While this code works fine for the resizing part, I have found that most input files work fine but some input files do not work fine.
I am providing download links to input.pdf and output.pdf files for testing and review. The output file is completely different from the input file. The images are missing, the background colour is different, even the pure text on first page has only the first line visible.
What is interesting is that these difference are only seen when I open the output pdf in Adobe Acrobat, or look at the physically printed pages.
The PDF looks perfect when i open in Preview (on MacOS) or open the PDF in my Chrome Browser.
and
The origin of the input pdf is that I created it in Preview (on MacOS) by mixing pages from different PDFs and dragging image files into the thumbnails as per these instructions:
https://support.apple.com/en-ca/HT202945
I've never had a problem before while making PDFs like this and even Adobe Acrobat reads the input pdf properly. Only the output pdf is problematic in Acrobat and in printers.
Is this a bug with pypdf or am I doing something wrong ?
How can i get the output PDF to be proper in Adobe Acrobat and printers etc ?
This is a valid bug with pypdf and the fix is due to be released in the next version.
Refer:
https://github.com/py-pdf/pypdf/issues/1607
The following is what PyMuPDF has to offer here. The output displays correctly in all PDF readers:
import fitz # import PyMuPDF
src = fitz.open("input.pdf")
doc = fitz.open()
for i in range(len(src)):
page = doc.new_page() # this is A4 portrait by default
page.show_pdf_page(page.rect, src, i) # scaling will happen automatically
doc.save("fitz-output.pdf",garbage=3,deflate=True)
The above method show_pdf_page() supports many more options, like selecting sub-rectangles form the source page, rotating it by arbitrary angles, and of course freely select the target page's sub-rectangle to receive the content.

How to extract images, video and audio from a pdf file using python

I need a python program that can extract videos audio and images from a pdf. I have tried using libraries such as PyPDF2 and Pillow, but I was unable to get all three to work let alone one.
I think you could achieve this using pymupdf.
To extract images see the following: https://pymupdf.readthedocs.io/en/latest/recipes-images.html#how-to-extract-images-pdf-documents
For Sound and Video these are essentially Annotation types.
The following "annots" function would get all the annotations of a specific type for a PDF page:
https://pymupdf.readthedocs.io/en/latest/page.html#Page.annots
Annotation types are as follows:
https://pymupdf.readthedocs.io/en/latest/vars.html#annotationtypes
Once you have acquired an annotation I think you can use the get_file method to extract the content ( see: https://pymupdf.readthedocs.io/en/latest/annot.html#Annot.get_file)
Hope this helps!
#George Davis-Diver can you please let me have an example PDF with video?
Sounds and videos are embedded in their specific annotation types. Both are no FileAttachment annotation, so the respective mathods cannot be used.
For a sound annotation, you must use `annot.get_sound()`` which returns a dictionary where one of the keys is the binary sound stream.
Images on the other hand may for sure be embedded as FileAttachment annotations - but this is unusual. Normally they are displayed on the page independently. Find out a page's images like this:
import fitz
from pprint import pprint
doc=fitz.open("your.pdf")
page=doc[0] # first page - use 0-based page numbers
pprint(page.get_images())
[(1114, 0, 1200, 1200, 8, 'DeviceRGB', '', 'Im1', 'FlateDecode')]
# extract the image stored under xref 1114:
img = doc.extract_image(1114)
This is a dictionary with image metadata and the binary image stream.
Note that PDF stores transparency data of an image separately, which therefore needs some additional care - but let us postpone this until actually happening.
Extracting video from RichMedia annotations is currently possible in PyMuPDF low-level code only.
#George Davis-Diver - thanks for example file!
Here is code that extracts video content:
import sys
import pathlib
import fitz
doc = fitz.open("vid.pdf") # open PDF
page = doc[0] # load desired page (0-based)
annot = page.first_annot # access the desired annot (first one in example)
if annot.type[0] != fitz.PDF_ANNOT_RICH_MEDIA:
print(f"Annotation type is {annot.type[1]}")
print("Only support RichMedia currently")
sys.exit()
cont = doc.xref_get_key(annot.xref, "RichMediaContent/Assets/Names")
if cont[0] != "array": # should be PDF array
sys.exit("unexpected: RichMediaContent/Assets/Names is no array")
array = cont[1][1:-1] # remove array delimiters
# jump over the name / title: we will get it later
if array[0] == "(":
i = array.find(")")
else:
i = array.find(">")
xref = array[i + 1 :] # here is the xref of the actual video stream
if not xref.endswith(" 0 R"):
sys.exit("media contents array has more than one entry")
xref = int(xref[:-4]) # xref of video stream file
video_filename = doc.xref_get_key(xref, "F")[1]
video_xref = doc.xref_get_key(xref, "EF/F")[1]
video_xref = int(video_xref.split()[0])
video_stream = doc.xref_stream_raw(video_xref)
pathlib.Path(video_filename).write_bytes(video_stream)

Cropping the Mediabox does not work for some pdfs

I wrote a little script which shall blank out the lower half of a PDF document. The document itself shall remain the same size, but the lower half shall be just white.
(This is to remove the "instructions" part from parcel labels of German parcel comanies like DHL and Hermes.)
To do this, I take the PDF page, adjust the Mediabox, and then merge this page onto a new, blank page.
Fortunately, this works as intended with the PDFs I need it for. However, I also tried a few other PDFs and for some, it just does not work. It copies over the complete PDF. This happens for example, when my code is given this file: https://www.veeam.com/veeam_backup_product_overview_ds.pdf
Here is the code:
import pypdf # PyPDF2, 3 and 4 are deprecated. PyPDF is currently in active development
reader = pypdf.PdfReader(source_filename)
writer = pypdf.PdfWriter()
# get first page
page = reader.pages[0]
# create new page
new_page = pypdf.PageObject.create_blank_page( None, width = page.mediabox.width, height = page.mediabox.height )
# crop original
page.mediabox.bottom = ( page.mediabox.top - page.mediabox.bottom ) / 2 + page.mediabox.bottom
# merge original into empty new page
new_page.merge_page( page )
writer.add_page(new_page)
with open(output_file, "wb") as fp:
writer.write(fp)
Can anyone explain why it does not work sometimes?

extract metadata of a pdf file (dimensions or orientation)

Given a pdf file, is there any way to find its page dimensions and orientations (horizontal or vertical) etc? The pypdf2 library gives a function to check for number of pages but how can I extract other info? Is it possible to use this link to find information about the file. Date of creation, number of pages, title etc? Or anything else that is possible.
from PyPDF2 import PdfFileWriter, PdfFileReader
input1 = PdfFileReader(open("document1.pdf", "rb"))
# print how many pages input1 has:
print "document1.pdf has %d pages." % input1.getNumPages()
https://pythonhosted.org/PyPDF2/
You can use the /Rotate in order to get a page's rotation.
pdf = PyPDF2.PdfFileReader(open('document1.pdf', 'rb'))
orientation = pdf.getPage(pagenumber).get('/Rotate')
It will yield a value in degrees. Though it may be useful for some documents, you should note, that the page rotation by itself does not denote the orientation. As was contributed by #mkl in the comments.
As to other metadata, there are many things you can pull out. You can look into PyPDF2.pdf.DocumentInformation methods for all of them.

Can I convert PDF blob to image using Python and Wand?

I'm trying to convert a PDF's first page to an image. However, the PDF is coming straight from the database in a base64 format. I then convert it to a blob. I want to know if it's possible to convert the first page of the PDF to an image within my Python code.
I'm familiar with being able to use filename in the Image object:
Image(filename="test.pdf[0]") as img:
The issue I'm facing is there is not an actual filename, just a blob. This is what I have so far, any suggestions would be appreciated.
x = object['file']
fileBlob = base64.b64decode('x')
with Image(**what do I put here for pdf blob?**) as img:
more code
It works for me
all_pages = Image(blob=blob_pdf) # PDF will have several pages.
single_image = all_pages.sequence[0] # Just work on first page
with Image(single_image) as i:
...
Documentation says something about blobs.
So it should be:
with Image(blob=fileBlob):
#etc etc
I didn't test that but I think this is what you are after.

Categories

Resources