Can I convert PDF blob to image using Python and Wand? - python

I'm trying to convert a PDF's first page to an image. However, the PDF is coming straight from the database in a base64 format. I then convert it to a blob. I want to know if it's possible to convert the first page of the PDF to an image within my Python code.
I'm familiar with being able to use filename in the Image object:
Image(filename="test.pdf[0]") as img:
The issue I'm facing is there is not an actual filename, just a blob. This is what I have so far, any suggestions would be appreciated.
x = object['file']
fileBlob = base64.b64decode('x')
with Image(**what do I put here for pdf blob?**) as img:
more code

It works for me
all_pages = Image(blob=blob_pdf) # PDF will have several pages.
single_image = all_pages.sequence[0] # Just work on first page
with Image(single_image) as i:
...

Documentation says something about blobs.
So it should be:
with Image(blob=fileBlob):
#etc etc
I didn't test that but I think this is what you are after.

Related

How to extract images, video and audio from a pdf file using python

I need a python program that can extract videos audio and images from a pdf. I have tried using libraries such as PyPDF2 and Pillow, but I was unable to get all three to work let alone one.
I think you could achieve this using pymupdf.
To extract images see the following: https://pymupdf.readthedocs.io/en/latest/recipes-images.html#how-to-extract-images-pdf-documents
For Sound and Video these are essentially Annotation types.
The following "annots" function would get all the annotations of a specific type for a PDF page:
https://pymupdf.readthedocs.io/en/latest/page.html#Page.annots
Annotation types are as follows:
https://pymupdf.readthedocs.io/en/latest/vars.html#annotationtypes
Once you have acquired an annotation I think you can use the get_file method to extract the content ( see: https://pymupdf.readthedocs.io/en/latest/annot.html#Annot.get_file)
Hope this helps!
#George Davis-Diver can you please let me have an example PDF with video?
Sounds and videos are embedded in their specific annotation types. Both are no FileAttachment annotation, so the respective mathods cannot be used.
For a sound annotation, you must use `annot.get_sound()`` which returns a dictionary where one of the keys is the binary sound stream.
Images on the other hand may for sure be embedded as FileAttachment annotations - but this is unusual. Normally they are displayed on the page independently. Find out a page's images like this:
import fitz
from pprint import pprint
doc=fitz.open("your.pdf")
page=doc[0] # first page - use 0-based page numbers
pprint(page.get_images())
[(1114, 0, 1200, 1200, 8, 'DeviceRGB', '', 'Im1', 'FlateDecode')]
# extract the image stored under xref 1114:
img = doc.extract_image(1114)
This is a dictionary with image metadata and the binary image stream.
Note that PDF stores transparency data of an image separately, which therefore needs some additional care - but let us postpone this until actually happening.
Extracting video from RichMedia annotations is currently possible in PyMuPDF low-level code only.
#George Davis-Diver - thanks for example file!
Here is code that extracts video content:
import sys
import pathlib
import fitz
doc = fitz.open("vid.pdf") # open PDF
page = doc[0] # load desired page (0-based)
annot = page.first_annot # access the desired annot (first one in example)
if annot.type[0] != fitz.PDF_ANNOT_RICH_MEDIA:
print(f"Annotation type is {annot.type[1]}")
print("Only support RichMedia currently")
sys.exit()
cont = doc.xref_get_key(annot.xref, "RichMediaContent/Assets/Names")
if cont[0] != "array": # should be PDF array
sys.exit("unexpected: RichMediaContent/Assets/Names is no array")
array = cont[1][1:-1] # remove array delimiters
# jump over the name / title: we will get it later
if array[0] == "(":
i = array.find(")")
else:
i = array.find(">")
xref = array[i + 1 :] # here is the xref of the actual video stream
if not xref.endswith(" 0 R"):
sys.exit("media contents array has more than one entry")
xref = int(xref[:-4]) # xref of video stream file
video_filename = doc.xref_get_key(xref, "F")[1]
video_xref = doc.xref_get_key(xref, "EF/F")[1]
video_xref = int(video_xref.split()[0])
video_stream = doc.xref_stream_raw(video_xref)
pathlib.Path(video_filename).write_bytes(video_stream)

Extracting Powerpoint background images using python-pptx

I have several powerpoints that I need to shuffle through programmatically and extract images from. The images then need to be converted into OpenCV format for later processing/analysis. I have done this successfully for images in the pptx, using:
for slide in presentation:
for shape in slide.shapes
if 'Picture' in shape.name:
pic_list.append(shape)
for extraction, and:
img = cv2.imdecode(np.frombuffer(page[i].image.blob, np.uint8), cv2.IMREAD_COLOR)
for python-pptx Picture to OpenCV conversion. However, I am having a lot of trouble extracting and manipulating the backgrounds in a similar fashion.
slide.background
is sufficient to extract a "_Background" object, but I have not found a good way to convert it into a OpenCV object similar to Pictures. Does anyone know how to do this? I am using python-pptx for extraction, but am not adverse to other packages if it's not possible with that package.
After a fair bit of work I discovered how to do this -- i.e., you don't. As far as I can tell, there is no way to directly extract the backgrounds with either python-pptx or Aspose. Powerpoint -- which, as it turns out, is an archive that can be unzipped with 7zip -- keeps its backgrounds disassembled in the ppt/media (pics), ppt/slideLayouts and ppt/slideMasters (text, formatting), and they are only put together by the Powerpoint renderer. This means that to extract the backgrounds as displayed, you basically need to run Powerpoint and take pics of the slides after removing text/pictures/etc. from the foreground.
I did not need to do this, as I just needed to extract text from the backgrounds. This can be done by checking slideLayouts and slideMasters XMLs using BeautifulSoup, at the <a:t> tag. The code to do this is pretty simple:
import zipfile
with zipfile.ZipFile(pptx_path, 'r') as zip_ref:
zip_ref.extractall(extraction_directory)
This will extract the .pptx into its component files.
from glob import glob
layouts = glob(os.path.join(extr_dir, 'ppt\slideLayouts\*.xml'))
masters = glob(os.path.join(extr_dir, 'ppt\slideMasters\*.xml'))
files = layouts + masters
This gets you the paths for slide layouts/masters.
from bs4 import BeautifulSoup
text_list = []
for file in files:
with open(file) as f:
data = f.read()
bs_data = BeautifulSoup(data, "xml")
bs_a_t = bs_data.find_all('a:t')
for a_t in bs_a_t:
text_list.append(str(a_t.contents[0]))
This will get you the actual text from the XMLs.
Hopefully this will be useful to someone else in the future.

I have a folder full of pdfs I am wanting to create a code that spits out a list of all pdfs that contain the color blue

Like the title says, a bunch of pdfs that need to be gone through and a list made showing the pdfs that have the color blue in them.
I tried using a snippet of code from another post that is similar to try and get a list of colors from one document thinking if I could create a loop to go through all documents and export the output to excel and filter for a specific color, that might work, but I cant even get it to work for a single pdf:
#!/usr/bin/env python
# -*- Encoding: UTF-8 -*-
import minecart
colors = set()
with open("F://Prints/0-25162.PDF", "rb") as file:
document = minecart.Document(file)
page = document.get_page(1)
for shape in page.shapes:
if shape.outline:
colors.add(shape.outline.color.as_rgb())
for color in colors: print (color)
Any help or direction would be appreciated.
I would try to render the PDF into PNG or similar bitmap format, then load it as a Python pixel array (using Pillow or similar), and look for blue pixels. Not sure which library you'd use for the rasterizing, but Pillow or pdf2image might do the job. Alternatively, you can do it with ImageMagick prior to the Python processing.

Wrong image src url when using Python scrapy

I am trying to scrape image from en eCommerce website using scrapy, but for some of the items(5-10 out of 180) image src output is similar to this -
data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8A .
For the rest of the items I get the correct image URL.
Can someone help me with this.
My code is for image src extraction is
image = response.css('.productimage img::attr(src)').extract()
And due to this I am getting an error while downloading the image to my local system.
This
data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8A
actually is image - bytes encoded to string using base64, you might use base64 built-in module to get it as file. Consider following simple example:
import base64
txt = "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8A"
content = base64.b64decode(txt.split(',')[-1])
with open('image.png','wb') as f:
f.write(content)
it will create image.png file in current working directory.
This base64 data:
iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8A
It is empty png image (it is not relevat image)
Usually this base64 data occure on e-commerce websites for products which don't have images.
I recommend You to interpret this products with base64.... as products without image.

Using Pillow and img2pdf to convert images to pdf

I have a task that requires me to get data from an image upload (jpg or png), resize it based on the requirement, and then transform it into pdf and then store in s3.
The file comes in as ByteIO
I have Pillow available so I can resize the image with it
Now the file type is class 'PIL.Image.Image' and I don't know how to proceed.
I found img2pdf library on https://gitlab.mister-muffin.de/josch/img2pdf, but I don't know how to use it when I have PIL format (use tobytes()?)
The s3 upload also looks like a file-like object, so I don't want to save it into a temp file before loading it again. Do I even need img2pdf in this case then?
How do I achieve this goal?
EDIT: I tried using tobytes() and upload to s3 directly. Upload was successful. However, when downloading to see the content, it shows an empty page. It seems like the file data is not written into the pdf file
EDIT 2: Actually went on the s3 and check the file stored. When I download it and open it, it shows cannot be opened
EDIT 3: I don't really have working code as I'm still experimenting what could work, but here's a gist
data = request.FILES['file'].file # where the data is
im = Image.open(data)
(width, height) = (im.width // 2, im.height // 2) # example action I wanna take with Pillow
data = im_resized.tobytes()
# potential step for using img2pdf here but I don't know how
# img2pdf.convert(data) # this fails because "ImageOpenError: cannot read input image (not jpeg2000). PIL: error reading image: cannot identify image file <_io.BytesIO..."
# img2pdf.convert(im_resized) # this also fails because "TypeError: Neither implements read() nor is str or bytes"
upload_to_s3(data) # some function that utilizes boto3 to upload to s3
The problem is that u use Image.Image object instead of JPEG or something like it
Try this:
bytes_io = io.BytesIO()
image.save(bytes_io, 'PNG')
with open(output_pdf, "wb") as f:
f.write(img2pdf.convert(bytes_io.getvalue()))

Categories

Resources