I have the following code to extract PDF to JPG. I had to resize the img because of the large size, I loose the PDF original format (A4, A3 etc..) :
with Img(filename=pdfName, resolution=self.resolution) as document:
reader = PyPDF2.PdfFileReader(pdfName.replace('[0]', ''))
for page_number, page in enumerate(document.sequence):
pdfSize = reader.getPage(page_number).mediaBox
width = pdfSize[2]
height = pdfSize[3]
with Img(page, resolution=self.resolution) as img:
# Do not resize first page, which used to find useful informations
if not get_first_page:
img.resize(int(width), int(height))
img.compression_quality = self.compressionQuality
img.background_color = Color("white")
img.alpha_channel = 'remove'
if get_first_page:
filename = output
else:
filename = tmpPath + '/' + 'tmp-' + str(page_number) + '.jpg'
img.save(filename=filename)
So, for each page, I read the PDF size, and resize the output made with wand. But my problem is the quality of jpg, which is really poor...
My resolution is 300 (I try with upper value, without succes) and compressionQuality is 100
Any ideas ?
Thanks
Related
I've been trying to merge some pictures to create a video, all of this using Opencv. But, the problem here is that Opencv does create the .avi file and it opens, using VLC Media Player, but when I'm trying to watch the video created, it shows nothing and doesn't show how long does the video last. However, the file size is around 6kb. There are 18 images and they have an average size os 252kb. I think that's too weird and incoherent.
Here's a look at my code:
import cv2
import os
frameSize = (500, 500)
with os.scandir('Taylor') as Taylor_Images:
for image in Taylor_Images:
print(image.name)
with os.scandir('Taylor_5') as Taylor5_Images:
for image in Taylor5_Images:
print(image.name)
image_Taylor = []
image_Taylor_5 = []
ext_2 = '.png'
adding = '_5'
path_2 = 'Taylor_5/'
ext = '.png'
path = 'Taylor/'
for i in range(1,10):
img_name = str(i) + ext
img = path + str(i) + ext
image_Taylor.append(img)
img_name_2 = str(i) + ext_2
img_2 = path_2 + str(i) + adding + ext_2
image_Taylor_5.append(img_2)
out = cv2.VideoWriter('output_video2.avi',cv2.VideoWriter_fourcc(*'DIVX'), 60, frameSize)
for i in range(0,9):
img = cv2.imread(image_Taylor[i])
out.write(img)
img_2 = cv2.imread(image_Taylor_5[i])
out.write(img_2)
out.release()
Update 1
As Vatsal suggested, I checked all the sizes of the images and I found that some of them didn't have the same size(Width,Height), so I resized them for having the same size(400,800).
With regard to what Tim said, the images do have that file size, I attach a picture of what I did according to your suggestion of making their file size a multiple of 4.
Checking the size of the images
This is the kind of pictures I'm using
However, the video generated from the script still has a size of 6kB.
I have the following function to convert PDF to images :
from wand.image import Image as Img
with Img(filename=pdfName, resolution=self.resolution) as pic:
pic.compression_quality = self.compressionQuality
pic.background_color = Color("white")
pic.alpha_channel = 'remove'
pic.save(filename='full_' + random + '-%03d.jpg')
If the PDF is multi page the JPG files are like this :
file-000.jpg, file-001.jpg
My question is, is it possible to have a cpt starting at 1 ?
file-001.jpg, file-002.jpg
Thanks in advance
Update
Another option is to set the Image.scene
from wand.image import Image as Img
from wand.api import library
with Img(filename=pdfName, resolution=self.resolution) as pic.
# Move the internal iterator to first frame.
library.MagickResetIterator(pic.wand)
pic.scene = 1
pic.compression_quality = self.compressionQuality
pic.background_color = Color("white")
pic.alpha_channel = 'remove'
pic.save(filename='full_' + random + '-%03d.jpg')
Original
The only thing I can think of is putting an empty image on the stack before reading.
with Img(filename='NULL:') as pic:
pic.read(filename=pdfName, resolution=self.resolution)
pic.compression_quality = self.compressionQuality
pic.background_color = Color("white")
pic.alpha_channel = 'remove'
pic.save(filename='full_' + random + '-%03d.jpg')
This would put a file-000.jpg, but it would be empty, and you can quickly remove it with os.remove()
Yes I hate myself for asking a pretty simple question.
I was hoping to get some advice for the best python library to extract images (of varying type) from a PDF.
I'm trying to take a PDF Drawing, save an image and it's position on the PDF from it, then place the saved image at the right position on a set of other PDFs.
I have tried afew to date but got stuck by various errors and the research I've done indicates there is no clear and obvious choice.
I have tried PyPDF2 but got an error around PNG filter 3 being unsupported.
I have tried PDFMiner but it's constrained to JPEGs which while isn't a deal breaker I still can't get it to extract a JPEG.
I have also tried fitz module from PyMuPDF and got 1 of 3 images on my PDF, however it was inverted colour, backwards, upside down. Though I'm sure there is post-processing for this
The code I have used, to be honest, is examples that people far smarter than me have come up with and I have modified them as necessary.
Fitz below
doc = fitz.open(pdf)
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n < 5: # this is GRAY or RGB
pix.writePNG("p%s-%s.png" % (i, xref))
else: # CMYK: convert to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix)
pix1.writePNG("p%s-%s.png" % (i, xref))
pix1 = None
pix = None
PyPDF2 below
if __name__ == '__main__':
input1 = PyPDF2.PdfFileReader(pdf)
page0 = input1.getPage(0)
if '/XObject' in page0['/Resources']:
xObject = page0['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
data = xObject[obj].getData()
if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
mode = "RGB"
else:
mode = "P"
if '/Filter' in xObject[obj]:
if xObject[obj]['/Filter'] == '/FlateDecode':
img = Image.frombytes(mode, size, data)
img.save(obj[1:] + ".png")
elif xObject[obj]['/Filter'] == '/DCTDecode':
img = open(obj[1:] + ".jpg", "wb")
img.write(data)
img.close()
elif xObject[obj]['/Filter'] == '/JPXDecode':
img = open(obj[1:] + ".jp2", "wb")
img.write(data)
img.close()
elif xObject[obj]['/Filter'] == '/CCITTFaxDecode':
img = open(obj[1:] + ".tiff", "wb")
img.write(data)
img.close()
else:
img = Image.frombytes(mode, size, data)
img.save(obj[1:] + ".png")
If you're reading this and you wrote either of the above, thanks for getting me this far haha. The
More looking for advice on what is the best library to proceed with rather than someone hold my hand with the code.
Appreciate any imparting of wisdom
Pete
PyPDF2 can (now) do this. Straight from the docs:
from PyPDF2 import PdfReader
reader = PdfReader("example.pdf")
page = reader.pages[0]
count = 0
for image_file_object in page.images:
with open(str(count) + image_file_object.name, "wb") as fp:
fp.write(image_file_object.data)
count += 1
I've a small Python program that edit an RTF template.
I need to embed ad image in a specific position of the rtf file
I've found this snip of code for png images (originally I think was in C#):
mpic = "{\pict\pngblip\picw" + img_Width + "\pich" + img_Height + "\picwgoal" + width + #"\pichgoal" + height + "\bin " + str + "}"
I don't know which library can let me to convert the image in the right format,
someone can give me some tips?
Thanks a lot,
Willy
It's a little intricate but works :D
...
filename = 'temp.png'
hex_content = ''
from PIL import Image
im = Image.open(filename)
width, height = im.size
im.close()
with open(filename, 'rb') as f:
content = f.read()
hex_content = binascii.hexlify(content)
f.close()
file_content = file_content.replace("<#LEAKS_IMAGE>",'{\pict\pngblip\picw'+str(width)+'\pich'+str(height)+'\picwgoal10000\pichgoal8341 '+hex_content+'}')
...
I have a Django app that allows users to upload images that are saved in Azure blob storage. When a user uploads an image, I want to both store the original image and create a thumbnail (100x100).
Currently, the site uploads the original from request.FILES perfectly, but for creating the thumbnail I can't figure out how to upload to Azure without first saving the edited image to a holding folder, reading that saved image, and then deleting it. That doesn't seem like the most efficient way to save a thumbnail.
Here is the current upload script:
# original upload
f = request.FILES['pic']
extension = f.name.split('.')[-1]
new_name, thumb_name = rename_file(extension)
upload(blob_service, 'pic-container', new_name, f)
# create thumbnail
f.seek(0)
im = Image.open(f)
width = im.size[0]
height = im.size[1]
if width > height:
left_crop = (width - height)/2
right_crop = (width + height)/2
cropped = im.crop((left_crop, 0, right_crop, height))
else:
upper_crop = (height - width)/2
lower_crop = (height + width)/2
cropped = im.crop((0, upper_crop, width, lower_crop))
std_size = 100, 100
cropped.thumbnail(std_size, Image.ANTIALIAS)
# save to holder, upload thumbnail, then delete from holder
cropped.save('path/to/holder/' + thumb_name)
upload(blob_service, 'media-pic', thumb_name, open('path/to/holder/' + thumb_name))
os.unlink('path/to/holder/' + thumb_name)
# upload script slightly modified from http://www.windowsazure.com/en-us/documentation/articles/storage-python-how-to-use-blob-storage/#_large-blobs
def upload(blob_service, container_name, blob_name, file_path):
blob_service.put_blob(container_name, blob_name, '', 'BlockBlob')
block_ids = []
index = 0
with file_path as f:
while True:
data = f.read(chunk_size)
if data:
length = len(data)
block_id = base64.b64encode(str(index))
blob_service.put_block(container_name, blob_name, data, block_id)
block_ids.append(block_id)
index += 1
else:
break
blob_service.put_block_list(container_name, blob_name, block_ids)
If I don't save the thumbnail first and just run upload(blob_service, 'pic-container', thumb_name, cropped) I get a pretty uninformative error that just says __exit__ and points to the line with file_path as f in def upload.
Is there any way to upload the file directly from cropped without having to save the file, either by revising the upload function or by changing cropped somehow.