Python change cpt on save with Wand - python

I have the following function to convert PDF to images :
from wand.image import Image as Img
with Img(filename=pdfName, resolution=self.resolution) as pic:
pic.compression_quality = self.compressionQuality
pic.background_color = Color("white")
pic.alpha_channel = 'remove'
pic.save(filename='full_' + random + '-%03d.jpg')
If the PDF is multi page the JPG files are like this :
file-000.jpg, file-001.jpg
My question is, is it possible to have a cpt starting at 1 ?
file-001.jpg, file-002.jpg
Thanks in advance

Update
Another option is to set the Image.scene
from wand.image import Image as Img
from wand.api import library
with Img(filename=pdfName, resolution=self.resolution) as pic.
# Move the internal iterator to first frame.
library.MagickResetIterator(pic.wand)
pic.scene = 1
pic.compression_quality = self.compressionQuality
pic.background_color = Color("white")
pic.alpha_channel = 'remove'
pic.save(filename='full_' + random + '-%03d.jpg')
Original
The only thing I can think of is putting an empty image on the stack before reading.
with Img(filename='NULL:') as pic:
pic.read(filename=pdfName, resolution=self.resolution)
pic.compression_quality = self.compressionQuality
pic.background_color = Color("white")
pic.alpha_channel = 'remove'
pic.save(filename='full_' + random + '-%03d.jpg')
This would put a file-000.jpg, but it would be empty, and you can quickly remove it with os.remove()

Related

Video created in OpenCv does open, but shows nothing

I've been trying to merge some pictures to create a video, all of this using Opencv. But, the problem here is that Opencv does create the .avi file and it opens, using VLC Media Player, but when I'm trying to watch the video created, it shows nothing and doesn't show how long does the video last. However, the file size is around 6kb. There are 18 images and they have an average size os 252kb. I think that's too weird and incoherent.
Here's a look at my code:
import cv2
import os
frameSize = (500, 500)
with os.scandir('Taylor') as Taylor_Images:
for image in Taylor_Images:
print(image.name)
with os.scandir('Taylor_5') as Taylor5_Images:
for image in Taylor5_Images:
print(image.name)
image_Taylor = []
image_Taylor_5 = []
ext_2 = '.png'
adding = '_5'
path_2 = 'Taylor_5/'
ext = '.png'
path = 'Taylor/'
for i in range(1,10):
img_name = str(i) + ext
img = path + str(i) + ext
image_Taylor.append(img)
img_name_2 = str(i) + ext_2
img_2 = path_2 + str(i) + adding + ext_2
image_Taylor_5.append(img_2)
out = cv2.VideoWriter('output_video2.avi',cv2.VideoWriter_fourcc(*'DIVX'), 60, frameSize)
for i in range(0,9):
img = cv2.imread(image_Taylor[i])
out.write(img)
img_2 = cv2.imread(image_Taylor_5[i])
out.write(img_2)
out.release()
Update 1
As Vatsal suggested, I checked all the sizes of the images and I found that some of them didn't have the same size(Width,Height), so I resized them for having the same size(400,800).
With regard to what Tim said, the images do have that file size, I attach a picture of what I did according to your suggestion of making their file size a multiple of 4.
Checking the size of the images
This is the kind of pictures I'm using
However, the video generated from the script still has a size of 6kB.

Python WAND resize poor quality

I have the following code to extract PDF to JPG. I had to resize the img because of the large size, I loose the PDF original format (A4, A3 etc..) :
with Img(filename=pdfName, resolution=self.resolution) as document:
reader = PyPDF2.PdfFileReader(pdfName.replace('[0]', ''))
for page_number, page in enumerate(document.sequence):
pdfSize = reader.getPage(page_number).mediaBox
width = pdfSize[2]
height = pdfSize[3]
with Img(page, resolution=self.resolution) as img:
# Do not resize first page, which used to find useful informations
if not get_first_page:
img.resize(int(width), int(height))
img.compression_quality = self.compressionQuality
img.background_color = Color("white")
img.alpha_channel = 'remove'
if get_first_page:
filename = output
else:
filename = tmpPath + '/' + 'tmp-' + str(page_number) + '.jpg'
img.save(filename=filename)
So, for each page, I read the PDF size, and resize the output made with wand. But my problem is the quality of jpg, which is really poor...
My resolution is 300 (I try with upper value, without succes) and compressionQuality is 100
Any ideas ?
Thanks

Batch generating barcodes using ReportLab

Yesterday, I asked a question that was perhaps too broad.
Today, I've acted on my ideas in an effort to implement a solution.
Using ReportLab, pdfquery and PyPDF2, I'm trying to automate the process of generating barcodes on hundreds of pages in a PDF document.
Each page needs to have one barcode. However, if a page has a letter in the top right ('A' through 'E') then it needs to use the same barcode as the previous page. The files with letters on the top right are duplicate forms with similar information.
If there is no letter present, then a unique barcode number (incremented by one is fine) should be used on that page.
My code seems to work, but I'm having two issues:
The barcode moves around ever so slightly (minor issue).
The barcode value will not change (major issue). Only the first barcode number is set on all pages.
I can't seem to tell why the value isn't changing. Does anyone have an a clue?
Code is here:
import pdfquery
import os
from io import BytesIO
from PyPDF2 import PdfFileWriter, PdfFileReader
from reportlab.graphics.barcode import eanbc
from reportlab.graphics.shapes import Drawing
from reportlab.lib.pagesizes import letter
from reportlab.lib.units import mm
from reportlab.pdfgen import canvas
from reportlab.graphics import renderPDF
pdf = pdfquery.PDFQuery("letters-test.pdf")
total_pages = pdf.doc.catalog['Pages'].resolve()['Count']
print("Total pages", total_pages)
barcode_value = 12345670
output = PdfFileWriter()
for i in range(0, total_pages):
pdf.load(i) # Load page i into memory
duplicate_letter = pdf.pq('LTTextLineHorizontal:in_bbox("432,720,612,820")').text()
if duplicate_letter != '':
print("Page " + str(i+1) + " letter " + str(duplicate_letter))
print(barcode_value)
packet = BytesIO()
c = canvas.Canvas(packet, pagesize=letter)
# draw the eanbc8 code
barcode_eanbc8 = eanbc.Ean8BarcodeWidget(str(barcode_value))
bounds = barcode_eanbc8.getBounds()
width = bounds[2] - bounds[0]
height = bounds[3] - bounds[1]
d = Drawing(50, 10)
d.add(barcode_eanbc8)
renderPDF.draw(d, c, 400, 700)
c.save()
packet.seek(0)
new_pdf = PdfFileReader(packet)
# read existing PDF
existing_pdf = PdfFileReader(open("letters-test.pdf", "rb"))
# add the "watermark" (which is the new pdf) on the existing page
page = existing_pdf.getPage(i)
page.mergePage(new_pdf.getPage(0))
output.addPage(page)
else:
# increment barcode value
barcode_value += 1
print("Page " + str(i+1) + " isn't a duplicate.")
print(barcode_value)
packet = BytesIO()
c = canvas.Canvas(packet, pagesize=letter)
# draw the eanbc8 code
barcode_eanbc8 = eanbc.Ean8BarcodeWidget(str(barcode_value))
bounds = barcode_eanbc8.getBounds()
width = bounds[2] - bounds[0]
height = bounds[3] - bounds[1]
d = Drawing(50, 10)
d.add(barcode_eanbc8)
renderPDF.draw(d, c, 420, 710)
c.save()
packet.seek(0)
new_pdf = PdfFileReader(packet)
# read existing PDF
existing_pdf = PdfFileReader(open("letters-test.pdf", "rb"))
# add the "watermark" (which is the new pdf) on the existing page
page = existing_pdf.getPage(i)
page.mergePage(new_pdf.getPage(0))
output.addPage(page)
# Clear page i from memory and re load.
# pdf = pdfquery.PDFQuery("letters-test.pdf")
outputStream = open("newpdf.pdf", "wb")
output.write(outputStream)
outputStream.close()
And here is letters-test.pdf
as Kamil Nicki's answer pointed out, Ean8BarcodeWidget limiting effective digits to 7:
class Ean8BarcodeWidget(Ean13BarcodeWidget):
_digits=7
...
self.value=max(self._digits-len(value),0)*'0'+value[:self._digits]
you may change your encoding scheme or use EAN 13 barcode with Ean13BarcodeWidget, which has 12 digits usable.
The reason why your barcode is not changing is that you provided too long integer into eanbc.Ean8BarcodeWidget.
According to EAN standard EAN-8 barcodes are 8 digits long (7 digits + checkdigit)
Solution:
If you change barcode_value from 12345670 to 1234560 and run your script you will see that barcode value is increased as you want and checkdigit is appended as eighth number.
With that information in hand you should use only 7 digits to encode information in barcode.

Resize multiple images in a folder (Python)

I already saw the examples suggested but some of them don't work.
So, I have this code which seems to work fine for one image:
im = Image.open('C:\\Users\\User\\Desktop\\Images\\2.jpg') # image extension *.png,*.jpg
new_width = 1200
new_height = 750
im = im.resize((new_width, new_height), Image.ANTIALIAS)
im.save('C:\\Users\\User\\Desktop\\resized.tif') # .jpg is deprecated and raise error....
How can I iterate it and resize more than one image ? Aspect ration need to be maintained.
Thank you
# Resize all images in a directory to half the size.
#
# Save on a new file with the same name but with "small_" prefix
# on high quality jpeg format.
#
# If the script is in /images/ and the files are in /images/2012-1-1-pics
# call with: python resize.py 2012-1-1-pics
import Image
import os
import sys
directory = sys.argv[1]
for file_name in os.listdir(directory):
print("Processing %s" % file_name)
image = Image.open(os.path.join(directory, file_name))
x,y = image.size
new_dimensions = (x/2, y/2) #dimension set here
output = image.resize(new_dimensions, Image.ANTIALIAS)
output_file_name = os.path.join(directory, "small_" + file_name)
output.save(output_file_name, "JPEG", quality = 95)
print("All done")
Where it says
new_dimensions = (x/2, y/2)
You can set any dimension value you want
for example, if you want 300x300, then change the code like the code line below
new_dimensions = (300, 300)
I assume that you want to iterate over images in a specific folder.
You can do this:
import os
from datetime import datetime
for image_file_name in os.listdir('C:\\Users\\User\\Desktop\\Images\\'):
if image_file_name.endswith(".tif"):
now = datetime.now().strftime('%Y%m%d-%H%M%S-%f')
im = Image.open('C:\\Users\\User\\Desktop\\Images\\'+image_file_name)
new_width = 1282
new_height = 797
im = im.resize((new_width, new_height), Image.ANTIALIAS)
im.save('C:\\Users\\User\\Desktop\\test_resize\\resized' + now + '.tif')
datetime.now() is just added to make the image names unique. It is just a hack that came to my mind first. You can do something else. This is needed in order not to override each other.
I assume that you have a list of images in some folder and you to resize all of them
from PIL import Image
import os
source_folder = 'path/to/where/your/images/are/located/'
destination_folder = 'path/to/where/you/want/to/save/your/images/after/resizing/'
directory = os.listdir(source_folder)
for item in directory:
img = Image.open(source_folder + item)
imgResize = img.resize((new_image_width, new_image_height), Image.ANTIALIAS)
imgResize.save(destination_folder + item[:-4] +'.tif', quality = 90)

Extract images from PDF without resampling, in python?

How might one extract all images from a pdf document, at native resolution and format? (Meaning extract tiff as tiff, jpeg as jpeg, etc. and without resampling). Layout is unimportant, I don't care were the source image is located on the page.
I'm using python 2.7 but can use 3.x if required.
You can use the module PyMuPDF. This outputs all images as .png files, but worked out of the box and is fast.
import fitz
doc = fitz.open("file.pdf")
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n < 5: # this is GRAY or RGB
pix.writePNG("p%s-%s.png" % (i, xref))
else: # CMYK: convert to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix)
pix1.writePNG("p%s-%s.png" % (i, xref))
pix1 = None
pix = None
see here for more resources
Here is a modified the version for fitz 1.19.6:
import os
import fitz # pip install --upgrade pip; pip install --upgrade pymupdf
from tqdm import tqdm # pip install tqdm
workdir = "your_folder"
for each_path in os.listdir(workdir):
if ".pdf" in each_path:
doc = fitz.Document((os.path.join(workdir, each_path)))
for i in tqdm(range(len(doc)), desc="pages"):
for img in tqdm(doc.get_page_images(i), desc="page_images"):
xref = img[0]
image = doc.extract_image(xref)
pix = fitz.Pixmap(doc, xref)
pix.save(os.path.join(workdir, "%s_p%s-%s.png" % (each_path[:-4], i, xref)))
print("Done!")
In Python with PyPDF2 and Pillow libraries it is simple:
PyPDF2>=2.10.0
from PyPDF2 import PdfReader
reader = PdfReader("example.pdf")
for page in reader.pages:
for image in page.images:
with open(image.name, "wb") as fp:
fp.write(image.data)
PyPDF2<2.10.0
from PIL import Image
from PyPDF2 import PdfReader
def extract_image(pdf_file_path):
reader = PdfReader(pdf_file_path)
page = reader.pages[0]
x_object = page["/Resources"]["/XObject"].getObject()
for obj in x_object:
if x_object[obj]["/Subtype"] == "/Image":
size = (x_object[obj]["/Width"], x_object[obj]["/Height"])
data = x_object[obj].getData()
if x_object[obj]["/ColorSpace"] == "/DeviceRGB":
mode = "RGB"
else:
mode = "P"
if x_object[obj]["/Filter"] == "/FlateDecode":
img = Image.frombytes(mode, size, data)
img.save(obj[1:] + ".png")
elif x_object[obj]["/Filter"] == "/DCTDecode":
img = open(obj[1:] + ".jpg", "wb")
img.write(data)
img.close()
elif x_object[obj]["/Filter"] == "/JPXDecode":
img = open(obj[1:] + ".jp2", "wb")
img.write(data)
img.close()
Often in a PDF, the image is simply stored as-is. For example, a PDF with a jpg inserted will have a range of bytes somewhere in the middle that when extracted is a valid jpg file. You can use this to very simply extract byte ranges from the PDF. I wrote about this some time ago, with sample code: Extracting JPGs from PDFs.
In Python with PyPDF2 for CCITTFaxDecode filter:
import PyPDF2
import struct
"""
Links:
PDF format: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
CCITT Group 4: https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.6-198811-I!!PDF-E&type=items
Extract images from pdf: http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python
Extract images coded with CCITTFaxDecode in .net: http://stackoverflow.com/questions/2641770/extracting-image-from-pdf-with-ccittfaxdecode-filter
TIFF format and tags: http://www.awaresystems.be/imaging/tiff/faq.html
"""
def tiff_header_for_CCITT(width, height, img_size, CCITT_group=4):
tiff_header_struct = '<' + '2s' + 'h' + 'l' + 'h' + 'hhll' * 8 + 'h'
return struct.pack(tiff_header_struct,
b'II', # Byte order indication: Little indian
42, # Version number (always 42)
8, # Offset to first IFD
8, # Number of tags in IFD
256, 4, 1, width, # ImageWidth, LONG, 1, width
257, 4, 1, height, # ImageLength, LONG, 1, lenght
258, 3, 1, 1, # BitsPerSample, SHORT, 1, 1
259, 3, 1, CCITT_group, # Compression, SHORT, 1, 4 = CCITT Group 4 fax encoding
262, 3, 1, 0, # Threshholding, SHORT, 1, 0 = WhiteIsZero
273, 4, 1, struct.calcsize(tiff_header_struct), # StripOffsets, LONG, 1, len of header
278, 4, 1, height, # RowsPerStrip, LONG, 1, lenght
279, 4, 1, img_size, # StripByteCounts, LONG, 1, size of image
0 # last IFD
)
pdf_filename = 'scan.pdf'
pdf_file = open(pdf_filename, 'rb')
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file)
for i in range(0, cond_scan_reader.getNumPages()):
page = cond_scan_reader.getPage(i)
xObject = page['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
"""
The CCITTFaxDecode filter decodes image data that has been encoded using
either Group 3 or Group 4 CCITT facsimile (fax) encoding. CCITT encoding is
designed to achieve efficient compression of monochrome (1 bit per pixel) image
data at relatively low resolutions, and so is useful only for bitmap image data, not
for color images, grayscale images, or general data.
K < 0 --- Pure two-dimensional encoding (Group 4)
K = 0 --- Pure one-dimensional encoding (Group 3, 1-D)
K > 0 --- Mixed one- and two-dimensional encoding (Group 3, 2-D)
"""
if xObject[obj]['/Filter'] == '/CCITTFaxDecode':
if xObject[obj]['/DecodeParms']['/K'] == -1:
CCITT_group = 4
else:
CCITT_group = 3
width = xObject[obj]['/Width']
height = xObject[obj]['/Height']
data = xObject[obj]._data # sorry, getData() does not work for CCITTFaxDecode
img_size = len(data)
tiff_header = tiff_header_for_CCITT(width, height, img_size, CCITT_group)
img_name = obj[1:] + '.tiff'
with open(img_name, 'wb') as img_file:
img_file.write(tiff_header + data)
#
# import io
# from PIL import Image
# im = Image.open(io.BytesIO(tiff_header + data))
pdf_file.close()
Libpoppler comes with a tool called "pdfimages" that does exactly this.
(On ubuntu systems it's in the poppler-utils package)
http://poppler.freedesktop.org/
http://en.wikipedia.org/wiki/Pdfimages
Windows binaries: http://blog.alivate.com.au/poppler-windows/
I prefer minecart as it is extremely easy to use. The below snippet show how to extract images from a pdf:
#pip install minecart
import minecart
pdffile = open('Invoices.pdf', 'rb')
doc = minecart.Document(pdffile)
page = doc.get_page(0) # getting a single page
#iterating through all pages
for page in doc.iter_pages():
im = page.images[0].as_pil() # requires pillow
display(im)
PikePDF can do this with very little code:
from pikepdf import Pdf, PdfImage
filename = "sample-in.pdf"
example = Pdf.open(filename)
for i, page in enumerate(example.pages):
for j, (name, raw_image) in enumerate(page.images.items()):
image = PdfImage(raw_image)
out = image.extract_to(fileprefix=f"{filename}-page{i:03}-img{j:03}")
extract_to will automatically pick the file extension based on how the image
is encoded in the PDF.
If you want, you could also print some detail about the images as they get extracted:
# Optional: print info about image
w = raw_image.stream_dict.Width
h = raw_image.stream_dict.Height
f = raw_image.stream_dict.Filter
size = raw_image.stream_dict.Length
print(f"Wrote {name} {w}x{h} {f} {size:,}B {image.colorspace} to {out}")
which can print something like
Wrote /Im1 150x150 /DCTDecode 5,952B /ICCBased to sample2.pdf-page000-img000.jpg
Wrote /Im10 32x32 /FlateDecode 36B /ICCBased to sample2.pdf-page000-img001.png
...
See the docs for
more that you can do with images, including replacing them in the PDF file.
While this usually works pretty well, note that there are a number of images that won’t be extracted this way:
Vector graphics, such as embedded SVG/PS/PDF; you can crop the original PDF, but I’m not aware of an easy way to do this programmatically
Certain monochrome images compressed inside the PDF using “CCITTFaxDecode, type G4, with the /EncodedByteAlign set to true”
Non-RGB/CMYK images, aka ProcessColorModel/DeviceN/HiFi, used for colour separations (Thanks mara004)
Here is my version from 2019 that recursively gets all images from PDF and reads them with PIL.
Compatible with Python 2/3. I also found that sometimes image in PDF may be compressed by zlib, so my code supports decompression.
#!/usr/bin/env python3
try:
from StringIO import StringIO
except ImportError:
from io import BytesIO as StringIO
from PIL import Image
from PyPDF2 import PdfFileReader, generic
import zlib
def get_color_mode(obj):
try:
cspace = obj['/ColorSpace']
except KeyError:
return None
if cspace == '/DeviceRGB':
return "RGB"
elif cspace == '/DeviceCMYK':
return "CMYK"
elif cspace == '/DeviceGray':
return "P"
if isinstance(cspace, generic.ArrayObject) and cspace[0] == '/ICCBased':
color_map = obj['/ColorSpace'][1].getObject()['/N']
if color_map == 1:
return "P"
elif color_map == 3:
return "RGB"
elif color_map == 4:
return "CMYK"
def get_object_images(x_obj):
images = []
for obj_name in x_obj:
sub_obj = x_obj[obj_name]
if '/Resources' in sub_obj and '/XObject' in sub_obj['/Resources']:
images += get_object_images(sub_obj['/Resources']['/XObject'].getObject())
elif sub_obj['/Subtype'] == '/Image':
zlib_compressed = '/FlateDecode' in sub_obj.get('/Filter', '')
if zlib_compressed:
sub_obj._data = zlib.decompress(sub_obj._data)
images.append((
get_color_mode(sub_obj),
(sub_obj['/Width'], sub_obj['/Height']),
sub_obj._data
))
return images
def get_pdf_images(pdf_fp):
images = []
try:
pdf_in = PdfFileReader(open(pdf_fp, "rb"))
except:
return images
for p_n in range(pdf_in.numPages):
page = pdf_in.getPage(p_n)
try:
page_x_obj = page['/Resources']['/XObject'].getObject()
except KeyError:
continue
images += get_object_images(page_x_obj)
return images
if __name__ == "__main__":
pdf_fp = "test.pdf"
for image in get_pdf_images(pdf_fp):
(mode, size, data) = image
try:
img = Image.open(StringIO(data))
except Exception as e:
print ("Failed to read image with PIL: {}".format(e))
continue
# Do whatever you want with the image
I started from the code of #sylvain
There was some flaws, like the exception NotImplementedError: unsupported filter /DCTDecode of getData, or the fact the code failed to find images in some pages because they were at a deeper level than the page.
There is my code :
import PyPDF2
from PIL import Image
import sys
from os import path
import warnings
warnings.filterwarnings("ignore")
number = 0
def recurse(page, xObject):
global number
xObject = xObject['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
data = xObject[obj]._data
if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
mode = "RGB"
else:
mode = "P"
imagename = "%s - p. %s - %s"%(abspath[:-4], p, obj[1:])
if xObject[obj]['/Filter'] == '/FlateDecode':
img = Image.frombytes(mode, size, data)
img.save(imagename + ".png")
number += 1
elif xObject[obj]['/Filter'] == '/DCTDecode':
img = open(imagename + ".jpg", "wb")
img.write(data)
img.close()
number += 1
elif xObject[obj]['/Filter'] == '/JPXDecode':
img = open(imagename + ".jp2", "wb")
img.write(data)
img.close()
number += 1
else:
recurse(page, xObject[obj])
try:
_, filename, *pages = sys.argv
*pages, = map(int, pages)
abspath = path.abspath(filename)
except BaseException:
print('Usage :\nPDF_extract_images file.pdf page1 page2 page3 …')
sys.exit()
file = PyPDF2.PdfFileReader(open(filename, "rb"))
for p in pages:
page0 = file.getPage(p-1)
recurse(p, page0)
print('%s extracted images'% number)
Well I have been struggling with this for many weeks, many of these answers helped me through, but there was always something missing, apparently no one here has ever had problems with jbig2 encoded images.
In the bunch of PDF that I am to scan, images encoded in jbig2 are very popular.
As far as I understand there are many copy/scan machines that scan papers and transform them into PDF files full of jbig2 encoded images.
So after many days of tests decided to go for the answer proposed here by dkagedal long time ago.
Here is my step by step on linux: (if you have another OS I suggest to use a linux docker it's going to be much easier.)
First step:
apt-get install poppler-utils
Then I was able to run command line tool called pdfimages like this:
pdfimages -all myfile.pdf ./images_found/
With the above command you will be able to extract all the images contained in myfile.pdf and you will have them saved inside images_found (you have to create images_found before)
In the list you will find several types of images, png, jpg, tiff; all these are easily readable with any graphic tool.
Then you will have some files named like: -145.jb2e and -145.jb2g.
These 2 files contain ONE IMAGE encoded in jbig2 saved in 2 different files one for the header and one for the data
Again I have lost many days trying to find out how to convert those files into something readable and finally I came across this tool called jbig2dec
So first you need to install this magic tool:
apt-get install jbig2dec
then you can run:
jbig2dec -t png -145.jb2g -145.jb2e
You are going to finally be able to get all extracted images converted into something useful.
good luck!
Much easier solution:
Use the poppler-utils package. To install it use homebrew (homebrew is MacOS specific, but you can find the poppler-utils package for Widows or Linux here: https://poppler.freedesktop.org/). First line of code below installs poppler-utils using homebrew. After installation the second line (run from the command line) then extracts images from a PDF file and names them "image*". To run this program from within Python use the os or subprocess module. Third line is code using os module, beneath that is an example with subprocess (python 3.5 or later for run() function). More info here: https://www.cyberciti.biz/faq/easily-extract-images-from-pdf-file/
brew install poppler
pdfimages file.pdf image
import os
os.system('pdfimages file.pdf image')
or
import subprocess
subprocess.run('pdfimages file.pdf image', shell=True)
I did this for my own program, and found that the best library to use was PyMuPDF. It lets you find out the "xref" numbers of each image on each page, and use them to extract the raw image data from the PDF.
import fitz
from PIL import Image
import io
filePath = "path/to/file.pdf"
#opens doc using PyMuPDF
doc = fitz.Document(filePath)
#loads the first page
page = doc.loadPage(0)
#[First image on page described thru a list][First attribute on image list: xref n], check PyMuPDF docs under getImageList()
xref = page.getImageList()[0][0]
#gets the image as a dict, check docs under extractImage
baseImage = doc.extractImage(xref)
#gets the raw string image data from the dictionary and wraps it in a BytesIO object before using PIL to open it
image = Image.open(io.BytesIO(baseImage['image']))
#Displays image for good measure
image.show()
Definitely check out the docs, though.
After some searching I found the following script which works really well with my PDF's. It does only tackle JPG, but it worked perfectly with my unprotected files. Also is does not require any outside libraries.
Not to take any credit, the script originates from Ned Batchelder, and not me.
Python3 code: extract jpg's from pdf's. Quick and dirty
import sys
with open(sys.argv[1],"rb") as file:
file.seek(0)
pdf = file.read()
startmark = b"\xff\xd8"
startfix = 0
endmark = b"\xff\xd9"
endfix = 2
i = 0
njpg = 0
while True:
istream = pdf.find(b"stream", i)
if istream < 0:
break
istart = pdf.find(startmark, istream, istream + 20)
if istart < 0:
i = istream + 20
continue
iend = pdf.find(b"endstream", istart)
if iend < 0:
raise Exception("Didn't find end of stream!")
iend = pdf.find(endmark, iend - 20)
if iend < 0:
raise Exception("Didn't find end of JPG!")
istart += startfix
iend += endfix
print("JPG %d from %d to %d" % (njpg, istart, iend))
jpg = pdf[istart:iend]
with open("jpg%d.jpg" % njpg, "wb") as jpgfile:
jpgfile.write(jpg)
njpg += 1
i = iend
After reading the posts using pyPDF2.
The error while using #sylvain's code NotImplementedError: unsupported filter /DCTDecode must come from the method .getData(): It is solved when using ._data instead, by #Alex Paramonov.
So far I have only met "DCTDecode" cases, but I am sharing the adapted code that include remarks from the different posts: From zilb by #Alex Paramonov, sub_obj['/Filter'] being a list, by #mxl.
Hope it can help the pyPDF2 users. Follow the code:
import sys
import PyPDF2, traceback
import zlib
try:
from PIL import Image
except ImportError:
import Image
pdf_path = 'path_to_your_pdf_file.pdf'
input1 = PyPDF2.PdfFileReader(open(pdf_path, "rb"))
nPages = input1.getNumPages()
for i in range(nPages) :
page0 = input1.getPage(i)
if '/XObject' in page0['/Resources']:
try:
xObject = page0['/Resources']['/XObject'].getObject()
except :
xObject = []
for obj_name in xObject:
sub_obj = xObject[obj_name]
if sub_obj['/Subtype'] == '/Image':
zlib_compressed = '/FlateDecode' in sub_obj.get('/Filter', '')
if zlib_compressed:
sub_obj._data = zlib.decompress(sub_obj._data)
size = (sub_obj['/Width'], sub_obj['/Height'])
data = sub_obj._data#sub_obj.getData()
try :
if sub_obj['/ColorSpace'] == '/DeviceRGB':
mode = "RGB"
elif sub_obj['/ColorSpace'] == '/DeviceCMYK':
mode = "CMYK"
# will cause errors when saving (might need convert to RGB first)
else:
mode = "P"
fn = 'p%03d-%s' % (i + 1, obj_name[1:])
if '/Filter' in sub_obj:
if '/FlateDecode' in sub_obj['/Filter']:
img = Image.frombytes(mode, size, data)
img.save(fn + ".png")
elif '/DCTDecode' in sub_obj['/Filter']:
img = open(fn + ".jpg", "wb")
img.write(data)
img.close()
elif '/JPXDecode' in sub_obj['/Filter']:
img = open(fn + ".jp2", "wb")
img.write(data)
img.close()
elif '/CCITTFaxDecode' in sub_obj['/Filter']:
img = open(fn + ".tiff", "wb")
img.write(data)
img.close()
elif '/LZWDecode' in sub_obj['/Filter'] :
img = open(fn + ".tif", "wb")
img.write(data)
img.close()
else :
print('Unknown format:', sub_obj['/Filter'])
else:
img = Image.frombytes(mode, size, data)
img.save(fn + ".png")
except:
traceback.print_exc()
else:
print("No image found for page %d" % (i + 1))
I installed ImageMagick on my server and then run commandline-calls through Popen:
#!/usr/bin/python
import sys
import os
import subprocess
import settings
IMAGE_PATH = os.path.join(settings.MEDIA_ROOT , 'pdf_input' )
def extract_images(pdf):
output = 'temp.png'
cmd = 'convert ' + os.path.join(IMAGE_PATH, pdf) + ' ' + os.path.join(IMAGE_PATH, output)
subprocess.Popen(cmd.split(), stderr=subprocess.STDOUT, stdout=subprocess.PIPE)
This will create an image for every page and store them as temp-0.png, temp-1.png ....
This is only 'extraction' if you got a pdf with only images and no text.
I added all of those together in PyPDFTK here.
My own contribution is handling of /Indexed files as such:
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
color_space = xObject[obj]['/ColorSpace']
if isinstance(color_space, pdf.generic.ArrayObject) and color_space[0] == '/Indexed':
color_space, base, hival, lookup = [v.getObject() for v in color_space] # pg 262
mode = img_modes[color_space]
if xObject[obj]['/Filter'] == '/FlateDecode':
data = xObject[obj].getData()
img = Image.frombytes(mode, size, data)
if color_space == '/Indexed':
img.putpalette(lookup.getData())
img = img.convert('RGB')
img.save("{}{:04}.png".format(filename_prefix, i))
Note that when /Indexed files are found, you can't just compare /ColorSpace to a string, because it comes as an ArrayObject. So, we have to check the array and retrieve the indexed palette (lookup in the code) and set it in the PIL Image object, otherwise it stays uninitialized (zero) and the whole image shows as black.
My first instinct was to save them as GIFs (which is an indexed format), but my tests turned out that PNGs were smaller and looked the same way.
I found those types of images when printing to PDF with Foxit Reader PDF Printer.
As of February 2019, the solution given by #sylvain (at least on my setup) does not work without a small modification: xObject[obj]['/Filter'] is not a value, but a list, thus in order to make the script work, I had to modify the format checking as follows:
import PyPDF2, traceback
from PIL import Image
input1 = PyPDF2.PdfFileReader(open(src, "rb"))
nPages = input1.getNumPages()
print nPages
for i in range(nPages) :
print i
page0 = input1.getPage(i)
try :
xObject = page0['/Resources']['/XObject'].getObject()
except : xObject = []
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
data = xObject[obj].getData()
try :
if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
mode = "RGB"
elif xObject[obj]['/ColorSpace'] == '/DeviceCMYK':
mode = "CMYK"
# will cause errors when saving
else:
mode = "P"
fn = 'p%03d-%s' % (i + 1, obj[1:])
print '\t', fn
if '/FlateDecode' in xObject[obj]['/Filter'] :
img = Image.frombytes(mode, size, data)
img.save(fn + ".png")
elif '/DCTDecode' in xObject[obj]['/Filter']:
img = open(fn + ".jpg", "wb")
img.write(data)
img.close()
elif '/JPXDecode' in xObject[obj]['/Filter'] :
img = open(fn + ".jp2", "wb")
img.write(data)
img.close()
elif '/LZWDecode' in xObject[obj]['/Filter'] :
img = open(fn + ".tif", "wb")
img.write(data)
img.close()
else :
print 'Unknown format:', xObject[obj]['/Filter']
except :
traceback.print_exc()
You could use pdfimages command in Ubuntu as well.
Install poppler lib using the below commands.
sudo apt install poppler-utils
sudo apt-get install python-poppler
pdfimages file.pdf image
List of files created are, (for eg.,. there are two images in pdf)
image-000.png
image-001.png
It works ! Now you can use a subprocess.run to run this from python.
Try below code. it will extract all image from pdf.
import sys
import PyPDF2
from PIL import Image
pdf=sys.argv[1]
print(pdf)
input1 = PyPDF2.PdfFileReader(open(pdf, "rb"))
for x in range(0,input1.numPages):
xObject=input1.getPage(x)
xObject = xObject['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
print(size)
data = xObject[obj]._data
#print(data)
print(xObject[obj]['/Filter'])
if xObject[obj]['/Filter'][0] == '/DCTDecode':
img_name=str(x)+".jpg"
print(img_name)
img = open(img_name, "wb")
img.write(data)
img.close()
print(str(x)+" is done")
I rewrite solutions as single python class.
It should be easy to work with. If you notice new "/Filter" or "/ColorSpace" then just add it to internal dictionaries.
https://github.com/survtur/extract_images_from_pdf
Requirements:
Python3.6+
PyPDF2
PIL
With pypdfium2 (v4):
import pypdfium2.__main__ as pdfium_cli
pdfium_cli.api_main(["extract-images", "input.pdf", "-o", "output_dir"])
There are some options to choose between different extraction strategies (see pypdfium2 extract-images --help).
Actual non-CLI Python APIs are available as well. The CLI's implementation demonstrates them (see the docs for details):
# assuming `args` is a given options set (e. g. argparse namepsace)
import pypdfium2 as pdfium
import pypdfium2.raw as pdfium_c
pdf = pdfium.PdfDocument(args.input)
images = []
for i in args.pages:
page = pdf.get_page(i)
obj_searcher = page.get_objects(
filter = (pdfium_c.FPDF_PAGEOBJ_IMAGE, ),
max_depth = args.max_depth,
)
images += list(obj_searcher)
n_digits = len(str(len(images)))
for i, image in enumerate(images):
prefix = args.output_dir / ("%s_%0*d" % (args.input.stem, n_digits, i+1))
try:
if args.use_bitmap:
pil_image = image.get_bitmap(render=args.render).to_pil()
pil_image.save("%s.%s" % (prefix, args.format))
else:
image.extract(prefix, fb_format=args.format, fb_render=args.render)
except pdfium.PdfiumError:
traceback.print_exc()
Note: Unfortunately, PDFium's public image extraction APIs are quite limited, so PdfImage.extract() is by far not as smart as pikepdf. If you only need the image bitmap and do not intend to save the image, PdfImage.get_bitmap() should be quite fine, though.
(Disclaimer: I'm the author of pypdfium2)
Following code is updated version of PyMUPDF :
doc = fitz.open("/Users/vignesh/Downloads/ViewJournal2244.pdf")
Images_per_page={}
for i in page:
images=[]
for image_box in doc[page].get_images():
rect=doc[page].get_image_rects(image_box)
page=doc[page].get_pixmap(matrix=fitz.Identity,clip=rect[0],dpi=None,colorspace=fitz.csRGB,alpha=True, annots=True)
string=page.tobytes()
images.append(string)
Images_per_page[i]=images
This worked for me:
import PyPDF2
from PyPDF2 import PdfFileReader
# Open the PDF file
pdf_file = open(r"C:\\Users\\file.pdf", 'rb')
pdf_reader = PdfFileReader(pdf_file)
# Iterate through each page
for page_num in range(pdf_reader.numPages):
page = pdf_reader.getPage(page_num)
xObject = page['/Resources']['/XObject'].getObject()
# Iterate through each image on the page
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
data = xObject[obj].getData()
# You can now save the image data to a file
with open(f'C:\\Users\\filepath\{obj}.jpg', 'wb') as img_file:
img_file.write(data)
# Close the PDF file
pdf_file.close()
First Install pdf2image
pip install pdf2image==1.14.0
Follow the below code for extraction of pages from PDF.
file_path="file path of PDF"
info = pdfinfo_from_path(file_path, userpw=None, poppler_path=None)
maxPages = info["Pages"]
image_counter = 0
if maxPages > 10:
for page in range(1, maxPages, 10):
pages = convert_from_path(file_path, dpi=300, first_page=page,
last_page=min(page+10-1, maxPages))
for page in pages:
page.save(image_path+'/' + str(image_counter) + '.png', 'PNG')
image_counter += 1
else:
pages = convert_from_path(file_path, 300)
for i, j in enumerate(pages):
j.save(image_path+'/' + str(i) + '.png', 'PNG')
Hope it helps coders looking for easy conversion of PDF files to Images as per pages of PDF.

Categories

Resources