Extract images from PDF without resampling, in python?

Extract images from PDF without resampling, in python? - python

How might one extract all images from a pdf document, at native resolution and format? (Meaning extract tiff as tiff, jpeg as jpeg, etc. and without resampling). Layout is unimportant, I don't care were the source image is located on the page.
I'm using python 2.7 but can use 3.x if required.

You can use the module PyMuPDF. This outputs all images as .png files, but worked out of the box and is fast.
import fitz
doc = fitz.open("file.pdf")
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n < 5: # this is GRAY or RGB
pix.writePNG("p%s-%s.png" % (i, xref))
else: # CMYK: convert to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix)
pix1.writePNG("p%s-%s.png" % (i, xref))
pix1 = None
pix = None
see here for more resources
Here is a modified the version for fitz 1.19.6:
import os
import fitz # pip install --upgrade pip; pip install --upgrade pymupdf
from tqdm import tqdm # pip install tqdm
workdir = "your_folder"
for each_path in os.listdir(workdir):
if ".pdf" in each_path:
doc = fitz.Document((os.path.join(workdir, each_path)))
for i in tqdm(range(len(doc)), desc="pages"):
for img in tqdm(doc.get_page_images(i), desc="page_images"):
xref = img[0]
image = doc.extract_image(xref)
pix = fitz.Pixmap(doc, xref)
pix.save(os.path.join(workdir, "%s_p%s-%s.png" % (each_path[:-4], i, xref)))
print("Done!")

In Python with PyPDF2 and Pillow libraries it is simple:
PyPDF2>=2.10.0
from PyPDF2 import PdfReader
reader = PdfReader("example.pdf")
for page in reader.pages:
for image in page.images:
with open(image.name, "wb") as fp:
fp.write(image.data)
PyPDF2<2.10.0
from PIL import Image
from PyPDF2 import PdfReader
def extract_image(pdf_file_path):
reader = PdfReader(pdf_file_path)
page = reader.pages[0]
x_object = page["/Resources"]["/XObject"].getObject()
for obj in x_object:
if x_object[obj]["/Subtype"] == "/Image":
size = (x_object[obj]["/Width"], x_object[obj]["/Height"])
data = x_object[obj].getData()
if x_object[obj]["/ColorSpace"] == "/DeviceRGB":
mode = "RGB"
else:
mode = "P"
if x_object[obj]["/Filter"] == "/FlateDecode":
img = Image.frombytes(mode, size, data)
img.save(obj[1:] + ".png")
elif x_object[obj]["/Filter"] == "/DCTDecode":
img = open(obj[1:] + ".jpg", "wb")
img.write(data)
img.close()
elif x_object[obj]["/Filter"] == "/JPXDecode":
img = open(obj[1:] + ".jp2", "wb")
img.write(data)
img.close()

Often in a PDF, the image is simply stored as-is. For example, a PDF with a jpg inserted will have a range of bytes somewhere in the middle that when extracted is a valid jpg file. You can use this to very simply extract byte ranges from the PDF. I wrote about this some time ago, with sample code: Extracting JPGs from PDFs.

In Python with PyPDF2 for CCITTFaxDecode filter:
import PyPDF2
import struct
"""
Links:
PDF format: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
CCITT Group 4: https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.6-198811-I!!PDF-E&type=items
Extract images from pdf: http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python
Extract images coded with CCITTFaxDecode in .net: http://stackoverflow.com/questions/2641770/extracting-image-from-pdf-with-ccittfaxdecode-filter
TIFF format and tags: http://www.awaresystems.be/imaging/tiff/faq.html
"""
def tiff_header_for_CCITT(width, height, img_size, CCITT_group=4):
tiff_header_struct = '<' + '2s' + 'h' + 'l' + 'h' + 'hhll' * 8 + 'h'
return struct.pack(tiff_header_struct,
b'II', # Byte order indication: Little indian
42, # Version number (always 42)
8, # Offset to first IFD
8, # Number of tags in IFD
256, 4, 1, width, # ImageWidth, LONG, 1, width
257, 4, 1, height, # ImageLength, LONG, 1, lenght
258, 3, 1, 1, # BitsPerSample, SHORT, 1, 1
259, 3, 1, CCITT_group, # Compression, SHORT, 1, 4 = CCITT Group 4 fax encoding
262, 3, 1, 0, # Threshholding, SHORT, 1, 0 = WhiteIsZero
273, 4, 1, struct.calcsize(tiff_header_struct), # StripOffsets, LONG, 1, len of header
278, 4, 1, height, # RowsPerStrip, LONG, 1, lenght
279, 4, 1, img_size, # StripByteCounts, LONG, 1, size of image
0 # last IFD
)
pdf_filename = 'scan.pdf'
pdf_file = open(pdf_filename, 'rb')
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file)
for i in range(0, cond_scan_reader.getNumPages()):
page = cond_scan_reader.getPage(i)
xObject = page['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
"""
The CCITTFaxDecode filter decodes image data that has been encoded using
either Group 3 or Group 4 CCITT facsimile (fax) encoding. CCITT encoding is
designed to achieve efficient compression of monochrome (1 bit per pixel) image
data at relatively low resolutions, and so is useful only for bitmap image data, not
for color images, grayscale images, or general data.
K < 0 --- Pure two-dimensional encoding (Group 4)
K = 0 --- Pure one-dimensional encoding (Group 3, 1-D)
K > 0 --- Mixed one- and two-dimensional encoding (Group 3, 2-D)
"""
if xObject[obj]['/Filter'] == '/CCITTFaxDecode':
if xObject[obj]['/DecodeParms']['/K'] == -1:
CCITT_group = 4
else:
CCITT_group = 3
width = xObject[obj]['/Width']
height = xObject[obj]['/Height']
data = xObject[obj]._data # sorry, getData() does not work for CCITTFaxDecode
img_size = len(data)
tiff_header = tiff_header_for_CCITT(width, height, img_size, CCITT_group)
img_name = obj[1:] + '.tiff'
with open(img_name, 'wb') as img_file:
img_file.write(tiff_header + data)
#
# import io
# from PIL import Image
# im = Image.open(io.BytesIO(tiff_header + data))
pdf_file.close()

Libpoppler comes with a tool called "pdfimages" that does exactly this.
(On ubuntu systems it's in the poppler-utils package)
http://poppler.freedesktop.org/
http://en.wikipedia.org/wiki/Pdfimages
Windows binaries: http://blog.alivate.com.au/poppler-windows/

I prefer minecart as it is extremely easy to use. The below snippet show how to extract images from a pdf:
#pip install minecart
import minecart
pdffile = open('Invoices.pdf', 'rb')
doc = minecart.Document(pdffile)
page = doc.get_page(0) # getting a single page
#iterating through all pages
for page in doc.iter_pages():
im = page.images[0].as_pil() # requires pillow
display(im)

PikePDF can do this with very little code:
from pikepdf import Pdf, PdfImage
filename = "sample-in.pdf"
example = Pdf.open(filename)
for i, page in enumerate(example.pages):
for j, (name, raw_image) in enumerate(page.images.items()):
image = PdfImage(raw_image)
out = image.extract_to(fileprefix=f"{filename}-page{i:03}-img{j:03}")
extract_to will automatically pick the file extension based on how the image
is encoded in the PDF.
If you want, you could also print some detail about the images as they get extracted:
# Optional: print info about image
w = raw_image.stream_dict.Width
h = raw_image.stream_dict.Height
f = raw_image.stream_dict.Filter
size = raw_image.stream_dict.Length
print(f"Wrote {name} {w}x{h} {f} {size:,}B {image.colorspace} to {out}")
which can print something like
Wrote /Im1 150x150 /DCTDecode 5,952B /ICCBased to sample2.pdf-page000-img000.jpg
Wrote /Im10 32x32 /FlateDecode 36B /ICCBased to sample2.pdf-page000-img001.png
...
See the docs for
more that you can do with images, including replacing them in the PDF file.
While this usually works pretty well, note that there are a number of images that won’t be extracted this way:
Vector graphics, such as embedded SVG/PS/PDF; you can crop the original PDF, but I’m not aware of an easy way to do this programmatically
Certain monochrome images compressed inside the PDF using “CCITTFaxDecode, type G4, with the /EncodedByteAlign set to true”
Non-RGB/CMYK images, aka ProcessColorModel/DeviceN/HiFi, used for colour separations (Thanks mara004)

Here is my version from 2019 that recursively gets all images from PDF and reads them with PIL.
Compatible with Python 2/3. I also found that sometimes image in PDF may be compressed by zlib, so my code supports decompression.
#!/usr/bin/env python3
try:
from StringIO import StringIO
except ImportError:
from io import BytesIO as StringIO
from PIL import Image
from PyPDF2 import PdfFileReader, generic
import zlib
def get_color_mode(obj):
try:
cspace = obj['/ColorSpace']
except KeyError:
return None
if cspace == '/DeviceRGB':
return "RGB"
elif cspace == '/DeviceCMYK':
return "CMYK"
elif cspace == '/DeviceGray':
return "P"
if isinstance(cspace, generic.ArrayObject) and cspace[0] == '/ICCBased':
color_map = obj['/ColorSpace'][1].getObject()['/N']
if color_map == 1:
return "P"
elif color_map == 3:
return "RGB"
elif color_map == 4:
return "CMYK"
def get_object_images(x_obj):
images = []
for obj_name in x_obj:
sub_obj = x_obj[obj_name]
if '/Resources' in sub_obj and '/XObject' in sub_obj['/Resources']:
images += get_object_images(sub_obj['/Resources']['/XObject'].getObject())
elif sub_obj['/Subtype'] == '/Image':
zlib_compressed = '/FlateDecode' in sub_obj.get('/Filter', '')
if zlib_compressed:
sub_obj._data = zlib.decompress(sub_obj._data)
images.append((
get_color_mode(sub_obj),
(sub_obj['/Width'], sub_obj['/Height']),
sub_obj._data
))
return images
def get_pdf_images(pdf_fp):
images = []
try:
pdf_in = PdfFileReader(open(pdf_fp, "rb"))
except:
return images
for p_n in range(pdf_in.numPages):
page = pdf_in.getPage(p_n)
try:
page_x_obj = page['/Resources']['/XObject'].getObject()
except KeyError:
continue
images += get_object_images(page_x_obj)
return images
if __name__ == "__main__":
pdf_fp = "test.pdf"
for image in get_pdf_images(pdf_fp):
(mode, size, data) = image
try:
img = Image.open(StringIO(data))
except Exception as e:
print ("Failed to read image with PIL: {}".format(e))
continue
# Do whatever you want with the image

I started from the code of #sylvain
There was some flaws, like the exception NotImplementedError: unsupported filter /DCTDecode of getData, or the fact the code failed to find images in some pages because they were at a deeper level than the page.
There is my code :
import PyPDF2
from PIL import Image
import sys
from os import path
import warnings
warnings.filterwarnings("ignore")
number = 0
def recurse(page, xObject):
global number
xObject = xObject['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
data = xObject[obj]._data
if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
mode = "RGB"
else:
mode = "P"
imagename = "%s - p. %s - %s"%(abspath[:-4], p, obj[1:])
if xObject[obj]['/Filter'] == '/FlateDecode':
img = Image.frombytes(mode, size, data)
img.save(imagename + ".png")
number += 1
elif xObject[obj]['/Filter'] == '/DCTDecode':
img = open(imagename + ".jpg", "wb")
img.write(data)
img.close()
number += 1
elif xObject[obj]['/Filter'] == '/JPXDecode':
img = open(imagename + ".jp2", "wb")
img.write(data)
img.close()
number += 1
else:
recurse(page, xObject[obj])
try:
_, filename, *pages = sys.argv
*pages, = map(int, pages)
abspath = path.abspath(filename)
except BaseException:
print('Usage :\nPDF_extract_images file.pdf page1 page2 page3 …')
sys.exit()
file = PyPDF2.PdfFileReader(open(filename, "rb"))
for p in pages:
page0 = file.getPage(p-1)
recurse(p, page0)
print('%s extracted images'% number)

Well I have been struggling with this for many weeks, many of these answers helped me through, but there was always something missing, apparently no one here has ever had problems with jbig2 encoded images.
In the bunch of PDF that I am to scan, images encoded in jbig2 are very popular.
As far as I understand there are many copy/scan machines that scan papers and transform them into PDF files full of jbig2 encoded images.
So after many days of tests decided to go for the answer proposed here by dkagedal long time ago.
Here is my step by step on linux: (if you have another OS I suggest to use a linux docker it's going to be much easier.)
First step:
apt-get install poppler-utils
Then I was able to run command line tool called pdfimages like this:
pdfimages -all myfile.pdf ./images_found/
With the above command you will be able to extract all the images contained in myfile.pdf and you will have them saved inside images_found (you have to create images_found before)
In the list you will find several types of images, png, jpg, tiff; all these are easily readable with any graphic tool.
Then you will have some files named like: -145.jb2e and -145.jb2g.
These 2 files contain ONE IMAGE encoded in jbig2 saved in 2 different files one for the header and one for the data
Again I have lost many days trying to find out how to convert those files into something readable and finally I came across this tool called jbig2dec
So first you need to install this magic tool:
apt-get install jbig2dec
then you can run:
jbig2dec -t png -145.jb2g -145.jb2e
You are going to finally be able to get all extracted images converted into something useful.
good luck!

Much easier solution:
Use the poppler-utils package. To install it use homebrew (homebrew is MacOS specific, but you can find the poppler-utils package for Widows or Linux here: https://poppler.freedesktop.org/). First line of code below installs poppler-utils using homebrew. After installation the second line (run from the command line) then extracts images from a PDF file and names them "image*". To run this program from within Python use the os or subprocess module. Third line is code using os module, beneath that is an example with subprocess (python 3.5 or later for run() function). More info here: https://www.cyberciti.biz/faq/easily-extract-images-from-pdf-file/
brew install poppler
pdfimages file.pdf image
import os
os.system('pdfimages file.pdf image')
or
import subprocess
subprocess.run('pdfimages file.pdf image', shell=True)

I did this for my own program, and found that the best library to use was PyMuPDF. It lets you find out the "xref" numbers of each image on each page, and use them to extract the raw image data from the PDF.
import fitz
from PIL import Image
import io
filePath = "path/to/file.pdf"
#opens doc using PyMuPDF
doc = fitz.Document(filePath)
#loads the first page
page = doc.loadPage(0)
#[First image on page described thru a list][First attribute on image list: xref n], check PyMuPDF docs under getImageList()
xref = page.getImageList()[0][0]
#gets the image as a dict, check docs under extractImage
baseImage = doc.extractImage(xref)
#gets the raw string image data from the dictionary and wraps it in a BytesIO object before using PIL to open it
image = Image.open(io.BytesIO(baseImage['image']))
#Displays image for good measure
image.show()
Definitely check out the docs, though.

After some searching I found the following script which works really well with my PDF's. It does only tackle JPG, but it worked perfectly with my unprotected files. Also is does not require any outside libraries.
Not to take any credit, the script originates from Ned Batchelder, and not me.
Python3 code: extract jpg's from pdf's. Quick and dirty
import sys
with open(sys.argv[1],"rb") as file:
file.seek(0)
pdf = file.read()
startmark = b"\xff\xd8"
startfix = 0
endmark = b"\xff\xd9"
endfix = 2
i = 0
njpg = 0
while True:
istream = pdf.find(b"stream", i)
if istream < 0:
break
istart = pdf.find(startmark, istream, istream + 20)
if istart < 0:
i = istream + 20
continue
iend = pdf.find(b"endstream", istart)
if iend < 0:
raise Exception("Didn't find end of stream!")
iend = pdf.find(endmark, iend - 20)
if iend < 0:
raise Exception("Didn't find end of JPG!")
istart += startfix
iend += endfix
print("JPG %d from %d to %d" % (njpg, istart, iend))
jpg = pdf[istart:iend]
with open("jpg%d.jpg" % njpg, "wb") as jpgfile:
jpgfile.write(jpg)
njpg += 1
i = iend

After reading the posts using pyPDF2.
The error while using #sylvain's code NotImplementedError: unsupported filter /DCTDecode must come from the method .getData(): It is solved when using ._data instead, by #Alex Paramonov.
So far I have only met "DCTDecode" cases, but I am sharing the adapted code that include remarks from the different posts: From zilb by #Alex Paramonov, sub_obj['/Filter'] being a list, by #mxl.
Hope it can help the pyPDF2 users. Follow the code:
import sys
import PyPDF2, traceback
import zlib
try:
from PIL import Image
except ImportError:
import Image
pdf_path = 'path_to_your_pdf_file.pdf'
input1 = PyPDF2.PdfFileReader(open(pdf_path, "rb"))
nPages = input1.getNumPages()
for i in range(nPages) :
page0 = input1.getPage(i)
if '/XObject' in page0['/Resources']:
try:
xObject = page0['/Resources']['/XObject'].getObject()
except :
xObject = []
for obj_name in xObject:
sub_obj = xObject[obj_name]
if sub_obj['/Subtype'] == '/Image':
zlib_compressed = '/FlateDecode' in sub_obj.get('/Filter', '')
if zlib_compressed:
sub_obj._data = zlib.decompress(sub_obj._data)
size = (sub_obj['/Width'], sub_obj['/Height'])
data = sub_obj._data#sub_obj.getData()
try :
if sub_obj['/ColorSpace'] == '/DeviceRGB':
mode = "RGB"
elif sub_obj['/ColorSpace'] == '/DeviceCMYK':
mode = "CMYK"
# will cause errors when saving (might need convert to RGB first)
else:
mode = "P"
fn = 'p%03d-%s' % (i + 1, obj_name[1:])
if '/Filter' in sub_obj:
if '/FlateDecode' in sub_obj['/Filter']:
img = Image.frombytes(mode, size, data)
img.save(fn + ".png")
elif '/DCTDecode' in sub_obj['/Filter']:
img = open(fn + ".jpg", "wb")
img.write(data)
img.close()
elif '/JPXDecode' in sub_obj['/Filter']:
img = open(fn + ".jp2", "wb")
img.write(data)
img.close()
elif '/CCITTFaxDecode' in sub_obj['/Filter']:
img = open(fn + ".tiff", "wb")
img.write(data)
img.close()
elif '/LZWDecode' in sub_obj['/Filter'] :
img = open(fn + ".tif", "wb")
img.write(data)
img.close()
else :
print('Unknown format:', sub_obj['/Filter'])
else:
img = Image.frombytes(mode, size, data)
img.save(fn + ".png")
except:
traceback.print_exc()
else:
print("No image found for page %d" % (i + 1))

I installed ImageMagick on my server and then run commandline-calls through Popen:
#!/usr/bin/python
import sys
import os
import subprocess
import settings
IMAGE_PATH = os.path.join(settings.MEDIA_ROOT , 'pdf_input' )
def extract_images(pdf):
output = 'temp.png'
cmd = 'convert ' + os.path.join(IMAGE_PATH, pdf) + ' ' + os.path.join(IMAGE_PATH, output)
subprocess.Popen(cmd.split(), stderr=subprocess.STDOUT, stdout=subprocess.PIPE)
This will create an image for every page and store them as temp-0.png, temp-1.png ....
This is only 'extraction' if you got a pdf with only images and no text.

I added all of those together in PyPDFTK here.
My own contribution is handling of /Indexed files as such:
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
color_space = xObject[obj]['/ColorSpace']
if isinstance(color_space, pdf.generic.ArrayObject) and color_space[0] == '/Indexed':
color_space, base, hival, lookup = [v.getObject() for v in color_space] # pg 262
mode = img_modes[color_space]
if xObject[obj]['/Filter'] == '/FlateDecode':
data = xObject[obj].getData()
img = Image.frombytes(mode, size, data)
if color_space == '/Indexed':
img.putpalette(lookup.getData())
img = img.convert('RGB')
img.save("{}{:04}.png".format(filename_prefix, i))
Note that when /Indexed files are found, you can't just compare /ColorSpace to a string, because it comes as an ArrayObject. So, we have to check the array and retrieve the indexed palette (lookup in the code) and set it in the PIL Image object, otherwise it stays uninitialized (zero) and the whole image shows as black.
My first instinct was to save them as GIFs (which is an indexed format), but my tests turned out that PNGs were smaller and looked the same way.
I found those types of images when printing to PDF with Foxit Reader PDF Printer.

As of February 2019, the solution given by #sylvain (at least on my setup) does not work without a small modification: xObject[obj]['/Filter'] is not a value, but a list, thus in order to make the script work, I had to modify the format checking as follows:
import PyPDF2, traceback
from PIL import Image
input1 = PyPDF2.PdfFileReader(open(src, "rb"))
nPages = input1.getNumPages()
print nPages
for i in range(nPages) :
print i
page0 = input1.getPage(i)
try :
xObject = page0['/Resources']['/XObject'].getObject()
except : xObject = []
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
data = xObject[obj].getData()
try :
if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
mode = "RGB"
elif xObject[obj]['/ColorSpace'] == '/DeviceCMYK':
mode = "CMYK"
# will cause errors when saving
else:
mode = "P"
fn = 'p%03d-%s' % (i + 1, obj[1:])
print '\t', fn
if '/FlateDecode' in xObject[obj]['/Filter'] :
img = Image.frombytes(mode, size, data)
img.save(fn + ".png")
elif '/DCTDecode' in xObject[obj]['/Filter']:
img = open(fn + ".jpg", "wb")
img.write(data)
img.close()
elif '/JPXDecode' in xObject[obj]['/Filter'] :
img = open(fn + ".jp2", "wb")
img.write(data)
img.close()
elif '/LZWDecode' in xObject[obj]['/Filter'] :
img = open(fn + ".tif", "wb")
img.write(data)
img.close()
else :
print 'Unknown format:', xObject[obj]['/Filter']
except :
traceback.print_exc()

You could use pdfimages command in Ubuntu as well.
Install poppler lib using the below commands.
sudo apt install poppler-utils
sudo apt-get install python-poppler
pdfimages file.pdf image
List of files created are, (for eg.,. there are two images in pdf)
image-000.png
image-001.png
It works ! Now you can use a subprocess.run to run this from python.

Try below code. it will extract all image from pdf.
import sys
import PyPDF2
from PIL import Image
pdf=sys.argv[1]
print(pdf)
input1 = PyPDF2.PdfFileReader(open(pdf, "rb"))
for x in range(0,input1.numPages):
xObject=input1.getPage(x)
xObject = xObject['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
print(size)
data = xObject[obj]._data
#print(data)
print(xObject[obj]['/Filter'])
if xObject[obj]['/Filter'][0] == '/DCTDecode':
img_name=str(x)+".jpg"
print(img_name)
img = open(img_name, "wb")
img.write(data)
img.close()
print(str(x)+" is done")

I rewrite solutions as single python class.
It should be easy to work with. If you notice new "/Filter" or "/ColorSpace" then just add it to internal dictionaries.
https://github.com/survtur/extract_images_from_pdf
Requirements:
Python3.6+
PyPDF2
PIL

With pypdfium2 (v4):
import pypdfium2.__main__ as pdfium_cli
pdfium_cli.api_main(["extract-images", "input.pdf", "-o", "output_dir"])
There are some options to choose between different extraction strategies (see pypdfium2 extract-images --help).
Actual non-CLI Python APIs are available as well. The CLI's implementation demonstrates them (see the docs for details):
# assuming `args` is a given options set (e. g. argparse namepsace)
import pypdfium2 as pdfium
import pypdfium2.raw as pdfium_c
pdf = pdfium.PdfDocument(args.input)
images = []
for i in args.pages:
page = pdf.get_page(i)
obj_searcher = page.get_objects(
filter = (pdfium_c.FPDF_PAGEOBJ_IMAGE, ),
max_depth = args.max_depth,
)
images += list(obj_searcher)
n_digits = len(str(len(images)))
for i, image in enumerate(images):
prefix = args.output_dir / ("%s_%0*d" % (args.input.stem, n_digits, i+1))
try:
if args.use_bitmap:
pil_image = image.get_bitmap(render=args.render).to_pil()
pil_image.save("%s.%s" % (prefix, args.format))
else:
image.extract(prefix, fb_format=args.format, fb_render=args.render)
except pdfium.PdfiumError:
traceback.print_exc()
Note: Unfortunately, PDFium's public image extraction APIs are quite limited, so PdfImage.extract() is by far not as smart as pikepdf. If you only need the image bitmap and do not intend to save the image, PdfImage.get_bitmap() should be quite fine, though.
(Disclaimer: I'm the author of pypdfium2)

Following code is updated version of PyMUPDF :
doc = fitz.open("/Users/vignesh/Downloads/ViewJournal2244.pdf")
Images_per_page={}
for i in page:
images=[]
for image_box in doc[page].get_images():
rect=doc[page].get_image_rects(image_box)
page=doc[page].get_pixmap(matrix=fitz.Identity,clip=rect[0],dpi=None,colorspace=fitz.csRGB,alpha=True, annots=True)
string=page.tobytes()
images.append(string)
Images_per_page[i]=images

This worked for me:
import PyPDF2
from PyPDF2 import PdfFileReader
# Open the PDF file
pdf_file = open(r"C:\\Users\\file.pdf", 'rb')
pdf_reader = PdfFileReader(pdf_file)
# Iterate through each page
for page_num in range(pdf_reader.numPages):
page = pdf_reader.getPage(page_num)
xObject = page['/Resources']['/XObject'].getObject()
# Iterate through each image on the page
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
data = xObject[obj].getData()
# You can now save the image data to a file
with open(f'C:\\Users\\filepath\{obj}.jpg', 'wb') as img_file:
img_file.write(data)
# Close the PDF file
pdf_file.close()

First Install pdf2image
pip install pdf2image==1.14.0
Follow the below code for extraction of pages from PDF.
file_path="file path of PDF"
info = pdfinfo_from_path(file_path, userpw=None, poppler_path=None)
maxPages = info["Pages"]
image_counter = 0
if maxPages > 10:
for page in range(1, maxPages, 10):
pages = convert_from_path(file_path, dpi=300, first_page=page,
last_page=min(page+10-1, maxPages))
for page in pages:
page.save(image_path+'/' + str(image_counter) + '.png', 'PNG')
image_counter += 1
else:
pages = convert_from_path(file_path, 300)
for i, j in enumerate(pages):
j.save(image_path+'/' + str(i) + '.png', 'PNG')
Hope it helps coders looking for easy conversion of PDF files to Images as per pages of PDF.

Related

How to convert multi image TIFF to PDF in python?

I'd like to convert multi image TIFF to PDF in python.
I wrote like this code. How ever this code dose not work. How should I change it?
images = []
img = Image.open('multipage.tif')
for i in range(4):
try:
img.seek(i)
images.append(img)
except EOFError:
# Not enough frames in img
break
images[0].save('multipage.pdf',save_all=True,append_images=images[1:])

I solved this problem. You can convert tiff to pdf easily by this function.
from PIL import Image, ImageSequence
import os
def tiff_to_pdf(tiff_path: str) -> str:
pdf_path = tiff_path.replace('.tiff', '.pdf')
if not os.path.exists(tiff_path): raise Exception(f'{tiff_path} does not find.')
image = Image.open(tiff_path)
images = []
for i, page in enumerate(ImageSequence.Iterator(image)):
page = page.convert("RGB")
images.append(page)
if len(images) == 1:
images[0].save(pdf_path)
else:
images[0].save(pdf_path, save_all=True,append_images=images[1:])
return pdf_path
You need install Pillow, when you use this function.

Python library for image extraction

Yes I hate myself for asking a pretty simple question.
I was hoping to get some advice for the best python library to extract images (of varying type) from a PDF.
I'm trying to take a PDF Drawing, save an image and it's position on the PDF from it, then place the saved image at the right position on a set of other PDFs.
I have tried afew to date but got stuck by various errors and the research I've done indicates there is no clear and obvious choice.
I have tried PyPDF2 but got an error around PNG filter 3 being unsupported.
I have tried PDFMiner but it's constrained to JPEGs which while isn't a deal breaker I still can't get it to extract a JPEG.
I have also tried fitz module from PyMuPDF and got 1 of 3 images on my PDF, however it was inverted colour, backwards, upside down. Though I'm sure there is post-processing for this
The code I have used, to be honest, is examples that people far smarter than me have come up with and I have modified them as necessary.
Fitz below
doc = fitz.open(pdf)
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n < 5: # this is GRAY or RGB
pix.writePNG("p%s-%s.png" % (i, xref))
else: # CMYK: convert to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix)
pix1.writePNG("p%s-%s.png" % (i, xref))
pix1 = None
pix = None
PyPDF2 below
if __name__ == '__main__':
input1 = PyPDF2.PdfFileReader(pdf)
page0 = input1.getPage(0)
if '/XObject' in page0['/Resources']:
xObject = page0['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
data = xObject[obj].getData()
if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
mode = "RGB"
else:
mode = "P"
if '/Filter' in xObject[obj]:
if xObject[obj]['/Filter'] == '/FlateDecode':
img = Image.frombytes(mode, size, data)
img.save(obj[1:] + ".png")
elif xObject[obj]['/Filter'] == '/DCTDecode':
img = open(obj[1:] + ".jpg", "wb")
img.write(data)
img.close()
elif xObject[obj]['/Filter'] == '/JPXDecode':
img = open(obj[1:] + ".jp2", "wb")
img.write(data)
img.close()
elif xObject[obj]['/Filter'] == '/CCITTFaxDecode':
img = open(obj[1:] + ".tiff", "wb")
img.write(data)
img.close()
else:
img = Image.frombytes(mode, size, data)
img.save(obj[1:] + ".png")
If you're reading this and you wrote either of the above, thanks for getting me this far haha. The
More looking for advice on what is the best library to proceed with rather than someone hold my hand with the code.
Appreciate any imparting of wisdom
Pete

PyPDF2 can (now) do this. Straight from the docs:
from PyPDF2 import PdfReader
reader = PdfReader("example.pdf")
page = reader.pages[0]
count = 0
for image_file_object in page.images:
with open(str(count) + image_file_object.name, "wb") as fp:
fp.write(image_file_object.data)
count += 1

Python Resize Multiple images ask user to continue

I'm trying to make a script that resize multiple or a single image based on a data pulled from XML.
My question is if i have multiple images how can I print out a qusetion like "There are more than 1 image do you wish to resize image 2 also?... than maybe " Would you liek to resize image 3 also ?"
My script so far is as follow,the only problem is taht it resizez all the images at start :
import os, glob
import sys
import xml.etree.cElementTree as ET
import re
from PIL import Image
pathNow ='C:\\'
items = []
textPath = []
imgPath = []
attribValue = []
#append system argument to list for later use
for item in sys.argv:
items.append(item)
#change path directory
newPath = pathNow + items[1]
os.chdir(newPath)
#end
#get first agrument for doc ref
for item in items:
docxml = items[2]
#search for file
for file in glob.glob(docxml + ".xml"):
tree = ET.parse(file)
rootFile = tree.getroot()
for rootChild in rootFile.iter('TextElement'):
if "svg" or "pdf" in rootChild.text:
try:
textPath = re.search('svg="(.+?)"', str(rootChild.text)).group(1)
attribValue.append(rootChild.get('elementId'))
imgPath.append(textPath)
except:
continue
for image in imgPath:
new_img_path = image[:-4] + '.png'
new_image = Image.open(new_img_path)
new_size=int(sys.argv[3]), int(sys.argv[4])
try:
new_image.thumbnail(new_size, Image.ANTIALIAS)
new_image.save(new_img_path, 'png')
except IOError:
print("Cannot resize picture '%s'" % new_img_path)
finally:
new_image.close()
print("Done resizeing image: %s " % new_img_path)
Thank you in advance.
Zapo

Change your final loop to:
for idx, image in enumerate(imgPath):
#img resizing goes here
count_remaining = len(imgPath) - (idx+1)
if count_remaining > 0:
print("There are {} images left to resize.".format(count_remaining))
response = input("Resize image #{}? (Y/N)".format(idx+2))
#use `raw_input` in place of `input` for Python 2.7 and below
if response.lower() != "y":
break

PyPDF2 - Returning only blank lines. En(de)code issue? [duplicate]

I'm trying to extract the text included in this PDF file using Python.
I'm using the PyPDF2 package (version 1.27.2), and have the following script:
import PyPDF2
with open("sample.pdf", "rb") as pdf_file:
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.pages[0]
page_content = page.extractText()
print(page_content)
When I run the code, I get the following output which is different from that included in the PDF document:
! " # $ % # $ % &% $ &' ( ) * % + , - % . / 0 1 ' * 2 3% 4
5
' % 1 $ # 2 6 % 3/ % 7 / ) ) / 8 % &) / 2 6 % 8 # 3" % 3" * % 31 3/ 9 # &)
%
How can I extract the text as is in the PDF document?

I was looking for a simple solution to use for python 3.x and windows. There doesn't seem to be support from textract, which is unfortunate, but if you are looking for a simple solution for windows/python 3 checkout the tika package, really straight forward for reading pdfs.
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
from tika import parser # pip install tika
raw = parser.from_file('sample.pdf')
print(raw['content'])
Note that Tika is written in Java so you will need a Java runtime installed

PyPDF2 recently improved a lot. Depending on the data, it is on-par or better than pdfminer.six.
pymupdf / tika / PDFium are better than PyPDF2, but the difference became rather small -
(mostly when to set a new line). The core part is that they are way faster. But they are not pure-Python which can mean that you cannot execute it. And some might have too restrictive licenses so that you may not use it.
Have a look at the benchmark.
Results from November 2022:
PyPDF2
Edit: I recently became the maintainer of PyPDF2! 😁 The community improved the text extraction a lot. Give it a try :-)
from PyPDF2 import PdfReader
reader = PdfReader("example.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
Please note that those packages are not maintained:
pyPdf, PyPDF3, PyPDF4
pdfminer (without .six)
pymupdf
import fitz # install using: pip install PyMuPDF
with fitz.open("my.pdf") as doc:
text = ""
for page in doc:
text += page.get_text()
print(text)
Other PDF libraries
pikepdf does not support text extraction (source)

Use textract.
http://textract.readthedocs.io/en/latest/
https://github.com/deanmalmgren/textract
It supports many types of files including PDFs
import textract
text = textract.process("path/to/file.extension")

Look at this code for PyPDF2<=1.26.0:
import PyPDF2
pdf_file = open('sample.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content.encode('utf-8')
The output is:
!"#$%#$%&%$&'()*%+,-%./01'*23%4
5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&)
%
Using the same code to read a pdf from 201308FCR.pdf
.The output is normal.
Its documentation explains why:
def extractText(self):
"""
Locate all text drawing commands, in the order they are provided in the
content stream, and extract the text. This works well for some PDF
files, but poorly for others, depending on the generator used. This will
be refined in the future. Do not rely on the order of text coming out of
this function, as it will change if this function is made more
sophisticated.
:return: a unicode string object.
"""

After trying textract (which seemed to have too many dependencies) and pypdf2 (which could not extract text from the pdfs I tested with) and tika (which was too slow) I ended up using pdftotext from xpdf (as already suggested in another answer) and just called the binary from python directly (you may need to adapt the path to pdftotext):
import os, subprocess
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
args = ["/usr/local/bin/pdftotext",
'-enc',
'UTF-8',
"{}/my-pdf.pdf".format(SCRIPT_DIR),
'-']
res = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output = res.stdout.decode('utf-8')
There is pdftotext which does basically the same but this assumes pdftotext in /usr/local/bin whereas I am using this in AWS lambda and wanted to use it from the current directory.
Btw: For using this on lambda you need to put the binary and the dependency to libstdc++.so into your lambda function. I personally needed to compile xpdf. As instructions for this would blow up this answer I put them on my personal blog.

I've try many Python PDF converters, and I like to update this review. Tika is one of the best. But PyMuPDF is a good news from #ehsaneha user.
I did a code to compare them in: https://github.com/erfelipe/PDFtextExtraction I hope to help you.
Tika-Python is a Python binding to the Apache Tika™ REST services
allowing Tika to be called natively in the Python community.
from tika import parser
raw = parser.from_file("///Users/Documents/Textos/Texto1.pdf")
raw = str(raw)
safe_text = raw.encode('utf-8', errors='ignore')
safe_text = str(safe_text).replace("\n", "").replace("\\", "")
print('--- safe text ---' )
print( safe_text )

You may want to use time proved xPDF and derived tools to extract text instead as pyPDF2 seems to have various issues with the text extraction still.
The long answer is that there are lot of variations how a text is encoded inside PDF and that it may require to decoded PDF string itself, then may need to map with CMAP, then may need to analyze distance between words and letters etc.
In case the PDF is damaged (i.e. displaying the correct text but when copying it gives garbage) and you really need to extract text, then you may want to consider converting PDF into image (using ImageMagik) and then use Tesseract to get text from image using OCR.

PyPDF2 in some cases ignores the white spaces and makes the result text a mess, but I use PyMuPDF and I'm really satisfied
you can use this link for more info

pdftotext is the best and simplest one!
pdftotext also reserves the structure as well.
I tried PyPDF2, PDFMiner and a few others but none of them gave a satisfactory result.

In 2020 the solutions above were not working for the particular pdf I was working with. Below is what did the trick. I am on Windows 10 and Python 3.8
Test pdf file: https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing
#pip install pdfminer.six
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert_pdf_to_txt(path):
'''Convert pdf content from a file path to text
:path the file path
'''
rsrcmgr = PDFResourceManager()
codec = 'utf-8'
laparams = LAParams()
with io.StringIO() as retstr:
with TextConverter(rsrcmgr, retstr, codec=codec,
laparams=laparams) as device:
with open(path, 'rb') as fp:
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
return retstr.getvalue()
if __name__ == "__main__":
print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf'))

I found a solution here PDFLayoutTextStripper
It's good because it can keep the layout of the original PDF.
It's written in Java but I have added a Gateway to support Python.
Sample code:
from py4j.java_gateway import JavaGateway
gw = JavaGateway()
result = gw.entry_point.strip('samples/bus.pdf')
# result is a dict of {
# 'success': 'true' or 'false',
# 'payload': pdf file content if 'success' is 'true'
# 'error': error message if 'success' is 'false'
# }
print result['payload']
Sample output from PDFLayoutTextStripper:
You can see more details here Stripper with Python

The below code is a solution to the question in Python 3. Before running the code, make sure you have installed the PyPDF2 library in your environment. If not installed, open the command prompt and run the following command:
pip3 install PyPDF2
Solution Code using PyPDF2 <= 1.26.0:
import PyPDF2
pdfFileObject = open('sample.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
page = pdfReader.getPage(i)
print(page.extractText())

pdfplumber is one of the better libraries to read and extract data from pdf. It also provides ways to read table data and after struggling with a lot of such libraries, pdfplumber worked best for me.
Mind you, it works best for machine-written pdf and not scanned pdf.
import pdfplumber
with pdfplumber.open(r'D:\examplepdf.pdf') as pdf:
first_page = pdf.pages[0]
print(first_page.extract_text())

I've got a better work around than OCR and to maintain the page alignment while extracting the text from a PDF. Should be of help:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
text= convert_pdf_to_txt('test.pdf')
print(text)

Multi - page pdf can be extracted as text at single stretch instead of giving individual page number as argument using below code
import PyPDF2
import collections
pdf_file = open('samples.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
c = collections.Counter(range(number_of_pages))
for i in c:
page = read_pdf.getPage(i)
page_content = page.extractText()
print page_content.encode('utf-8')

You can use PDFtoText
https://github.com/jalan/pdftotext
PDF to text keeps text format indentation, doesn't matter if you have tables.

If wanting to extract text from a table, I've found tabula to be easily implemented, accurate, and fast:
to get a pandas dataframe:
import tabula
df = tabula.read_pdf('your.pdf')
df
By default, it ignores page content outside of the table. So far, I've only tested on a single-page, single-table file, but there are kwargs to accommodate multiple pages and/or multiple tables.
install via:
pip install tabula-py
# or
conda install -c conda-forge tabula-py
In terms of straight-up text extraction see:
https://stackoverflow.com/a/63190886/9249533

As of 2021 I would like to recommend pdfreader due to the fact that PyPDF2/3 seems to be troublesome now and tika is actually written in java and needs a jre in the background. pdfreader is pythonic, currently well maintained and has extensive documentation here.
Installation as usual: pip install pdfreader
Short example of usage:
from pdfreader import PDFDocument, SimplePDFViewer
# get raw document
fd = open(file_name, "rb")
doc = PDFDocument(fd)
# there is an iterator for pages
page_one = next(doc.pages())
all_pages = [p for p in doc.pages()]
# and even a viewer
fd = open(file_name, "rb")
viewer = SimplePDFViewer(fd)

Here is the simplest code for extracting text
code:
# importing required modules
import PyPDF2
# creating a pdf file object
pdfFileObj = open('filename.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(5)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()

Use pdfminer.six. Here is the the doc : https://pdfminersix.readthedocs.io/en/latest/index.html
To convert pdf to text :
def pdf_to_text():
from pdfminer.high_level import extract_text
text = extract_text('test.pdf')
print(text)

You can simply do this using pytessaract and OpenCV. Refer the following code. You can get more details from this article.
import os
from PIL import Image
from pdf2image import convert_from_path
import pytesseract
filePath = ‘021-DO-YOU-WONDER-ABOUT-RAIN-SNOW-SLEET-AND-HAIL-Free-Childrens-Book-By-Monkey-Pen.pdf’
doc = convert_from_path(filePath)
path, fileName = os.path.split(filePath)
fileBaseName, fileExtension = os.path.splitext(fileName)
for page_number, page_data in enumerate(doc):
txt = pytesseract.image_to_string(page_data).encode(“utf-8”)
print(“Page # {} — {}”.format(str(page_number),txt))

Go through the official documentation there it is given
from PyPDF2 import PdfReader
reader = PdfReader("example.pdf")
page = reader.pages[0]
print(page.extract_text())

I am adding code to accomplish this:
It is working fine for me:
# This works in python 3
# required python packages
# tabula-py==1.0.0
# PyPDF2==1.26.0
# Pillow==4.0.0
# pdfminer.six==20170720
import os
import shutil
import warnings
from io import StringIO
import requests
import tabula
from PIL import Image
from PyPDF2 import PdfFileWriter, PdfFileReader
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
warnings.filterwarnings("ignore")
def download_file(url):
local_filename = url.split('/')[-1]
local_filename = local_filename.replace("%20", "_")
r = requests.get(url, stream=True)
print(r)
with open(local_filename, 'wb') as f:
shutil.copyfileobj(r.raw, f)
return local_filename
class PDFExtractor():
def __init__(self, url):
self.url = url
# Downloading File in local
def break_pdf(self, filename, start_page=-1, end_page=-1):
pdf_reader = PdfFileReader(open(filename, "rb"))
# Reading each pdf one by one
total_pages = pdf_reader.numPages
if start_page == -1:
start_page = 0
elif start_page < 1 or start_page > total_pages:
return "Start Page Selection Is Wrong"
else:
start_page = start_page - 1
if end_page == -1:
end_page = total_pages
elif end_page < 1 or end_page > total_pages - 1:
return "End Page Selection Is Wrong"
else:
end_page = end_page
for i in range(start_page, end_page):
output = PdfFileWriter()
output.addPage(pdf_reader.getPage(i))
with open(str(i + 1) + "_" + filename, "wb") as outputStream:
output.write(outputStream)
def extract_text_algo_1(self, file):
pdf_reader = PdfFileReader(open(file, 'rb'))
# creating a page object
pageObj = pdf_reader.getPage(0)
# extracting extract_text from page
text = pageObj.extractText()
text = text.replace("\n", "").replace("\t", "")
return text
def extract_text_algo_2(self, file):
pdfResourceManager = PDFResourceManager()
retstr = StringIO()
la_params = LAParams()
device = TextConverter(pdfResourceManager, retstr, codec='utf-8', laparams=la_params)
fp = open(file, 'rb')
interpreter = PDFPageInterpreter(pdfResourceManager, device)
password = ""
max_pages = 0
caching = True
page_num = set()
for page in PDFPage.get_pages(fp, page_num, maxpages=max_pages, password=password, caching=caching,
check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
text = text.replace("\t", "").replace("\n", "")
fp.close()
device.close()
retstr.close()
return text
def extract_text(self, file):
text1 = self.extract_text_algo_1(file)
text2 = self.extract_text_algo_2(file)
if len(text2) > len(str(text1)):
return text2
else:
return text1
def extarct_table(self, file):
# Read pdf into DataFrame
try:
df = tabula.read_pdf(file, output_format="csv")
except:
print("Error Reading Table")
return
print("\nPrinting Table Content: \n", df)
print("\nDone Printing Table Content\n")
def tiff_header_for_CCITT(self, width, height, img_size, CCITT_group=4):
tiff_header_struct = '<' + '2s' + 'h' + 'l' + 'h' + 'hhll' * 8 + 'h'
return struct.pack(tiff_header_struct,
b'II', # Byte order indication: Little indian
42, # Version number (always 42)
8, # Offset to first IFD
8, # Number of tags in IFD
256, 4, 1, width, # ImageWidth, LONG, 1, width
257, 4, 1, height, # ImageLength, LONG, 1, lenght
258, 3, 1, 1, # BitsPerSample, SHORT, 1, 1
259, 3, 1, CCITT_group, # Compression, SHORT, 1, 4 = CCITT Group 4 fax encoding
262, 3, 1, 0, # Threshholding, SHORT, 1, 0 = WhiteIsZero
273, 4, 1, struct.calcsize(tiff_header_struct), # StripOffsets, LONG, 1, len of header
278, 4, 1, height, # RowsPerStrip, LONG, 1, lenght
279, 4, 1, img_size, # StripByteCounts, LONG, 1, size of extract_image
0 # last IFD
)
def extract_image(self, filename):
number = 1
pdf_reader = PdfFileReader(open(filename, 'rb'))
for i in range(0, pdf_reader.numPages):
page = pdf_reader.getPage(i)
try:
xObject = page['/Resources']['/XObject'].getObject()
except:
print("No XObject Found")
return
for obj in xObject:
try:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
data = xObject[obj]._data
if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
mode = "RGB"
else:
mode = "P"
image_name = filename.split(".")[0] + str(number)
print(xObject[obj]['/Filter'])
if xObject[obj]['/Filter'] == '/FlateDecode':
data = xObject[obj].getData()
img = Image.frombytes(mode, size, data)
img.save(image_name + "_Flate.png")
# save_to_s3(imagename + "_Flate.png")
print("Image_Saved")
number += 1
elif xObject[obj]['/Filter'] == '/DCTDecode':
img = open(image_name + "_DCT.jpg", "wb")
img.write(data)
# save_to_s3(imagename + "_DCT.jpg")
img.close()
number += 1
elif xObject[obj]['/Filter'] == '/JPXDecode':
img = open(image_name + "_JPX.jp2", "wb")
img.write(data)
# save_to_s3(imagename + "_JPX.jp2")
img.close()
number += 1
elif xObject[obj]['/Filter'] == '/CCITTFaxDecode':
if xObject[obj]['/DecodeParms']['/K'] == -1:
CCITT_group = 4
else:
CCITT_group = 3
width = xObject[obj]['/Width']
height = xObject[obj]['/Height']
data = xObject[obj]._data # sorry, getData() does not work for CCITTFaxDecode
img_size = len(data)
tiff_header = self.tiff_header_for_CCITT(width, height, img_size, CCITT_group)
img_name = image_name + '_CCITT.tiff'
with open(img_name, 'wb') as img_file:
img_file.write(tiff_header + data)
# save_to_s3(img_name)
number += 1
except:
continue
return number
def read_pages(self, start_page=-1, end_page=-1):
# Downloading file locally
downloaded_file = download_file(self.url)
print(downloaded_file)
# breaking PDF into number of pages in diff pdf files
self.break_pdf(downloaded_file, start_page, end_page)
# creating a pdf reader object
pdf_reader = PdfFileReader(open(downloaded_file, 'rb'))
# Reading each pdf one by one
total_pages = pdf_reader.numPages
if start_page == -1:
start_page = 0
elif start_page < 1 or start_page > total_pages:
return "Start Page Selection Is Wrong"
else:
start_page = start_page - 1
if end_page == -1:
end_page = total_pages
elif end_page < 1 or end_page > total_pages - 1:
return "End Page Selection Is Wrong"
else:
end_page = end_page
for i in range(start_page, end_page):
# creating a page based filename
file = str(i + 1) + "_" + downloaded_file
print("\nStarting to Read Page: ", i + 1, "\n -----------===-------------")
file_text = self.extract_text(file)
print(file_text)
self.extract_image(file)
self.extarct_table(file)
os.remove(file)
print("Stopped Reading Page: ", i + 1, "\n -----------===-------------")
os.remove(downloaded_file)
# I have tested on these 3 pdf files
# url = "http://s3.amazonaws.com/NLP_Project/Original_Documents/Healthcare-January-2017.pdf"
url = "http://s3.amazonaws.com/NLP_Project/Original_Documents/Sample_Test.pdf"
# url = "http://s3.amazonaws.com/NLP_Project/Original_Documents/Sazerac_FS_2017_06_30%20Annual.pdf"
# creating the instance of class
pdf_extractor = PDFExtractor(url)
# Getting desired data out
pdf_extractor.read_pages(15, 23)

You can download tika-app-xxx.jar(latest) from Here.
Then put this .jar file in the same folder of your python script file.
then insert the following code in the script:
import os
import os.path
tika_dir=os.path.join(os.path.dirname(__file__),'<tika-app-xxx>.jar')
def extract_pdf(source_pdf:str,target_txt:str):
os.system('java -jar '+tika_dir+' -t {} > {}'.format(source_pdf,target_txt))
The advantage of this method:
fewer dependency. Single .jar file is easier to manage that a python package.
multi-format support. The position source_pdf can be the directory of any kind of document. (.doc, .html, .odt, etc.)
up-to-date. tika-app.jar always release earlier than the relevant version of tika python package.
stable. It is far more stable and well-maintained (Powered by Apache) than PyPDF.
disadvantage:
A jre-headless is necessary.

If you try it in Anaconda on Windows, PyPDF2 might not handle some of the PDFs with non-standard structure or unicode characters. I recommend using the following code if you need to open and read a lot of pdf files - the text of all pdf files in folder with relative path .//pdfs// will be stored in list pdf_text_list.
from tika import parser
import glob
def read_pdf(filename):
text = parser.from_file(filename)
return(text)
all_files = glob.glob(".\\pdfs\\*.pdf")
pdf_text_list=[]
for i,file in enumerate(all_files):
text=read_pdf(file)
pdf_text_list.append(text['content'])
print(pdf_text_list)

For extracting Text from PDF use below code
import PyPDF2
pdfFileObj = open('mypdf.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print(pdfReader.numPages)
pageObj = pdfReader.getPage(0)
a = pageObj.extractText()
print(a)

A more robust way, supposing there are multiple PDF's or just one !
import os
from PyPDF2 import PdfFileWriter, PdfFileReader
from io import BytesIO
mydir = # specify path to your directory where PDF or PDF's are
for arch in os.listdir(mydir):
buffer = io.BytesIO()
archpath = os.path.join(mydir, arch)
with open(archpath) as f:
pdfFileObj = open(archpath, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pdfReader.numPages
pageObj = pdfReader.getPage(0)
ley = pageObj.extractText()
file1 = open("myfile.txt","w")
file1.writelines(ley)
file1.close()

Camelot seems a fairly powerful solution to extract tables from PDFs in Python.
At first sight it seems to achieve almost as accurate extraction as the tabula-py package suggested by CreekGeek, which is already waaaaay above any other posted solution as of today in terms of reliability, but it is supposedly much more configurable. Furthermore it has its own accuracy indicator (results.parsing_report), and great debugging features.
Both Camelot and Tabula provide the results as Pandas’ DataFrames, so it is easy to adjust tables afterwards.
pip install camelot-py
(Not to be confused with the camelot package.)
import camelot
df_list = []
results = camelot.read_pdf("file.pdf", ...)
for table in results:
print(table.parsing_report)
df_list.append(results[0].df)
It can also output results as CSV, JSON, HTML or Excel.
Camelot comes at the expense of a number of dependencies.
NB : Since my input is pretty complex with many different tables I ended up using both Camelot and Tabula, depending on the table, to achieve the best results.

Try out borb, a pure python PDF library
import typing
from borb.pdf.document import Document
from borb.pdf.pdf import PDF
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction
def main():
# variable to hold Document instance
doc: typing.Optional[Document] = None
# this implementation of EventListener handles text-rendering instructions
l: SimpleTextExtraction = SimpleTextExtraction()
# open the document, passing along the array of listeners
with open("input.pdf", "rb") as in_file_handle:
doc = PDF.loads(in_file_handle, [l])
# were we able to read the document?
assert doc is not None
# print the text on page 0
print(l.get_text(0))
if __name__ == "__main__":
main()

It includes creating a new sheet for each PDF page being set dynamically based on number of pages in the document.
import PyPDF2 as p2
import xlsxwriter
pdfFileName = "sample.pdf"
pdfFile = open(pdfFileName, 'rb')
pdfread = p2.PdfFileReader(pdfFile)
number_of_pages = pdfread.getNumPages()
workbook = xlsxwriter.Workbook('pdftoexcel.xlsx')
for page_number in range(number_of_pages):
print(f'Sheet{page_number}')
pageinfo = pdfread.getPage(page_number)
rawInfo = pageinfo.extractText().split('\n')
row = 0
column = 0
worksheet = workbook.add_worksheet(f'Sheet{page_number}')
for line in rawInfo:
worksheet.write(row, column, line)
row += 1
workbook.close()

PIL image.open() not working for my .png

I use PIL to open image for extracting several bits and writing them to string. It supposed to be, that this code will filter except ones, that have (R<=1 && G<=1 && B<=1) and took last bit of each color. The matter is that it doesn't work.
from PIL import Image
def extract_bits(color, bitmask):
bitmask_len = len(bin(bitmask)[2:])
extracted_bits = bin(color & bitmask)[2:]
extracted_bits = '0' * (bitmask_len - len(extracted_bits)) + extracted_bits
return extracted_bits
if __name__ == '__main__':
img = Image.open('IMG_0707png')
pixels = list(img.getdata())
bits = ''
for i in range(0, len(pixels), 1):
r = pixels[i][0]
g = pixels[i][1]
b = pixels[i][2]
if not (r <= 1 and g <= 1 and b <= 1): continue
bits += extract_bits(r, 0x1)
bits += extract_bits(g, 0x1)
bits += extract_bits(b, 0x1)
bits += '0' * (8 - len(bits) % 8)
text = ''
for i in range(0, len(bits), 8):
text += chr(int(bits[i:i+8], 2))
print text
I looked around this problem, and found solution that doesn't work in my case.
img = Image.open(open('IMG_0707.png', 'rb'))
In both cases I get
File "<stdin>, line 1, in <module>"
File "<string>" line11, im <module>
File "c:\python27\lib\site-packages\PIL\Image.py", line 1980, in open
raise IOError("cannot identify image file")
IOError: cannot identify image file
Also I tried to specify exact path with no luck.
img = Image.open(open("IMG_0707.png", 'rb'))
img = Image.open(open("c:\python27\IMG_0707.png", 'rb'))
img = Image.open(open("c:/python27/IMG_0707.png", 'rb'))
And so on. I would be grateful for any help.
Image, im trying to open

That PNG is corrupted. Here is what ImageMagick had to say:
$ convert IMG_0707.png IMG_0707-new.png
convert: IHDR: CRC error `IMG_0707.png' # error/png.c/MagickPNGErrorHandler/1309.
convert: corrupt image `IMG_0707.png' # error/png.c/ReadPNGImage/3294.
convert: missing an image filename `IMG_0707-new.png' # error/convert.c/ConvertImageCommand/3011.

import Image
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
import numpy as np
im = Image.open(StringIO(raw_data))
hi i had the same problem and this is the solution
raw_data: in my case it was base64 decoded png file

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract images from PDF without resampling, in python? - python

How might one extract all images from a pdf document, at native resolution and format? (Meaning extract tiff as tiff, jpeg as jpeg, etc. and without resampling). Layout is unimportant, I don't care were the source image is located on the page. I'm using python 2.7 but can use 3.x if required.

Libpoppler comes with a tool called "pdfimages" that does exactly this. (On ubuntu systems it's in the poppler-utils package) http://poppler.freedesktop.org/ http://en.wikipedia.org/wiki/Pdfimages Windows binaries: http://blog.alivate.com.au/poppler-windows/

I rewrite solutions as single python class. It should be easy to work with. If you notice new "/Filter" or "/ColorSpace" then just add it to internal dictionaries. https://github.com/survtur/extract_images_from_pdf Requirements: Python3.6+ PyPDF2 PIL

Related

How to convert multi image TIFF to PDF in python?

Python library for image extraction

Python Resize Multiple images ask user to continue

PyPDF2 - Returning only blank lines. En(de)code issue? [duplicate]

PIL image.open() not working for my .png

Categories

Resources