I have two pdfs " file.pdf ; BACKGROUND.pdf ". i wanna use BACKGROUND.pdf as background for file.pdf with python.
i don't know where to start i'm a python developper beginner
Maybe you could convert your first PDF to an image using this :
https://www.geeksforgeeks.org/convert-pdf-to-image-using-python/
# import module
from pdf2image import convert_from_path
# Store Pdf with convert_from_path function
images = convert_from_path('example.pdf')
for i in range(len(images)):
# Save pages as images in the pdf
images[i].save('page'+ str(i) +'.jpg', 'JPEG')
Then use the image as background for the second file.
Or use another util :
How to programmatically add a background to a pdf?
qpdf --underlay "background.pdf" -- file.pdf output.pdf
Should work for most cases.
Python users can use https://github.com/pikepdf/pikepdf a wrapper around qpdf
documentation at https://pikepdf.readthedocs.io/en/latest/
especially for this case Overlays, underlays, watermarks, n-up
Related
Context:
I have PDF files I'm working with.
I'm using an ocr to extract the text from these documents and to be able to do that I have to convert my pdf files to images.
I currently use the convert_from_path function of the pdf2image module but it is very time inefficient (9minutes for a 9page pdf).
Problem:
I am looking for a way to accelerate this process or another way to convert my PDF files to images.
Additional info:
I am aware that there is a thread_count parameter in the function but after several tries it doesn't seem to make any difference.
This is the whole function I am using:
def pdftoimg(fic,output_folder):
# Store all the pages of the PDF in a variable
pages = convert_from_path(fic, dpi=500,output_folder=output_folder,thread_count=9, poppler_path=r'C:\Users\Vincent\Documents\PDF\poppler-21.02.0\Library\bin')
image_counter = 0
# Iterate through all the pages stored above
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(output_folder+filename, 'JPEG')
image_counter = image_counter + 1
for i in os.listdir(output_folder):
if i.endswith('.ppm'):
os.remove(output_folder+i)
Link to the convert_from_path reference.
I found an answer to that problem using another module called fitz which is a python binding to MuPDF.
First of all install PyMuPDF:
The documentation can be found here but for windows users it's rather simple:
pip install PyMuPDF
Then import the fitz module:
import fitz
print(fitz.__doc__)
>>>PyMuPDF 1.18.13: Python bindings for the MuPDF 1.18.0 library.
>>>Version date: 2021-05-05 06:32:22.
>>>Built for Python 3.7 on win32 (64-bit).
Open your file and save every page as images:
The get_pixmap() method accepts different parameters that allows you to control the image (variation,resolution,color...) so I suggest that you red the documentation here.
def convert_pdf_to_image(fic):
#open your file
doc = fitz.open(fic)
#iterate through the pages of the document and create a RGB image of the page
for page in doc:
pix = page.get_pixmap()
pix.save("page-%i.png" % page.number)
Hope this helps anyone else.
I want to do an image form submission, and I want to validate that the image was submitted is an image server side, which is running python. Is there a simple way to do this in pure python?
A simple and naive way to do it would be with libmagic (for example the one at https://github.com/ahupp/python-magic). A better way, but it's not native Python and is a very extensive library, would be to use PIL http://www.pythonware.com/products/pil/.
Use PIL:
import sys
import Image
for infile in sys.argv[1:]:
try:
im = Image.open(infile)
print infile, im.format, "%dx%d" % im.size, im.mode
except IOError:
pass
From the docs:
The Python Imaging Library supports a wide variety of image file
formats. To read files from disk, use the open function in the Image
module. You don't have to know the file format to open a file. The
library automatically determines the format based on the contents of
the file.
I want to convert some multi-pages .tif or .pdf files to individual .png images. From command line (using ImageMagick) I just do:
convert multi_page.pdf file_out.png
And I get all the pages as individual images (file_out-0.png, file_out-1.png, ...)
I would like to handle this file conversion within Python, unfortunately PIL cannot read .pdf files, so I want to use PythonMagick. I tried:
import PythonMagick
im = PythonMagick.Image('multi_page.pdf')
im.write("file_out%d.png")
or just
im.write("file_out.png")
But I only get 1 page converted to png.
Of course I could load each pages individually and convert them one by one. But there must be a way to do them all at once?
ImageMagick is not memory efficient, so if you try to read a large pdf, like 100 pages or so, the memory requirement will be huge and it might crash or seriously slow down your system. So after all reading all pages at once with PythonMagick is a bad idea, its not safe.
So for pdfs, I ended up doing it page by page, but for that I need to get the number of pages first using pyPdf, its reasonably fast:
pdf_im = pyPdf.PdfFileReader(file('multi_page.pdf', "rb"))
npage = pdf_im.getNumPages()
for p in npage:
im = PythonMagick.Image('multi_page.pdf['+ str(p) +']')
im.write('file_out-' + str(p)+ '.png')
A more complete example based on the answer by Ivo Flipse and http://p-s.co.nz/wordpress/pdf-to-png-using-pythonmagick/
This uses a higher resolution and uses PyPDF2 instead of older pyPDF.
import sys
import PyPDF2
import PythonMagick
pdffilename = sys.argv[1]
pdf_im = PyPDF2.PdfFileReader(file(pdffilename, "rb"))
npage = pdf_im.getNumPages()
print('Converting %d pages.' % npage)
for p in range(npage):
im = PythonMagick.Image()
im.density('300')
im.read(pdffilename + '[' + str(p) +']')
im.write('file_out-' + str(p)+ '.png')
I had the same problem and as a work around i used ImageMagick and did
import subprocess
params = ['convert', 'src.pdf', 'out.png']
subprocess.check_call(params)
I am using gfx to convert a particular page in a pdf to a .png image, but the image created is of very bad quality. I need to use gfx and can't use any other module. the code used is:
import gfx
pdf_loc=”C:\new.pdf”
pagenumber=12
doc = gfx.open('pdf',pdf_loc)
page = doc.getPage(page_number)
img = gfx.ImageList()
img.setparameter("antialise", "1") # turn on antialising
img.startpage(page.width,page.height)
page.render(img)
img.endpage()
input_loc="C:\newimg.png"
img.save(input_loc)
You can use the swfrender
add this
gfx.setparameter("zoom", "400")
You can learn more on http://wiki.swftools.org/wiki/Python_gfx_tutorial
So the state I'm in released a bunch of data in PDF form, but to make matters worse, most (all?) of the PDFs appear to be letters typed in Office, printed/fax, and then scanned (our government at its best eh?). At first I thought I was crazy, but then I started seeing numerous pdfs that are 'tilted', like someone didn't get them on the scanner properly. So, I figured the next best thing to getting the actual text out of them, would be to turn each page into an image.
Obviously this needs to be automated, and I'd prefer to stick with Python if possible. If Ruby or Perl have some form of implementation that's just too awesome to pass up, I can go that route. I've tried pyPDF for text extraction, that obviously didn't do me much good. I've tried swftools, but the images I'm getting from that are just shy of completely unusable. It just seems like the fonts get ruined in the conversion. I also don't even really care about the image format on the way out, just as long as they're relatively lightweight, and readable.
If the PDFs are truly scanned images, then you shouldn't convert the PDF to an image, you should extract the image from the PDF. Most likely, all of the data in the PDF is essentially one giant image, wrapped in PDF verbosity to make it readable in Acrobat.
You should try the simple expedient of simply finding the image in the PDF, and copying the bytes out: Extracting JPGs from PDFs. The code there is dead simple, and there are probably dozens of reasons it won't work on your PDF files. But if it does, you'll have a quick and painless way to get the image data out of the PDF files.
You could call e.g. pdftoppm from the command-line (or using Python's subprocess module) and then convert the resulting PPM files to the desired format using e.g. ImageMagick (again, using subprocess or some bindings if they exist).
Ghostscript is ideal for converting PDF files to images. It is reliable and has many configurable options. Its also available under the GPL license or commercial license. You can call it from the command line or use its native API. For more information:
Ghostscript Main Website
Ghostscript docs on Command line usage
Another stackoverflow thread that provides some examples of invoking Ghostscript's command line interface from Python
Ghostscript API Documentation
Here's an alternative approach to turning a .pdf file into images: Use an image printer. I've successfully used the function below to "print" pdf's to jpeg images with ImagePrinter Pro. However, there are MANY image printers out there. Pick the one you like. Some of the code may need to be altered slightly based on the image printer you pick and the standard file saving format that image printer uses.
import win32api
import os
def pdf_to_jpg(pdfPath, pages):
# print pdf using jpg printer
# 'pages' is the number of pages in the pdf
filepath = pdfPath.rsplit('/', 1)[0]
filename = pdfPath.rsplit('/', 1)[1]
#print pdf to jpg using jpg printer
tempprinter = "ImagePrinter Pro"
printer = '"%s"' % tempprinter
win32api.ShellExecute(0, "printto", filename, printer, ".", 0)
# Add time delay to ensure pdf finishes printing to file first
fileFound = False
if pages > 1:
jpgName = filename.split('.')[0] + '_' + str(pages - 1) + '.jpg'
else:
jpgName = filename.split('.')[0] + '.jpg'
jpgPath = filepath + '/' + jpgName
waitTime = 30
for i in range(waitTime):
if os.path.isfile(jpgPath):
fileFound = True
break
else:
time.sleep(1)
# print Error if the file was never found
if not fileFound:
print "ERROR: " + jpgName + " wasn't found after " + str(waitTime)\
+ " seconds"
return jpgPath
The resulting jpgPath variable tells you the path location of the last jpeg page of the pdf printed. If you need to get another page, you can easily add some logic to modify the path to get prior pages
in pdf_to_jpg(pdfPath)
6 # 'pages' is the number of pages in the pdf
7 filepath = pdfPath.rsplit('/', 1)[0]
----> 8 filename = pdfPath.rsplit('/', 1)[1]
9
10 #print pdf to jpg using jpg printer
IndexError: list index out of range
With Wand there are now excellent imagemagick bindings for Python that make this a very easy task.
Here is the code necessary for converting a single PDF file into a sequence of PNG images:
from wand.image import Image
input_path = "name_of_file.pdf"
output_name = "name_of_outfile_{index}.png"
source = Image(filename=upload.original.path, resolution=300, width=2200)
images = source.sequence
for i in range(len(images)):
Image(images[0]).save(filename=output_name.format(i))