PDF to PNG in Python with pdf2cairo

PDF to PNG in Python with pdf2cairo - python

I'm looking for a good PDF 2 Image convertor for a long time. I need to convert the PDF to an image in order to print it with use of Qt. I'm programming in Python/Pyside, so if I can convert the PDF to a series of (PNG) images with use of subprocess I can print them without problems.
I achieved to do this by calling convert.exe from Imagemagick. It works quite well but it relies on GhostScript and that is a big package which I want to avoid since its more complex to integrate.
I also tried muPDF from GhostScript, but this seems to not have stdin and stdout options. That's a pity because it first saves my file. Opens it with muPDF, converts and saves it and then reload it again in my Python application. It should be possible without all those steps!
Today I started with experimenting with Poppler's pdf2cairo. I assumed that it would work in this way to convert my (multi paged) PDF to a series of images and pipe it to the stdout. Unfortunately it doesn't and I experience two problems:
It complains that it can only export to stdout when you also use the -singlepage argument. How can I export all pages to stdout?
When I export to stdout I get the error: 'Error opening output file fd://0.png\r\n
Converting a pdf from stdin to image files is no problem it all.
This is my code which also triggers the error about opening the output file:
import subprocess
pdf = open('test.pdf')
p = subprocess.Popen(['pop/pdftocairo.exe', '-singlefile', '-png', '-', '-'],stdin = pdf, stdout = subprocess.PIPE, stderr = subprocess.PIPE)
print(p.stderr.read())
print(p.stdout.read())
I've downloaded PDF2Cairo pre-compiled from: http://blog.alivate.com.au/poppler-windows/
The documentation of the command line options of pdf2cairo can be found here: http://manpages.ubuntu.com/manpages/precise/man1/pdftocairo.1.html
Hopefully you can help me out to make this work!
Update
As you can see below in the answers pdftocairo is buggy and does not work correctly when you want to use stdout. pdftoppm does work it return is byte object of your PDF file:
pdf = open('test.pdf')
p = subprocess.Popen(['pop/pdftoppm.exe', '-png'],stdin = pdf, stdout = subprocess.PIPE, stderr = subprocess.PIPE)
data, error = p.communicate()
The only thing I still need to do is split the byte object into multiple files.

It's a bug in pdftocairo.
The output filename is first passed to getOutputFilename, which returns the special string fd://0 as placeholder for stdout.
But then later that string is passed to getImageFilename which unconditionally adds an extension to the filename, so that later the comparision fails and the program tires to open the literal file fd://0.png instead of using stdout.
Unfortunatlely, the only thing you can do is file a bug report.
As for exporting a multipage document to stdout, that's not supported at all, and it wouldn't work with filetypes like png or jpeg anyway, because these formats don't support multipage documents. It does work for svg, pdf, eps and ps output files, as these formats do support multipage documents (and the processing of the filename done correctly for these.)

I thought it would be easier to just use os.system and pass the whole command string.
This assumes there are "pdfs" and "imgs" folders; change accordingly.
import os
import glob
for pdf_file in glob.glob("pdfs\*.pdf"):
cmd_str = "pdftocairo.exe -jpeg \"%s\" \"%s\"" % (pdf_file, os.path.join("imgs", os.path.splitext(os.path.split(pdf_file)[1])[0]))
print cmd_str
os.system(cmd_str)

Related

How do you use Python Ghostscript's high-level interface to convert a .pdf file into multiple .png files?

I am trying to convert a .pdf file into several .png files using Ghostscript in Python. The other answers on here were pretty old hence this new thread.
The following code was given as an example on pypi.org of the 'high level' interface, and I am trying to model my code after the example code below.
import sys
import locale
import ghostscript
args = [
"ps2pdf", # actual value doesn't matter
"-dNOPAUSE", "-dBATCH", "-dSAFER",
"-sDEVICE=pdfwrite",
"-sOutputFile=" + sys.argv[1],
"-c", ".setpdfwrite",
"-f", sys.argv[2]
]
# arguments have to be bytes, encode them
encoding = locale.getpreferredencoding()
args = [a.encode(encoding) for a in args]
ghostscript.Ghostscript(*args)
Can someone explain what this code is doing? And can it be used somehow to convert a .pdf into .png files?
I am new to this and am truly confused. Thanks so much!

That's calling Ghostscript, obviously. From the arguments it's not spawning a process, it's linked (either dynamically or statically) to the Ghostscript library.
The args are Ghostscript arguments. These are documented in the Ghostscript documentation, you can find it online here. Because it mimics the command line interface, where the first argument is the calling program, the first argument here is meaningless and can be anything you want (as the comment says).
The next three arguments turn on SAFER (which prevents some potentially dangerous operations and is, now, the default anyway), sets NOPAUSE so the entire input is processed without pausing between pages, and BATCH so that on completion Ghostscript exits instead of returning to the interactive prompt.
Then it selects a device. In Ghostscript (due to the PostScript language) devices are what actually output stuff. In this case the device selected is the pdfwrite device, which outputs PDF.
Then there's the OutputFile, you can probably guess that this is the name (and path) of the file where the output is to be written.
The next 3 arguments; -c .setpdfwrite -f are, frankly archaic and pointless. They were once recommended when using the pdfwrite device (and only the pdfwrite device) but they have no useful effect these days.
The very last argument is, of course, the input file.
Certainly you can use Ghostscript to render PDF files to PNG. You want to use one of the PNG devices, there are several depending on what colour depth you want to support. Unless you have some stranger requirement, just use png16m. If your input file contains more than one page you'll want to set the OutputFile to use %d so that it writes one file per page.
More details on all of this can, of course, be found in the documentation.

How to open and close a PDF file from within python

I can open a PDF file from within Python using subprocess.Popen() but I am having trouble closing the PDF file. How can I close an open PDF file using Python. My code is:
# open the PDF file
plot = subprocess.Popen('open %s' % filename, shell=True)
# user inputs a comment (will subsequently be saved in a file)
comment = raw_input('COMMENT: ')
# close the PDF file
#psutil.Process(plot.pid).get_children()[0].kill()
plot.kill()
Edit: I can close the PDF immediately after opening it (using plot.kill()) but this does not work if there is another command between opening the PDF and 'killing' it. Any help would be great - thanks in advance.

For me, this one works fine (inspired by this). Perhaps, instead of using 'open,' you can use a direct command for the PDF reader? Commands like 'open' tend to make a new process and then shut down immediately. I don't know your environment or anything, but for me, on Linux, this worked:
import subprocess
import signal
import os
filename="test.pdf"
plot = subprocess.Popen("evince '%s'" % filename, stdout=subprocess.PIPE,
shell=True, preexec_fn=os.setsid)
input("test")
os.killpg(os.getpgid(plot.pid), signal.SIGTERM)
On windows/mac, this will probably work if you change 'evince' to the path of the executable of your pdf reader.

Using python to carry out actions in the windows cmd

I have a program, which saves a canvas into a postscript file. The program then opens the file with IrfanView, where I can manually save it as a .png and then I can run another function from python, which does another operation with it and saves it as a .png again. My question is whether there is a way to either cut out the middle manual bit (where I have to click the save as button) or whether the saving from IrfanView can all be done through python code?
This far I've found out that I cannot save the canvas and whatever is on it (im using turtles) can only be saved using postscript.
Also converting postscript to png or jpeg from within python also seems to be a bit of a tall order.
Note: Essentially I use Irfan to do the postscript to .png conversion, but I would like to hide this step of the process from the user, so it would be nice if the program could do it for me.
New Note: I have tried to use the python subprocess module to make a call to the cmd and use that to convert, but whenever I attempt to run the .Popen or the .call function I get an error - Access denied or file not found, either way the commands don't want to run from the python program. I even tried just opening a file from python, through the cmd only to get an error (the same command works when typed directly into the cmd):
WindowsError: [Error 193] %1 is not a valid Win32 application

Assuming that you have a postscript file named saved.ps that you want to convert to a png file with Ghostscript using the device pngalpha, you could do:
gspath = "/path/to/gs" # would be gspath="c:\path\to\gswin32c" on Windows...
infile = "saved.ps"
outfile = "output.png"
gs = subprocess.Popen(["gs", "-o", "output.png", "-sDEVICE=pngalpha",
"-dBatch", infile], executable=gspath,
stdout=subprocess.PIPE, stderr = subprocess.PIPE)
out, err = gs.communicate()
if gs.returncode != 0:
# do error processing, at least display out and err

Writing PDFs to STDOUT with Python

I want to merge two PDF documents with Python (prepend a pre-made cover sheet to an existing document) and present the result to a browser. I'm currently using the PyPDF2 library which can perform the merge easily enough, but the PdfFileWriter class write() method only seems to support writing to a file object (must support write() and tell() methods). In this case, there is no reason to touch the filesystem; the merged PDF is already in memory and I just want to send a Content-type header and then the document to STDOUT (the browser via CGI). Is there a Python library better suited to writing a document to STDOUT than PyPDF2? Alternately, is there a way to pass STDIO as an argument to PdfFileWriter's write() method in such a way that it appears to write() as though it were a file handle?
Letting write() write the document to the filesystem and then opening the resulting file and sending it to the browser works, but is not an option in this case (aside from being terribly inelegant).
solution
Using mgilson's advice, this is how I got it to work in Python 2.7:
#!/usr/bin/python
import cStringIO
import sys
from PyPDF2 import PdfFileMerger
merger = PdfFileMerger()
###
# Actual PDF open/merge code goes here
###
output = cStringIO.StringIO()
merger.write(output)
print("Content-type: application/pdf\n")
sys.stdout.write(output.getvalue())
output.close()

Python supports an "in-memory" filetype via cStringIO.StringIO (or io.BytesIO, ... depending on python version). In your case, you could create an instance of one of those classes, pass that to the method which expects a file and then you can use the .getvalue() method to return the contents as a string (or bytes depending on python version). Once you have the contents as a string, you can simply print them or use sys.stdout.write to write the string to standard output.

Send args to subprocess while using stdin

I'm trying to take a screenshot then run a command on that screenshot without saving to disk.
The actual command I want to run is visgrep image.png pattern.pat
visgrep must have two args: the image file and a .pat file.
Here is what I have so far.
p = subprocess.Popen(['import', '-crop', '305x42+1328+281', '-window', 'root', '-depth', '8', 'png:' ], stdout=subprocess.PIPE,)
cmd = ['visgrep']
subprocess.call(cmd, stdin=p.stdout)
Obviously this fails as visgrep must have two args.
So how can I do visgrep image.png pattern.pat but substituting 'image.png' with the output of ImageMagick's import?
Do I need to use xargs? Is there a better way to accomplish what I'm trying?

In linux you can use /dev/stdin as file name but it does not work all the times. If it does not work with visgrep, you must use a temporary file (which is not a shame).
PS. shouldn't png: be png:-?

According to this answer, changing the argument png: to png:- will cause the import command to output to standard out instead of a file. I am unfamiliar with visgrep, so I'm not sure how to tell it to read the source image from stdin.

From the ImageMagick documentation:
STDIN, STDOUT, and file descriptors
Unix and Windows permit the output of one command to be piped to the
input of another. ImageMagick permits image data to be read and
written from the standard streams STDIN (standard in) and STDOUT
(standard out), respectively, using a pseudo-filename of -. In this
example we pipe the output of convert to the display program:
$ convert logo: gif:- | display gif:-
The second explicit format "gif:" is optional in the preceding
example. The GIF image format has a unique signature within the image
so ImageMagick's display command can readily recognize the format as
GIF. The convert program also accepts STDIN as input in this way:
$ convert rose: gif:- | convert - -resize "200%" bigrose.jpg
You can use the same filename convention with the import command.
So, try:
p = subprocess.Popen(['import', '-crop', '305x42+1328+281',
'-window', 'root', '-depth', '8', 'png:-' ],
stdout=subprocess.PIPE,)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.