How can I convert PDF pages to images? - python

I need save PDF pages as images.
Is this possible with pypdf?

As far as I know there is no good way to do this, not with pyPdf or any other libraries I've seen. PIL supports writing, but not reading PDF so it doesn't help here, either. Such support would be quite nice to have. I'd recommend using ImageMagick as a work around, you can call it with subprocess from your script, and have it handle the conversion.

ImageMagick also has Python bindings available, so you could output your images without having to use subprocess

Related

Python Audio Edit

I am searching for a way to write a simple python
program to perform an automatic edit on an audio file.
I wrote with PIL automatic picture resizing to a predefined size.
I would like to write the same for automatic file re-encoding into a predefined bitrate.
similarly, i would like to write a python program that can stretch an audio file and re-encode it.
do i have to parse MP3's by myself, or is there a library that can be used for this?
Rather than doing this natively in Python, I strongly recommend leaving the heavy lifting up to FFMPEG, by executing it from your script.
It can chop, encode, and decode just about anything you throw at it. You can find a list of common parameters here: http://howto-pages.org/ffmpeg/
This way, you can leave your Python program to figure out the logic of what you want to cut and where, and not spend a decade writing code to deal with all of the audio formats available.
If you don't like the idea of directly executing it, there is also a Python wrapper available for FFMPEG.
There is pydub. It's an easy to use library.

Using Imagemagick without making files?

I'm working in Python to create images from text. I've already been back and forth with PIL and frankly, its font and alignment options need a lot of work.
I can subprocess Imagemagick and it works great, except that it seems to always need to write a file to disk. I would like to subprocess the image creation and just get the data returned to Python, keeping everything in memory.
I've looked into a number of supposed Python wrappers for ImageMagick, but they're all hopelessly years out of date or not documented whatsoever. Even searching extensively on SO doesn't see to clearly point to a defacto way to use ImageMagic with Python. So I think going for subprocessing is the best way forward.
convert and the other ImageMagick commands can output image data to stdout if you specify format:- as the output file. You can capture that output in Python using the subprocess module.
For instance:
cmd = ["convert", "test.bmp", "jpg:-"]
output_stream = subprocess.Popen(cmd, stdout=subprocess.PIPE).stdout
It would be a lot more work than piping data to ImageMagick, but there are several Pango based solutions. I used pango and pygtk awhile back, and I am pretty sure you could develop a headless gtk or gdk application to render text to a pixbuf.
A simpler solution might be to use the python cairo bondings.
Pango works at a pretty low level, so simple stuff can be a lot more complicated, but rendering quality is hard to beat, and it gives you a lot of fine grained control over the layout.

Creating a multi-page TIFF with Python

This has already been asked here, but I was looking for a solution that would work on Linux.. Is tiffcp the only way?
Looks like ImageMagick can do it. The solution is essentially the same; call it from the command line.
Specifically, you want the -adjoin option (which is on by default). The command will look something like:
convert *.tiff my_combined_file.tiff
Haven't tried it, but there is pylibtiff, a python wrapper for the libtiff library on which tiffcp is implemented.
I know this is an old question, but convert has the drawback that it recompresses the images. You can use the python tifftools package to do this without recompressing the images: tifftools merge *.tiff combined_file.tiff.
Disclaimer: I am the author of the tifftools package.

Check whether a PDF file is valid with Python

I get a file via a HTTP upload and need to make sure its a PDF file. The programing language is Python, but this should not matter.
I thought of the following solutions:
Check if the first bytes of the string are %PDF. This is not a good check but prevents the user from uploading other files accidentally.
Use libmagic (the file command in bash uses it). This does exactly the same check as in (1)
Use a library to try to read the page count out of the file. If the lib is able to read a page count it should be a valid PDF file. Problem: I don't know a Python library that can do this
Are there solutions using a library or another trick?
The current solution (as of 2023) is to use pypdf and catch exceptions (and possibly analyze reader.metadata)
from pypdf import PdfReader
from pypdf.errors import PdfReadError
with open("testfile.txt", "w") as f:
f.write("hello world!")
try:
PdfReader("testfile.txt")
except PdfReadError:
print("invalid PDF file")
else:
pass
The two most commonly used PDF libraries for Python are:
pyPdf
ReportLab
Both are pure python so should be easy to install as well be cross-platform.
With pyPdf it would probably be as simple as doing:
from pyPdf import PdfFileReader
doc = PdfFileReader(file("upload.pdf", "rb"))
This should be enough, but doc will now have documentInfo() and numPages() methods if you want to do further checking.
As Carl answered, pdftotext is also a good solution, and would probably be faster on very large documents (especially ones with many cross-references). However it might be a little slower on small PDF's due to system overhead of forking a new process, etc.
In a project if mine I need to check for the mime type of some uploaded file. I simply use the file command like this:
from subprocess import Popen, PIPE
filetype = Popen("/usr/bin/file -b --mime -", shell=True, stdout=PIPE, stdin=PIPE).communicate(file.read(1024))[0].strip()
You of course might want to move the actual command into some configuration file as also command line options vary among operating systems (e.g. mac).
If you just need to know whether it's a PDF or not and do not need to process it anyway I think the file command is a faster solution than a lib. Doing it by hand is of course also possible but the file command gives you maybe more flexibility if you want to check for different types.
If you're on a Linux or OS X box, you could use Pdftotext (part of Xpdf, found here). If you pass a non-PDF to pdftotext, it will certainly bark at you, and you can use commands.getstatusoutput to get the output and parse it for these warnings.
If you're looking for a platform-independent solution, you might be able to make use of pyPdf.
Edit: It's not elegant, but it looks like pyPdf's PdfFileReader will throw an IOError(22) if you attempt to load a non-PDF.
I run into the same problem but was not forced to use a programming language to manage this task. I used pyPDF but was not efficient for me as it hangs infinitely on some corrupted files.
However, I found this software useful till now.
Good luck with it.
https://sourceforge.net/projects/corruptedpdfinder/
Here is a solution using pdfminersix, which can be installed with pip install pdfminer.six:
from pdfminer.high_level import extract_text
def is_pdf(path_to_file):
try:
extract_text(path_to_file)
return True
except:
return False
You can also use filetype (pip install filetype):
import filetype
def is_pdf(path_to_file):
return filetype.guess(path_to_file).mime == 'application/pdf'
Neither of these solutions is ideal.
The problem with the filetype solution is that it doesn't tell you if the PDF itself is readable or not. It will tell you if the file is a PDF, but it could be a corrupt PDF.
The pdfminer solution should only return True if the PDF is actually readable. But it is a big library and seems like overkill for such a simple function.
I've started another thread here asking how to check if a file is a valid PDF without using a library (or using a smaller one).
By valid do you mean that it can be displayed by a PDF viewer, or that the text can be extracted? They are two very different things.
If you just want to check that it really is a PDF file that has been uploaded then the pyPDF solution, or something similar, will work.
If, however, you want to check that the text can be extracted then you have found a whole world of pain! Using pdftotext would be a simple solution that would work in a majority of cases but it is by no means 100% successful. We have found many examples of PDFs that pdftotext cannot extract from but Java libraries such as iText and PDFBox can.

Editing Photoshop PSD text layers programmatically

I have a multi-layered PSD, with one specific layer being non-rasterized text. I'm trying to figure out a way I can, from a bash/perl/python/whatever-else program:
load the PSD
edit the text in said layer
flatten all layers in the image
save as a web-friendly format like PNG or JPG
I immediately thought of ImageMagick, but I don't think I can edit the text layer through IM. If I can accomplish the first two steps some other programmatic way, I can always use ImageMagick to perform the last two steps.
After a couple of hours of googling and searching CPAN and PyPI, I still have found nothing promising. Does anyone have advice or ideas on the subject?
If you don't like to use the officially supported AppleScript, JavaScript, or VBScript, then there is also the possibility to do it in Python. This is explained in the article Photoshop scripting with Python, which relies on Photoshop's COM interface.
I have not tried it, so in case it does not work for you:
If your text is preserved after conversion to SVG then you can simply replace it by whatever tool you like. Afterwards, convert it to PNG (eg. by inkscape --export-png=...).
The only way I can think of to automate the changing of text inside of a PSD would be to use a regex based substitution.
Create a very simple picture in Photoshop, perhaps a white background and a text layer, with the text being a known length.
Search the file for your text, and with a hex editor, search nearby for the length of the text (which may or may not be part of the file format).
Try changing the text, first to a string of the same length, then to something shorter/longer.
Open in Photoshop after each change to see if the file is corrupt.
This method, if viable, will only work if the layer in question contains a known string, which can be substituted for your other value. Note that I have no idea whether this will work, as I don't have Photoshop on this computer to try this method out. Perhaps you can make it work?
As for converting to png, I am at a loss. If the replacing script is in Python, you may be able to do it with the Python Imaging Library (PIL, which seems to support it), but otherwise you may just have to open Photoshop to do the conversion. Which means that it probably wouldn't be worth it to change the text pragmatically in the first place.
Have you considered opening and editing the image in The GIMP? It has very good PSD support, and can be scripted in several languages.
Which one you use depends in part on your platform, the Perl interface didn't work on Windows the last I knew. I believe Scheme is supported in all ports.
You can use Photoshop itself to do this with OLE. You will need to install Photoshop, of course. Win32::OLE in Perl or similar module in Python. See http://www.adobe.com/devnet/photoshop/pdfs/PhotoshopScriptingGuide.pdf
If you're going to automate Photoshop, you pretty much have to use Photoshop's own scripting systems. I don't think there's a way around that.
Looking at the problem a different way, can you export from Photoshop to some other format which supports layers, like PNG, which is editable by ImageMagick?
You can also try this using Node.js. I made a PSD command-line tool
One-line command install (needs NodeJS/NPM installed)
npm install -g psd-cli
You can then use it by typing in your terminal
psd myfile.psd -t
You can check out the code to use it from another node script or use it through your shell is from another Bash/Perl/whatever script.

Categories

Resources