Check whether a PDF file is valid with Python - python

I get a file via a HTTP upload and need to make sure its a PDF file. The programing language is Python, but this should not matter.
I thought of the following solutions:
Check if the first bytes of the string are %PDF. This is not a good check but prevents the user from uploading other files accidentally.
Use libmagic (the file command in bash uses it). This does exactly the same check as in (1)
Use a library to try to read the page count out of the file. If the lib is able to read a page count it should be a valid PDF file. Problem: I don't know a Python library that can do this
Are there solutions using a library or another trick?

The current solution (as of 2023) is to use pypdf and catch exceptions (and possibly analyze reader.metadata)
from pypdf import PdfReader
from pypdf.errors import PdfReadError
with open("testfile.txt", "w") as f:
f.write("hello world!")
try:
PdfReader("testfile.txt")
except PdfReadError:
print("invalid PDF file")
else:
pass

The two most commonly used PDF libraries for Python are:
pyPdf
ReportLab
Both are pure python so should be easy to install as well be cross-platform.
With pyPdf it would probably be as simple as doing:
from pyPdf import PdfFileReader
doc = PdfFileReader(file("upload.pdf", "rb"))
This should be enough, but doc will now have documentInfo() and numPages() methods if you want to do further checking.
As Carl answered, pdftotext is also a good solution, and would probably be faster on very large documents (especially ones with many cross-references). However it might be a little slower on small PDF's due to system overhead of forking a new process, etc.

In a project if mine I need to check for the mime type of some uploaded file. I simply use the file command like this:
from subprocess import Popen, PIPE
filetype = Popen("/usr/bin/file -b --mime -", shell=True, stdout=PIPE, stdin=PIPE).communicate(file.read(1024))[0].strip()
You of course might want to move the actual command into some configuration file as also command line options vary among operating systems (e.g. mac).
If you just need to know whether it's a PDF or not and do not need to process it anyway I think the file command is a faster solution than a lib. Doing it by hand is of course also possible but the file command gives you maybe more flexibility if you want to check for different types.

If you're on a Linux or OS X box, you could use Pdftotext (part of Xpdf, found here). If you pass a non-PDF to pdftotext, it will certainly bark at you, and you can use commands.getstatusoutput to get the output and parse it for these warnings.
If you're looking for a platform-independent solution, you might be able to make use of pyPdf.
Edit: It's not elegant, but it looks like pyPdf's PdfFileReader will throw an IOError(22) if you attempt to load a non-PDF.

I run into the same problem but was not forced to use a programming language to manage this task. I used pyPDF but was not efficient for me as it hangs infinitely on some corrupted files.
However, I found this software useful till now.
Good luck with it.
https://sourceforge.net/projects/corruptedpdfinder/

Here is a solution using pdfminersix, which can be installed with pip install pdfminer.six:
from pdfminer.high_level import extract_text
def is_pdf(path_to_file):
try:
extract_text(path_to_file)
return True
except:
return False
You can also use filetype (pip install filetype):
import filetype
def is_pdf(path_to_file):
return filetype.guess(path_to_file).mime == 'application/pdf'
Neither of these solutions is ideal.
The problem with the filetype solution is that it doesn't tell you if the PDF itself is readable or not. It will tell you if the file is a PDF, but it could be a corrupt PDF.
The pdfminer solution should only return True if the PDF is actually readable. But it is a big library and seems like overkill for such a simple function.
I've started another thread here asking how to check if a file is a valid PDF without using a library (or using a smaller one).

By valid do you mean that it can be displayed by a PDF viewer, or that the text can be extracted? They are two very different things.
If you just want to check that it really is a PDF file that has been uploaded then the pyPDF solution, or something similar, will work.
If, however, you want to check that the text can be extracted then you have found a whole world of pain! Using pdftotext would be a simple solution that would work in a majority of cases but it is by no means 100% successful. We have found many examples of PDFs that pdftotext cannot extract from but Java libraries such as iText and PDFBox can.

Related

How to change Title (not file-name) of .mp3 file on Mac using Python?

I want to rename mp3 files on my mac using python before importing them to iTunes. So I need to change the "Title" of the file, not the file's name. As in, I want to change "Al-Fatihah" in the picture below to "new_title".
Most online resources and question that I found suggest using either external libraries or using os.stat() which only gives info about modification and creation of the file (second picture below), unless I'm misunderstanding something. I was wondering if there is a way to do so without having to download extra libraries as I'm not always sure which libraries are safe.
Thanks!
If you don't use a library, you're gonna have to go in and manually edit the bytes yourself. The 'title' you're referring to is an ID3 tag, which is a standard defining which parts of the mp3 file contain data about the track.
In the case of ID3v1, the last 128 bytes of the file are reserved for metadata, and bytes 4 to 34 are reserved for the title.
Manually writing bytes in python is an absolute pain, so I strongly, strongly recommend using a library for this menial task. eyeD3 is a library that can do this for you. If you are not "sure which libraries are safe", why don't you have a look at the source code for these libraries to check that they're safe yourself?
If you really, must absolutely edit them using only python, you'd have to go about it like this. I'm pasting this answer from another question about manipulating bytes here. This is not an exact solution, more of a guideline of what manually editing the bytes would entail:
with open("filename.mp3", "r+b") as f:
fourbytes = [ord(b) for b in f.read(4)]
fourbytes[0] = fourbytes[1] # whatever, manipulate your bytes here
f.seek(0)
f.write("".join(chr(b) for b in fourbytes))

conversion from rtf to docx from the commandline on a Mac in Python

Does anyone know of a Python script to convert rtf to docx on a Mac?
Textutil does not properly process headers and footers, so that is not an option. I need a commandline utility because I have a large number of files in different locations to process. The textutil interface is fine... it is just mind-boggling that it doesn't handle the header (which contains important information in my documents).
A Python script is preferable.
-Joe
I don't know the exact answer to your question, but can suggest another way of doing it: Since you are doing this on OSX, why not use the built-in automator? For an example, look at http://aseriesoftubes.com/articles/how-to-batch-convert-doc-files-to-pdf-format-using-mac-osx-automator/ or https://discussions.apple.com/thread/3050596?tstart=0

How to open Excel instance in python on MAC?

I think this question has been asked before but it's not clear, in the original question the user has provided excel.exe which is a windows executable extension and not for mac.
I need to open new Excel instance in Python on MAC.
which module should I import?
I'm a newbie I have completed learning python language, but have trouble understanding documentation.
If all you need to do is launch Excel, the best way to do it is to use LaunchServices to do it.
If you have PyObjC (which you do if you're using the Python that Apple pre-installs on 10.6 and later; otherwise, you may have to install it):
import Foundation
ws = Foundation.NSWorkspace.sharedWorkspace()
ws.launchApplication_('Microsoft Excel')
If not, you can always use the open tool:
import subprocess
subprocess.check_call(['open', '-a', 'Microsoft Excel'])
Either way, you're effectively launching Excel the same way as if the user double-clicked the app icon in Finder.
If you want to make Excel do something simple like open a specific document, that's not much harder. Look at the NSWorkspace or open documentation to see how to do whatever you want.
If you actually want to control Excel—e.g., open a document, make some changes, and save it—you'll want to use its AppleScript interface.
Apple's recommended way of doing that is via ScriptingBridge, or using a dual-language approach (write AppleScripts and execute them via NSAppleScript—which, in Python, you do through PyObjC). However, I'd probably use appscript (get the code from here). Despite the fact that it's been abandoned by its original creator, and is only being sparsely maintained, and will probably eventually stop working with some future OS X version, it's still much better than the official solutions.
Here's a sample (untested, because I don't have Excel here):
import appscript
excel = appscript.app('Microsoft Excel')
excel.workbooks[1].column[2].row[2].formula.set('=A2+1')
From the comments it is not completely clear if you need to 'update' an Excel file with data, and just assume that you need Excel to do so, or that you need to change some excel files to include new data.
It is usually much easier, and certainly faster (wrt excution speed) to go with 'updating' an Excel file without starting Excel. However updating is not the right word: you have to read in the file and write it out new. You can of course overwrite the orginal file, so it looks like an update.
For 'updating' you can use the trio xlrd, xlwt, xlutils if the files you work with are .xls files (Excel 2003). IIRC xlwt does not support .xlsx for writing (but xlrd can read those files).
For .xlsx files I use openpyxl,
Both are good enough for writing things like data, formula and basic formatting.
If you have existing Excel files which you use as 'templates' with information that would get lost if you read/write using one of the above packages, then you have to go with updating the file in Excel. I had to do so because I had no easy way to include Visual Basic macros and very specific formatting specified by a client. And sometimes it is just easier to visually setup a spreadsheet and then just fill the cells programmatically. But this was all done on Windows.
If you really have to drive Excel on Mac, because you need to use existing files as templates, I suggest you look at Applescript. Or, if it is an option, look at OpenOffice/LibreOffice PyUno interface.

Which format should I save my python script output?

I have an executable (converted to exe from python using py2exe) that outputs lists of numbers that could be from 0-50K lines long or a little bit more.
While developing, I just saved them to a TXT file using simple f.write.
The person wants to print this output on paper! (don't ask why lol)
So, I'm wondering if I can output it to something like HTML? XML? Something that could display tables of 50K lines and maybe 3 columns and that would also run in any PC without additional programs?
Suggestions?
EDIT:
Regarding CSV:
In most situations the best way in my opinion would be to make a CSV. I'm not opposing it in anyway, rather I think others might find Lott's answer useful for their cases. Sorry I didn't explain it that well in my question as far as my constraints go.
My constraints are: the user doesn't have an office suite, no python installed. Just think of a PC that has the bare minimum after a clean windows xp/vista installation, maybe Internet Explorer 7 or 8. This PC has to be able to open my output file and allow for reasonable viewing, searching, and printing.
CSV.
http://docs.python.org/library/csv.html
http://en.wikipedia.org/wiki/Comma-separated_values
They can load a spreadsheet and print anything they want.
If you can't install anything on the computer, the you might be best off outputting an HTML file with the data in a <table> that the user could view/search/print in IE.
You could use LaTeX to produce a PDF, maybe? But why exactly isn't a text file good enough?
You can produce a PDF using Reportlab. After all if you really want full control of the printed output, there's nothing that beats PDF.
Does 50k lines make too large a file? If not, just continue writing text files. Otherwise an easy solution would be to continue spitting out text files and compress them, e.g. with zip. You could use the zipfile library in Python. Most computers have no trouble reading zip files.

viewing files in python?

I am creating a sort of "Command line" in Python. I already added a few functions, such as changing login/password, executing, etc., But is it possible to browse files in the directory that the main file is in with a command/module, or will I have to make the module myself and use the import command? Same thing with changing directories to view, too.
Browsing files is as easy as using the standard os module. If you want to do something with those files, that's entirely different.
import os
all_files = os.listdir('.') # gets all files in current directory
To change directories you can issue os.chdir('path/to/change/to'). In fact there are plenty of useful functions found in the os module that facilitate the things you're asking about. Making them pretty and user-friendly, however, is up to you!
I'd like to see someone write a a semantic file-browser, i.e. one that auto-generates tags for files according to their input and then allows views and searching accordingly.
Think about it... take an MP3, lookup the lyrics, run it through Zemanta, bam! a PDF file, a OpenOffice file, etc., that'd be pretty kick-butt! probably fairly intensive too, but it'd be pretty dang cool!
Cheers,
-C

Categories

Resources