Compress PDFs using Python

Compress PDFs using Python - python

So I have a gazillion pdfs in a folder, I want to recursively (using os.path.walk) shrink them. I see that adobe pro has a save as reduced size. Would I be able to use this / how do you suggest I do it otherwise.
Note: Yes, I would like them to stay as pdfs because I find that to be the most commonly used and installed fileviewer.

From the project's GitHub page for pdfsizeopt, which is written in Python:
pdfsizeopt is a program for converting large PDF files to small ones. More specifically, pdfsizeopt is a free, cross-platform command-line application (for Linux, Mac OS X, Windows and Unix) and a collection of best practices to optimize the size of PDF files, with focus on PDFs created from TeX and LaTeX documents. pdfsizeopt is written in Python..."
You can probably easily adapt this to your specific needs.

Realize this is an old question. Thought I would suggest an alternative to pdfsizeopt, as I have experienced quality loss using it for PDFs of maps. PDFTron offers a comprehensive set of functionality. Here is a snippet modified from their web-page (see "example 1"):
import site
site.addsitedir(r"...pathToPDFTron\PDFNetWrappersWin32\PDFNetC\Lib")
from PDFNetPython import PDFDoc, Optimizer, SDFDoc
doc = PDFDoc(inPDF_Path)
doc.InitSecurityHandler()
Optimizer.Optimize(doc)
doc.Save(outPDF_Path, SDFDoc.e_linearized)
doc.Close()

Related

Creating PDFs from HTML/Javascript in Python with no OS dependencies

Is there any way to use Python to create PDF documents from HTML/CSS/Javascript, without introducing any OS-level dependencies?
It seems every existing solution requires special supplemental software, but upon reviewing PDF formatting specifications and HTML/CSS/Javascript rendering, there doesn't appear to be a reason why a Python solution can't exist without them. Some solutions come close, such as pyppeteer, but it still leans on a headless Chrome installation locally. These dependencies mean that microservices can't be leveraged, even though PDF generation would otherwise seem to be a viable use case for them.
While similar questions have come up many times over on SO, there doesn't appear to have been a viable technique shown without having to install specialized dependencies on the OS.
Some similar questions which routinely recommend wkhtmltopdf or are otherwise out of date (e.g., moving PDF printing support outside of Chrome is dead now):
How to convert webpage into PDF by using Python
How to convert a local HTML file to PDF using Python in Windows
HTML to PDF conversion using Chrome pdfium
How does Chrome render PDFs from HTML so well?
Convert a HTML/CSS/Javascript file to PDF using Python?
If I've somehow missed a viable approach, please feel free to mark this as a duplicate with my thanks!
Edit February 2021: It appears that the cefpython project may meet these demands - PDF printing support seems like it could be implemented in the near future.

So to clarify and formalize what others have said:
If you want to create PDF documents from HTML/CSS/javascript content, you will necessarily need a javascript engine (because you obviously need to execute the javascript if it affects the visuals of the document). This is the most complex component that you need.
As for now, there is no ECMAscript compliant engine written in pure python that is well-maintained (that would be a huge project)... There will probably never be one, since compilers and VMs for languages need to be performant and are thus usually written in a performant low-level language.
So you will always need compiled binaries for that and the HTML renderers which are less complex but also need to be performant if used in browsers, so usually they're also C++ or the likes.
The javascript engine and HTML renderer are the major part of a browser, so a headless browser is a good solution to this requirement.

Try this library: xhtml2pdf
It worked for me. Here is the documentation: doc
Some sample code:
from xhtml2pdf import pisa
def convert_html_to_pdf(source_html, output_filename):
# open output file for writing (truncated binary)
result_file = open(output_filename, "w+b")
# convert HTML to PDF
pisa_status = pisa.CreatePDF(
source_html, # the HTML to convert
dest=result_file) # file handle to recieve result
# close output file
result_file.close() # close output file
# return False on success and True on errors
return pisa_status.err
# Define your data
source_html = open('2020-06.html')
output_filename = "test.pdf"
convert_html_to_pdf(source_html, output_filename)

How to open .fif file format?

I want to open a .fif file of size around 800MB. I googled and found that these kind of files can be opened with photoshop. Is there a way to extract the images and store in some other standard format using python or c++.

This is probably an EEG or MEG data file. The full specification is here, and it can be read in with the MNE package in Python.
import mne
raw = mne.io.read_raw_fif('filename.fif')

FIF stands for Fractal Image Format and seems to be output of the Genuine Fractals Plugin for Adobe's Photoshop. Unfortunately, there is no format specification available and the plugin claims to use patented algorithms so you won't be able to read these files from within your own software.
There however are other tools which can do fractal compression. Here's some information about one example. While this won't allow you to open FIF files from the Genuine Fractals Plugin, it would allow you to compress the original file, if still available.

XnView seems to handle FIF files, but it's windows-only. There is a MP or Multiplatform version, but it seems less complete and didn't work when I tried to view a FIF file.
Update: XnView MP, which does work on Linux and OSX claims to support FIF, but I couldn't get it to work.
Update2: There's also an open source project:Fiasco that can work with fractal images, but not sure it's compatible with the proprietary FIF format.

Is there a Python library to create thumbnails for various document file formats?

I'd like to generate thumbnails from various "document" file formats such as odt, doc(x) and ppt(x) but also mp4, psd, tiff (and possibly others) from a Python application. As far as I know for each of these formats there is at least one open source application which can generate preview images/thumbnails (e.g. LibreOffice, ffmpeg) or at least extract embedded thumbnails (e.g. imagemagick).
My main problem is that each of these applications/libraries use different command line options so I'm looking for a Python library (or a unified CLI tool) which provides a high-level API to generate a thumbnail with specified dimensions, quality level given a filename and calls the appropriate external tool (ideally including catching exceptions, segfaults and timeouts). Bonus points if it can generate multiple thumbnails if requested (e.g. one per page, page X-Y, every Z seconds but at most N images).
Does anyone know such a library/utility? (Boundary condition: The files may contain sensitive material or might be quite big so this must work without any network communication, using an external web service is not possible.)
If there is no such thing in Python, a locally installable web service would be fine as well.

I ended up writing my own library (named anythumbnailer, MIT license) which worked well enough for my immediate needs. The library is not what I envisioned (only basic thumbnailing, no support for dimensions, …) but it can generate thumbnails for doc(x), xls(x), ppt(x), videos and pdf on Linux with the help of ffmpeg, LibreOffice and ffmpeg.

you can look at Preview generator. preview-generator is a library for generating preview - thumbnails, pdf, text and json overview for all your file-based content. This module gives you access to jpeg, pdf, text, htlm and json preview of virtually any kind of file. It also includes a cache mechanism so you do not have to care about preview storage.

Script to search for text from PDF

Problem
On the Mac OS X platform, I would like to write a script, either in Python or Tcl to search for text within a PDF file and extract the relevant parts. I appreciate any help.
Background
I am writing scripts to look inside a PDF to determine if it is a bill, from what company, and for what period. Based on these information, I rename the PDF and move it to an appropriate directory. For example, file such as Statement_03948293929384.pdf might become 2012-07-15 Water Bill.pdf and moved to my Utilities folder.
What have I done so far?
I have searched for PDF-to-plain-text tools, but not found anything yet
I have looked into the Tcl wiki and found an example, but could not get it to work (I searched for text in PDF, but not found).
I am looking into pdf-parser.py by Didier Stevens
I heard of a Python package called pyPdf and will look at it next.
Update
I have found a command-line tool called pdftotext written by Glyph & Cog, LLC; built and packaged by Carsten Bluem. This tool is straight forward and it solves my problem. I am still looking out for those tools that can search PDF directly, without having to convert to text file.

I have successfully used PyODConverter to convert to/from PDFs (there is also a more powerful Java version). Once you have the PDF converted to text it should be trivial to do the searching. Also I believe iText should be capable of doing similar things, but I haven't tested it.

Dynamic generation of .doc files

How can you dynamically generate a .doc file using AJAX? Python? Adobe AIR? I'm thinking of a situation where an online program/desktop app takes in user feedback in article form (a la wiki) in Icelandic character encoding and then upon pressing a button releases a .doc file containing the user input for the webpage. Any solutions/suggestions would be much appreciated.
PS- I don't want to go the C#/Java way with this.

The problem with the *.doc MS word format is, that it isn't documented enough, therefor it can't have a very good support like, for example, PDF, which is a standard.
Except of the problems with generating the doc, you're users might have problems reading the doc files. For example users on linux machines.
You should consider producing RTF on the server. It is more standard, and thus more supported both for document generation, and for reading the document afterwards. Unless you need very specific features, it should suffice for most of documents types, and MS word opens it by default, just like it opens its own native format.
PyRTF is an project you can use for RTF generation with python.

It don't have to do much with ajax(in th sense that ajax is generally used for dynamic client side interactions)
You need a server side script which takes the input and converts it to doc.
You may use something like openoffice and python if it has some interface
see http://wiki.services.openoffice.org/wiki/Python
or on windows you can directly use Word COM objects to create doc using win32apis
but it is less probable, that a windows server serving python :)
I think better alternative is to generate PDF which would be nicer and easier.
Reportlab has a wonderful pdf generation library and it works like charm from python.
Once you have pdf you may use some pdf to doc converter, but I think PDF would be good enough.
Edit: Doc generation
On second thought if you are insisting on DOC you may have windows server in that case
you can use COM objets to generate DOC, xls or whatever see
http://win32com.goermezer.de/content/view/173/284/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compress PDFs using Python - python

Related

Creating PDFs from HTML/Javascript in Python with no OS dependencies

How to open .fif file format?

Is there a Python library to create thumbnails for various document file formats?

Script to search for text from PDF

Dynamic generation of .doc files

Categories

Resources