Creating PDFs from HTML/Javascript in Python with no OS dependencies - python

Is there any way to use Python to create PDF documents from HTML/CSS/Javascript, without introducing any OS-level dependencies?
It seems every existing solution requires special supplemental software, but upon reviewing PDF formatting specifications and HTML/CSS/Javascript rendering, there doesn't appear to be a reason why a Python solution can't exist without them. Some solutions come close, such as pyppeteer, but it still leans on a headless Chrome installation locally. These dependencies mean that microservices can't be leveraged, even though PDF generation would otherwise seem to be a viable use case for them.
While similar questions have come up many times over on SO, there doesn't appear to have been a viable technique shown without having to install specialized dependencies on the OS.
Some similar questions which routinely recommend wkhtmltopdf or are otherwise out of date (e.g., moving PDF printing support outside of Chrome is dead now):
How to convert webpage into PDF by using Python
How to convert a local HTML file to PDF using Python in Windows
HTML to PDF conversion using Chrome pdfium
How does Chrome render PDFs from HTML so well?
Convert a HTML/CSS/Javascript file to PDF using Python?
If I've somehow missed a viable approach, please feel free to mark this as a duplicate with my thanks!
Edit February 2021: It appears that the cefpython project may meet these demands - PDF printing support seems like it could be implemented in the near future.

So to clarify and formalize what others have said:
If you want to create PDF documents from HTML/CSS/javascript content, you will necessarily need a javascript engine (because you obviously need to execute the javascript if it affects the visuals of the document). This is the most complex component that you need.
As for now, there is no ECMAscript compliant engine written in pure python that is well-maintained (that would be a huge project)... There will probably never be one, since compilers and VMs for languages need to be performant and are thus usually written in a performant low-level language.
So you will always need compiled binaries for that and the HTML renderers which are less complex but also need to be performant if used in browsers, so usually they're also C++ or the likes.
The javascript engine and HTML renderer are the major part of a browser, so a headless browser is a good solution to this requirement.

Try this library: xhtml2pdf
It worked for me. Here is the documentation: doc
Some sample code:
from xhtml2pdf import pisa
def convert_html_to_pdf(source_html, output_filename):
# open output file for writing (truncated binary)
result_file = open(output_filename, "w+b")
# convert HTML to PDF
pisa_status = pisa.CreatePDF(
source_html, # the HTML to convert
dest=result_file) # file handle to recieve result
# close output file
result_file.close() # close output file
# return False on success and True on errors
return pisa_status.err
# Define your data
source_html = open('2020-06.html')
output_filename = "test.pdf"
convert_html_to_pdf(source_html, output_filename)

Related

Python + Linux - Excel to HTML (keeping format)

I'm looking for a way to convert excel to html while preserving formatting.
I know this is doable on windows due to the availability of some underlying win32 libraries, (eg via xlwings
Python - Excel to HTML (keeping format))
But I'm looking for a solution on Linux.
I've also come by Aspose Cells but this requires a paid license or else it will add a lot of extra junk to the output that needs to be scrubbed out.
And lastly I tried the python lib xlsx2html but it does a very poor job at preserving formatting.
Are there any suggestions for a Linux based solution? I'd also be interested in tools written in other languages that can be easily wrapped around via python.
Thanks in advance!
Update:
Here is an example of a random excel sheet I converted via excel itself that I would like to reproduce. It has some colors, some border variations, some merged cells and some font sizes to see if they all work.
You can use LibreOffice to convert an Excel file to a HTML file using the command line:
# --convert-to implies --headless so it's not mandatory to specify --headless
soffice --headless --convert-to html data.xlsx
You can refer to the documentation to know more about other CLI parameters.
I think you should search for Excel to HTML in the JS world not python (I am not saying it is not possible, but It's more usual in JS), I promise you will get better results.
In my opinion, finding a JS-based solution and make a python wrapper can be more helpful. Because in JS community, they struggled more than another communities to import and work with Excels.
Another idea is to change your approach, look for how you can import an Excel file in an embedded way or iframe inside an HTML page with JS and then export it.
But again, I highly recommend to check JS libraries or GitHub repositories, some of them care about formatting.

What's a good document standard to use programmatically?

I'm writing a program that requires input in the form of a document, it needs to replace a few values, insert a table, and convert it to PDF. It's written in Python + Qt (PyQt). Is there any well known document standard which can be easily used programmatically? It must be cross platform, and preferably open.
I have looked into Microsoft Doc and Docx, which are binary formats and I can't edit them. Python has bindings for it, but they're only on Windows.
Open Office's ODT/ODF is zipped in an xml file, so I can edit that one but there's no command line utilities or any way to programmatically convert the file to a PDF. Open Office provides bindings, but you need to run Open Office from the command line, start a server, etc. And my clients may not have Open Office installed.
RTF is readable from Python, but I couldn't find any way/libraries to convert RTF documents to PDF.
At the moment I'm exporting from Microsoft Word to HTML, replacing the values and using PyQt to convert it to a PDF. However it loses formatting features and looks awful. I'm surprised there isn't a well known library which lets you edit a variety of document formats and convert them into other formats, am I missing something?
Update: Thanks for the advice, I'll have a look at using Latex.
Thanks,
Jackson
Have you looked into using LaTeX documents?
They are perfect to use programatically (compiling documents? You gotta love that...), and you have several Python frameworks you can use such as plasTeX and PyTex.
Exporting a LaTeX documents to PDF is almost immediate.
Since you're already using PyQt anyway, it might be worth looking at Qt's built-in RTF processing module which looks decent. Here's the documentation on detailed content manipulation including inserting tables. Also the QPrinter module's default print-to-file format happens to be PDF.
Without knowing more about your particular needs it's hard to say if these would do what you want, but since your application already has PyQt as a dependency, seems silly to introduce any more without evaluating the functionality you've already got available.
The non-GUI parts of the Qt framework are often overlooked though.
edit: included more links.
You might want to try ReportLab. The open source version can write PDFs, and the commercial version has a lot of really nice abstractions to allow output to a variety of different formats from a single input.
I don't know the kind of odience of your program, Tex is good and i would go with it.
Another possible choice is Excel format, parsing it with xlrd.
I've used it a couple of time and it's pretty straightforward.
Excel file is a good for the following reasons:
Well known format easy to edit
You could prepare a predefined template with constrains and table
Creating XML documents, transforming them to XSL/fo and rendering with Fop or RenderX. If you use docbook as the primary input, there are toolchains freely available for converting that to PDF, RTF, HTML and so forth.
It is rather quirky to use and not my idea of fun, but is does deliver and can be embedded in an application, AFAICT.
Creating docbook is very straightforward as it has a wide range of semantic tags, table support etc to give a "meaningful" markup which can be reliably formatted. The XSL stylesheets are modular and allow parts to be customized or replaced to generate your own look and feel.
It works well for relatively free flow documents with lots of text.
For filling in the blanks kind of documents, a regular reporting engine may be a better fit, or some straighforward XSL stylesheets spitting out the XSL-fo directly.

Generate ODT/DOC(X) and convert to PDF, without OO.o/MS

I have a WSGI application that generates invoices and stores them as PDF.
So far I have solved similar problems with FPDF (or equivalents), generating the PDF from scratch like a GUI. Sadly this means the entire formatting logic (positioning headers, footers and content, styling) is in the application, where it really shouldn't be.
As the templates already exist in Office formats (ODT, DOC, DOCX), I would prefer to simply use those as a basis and fill in the actual content. I've found the Appy framework, which does pretty much that with annotated ODT files.
That still leaves the bigger problem open, tho: converting ODT (or DOC, or DOCX) to PDF. On a server. Running Linux. Without GUI libraries. And thus, without OO.o or MS Office.
Is this at all possible or am I better off keeping the styling in my code?
The actual content that would be filled in is actually quite restricted: a few paragraphs, some of which may be optional, a headline or two, always at the same place, and a few rows of a table. In HTML this would be trivial.
EDIT: Basically, I want a library that can generate ODT files from ODF files acting as templates and a library that can convert the result into PDF (which is probably the crux).
I don't know how to go about automatic ODT -> PDF conversion, but a simpler route might be to generate your invoices as HTML and convert them to PDF using http://www.xhtml2pdf.com/. I haven't tried the library myself, but it definitely seems promising.
You can use QTextDocument, QTextCursor and QTextDocumentWriter in PyQt4. A simple example to show how to write to an odt file:
>>>from pyqt4 import QtGui
# Create a document object
>>>doc = QtGui.QTextDocument()
# Create a cursor pointing to the beginning of the document
>>>cursor = QtGui.QTextCursor(doc)
# Insert some text
>>>cursor.insertText('Hello world')
# Create a writer to save the document
>>>writer = QtGui.QTextDocumentWriter()
>>>writer.supportedDocumentFormats()
[PyQt4.QtCore.QByteArray(b'HTML'), PyQt4.QtCore.QByteArray(b'ODF'), PyQt4.QtCore.QByteArray(b'plaintext')]
>>>odf_format = writer.supportedDocumentFormats()[1]
>>>writer.setFormat(odf_format)
>>>writer.setFileName('hello_world.odt')
>>>writer.write(doc) # Return True if successful
True
If not sure the difference between odt and odf in this case. I checked the file type and it said 'application/vnd.oasis.opendocument.text'. So I assume it is odt. You can print to a pdf file by using QPrinter.
More information at:
http://qt-project.org/doc/qt-4.8/

Dynamic generation of .doc files

How can you dynamically generate a .doc file using AJAX? Python? Adobe AIR? I'm thinking of a situation where an online program/desktop app takes in user feedback in article form (a la wiki) in Icelandic character encoding and then upon pressing a button releases a .doc file containing the user input for the webpage. Any solutions/suggestions would be much appreciated.
PS- I don't want to go the C#/Java way with this.
The problem with the *.doc MS word format is, that it isn't documented enough, therefor it can't have a very good support like, for example, PDF, which is a standard.
Except of the problems with generating the doc, you're users might have problems reading the doc files. For example users on linux machines.
You should consider producing RTF on the server. It is more standard, and thus more supported both for document generation, and for reading the document afterwards. Unless you need very specific features, it should suffice for most of documents types, and MS word opens it by default, just like it opens its own native format.
PyRTF is an project you can use for RTF generation with python.
It don't have to do much with ajax(in th sense that ajax is generally used for dynamic client side interactions)
You need a server side script which takes the input and converts it to doc.
You may use something like openoffice and python if it has some interface
see http://wiki.services.openoffice.org/wiki/Python
or on windows you can directly use Word COM objects to create doc using win32apis
but it is less probable, that a windows server serving python :)
I think better alternative is to generate PDF which would be nicer and easier.
Reportlab has a wonderful pdf generation library and it works like charm from python.
Once you have pdf you may use some pdf to doc converter, but I think PDF would be good enough.
Edit: Doc generation
On second thought if you are insisting on DOC you may have windows server in that case
you can use COM objets to generate DOC, xls or whatever see
http://win32com.goermezer.de/content/view/173/284/

Pure python solution to convert XHTML to PDF

I am after a pure Python solution (for the GAE) to convert webpages to pdf.
I had a look at reportlab but the documentation focuses on generating pdfs from scratch, rather than converting from HTML.
What do you recommend? - pisa?
Edit:
My use case is I have a HTML report that I want to make available in PDF too. I will make updates to this report structure so I don't want to maintain a separate PDF version, but (hopefully) convert automatically.
Also because I generate the report HTML I can ensure it is well formed XHTML to make the PDF conversion easier.
Pisa claims to support what I want to do:
pisa is a html2pdf converter using the
ReportLab Toolkit, the HTML5lib and
pyPdf. It supports HTML 5 and CSS 2.1
(and some of CSS 3). It is completely
written in pure Python so it is
platform independent. The main benefit
of this tool that a user with Web
skills like HTML and CSS is able to
generate PDF templates very quickly
without learning new technologies.
Easy integration into Python
frameworks like CherryPy, KID
Templating, TurboGears, Django, Zope,
Plone, Google AppEngine (GAE) etc.
So I will investigate it further
Have you considered pyPdf? I doubt it has anywhere like the functional richness you require, but, it IS a start, and is in pure Python. The PdfFileWriter class would be the one to generate PDF output, unfortunately it requires PageObject instances and doesn't provide real ways to put those together, except extracting them from existing PDF documents. Unfortunately all richer pdf page-generation packages I can find do appear to depend on reportlab or other non-pure-Python libraries:-(.
What you're asking for is a pure Python HTML renderer, which is a big task to say the least ('real' renderers like webkit are the product of thousands of hours of work). As far as I'm aware, there aren't any.
Instead of looking for an HTML to PDF converter, what I'd suggest is building your report in a format that's easily converted to both - for example, you could build it as a DOM (a set of linked objects), and write converters for both HTML and PDF output. This is a much more limited problem than converting HTML to PDF, and hence much easier to implement.

Categories

Resources