What's a good document standard to use programmatically? - python

I'm writing a program that requires input in the form of a document, it needs to replace a few values, insert a table, and convert it to PDF. It's written in Python + Qt (PyQt). Is there any well known document standard which can be easily used programmatically? It must be cross platform, and preferably open.
I have looked into Microsoft Doc and Docx, which are binary formats and I can't edit them. Python has bindings for it, but they're only on Windows.
Open Office's ODT/ODF is zipped in an xml file, so I can edit that one but there's no command line utilities or any way to programmatically convert the file to a PDF. Open Office provides bindings, but you need to run Open Office from the command line, start a server, etc. And my clients may not have Open Office installed.
RTF is readable from Python, but I couldn't find any way/libraries to convert RTF documents to PDF.
At the moment I'm exporting from Microsoft Word to HTML, replacing the values and using PyQt to convert it to a PDF. However it loses formatting features and looks awful. I'm surprised there isn't a well known library which lets you edit a variety of document formats and convert them into other formats, am I missing something?
Update: Thanks for the advice, I'll have a look at using Latex.
Thanks,
Jackson

Have you looked into using LaTeX documents?
They are perfect to use programatically (compiling documents? You gotta love that...), and you have several Python frameworks you can use such as plasTeX and PyTex.
Exporting a LaTeX documents to PDF is almost immediate.

Since you're already using PyQt anyway, it might be worth looking at Qt's built-in RTF processing module which looks decent. Here's the documentation on detailed content manipulation including inserting tables. Also the QPrinter module's default print-to-file format happens to be PDF.
Without knowing more about your particular needs it's hard to say if these would do what you want, but since your application already has PyQt as a dependency, seems silly to introduce any more without evaluating the functionality you've already got available.
The non-GUI parts of the Qt framework are often overlooked though.
edit: included more links.

You might want to try ReportLab. The open source version can write PDFs, and the commercial version has a lot of really nice abstractions to allow output to a variety of different formats from a single input.

I don't know the kind of odience of your program, Tex is good and i would go with it.
Another possible choice is Excel format, parsing it with xlrd.
I've used it a couple of time and it's pretty straightforward.
Excel file is a good for the following reasons:
Well known format easy to edit
You could prepare a predefined template with constrains and table

Creating XML documents, transforming them to XSL/fo and rendering with Fop or RenderX. If you use docbook as the primary input, there are toolchains freely available for converting that to PDF, RTF, HTML and so forth.
It is rather quirky to use and not my idea of fun, but is does deliver and can be embedded in an application, AFAICT.
Creating docbook is very straightforward as it has a wide range of semantic tags, table support etc to give a "meaningful" markup which can be reliably formatted. The XSL stylesheets are modular and allow parts to be customized or replaced to generate your own look and feel.
It works well for relatively free flow documents with lots of text.
For filling in the blanks kind of documents, a regular reporting engine may be a better fit, or some straighforward XSL stylesheets spitting out the XSL-fo directly.

Related

How to convert pdf to xml /json using python code

Can any one help me on how to convert pdf file to xml file using python code? My pdf contains:
Unstructured data
It has images
Mathematical equations
Chemical Equations
Table Data
Logo's tag's etc.
I tried using PDFMiner, but my pdf data was not converted into .xml/json file format. Are there any libraries other than PDFMiner? PyPDF2, Tabula-py, PDFQuery, comelot, PyMuPDF, pdf to dox, pandas- these other libraries/utilities all not suitable for my requirement.
Please advise me on any other options. Thank you.
The first thing I would recommend you trying is GROBID (see here for the full documentation). You can play with an online demo here to see if fits your needs (select TEI -> Process Fulltext Document, and upload a PDF). You can also check out this from the Allen Institute (it is based on GROBID and has a handy function for converting TEI.XML to JSON).
The other package which--obviously--does a good job is the Adobe PDF Extract API (see here). It's of course a paid service but when you register for an account you get 1.000 document transactions for free. It's easy to implement in Python, well documented, and a good way for experimenting and getting a feel for the difficulties of reliable data extraction from PDF.
I worked with both options to extract text, figures, tables etc. from scientific papers. Both yielded good results. The main problem with out-of-the-box solutions is that, when you work with complex formats (or badly formatted docs), erroneously identified document elements are quite common (for example a footnote or a header gets merged with the main text). Both options are based on machine learning models and, at least for GROBID, it is possible to retrain these models for your specific task (I haven't tried this so far, so I don't know how worthwhile it is).
However, if your target PDFs are all of the same (simple) format (or if you can control their format) you should be fine with either option.

Search office documents for string python

I've been looking for a fast and relatively easy way of searching (grep-ish) for user-defined strings in files of varying formats, i.e xlsx, docx, pptx, pdf using Python.
My research has led me to believe that there might not be a convenient way of doing this, as per a single module or similar. Am I forced to use a separate module for each file type? And if so are these approriate?
docx
openpyxl
pptx
slate
I also looked at forms of decompression to get to the xml-files containing actual text but it seems unwieldy. I just want to be sure that there is no simple, uniform way of handling all of these different filetypes.
Well, I've mostly figured it out. In the end I decided to use powershell combined with "itextsharp.dll" to process the files. It turned out to be simpler than using portable python. Thanks for the answers:-)

Is there a reliable python library for taking a BibTex entry and outputting it into specific formats?

I'm developing using Python and Django for a website. I want to take a BibTex entry and output it in a view in 3 different formats, MLA, APA, and Chicago. Is there a library out there that already does this or am I going to have to manually do the string formatting?
There are the following projects:
BibtexParser
Pybtex
Pybliographer
BabyBib
If you need complex parsing and output, Pybtex is recommended. Example:
>>> from pybtex.database.input import bibtex
>>> parser = bibtex.Parser()
>>> bib_data = parser.parse_file('examples/foo.bib')
>>> bib_data.entries.keys()
[u'ruckenstein-diffusion', u'viktorov-metodoj', u'test-inbook', u'test-booklet']
>>> print bib_data.entries['ruckenstein-diffusion'].fields['title']
Predicting the Diffusion Coefficient in Supercritical Fluids
Good luck.
Having tried them, all of these projects are bad, for various reasons: terrible APIs, bad documentation, and a failure to parse valid BibTeX files. The implementation you want doesn't show up in most Google searches, from my own searching: it's biblib. This text from the README should sell it:
There are a lot of BibTeX parsers out there. Most of them are complete nonsense based on some imaginary grammar made up by the module's author that is almost, but not quite, entirely unlike BibTeX's actual grammar. BibTeX has a grammar. It's even pretty simple, though it's probably not what you think it is. The hardest part of BibTeX's grammar is that it's only written down in one place: the BibTeX source code.
The accepted answer of using pybtex is fraught with danger as Pybtex does not preserve the bibtex format of even simple bibtex files. (https://bitbucket.org/pybtex-devs/pybtex/issues/130/need-to-specially-represent-bibtex-markup)
Pybtex is therefore losing bibtex information when reading and re-writing a simple .bib file without making any changes. Users should be very careful following the recommendations to use pybtex.
I will try biblib as well and report back but the accepted answer should be edited to not recommend pybtex.
Edit:
I was able to import the data using Bibtex Parser, without any loss of data. However, I had to compile from https://github.com/sciunto-org/python-bibtexparser as the version installed via pip was bugged at the time. Users should verify that pip is getting the latest version.
As for exporting, once the data has been imported via BibTex Parser, it's in a dictionary, and can be exported as the user desires. BibTex Parser does not have built in functions for exporting in common formats. As I did not need this functionality, I didn't specifically test it. However, once imported into a dictionary, the string output can be converted to any citation format rather easily.
Here, pybtex and a custom style file can help. I used the style file provided by the journal and compiled in LaTeX instead, but PyBtex has python style files (but also allows ingesting .sty files). So I would recommend taking the Bibtex Parser input and transferring it to PyBtex (or similar) for outputting in a certain style.
The closest thing I know of is the pybtex package

Dynamic generation of .doc files

How can you dynamically generate a .doc file using AJAX? Python? Adobe AIR? I'm thinking of a situation where an online program/desktop app takes in user feedback in article form (a la wiki) in Icelandic character encoding and then upon pressing a button releases a .doc file containing the user input for the webpage. Any solutions/suggestions would be much appreciated.
PS- I don't want to go the C#/Java way with this.
The problem with the *.doc MS word format is, that it isn't documented enough, therefor it can't have a very good support like, for example, PDF, which is a standard.
Except of the problems with generating the doc, you're users might have problems reading the doc files. For example users on linux machines.
You should consider producing RTF on the server. It is more standard, and thus more supported both for document generation, and for reading the document afterwards. Unless you need very specific features, it should suffice for most of documents types, and MS word opens it by default, just like it opens its own native format.
PyRTF is an project you can use for RTF generation with python.
It don't have to do much with ajax(in th sense that ajax is generally used for dynamic client side interactions)
You need a server side script which takes the input and converts it to doc.
You may use something like openoffice and python if it has some interface
see http://wiki.services.openoffice.org/wiki/Python
or on windows you can directly use Word COM objects to create doc using win32apis
but it is less probable, that a windows server serving python :)
I think better alternative is to generate PDF which would be nicer and easier.
Reportlab has a wonderful pdf generation library and it works like charm from python.
Once you have pdf you may use some pdf to doc converter, but I think PDF would be good enough.
Edit: Doc generation
On second thought if you are insisting on DOC you may have windows server in that case
you can use COM objets to generate DOC, xls or whatever see
http://win32com.goermezer.de/content/view/173/284/

Pure python solution to convert XHTML to PDF

I am after a pure Python solution (for the GAE) to convert webpages to pdf.
I had a look at reportlab but the documentation focuses on generating pdfs from scratch, rather than converting from HTML.
What do you recommend? - pisa?
Edit:
My use case is I have a HTML report that I want to make available in PDF too. I will make updates to this report structure so I don't want to maintain a separate PDF version, but (hopefully) convert automatically.
Also because I generate the report HTML I can ensure it is well formed XHTML to make the PDF conversion easier.
Pisa claims to support what I want to do:
pisa is a html2pdf converter using the
ReportLab Toolkit, the HTML5lib and
pyPdf. It supports HTML 5 and CSS 2.1
(and some of CSS 3). It is completely
written in pure Python so it is
platform independent. The main benefit
of this tool that a user with Web
skills like HTML and CSS is able to
generate PDF templates very quickly
without learning new technologies.
Easy integration into Python
frameworks like CherryPy, KID
Templating, TurboGears, Django, Zope,
Plone, Google AppEngine (GAE) etc.
So I will investigate it further
Have you considered pyPdf? I doubt it has anywhere like the functional richness you require, but, it IS a start, and is in pure Python. The PdfFileWriter class would be the one to generate PDF output, unfortunately it requires PageObject instances and doesn't provide real ways to put those together, except extracting them from existing PDF documents. Unfortunately all richer pdf page-generation packages I can find do appear to depend on reportlab or other non-pure-Python libraries:-(.
What you're asking for is a pure Python HTML renderer, which is a big task to say the least ('real' renderers like webkit are the product of thousands of hours of work). As far as I'm aware, there aren't any.
Instead of looking for an HTML to PDF converter, what I'd suggest is building your report in a format that's easily converted to both - for example, you could build it as a DOM (a set of linked objects), and write converters for both HTML and PDF output. This is a much more limited problem than converting HTML to PDF, and hence much easier to implement.

Categories

Resources