Dynamic generation of .doc files - python

How can you dynamically generate a .doc file using AJAX? Python? Adobe AIR? I'm thinking of a situation where an online program/desktop app takes in user feedback in article form (a la wiki) in Icelandic character encoding and then upon pressing a button releases a .doc file containing the user input for the webpage. Any solutions/suggestions would be much appreciated.
PS- I don't want to go the C#/Java way with this.

The problem with the *.doc MS word format is, that it isn't documented enough, therefor it can't have a very good support like, for example, PDF, which is a standard.
Except of the problems with generating the doc, you're users might have problems reading the doc files. For example users on linux machines.
You should consider producing RTF on the server. It is more standard, and thus more supported both for document generation, and for reading the document afterwards. Unless you need very specific features, it should suffice for most of documents types, and MS word opens it by default, just like it opens its own native format.
PyRTF is an project you can use for RTF generation with python.

It don't have to do much with ajax(in th sense that ajax is generally used for dynamic client side interactions)
You need a server side script which takes the input and converts it to doc.
You may use something like openoffice and python if it has some interface
see http://wiki.services.openoffice.org/wiki/Python
or on windows you can directly use Word COM objects to create doc using win32apis
but it is less probable, that a windows server serving python :)
I think better alternative is to generate PDF which would be nicer and easier.
Reportlab has a wonderful pdf generation library and it works like charm from python.
Once you have pdf you may use some pdf to doc converter, but I think PDF would be good enough.
Edit: Doc generation
On second thought if you are insisting on DOC you may have windows server in that case
you can use COM objets to generate DOC, xls or whatever see
http://win32com.goermezer.de/content/view/173/284/

Related

How to convert pdf to xml /json using python code

Can any one help me on how to convert pdf file to xml file using python code? My pdf contains:
Unstructured data
It has images
Mathematical equations
Chemical Equations
Table Data
Logo's tag's etc.
I tried using PDFMiner, but my pdf data was not converted into .xml/json file format. Are there any libraries other than PDFMiner? PyPDF2, Tabula-py, PDFQuery, comelot, PyMuPDF, pdf to dox, pandas- these other libraries/utilities all not suitable for my requirement.
Please advise me on any other options. Thank you.
The first thing I would recommend you trying is GROBID (see here for the full documentation). You can play with an online demo here to see if fits your needs (select TEI -> Process Fulltext Document, and upload a PDF). You can also check out this from the Allen Institute (it is based on GROBID and has a handy function for converting TEI.XML to JSON).
The other package which--obviously--does a good job is the Adobe PDF Extract API (see here). It's of course a paid service but when you register for an account you get 1.000 document transactions for free. It's easy to implement in Python, well documented, and a good way for experimenting and getting a feel for the difficulties of reliable data extraction from PDF.
I worked with both options to extract text, figures, tables etc. from scientific papers. Both yielded good results. The main problem with out-of-the-box solutions is that, when you work with complex formats (or badly formatted docs), erroneously identified document elements are quite common (for example a footnote or a header gets merged with the main text). Both options are based on machine learning models and, at least for GROBID, it is possible to retrain these models for your specific task (I haven't tried this so far, so I don't know how worthwhile it is).
However, if your target PDFs are all of the same (simple) format (or if you can control their format) you should be fine with either option.

Creating PDFs from HTML/Javascript in Python with no OS dependencies

Is there any way to use Python to create PDF documents from HTML/CSS/Javascript, without introducing any OS-level dependencies?
It seems every existing solution requires special supplemental software, but upon reviewing PDF formatting specifications and HTML/CSS/Javascript rendering, there doesn't appear to be a reason why a Python solution can't exist without them. Some solutions come close, such as pyppeteer, but it still leans on a headless Chrome installation locally. These dependencies mean that microservices can't be leveraged, even though PDF generation would otherwise seem to be a viable use case for them.
While similar questions have come up many times over on SO, there doesn't appear to have been a viable technique shown without having to install specialized dependencies on the OS.
Some similar questions which routinely recommend wkhtmltopdf or are otherwise out of date (e.g., moving PDF printing support outside of Chrome is dead now):
How to convert webpage into PDF by using Python
How to convert a local HTML file to PDF using Python in Windows
HTML to PDF conversion using Chrome pdfium
How does Chrome render PDFs from HTML so well?
Convert a HTML/CSS/Javascript file to PDF using Python?
If I've somehow missed a viable approach, please feel free to mark this as a duplicate with my thanks!
Edit February 2021: It appears that the cefpython project may meet these demands - PDF printing support seems like it could be implemented in the near future.
So to clarify and formalize what others have said:
If you want to create PDF documents from HTML/CSS/javascript content, you will necessarily need a javascript engine (because you obviously need to execute the javascript if it affects the visuals of the document). This is the most complex component that you need.
As for now, there is no ECMAscript compliant engine written in pure python that is well-maintained (that would be a huge project)... There will probably never be one, since compilers and VMs for languages need to be performant and are thus usually written in a performant low-level language.
So you will always need compiled binaries for that and the HTML renderers which are less complex but also need to be performant if used in browsers, so usually they're also C++ or the likes.
The javascript engine and HTML renderer are the major part of a browser, so a headless browser is a good solution to this requirement.
Try this library: xhtml2pdf
It worked for me. Here is the documentation: doc
Some sample code:
from xhtml2pdf import pisa
def convert_html_to_pdf(source_html, output_filename):
# open output file for writing (truncated binary)
result_file = open(output_filename, "w+b")
# convert HTML to PDF
pisa_status = pisa.CreatePDF(
source_html, # the HTML to convert
dest=result_file) # file handle to recieve result
# close output file
result_file.close() # close output file
# return False on success and True on errors
return pisa_status.err
# Define your data
source_html = open('2020-06.html')
output_filename = "test.pdf"
convert_html_to_pdf(source_html, output_filename)

Write Microsoft Word Doc with MySQL data

I'm programming a MySQL database with Web interface for remote access. I used Django as a framework. But now, I want to generate some reports using the MySQL data and modify them after generating. Therefore, I automatically think of exporting data to or importing from Word. The thing is, how I do this?
I have seen several options. One of them, using Python-docx, a library to generate docx documents in Python. I could have a problem with this, because the generated reports will be large, with lots of images, tables, pages, etc. I worked with xlsxwriter, and when the files were large it took long time to generate de xlsx. I don't know if Python-docx would be the better solution.
Other option is to import data directly from Microsoft Word, using some software for this concrete purpose or using a macro VBA. I have programmed some example code with VBA to import data of MySQL using connectors ODBC and it's immediately possible, but there is thousand of objects and classes of VBA Word to learn.
Exposed the problem, any tips or suggestions??? Thanks in advance!
Another option is to generate HTML & open as a word document.
If you take a document similar to what you want to generate & save as HTML you will see what word does. Take this file as a template for your documents

What python modules should I use to edit a word document then turn it to a pdf?

I want users to be able to create the report template in Microsoft Word, I'll then probably add document fields. Then the script evaluates a number of things adds the appropriate text to the fields then creates a pdf of the filled in form.
So which modules would be best for this? I've looked at reportlab but I need to work from a pre-generated template and that doesn't seem feasible.
If you will use it only under Windows, having Word installed you could use PyWin32 that lets you access the api of the suite. You could also try IronPython as suggested here.
If you need to read a docx template regardless of the platform you could try this outdated extension.
If it suits your application to use a cloud service to populate Doc/DocX files there is a commercial system called Docmosis that can popluate plain-text (or merge) fields and stream back populated PDF documents to your Python system, or deliver via email etc.
You would upload your "template" Doc files to Docmosis via the Website (or api calls) then invoke Docmosis using a https post from your Python code.
Please note I work for the company that created Docmosis.
Hope that helps.

What's a good document standard to use programmatically?

I'm writing a program that requires input in the form of a document, it needs to replace a few values, insert a table, and convert it to PDF. It's written in Python + Qt (PyQt). Is there any well known document standard which can be easily used programmatically? It must be cross platform, and preferably open.
I have looked into Microsoft Doc and Docx, which are binary formats and I can't edit them. Python has bindings for it, but they're only on Windows.
Open Office's ODT/ODF is zipped in an xml file, so I can edit that one but there's no command line utilities or any way to programmatically convert the file to a PDF. Open Office provides bindings, but you need to run Open Office from the command line, start a server, etc. And my clients may not have Open Office installed.
RTF is readable from Python, but I couldn't find any way/libraries to convert RTF documents to PDF.
At the moment I'm exporting from Microsoft Word to HTML, replacing the values and using PyQt to convert it to a PDF. However it loses formatting features and looks awful. I'm surprised there isn't a well known library which lets you edit a variety of document formats and convert them into other formats, am I missing something?
Update: Thanks for the advice, I'll have a look at using Latex.
Thanks,
Jackson
Have you looked into using LaTeX documents?
They are perfect to use programatically (compiling documents? You gotta love that...), and you have several Python frameworks you can use such as plasTeX and PyTex.
Exporting a LaTeX documents to PDF is almost immediate.
Since you're already using PyQt anyway, it might be worth looking at Qt's built-in RTF processing module which looks decent. Here's the documentation on detailed content manipulation including inserting tables. Also the QPrinter module's default print-to-file format happens to be PDF.
Without knowing more about your particular needs it's hard to say if these would do what you want, but since your application already has PyQt as a dependency, seems silly to introduce any more without evaluating the functionality you've already got available.
The non-GUI parts of the Qt framework are often overlooked though.
edit: included more links.
You might want to try ReportLab. The open source version can write PDFs, and the commercial version has a lot of really nice abstractions to allow output to a variety of different formats from a single input.
I don't know the kind of odience of your program, Tex is good and i would go with it.
Another possible choice is Excel format, parsing it with xlrd.
I've used it a couple of time and it's pretty straightforward.
Excel file is a good for the following reasons:
Well known format easy to edit
You could prepare a predefined template with constrains and table
Creating XML documents, transforming them to XSL/fo and rendering with Fop or RenderX. If you use docbook as the primary input, there are toolchains freely available for converting that to PDF, RTF, HTML and so forth.
It is rather quirky to use and not my idea of fun, but is does deliver and can be embedded in an application, AFAICT.
Creating docbook is very straightforward as it has a wide range of semantic tags, table support etc to give a "meaningful" markup which can be reliably formatted. The XSL stylesheets are modular and allow parts to be customized or replaced to generate your own look and feel.
It works well for relatively free flow documents with lots of text.
For filling in the blanks kind of documents, a regular reporting engine may be a better fit, or some straighforward XSL stylesheets spitting out the XSL-fo directly.

Categories

Resources