I have a Python 2.6 app running on Linux that creates a CSV file. From the app, I need to create an HTML report, as a single HTML file, that presents the data from the CSV (probably as a table) and also highlights fields where the values meet certain criteria. Charting type functionality would be a nice to have.
What's the best way to do this?
No GPL stuff please.
Choose a Python csv library from here. Now that you have the data mapped to Python data structures you can iterate on it and create the html. I would use the Jinja2 templating engine which is nicely documented. Highlighting rows/cells would work by setting certain css classes on the respective tr/td elements in the table.
Related
I want to use Document.ai to extract data from tables in my pdf.
I was following this code snippet https://cloud.google.com/document-ai/docs/handle-response#code_samples_2
But my table array is always empty. I tried to do it with pdf provided by google or with mine. But it doesn't work.
When I put this documents into drag'n'drop example it work flawlessly https://cloud.google.com/document-ai/docs/drag-and-drop
Is there some trick to make it work with python client?
Be sure that you're sending the document to a Form Parser processor, as that's the processor type that can extract tables.
If you're already using the form parser, it would be helpful to provide the document you're using and the sample output from Document AI.
Update: The Demo has been updated to show that Tables are extracted using the Form Parser.
I want to enable a user to export some data to a web application I am building. The data from the legacy application can be accessed through MS Acces (ODBC). The web application is written in Django/Python, but that is not very relevant.
The user would have to export data from time to time and import it into the web app. The table structure in the web app more-or-less mirrors the one in the legacy application.
My question of how to get the data from Access to a format that is easily parseable in the web app. The data is from 5 different tables and interrelated. Is there a way to serialise the data from Access into an XML / JSON file? I know that you can do an XML export, but as far as I know that is limited to a query, so I wouldn't have the hierarchy... Is there a VBA library to help with the task?
You can reference Microsoft XML, v5.0 (or whatever version) in the Visual Basic Editor and create XML programmatically.
See
- Simple example
- Introduction to XML in Microsoft Windows (in depth example)
Answering my own question here. I did some googling and it looks like you can export data from a table together with selected other tables. For that, it is necessary to draw the relationships within Access.
That might also solve my problem (and without composing the XML manually). Will find out if this works and check back later.
source: http://msdn.microsoft.com/en-us/library/office/aa167823(v=office.11).aspx#odc_accessnewxmlfeatures_includingrelatedtableswhenexportingxml
I'm programming a MySQL database with Web interface for remote access. I used Django as a framework. But now, I want to generate some reports using the MySQL data and modify them after generating. Therefore, I automatically think of exporting data to or importing from Word. The thing is, how I do this?
I have seen several options. One of them, using Python-docx, a library to generate docx documents in Python. I could have a problem with this, because the generated reports will be large, with lots of images, tables, pages, etc. I worked with xlsxwriter, and when the files were large it took long time to generate de xlsx. I don't know if Python-docx would be the better solution.
Other option is to import data directly from Microsoft Word, using some software for this concrete purpose or using a macro VBA. I have programmed some example code with VBA to import data of MySQL using connectors ODBC and it's immediately possible, but there is thousand of objects and classes of VBA Word to learn.
Exposed the problem, any tips or suggestions??? Thanks in advance!
Another option is to generate HTML & open as a word document.
If you take a document similar to what you want to generate & save as HTML you will see what word does. Take this file as a template for your documents
I'm writing a program that requires input in the form of a document, it needs to replace a few values, insert a table, and convert it to PDF. It's written in Python + Qt (PyQt). Is there any well known document standard which can be easily used programmatically? It must be cross platform, and preferably open.
I have looked into Microsoft Doc and Docx, which are binary formats and I can't edit them. Python has bindings for it, but they're only on Windows.
Open Office's ODT/ODF is zipped in an xml file, so I can edit that one but there's no command line utilities or any way to programmatically convert the file to a PDF. Open Office provides bindings, but you need to run Open Office from the command line, start a server, etc. And my clients may not have Open Office installed.
RTF is readable from Python, but I couldn't find any way/libraries to convert RTF documents to PDF.
At the moment I'm exporting from Microsoft Word to HTML, replacing the values and using PyQt to convert it to a PDF. However it loses formatting features and looks awful. I'm surprised there isn't a well known library which lets you edit a variety of document formats and convert them into other formats, am I missing something?
Update: Thanks for the advice, I'll have a look at using Latex.
Thanks,
Jackson
Have you looked into using LaTeX documents?
They are perfect to use programatically (compiling documents? You gotta love that...), and you have several Python frameworks you can use such as plasTeX and PyTex.
Exporting a LaTeX documents to PDF is almost immediate.
Since you're already using PyQt anyway, it might be worth looking at Qt's built-in RTF processing module which looks decent. Here's the documentation on detailed content manipulation including inserting tables. Also the QPrinter module's default print-to-file format happens to be PDF.
Without knowing more about your particular needs it's hard to say if these would do what you want, but since your application already has PyQt as a dependency, seems silly to introduce any more without evaluating the functionality you've already got available.
The non-GUI parts of the Qt framework are often overlooked though.
edit: included more links.
You might want to try ReportLab. The open source version can write PDFs, and the commercial version has a lot of really nice abstractions to allow output to a variety of different formats from a single input.
I don't know the kind of odience of your program, Tex is good and i would go with it.
Another possible choice is Excel format, parsing it with xlrd.
I've used it a couple of time and it's pretty straightforward.
Excel file is a good for the following reasons:
Well known format easy to edit
You could prepare a predefined template with constrains and table
Creating XML documents, transforming them to XSL/fo and rendering with Fop or RenderX. If you use docbook as the primary input, there are toolchains freely available for converting that to PDF, RTF, HTML and so forth.
It is rather quirky to use and not my idea of fun, but is does deliver and can be embedded in an application, AFAICT.
Creating docbook is very straightforward as it has a wide range of semantic tags, table support etc to give a "meaningful" markup which can be reliably formatted. The XSL stylesheets are modular and allow parts to be customized or replaced to generate your own look and feel.
It works well for relatively free flow documents with lots of text.
For filling in the blanks kind of documents, a regular reporting engine may be a better fit, or some straighforward XSL stylesheets spitting out the XSL-fo directly.
I'm making a web interface to autofill pdf forms with user data from a database. The admin needs to be able to upload a pdf (right now targeted at IRS pdf forms) and then associate the fields in the pdf with data fields in the database.
I need a way to help the admin associate the field names (stuff like "topmostSubform[0].Page2[0].p2-t66[0]") with the the data fields in the database. I'm looking for a way to modify the PDF programatically to in some way provide this information.
Basically I'm open to suggestions on how I might make the field names appear in an obvious manner on a modified version of the original pdf. The closest I've gotten is being able to insert Tooltips into the fields in the pdf by just editting the raw pdf line by line. However when editting the pdf in this manner the field names are gibberish, and so I can't just use them.
An optimal solution would be anything that could automatically parse a pdf and set each field's tooltip to be the fields name. Anything that can be run from the command line, or any python tool, or just a basic how to correctly parse a field's name from a raw pdf file would be amazing.
There may be an easier solution than this, but you could definitely get the job done with http://www.reportlab.com/software/opensource/rl-toolkit/'>ReportLab.
If you can save the current tax forms as an image, you could determine where each of the items need to be written and develop your code so that it automatically layers the appropriate values from the database on top of the image (the tax form, or whatever it might be).
Once you've determined 1) What fields need to be pulled from the database, and 2) where they 're supposed to go on within the form...
this is essentially what you'd be doing:
from reportlab.pdfgen import canvas
report_string_values = ['Alex',500,500],['Guido',400,400],
c = canvas.Canvas('hello.pdf')
c.drawImage(background_image,x_pos,y_pos) # x_pos and w_pos are # pixels from bl origin
for rsv in report_string_values:
c.drawString(rsv.x_pos,rsv.,rsv.text)
c.showPage()
c.save()
A postscript parser lives here: https://github.com/haxwithaxe/py-ps-parser
I've been interested in playing with it, but haven't yet.
This may be way off your intended track; but, it might be worth a think. I've been working on parsing scanned structured documents into Django model instances. Using tesseract and unpaper to do the pre-processing and OCR, I get over 99% accuracy. That lets me parse the OCR output text with the Levenshtein and re modules and do a simple new_instance = MyModel(parsed1, parsed2, ...).
It seems that you are trying to do something similar. Looking at the forms at http://www.irs.gov/formspubs/ They tend to have text labels left-adjacent to the fields. Using something like py-tesseract, you should be able to OCR the labels, overlay the OCR text over the form image and allow the user to select/edit the field labels.
There is a nice little tool, ocrfeeder https://live.gnome.org/OCRFeeder, that is written in python and should give you a basic idea of how the process works in a desktop app. Good luck.
Government Forms are usually not a standard PDF but a JavaScript driven XFA in a PDF wrapper, thus to enter the data programmatically you need a lookup table as the order is rarely the visual order.
Here the first field "single yes or no" "topmostSubform[0].Page1[0].c1_01[0]" is a checkbox designated well down the list of entries. of course none in this Form are "topmostSubform[0].Page2[0].p2-t66[0]" so you need totally different look-up table for each XFA. Otherwise follow the entries (luckilly there is some sence of sequencing in this form) so free format field "topmostSubform[0].Page1[0].f1_01[0]" is near "dependent:" etc.
There are XFA dedicated applications that can extract the positions of static fields, but if the fields are dynamically adapting then the page position would be a moving feast.
For XFA you need an intelligent dedicated Adobe listing (often xml / xlsx input output supplied on request from the relevant department), or build your own if Acrobat Pro does not block the attempt.
I may be interpreting the question wrong but I have a lot of experience in pdf generation with python/django because of the site that I worked on for 5 months. I would suggest using texlive. Basically what I did was built a generic tex template for a document and then used django templating to insert the fields. I rendered the template as if it were html using render_to_string and then generated it using the pdflatex command. I ran pdflatex using pythons subprocess module and a little extra. To do the generating I used this guys pdflatex module http://bit.ly/KaDMBp , with some modifications. All the things you need are in the core.py inside of the pdflatex directory.
Ex tex document (test.tex) )
\begin{document}
my name is {{input_name}} and i live in {{input_location}}.
\end{document}
Ex rendering template with django templating and render_to_string )
params={input_name:"andrew",input_location:"nyc"}
tex_doc = render_to_string('test.tex', params)
Ex generating as pdf)
pdflatex = PDFLatex(texfile=tex_path,outputdir=pdf_path)
pdflatex.transform()
Latex has a somewhat annoying, difficult learning curve but if you put in the time you can learn what you need to know in order to create these pdfs.
Hope this helps.
The SDAPS framework was designed for scenarios like this: It aids in batch-processing PDF-based forms, extract contents from designated fields and e.g. funnel those into a database for further processing.