I'm making a web interface to autofill pdf forms with user data from a database. The admin needs to be able to upload a pdf (right now targeted at IRS pdf forms) and then associate the fields in the pdf with data fields in the database.
I need a way to help the admin associate the field names (stuff like "topmostSubform[0].Page2[0].p2-t66[0]") with the the data fields in the database. I'm looking for a way to modify the PDF programatically to in some way provide this information.
Basically I'm open to suggestions on how I might make the field names appear in an obvious manner on a modified version of the original pdf. The closest I've gotten is being able to insert Tooltips into the fields in the pdf by just editting the raw pdf line by line. However when editting the pdf in this manner the field names are gibberish, and so I can't just use them.
An optimal solution would be anything that could automatically parse a pdf and set each field's tooltip to be the fields name. Anything that can be run from the command line, or any python tool, or just a basic how to correctly parse a field's name from a raw pdf file would be amazing.
There may be an easier solution than this, but you could definitely get the job done with http://www.reportlab.com/software/opensource/rl-toolkit/'>ReportLab.
If you can save the current tax forms as an image, you could determine where each of the items need to be written and develop your code so that it automatically layers the appropriate values from the database on top of the image (the tax form, or whatever it might be).
Once you've determined 1) What fields need to be pulled from the database, and 2) where they 're supposed to go on within the form...
this is essentially what you'd be doing:
from reportlab.pdfgen import canvas
report_string_values = ['Alex',500,500],['Guido',400,400],
c = canvas.Canvas('hello.pdf')
c.drawImage(background_image,x_pos,y_pos) # x_pos and w_pos are # pixels from bl origin
for rsv in report_string_values:
c.drawString(rsv.x_pos,rsv.,rsv.text)
c.showPage()
c.save()
A postscript parser lives here: https://github.com/haxwithaxe/py-ps-parser
I've been interested in playing with it, but haven't yet.
This may be way off your intended track; but, it might be worth a think. I've been working on parsing scanned structured documents into Django model instances. Using tesseract and unpaper to do the pre-processing and OCR, I get over 99% accuracy. That lets me parse the OCR output text with the Levenshtein and re modules and do a simple new_instance = MyModel(parsed1, parsed2, ...).
It seems that you are trying to do something similar. Looking at the forms at http://www.irs.gov/formspubs/ They tend to have text labels left-adjacent to the fields. Using something like py-tesseract, you should be able to OCR the labels, overlay the OCR text over the form image and allow the user to select/edit the field labels.
There is a nice little tool, ocrfeeder https://live.gnome.org/OCRFeeder, that is written in python and should give you a basic idea of how the process works in a desktop app. Good luck.
Government Forms are usually not a standard PDF but a JavaScript driven XFA in a PDF wrapper, thus to enter the data programmatically you need a lookup table as the order is rarely the visual order.
Here the first field "single yes or no" "topmostSubform[0].Page1[0].c1_01[0]" is a checkbox designated well down the list of entries. of course none in this Form are "topmostSubform[0].Page2[0].p2-t66[0]" so you need totally different look-up table for each XFA. Otherwise follow the entries (luckilly there is some sence of sequencing in this form) so free format field "topmostSubform[0].Page1[0].f1_01[0]" is near "dependent:" etc.
There are XFA dedicated applications that can extract the positions of static fields, but if the fields are dynamically adapting then the page position would be a moving feast.
For XFA you need an intelligent dedicated Adobe listing (often xml / xlsx input output supplied on request from the relevant department), or build your own if Acrobat Pro does not block the attempt.
I may be interpreting the question wrong but I have a lot of experience in pdf generation with python/django because of the site that I worked on for 5 months. I would suggest using texlive. Basically what I did was built a generic tex template for a document and then used django templating to insert the fields. I rendered the template as if it were html using render_to_string and then generated it using the pdflatex command. I ran pdflatex using pythons subprocess module and a little extra. To do the generating I used this guys pdflatex module http://bit.ly/KaDMBp , with some modifications. All the things you need are in the core.py inside of the pdflatex directory.
Ex tex document (test.tex) )
\begin{document}
my name is {{input_name}} and i live in {{input_location}}.
\end{document}
Ex rendering template with django templating and render_to_string )
params={input_name:"andrew",input_location:"nyc"}
tex_doc = render_to_string('test.tex', params)
Ex generating as pdf)
pdflatex = PDFLatex(texfile=tex_path,outputdir=pdf_path)
pdflatex.transform()
Latex has a somewhat annoying, difficult learning curve but if you put in the time you can learn what you need to know in order to create these pdfs.
Hope this helps.
The SDAPS framework was designed for scenarios like this: It aids in batch-processing PDF-based forms, extract contents from designated fields and e.g. funnel those into a database for further processing.
Related
I am currently developing a proprietary PDF parser that can read multiple types of documents with various types of data. Before starting, I was thinking about if reading PowerPoint slides was possible. My employer uses presentation guidelines that requires imagery and background designs - is it possible to build a parser that can read the data from these PowerPoint PDFs without the slide decor getting in the way?
So the workflow would basically be this:
At the end of a project, the project report is delivered in the form of a presentation.
The presentation would be converted to PDF.
The PDF would be submitted to my application.
The application would read the slides and create a data-focused report for quick review.
The goal of the application is to cut down on the amount of reading that needs to be done by a significant amount as some of these presentation reports can be many pages long with not enough time in the day.
Parsing PDFs into structured data is always tricky, as the format is geared towards precise printing, rather than ease of editing or data extraction.
Basically, a PDF contains information like "there's a label with such text at such (x,y) position on a certain page", or things like that.
Basically, you will very likely need some heuristics in order to turn that into structured data.
It will basically be a form of scraping.
Search on your favorite search engine for PDF scraping, or something like that, and it would be a good start.
Also, you may want to look at those similar posts:
PDF Data and Table Scraping to Excel
How to extract table as text from the PDF using Python?
A PowerPoint PDF isn't a type of PDF.
There isn't going to be anything natively in the PDF that identifies elements on the page as being 'slide' graphics the originated from a PowerPoint file for example.
You could try building an algorithm that makes decision about content to drop from the created PDF but that would be tricky and seems like the wrong approach to me.
A better approach would be to "Export" the PPT to text first, e.g. in Microsoft PowerPoint Export it to a RTF file so you get all of the text out and use that directly or then convert that to PDF.
I have an original text that I want to translate. I normally do it manually but I know I could save a lot of time translating automatically the most frequent words and expressions.
I will find out how to translate simple words, the problem is not here. I have read some books on python and I think using string manipulations can be done.
But I am lost about how to create the output file.
The output file will contain:
short empty forms ready to be filled wherever there is text that has not been translated
the translated words wherever they were in the original file
In the output file I will fill manually the empty forms, after pressing Tab the cursor should jump to the next exmpty form
I am lost here, I know how to do forms on html but the language I am used to is Python.
I would like to know what modules from Python I could use. I need some guidance on this.
Can you recommend me a book or a tool that explains how to do something similar to this?
This is what I want to do, assuming I have managed to create a simple database to translate colors from Spanish to English.
The first step contains the original file.
The second step contains the automatic translation.
In the third step I complete the manual translation.
After finishing everything is grouped into a normal txt file ready to be used.
I think it is quite clear. I don't expect people to tell me the code to do this, I just need to know what tools could be used to achieve my goal.
Thanks for editing.
To create an interface that works with a web browser, Flask for Python is a good method for creating webforms. There are tutorials available.
One method for storing data would be an SQLite file. That may be more than you need, so I'd recommend starting with a CSV file. Libraries exist in Python for both CSVs and SQLite.
I am creating a dynamic blog. I use Django's admin to add posts, and I have created some simple tags that python then substitutes for the actual html and css that are needed by the browser. This makes each blog easier to create and easier to read while creating.
Before Django saves the new blog, I've coded my model to send the text to a python script, which parses the code and creates the finished html.
This all works great, but I would also like to be able to parse the code before Django loads it, that way I can remove the html/css programatically, changing it back to the easier to read tags, making it easier to edit an already created blog.
Is there a way to capture control of Django admin BEFORE it loads model data into the form for editing?
The more simple solution is to have two fields, the original and the generated HTML.
Use the original as you are using it now and save the generated HTML to the other field.
Use the other field for your templates.
I store in database document files (.TXT). How to create a preview of the documents and the ability to edit their content?(just like in a text editor) Does anyone know the general algorithm or existing django application?
class FileDb(models.Model):
user = models.ForeignKey(User)
src = models.FileField(upload_to="src_after_ocr")
There is no django general algorithm for it.
django is python and therefore you can use common python code for opening and reading files. There are many examples of that if you just run google search.
After reading the file you can put out the contents of the file as a value for TextField in some form.
Just add script, that runs tinymce or some other wysiwyg editor script, to the mix and you have nice frontend for editing fields.
Alan
What I am trying to accomplish is to allow users to view information in the django admin console and allow them to save and print out a PDF of the information infront of them based upon how ever they sorted/filtered the data.
I have seen a lot of documentation on report lab but mostly for just drawing lines and what not. How can I simply output the admin results to a PDF? If that is even possible. I am open to other suggestions if report lab is not the ideal way to get this done.
Thanks in advance.
Better use some kind of html2pdf because you already have html there.
If html2pdf doesn't do what you need, you can do everything you want to do with ReportLab. Have a look at the ReportLab manual, in particular the parts on Platypus. This is a part of the ReportLab library that allows you to build PDFs out of objects representing page parts (paragraphs, tables, frames, layouts, etc.).