pdfminer doesn't extract data from filled-out pdf form

pdfminer doesn't extract data from filled-out pdf form - python

I'm trying to use pdfminer to extract the filled-out contents in a pdf form. The instructions for accessing the pdf are:
Go to https://www.ffiec.gov/nicpubweb/nicweb/InstitutionProfile.aspx?parID_Rssd=1073757&parDT_END=99991231
Click "Create Report" next to the fourth report from the top (i.e.,Banking Organization Systemic Risk Report (FR Y-15))
Click "Your request for a financial report is ready"
To extract the contents in blue, I copied code from this post:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
filename = 'FRY15_1073757_20160630.PDF'
fp = open(filename, 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm'])['Fields']
for i in fields:
field = resolve1(i)
name, value = field.get('T'), field.get('V')
print '{0}: {1}'.format(name, value)
This didn't extract the data fields as expected -- nothing was printed. I tried the same code on another pdf and it worked so I suspect the failure might have to do with the security setting of the first pdf, which is shown below
For the second pdf on which the code worked, the security setting shows "Allowed" for all the actions. I also tried using pdfminer's pdf2txt.py functionality (see here) but the filled-out data in the fields in the original pdf form (which is what I want) was not in the converted text file; only the "flat" non-fillable part of the pdf was converted. Interestingly, if I use Adobe Reader's Save As Text to convert the pdf to a text file, the fillable part was in the converted text file. This is what I've been doing to get around the failed code.
Any idea how I can extract data directly from the pdf form? Thanks.

I can only explain what the problem is but cannot present a solution because I have no working Python knowledge.
Your code iterates over the immediate children of the AcroForm Fields array and expect them to represent the form fields.
While this expectation often is fulfilled, it actually only represents a special case: Form fields are arranged as a tree structure with that Fields array as root element, e.g. in case of your sample document there is large tree:
Thus, you have to descend into the structure, not merely iterate over the immediate children of Fields, to find all form fields.

Related

Edit powerpoint .pptx template by python

I am trying to automate report generating in PowerPoint by python. I wanted to know if there is any way to detect an existing textbox from a PowerPoint template and then fill it with some text in python?

Main logic is that how to find a placeholder which is given by template by default as well as text-box on non-template-pages. We can take different type
to extract data and fill placeholder and text-box like From txt file, form web scraping and many more. Among them we have taken our data list_ object.
1. Lets we n page and we are accessing page 1 so we can access this page usng this code :
(pptx.Presentation(inout_pptx)).slides[0]
2. To select placeholder by default provided in template we will use this code and we will iterator over all placehodler
slide.shapes
3. To update particular placeholder use this :
shape.text_frame.text = data
CODE :
import pptx
inout_pptx = r"C:\\Users\\lenovo\\Desktop\\StackOverFlow\\python_pptx.pptx"
list_data = [
'Quantam Computing dsfsf ',
'Welcome to Quantam Computing Tutorial, hope you will get new thing',
'User_Name sd',
'<Enrollment Number>']
"""open file"""
prs = pptx.Presentation(inout_pptx)
"""get to the required slide"""
slide = prs.slides[0]
"""Find required text box"""
for shape, data in zip(slide.shapes, list_data):
if not shape.has_text_frame:
continue
shape.text_frame.text = data
"""save the file"""
prs.save(inout_pptx)
RESULTS :

If I understand correctly, your presentation contains placeholders for text filling. The following code example shows you how to fill a footer on the first slide with Aspose.Slides for Python via .NET:
import aspose.slides as slides
with slides.Presentation("example.pptx") as presentation:
firstSlide = presentation.slides[0]
for shape in firstSlide.shapes:
# AutoShape objects have text frames
if (isinstance(shape, slides.AutoShape) and shape.placeholder is not None):
if shape.placeholder.type == slides.PlaceholderType.FOOTER:
shape.text_frame.text = "My footer text"
presentation.save("example_out.pptx", slides.export.SaveFormat.PPTX)
I work as a Support Developer at Aspose.

is there a method to read pdf files line by line?

I have a pdf file over 100 pages. There are boxes and columns of text. When I extract the text using PyPdf2 and tika parser, I get a string of of data which is out of order. It is ordered by columns in many cases and skips around the document in other cases. Is it possible to read the pdf file starting from the top, moving left to right until the bottom? I want to read the text in the columns and boxes, but I want the line of text displayed as it would be read left to right.
I've tried:
PyPDF2 - the only tool is extracttext(). Fast but does not give gaps in the elements. Results are jumbled.
Pdfminer - PDFPageInterpeter() method with LAParams. This works well but is slow. At least 2 seconds per page and I've got 200 pages.
pdfrw - this only tells me the number of pages.
tabula_py - only gives me the first page. Maybe I'm not looping it correctly.
tika - what I'm currently working with. Fast and more readable, but the content is still jumbled.
from tkinter import filedialog
import os
from tika import parser
import re
# select the file you want
file_path = filedialog.askopenfilename(initialdir=os.getcwd(),filetypes=[("PDF files", "*.pdf")])
print(file_path) # print that path
file_data = parser.from_file(file_path) # Parse data from file
text = file_data['content'] # Get files text content
by_page = text.split('... Information') # split up the document into pages by string that always appears on the
# top of each page
for i in range(1,len(by_page)): # loop page by page
info = by_page[i] # get one page worth of data from the pdf
reformated = info.replace("\n", "&") # I replace the new lines with "&" to make it more readable
print("Page: ",i) # print page number
print(reformated,"\n\n") # print the text string from the pdf
This provides output of a sort, but it is not ordered in the way I would like. I want the pdf to be read left to right. Also, if I could get a pure python solution, that would be a bonus. I don't want my end users to be forced to install java (I think the tika and tabula-py methods are dependent on java).

I did this for .docx with this code. Where txt is the .docx. Hope this help link
import re
pttrn = re.compile(r'(\.|\?|\!)(\'|\")?\s')
new = re.sub(pttrn, r'\1\2\n\n', txt)
print(new)

How to erase text from PDF using Python

I'm creating a python script to edit text from PDFs.
I have this Python code which allows me to add text into specific positions of a PDF file.
import PyPDF2
import io
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
import sys
packet = io.BytesIO()
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
# Insert code into specific position
can.drawString(300, 115, "Hello world")
can.save()
#move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PyPDF2.PdfFileReader(packet)
# read your existing PDF
existing_pdf = PyPDF2.PdfFileReader(open("original.pdf", "rb"))
num_pages = existing_pdf.numPages
output = PyPDF2.PdfFileWriter()
# add the "watermark" (which is the new pdf) on the existing page
page = existing_pdf.getPage(num_pages-1) # get the last page of the original pdf
page.mergePage(new_pdf.getPage(0)) # merges my created text with my PDF.
x = existing_pdf.getNumPages()
#add all pages from original pdf into output pdf
for n in range(x):
output.addPage(existing_pdf.getPage(n))
# finally, write "output" to a real file
outputStream = open("output.pdf", "wb")
output.write(outputStream)
outputStream.close()
My problem: I want to replace the text in a specific position of my original PDF with my custom text. A way of writing blank characters would do the trick but I couldn't find anything that does this.
PS.: It must be Python code because I will need to deploy this as a .exe file later and I only know how to do that using Python code.

A general purpose algorithm for replacing text in a PDF is a difficult problem. I'm not saying it can't ever be done, because I've demonstrated doing so with the Adobe PDF Library albeit with a very simple input file with no complications, but I'm not sure that pyPDF2 has the facilities required to do so. In part, just finding the text can be a challenge.
You (or more realistically your PDF library) has to parse the page contents and keep track of the changes to the graphic state, specifically changes to the current transformation matrix in case the text is in a Form XObject, and the text transformation matrix, and changes to the font; you have to use the font resource to get character widths to figure out where the text cursor may be positioned after inserting a string. You may need to handle standard-14 fonts which don't contain that information in their font resources (the application -your program- is expected to know their metrics)
After all that, removing the text is easy if you don't need to break up a Tj or TJ (show text) instruction into different parts. Preventing the text after from shifting, if that's what's desired, may require inserting a new Tm instruction to reposition the text after to where it would have been.
Inserting new text can be challenging. If you want to stay consistent with the font being used and it is embedded and subset, it may not necessarily contain the glyphs you need for your text insertion. And after insertion, you then have to decide whether you need to reflow the text that comes after the text you inserted.
And lastly, you will need your PDF library to save all the changes. Quite frankly, using Adobe Acrobat's Redaction features would likely be cheaper and more cost-effective way of doing this than trying to program this from scratch.

If you want to do a poor man's redaction with ReportLab and PyPDF2,
you would create your replacement content with ReportLab.
Given a Canvas, a rectangle indicating an area, a text string and a point where the text string would be inserted you would then:
#set a fill color to white:
c.setFillColorRGB(1,1,1)
# draw a rectangle
c.rect([your rectangle], fill=1)
# change color
c.setFillColorRGB(0,0,0)
c.drawString([text insert position], [text string])
save this PDF document you've created to a temporary file.
Open this PDF document and the document you want to modify using the PyPDF2's PdfFileReader. create a pdfFileWriter object, call it ModifiedDoc. Get page 0 of temporary PDF, call it updatePage. Get page n of the other document, call it toModifyPage.
toModifyPage.mergePage(updatePage)
after you are done updating pages:
modifiedDoc.cloneDocumentFromReader(srcDoc)
modifiedDoc.write(outStream)
Again, if you go this route, a user might still see the original text before it gets covered up with the new content, and text extraction would likely pull out both the original and new text for that area, and possibly intermingle it to something unintelligible.

updating metadata for feature classes programatically using arcpy

I would like to be able to take an excel file that contains a record for each feature class, and some metadata fields, like summary, description, etc., and convert that to the feature class metadata. From the research I've done it seems like I need to convert each record in the excel table to xml, and then from there I may be able to import the xml file as metadata. Looks like I could use ElementTree, but I'm a little unsure of how to execute. Has anyone done this before and if so could you provide some guidance?

Man, this can be quite a process! I had to update some metadata information for a project at work the other day, so here goes nothing. It would be helpful to stored all of the metadata information in the excel table as a dictionary list or other data structure of your choosing (I work with csvs and try to stay away from excel spreadsheets for experience reasons).
metaInfo = [{"featureClass":"fc1",
"abstract":"text goes here",
"description":"text goes here",
"tags":["tag1","tag2","tag3"]},
{"featureClass":"fc2",
"abstract":"text goes here",
"description":"text goes here",
"tags":["tag1","tag2","tag3"]},...]
From there, I would actually export the current metadata feature class using the Export Metadata function to convert your feature class metadata into an xml file using a FGDC schema. Here is a code example below:
#Directory containing ArcGIS Install files
installDir = arcpy.GetInstallInfo("desktop")["InstallDir"]
#Path to XML schema for FGDC
translator = os.path.join(installDir, "Metadata/Translator/ARCGIS2FGDC.xml")
#Export your metadata
arcpy.ExportMetadata_conversion(featureClassPath, translator, tempXmlExportPath)
From there, you can use the xml module to access the ElementTree class. However, I would recommend using the lxml module (http://lxml.de/index.html#download) because it allows you to incorporate html code into your metadata through the CDATA factory if you needed special elements like line breaks in your metadata. From there, assuming that you have imported lxml, parse your local xml document:
import lxml.etree as ET
tree = ET.parse(tempXmlExportPath)
root = tree.getroot()
If you want to update the tags use the code below:
idinfo = root[0]
#Create keyworks element
keywords = ET.SubElement(idinfo, "keywords")
tree.write(tempXmlExportPath)
#Create theme child
theme = ET.SubElement(keywords, "theme")
tree.write(tempXmlExportPath)
#Create themekt and themekey grandchildren/insert tag info
themekt = ET.SubElement(theme, "themekt")
tree.write(tempXmlExportPath)
for tag in tags: #tags list from your dictionary
themekey = ET.SubElement(theme, "themekey")
themekey.text = tag
tree.write(tempXmlExportPath)
To update the Summary tags, use this code:
#Create descript tag
descript = ET.SubElement(idinfo, "descript")
tree.write(tempXmlExportPath)
#Create purpose child from abstract
abstract = ET.SubElement(descript, "abstract")
text = #get abstract string from dictionary
abstract.text = text
tree.write(tempXmlExportPath)
If a tag in the xml already exists, store the tag as an object using the parent.find("child") method, and update the text similar to the code examples above. Once you have updated your local xml file, use the Import Metadata method to import the xml file back into the feature class and remove the local xml file.
arcpy.ImportMetadata_conversion(tempXmlExportPath, "FROM_FGDC", featureClassPath)
shutil.rmtree(tempXmlExportPath)
Keep in mind that these tools in Arc are only for 32 bit, so if you are scripting through the 64 bit background geoprocessor, this will not work. I am working off of ArcMap 10.1. If you have any questions, please let me know or consult the documentation below:
lxml module
http://lxml.de/index.html#documentation
Export Metadata arcpy
http://resources.arcgis.com/en/help/main/10.1/index.html#//00120000000t000000
Import Metadata arcpy
http://resources.arcgis.com/en/help/main/10.1/index.html#//00120000000w000000

pypdf not extracting tables from pdf

I am using pypdf to extract text from pdf files . The problem is that the tables in the pdf files are not extracted. I have also tried using the pdfminer but i am having the same issue .

The problem is that tables in PDFs are generally made up of absolutely positioned lines and characters, and it is non-trivial to convert this into a sensible table representation.
In Python, PDFMiner is probably your best bet. It gives you a tree structure of layout objects, but you will have to do the table interpreting yourself by looking at the positions of lines (LTLine) and text boxes (LTTextBox). There's a little bit of documentation here.
Alternatively, PDFX attempts this (and often succeeds), but you have to use it as a web service (not ideal, but fine for the occasional job). To do this from Python, you could do something like the following:
import urllib2
import xml.etree.ElementTree as ET
# Make request to PDFX
pdfdata = open('example.pdf', 'rb').read()
request = urllib2.Request('http://pdfx.cs.man.ac.uk', pdfdata, headers={'Content-Type' : 'application/pdf'})
response = urllib2.urlopen(request).read()
# Parse the response
tree = ET.fromstring(response)
for tbox in tree.findall('.//region[#class="DoCO:TableBox"]'):
src = ET.tostring(tbox.find('content/table'))
info = ET.tostring(tbox.find('region[#class="TableInfo"]'))
caption = ET.tostring(tbox.find('caption'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.