Python PDF Parsing with Camelot and Extract the Table Title

Python PDF Parsing with Camelot and Extract the Table Title - python

Camelot is a fantastic Python library to extract the tables from a pdf file as a data frame. However, I'm looking for a solution that also returns the table description text written right above the table.
The code I'm using for extracting tables from pdf is this:
import camelot
tables = camelot.read_pdf('test.pdf', pages='all',lattice=True, suppress_stdout = True)
I'd like to extract the text written above the table i.e THE PARTICULARS, as shown in the image below.
What should be a best approach for me to do it? appreciate any help. thank you

You can create the Lattice parser directly
parser = Lattice(**kwargs)
for p in pages:
t = parser.extract_tables(p, suppress_stdout=suppress_stdout,
layout_kwargs=layout_kwargs)
tables.extend(t)
Then you have access to parser.layout which contains all the components in the page. These components all have bbox (x0, y0, x1, y1) and the extracted tables also have a bbox object. You can find the closest component to the table on top of it and extract the text.

Here's my hilariously bad implementation just so that someone can laugh and get inspired to do a better one and contribute to the great camelot package :)
Caveats:
Will only work for non-rotated tables
It's a heuristic
The code is bad
# Helper methods for _bbox
def top_mid(bbox):
return ((bbox[0]+bbox[2])/2, bbox[3])
def bottom_mid(bbox):
return ((bbox[0]+bbox[2])/2, bbox[1])
def distance(p1, p2):
return math.sqrt((p1[0]-p2[0])**2 + (p1[1]-p2[1])**2)
def get_closest_text(table, htext_objs):
min_distance = 999 # Cause 9's are big :)
best_guess = None
table_mid = top_mid(table._bbox) # Middle of the TOP of the table
for obj in htext_objs:
text_mid = bottom_mid(obj.bbox) # Middle of the BOTTOM of the text
d = distance(text_mid, table_mid)
if d < min_distance:
best_guess = obj.get_text().strip()
min_distance = d
return best_guess
def get_tables_and_titles(pdf_filename):
"""Here's my hacky code for grabbing tables and guessing at their titles"""
my_handler = PDFHandler(pdf_filename) # from camelot.handlers import PDFHandler
tables = camelot.read_pdf(pdf_filename, pages='2,3,4')
print('Extracting {:d} tables...'.format(tables.n))
titles = []
with camelot.utils.TemporaryDirectory() as tempdir:
for table in tables:
my_handler._save_page(pdf_filename, table.page, tempdir)
tmp_file_path = os.path.join(tempdir, f'page-{table.page}.pdf')
layout, dim = camelot.utils.get_page_layout(tmp_file_path)
htext_objs = camelot.utils.get_text_objects(layout, ltype="horizontal_text")
titles.append(get_closest_text(table, htext_objs)) # Might be None
return titles, tables
See: https://github.com/atlanhq/camelot/issues/395

Related

PDF - Fitz makes the merged page as flipped

Below is the code I use to add a watermark onto pdf pages. On some pages the watermark looks like flipped upside down (rotated 180 degrees and looks like in the mirror).
doc_report = fitz.open(report_pdf_path)
doc_watermark = fitz.open(watermark_pdf_path)
for i in xrange(doc_report.pageCount):
page = doc_report.loadPage(i)
page_front = fitz.open()
page_front.insertPDF(doc_watermark, from_page=i, to_page=i)
page.showPDFpage(page.rect, page_front, pno=0, keep_proportion=True, overlay=True, rotate=0, clip=None)
doc_report.save(save_path, encryption=fitz.PDF_ENCRYPT_KEEP)
doc_report.close()
doc_watermark.close()
While debugging I compared the rotation, transformation properties of the target and watermark page, they look identical.
Could you please advise how can I resolve this?

Thanks to K J, here is the updated code resolving the issue.
doc_report = fitz.open(report_pdf_path)
doc_watermark = fitz.open(watermark_pdf_path)
for i in xrange(doc_report.pageCount):
page = doc_report.loadPage(i)
page_front = fitz.open()
# added this
if not page._isWrapped:
page._wrapContents()
page_front.insertPDF(doc_watermark, from_page=i, to_page=i)
page.showPDFpage(page.rect, page_front, pno=0, keep_proportion=True, overlay=True, rotate=0, clip=None)
doc_report.save(save_path, encryption=fitz.PDF_ENCRYPT_KEEP)
doc_report.close()
doc_watermark.close()

reportlab: Smartest way to position page with bookmarkPage

I've been using python and reportlab to auto-generate long documents and want to use the PDF outline tree for easy navigation through the document. According to the docs, canvas.bookmarkPage comes with multiple options to adjust the document view after jumping to the destination page. The standard one is a simple page Fit to the window. From a user perspective, I would prefer FitH (as wide as possible, with destination at the top of the screen) or XYZ (keep user zoom level with destination at the top of the screen). When using any fit option except the basic Fit, the function call must be provided with the coordinates to arrange the view accordingly.
However, I could not find any explanations, examples, code snippets, or anything on how to figure this out, and it took me a good while to come up with a solution. So, I want to share this solution here and ask if this is really the best way to do it or if I overlooked something basic.
The key thing here is SmartParagraph which remembers its position after it was drawn. First, I used flowable.canv.absolutePosition(0,0) in the afterFlowable() method because this is where I needed this information to pass it to bookmarkPage(). However, the position was always reported as 0, 0, so apparently the flowable and/or the canvas have forgotten everything about the position when afterFlowable() is reached. So I thought there has to be some point in time when a Flowable knows its position and after investigating the source code I found out that after draw(), it still knows where it is.
So: SmartParagraph is a subclass of Paragraph that stores its position after it is drawn, so that later in the document building process this can be used by any external element for whatever.
The example will create a dummy pdf with 2 headings that do a nice FitH zoom and two headings that do the basic Fit zoom.
Does anyone have a better idea on how to solve this?
import typing
from reportlab.lib.styles import ParagraphStyle as PS
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.platypus.flowables import Flowable
from reportlab.platypus import PageBreak, Spacer
from reportlab.platypus.paragraph import Paragraph
from reportlab.platypus.doctemplate import SimpleDocTemplate
from reportlab.lib.units import cm
class SmartParagraph(Paragraph):
def __init__(self, text, *args, **kwds):
"""This paragraph remembers its position on the canvas"""
super(SmartParagraph, self).__init__(text, *args, **kwds)
self._pos: typing.Tuple[int, int] = None
def draw(self):
super(SmartParagraph, self).draw()
self._pos = self.canv.absolutePosition(0, 0)
def get_pos(self) -> typing.Tuple[int, int]:
return self._pos
class CustomDocTemplate(SimpleDocTemplate):
def __init__(self, filename, outline_levels: int = 4, **kwargs):
super(CustomDocTemplate, self).__init__(filename, **kwargs)
self._bookmark_keys = list()
if not isinstance(outline_levels, int) and outline_levels < 1:
raise ValueError("Outline levels must be integer and at least 1")
self._outline_levels = {f'Heading{level+1}': level for level in range(outline_levels)}
# Map of kind: Heading1 -> 0
# Heading 1 is level 0, I dont make the rules
def afterFlowable(self, flowable: Flowable):
"""Registers TOC entries."""
if isinstance(flowable, Paragraph):
flowable: Paragraph
text = flowable.getPlainText()
style = flowable.style.name
if style in self._outline_levels:
level = self._outline_levels[style]
else:
return
if text not in self._bookmark_keys:
key = text
self._bookmark_keys.append(key)
else:
# There might headings with identical text, yet they need a different key
# Keys are stored in a list and incremented if a duplicate is found
cnt = 1
while True:
key = text + str(cnt)
if key not in self._bookmark_keys:
self._bookmark_keys.append(key)
break
cnt += 1
if isinstance(flowable, SmartParagraph):
# Only smart paragraphs know their own position
x, y = flowable.get_pos()
y += flowable.style.fontSize + 15
self.canv.bookmarkPage(key, fit="FitH", top=y)
else:
# Dumb paragraphs need to show the whole page
self.canv.bookmarkPage(key)
self.canv.addOutlineEntry(title=text, key=key, level=level)
def _endBuild(self):
"""Override of parent function. Shows outline tree by default when opening PDF."""
super(CustomDocTemplate, self)._endBuild()
self.canv.showOutline()
story = list()
story.append(SmartParagraph('First Smart Heading', getSampleStyleSheet()['h1']))
story.append(Paragraph('Text in first heading'))
story.append(Spacer(1, 0.5 * cm))
story.append(SmartParagraph('First Sub Smart Heading', getSampleStyleSheet()['h2']))
story.append(Paragraph('Text in first sub heading'))
story.append(Spacer(1, 0.5 * cm))
story.append(Paragraph('Second Sub Dumb Heading', getSampleStyleSheet()['h2']))
story.append(Paragraph('Text in second sub heading'))
story.append(PageBreak())
story.append(Paragraph('Last Dumb Heading', getSampleStyleSheet()['h1']))
story.append(Paragraph('Text in last heading', PS('body')))
doc = CustomDocTemplate('mintoc.pdf')
doc.multiBuild(story)

pdfplumber extract_text function also extracts text from the table. Only want to extract text outside of the table

I have a pdf that contains text and tables. I want to extract both of them but when I used the extract_text function it also extracts the content which is inside of the table. I just want to only extract the text which is outside the table and the table can be extracted with the extract_tables function.
I have tested with a pdf that only contains tables but still extract_text extracts also the table contents which I want to extract using extract_tables function.

You can try with the following code
import pdfplumber
# Import the PDF.
pdf = pdfplumber.open("file.pdf")
# Load the first page.
p = pdf.pages[0]
# Table settings.
ts = {
"vertical_strategy": "lines",
"horizontal_strategy": "lines",
}
# Get the bounding boxes of the tables on the page.
bboxes = [table.bbox for table in p.find_tables(table_settings=ts)]
def not_within_bboxes(obj):
"""Check if the object is in any of the table's bbox."""
def obj_in_bbox(_bbox):
"""See https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/table.py#L404"""
v_mid = (obj["top"] + obj["bottom"]) / 2
h_mid = (obj["x0"] + obj["x1"]) / 2
x0, top, x1, bottom = _bbox
return (h_mid >= x0) and (h_mid < x1) and (v_mid >= top) and (v_mid < bottom)
return not any(obj_in_bbox(__bbox) for __bbox in bboxes)
print("Text outside the tables:")
print(p.filter(not_within_bboxes).extract_text())
I am using the .filter() method provided by pdfplumber to drop any objects that fall inside the bounding box of any of the tables and creating a filtered version of the page and then extracting the text from it.
Since you haven't shared the PDF, the table settings I have used may not work but you can change them to suit your needs.

How to find textual differences between revisions on Wikipedia pages with mwclient?

I'm trying to find the textual differences between two revisions of a given Wikipedia page using mwclient. I have the following code:
import mwclient
import difflib
site = mwclient.Site('en.wikipedia.org')
page = site.pages['Bowdoin College']
texts = [rev for rev in page.revisions(prop='content')]
if not (texts[-1][u'*'] == texts[0][u'*']):
##show me the differences between the pages
Thank you!

It's not clear weather you want a difflib-generated diff or a mediawiki-generated diff using mwclient.
In the first case, you have two strings (the text of two revisions) and you want to get the diff using difflib:
...
t1 = texts[-1][u'*']
t2 = texts[0][u'*']
print('\n'.join(difflib.unified_diff(t1.splitlines(), t2.splitlines())))
(difflib can also generate an HTML diff, refer to the documentation for more info.)
But if you want the MediaWiki-generated HTML diff using mwclient you'll need revision ids:
# TODO: Loading all revisions is slow,
# try to load only as many as required.
revisions = list(page.revisions(prop='ids'))
last_revision_id = revisions[-1]['revid']
first_revision_id = revisions[0]['revid']
Then use the compare action to compare the revision ids:
compare_result = site.get('compare', fromrev=last_revision_id, torev=first_revision_id)
html_diff = compare_result['compare']['*']

How do you keep table rows together in python-docx?

As an example, I have a generic script that outputs the default table styles using python-docx (this code runs fine):
import docx
d=docx.Document()
type_of_table=docx.enum.style.WD_STYLE_TYPE.TABLE
list_table=[['header1','header2'],['cell1','cell2'],['cell3','cell4']]
numcols=max(map(len,list_table))
numrows=len(list_table)
styles=(s for s in d.styles if s.type==type_of_table)
for stylenum,style in enumerate(styles,start=1):
label=d.add_paragraph('{}) {}'.format(stylenum,style.name))
label.paragraph_format.keep_with_next=True
label.paragraph_format.space_before=docx.shared.Pt(18)
label.paragraph_format.space_after=docx.shared.Pt(0)
table=d.add_table(numrows,numcols)
table.style=style
for r,row in enumerate(list_table):
for c,cell in enumerate(row):
table.row_cells(r)[c].text=cell
d.save('tablestyles.docx')
Next, I opened the document, highlighted a split table and under paragraph format, selected "Keep with next," which successfully prevented the table from being split across a page:
Here is the XML code of the non-broken table:
You can see the highlighted line shows the paragraph property that should be keeping the table together. So I wrote this function and stuck it in the code above the d.save('tablestyles.docx') line:
def no_table_break(document):
tags=document.element.xpath('//w:p')
for tag in tags:
ppr=tag.get_or_add_pPr()
ppr.keepNext_val=True
no_table_break(d)
When I inspect the XML code the paragraph property tag is set properly and when I open the Word document, the "Keep with next" box is checked for all tables, yet the table is still split across pages. Am I missing an XML tag or something that's preventing this from working properly?

Ok, I also needed this. I think we were all making the incorrect assumption that the setting in Word's table properties (or the equivalent ways to achieve this in python-docx) was about keeping the table from being split across pages. It's not -- instead, it's simply about whether or not a table's rows can be split across pages.
Given that we know how successfully do this in python-docx, we can prevent tables from being split across pages by putting each table within the row of a larger master table. The code below successfully does this. I'm using Python 3.6 and Python-Docx 0.8.6
import docx
from docx.oxml.shared import OxmlElement
import os
import sys
def prevent_document_break(document):
"""https://github.com/python-openxml/python-docx/issues/245#event-621236139
Globally prevent table cells from splitting across pages.
"""
tags = document.element.xpath('//w:tr')
rows = len(tags)
for row in range(0, rows):
tag = tags[row] # Specify which <w:r> tag you want
child = OxmlElement('w:cantSplit') # Create arbitrary tag
tag.append(child) # Append in the new tag
d = docx.Document()
type_of_table = docx.enum.style.WD_STYLE_TYPE.TABLE
list_table = [['header1', 'header2'], ['cell1', 'cell2'], ['cell3', 'cell4']]
numcols = max(map(len, list_table))
numrows = len(list_table)
styles = (s for s in d.styles if s.type == type_of_table)
big_table = d.add_table(1, 1)
big_table.autofit = True
for stylenum, style in enumerate(styles, start=1):
cells = big_table.add_row().cells
label = cells[0].add_paragraph('{}) {}'.format(stylenum, style.name))
label.paragraph_format.keep_with_next = True
label.paragraph_format.space_before = docx.shared.Pt(18)
label.paragraph_format.space_after = docx.shared.Pt(0)
table = cells[0].add_table(numrows, numcols)
table.style = style
for r, row in enumerate(list_table):
for c, cell in enumerate(row):
table.row_cells(r)[c].text = cell
prevent_document_break(d)
d.save('tablestyles.docx')
# because I'm lazy...
openers = {'linux': 'libreoffice tablestyles.docx',
'linux2': 'libreoffice tablestyles.docx',
'darwin': 'open tablestyles.docx',
'win32': 'start tablestyles.docx'}
os.system(openers[sys.platform])

Have been straggling with the problem for some hours and finally found the solution worked fine for me. I just changed the XPath in the topic starter's code so now it looks like this:
def keep_table_on_one_page(doc):
tags = self.doc.element.xpath('//w:tr[position() < last()]/w:tc/w:p')
for tag in tags:
ppr = tag.get_or_add_pPr()
ppr.keepNext_val = True
The key moment is this selector
[position() < last()]
We want all but the last row in each table to keep with the next one

Would have left this is a comment under #DeadAd 's answer, but had low rep.
In case anyone is looking to stop a specific table from breaking, rather than all tables in a doc, change the xpath to the following:
tags = table._element.xpath('./w:tr[position() < last()]/w:tc/w:p')
where table refers to the instance of <class 'docx.table.Table'> which you want to keep together.
"//" will select all nodes that match the xpath (regardless of relative location), "./" will start selection from current node

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python PDF Parsing with Camelot and Extract the Table Title - python

Related

PDF - Fitz makes the merged page as flipped

reportlab: Smartest way to position page with bookmarkPage

pdfplumber extract_text function also extracts text from the table. Only want to extract text outside of the table

How to find textual differences between revisions on Wikipedia pages with mwclient?

How do you keep table rows together in python-docx?

Categories

Resources