Python PDFMiner : How to link outlines to underlying text

Python PDFMiner : How to link outlines to underlying text - python

I am trying to parse a PDF and create some kind of a hierarchical structure. Consider the input
Title 1
some text some text some text some text some text some text some text
some text some text some text some text some text some text some text
Title 1.1
some more text some more text some more text some more text
some more text some more text some more text some more text
some more text some more text
Title 2
some final text some final text
some final text some final text some final text some final text
some final text some final text some final text some final text
here is how i can extract the outline/titles
path='myFile.pdf'
# Open a PDF file.
fp = open(path, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, '')
outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines:
print (level, title)
this gives me
(1, u'Title 1')
(2, u'Title 1.1')
(1, u'Title 2')
which is perfect, as the levels are aligned with the text hierarchy. Now I can extract the text as follows
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
text_from_pdf = open('textFromPdf.txt','w')
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
layout = device.get_result()
for element in layout:
if isinstance(element, LTTextBox):
text_from_pdf.write(''.join([i if ord(i) < 128 else ' ' for i in element.get_text()]))
which gives me
Title 1
some text some text some text some text some text some text some text
some text some text some text some text some text some text some text
Title 1.1
some more text some more text some more text some more text
some more text some more text some more text some more text
some more text some more text
Title 2
some final text some final text
some final text some final text some final text some final text
some final text some final text some final text some final text
which is ok as far as the order goes, but now i have lost all sense of hierarchy. How do i know where a title ends and another begins? Also, who is the parent, if any of a title/heading?
Is there a way to connect the outline information to the layout elements? It would be great to be able to parse all the information while iterating through the levels.
Another problem is that if there are any citations at the bottom of a page, then the citation text gets mixed in with the results. Is there a way to ignore the headers, footers and citations when parsing a PDF?

I hope it is possible but it is clearly stated in the pdfminer document as follow
Some PDF documents use page numbers as destinations, while others use page numbers and the physical location within the page. Since PDF does not have a logical structure, and it does not provide a way to refer to any in-page object from the outside, there’s no way to tell exactly which part of text these destinations are referring to.
https://pdfminer-docs.readthedocs.io/programming.html#:~:text=Some%20PDF%20documents,are%20referring%20to.
Thanks

Related

How can I extract the outline of a PDF using Python?

I wish to read the outline of a .pdf format paper. The expected output is a list of section titles like ['abstract', 'Introduction', ...], The section titles can be identified by the following characteristics: 1) bold and larger font size, 2) all nouns starting with capital letters, and 3) appearing immediately after a line break \n.
The solutions I have tried with includes:
pypdf2 with reader.outline
reader = PyPDF2.PdfReader('path/to/my/pdf')
print(reader.outline)
pymupdf with doc.get_toc()
doc = fitz.open('path/to/my/pdf')
toc = doc.get_toc()
However both give me empty list.
I am currently using the re library to extract the section titles, but the results include additional information such as references and table contents.
import re
re.findall(r'(\[turnpage\]|\n)([A-Z][^.]*?)(\n[A-Z0-9][^\s]*?)', text)
For a clearer understanding of the results produced by the code, please refer to this link

If reader.outline by pypdf gives an empty result, there is no outline specified as metadata.
There can still be an outline specified as normal text. However, detecting / parsing that would require custom work on your side. You can use the text extraction as a basis:
https://pypdf.readthedocs.io/en/latest/user/extract-text.html
from pypdf import PdfReader
reader = PdfReader("example.pdf")
page = reader.pages[0]
print(page.extract_text())

How to spread text on multiple pages depending on text size?

What I tried
doc = fitz.open()
page = doc.new_page()
text = 'Long text'
tw = fitz.TextWriter(page.rect)
tw.append((20,40), text, small_caps=True)
tw.write_text(page)
doc.ez_save('test.pdf')
How to spread text on multiple pages depending on text size?
Font, text size should stay the same, it should extend to the amount of pages needed and write the appropriate pieces there.
I did not find the solution here.

Delete text from pdf using PyMUPDF

I need to remove the text "DRAFT" from a pdf document using Python. I can find the text box containing the text but can't find an example of how to edit the pdf text element using pymupdf.
In the example below the draft object contains the coords and text for the DRAFT text element.
import fitz
fname = r"original.pdf"
doc = fitz.open(fname)
page = doc.load_page(0)
draft = page.search_for("DRAFT")
# insert code here to delete the DRAFT text or replace it with an empty string
out_fname = r"final.pdf"
doc.save(out_fname)
Added 4/28/2022
I found a way to delete the text but unfortunately it also deletes any overlapping text underneath the box around DRAFT. I really just want to delete the DRAFT letters without modifying underlying layers
# insert code here to delete the DRAFT text or replace it with an empty string
rl = page.search_for("DRAFT", quads = True)
page.add_redact_annot(rl[0])
page.apply_redactions()

You can try this.
import fitz
doc = fitz.open("xxxx")
for page in doc:
for xref in page.get_contents():
stream = doc.xref_stream(xref).replace(b'The string to delete', b'')
doc.update_stream(xref, stream)

How to replace/delete text from a pdf using python?

I have code that hides parts of the pdf (by just covering it with a white polygon) but the issue with this is, the text is still there, if you ctrl-f you can still find it.
My goal is to actually remove the text from the pdf itself. Using pdfminer I managed to extract the text from the pdf but I don't know if its possible to actually "replace" the text with say just some empty spaces. Is such a thing possible using python? Extracting it isn't enough. I need the text to be removed from the PDF

Is such a thing possible? Yes, although it is not recommended. In my opinion, your best bet is to open and read your existing file, move it to an editable format, remove whatever text that you don't want present and then convert it back.
However, you could extract the data and remove it from memory by using:
import PyPDF2
# creating a pdf file object
pdfFileObj = open('example.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
Line by line, this program would:
pdfFileObj = open('example.pdf', 'rb')
Open the example.pdf and save the file object as pdfFileObj.
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
Create an object of PdfFileReader and pass the PDF file object whole getting a PDF reader object.
print(pdfReader.numPages)
Give the number of pages.
pageObj = pdfReader.getPage(0)
Create an object of PageObject class. PDF reader object has function getPage() which takes page number (starting form index 0) as an argument and returns the page object.
print(pageObj.extractText())
Extract text from the PDF page.
pdfFileObj.close()
Close the PDF file object.
The replacement text would simply be "", as you want to remove all instances / cases of a certain piece of text.

I used pdf-redactor in one of my projects and it works pretty nice.
Here is an example how to redact Social Security Numbers from text layer.

This is kind of memory intensive but you can copy the rest of the pdf apart from the part you are removing and then overwrite the file with the new version which does not contain the part you wish to remove. You can do this using PyPDF by retrieving a content stream and finding and removing the relevant parts.
PyPDF docs https://pythonhosted.org/PyPDF2/PageObject.html?highlight=getcontents#PyPDF2.pdf.PageObject.getContents;
PDF standard https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf pg 78, pg 81;

I know I am late but for future readers here is a workaround I found to resolve this using pymupdf. This solution successfully deletes text from pdf.
page = doc.load_page(0)
draft = page.search_for("Invoice")
for rect in draft:
annot = page.add_redact_annot(rect)
page.apply_redactions()
page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_NONE)
# then save the doc to a new PDF:
doc.save("new.pdf", garbage=3, deflate=True)

How to erase text from PDF using Python

I'm creating a python script to edit text from PDFs.
I have this Python code which allows me to add text into specific positions of a PDF file.
import PyPDF2
import io
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
import sys
packet = io.BytesIO()
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
# Insert code into specific position
can.drawString(300, 115, "Hello world")
can.save()
#move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PyPDF2.PdfFileReader(packet)
# read your existing PDF
existing_pdf = PyPDF2.PdfFileReader(open("original.pdf", "rb"))
num_pages = existing_pdf.numPages
output = PyPDF2.PdfFileWriter()
# add the "watermark" (which is the new pdf) on the existing page
page = existing_pdf.getPage(num_pages-1) # get the last page of the original pdf
page.mergePage(new_pdf.getPage(0)) # merges my created text with my PDF.
x = existing_pdf.getNumPages()
#add all pages from original pdf into output pdf
for n in range(x):
output.addPage(existing_pdf.getPage(n))
# finally, write "output" to a real file
outputStream = open("output.pdf", "wb")
output.write(outputStream)
outputStream.close()
My problem: I want to replace the text in a specific position of my original PDF with my custom text. A way of writing blank characters would do the trick but I couldn't find anything that does this.
PS.: It must be Python code because I will need to deploy this as a .exe file later and I only know how to do that using Python code.

A general purpose algorithm for replacing text in a PDF is a difficult problem. I'm not saying it can't ever be done, because I've demonstrated doing so with the Adobe PDF Library albeit with a very simple input file with no complications, but I'm not sure that pyPDF2 has the facilities required to do so. In part, just finding the text can be a challenge.
You (or more realistically your PDF library) has to parse the page contents and keep track of the changes to the graphic state, specifically changes to the current transformation matrix in case the text is in a Form XObject, and the text transformation matrix, and changes to the font; you have to use the font resource to get character widths to figure out where the text cursor may be positioned after inserting a string. You may need to handle standard-14 fonts which don't contain that information in their font resources (the application -your program- is expected to know their metrics)
After all that, removing the text is easy if you don't need to break up a Tj or TJ (show text) instruction into different parts. Preventing the text after from shifting, if that's what's desired, may require inserting a new Tm instruction to reposition the text after to where it would have been.
Inserting new text can be challenging. If you want to stay consistent with the font being used and it is embedded and subset, it may not necessarily contain the glyphs you need for your text insertion. And after insertion, you then have to decide whether you need to reflow the text that comes after the text you inserted.
And lastly, you will need your PDF library to save all the changes. Quite frankly, using Adobe Acrobat's Redaction features would likely be cheaper and more cost-effective way of doing this than trying to program this from scratch.

If you want to do a poor man's redaction with ReportLab and PyPDF2,
you would create your replacement content with ReportLab.
Given a Canvas, a rectangle indicating an area, a text string and a point where the text string would be inserted you would then:
#set a fill color to white:
c.setFillColorRGB(1,1,1)
# draw a rectangle
c.rect([your rectangle], fill=1)
# change color
c.setFillColorRGB(0,0,0)
c.drawString([text insert position], [text string])
save this PDF document you've created to a temporary file.
Open this PDF document and the document you want to modify using the PyPDF2's PdfFileReader. create a pdfFileWriter object, call it ModifiedDoc. Get page 0 of temporary PDF, call it updatePage. Get page n of the other document, call it toModifyPage.
toModifyPage.mergePage(updatePage)
after you are done updating pages:
modifiedDoc.cloneDocumentFromReader(srcDoc)
modifiedDoc.write(outStream)
Again, if you go this route, a user might still see the original text before it gets covered up with the new content, and text extraction would likely pull out both the original and new text for that area, and possibly intermingle it to something unintelligible.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.