Delete text from pdf using PyMUPDF

Delete text from pdf using PyMUPDF - python

I need to remove the text "DRAFT" from a pdf document using Python. I can find the text box containing the text but can't find an example of how to edit the pdf text element using pymupdf.
In the example below the draft object contains the coords and text for the DRAFT text element.
import fitz
fname = r"original.pdf"
doc = fitz.open(fname)
page = doc.load_page(0)
draft = page.search_for("DRAFT")
# insert code here to delete the DRAFT text or replace it with an empty string
out_fname = r"final.pdf"
doc.save(out_fname)
Added 4/28/2022
I found a way to delete the text but unfortunately it also deletes any overlapping text underneath the box around DRAFT. I really just want to delete the DRAFT letters without modifying underlying layers
# insert code here to delete the DRAFT text or replace it with an empty string
rl = page.search_for("DRAFT", quads = True)
page.add_redact_annot(rl[0])
page.apply_redactions()

You can try this.
import fitz
doc = fitz.open("xxxx")
for page in doc:
for xref in page.get_contents():
stream = doc.xref_stream(xref).replace(b'The string to delete', b'')
doc.update_stream(xref, stream)

Related

How can I extract the outline of a PDF using Python?

I wish to read the outline of a .pdf format paper. The expected output is a list of section titles like ['abstract', 'Introduction', ...], The section titles can be identified by the following characteristics: 1) bold and larger font size, 2) all nouns starting with capital letters, and 3) appearing immediately after a line break \n.
The solutions I have tried with includes:
pypdf2 with reader.outline
reader = PyPDF2.PdfReader('path/to/my/pdf')
print(reader.outline)
pymupdf with doc.get_toc()
doc = fitz.open('path/to/my/pdf')
toc = doc.get_toc()
However both give me empty list.
I am currently using the re library to extract the section titles, but the results include additional information such as references and table contents.
import re
re.findall(r'(\[turnpage\]|\n)([A-Z][^.]*?)(\n[A-Z0-9][^\s]*?)', text)
For a clearer understanding of the results produced by the code, please refer to this link

If reader.outline by pypdf gives an empty result, there is no outline specified as metadata.
There can still be an outline specified as normal text. However, detecting / parsing that would require custom work on your side. You can use the text extraction as a basis:
https://pypdf.readthedocs.io/en/latest/user/extract-text.html
from pypdf import PdfReader
reader = PdfReader("example.pdf")
page = reader.pages[0]
print(page.extract_text())

how to extract only main text with pdfplumber and ignore image text and tables?

trying to parse any non scanned pdf and extract only text, without tables and their comments or pictures and their comment. just the main text of a pdf, if such text exists. tried pdfplumber.
when trying this piece of code it extract all texts, include tables and their comments.
import pdfplumber
with pdfplumber.open("somePDFname.pdf") as pdf:
for pdf_page in pdf.pages:
single_page_text = pdf_page.extract_text()
print( single_page_text )
saw this solution - How to ignore table and its content while extracting text from pdf but if I understood correctly it was specific for a certain table, so did not work for me as I don't know the dim of the tables/images I'm scanning.
also read the issue in the pdfplumber (https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwj0zejJ2P76AhUzuZUCHZ3oBZkQFnoECBAQAQ&url=https%3A%2F%2Fgithub.com%2Fjsvine%2Fpdfplumber%2Fissues%2F242&usg=AOvVaw3-4BI2LYY2dmH9ldel9_J9).
saw this solution also -https://stackoverflow.com/questions/66293939/how-i-can-extract-only-text-without-tables-inside-a-pdf-file-using-pdfplumber
but rather use pdfplumber for later parsing.
Is there a more general solution to the problem?

Hello you can use a filter after extracting text
clean_text = text.filter(lambda obj: obj["object_type"] == "char" and "Bold" in obj["fontname"])
also, you can use specify the front Size in the filer,
import pdfplumber
with pdfplumber.open("path/to/file.pdf") as pdf:
first_page = pdf.pages[0]
print(first_page.chars[0])
please check the above code for the get dataframe page-wise.

Python PDFMiner : How to link outlines to underlying text

I am trying to parse a PDF and create some kind of a hierarchical structure. Consider the input
Title 1
some text some text some text some text some text some text some text
some text some text some text some text some text some text some text
Title 1.1
some more text some more text some more text some more text
some more text some more text some more text some more text
some more text some more text
Title 2
some final text some final text
some final text some final text some final text some final text
some final text some final text some final text some final text
here is how i can extract the outline/titles
path='myFile.pdf'
# Open a PDF file.
fp = open(path, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, '')
outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines:
print (level, title)
this gives me
(1, u'Title 1')
(2, u'Title 1.1')
(1, u'Title 2')
which is perfect, as the levels are aligned with the text hierarchy. Now I can extract the text as follows
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
text_from_pdf = open('textFromPdf.txt','w')
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
layout = device.get_result()
for element in layout:
if isinstance(element, LTTextBox):
text_from_pdf.write(''.join([i if ord(i) < 128 else ' ' for i in element.get_text()]))
which gives me
Title 1
some text some text some text some text some text some text some text
some text some text some text some text some text some text some text
Title 1.1
some more text some more text some more text some more text
some more text some more text some more text some more text
some more text some more text
Title 2
some final text some final text
some final text some final text some final text some final text
some final text some final text some final text some final text
which is ok as far as the order goes, but now i have lost all sense of hierarchy. How do i know where a title ends and another begins? Also, who is the parent, if any of a title/heading?
Is there a way to connect the outline information to the layout elements? It would be great to be able to parse all the information while iterating through the levels.
Another problem is that if there are any citations at the bottom of a page, then the citation text gets mixed in with the results. Is there a way to ignore the headers, footers and citations when parsing a PDF?

I hope it is possible but it is clearly stated in the pdfminer document as follow
Some PDF documents use page numbers as destinations, while others use page numbers and the physical location within the page. Since PDF does not have a logical structure, and it does not provide a way to refer to any in-page object from the outside, there’s no way to tell exactly which part of text these destinations are referring to.
https://pdfminer-docs.readthedocs.io/programming.html#:~:text=Some%20PDF%20documents,are%20referring%20to.
Thanks

How to erase text from PDF using Python

I'm creating a python script to edit text from PDFs.
I have this Python code which allows me to add text into specific positions of a PDF file.
import PyPDF2
import io
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
import sys
packet = io.BytesIO()
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
# Insert code into specific position
can.drawString(300, 115, "Hello world")
can.save()
#move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PyPDF2.PdfFileReader(packet)
# read your existing PDF
existing_pdf = PyPDF2.PdfFileReader(open("original.pdf", "rb"))
num_pages = existing_pdf.numPages
output = PyPDF2.PdfFileWriter()
# add the "watermark" (which is the new pdf) on the existing page
page = existing_pdf.getPage(num_pages-1) # get the last page of the original pdf
page.mergePage(new_pdf.getPage(0)) # merges my created text with my PDF.
x = existing_pdf.getNumPages()
#add all pages from original pdf into output pdf
for n in range(x):
output.addPage(existing_pdf.getPage(n))
# finally, write "output" to a real file
outputStream = open("output.pdf", "wb")
output.write(outputStream)
outputStream.close()
My problem: I want to replace the text in a specific position of my original PDF with my custom text. A way of writing blank characters would do the trick but I couldn't find anything that does this.
PS.: It must be Python code because I will need to deploy this as a .exe file later and I only know how to do that using Python code.

A general purpose algorithm for replacing text in a PDF is a difficult problem. I'm not saying it can't ever be done, because I've demonstrated doing so with the Adobe PDF Library albeit with a very simple input file with no complications, but I'm not sure that pyPDF2 has the facilities required to do so. In part, just finding the text can be a challenge.
You (or more realistically your PDF library) has to parse the page contents and keep track of the changes to the graphic state, specifically changes to the current transformation matrix in case the text is in a Form XObject, and the text transformation matrix, and changes to the font; you have to use the font resource to get character widths to figure out where the text cursor may be positioned after inserting a string. You may need to handle standard-14 fonts which don't contain that information in their font resources (the application -your program- is expected to know their metrics)
After all that, removing the text is easy if you don't need to break up a Tj or TJ (show text) instruction into different parts. Preventing the text after from shifting, if that's what's desired, may require inserting a new Tm instruction to reposition the text after to where it would have been.
Inserting new text can be challenging. If you want to stay consistent with the font being used and it is embedded and subset, it may not necessarily contain the glyphs you need for your text insertion. And after insertion, you then have to decide whether you need to reflow the text that comes after the text you inserted.
And lastly, you will need your PDF library to save all the changes. Quite frankly, using Adobe Acrobat's Redaction features would likely be cheaper and more cost-effective way of doing this than trying to program this from scratch.

If you want to do a poor man's redaction with ReportLab and PyPDF2,
you would create your replacement content with ReportLab.
Given a Canvas, a rectangle indicating an area, a text string and a point where the text string would be inserted you would then:
#set a fill color to white:
c.setFillColorRGB(1,1,1)
# draw a rectangle
c.rect([your rectangle], fill=1)
# change color
c.setFillColorRGB(0,0,0)
c.drawString([text insert position], [text string])
save this PDF document you've created to a temporary file.
Open this PDF document and the document you want to modify using the PyPDF2's PdfFileReader. create a pdfFileWriter object, call it ModifiedDoc. Get page 0 of temporary PDF, call it updatePage. Get page n of the other document, call it toModifyPage.
toModifyPage.mergePage(updatePage)
after you are done updating pages:
modifiedDoc.cloneDocumentFromReader(srcDoc)
modifiedDoc.write(outStream)
Again, if you go this route, a user might still see the original text before it gets covered up with the new content, and text extraction would likely pull out both the original and new text for that area, and possibly intermingle it to something unintelligible.

How can I open a docx file, find a particular string occurring at multiple places and add two lines after that in the whole document using python?

As I am new to the python programming, I want to open a .docx file, parse it, find occurrence of particular string in multiple places and then adding two lines after that in whole document. How can i do these thing using python script?

You could do this as follows using win32com:
import win32com.client
search_for = "This is some text in my file"
add_lines = "^pLine1^pLine2^pLine3" # ^p creates a new line
word = win32com.client.Dispatch("Word.Application")
word.Visible = False
word.DisplayAlerts = False
doc = word.Documents.Open(r'c:\my_folder\my_file.docx')
const = win32com.client.constants
find = doc.Content.Find
find.ClearFormatting()
find.Replacement.ClearFormatting()
find.Execute(Forward=True, Replace=const.wdReplaceAll, FindText=search_for, ReplaceWith=search_for+add_lines)
word.ActiveDocument.SaveAs(r'c:\my_folder\my_file_output.docx')
word.Quit()
To change the colour of the replacement text you could also add the following before the find.Execute:
find.Replacement.Font.Color = 255 # wdColorRed
A full list of standard Word colours is listed on the Microsoft site.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.