How to replace/delete text from a pdf using python?

How to replace/delete text from a pdf using python? - python

I have code that hides parts of the pdf (by just covering it with a white polygon) but the issue with this is, the text is still there, if you ctrl-f you can still find it.
My goal is to actually remove the text from the pdf itself. Using pdfminer I managed to extract the text from the pdf but I don't know if its possible to actually "replace" the text with say just some empty spaces. Is such a thing possible using python? Extracting it isn't enough. I need the text to be removed from the PDF

Is such a thing possible? Yes, although it is not recommended. In my opinion, your best bet is to open and read your existing file, move it to an editable format, remove whatever text that you don't want present and then convert it back.
However, you could extract the data and remove it from memory by using:
import PyPDF2
# creating a pdf file object
pdfFileObj = open('example.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
Line by line, this program would:
pdfFileObj = open('example.pdf', 'rb')
Open the example.pdf and save the file object as pdfFileObj.
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
Create an object of PdfFileReader and pass the PDF file object whole getting a PDF reader object.
print(pdfReader.numPages)
Give the number of pages.
pageObj = pdfReader.getPage(0)
Create an object of PageObject class. PDF reader object has function getPage() which takes page number (starting form index 0) as an argument and returns the page object.
print(pageObj.extractText())
Extract text from the PDF page.
pdfFileObj.close()
Close the PDF file object.
The replacement text would simply be "", as you want to remove all instances / cases of a certain piece of text.

I used pdf-redactor in one of my projects and it works pretty nice.
Here is an example how to redact Social Security Numbers from text layer.

This is kind of memory intensive but you can copy the rest of the pdf apart from the part you are removing and then overwrite the file with the new version which does not contain the part you wish to remove. You can do this using PyPDF by retrieving a content stream and finding and removing the relevant parts.
PyPDF docs https://pythonhosted.org/PyPDF2/PageObject.html?highlight=getcontents#PyPDF2.pdf.PageObject.getContents;
PDF standard https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf pg 78, pg 81;

I know I am late but for future readers here is a workaround I found to resolve this using pymupdf. This solution successfully deletes text from pdf.
page = doc.load_page(0)
draft = page.search_for("Invoice")
for rect in draft:
annot = page.add_redact_annot(rect)
page.apply_redactions()
page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_NONE)
# then save the doc to a new PDF:
doc.save("new.pdf", garbage=3, deflate=True)

Related

Replace a Specific page in a PDF with a page from another PDF in python 3

I am using pypdf2 to highlight text in a particular page in Pdf files.Hence,I get only a single page with higlighted text as an output.Now,I want to replace this page in the original pdf file.
I have also tried "search=" parameter from abode to highlight in the same file itself.But,It is not working.
I am new to working with PDFs.Sorry,If the question sounded a bit naive.

Get a word processor or page layout software. Convert PDF to format your software can read. Edit document. Write new PDF.
Anything else sounds nefarious. Nobody should help you do it any other way.

To replace a page in a given file you can try do the following approach:
Get the page (the object) that you wanna modify/replace
Change it (the PyPDF2 return the object itself instead a getter, so, you can change it).
In the code below, I'm adding a "water mark" in my document:
tmp_name = "__tmp.pdf"
output_file = PdfFileWriter()
with open(inFile, 'rb') as f:
# Read the pdf (create a pdf stream)
pdf_original = PdfFileReader(f, strict=False)
# put all buffer in a single file
output_file.appendPagesFromReader(pdf_original)
# create new PDF with water mark
WaterMark._page(fixPage, tmp_name)
# Open the created pdf
with open(tmp_name, 'rb') as ftmp:
# Read the temp pdf (create a pdf stream obj)
temp_pdf = PdfFileReader(ftmp)
for p in range(startPage, startPage+pages):
original_page = output_file.getPage(p)
temp_page = temp_pdf.getPage(0)
original_page.mergePage(temp_page)
# write result
if output_file.getNumPages():
# newpath = inFile[:-4] + "_numbered.pdf"
with open(outFile, 'wb') as f:
output_file.write(f)
os.remove(tmp_name)
My answer is based on this and this
The pdf that I've used
Above you can see the second page (edited page) in my document.

How to erase text from PDF using Python

I'm creating a python script to edit text from PDFs.
I have this Python code which allows me to add text into specific positions of a PDF file.
import PyPDF2
import io
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
import sys
packet = io.BytesIO()
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
# Insert code into specific position
can.drawString(300, 115, "Hello world")
can.save()
#move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PyPDF2.PdfFileReader(packet)
# read your existing PDF
existing_pdf = PyPDF2.PdfFileReader(open("original.pdf", "rb"))
num_pages = existing_pdf.numPages
output = PyPDF2.PdfFileWriter()
# add the "watermark" (which is the new pdf) on the existing page
page = existing_pdf.getPage(num_pages-1) # get the last page of the original pdf
page.mergePage(new_pdf.getPage(0)) # merges my created text with my PDF.
x = existing_pdf.getNumPages()
#add all pages from original pdf into output pdf
for n in range(x):
output.addPage(existing_pdf.getPage(n))
# finally, write "output" to a real file
outputStream = open("output.pdf", "wb")
output.write(outputStream)
outputStream.close()
My problem: I want to replace the text in a specific position of my original PDF with my custom text. A way of writing blank characters would do the trick but I couldn't find anything that does this.
PS.: It must be Python code because I will need to deploy this as a .exe file later and I only know how to do that using Python code.

A general purpose algorithm for replacing text in a PDF is a difficult problem. I'm not saying it can't ever be done, because I've demonstrated doing so with the Adobe PDF Library albeit with a very simple input file with no complications, but I'm not sure that pyPDF2 has the facilities required to do so. In part, just finding the text can be a challenge.
You (or more realistically your PDF library) has to parse the page contents and keep track of the changes to the graphic state, specifically changes to the current transformation matrix in case the text is in a Form XObject, and the text transformation matrix, and changes to the font; you have to use the font resource to get character widths to figure out where the text cursor may be positioned after inserting a string. You may need to handle standard-14 fonts which don't contain that information in their font resources (the application -your program- is expected to know their metrics)
After all that, removing the text is easy if you don't need to break up a Tj or TJ (show text) instruction into different parts. Preventing the text after from shifting, if that's what's desired, may require inserting a new Tm instruction to reposition the text after to where it would have been.
Inserting new text can be challenging. If you want to stay consistent with the font being used and it is embedded and subset, it may not necessarily contain the glyphs you need for your text insertion. And after insertion, you then have to decide whether you need to reflow the text that comes after the text you inserted.
And lastly, you will need your PDF library to save all the changes. Quite frankly, using Adobe Acrobat's Redaction features would likely be cheaper and more cost-effective way of doing this than trying to program this from scratch.

If you want to do a poor man's redaction with ReportLab and PyPDF2,
you would create your replacement content with ReportLab.
Given a Canvas, a rectangle indicating an area, a text string and a point where the text string would be inserted you would then:
#set a fill color to white:
c.setFillColorRGB(1,1,1)
# draw a rectangle
c.rect([your rectangle], fill=1)
# change color
c.setFillColorRGB(0,0,0)
c.drawString([text insert position], [text string])
save this PDF document you've created to a temporary file.
Open this PDF document and the document you want to modify using the PyPDF2's PdfFileReader. create a pdfFileWriter object, call it ModifiedDoc. Get page 0 of temporary PDF, call it updatePage. Get page n of the other document, call it toModifyPage.
toModifyPage.mergePage(updatePage)
after you are done updating pages:
modifiedDoc.cloneDocumentFromReader(srcDoc)
modifiedDoc.write(outStream)
Again, if you go this route, a user might still see the original text before it gets covered up with the new content, and text extraction would likely pull out both the original and new text for that area, and possibly intermingle it to something unintelligible.

Copying file contents to clipboard and pasting into a plain text file automatically in python

All I am trying to accomplish with this little script that I wrote is to parse data from a PDF file.
However, I seem to have run into an issue with python, more specifically the PyPDF2 module not able to read the text from a pdf file. The data printed out is all fuzzy and basically not readable. However, when I open up the pdf file that I am trying to read I can simply click drag and ctrl+c to copy contents after which when I paste it into a plain txt document it works flawlessly. The data is readable when I go through this process of copying and pasting manually.
So what I'm trying to do is mimic that exact step, however automate it instead of having me go through all the pages within the pdf file performing the above steps.
Or if there are any suggestion as to what else I can do to achieve this, I would greatly appreciate it. I have tried converting the pdf file into a docx and plain text files however the contents of the file had their formats completely re arranged
import PyPDF2
pdfFileObj = open('sjsuclassdata.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pdfReader.numPages
pageObj = pdfReader.getPage(4)
print(pageObj.extractText())
EDIT
Essentially what Im trying to do now is to simply write a script that would perform the following actions.
1.) Read pdf file
2.) copy contents of whole page (ctrl+a)
3.) paste contents of whole page into plain text file (ctrl+v)
4.) read pdf till end of file

I would give slate a try:
import slate
output_prefix = 'foobar'
file_ext = 'txt'
with open('example.pdf') as f:
doc = slate.PDF(f)
for page_number, page in enumerate(doc):
open('%s_%s.%s' % (output_suffix, page_number, file_ext), 'w+').write(doc[page_number])

How to insert annotation into pdf with Python

I want to add text or annotation in the exsting pdf file to interpret some key words.
At first I tried the pyPdf & reportlib to merge t he original pdf file & new generated interpretion pdf file, but it doesn't work. Because the original file keep out all the words of interpretation pdf and make new pdf file invisible. Don't know why? If I test to merge two new generated interpretion pdf file into one, it works well.
So I am thinking to try to use another way to insert just annotation into existing pdf file by python. Anybody have related experience can give me suggestion? Thanks!

Adding a watermark to existing pdf using PyPDF certainly works for me:
template = PdfFileReader(file("template.pdf", "rb")) #template pdf
output=PdfFileWriter() #writer for the merged pdf
for i in range(new.getNumPages()):
page=template.getPage(i)
page.mergePage(new.getPage(i))
output.addPage(page)
Read my other SO answer for reference.
Read my complete article to know more about pdf generation and merging in python.

how to insert a string to pdf using pypdf?

sorry,.. i'am a noob in python..
I need to create a pdf file, without using an existing pdf files.. (pure create a new one)
i have googling, and lot of them is merge 2 pdf or create a new file copies from a particular page in another file... what i want to achieve is make a report page (in chart), but for first step or the simple one "how to insert a string into my pdf file ? (hello world mybe)"..
this is my code to make a new pdf file with a single blankpage
from pyPdf import PdfFileReader, PdfFileWriter
op = PdfFileWriter()
# here to add blank page
op.addBlankPage(200,200)
#how to add string here, and insert it to my blank page ?
ops = file("document-output.pdf", "wb")
op.write(ops)
ops.close()

You want "pisa" or "reportlab" for generating arbitrary PDF documents, not "pypdf".
http://www.xhtml2pdf.com/doc/pisa-en.html
http://www.reportlab.org

Also check out the pyfpdf library. I've used the php port of this library for a few years and it's quite flexible, allowing you to work with flowable text, lines, rectangles, and images.
http://code.google.com/p/pyfpdf

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.