I want to add text or annotation in the exsting pdf file to interpret some key words.
At first I tried the pyPdf & reportlib to merge t he original pdf file & new generated interpretion pdf file, but it doesn't work. Because the original file keep out all the words of interpretation pdf and make new pdf file invisible. Don't know why? If I test to merge two new generated interpretion pdf file into one, it works well.
So I am thinking to try to use another way to insert just annotation into existing pdf file by python. Anybody have related experience can give me suggestion? Thanks!
Adding a watermark to existing pdf using PyPDF certainly works for me:
template = PdfFileReader(file("template.pdf", "rb")) #template pdf
output=PdfFileWriter() #writer for the merged pdf
for i in range(new.getNumPages()):
page=template.getPage(i)
page.mergePage(new.getPage(i))
output.addPage(page)
Read my other SO answer for reference.
Read my complete article to know more about pdf generation and merging in python.
Related
I have a folder of PDFs that I'm currently merging using PyPDF2.
merger = PdfFileMerger()
for file in os.listdir('****'):
if file.endswith(".pdf"):
merger.append('****'+file)
merger.write('****' + str(dt.date.today()) + '.pdf')
merger.close()
The files contain graphs and the titles are very specific. What I would like to be able to do is:
Based on string in title, merge multiple PDFs to the same page of the new PDF (preferably split into two columns) - I know this isn't correct syntax but something like:
if 'dogs' in file:
merger.write(...,page=1,cols=2)
elif 'cats' in file:
merger.write(...,page=2,cols=2)
Not sure if this is possible, have looked at other answers and read through the documentation but can't figure it out. Would also like to be able to have a fair amount of graphs (I guess up to 6?) on a single page.
If the PDF titles are located in the same location in the file consistently, then you could do a extractText() function. Retrieve the title text in each pdf file. Then do your comparison analysis.
In need of help from learned people on this forum. I just want to embed one pdf file to another pdf file. So that when I go to the attachment section of the second file I can get to see and open the first file. I would like to do this with help of PyMupdf. Got a command embeddedFileAdd to do so but I am not sure how to use it.
Just Soved it with this code:
import fitz
pdf1=r'C:\Users\Amit PC\Desktop\pdf1.pdf'
pdf2=r'C:\Users\Amit PC\Desktop\pdf2.pdf'
outfile=r'C:\Users\Amit PC\Desktop\test2.pdf'
img= bytearray(open(pdf2,'rb').read())
doc1=fitz.open(pdf1)
doc1.embeddedFileAdd(img,'attach.pdf')
doc1.save(outfile, deflate = True)
doc1.close()
I have code that hides parts of the pdf (by just covering it with a white polygon) but the issue with this is, the text is still there, if you ctrl-f you can still find it.
My goal is to actually remove the text from the pdf itself. Using pdfminer I managed to extract the text from the pdf but I don't know if its possible to actually "replace" the text with say just some empty spaces. Is such a thing possible using python? Extracting it isn't enough. I need the text to be removed from the PDF
Is such a thing possible? Yes, although it is not recommended. In my opinion, your best bet is to open and read your existing file, move it to an editable format, remove whatever text that you don't want present and then convert it back.
However, you could extract the data and remove it from memory by using:
import PyPDF2
# creating a pdf file object
pdfFileObj = open('example.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
Line by line, this program would:
pdfFileObj = open('example.pdf', 'rb')
Open the example.pdf and save the file object as pdfFileObj.
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
Create an object of PdfFileReader and pass the PDF file object whole getting a PDF reader object.
print(pdfReader.numPages)
Give the number of pages.
pageObj = pdfReader.getPage(0)
Create an object of PageObject class. PDF reader object has function getPage() which takes page number (starting form index 0) as an argument and returns the page object.
print(pageObj.extractText())
Extract text from the PDF page.
pdfFileObj.close()
Close the PDF file object.
The replacement text would simply be "", as you want to remove all instances / cases of a certain piece of text.
I used pdf-redactor in one of my projects and it works pretty nice.
Here is an example how to redact Social Security Numbers from text layer.
This is kind of memory intensive but you can copy the rest of the pdf apart from the part you are removing and then overwrite the file with the new version which does not contain the part you wish to remove. You can do this using PyPDF by retrieving a content stream and finding and removing the relevant parts.
PyPDF docs https://pythonhosted.org/PyPDF2/PageObject.html?highlight=getcontents#PyPDF2.pdf.PageObject.getContents;
PDF standard https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf pg 78, pg 81;
I know I am late but for future readers here is a workaround I found to resolve this using pymupdf. This solution successfully deletes text from pdf.
page = doc.load_page(0)
draft = page.search_for("Invoice")
for rect in draft:
annot = page.add_redact_annot(rect)
page.apply_redactions()
page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_NONE)
# then save the doc to a new PDF:
doc.save("new.pdf", garbage=3, deflate=True)
All I am trying to accomplish with this little script that I wrote is to parse data from a PDF file.
However, I seem to have run into an issue with python, more specifically the PyPDF2 module not able to read the text from a pdf file. The data printed out is all fuzzy and basically not readable. However, when I open up the pdf file that I am trying to read I can simply click drag and ctrl+c to copy contents after which when I paste it into a plain txt document it works flawlessly. The data is readable when I go through this process of copying and pasting manually.
So what I'm trying to do is mimic that exact step, however automate it instead of having me go through all the pages within the pdf file performing the above steps.
Or if there are any suggestion as to what else I can do to achieve this, I would greatly appreciate it. I have tried converting the pdf file into a docx and plain text files however the contents of the file had their formats completely re arranged
import PyPDF2
pdfFileObj = open('sjsuclassdata.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pdfReader.numPages
pageObj = pdfReader.getPage(4)
print(pageObj.extractText())
EDIT
Essentially what Im trying to do now is to simply write a script that would perform the following actions.
1.) Read pdf file
2.) copy contents of whole page (ctrl+a)
3.) paste contents of whole page into plain text file (ctrl+v)
4.) read pdf till end of file
I would give slate a try:
import slate
output_prefix = 'foobar'
file_ext = 'txt'
with open('example.pdf') as f:
doc = slate.PDF(f)
for page_number, page in enumerate(doc):
open('%s_%s.%s' % (output_suffix, page_number, file_ext), 'w+').write(doc[page_number])
sorry,.. i'am a noob in python..
I need to create a pdf file, without using an existing pdf files.. (pure create a new one)
i have googling, and lot of them is merge 2 pdf or create a new file copies from a particular page in another file... what i want to achieve is make a report page (in chart), but for first step or the simple one "how to insert a string into my pdf file ? (hello world mybe)"..
this is my code to make a new pdf file with a single blankpage
from pyPdf import PdfFileReader, PdfFileWriter
op = PdfFileWriter()
# here to add blank page
op.addBlankPage(200,200)
#how to add string here, and insert it to my blank page ?
ops = file("document-output.pdf", "wb")
op.write(ops)
ops.close()
You want "pisa" or "reportlab" for generating arbitrary PDF documents, not "pypdf".
http://www.xhtml2pdf.com/doc/pisa-en.html
http://www.reportlab.org
Also check out the pyfpdf library. I've used the php port of this library for a few years and it's quite flexible, allowing you to work with flowable text, lines, rectangles, and images.
http://code.google.com/p/pyfpdf