how do i change hyperlinks inside pdf using python? - python

How do I change the hyperlinks in pdf using python? I am currently using a pyPDF2 to open up and loop through the pages. How do I actually scan for hyperlinks and then proceed to change the hyperlinks?

So I couldn't get what you want using the pyPDF2 library.
I did however get something working with another library: pdfrw. This installed fine for me using pip in Python 3.6:
pip install pdfrw
Note: for the following I have been using this example pdf I found online which contains multiple links. Your mileage may vary with this.
import pdfrw
pdf = pdfrw.PdfReader("pdf.pdf") # Load the pdf
new_pdf = pdfrw.PdfWriter() # Create an empty pdf
for page in pdf.pages: # Go through the pages
# Links are in Annots, but some pages don't have links so Annots returns None
for annot in page.Annots or []:
old_url = annot.A.URI
# >Here you put logic for replacing the URLs<
# Use the PdfString object to do the encoding for us
# Note the brackets around the URL here
new_url = pdfrw.objects.pdfstring.PdfString("(http://www.google.com)")
# Override the URL with ours
annot.A.URI = new_url
new_pdf.addpage(page)
new_pdf.write("new.pdf")

I managed to get it working with PyPDF2.
If you just want to remove all annotations for a page, you just have to do:
if '/Annots' in page: del page['/Annots']
Else, here is how you change each link:
import PyPDF2
new_link = "https://www.youtube.com/watch?v=dQw4w9WgXcQ" # great video by the way
pdf_reader = PyPDF2.PdfFileReader("input.pdf")
pdf_writer = PyPDF2.PdfFileWriter()
for i in range(pdf_reader.getNumPages()):
page = pdf_reader.getPage(i)
if '/Annots' not in page: continue
for annot in page['/Annots']:
annot_obj = annot.getObject()
if '/A' not in annot_obj: continue # not a link
# you have to wrap the key and value with a TextStringObject:
key = PyPDF2.generic.TextStringObject("/URI")
value = PyPDF2.generic.TextStringObject(new_link)
annot_obj['/A'][key] = value
pdf_writer.addPage(page)
with open('output.pdf', 'wb') as f:
pdf_writer.write(f)
An equivalent one-liner for a given page index i and annotation index j would be:
pdf_reader.getPage(i)['/Annots'][j].getObject()['/A'][PyPDF2.generic.TextStringObject("/URI")] = PyPDF2.generic.TextStringObject(new_link)

Related

Extract URLs from PDF - text doesn't match URL

I'm using following code to extract URLs from PDF and it works fine to extract the anchor but does not work when anchor text is different than the URL behind it.
For example: 'www.page.com/A' is used as a short url in the text but the actual URL behind it is a longer (full) version.
The code I'm using is:
import urllib.request
import PyPDF2
urllib.request.urlretrieve(url, "remoteFile")
pdfFile = PyPDF2.PdfFileReader("remoteFile", strict=False)
key = "/Annots"
uri = "/URI"
ank = "/A"
mylist = []
for page_no in range(pdfFile.numPages):
page = pdfFile.getPage(page_no)
text = page.extractText()
pageObject = page.getObject()
if key in pageObject.keys():
ann = pageObject.keys()
for a in ann:
try:
u = a.getObject()
if uri in u[ank].keys():
mylist.append(u[ank][uri])
print(u[ank][uri])
except KeyError:
pass
As I said, it works ok if the anchor and the link are the same. If the link is different, it saves the anchor. Ideally I would save both (or just link).

Is it possible to capture specific parts of a PDF text with AWS Textract?

I need to extract the text from the PDF, but I don't want the entire PDF to be parsed. I wonder if it's possible to get specific parts of the parsed PDF. For example, I have a PDF with information about: Address, city and country. I don't want everything returned, just the Address field, not the other information.
Code that returns the text to me:
from textractcaller.t_call import call_textract
from textractprettyprinter.t_pretty_print import get_lines_string
response = call_textract(input_document="s3://my-bucket/myfile.pdf")
print(get_lines_string(response))
Try this method (it doesn't use AWS Textract, but works as well):
import PyPDF2
def extract_text(filename, page_number):
# Returns the content of a given page
pdf_file_object = open(filename, 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file_object)
# page_number - 1 below because in python, page 1 is considered as page 0
page_object = pdf_reader.getPage(page_number - 1)
text = page_object.extractText()
pdf_file_object.close()
return text
This function extracts the text from one single PDF page.
If you haven't got PyPDF2 yet, install it through the command line with 'pip install PyPDF2'.

Extracting comments/annotations from PDF sequentially - Python

I am trying to extract comments from a PDF using Python. These are the two pieces of code that I have tested:
One using PyPDF2:
import PyPDF2
src = 'xxxx.pdf'
input1 = PyPDF2.PdfFileReader(open(src, "rb"))
nPages = input1.getNumPages()
df_comments = pd.DataFrame()
for i in range(nPages) :
annotation = []
page = []
page0 = input1.getPage(i)
try :
for annot in page0['/Annots'] :
annotation.append(annot.getObject())
page = [i+1] * len(annotation)
page = pd.DataFrame(page)
annotation = pd.DataFrame(annotation)
df_temp = pd.concat([page, annotation], axis=1)
df_comments = pd.concat([df_comments, df_temp], ignore_index=True)
except :
# there are no annotations on this page
pass
and the other using fitz:
import fitz
doc = fitz.open(src)
for i in range(doc.pageCount):
page = doc[i]
for annot in page.annots():
print(annot.info)
The comments are getting extracted, however when I check the PDF I see that the comments are not being extracted sequentially. I have tried to check other parameters like creation date, modification date but that is not helping me.
Is their a way I can extract them serially as they are appearing in the PDF? Or Can I extract the text as well from the PDF against which the comment has been tagged?
I'm the current maintainer of PyPDF2.
The annotations are currently extracted in the order they appear in the annotations dictionary.
If you have a sensible way to sort them, feel free to open a feature request in the PyPDF2 issue tracker on github.

Python Blog RSS Feed Scraping BeautifulSoup Output to .txt Files

Apologies in advance for the long block of code following. I'm new to BeautifulSoup, but found there were some useful tutorials using it to scrape RSS feeds for blogs. Full disclosure: this is code adapted from this video tutorial which has been immensely helpful in getting this off the ground: http://www.youtube.com/watch?v=Ap_DlSrT-iE.
Here's my problem: the video does a great job of showing how to print the relevant content to the console. I need to write out each article's text to a separate .txt file and save it to some directory (right now I'm just trying to save to my Desktop). I know the problem lies i the scope of the two for-loops near the end of the code (I've tried to comment this for people to see quickly--it's the last comment beginning # Here's where I'm lost...), but I can't seem to figure it out on my own.
Currently what the program does is takes the text from the last article read in by the program and writes that out to the number of .txt files that are indicated in the variable listIterator. So, in this case I believe there are 20 .txt files that get written out, but they all contain the text of the last article that's looped over. What I want the program to do is loop over each article and print the text of each article out to a separate .txt file. Sorry for the verbosity, but any insight would be really appreciated.
from urllib import urlopen
from bs4 import BeautifulSoup
import re
# Read in webpage.
webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()
# On RSS Feed site, find tags for title of articles and
# tags for article links to be downloaded.
patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')
# Find the tags listed in variables above in the articles.
findPatTitle = re.findall(patFinderTitle, webpage)
findPatLink = re.findall(patFinderLink, webpage)
# Create a list that is the length of the number of links
# from the RSS feed page. Use this to iterate over each article,
# read it in, and find relevant text or <p> tags.
listIterator = []
listIterator[:] = range(len(findPatTitle))
for i in listIterator:
# Print each title to console to ensure program is working.
print findPatTitle[i]
# Read in the linked-to article.
articlePage = urlopen(findPatLink[i]).read()
# Find the beginning and end of articles using tags listed below.
divBegin = articlePage.find("<div class='story-teaser'>")
divEnd = articlePage.find("<footer class='article-footer'>")
# Define article variable that will contain all the content between the
# beginning of the article to the end as indicated by variables above.
article = articlePage[divBegin:divEnd]
# Parse the page using BeautifulSoup
soup = BeautifulSoup(article)
# Compile list of all <p> tags for each article and store in paragList
paragList = soup.findAll('p')
# Create empty string to eventually convert items in paragList to string to
# be written to .txt files.
para_string = ''
# Here's where I'm lost and have some sort of scope issue with my for-loops.
for i in paragList:
para_string = para_string + str(i)
newlist = range(len(findPatTitle))
for i in newlist:
ofile = open(str(listIterator[i])+'.txt', 'w')
ofile.write(para_string)
ofile.close()
The reason why it seems that only the last article is written down, is because all the articles are writer to 20 separate files over and over again. Lets have a look at the following:
for i in paragList:
para_string = para_string + str(i)
newlist = range(len(findPatTitle))
for i in newlist:
ofile = open(str(listIterator[i])+'.txt', 'w')
ofile.write(para_string)
ofile.close()
You are writing parag_string over and over again to the same 20 files for each iteration. What you need to be doing is this, append all your parag_strings to a separate list, say paraStringList, and then write all its contents to separate files, like so:
for i, var in enumerate(paraStringList): # Enumerate creates a tuple
with open("{0}.txt".format(i), 'w') as writer:
writer.write(var)
Now that this needs to be outside of your main loop i.e. for i in listIterator:(...). This is a working version of the program:
from urllib import urlopen
from bs4 import BeautifulSoup
import re
webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()
patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')
findPatTitle = re.findall(patFinderTitle, webpage)[0:4]
findPatLink = re.findall(patFinderLink, webpage)[0:4]
listIterator = []
listIterator[:] = range(len(findPatTitle))
paraStringList = []
for i in listIterator:
print findPatTitle[i]
articlePage = urlopen(findPatLink[i]).read()
divBegin = articlePage.find("<div class='story-teaser'>")
divEnd = articlePage.find("<footer class='article-footer'>")
article = articlePage[divBegin:divEnd]
soup = BeautifulSoup(article)
paragList = soup.findAll('p')
para_string = ''
for i in paragList:
para_string += str(i)
paraStringList.append(para_string)
for i, var in enumerate(paraStringList):
with open("{0}.txt".format(i), 'w') as writer:
writer.write(var)

Merge Existing PDF into new ReportLab PDF via flowables

I have a reportlab SimpleDocTemplate and returning it as a dynamic PDF. I am generating it's content based on some Django model metadata. Here's my template setup:
buff = StringIO()
doc = SimpleDocTemplate(buff, pagesize=letter,
rightMargin=72,leftMargin=72,
topMargin=72,bottomMargin=18)
Story = []
I can easily add textual metadata from the Entry model into the Story list to be built later:
ptext = '<font size=20>%s</font>' % entry.title.title()
paragraph = Paragraph(ptext, custom_styles["Custom"])
Story.append(paragraph)
And then generate the PDF to be returned in the response by calling build on the SimpleDocTemplate:
doc.build(Story, onFirstPage=entry_page_template, onLaterPages=entry_page_template)
pdf = buff.getvalue()
resp = HttpResponse(mimetype='application/x-download')
resp['Content-Disposition'] = 'attachment;filename=logbook.pdf'
resp.write(pdf)
return resp
One metadata field on the model is a file attachment. When those file attachments are PDFs, I'd like to merge them into the Story that I am generating; IE meaning a PDF of reportlab "flowable" type.
I'm attempting to do so using pdfrw, but haven't had any luck. Ideally I'd love to just call:
from pdfrw import PdfReader
pdf = pPdfReader(entry.document.file.path)
Story.append(pdf)
and append the pdf to the existing Story list to be included in the generation of the final document, as noted above.
Anyone have any ideas? I tried something similar using pagexobj to create the pdf, trying to follow this example:
http://code.google.com/p/pdfrw/source/browse/trunk/examples/rl1/subset.py
from pdfrw.buildxobj import pagexobj
from pdfrw.toreportlab import makerl
pdf = pagexobj(PdfReader(entry.document.file.path))
But didn't have any luck either. Can someone explain to me the best way to merge an existing PDF file into a reportlab flowable? I'm no good with this stuff and have been banging my head on pdf-generation for days now. :) Any direction greatly appreciated!
I just had a similar task in a project. I used reportlab (open source version) to generate pdf files and pyPDF to facilitate the merge. My requirements were slightly different in that I just needed one page from each attachment, but I'm sure this is probably close enough for you to get the general idea.
from pyPdf import PdfFileReader, PdfFileWriter
def create_merged_pdf(user):
basepath = settings.MEDIA_ROOT + "/"
# following block calls the function that uses reportlab to generate a pdf
coversheet_path = basepath + "%s_%s_cover_%s.pdf" %(user.first_name, user.last_name, datetime.now().strftime("%f"))
create_cover_sheet(coversheet_path, user, user.performancereview_set.all())
# now user the cover sheet and all of the performance reviews to create a merged pdf
merged_path = basepath + "%s_%s_merged_%s.pdf" %(user.first_name, user.last_name, datetime.now().strftime("%f"))
# for merged file result
output = PdfFileWriter()
# for each pdf file to add, open in a PdfFileReader object and add page to output
cover_pdf = PdfFileReader(file( coversheet_path, "rb"))
output.addPage(cover_pdf.getPage(0))
# iterate through attached files and merge. I only needed the first page, YMMV
for review in user.performancereview_set.all():
review_pdf = PdfFileReader(file(review.pdf_file.file.name, "rb"))
output.addPage(review_pdf.getPage(0)) # only first page of attachment
# write out the merged file
outputStream = file(merged_path, "wb")
output.write(outputStream)
outputStream.close()
I used the following class to solve my issue. It inserts the PDFs as vector PDF images.
It works great because I needed to have a table of contents. The flowable object allowed the built in TOC functionality to work like a charm.
Is there a matplotlib flowable for ReportLab?
Note: If you have multiple pages in the file, you have to modify the class slightly. The sample class is designed to just read the first page of the PDF.
I know the question is a bit old but I'd like to provide a new solution using the latest PyPDF2.
You now have access to the PdfFileMerger, which can do exactly what you want, append PDFs to an existing file. You can even merge them in different positions and choose a subset or all the pages!
The official docs are here: https://pythonhosted.org/PyPDF2/PdfFileMerger.html
An example from the code in your question:
import tempfile
import PyPDF2
from django.core.files import File
# Using a temporary file rather than a buffer in memory is probably better
temp_base = tempfile.TemporaryFile()
temp_final = tempfile.TemporaryFile()
# Create document, add what you want to the story, then build
doc = SimpleDocTemplate(temp_base, pagesize=letter, ...)
...
doc.build(...)
# Now, this is the fancy part. Create merger, add extra pages and save
merger = PyPDF2.PdfFileMerger()
merger.append(temp_base)
# Add any extra document, you can choose a subset of pages and add bookmarks
merger.append(entry.document.file, bookmark='Attachment')
merger.write(temp_final)
# Write the final file in the HTTP response
django_file = File(temp_final)
resp = HttpResponse(django_file, content_type='application/pdf')
resp['Content-Disposition'] = 'attachment;filename=logbook.pdf'
if django_file.size is not None:
resp['Content-Length'] = django_file.size
return resp
Use this custom flowable:
class PDF_Flowable(Flowable):
#----------------------------------------------------------------------
def __init__(self,P,page_no):
Flowable.__init__(self)
self.P = P
self.page_no = page_no
#----------------------------------------------------------------------
def draw(self):
"""
draw the line
"""
canv = self.canv
pages = self.P
page_no = self.page_no
canv.translate(x, y)
canv.doForm(makerl(canv, pages[page_no]))
canv.restoreState()
and then after opening existing pdf i.e.
pages = PdfReader(BASE_DIR + "/out3.pdf").pages
pages = [pagexobj(x) for x in pages]
for i in range(0, len(pages)):
F = PDF_Flowable(pages,i)
elements.append(F)
elements.append(PageBreak())
use this code to add this custom flowable in elements[].

Categories

Resources