Loop through PDF and save all pages into a DataFrame

Loop through PDF and save all pages into a DataFrame - python

I am fairly new to Python and trying the PyPDF2 package for the first time. I simply want to loop through my PDF document (66 pages) and extract all the text into a DataFrame.
I have followed some blog posts (http://echrislynch.com/2018/07/13/turning-a-pdf-into-a-pandas-dataframe/) and have the following code. Unlike the blog post, I am not interested in any data cleanse or transformation at this point, I simply want the pages stored in a dataframe. :
import PyPDF2
import os
import pandas as pd
# Open PDF as an object and read it into PyPDF2
pdfFileObj = open('MyReport.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# loop through pages
pages=list()
for i in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(i)
page = pageObj.extractText()
page = page[0:]
pages.append(page[0:])
for i in range(1,len(page)):
pages=[page[2:] for page in pages]
# Create dataframe
page_df = pd.DataFrame([page])
# Concat with dbn_df
MyNewReport= pd.DataFrame([page])
page_df = page_df.iloc[0:]
MyNewReport= pd.concat([MyNewReport,page_df], axis=0,
ignore_index=True, sort=False)
I am encountering an error :
File "<ipython-input-78-729b84e346f9>", line 16, in <module>
page[i] = page[i][2:]
TypeError: 'str' object does not support item assignment
So I know the issue lies with my loop, although looking at the variable explorer, my dataframe contains the text from the last page of my pdf...so it is looping through something!
Can anyone help or recommend some further reading to understand the error and resolution?

Related

Extracting comments/annotations from PDF sequentially - Python

I am trying to extract comments from a PDF using Python. These are the two pieces of code that I have tested:
One using PyPDF2:
import PyPDF2
src = 'xxxx.pdf'
input1 = PyPDF2.PdfFileReader(open(src, "rb"))
nPages = input1.getNumPages()
df_comments = pd.DataFrame()
for i in range(nPages) :
annotation = []
page = []
page0 = input1.getPage(i)
try :
for annot in page0['/Annots'] :
annotation.append(annot.getObject())
page = [i+1] * len(annotation)
page = pd.DataFrame(page)
annotation = pd.DataFrame(annotation)
df_temp = pd.concat([page, annotation], axis=1)
df_comments = pd.concat([df_comments, df_temp], ignore_index=True)
except :
# there are no annotations on this page
pass
and the other using fitz:
import fitz
doc = fitz.open(src)
for i in range(doc.pageCount):
page = doc[i]
for annot in page.annots():
print(annot.info)
The comments are getting extracted, however when I check the PDF I see that the comments are not being extracted sequentially. I have tried to check other parameters like creation date, modification date but that is not helping me.
Is their a way I can extract them serially as they are appearing in the PDF? Or Can I extract the text as well from the PDF against which the comment has been tagged?

I'm the current maintainer of PyPDF2.
The annotations are currently extracted in the order they appear in the annotations dictionary.
If you have a sensible way to sort them, feel free to open a feature request in the PyPDF2 issue tracker on github.

PyPDF4 not reading certain characters

I'm compiling some data for a project and I've been using PyPDF4 to read this data from it's source PDF file, but I've been having trouble with certain characters not showing up correctly. Here's my code:
from PyPDF4 import PdfFileReader
import pandas as pd
import numpy as np
import os
import xml.etree.cElementTree as ET
# File name
pdf_path = "PV-9-2020-10-23-RCV_FR.pdf"
# Results storage
results = {}
# Start page
page = 5
# Lambda to assign votes
serify = lambda voters, vote: pd.Series({voter.strip(): vote for voter in voters})
with open(pdf_path, 'rb') as f:
# Get PDF reader for PDF file f
pdf = PdfFileReader(f)
while page < pdf.numPages:
# Get text of page in PDF
text = pdf.getPage(page).extractText()
proposal = text.split("\n+\n")[0].split("\n")[3]
# Collect all pages relevant pages
while text.find("\n0\n") is -1:
page += 1
text += "\n".join(pdf.getPage(page).extractText().split("\n")[3:])
# Remove corrections
text, corrections = text.split("CORRECCIONES")
# Grab relevant text !!! This is where the missing characters show up.
text = "\n, ".join([n[:n.rindex("\n")] for n in text.split("\n:")])
for_list = "".join(text[text.index("\n+\n")+3:text.index("\n-\n")].split("\n")[:-1]).split(", ")
nay_list = "".join(text[text.index("\n-\n")+3:text.index("\n0\n")].split("\n")[:-1]).split(", ")
abs_list = "".join(text[text.index("\n0\n")+3:].split("\n")[:-1]).split(", ")
# Store data in results
results.update({proposal: dict(pd.concat([serify(for_list, 1), serify(nay_list, -1), serify(abs_list, 0)]).items())})
page += 1
print(page)
results = pd.DataFrame(results)
The characters I'm having difficulty don't show up in the text extracted using extractText. Ždanoka for instance becomes "danoka, Štefanec becomes -tefanc. It seems like most of the characters are Eastern European, which makes me think I need one of the latin decoders.
I've looked through some of PyPDF4's capabilities, it seems like it has plenty of relevant codecs, including latin1. I've attempted decoding the file using different functions from the PyPDF4.generic.codecs module, and either the characters don't show still, or the code throws an error at an unrecognised byte.
I haven't yet attempted using multiple codecs on different bytes from the same file, that seems like it would take some time. Am I missing something in my code that can easily fix this? Or is it more likely I will have to tailor fit a solution using PyPDF4's functions?

Use pypdf instead of PyPDF2/PyPDF3/PyPDF4. You will need to apply the migrations.
pypdf has received a lot of updates in December 2022. Especially the text extraction.
To give you a minimal full example for text extraction:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
for page in reader.pages:
print(page.extract_text())

How to extract Table from PDF in Python? [duplicate]

This question already has answers here:
How can I extract tables from PDF documents?
(4 answers)
Closed 11 days ago.
I have thousands of PDF files, composed only by tables, with this structure:
pdf file
However, despite being fairly structured, I cannot read the tables without losing the structure.
I tried PyPDF2, but the data comes completely messed up.
import PyPDF2
pdfFileObj = open(pdf_file.pdf, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
print(pageObj.extractText().split('\n')[0])
print(pageObj.extractText().split('/')[0])
I also tried Tabula, but it only reads the header (and not the content of the tables)
from tabula import read_pdf
pdfFile1 = read_pdf(pdf_file.pdf, output_format = 'json') #Option 1: reads all the headers
pdfFile2 = read_pdf(pdf_file.pdf, multiple_tables = True) #Option 2: reads only the first header and few lines of content
Any thoughts?

After struggling a little bit, I found a way.
For each page of the file, it was necessary to define into tabula's read_pdf function the area of the table and the limits of the columns.
Here is the working code:
import pypdf
from tabula import read_pdf
# Get the number of pages in the file
pdf_reader = pypdf.PdfReader(pdf_file)
n_pages = len(pdf_reader.pages)
# For each page the table can be read with the following code
table_pdf = read_pdf(
pdf_file,
guess=False,
pages=1,
stream=True,
encoding="utf-8",
area=(96, 24, 558, 750),
columns=(24, 127, 220, 274, 298, 325, 343, 364, 459, 545, 591, 748),
)

Try this: pip install tabula-py
from tabula import read_pdf
df = read_pdf("file_name.pdf")

use library tabula
pip install tabula
then exract it
import tabula
# this reads page 63
dfs = tabula.read_pdf(url, pages=63, stream=True)
# if you want read all pages
dfs = tabula.read_pdf(url, pages=all)
df[1]
By the way, I tried read pdf files by using another way. Then it works better than library tabula. I will post it soon.

#fmarques
You could also try a new Python package (SLICEmyPDF) developed by StatCan specially for extracting tabular data from PDF:
https://github.com/StatCan/SLICEmyPDF
From my experience SLICEmyPDF outperforms other free Python or R packages.
The catch is that it requires the installation of a few extra free software. The instructions for the installation can be found at
https://dataworldofredhairedgirl.blogspot.com/2022/04/how-to-install-statcan-slicemypdf-on.html

how do i change hyperlinks inside pdf using python?

How do I change the hyperlinks in pdf using python? I am currently using a pyPDF2 to open up and loop through the pages. How do I actually scan for hyperlinks and then proceed to change the hyperlinks?

So I couldn't get what you want using the pyPDF2 library.
I did however get something working with another library: pdfrw. This installed fine for me using pip in Python 3.6:
pip install pdfrw
Note: for the following I have been using this example pdf I found online which contains multiple links. Your mileage may vary with this.
import pdfrw
pdf = pdfrw.PdfReader("pdf.pdf") # Load the pdf
new_pdf = pdfrw.PdfWriter() # Create an empty pdf
for page in pdf.pages: # Go through the pages
# Links are in Annots, but some pages don't have links so Annots returns None
for annot in page.Annots or []:
old_url = annot.A.URI
# >Here you put logic for replacing the URLs<
# Use the PdfString object to do the encoding for us
# Note the brackets around the URL here
new_url = pdfrw.objects.pdfstring.PdfString("(http://www.google.com)")
# Override the URL with ours
annot.A.URI = new_url
new_pdf.addpage(page)
new_pdf.write("new.pdf")

I managed to get it working with PyPDF2.
If you just want to remove all annotations for a page, you just have to do:
if '/Annots' in page: del page['/Annots']
Else, here is how you change each link:
import PyPDF2
new_link = "https://www.youtube.com/watch?v=dQw4w9WgXcQ" # great video by the way
pdf_reader = PyPDF2.PdfFileReader("input.pdf")
pdf_writer = PyPDF2.PdfFileWriter()
for i in range(pdf_reader.getNumPages()):
page = pdf_reader.getPage(i)
if '/Annots' not in page: continue
for annot in page['/Annots']:
annot_obj = annot.getObject()
if '/A' not in annot_obj: continue # not a link
# you have to wrap the key and value with a TextStringObject:
key = PyPDF2.generic.TextStringObject("/URI")
value = PyPDF2.generic.TextStringObject(new_link)
annot_obj['/A'][key] = value
pdf_writer.addPage(page)
with open('output.pdf', 'wb') as f:
pdf_writer.write(f)
An equivalent one-liner for a given page index i and annotation index j would be:
pdf_reader.getPage(i)['/Annots'][j].getObject()['/A'][PyPDF2.generic.TextStringObject("/URI")] = PyPDF2.generic.TextStringObject(new_link)

Merge Existing PDF into new ReportLab PDF via flowables

I have a reportlab SimpleDocTemplate and returning it as a dynamic PDF. I am generating it's content based on some Django model metadata. Here's my template setup:
buff = StringIO()
doc = SimpleDocTemplate(buff, pagesize=letter,
rightMargin=72,leftMargin=72,
topMargin=72,bottomMargin=18)
Story = []
I can easily add textual metadata from the Entry model into the Story list to be built later:
ptext = '<font size=20>%s</font>' % entry.title.title()
paragraph = Paragraph(ptext, custom_styles["Custom"])
Story.append(paragraph)
And then generate the PDF to be returned in the response by calling build on the SimpleDocTemplate:
doc.build(Story, onFirstPage=entry_page_template, onLaterPages=entry_page_template)
pdf = buff.getvalue()
resp = HttpResponse(mimetype='application/x-download')
resp['Content-Disposition'] = 'attachment;filename=logbook.pdf'
resp.write(pdf)
return resp
One metadata field on the model is a file attachment. When those file attachments are PDFs, I'd like to merge them into the Story that I am generating; IE meaning a PDF of reportlab "flowable" type.
I'm attempting to do so using pdfrw, but haven't had any luck. Ideally I'd love to just call:
from pdfrw import PdfReader
pdf = pPdfReader(entry.document.file.path)
Story.append(pdf)
and append the pdf to the existing Story list to be included in the generation of the final document, as noted above.
Anyone have any ideas? I tried something similar using pagexobj to create the pdf, trying to follow this example:
http://code.google.com/p/pdfrw/source/browse/trunk/examples/rl1/subset.py
from pdfrw.buildxobj import pagexobj
from pdfrw.toreportlab import makerl
pdf = pagexobj(PdfReader(entry.document.file.path))
But didn't have any luck either. Can someone explain to me the best way to merge an existing PDF file into a reportlab flowable? I'm no good with this stuff and have been banging my head on pdf-generation for days now. :) Any direction greatly appreciated!

I just had a similar task in a project. I used reportlab (open source version) to generate pdf files and pyPDF to facilitate the merge. My requirements were slightly different in that I just needed one page from each attachment, but I'm sure this is probably close enough for you to get the general idea.
from pyPdf import PdfFileReader, PdfFileWriter
def create_merged_pdf(user):
basepath = settings.MEDIA_ROOT + "/"
# following block calls the function that uses reportlab to generate a pdf
coversheet_path = basepath + "%s_%s_cover_%s.pdf" %(user.first_name, user.last_name, datetime.now().strftime("%f"))
create_cover_sheet(coversheet_path, user, user.performancereview_set.all())
# now user the cover sheet and all of the performance reviews to create a merged pdf
merged_path = basepath + "%s_%s_merged_%s.pdf" %(user.first_name, user.last_name, datetime.now().strftime("%f"))
# for merged file result
output = PdfFileWriter()
# for each pdf file to add, open in a PdfFileReader object and add page to output
cover_pdf = PdfFileReader(file( coversheet_path, "rb"))
output.addPage(cover_pdf.getPage(0))
# iterate through attached files and merge. I only needed the first page, YMMV
for review in user.performancereview_set.all():
review_pdf = PdfFileReader(file(review.pdf_file.file.name, "rb"))
output.addPage(review_pdf.getPage(0)) # only first page of attachment
# write out the merged file
outputStream = file(merged_path, "wb")
output.write(outputStream)
outputStream.close()

I used the following class to solve my issue. It inserts the PDFs as vector PDF images.
It works great because I needed to have a table of contents. The flowable object allowed the built in TOC functionality to work like a charm.
Is there a matplotlib flowable for ReportLab?
Note: If you have multiple pages in the file, you have to modify the class slightly. The sample class is designed to just read the first page of the PDF.

I know the question is a bit old but I'd like to provide a new solution using the latest PyPDF2.
You now have access to the PdfFileMerger, which can do exactly what you want, append PDFs to an existing file. You can even merge them in different positions and choose a subset or all the pages!
The official docs are here: https://pythonhosted.org/PyPDF2/PdfFileMerger.html
An example from the code in your question:
import tempfile
import PyPDF2
from django.core.files import File
# Using a temporary file rather than a buffer in memory is probably better
temp_base = tempfile.TemporaryFile()
temp_final = tempfile.TemporaryFile()
# Create document, add what you want to the story, then build
doc = SimpleDocTemplate(temp_base, pagesize=letter, ...)
...
doc.build(...)
# Now, this is the fancy part. Create merger, add extra pages and save
merger = PyPDF2.PdfFileMerger()
merger.append(temp_base)
# Add any extra document, you can choose a subset of pages and add bookmarks
merger.append(entry.document.file, bookmark='Attachment')
merger.write(temp_final)
# Write the final file in the HTTP response
django_file = File(temp_final)
resp = HttpResponse(django_file, content_type='application/pdf')
resp['Content-Disposition'] = 'attachment;filename=logbook.pdf'
if django_file.size is not None:
resp['Content-Length'] = django_file.size
return resp

Use this custom flowable:
class PDF_Flowable(Flowable):
#----------------------------------------------------------------------
def __init__(self,P,page_no):
Flowable.__init__(self)
self.P = P
self.page_no = page_no
#----------------------------------------------------------------------
def draw(self):
"""
draw the line
"""
canv = self.canv
pages = self.P
page_no = self.page_no
canv.translate(x, y)
canv.doForm(makerl(canv, pages[page_no]))
canv.restoreState()
and then after opening existing pdf i.e.
pages = PdfReader(BASE_DIR + "/out3.pdf").pages
pages = [pagexobj(x) for x in pages]
for i in range(0, len(pages)):
F = PDF_Flowable(pages,i)
elements.append(F)
elements.append(PageBreak())
use this code to add this custom flowable in elements[].

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Loop through PDF and save all pages into a DataFrame - python

Related

Extracting comments/annotations from PDF sequentially - Python

PyPDF4 not reading certain characters

How to extract Table from PDF in Python? [duplicate]

how do i change hyperlinks inside pdf using python?

Merge Existing PDF into new ReportLab PDF via flowables

Categories

Resources