PyPDF4 not reading certain characters

PyPDF4 not reading certain characters - python

I'm compiling some data for a project and I've been using PyPDF4 to read this data from it's source PDF file, but I've been having trouble with certain characters not showing up correctly. Here's my code:
from PyPDF4 import PdfFileReader
import pandas as pd
import numpy as np
import os
import xml.etree.cElementTree as ET
# File name
pdf_path = "PV-9-2020-10-23-RCV_FR.pdf"
# Results storage
results = {}
# Start page
page = 5
# Lambda to assign votes
serify = lambda voters, vote: pd.Series({voter.strip(): vote for voter in voters})
with open(pdf_path, 'rb') as f:
# Get PDF reader for PDF file f
pdf = PdfFileReader(f)
while page < pdf.numPages:
# Get text of page in PDF
text = pdf.getPage(page).extractText()
proposal = text.split("\n+\n")[0].split("\n")[3]
# Collect all pages relevant pages
while text.find("\n0\n") is -1:
page += 1
text += "\n".join(pdf.getPage(page).extractText().split("\n")[3:])
# Remove corrections
text, corrections = text.split("CORRECCIONES")
# Grab relevant text !!! This is where the missing characters show up.
text = "\n, ".join([n[:n.rindex("\n")] for n in text.split("\n:")])
for_list = "".join(text[text.index("\n+\n")+3:text.index("\n-\n")].split("\n")[:-1]).split(", ")
nay_list = "".join(text[text.index("\n-\n")+3:text.index("\n0\n")].split("\n")[:-1]).split(", ")
abs_list = "".join(text[text.index("\n0\n")+3:].split("\n")[:-1]).split(", ")
# Store data in results
results.update({proposal: dict(pd.concat([serify(for_list, 1), serify(nay_list, -1), serify(abs_list, 0)]).items())})
page += 1
print(page)
results = pd.DataFrame(results)
The characters I'm having difficulty don't show up in the text extracted using extractText. Ždanoka for instance becomes "danoka, Štefanec becomes -tefanc. It seems like most of the characters are Eastern European, which makes me think I need one of the latin decoders.
I've looked through some of PyPDF4's capabilities, it seems like it has plenty of relevant codecs, including latin1. I've attempted decoding the file using different functions from the PyPDF4.generic.codecs module, and either the characters don't show still, or the code throws an error at an unrecognised byte.
I haven't yet attempted using multiple codecs on different bytes from the same file, that seems like it would take some time. Am I missing something in my code that can easily fix this? Or is it more likely I will have to tailor fit a solution using PyPDF4's functions?

Use pypdf instead of PyPDF2/PyPDF3/PyPDF4. You will need to apply the migrations.
pypdf has received a lot of updates in December 2022. Especially the text extraction.
To give you a minimal full example for text extraction:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
for page in reader.pages:
print(page.extract_text())

Related

Is it possible to capture specific parts of a PDF text with AWS Textract?

I need to extract the text from the PDF, but I don't want the entire PDF to be parsed. I wonder if it's possible to get specific parts of the parsed PDF. For example, I have a PDF with information about: Address, city and country. I don't want everything returned, just the Address field, not the other information.
Code that returns the text to me:
from textractcaller.t_call import call_textract
from textractprettyprinter.t_pretty_print import get_lines_string
response = call_textract(input_document="s3://my-bucket/myfile.pdf")
print(get_lines_string(response))

Try this method (it doesn't use AWS Textract, but works as well):
import PyPDF2
def extract_text(filename, page_number):
# Returns the content of a given page
pdf_file_object = open(filename, 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file_object)
# page_number - 1 below because in python, page 1 is considered as page 0
page_object = pdf_reader.getPage(page_number - 1)
text = page_object.extractText()
pdf_file_object.close()
return text
This function extracts the text from one single PDF page.
If you haven't got PyPDF2 yet, install it through the command line with 'pip install PyPDF2'.

Decoding problem with fitz.Document in Python 3.7

I want to extract the text of a PDF and use some regular expressions to filter for information.
I am coding in Python 3.7.4 using fitz for parsing the pdf. The PDF is written in German. My code looks as follows:
doc = fitz.open(pdfpath)
pagecount = doc.pageCount
page = 0
content = ""
while (page < pagecount):
p = doc.loadPage(page)
page += 1
content = content + p.getText()
Printing the content, I realized that the first (and important) half of the document is decoded as a strange mix of Japanese (?) signs and others, like this: ｮ｡ｵｳｷ･ｩｴｵｮｧ＠ｭ.
I tried to solve it with different decodings (latin-1, iso-8859-1), encoding is definitely in utf-8.
content= content+p.getText().encode("utf-8").decode("utf-8")
I also have tried to get the text using minecart:
import minecart
file = open(pdfpath, 'rb')
document = minecart.Document(file)
for page in document.iter_pages():
for lettering in page.letterings :
print(lettering)
which results in the same problem.
Using textract, the first half is an empty string:
import textract
text = textract.process(pdfpath)
print(text.decode('utf-8'))
Same thing with PyPDF2:
import PyPDF2
pdfFileObj = open(pdfpath, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
for index in range(0, pdfReader.numPages) :
pageObj = pdfReader.getPage(index)
print(pageObj.extractText())
I don't understand the problem as it's looking like a normal PDF with normal text. Also some of the PDFs don't have this problem.

How to extract text from Pdf using Pypdf2 excluding the text content from Charts and Tables

I want to extract text content from PPT and PDF files in python.
While using PPTX works fine for extracting the text, using PyPDF2 extracts the text content from charts and tables as well from PDF when using extract_text() which I don't want.
I have tried different things but can't figure out a way to achieve this. Is there any way that this can be done? Pfb the code for the same.
import ntpath
import os
import glob
import PyPDF2
import pandas as pd from pptx import Presentation
df_header=pd.DataFrame(columns=['Document_Name', 'Document_Type', 'Page_No', 'Text', 'Report Name'])
df_header.to_csv('Downloads\\\\FinalSample.csv', mode='a', header=True)
for eachfile in glob.glob("D:\\CP US People-Centric Hub (19-SCP-3063)\\Reports\\/*\\\\/*"):
file1 = eachfile.split("\\")
report_name = file1[3]
if eachfile.endswith(".pptx"):
data=[]
prs = Presentation(eachfile)
for slide in prs.slides:
text_runs = ''
slide_num = prs.slides.index(slide) + 1
for shape in slide.shapes:
if not shape.has_text_frame:
continue
for paragraph in shape.text_frame.paragraphs:
text_runs = text_runs + ' ' + paragraph.text
data.append([ntpath.basename(eachfile), 'PPT', slide_num, text_runs,report_name])
df_ppt=pd.DataFrame(data)
df_ppt.to_csv('Downloads\\\\FinalSample.csv', mode='a', header=False)
elif eachfile.endswith(".pdf"):
data1=[]
pdfFileObj = open(eachfile, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
outlines = pdfReader.getOutlines()
for pageNum in range(pdfReader.numPages):
data1.append([ntpath.basename(eachfile), 'PDF', pageNum + 1,pdfReader.getPage(pageNum).extractText(),report_name])
df_pdf=pd.DataFrame(data1)
df_pdf.to_csv('Downloads\\\\FinalSample.csv', mode='a', header=False)
pdfFileObj.close()

No, sorry: extracting just the body text from a PDF and omitting figure titles, footnotes, headers, footers, page numbers etc. isn't possible in general. This is because "body text" isn't really a defined concept in the PDF format.
You could however dig into the library and add some heuristics targeting figure captions e.g. to discard blocks of text that follow a large gap without text, or are too short (but what about titles?), or perhaps where the font size is much smaller than the mean.

Python Blog RSS Feed Scraping BeautifulSoup Output to .txt Files

Apologies in advance for the long block of code following. I'm new to BeautifulSoup, but found there were some useful tutorials using it to scrape RSS feeds for blogs. Full disclosure: this is code adapted from this video tutorial which has been immensely helpful in getting this off the ground: http://www.youtube.com/watch?v=Ap_DlSrT-iE.
Here's my problem: the video does a great job of showing how to print the relevant content to the console. I need to write out each article's text to a separate .txt file and save it to some directory (right now I'm just trying to save to my Desktop). I know the problem lies i the scope of the two for-loops near the end of the code (I've tried to comment this for people to see quickly--it's the last comment beginning # Here's where I'm lost...), but I can't seem to figure it out on my own.
Currently what the program does is takes the text from the last article read in by the program and writes that out to the number of .txt files that are indicated in the variable listIterator. So, in this case I believe there are 20 .txt files that get written out, but they all contain the text of the last article that's looped over. What I want the program to do is loop over each article and print the text of each article out to a separate .txt file. Sorry for the verbosity, but any insight would be really appreciated.
from urllib import urlopen
from bs4 import BeautifulSoup
import re
# Read in webpage.
webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()
# On RSS Feed site, find tags for title of articles and
# tags for article links to be downloaded.
patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')
# Find the tags listed in variables above in the articles.
findPatTitle = re.findall(patFinderTitle, webpage)
findPatLink = re.findall(patFinderLink, webpage)
# Create a list that is the length of the number of links
# from the RSS feed page. Use this to iterate over each article,
# read it in, and find relevant text or <p> tags.
listIterator = []
listIterator[:] = range(len(findPatTitle))
for i in listIterator:
# Print each title to console to ensure program is working.
print findPatTitle[i]
# Read in the linked-to article.
articlePage = urlopen(findPatLink[i]).read()
# Find the beginning and end of articles using tags listed below.
divBegin = articlePage.find("<div class='story-teaser'>")
divEnd = articlePage.find("<footer class='article-footer'>")
# Define article variable that will contain all the content between the
# beginning of the article to the end as indicated by variables above.
article = articlePage[divBegin:divEnd]
# Parse the page using BeautifulSoup
soup = BeautifulSoup(article)
# Compile list of all <p> tags for each article and store in paragList
paragList = soup.findAll('p')
# Create empty string to eventually convert items in paragList to string to
# be written to .txt files.
para_string = ''
# Here's where I'm lost and have some sort of scope issue with my for-loops.
for i in paragList:
para_string = para_string + str(i)
newlist = range(len(findPatTitle))
for i in newlist:
ofile = open(str(listIterator[i])+'.txt', 'w')
ofile.write(para_string)
ofile.close()

The reason why it seems that only the last article is written down, is because all the articles are writer to 20 separate files over and over again. Lets have a look at the following:
for i in paragList:
para_string = para_string + str(i)
newlist = range(len(findPatTitle))
for i in newlist:
ofile = open(str(listIterator[i])+'.txt', 'w')
ofile.write(para_string)
ofile.close()
You are writing parag_string over and over again to the same 20 files for each iteration. What you need to be doing is this, append all your parag_strings to a separate list, say paraStringList, and then write all its contents to separate files, like so:
for i, var in enumerate(paraStringList): # Enumerate creates a tuple
with open("{0}.txt".format(i), 'w') as writer:
writer.write(var)
Now that this needs to be outside of your main loop i.e. for i in listIterator:(...). This is a working version of the program:
from urllib import urlopen
from bs4 import BeautifulSoup
import re
webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()
patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')
findPatTitle = re.findall(patFinderTitle, webpage)[0:4]
findPatLink = re.findall(patFinderLink, webpage)[0:4]
listIterator = []
listIterator[:] = range(len(findPatTitle))
paraStringList = []
for i in listIterator:
print findPatTitle[i]
articlePage = urlopen(findPatLink[i]).read()
divBegin = articlePage.find("<div class='story-teaser'>")
divEnd = articlePage.find("<footer class='article-footer'>")
article = articlePage[divBegin:divEnd]
soup = BeautifulSoup(article)
paragList = soup.findAll('p')
para_string = ''
for i in paragList:
para_string += str(i)
paraStringList.append(para_string)
for i, var in enumerate(paraStringList):
with open("{0}.txt".format(i), 'w') as writer:
writer.write(var)

Merge Existing PDF into new ReportLab PDF via flowables

I have a reportlab SimpleDocTemplate and returning it as a dynamic PDF. I am generating it's content based on some Django model metadata. Here's my template setup:
buff = StringIO()
doc = SimpleDocTemplate(buff, pagesize=letter,
rightMargin=72,leftMargin=72,
topMargin=72,bottomMargin=18)
Story = []
I can easily add textual metadata from the Entry model into the Story list to be built later:
ptext = '<font size=20>%s</font>' % entry.title.title()
paragraph = Paragraph(ptext, custom_styles["Custom"])
Story.append(paragraph)
And then generate the PDF to be returned in the response by calling build on the SimpleDocTemplate:
doc.build(Story, onFirstPage=entry_page_template, onLaterPages=entry_page_template)
pdf = buff.getvalue()
resp = HttpResponse(mimetype='application/x-download')
resp['Content-Disposition'] = 'attachment;filename=logbook.pdf'
resp.write(pdf)
return resp
One metadata field on the model is a file attachment. When those file attachments are PDFs, I'd like to merge them into the Story that I am generating; IE meaning a PDF of reportlab "flowable" type.
I'm attempting to do so using pdfrw, but haven't had any luck. Ideally I'd love to just call:
from pdfrw import PdfReader
pdf = pPdfReader(entry.document.file.path)
Story.append(pdf)
and append the pdf to the existing Story list to be included in the generation of the final document, as noted above.
Anyone have any ideas? I tried something similar using pagexobj to create the pdf, trying to follow this example:
http://code.google.com/p/pdfrw/source/browse/trunk/examples/rl1/subset.py
from pdfrw.buildxobj import pagexobj
from pdfrw.toreportlab import makerl
pdf = pagexobj(PdfReader(entry.document.file.path))
But didn't have any luck either. Can someone explain to me the best way to merge an existing PDF file into a reportlab flowable? I'm no good with this stuff and have been banging my head on pdf-generation for days now. :) Any direction greatly appreciated!

I just had a similar task in a project. I used reportlab (open source version) to generate pdf files and pyPDF to facilitate the merge. My requirements were slightly different in that I just needed one page from each attachment, but I'm sure this is probably close enough for you to get the general idea.
from pyPdf import PdfFileReader, PdfFileWriter
def create_merged_pdf(user):
basepath = settings.MEDIA_ROOT + "/"
# following block calls the function that uses reportlab to generate a pdf
coversheet_path = basepath + "%s_%s_cover_%s.pdf" %(user.first_name, user.last_name, datetime.now().strftime("%f"))
create_cover_sheet(coversheet_path, user, user.performancereview_set.all())
# now user the cover sheet and all of the performance reviews to create a merged pdf
merged_path = basepath + "%s_%s_merged_%s.pdf" %(user.first_name, user.last_name, datetime.now().strftime("%f"))
# for merged file result
output = PdfFileWriter()
# for each pdf file to add, open in a PdfFileReader object and add page to output
cover_pdf = PdfFileReader(file( coversheet_path, "rb"))
output.addPage(cover_pdf.getPage(0))
# iterate through attached files and merge. I only needed the first page, YMMV
for review in user.performancereview_set.all():
review_pdf = PdfFileReader(file(review.pdf_file.file.name, "rb"))
output.addPage(review_pdf.getPage(0)) # only first page of attachment
# write out the merged file
outputStream = file(merged_path, "wb")
output.write(outputStream)
outputStream.close()

I used the following class to solve my issue. It inserts the PDFs as vector PDF images.
It works great because I needed to have a table of contents. The flowable object allowed the built in TOC functionality to work like a charm.
Is there a matplotlib flowable for ReportLab?
Note: If you have multiple pages in the file, you have to modify the class slightly. The sample class is designed to just read the first page of the PDF.

I know the question is a bit old but I'd like to provide a new solution using the latest PyPDF2.
You now have access to the PdfFileMerger, which can do exactly what you want, append PDFs to an existing file. You can even merge them in different positions and choose a subset or all the pages!
The official docs are here: https://pythonhosted.org/PyPDF2/PdfFileMerger.html
An example from the code in your question:
import tempfile
import PyPDF2
from django.core.files import File
# Using a temporary file rather than a buffer in memory is probably better
temp_base = tempfile.TemporaryFile()
temp_final = tempfile.TemporaryFile()
# Create document, add what you want to the story, then build
doc = SimpleDocTemplate(temp_base, pagesize=letter, ...)
...
doc.build(...)
# Now, this is the fancy part. Create merger, add extra pages and save
merger = PyPDF2.PdfFileMerger()
merger.append(temp_base)
# Add any extra document, you can choose a subset of pages and add bookmarks
merger.append(entry.document.file, bookmark='Attachment')
merger.write(temp_final)
# Write the final file in the HTTP response
django_file = File(temp_final)
resp = HttpResponse(django_file, content_type='application/pdf')
resp['Content-Disposition'] = 'attachment;filename=logbook.pdf'
if django_file.size is not None:
resp['Content-Length'] = django_file.size
return resp

Use this custom flowable:
class PDF_Flowable(Flowable):
#----------------------------------------------------------------------
def __init__(self,P,page_no):
Flowable.__init__(self)
self.P = P
self.page_no = page_no
#----------------------------------------------------------------------
def draw(self):
"""
draw the line
"""
canv = self.canv
pages = self.P
page_no = self.page_no
canv.translate(x, y)
canv.doForm(makerl(canv, pages[page_no]))
canv.restoreState()
and then after opening existing pdf i.e.
pages = PdfReader(BASE_DIR + "/out3.pdf").pages
pages = [pagexobj(x) for x in pages]
for i in range(0, len(pages)):
F = PDF_Flowable(pages,i)
elements.append(F)
elements.append(PageBreak())
use this code to add this custom flowable in elements[].

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PyPDF4 not reading certain characters - python

Related

Is it possible to capture specific parts of a PDF text with AWS Textract?

Decoding problem with fitz.Document in Python 3.7

How to extract text from Pdf using Pypdf2 excluding the text content from Charts and Tables

Python Blog RSS Feed Scraping BeautifulSoup Output to .txt Files

Merge Existing PDF into new ReportLab PDF via flowables

Categories

Resources