Decoding problem with fitz.Document in Python 3.7 - python

I want to extract the text of a PDF and use some regular expressions to filter for information.
I am coding in Python 3.7.4 using fitz for parsing the pdf. The PDF is written in German. My code looks as follows:
doc = fitz.open(pdfpath)
pagecount = doc.pageCount
page = 0
content = ""
while (page < pagecount):
p = doc.loadPage(page)
page += 1
content = content + p.getText()
Printing the content, I realized that the first (and important) half of the document is decoded as a strange mix of Japanese (?) signs and others, like this: ョ。オウキ・ゥエオョァ@ュ.
I tried to solve it with different decodings (latin-1, iso-8859-1), encoding is definitely in utf-8.
content= content+p.getText().encode("utf-8").decode("utf-8")
I also have tried to get the text using minecart:
import minecart
file = open(pdfpath, 'rb')
document = minecart.Document(file)
for page in document.iter_pages():
for lettering in page.letterings :
print(lettering)
which results in the same problem.
Using textract, the first half is an empty string:
import textract
text = textract.process(pdfpath)
print(text.decode('utf-8'))
Same thing with PyPDF2:
import PyPDF2
pdfFileObj = open(pdfpath, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
for index in range(0, pdfReader.numPages) :
pageObj = pdfReader.getPage(index)
print(pageObj.extractText())
I don't understand the problem as it's looking like a normal PDF with normal text. Also some of the PDFs don't have this problem.

Related

Is it possible to capture specific parts of a PDF text with AWS Textract?

I need to extract the text from the PDF, but I don't want the entire PDF to be parsed. I wonder if it's possible to get specific parts of the parsed PDF. For example, I have a PDF with information about: Address, city and country. I don't want everything returned, just the Address field, not the other information.
Code that returns the text to me:
from textractcaller.t_call import call_textract
from textractprettyprinter.t_pretty_print import get_lines_string
response = call_textract(input_document="s3://my-bucket/myfile.pdf")
print(get_lines_string(response))
Try this method (it doesn't use AWS Textract, but works as well):
import PyPDF2
def extract_text(filename, page_number):
# Returns the content of a given page
pdf_file_object = open(filename, 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file_object)
# page_number - 1 below because in python, page 1 is considered as page 0
page_object = pdf_reader.getPage(page_number - 1)
text = page_object.extractText()
pdf_file_object.close()
return text
This function extracts the text from one single PDF page.
If you haven't got PyPDF2 yet, install it through the command line with 'pip install PyPDF2'.

PyPDF4 not reading certain characters

I'm compiling some data for a project and I've been using PyPDF4 to read this data from it's source PDF file, but I've been having trouble with certain characters not showing up correctly. Here's my code:
from PyPDF4 import PdfFileReader
import pandas as pd
import numpy as np
import os
import xml.etree.cElementTree as ET
# File name
pdf_path = "PV-9-2020-10-23-RCV_FR.pdf"
# Results storage
results = {}
# Start page
page = 5
# Lambda to assign votes
serify = lambda voters, vote: pd.Series({voter.strip(): vote for voter in voters})
with open(pdf_path, 'rb') as f:
# Get PDF reader for PDF file f
pdf = PdfFileReader(f)
while page < pdf.numPages:
# Get text of page in PDF
text = pdf.getPage(page).extractText()
proposal = text.split("\n+\n")[0].split("\n")[3]
# Collect all pages relevant pages
while text.find("\n0\n") is -1:
page += 1
text += "\n".join(pdf.getPage(page).extractText().split("\n")[3:])
# Remove corrections
text, corrections = text.split("CORRECCIONES")
# Grab relevant text !!! This is where the missing characters show up.
text = "\n, ".join([n[:n.rindex("\n")] for n in text.split("\n:")])
for_list = "".join(text[text.index("\n+\n")+3:text.index("\n-\n")].split("\n")[:-1]).split(", ")
nay_list = "".join(text[text.index("\n-\n")+3:text.index("\n0\n")].split("\n")[:-1]).split(", ")
abs_list = "".join(text[text.index("\n0\n")+3:].split("\n")[:-1]).split(", ")
# Store data in results
results.update({proposal: dict(pd.concat([serify(for_list, 1), serify(nay_list, -1), serify(abs_list, 0)]).items())})
page += 1
print(page)
results = pd.DataFrame(results)
The characters I'm having difficulty don't show up in the text extracted using extractText. Ždanoka for instance becomes "danoka, Štefanec becomes -tefanc. It seems like most of the characters are Eastern European, which makes me think I need one of the latin decoders.
I've looked through some of PyPDF4's capabilities, it seems like it has plenty of relevant codecs, including latin1. I've attempted decoding the file using different functions from the PyPDF4.generic.codecs module, and either the characters don't show still, or the code throws an error at an unrecognised byte.
I haven't yet attempted using multiple codecs on different bytes from the same file, that seems like it would take some time. Am I missing something in my code that can easily fix this? Or is it more likely I will have to tailor fit a solution using PyPDF4's functions?
Use pypdf instead of PyPDF2/PyPDF3/PyPDF4. You will need to apply the migrations.
pypdf has received a lot of updates in December 2022. Especially the text extraction.
To give you a minimal full example for text extraction:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
for page in reader.pages:
print(page.extract_text())

GET table of contents from a PDF with python

I'm trying to get Table of Contents from a PDF. I'm using PyMuPDF for that purpose. But it only extracts ToC if the PDF consists of Bookmarks. Otherwise it only results in an empty list.
def get_Table_Of_Contents(doc):
toc = doc.getToC()
return toc
toc= get_Table_Of_Contents(file)
toc
Usually TOC is represented like a regular text on a page.
Try pdfreader to extract texts and/or PDF "markdown".
Here is a sample code extracting all the above from a page:
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(your_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
# navigate to TOC
viewer.navigate(toc_page_number)
viewer.render()
pdf_markdown = viewer.canvas.text_content
plain_text = "".join(viewer.canvas.strings)
then you can parse plain_text or pdf_markdown as regular strings.
Convert pdf to html using pdf-html converter. You can parse html toextract whatever data you want using parser like beautifulsoup.

can't read pdf document using PyPDF2

I am trying to read some text from a pdf file. I am using the code below however when I try to get the text (ptext) all that is return is a string variable of size 1 & its empty.
Why is no text being returned? I have tried other pages and another pdf book but the same thing, I can't seem to read any text.
import PyPDF2
file = open(r'C:/Users/pdfs/test_file.pdf', 'rb')
fileReader = PyPDF2.PdfFileReader(file)
pageObj = fileReader.getPage(445)
ptext = pageObj.extractText()
I also had the same issue, I thought something was wrong with my code or whatnot. After some intense researching, debugging and investigation, it seems that PyPDF2, PyPDF3, PyPDF4 packages cant handle large files... Yes, I tried with a 20 page PDF, ran seamlessly, but put in a 50+ page PDF, and PyPDF crashes.
My only suggestion would be to use a different package altogether. pdftotext is a good recommendation. Use pip install pdftotext.
I have faced a similar issue while reading my pdf files. Hope the below solution helps.
The reason why I faced this issue : The pdf I was selecting was actually a scanned image. I created my resume using a third party site which returned me a pdf. On parsing this type of file, I was not able to extract text directly.
Below is the testes working code
from PIL import Image
import pytesseract
from pdf2image import convert_from_path
import os
def readPdfFile(filePath):
pages = convert_from_path(filePath, 500)
image_counter = 1
#Part #1 : Converting PDF to images
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(filename, 'JPEG')
image_counter = image_counter + 1
#Part #2 - Recognizing text from the images using OCR
filelimit = image_counter-1 # Variable to get count of total number of pages
for i in range(1, filelimit + 1):
filename = "page_"+str(i)+".jpg"
text = str(((pytesseract.image_to_string(Image.open(filename)))))
text = text.replace('-\n', '')
#Part 3 - Remove those temp files
image_counter = 1
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
os.remove(filename)
image_counter = image_counter + 1
return text

pypdf mergepage issue

I want to add pdf watermark using pypdf lib, code below:
def add_wm(pdf_in, pdf_out):
wm_file = open("watermark.pdf", "rb")
pdf_wm = PdfFileReader(wm_file)
pdf_output = PdfFileWriter()
input_stream = open(pdf_in, "rb")
pdf_input = PdfFileReader(input_stream)
pageNum = pdf_input.getNumPages()
#print pageNum
for i in range(pageNum):
page = pdf_input.getPage(i)
page.mergePage(pdf_wm.getPage(0)) # !! here is fail if has chinese character
page.compressContentStreams()
pdf_output.addPage(page)
output_stream = open(pdf_out, "wb")
pdf_output.write(output_stream)
output_stream.close()
input_stream.close()
wm_file.close()
return True
The issue is if page = pdf_input.getPage(i) page has Chinese characters, page.mergePage will be raise exception and cause failure. How do I work around this?
The Python library pdfrw also supports watermarking. If it does not work for your particular PDF, please email it to me (address at github) and I will investigate -- I am the pdfrw author.
I had the same problem when i was watermarking with PyPdf2 1.25.1 .
pdfrw as Patrick suggested isn't working for my PDF (works for word documents exported as pdf but not for scanned documents i guess).
Updating to the newest version of PyPDF2 (for me this is 1.26.0) fixed this bug.
For more information see PyPDF2 issue #176

Categories

Resources