PyPDF2 does not always extract text from pdf file - python

I am trying to extract text from a PDF file using the python package PyPDF2. Below is the code I wrote for it and it is working just fine for most pdf files. However for this file I get the following error and no text gets extracted whatsoever. Here the warning: PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
import PyPDF2
# Import and read pdf file
file_name = 'path/to/file.pdf'
pdfFileObj = open(file_name, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict=True)
# Extract text from pdf file
pageObj = pdfReader.getPage(1)
text = pageObj.extractText()
print(text)

Related

Getting "Xref table not zero-indexed. ID numbers for objects will be corrected" warning

I have the following code (comments explain what is occuring):
import os
from io import StringIO
from PyPDF2 import PdfFileReader
# Path to the directory containing the PDF files
pdf_dir = '/path/to/pdf/files'
# Iterate over the files in the directory
for filename in os.listdir(pdf_dir):
# Check if the file is a PDF file
if filename.endswith('.pdf'):
# Construct the full path to the file
filepath = os.path.join(pdf_dir, filename)
# Open the PDF file and read its contents
with open(filepath, 'rb') as f:
pdf = PdfFileReader(f)
# Extract the text from the PDF file
text = ''
for page in pdf.pages:
text += page.extractText()
# Construct the name of the output text file
txt_filename = filename[:-4] + '.txt'
# Write the text to the output file
with open(txt_filename, 'w') as f:
f.write(text)
When I run the code, it produces a Xref table not zero-indexed. ID numbers for objects will be corrected warning. It is not a hard error, but it makes me wonder if there's a different way I should be doing this.
Thanks for any suggestions.

Extracting PDF text into a text file using Python - Extraction error

I want to first extract all the text from 1 pdf file and store it into one text file.
Here is my code:
import PyPDF2
from pathlib import Path
with Path('C:/Users/Lui/Desktop/Test/file1.pdf').open(mode='rb') as pdf_file, open('Extracted/extractPDF.txt', 'w') as text_file:
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
print(number_of_pages)
for page_number in range(number_of_pages): # use xrange in Py2
page = read_pdf.getPage(page_number)
page_content = page.extractText()
print(page_content)
text_file.write(page_content)
The pdf looks like this:
However, the text file created looks different in comparison with missing words and spacing:
What am I doing wrong? My goal is to then loop through 1,000 PDF's so I'm trying to get 1 example working first.
Try using pdftotext
import pdftotext
# Load your PDF
with open(filename, "rb") as f:
pdf = pdftotext.PDF(f)
# If it's password-protected
#with open("secure.pdf", "rb") as f:
# pdf = pdftotext.PDF(f, "secret")
# How many pages?
#print(len(pdf))
# Iterate over all the pages
#for page in pdf:
# print(page)
data = "\n\n".join(pdf)
# Read all the text into one string
print(data)
This package works far better and should help you out.

PyPDF2 split pdf keeping original pdf format

I am spliting a pdf file by ranges. My code works fine, but the problem is that the new pdf file does not preserve the format of the original pdf.
How can I splita pdf file without losing the original format?
from PyPDF2 import PdfFileReader, PdfFileWriter
#split range
pgi=30 #start
pgf=37 #end
pdf_document = "test.pdf"
pdf = PdfFileReader(pdf_document)
pdf_writer = PdfFileWriter()
for page in range(pgi-1,pgf):
current_page = pdf.getPage(page)
pdf_writer.addPage(current_page)
with open(f'test-{pgi}-{pgf}.pdf', "wb") as out:
pdf_writer.write(out)
PS: I don't have this problem when I use Adobe or MasterPdfEditor software.
The image below shows the original and new pdf format.
out format pdf

Convert the page data extracted from pdf file into csv file using PyPDF2

I have extracted data from a pdf file, but I am unable to convert it into a csv file
import PyPDF2 as PDF
PDFfile = open("path", "rb")
pdfread = PDF.PdfFileReader(PDFfile)
page = pdfread.getpage(0)
Page_content = page.extractText()
After extracting the text from a particular page, I want to export it to a csv file. How to do this?

issues in extracting text from PyPDF2

I am trying to read a pdf using pypdf2 in python.
I have been using below code to read pdf.
import PyPDF2
filename = r'./abc.pdf'
pdf = open(filename,'rb')
pdfReader= PyPDF2.PdfFileReader(pdf)
page = pdfReader.getPage(0)
text = page.extractText()
print(text)
It displays text if I execute it on python command line.
It does not display text if I save it as script (with .py file extension) and run script.

Categories

Resources