I am trying to extract text from a PDF file using the python package PyPDF2. Below is the code I wrote for it and it is working just fine for most pdf files. However for this file I get the following error and no text gets extracted whatsoever. Here the warning: PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
import PyPDF2
# Import and read pdf file
file_name = 'path/to/file.pdf'
pdfFileObj = open(file_name, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict=True)
# Extract text from pdf file
pageObj = pdfReader.getPage(1)
text = pageObj.extractText()
print(text)
Related
I have the following code (comments explain what is occuring):
import os
from io import StringIO
from PyPDF2 import PdfFileReader
# Path to the directory containing the PDF files
pdf_dir = '/path/to/pdf/files'
# Iterate over the files in the directory
for filename in os.listdir(pdf_dir):
# Check if the file is a PDF file
if filename.endswith('.pdf'):
# Construct the full path to the file
filepath = os.path.join(pdf_dir, filename)
# Open the PDF file and read its contents
with open(filepath, 'rb') as f:
pdf = PdfFileReader(f)
# Extract the text from the PDF file
text = ''
for page in pdf.pages:
text += page.extractText()
# Construct the name of the output text file
txt_filename = filename[:-4] + '.txt'
# Write the text to the output file
with open(txt_filename, 'w') as f:
f.write(text)
When I run the code, it produces a Xref table not zero-indexed. ID numbers for objects will be corrected warning. It is not a hard error, but it makes me wonder if there's a different way I should be doing this.
Thanks for any suggestions.
I want to first extract all the text from 1 pdf file and store it into one text file.
Here is my code:
import PyPDF2
from pathlib import Path
with Path('C:/Users/Lui/Desktop/Test/file1.pdf').open(mode='rb') as pdf_file, open('Extracted/extractPDF.txt', 'w') as text_file:
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
print(number_of_pages)
for page_number in range(number_of_pages): # use xrange in Py2
page = read_pdf.getPage(page_number)
page_content = page.extractText()
print(page_content)
text_file.write(page_content)
The pdf looks like this:
However, the text file created looks different in comparison with missing words and spacing:
What am I doing wrong? My goal is to then loop through 1,000 PDF's so I'm trying to get 1 example working first.
Try using pdftotext
import pdftotext
# Load your PDF
with open(filename, "rb") as f:
pdf = pdftotext.PDF(f)
# If it's password-protected
#with open("secure.pdf", "rb") as f:
# pdf = pdftotext.PDF(f, "secret")
# How many pages?
#print(len(pdf))
# Iterate over all the pages
#for page in pdf:
# print(page)
data = "\n\n".join(pdf)
# Read all the text into one string
print(data)
This package works far better and should help you out.
I am spliting a pdf file by ranges. My code works fine, but the problem is that the new pdf file does not preserve the format of the original pdf.
How can I splita pdf file without losing the original format?
from PyPDF2 import PdfFileReader, PdfFileWriter
#split range
pgi=30 #start
pgf=37 #end
pdf_document = "test.pdf"
pdf = PdfFileReader(pdf_document)
pdf_writer = PdfFileWriter()
for page in range(pgi-1,pgf):
current_page = pdf.getPage(page)
pdf_writer.addPage(current_page)
with open(f'test-{pgi}-{pgf}.pdf', "wb") as out:
pdf_writer.write(out)
PS: I don't have this problem when I use Adobe or MasterPdfEditor software.
The image below shows the original and new pdf format.
out format pdf
I have extracted data from a pdf file, but I am unable to convert it into a csv file
import PyPDF2 as PDF
PDFfile = open("path", "rb")
pdfread = PDF.PdfFileReader(PDFfile)
page = pdfread.getpage(0)
Page_content = page.extractText()
After extracting the text from a particular page, I want to export it to a csv file. How to do this?
I am trying to read a pdf using pypdf2 in python.
I have been using below code to read pdf.
import PyPDF2
filename = r'./abc.pdf'
pdf = open(filename,'rb')
pdfReader= PyPDF2.PdfFileReader(pdf)
page = pdfReader.getPage(0)
text = page.extractText()
print(text)
It displays text if I execute it on python command line.
It does not display text if I save it as script (with .py file extension) and run script.