python pdf (PyPDF2 module) - How to split/merge this? - python

I was trying to split & merge pdf files so that i can remove the first page of each pdf files.. Here's the code.
#python3
#split and merge pdf files!
import os, PyPDF2
pdfFiles = []
os.chdir('C:\\Users\\Cyber\\Downloads\\5-111-fall-2008\\5-111-fall-2008\\contents\\readings-and-lecture-notes')
for filename in os.listdir('.'):
if filename.endswith('pdf'):
pdfFiles.append(filename)
pdfWriter = PyPDF2.PdfFileWriter()
for filename in pdfFiles:
pdfFileObj = open(filename, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
for pageNum in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
pdfWriter.addPage(pageObj)
pdfOutput = open('Merged.pdf', 'wb')
pdfWriter.write(pdfOutput)
pdfOutput.close()
And then i get the following error...
: PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
I searched for that error and found out that it indicates there may have been an issue with the creation of the PDF itself.
Though i get my merged.pdf file as i wanted, i want to know what that exactly means & how to avoid getting them.

This warning means that the first section of the xref table does not begin with object zero. There may have been an error in writing the PDF. If strict = False, PyPDF2 will try to correct the object ID numbers. If strict = True, they will not be corrected.The default is True. Try PyPDF2.PdfFileReader(pdfFileObj,False)

Related

Getting "Xref table not zero-indexed. ID numbers for objects will be corrected" warning

I have the following code (comments explain what is occuring):
import os
from io import StringIO
from PyPDF2 import PdfFileReader
# Path to the directory containing the PDF files
pdf_dir = '/path/to/pdf/files'
# Iterate over the files in the directory
for filename in os.listdir(pdf_dir):
# Check if the file is a PDF file
if filename.endswith('.pdf'):
# Construct the full path to the file
filepath = os.path.join(pdf_dir, filename)
# Open the PDF file and read its contents
with open(filepath, 'rb') as f:
pdf = PdfFileReader(f)
# Extract the text from the PDF file
text = ''
for page in pdf.pages:
text += page.extractText()
# Construct the name of the output text file
txt_filename = filename[:-4] + '.txt'
# Write the text to the output file
with open(txt_filename, 'w') as f:
f.write(text)
When I run the code, it produces a Xref table not zero-indexed. ID numbers for objects will be corrected warning. It is not a hard error, but it makes me wonder if there's a different way I should be doing this.
Thanks for any suggestions.

Python does not print PDF with pyPDF2

I tried to print pages of a pdf document:
import PyPDF2
FILE_PATH = 'my.pdf'
with open(FILE_PATH, mode='rb') as f:
reader = PyPDF2.PdfFileReader(f)
page = reader.getPage(0) # I tried also other pages e.g 1,2,..
print(page.extractText())
But I only get a lot of blank space and no error message. Could it be that this pdf version (my.pdf) is not supported by PyPDF2?
This solved it (prints all pages of the document). Thanks
from pdfreader import SimplePDFViewer
fd = open("my.pdf", "rb")
viewer = SimplePDFViewer(fd)
for i in range(1,16): # need range from 1 - max number of pages +1
viewer.navigate(i)
viewer.render()
page_1_content=viewer.canvas.text_content
page_1_text = "".join(viewer.canvas.strings)
print (page_1_text)
Try pdfreader
from pdfreader import SimplePDFViewer
fd = open("my.pdf", "rb")
viewer = SimplePDFViewer(fd)
viewer.render()
page_0_content=viewer.canvas.text_content
page_0_text = "".join(viewer.canvas.strings)
If it's blank, either the PDF is being read and it's format can't be read by pypdf so it just outputs blank. Maybe put in the absolute filepath instead of relative filepath. If all else fails, try with different PDFs , and if there is a version that does work and yours doesn't, you might need to convert yours to that working type.

How to compress pdf without losing quality using PyPDF2 [duplicate]

I am struggling to compress my merged pdf's using the PyPDF2 module. this is my attempt based on http://www.blog.pythonlibrary.org/2012/07/11/pypdf2-the-new-fork-of-pypdf/
import PyPDF2
path = open('path/to/hello.pdf', 'rb')
path2 = open('path/to/another.pdf', 'rb')
merger = PyPDF2.PdfFileMerger()
merger.append(fileobj=path2)
merger.append(fileobj=path)
pdf.filters.compress(merger)
merger.write(open("test_out2.pdf", 'wb'))
The error I receive is
TypeError: must be string or read-only buffer, not file
I have also tried to compressing the pdf after the merging is complete. I am basing my failed compression on what file size I got after using PDFSAM with compression.
Any thoughts? Thanks.
PyPDF2 doesn't have a reliable compression method. That said, there's a compress_content_streams() method with the following description:
Compresses the size of this page by joining all content streams and applying a FlateDecode filter.
However, it is possible that this function will perform no action if content stream compression becomes "automatic" for some reason.
Again, this won't make any difference in most cases but you can try this code:
from PyPDF2 import PdfReader, PdfWriter
writer = PdfWriter()
for pdf in ["path/to/hello.pdf", "path/to/another.pdf"]:
reader = PdfReader(pdf)
for page in reader.pages:
page.compress_content_streams()
writer.add_page(page)
with open("test_out2.pdf", "wb") as f:
writer.write(f)
Your error says that it must be string or read-only buffer, not file.
So it's better to write your merger to a byte or string.
import PyPDF2
from io import BytesIO
tmp = BytesIO()
path = open('path/to/hello.pdf', 'rb')
path2 = open('path/to/another.pdf', 'rb')
merger = PyPDF2.PdfFileMerger()
merger.append(fileobj=path2)
merger.append(fileobj=path)
merger.write(tmp)
PyPDF2.filters.compress(tmp.getvalue())
merger.write(open("test_out2.pdf", 'wb'))
The initial approach isn't that wrong. Just add the pages to your writer and compress them before writing to a file:
...
for i in list(range(reader.numPages)):
page = reader.getPage(i)
writer.addPage(page);
for i in list(range(writer.getNumPages())):
page.compressContentStreams()
...
pypdf offers several ways to reduce the file size: https://pypdf.readthedocs.io/en/latest/user/file-size.html
compress_content_streams is one that only has the disadvantage that it might take long (depends on the PDF; think of it as ZIP-for-PDF):
from pypdf import PdfReader, PdfWriter
reader = PdfReader("example.pdf")
writer = PdfWriter()
for page in reader.pages:
page.compress_content_streams() # This is CPU intensive!
writer.add_page(page)
with open("out.pdf", "wb") as f:
writer.write(f)

Unable to iterate through a list -pyPDF2

Running below code is throwing error at line pdfReader
pdf=['/somepath/a.pdf','/somepath/b.pdf']
for count in range(len(pdf)):
name=pdf[count]
pdfFileObj = open(name, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) #Error at this line
pages=pdfReader.numPages
Error- PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
However when I am just passing pdf location below like this its working , but I need a loop so that every pdf can be used..
pdfFileObj = open(pdf[0], 'rb')
Even I tried look like , but it again failing at PdfReader
for p in pdf:
pdfFileObj = open(p, 'rb')
According to this site, this error means that the first section of the xref table does not begin with object zero. You can overcome this by passing the option strict = false and PyPDF2 will automatically correct the object ID numbers. Usually this is not a big problem and Adobe will still read your PDF's. Cheers.

How to extract text from a directory of PDF files efficiently with OCR?

I have a large directory with PDF files (images), how can I extract efficiently the text from all the files inside the directory?. So far I tried to:
import multiprocessing
import textract
def extract_txt(file_path):
text = textract.process(file_path, method='tesseract')
p = multiprocessing.Pool(2)
file_path = ['/Users/user/Desktop/sample.pdf']
list(p.map(extract_txt, file_path))
However, it is not working... it takes a lot of time (I have some documents that have 600 pages). Additionally: a) I do not know how to handle efficiently the directory transformation part. b) I would like to add a page separator, let's say: <start/age = 1> ... page content ... <end/page = 1>, but I have no idea of how to do this.
Thus, how can I apply the extract_txt function to all the elements of a directory that end with .pdf and return the same files in another directory but in a .txt format, and add a page separator with OCR text extraction?.
Also, I was curios about using google docs to make this task, is it possible to programmatically use google docs to solve the aforementioned text extracting problem?.
UPDATE
Regarding the "adding a page separator" issue (<start/age = 1> ... page content ... <end/page = 1>) after reading Roland Smith's answer I tried to:
from PyPDF2 import PdfFileWriter, PdfFileReader
import textract
def extract_text(pdf_file):
inputpdf = PdfFileReader(open(pdf_file, "rb"))
for i in range(inputpdf.numPages):
w = PdfFileWriter()
w.addPage(inputpdf.getPage(i))
outfname = 'page{:03d}.pdf'.format(i)
with open(outfname, 'wb') as outfile: # I presume you need `wb`.
w.write(outfile)
print('\n<begin page pos =' , i, '>\n')
text = textract.process(str(outfname), method='tesseract')
os.remove(outfname) # clean up.
print(str(text, 'utf8'))
print('\n<end page pos =' , i, '>\n')
extract_text('/Users/user/Downloads/ImageOnly.pdf')
However, I still have issues with the print() part, since instead of printing, it would be more useful to save into a file all the output. Thus, I tried to redirect the output to a a file:
sys.stdout=open("test.txt","w")
print('\n<begin page pos =' , i, '>\n')
sys.stdout.close()
text = textract.process(str(outfname), method='tesseract')
os.remove(outfname) # clean up.
sys.stdout=open("test.txt","w")
print(str(text, 'utf8'))
sys.stdout.close()
sys.stdout=open("test.txt","w")
print('\n<end page pos =' , i, '>\n')
sys.stdout.close()
Any idea of how to make the page extraction/separator trick and saving everything into a file?...
In your code, you are extracting the text, but you don't do anything with it.
Try something like this:
def extract_txt(file_path):
text = textract.process(file_path, method='tesseract')
outfn = file_path[:-4] + '.txt' # assuming filenames end with '.pdf'
with open(outfn, 'wb') as output_file:
output_file.write(text)
return file_path
This writes the text to file that has the same name but a .txt extension.
It also returns the path of the original file to let the parent know that this file is done.
So I would change the mapping code to:
p = multiprocessing.Pool()
file_path = ['/Users/user/Desktop/sample.pdf']
for fn in p.imap_unordered(extract_txt, file_path):
print('completed file:', fn)
You don't need to give an argument when creating a Pool. By default it will create as many workers as there are cpu-cores.
Using imap_unordered creates an iterator that starts yielding values as soon as they are available.
Because the worker function returned the filename, you can print it to let the user know that this file is done.
Edit 1:
The additional question is if it is possible to mark page boundaries. I think it is.
A method that would surely work is to split the PDF file into pages before the OCR. You could use e.g. pdfinfo from the poppler-utils package to find out the number of pages in a document. And then you could use e.g. pdfseparate from the same poppler-utils package to convert that one pdf file of N pages into N pdf files of one page. You could then OCR the single page PDF files separately. That would give you the text on each page separately.
Alternatively you could OCR the whole document and then search for page breaks. This will only work if the document has a constant or predictable header or footer on every page. It is probably not as reliable as the abovementioned method.
Edit 2:
If you need a file, write a file:
from PyPDF2 import PdfFileWriter, PdfFileReader
import textract
def extract_text(pdf_file):
inputpdf = PdfFileReader(open(pdf_file, "rb"))
outfname = pdf_file[:-4] + '.txt' # Assuming PDF file name ends with ".pdf"
with open(outfname, 'w') as textfile:
for i in range(inputpdf.numPages):
w = PdfFileWriter()
w.addPage(inputpdf.getPage(i))
outfname = 'page{:03d}.pdf'.format(i)
with open(outfname, 'wb') as outfile: # I presume you need `wb`.
w.write(outfile)
print('page', i)
text = textract.process(outfname, method='tesseract')
# Add header and footer.
text = '\n<begin page pos = {}>\n'.format(i) + text + '\n<end page pos = {}>\n'.format(i)
# Write the OCR-ed text to the output file.
textfile.write(text)
os.remove(outfname) # clean up.
print(text)

Categories

Resources