Unable to iterate through a list -pyPDF2 - python

Running below code is throwing error at line pdfReader
pdf=['/somepath/a.pdf','/somepath/b.pdf']
for count in range(len(pdf)):
name=pdf[count]
pdfFileObj = open(name, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) #Error at this line
pages=pdfReader.numPages
Error- PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
However when I am just passing pdf location below like this its working , but I need a loop so that every pdf can be used..
pdfFileObj = open(pdf[0], 'rb')
Even I tried look like , but it again failing at PdfReader
for p in pdf:
pdfFileObj = open(p, 'rb')

According to this site, this error means that the first section of the xref table does not begin with object zero. You can overcome this by passing the option strict = false and PyPDF2 will automatically correct the object ID numbers. Usually this is not a big problem and Adobe will still read your PDF's. Cheers.

Related

Need to extract text from a PDF file with python

I am trying to extract text from a PDF file, but it gives an error
PdfReadError: Could not read malformed PDF file
Can anyone guide me with how to proceed with this?
Here is the code
import os
import PyPDF2
dir_name='path to folder'
files=os.listdir(dir_name)
os.chdir(dir_name)
for j in files:
print(j)
print("In file")
pdfFileObj = open(j, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print(pdfReader.numPages)
pdfFile=pdfReader.getPage(0)
#page_lines=pdfFile.extractText()
print(pdfFile.extractText())
pdfFileObj.close()
This might be something which is happening cause of the files in the directory you did chdir.
Make sure it has no other files other than pdf files. Also try to extract files based on its extension, specially the .pdf.
Here is similar code. Try executing it just for the files you found are malformed.
import PyPDF2
pdfFileObj = open('example.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print(pdfReader.numPages)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
pdfFileObj.close()
Update
It is observed that this module PyPDF2 does not function properly. The module is only good for it's (.numPages) method. Other methods may or may not work as expected, while sometimes returning nothing.
Try PdfMiner for robust extraction. It has a lot of options to explore.
pdfminer

Python does not print PDF with pyPDF2

I tried to print pages of a pdf document:
import PyPDF2
FILE_PATH = 'my.pdf'
with open(FILE_PATH, mode='rb') as f:
reader = PyPDF2.PdfFileReader(f)
page = reader.getPage(0) # I tried also other pages e.g 1,2,..
print(page.extractText())
But I only get a lot of blank space and no error message. Could it be that this pdf version (my.pdf) is not supported by PyPDF2?
This solved it (prints all pages of the document). Thanks
from pdfreader import SimplePDFViewer
fd = open("my.pdf", "rb")
viewer = SimplePDFViewer(fd)
for i in range(1,16): # need range from 1 - max number of pages +1
viewer.navigate(i)
viewer.render()
page_1_content=viewer.canvas.text_content
page_1_text = "".join(viewer.canvas.strings)
print (page_1_text)
Try pdfreader
from pdfreader import SimplePDFViewer
fd = open("my.pdf", "rb")
viewer = SimplePDFViewer(fd)
viewer.render()
page_0_content=viewer.canvas.text_content
page_0_text = "".join(viewer.canvas.strings)
If it's blank, either the PDF is being read and it's format can't be read by pypdf so it just outputs blank. Maybe put in the absolute filepath instead of relative filepath. If all else fails, try with different PDFs , and if there is a version that does work and yours doesn't, you might need to convert yours to that working type.

How to compress pdf without losing quality using PyPDF2 [duplicate]

I am struggling to compress my merged pdf's using the PyPDF2 module. this is my attempt based on http://www.blog.pythonlibrary.org/2012/07/11/pypdf2-the-new-fork-of-pypdf/
import PyPDF2
path = open('path/to/hello.pdf', 'rb')
path2 = open('path/to/another.pdf', 'rb')
merger = PyPDF2.PdfFileMerger()
merger.append(fileobj=path2)
merger.append(fileobj=path)
pdf.filters.compress(merger)
merger.write(open("test_out2.pdf", 'wb'))
The error I receive is
TypeError: must be string or read-only buffer, not file
I have also tried to compressing the pdf after the merging is complete. I am basing my failed compression on what file size I got after using PDFSAM with compression.
Any thoughts? Thanks.
PyPDF2 doesn't have a reliable compression method. That said, there's a compress_content_streams() method with the following description:
Compresses the size of this page by joining all content streams and applying a FlateDecode filter.
However, it is possible that this function will perform no action if content stream compression becomes "automatic" for some reason.
Again, this won't make any difference in most cases but you can try this code:
from PyPDF2 import PdfReader, PdfWriter
writer = PdfWriter()
for pdf in ["path/to/hello.pdf", "path/to/another.pdf"]:
reader = PdfReader(pdf)
for page in reader.pages:
page.compress_content_streams()
writer.add_page(page)
with open("test_out2.pdf", "wb") as f:
writer.write(f)
Your error says that it must be string or read-only buffer, not file.
So it's better to write your merger to a byte or string.
import PyPDF2
from io import BytesIO
tmp = BytesIO()
path = open('path/to/hello.pdf', 'rb')
path2 = open('path/to/another.pdf', 'rb')
merger = PyPDF2.PdfFileMerger()
merger.append(fileobj=path2)
merger.append(fileobj=path)
merger.write(tmp)
PyPDF2.filters.compress(tmp.getvalue())
merger.write(open("test_out2.pdf", 'wb'))
The initial approach isn't that wrong. Just add the pages to your writer and compress them before writing to a file:
...
for i in list(range(reader.numPages)):
page = reader.getPage(i)
writer.addPage(page);
for i in list(range(writer.getNumPages())):
page.compressContentStreams()
...
pypdf offers several ways to reduce the file size: https://pypdf.readthedocs.io/en/latest/user/file-size.html
compress_content_streams is one that only has the disadvantage that it might take long (depends on the PDF; think of it as ZIP-for-PDF):
from pypdf import PdfReader, PdfWriter
reader = PdfReader("example.pdf")
writer = PdfWriter()
for page in reader.pages:
page.compress_content_streams() # This is CPU intensive!
writer.add_page(page)
with open("out.pdf", "wb") as f:
writer.write(f)

Reading and returning jpg file in Python, ASP.NET?

I have an image.asp file, using Python, in which I am attempting to open a JPEG image and write it to the response so that it can be retrieved from the relevant link. What I have currently:
<%# LANGUAGE = Python%>
<%
path = "path/to/image.jpg"
with open(path, 'rb') as f:
jpg = f.read()
Response.ContentType = "image/jpeg"
Response.WriteBinary(jpg)
%>
In a browser, this returns the following error:
The image "url/to/image.asp" cannot be displayed because it contains errors.
I suspect the issue is that I am just not writing contents of the jpg file correctly. What do I need to fix?
Your issue is here:
with open(url, 'rb') as f:
Your variable, that contains path is named path, not url.
Make it as:
with open(path, 'rb') as f:
and it will work better.

python pdf (PyPDF2 module) - How to split/merge this?

I was trying to split & merge pdf files so that i can remove the first page of each pdf files.. Here's the code.
#python3
#split and merge pdf files!
import os, PyPDF2
pdfFiles = []
os.chdir('C:\\Users\\Cyber\\Downloads\\5-111-fall-2008\\5-111-fall-2008\\contents\\readings-and-lecture-notes')
for filename in os.listdir('.'):
if filename.endswith('pdf'):
pdfFiles.append(filename)
pdfWriter = PyPDF2.PdfFileWriter()
for filename in pdfFiles:
pdfFileObj = open(filename, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
for pageNum in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
pdfWriter.addPage(pageObj)
pdfOutput = open('Merged.pdf', 'wb')
pdfWriter.write(pdfOutput)
pdfOutput.close()
And then i get the following error...
: PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
I searched for that error and found out that it indicates there may have been an issue with the creation of the PDF itself.
Though i get my merged.pdf file as i wanted, i want to know what that exactly means & how to avoid getting them.
This warning means that the first section of the xref table does not begin with object zero. There may have been an error in writing the PDF. If strict = False, PyPDF2 will try to correct the object ID numbers. If strict = True, they will not be corrected.The default is True. Try PyPDF2.PdfFileReader(pdfFileObj,False)

Categories

Resources