pypdf for lists of pdfs

pypdf for lists of pdfs - python

I have got pypdf to work just fine for a single pdf file, but I can not seem to get it to work for a lits of files, or in a for loop for multiple pdfs, without failing because of the string not being callable. Any ideas I can use as a work around?
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
#print getPDFContent(r"Z:\GIS\MasterPermits\12300983.pdf").encode("ascii", "ignore")
#find pdfs
for root, dirs, files in os.walk(folder1):
for file in files:
if file.endswith(('.pdf')):
d=os.path.join(root, file)
print getPDFContent(d).encode("ascii", "ignore")
Traceback (most recent call last):
File "C:\Documents and Settings\dknight\Desktop\readpdf.py", line 50, in <module>
print getPDFContent(d).encode("ascii", "ignore")
File "C:\Documents and Settings\dknight\Desktop\readpdf.py", line 32, in getPDFContent
pdf = pyPdf.PdfFileReader(file(path, "rb"))
TypeError: 'str' object is not callable
I was using a list, but I got the exact same error, I didnt think this would be a big deal, but as of right now it is becoming one. I know I was able to work around similar issues in arcpy, but this is nothing close

Try not to use built-in types for variable names:
Don't do this:
for file in files:
Do this instead:
for myfile in files:

Related

Encrypting many PDFs by python using PyPDF2

I am trying to make a python program which loops through all files in a folder, selects those which have extension '.pdf', and encrypt them with restricted permissions. I am using this version of PyPDF2 library:
https://github.com/vchatterji/PyPDF2. (A modification of the original PyPDF2 which also allows to set permissions). I have tested it with a single pdf file and it works fine. I want that the original pdf file should be deleted and the encrypted one should remain with the same name.
Here is my code:
import os
import PyPDF2
directory = './'
for filename in os.listdir(directory):
if filename.endswith(".pdf"):
pdfFile = open(filename, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFile)
pdfWriter = PyPDF2.PdfFileWriter()
for pageNum in range(pdfReader.numPages):
pdfWriter.addPage(pdfReader.getPage(pageNum))
pdfFile.close()
os.remove(filename)
pdfWriter.encrypt('', 'ispat', perm_mask=-3904)
resultPdf = open(filename, 'wb')
pdfWriter.write(resultPdf)
resultPdf.close()
continue
else:
continue
It gives the following error:
C:\Users\manul\Desktop\ghh>python encrypter.py
Traceback (most recent call last):
File "encrypter.py", line 9, in <module>
pdfReader = PyPDF2.PdfFileReader(pdfFile)
File "C:\Users\manul\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\pdf.py", line 1153, in __init__
self.read(stream)
File "C:\Users\manul\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\pdf.py", line 1758, in read
stream.seek(-1, 2)
OSError: [Errno 22] Invalid argument
I have some PDFs stored in 'ghh' folder on Desktop. Any help is greatly appreciated.

Using pdfReader = PyPDF2.PdfFileReader(filename) will make the reader work, but this specific error is caused by your files being empty. You can check the file sizes with os.path.getsize(filename). Your files were probably wiped because the script deletes the original file, then creates a new file with open(filepath, "wb"), and then it terminates incorrectly due to an error that occurs with pdfWriter.write(resultPdf), leaving an empty file with the original file name.
Passing a file name instead of a file object to PdfFileReader as mentioned resolves the error that occurs with pdfWriter (I don't know why), but you'll need to replace any empty files in your directory with copies of the original pdfs to get rid of the OSError.

saving a variable filename in reportlab

I keep getting an error " no attribute for string .pdf" with my code. It works as a static filename, but not as a variable. I have read the other examples on opening a filename, but none of them are inline like reportlab needs.
testid = datetime.datetime.now().strftime("%y%m%d_%H%M%S")
basename = str("cert")
filename = "_".join([basename, testid])
print filename
canvas = canvas.Canvas(filename.pdf, pagesize=letter)

Here's a working example that fails on the last line, but should demonstrate the problem:
#!python2
#coding=utf-8
import datetime
testid = datetime.datetime.now().strftime("%y%m%d_%H%M%S")
basename = str("cert")
filename = "_".join([basename, testid])
print filename
print "filename.pdf"
print filename.pdf
Output:
cert_170508_152300
filename.pdf
Traceback (most recent call last):
File "test.py", line 11, in <module>
print filename.pdf
AttributeError: 'str' object has no attribute 'pdf'
You are passing a non-existing object(* as parameter, where a string is required.
(* actually, since filename exists (a string), it assumes you are trying to access its non-existing attribute (just like the error says).
This should work:
canvas = canvas.Canvas(filename, pagesize=letter)
This appends the .pdf suffix:
canvas = canvas.Canvas(filename + '.pdf', pagesize=letter)

You should combine your variable filename with the extension .pdf
canvas = canvas.Canvas(filename+'.pdf', pagesize=letter)
if the variable doesn't contain the extension already

How to get pyPdf to work with os or glob

My goal is to read a directory with several PDF files and return the number of pages in each file using Python. I'm trying to use the pyPdf library but it fails.
If I do this:
from pyPdf import PdfFileReader
testFile = "C:\\path\\file.pdf"
pdfFile = PdfFileReader(file(testFile, 'rb'))
print pdfFile.getNumPages()
I'll get a result
If I do this, it fails:
pdfList = []
for root, dirs, files in os.walk("C:\\path"):
for file in files:
pdfList.append(os.path.join(root, file)
for item in pdfList:
targetPdf = PdfFileReader(file(item,'rb'))
numPages = targetPdf.getNumPages()
print item, numPages
This always results in:
TypeError: 'str' object is not callable
If I try to recreate a pyPdf object manually, I get the same thing.
What am I doing wrong?

Issue is due to using name, file as variable.
You are using file as variable name in first for loop.
And as a function call in statement, targetPdf = PdfFileReader(file(item,'rb')).
Try changing variable name in first for loop from file to fileName.
Hope that helps

Python Minidom Parsing File Objects

I wrote a code using minidom which takes an xml script, opens it as a file object and then parses that file object. Not only that, but I want the script to open multiple files that are all contained in a folder, and parse each one individually.
An example of the xml script is:
<?xml version="1.0"?>
<Data>
<data1>1</data1>
<data2>2</data2>
<data3>3</data3>
<Sub_data>
<sub_data1>0.1111111111111</sub_data1>
<sub_data2>0.2222222222222</sub_data2>
... and so on.
i.e., it's pretty standard.
Now, my code looks like this:
import os
import io
from xml.dom import minidom
#folder where xml files are located
indir = '/foo/bar/docs/'
masterlist = []
for root, dirs, filenames in os.walk(indir):
for f in filenames:
row = []
fsock = io.open(indir + f, mode = 'rt', encoding = 'cp1252')
xmldoc = minidom.parse(fsock)
...
and the error I am getting is:
Traceback (most recent call last): File "kgp_2.py", line 34, in
<module> xmldoc = minidom.parse(fsock) File
"/usr/lib/python2.7/xml/dom/minidom.py", line 1918, in parse return
expatbuilder.parse(file) File
"/usr/lib/python2.7/xml/dom/expatbuilder.py", line 928, in parse
result = builder.parseFile(file) File
"/usr/lib/python2.7/xml/dom/expatbuilder.py", line 211, in parseFile
parser.Parse("", True) xml.parsers.expat.ExpatError: no element found:
line 203, column 1381
Now, when I make the change:
fsock = io.open(indir + filenames[0], mode = 'rt', encoding = 'cp1252')
this works fine, that is, it opens the first file in the folder; but I want to parse all the files in the folder. When I do a loop like:
m = 0
... in loop:
fsock = io.open(indir + filenames[m], mode = 'rt', encoding = 'cp1252')
...
m = m+1
I get the original error.
The reason I am using the io library instead of the usual file open function is that a previous stack overflow article recommended it. Using:
fsock = open(indir + filenames[0])
like before, gets no error, but:
fsock = open(indir + f)
or
#with a loop over m, like above
fsock = open(infir + filenames[m])
get the same error as above.
A strange problem. When I print the filenames they are correct. And they are being opened, there's no error there. It's the parser that just won't parse the object files, even with filenames[m] where m = 0, surely this should be no problem?
EDIT:
Parsing document with python minidom
in this post they had a similar problem, the resolution was to use
xmldoc.seek(0)
however, for me this returns
Traceback (most recent call last):
File "kgp_2.py", line 45, in <module>
xmldoc.seek(0)
AttributeError: Document instance has no attribute 'seek'
EDIT 2: THIS HAS BEEN RESOLVED. IT WAS A CASE OF A CORRUPTED INPUT XML FILE.

Are you sure the XML data contained in all XML files is correct? Perhaps one is empty an you have to handle such Exception. Anyhow I recommend you to use xml.etree doc.

"'NoneType' object is not iterable" error

Just wrote my first python program! I get zip files as attachment in mail which is saved in local folder. The program checks if there is any new file and if there is one it extracts the zip file and based on the filename it extracts to different folder. When i run my code i get the following error:
Traceback (most recent call last): File "C:/Zip/zipauto.py", line 28, in for file in new_files: TypeError: 'NoneType' object is not iterable
Can anyone please tell me where i am going wrong.
Thanks a lot for your time,
Navin
Here is my code:
import zipfile
import os
ROOT_DIR = 'C://Zip//Zipped//'
destinationPath1 = "C://Zip//Extracted1//"
destinationPath2 = "C://Zip//Extracted2//"
def check_for_new_files(path=ROOT_DIR):
new_files=[]
for file in os.listdir(path):
print "New file found ... ", file
def process_file(file):
sourceZip = zipfile.ZipFile(file, 'r')
for filename in sourceZip.namelist():
if filename.startswith("xx") and filename.endswith(".csv"):
sourceZip.extract(filename, destinationPath1)
elif filename.startswith("yy") and filename.endswith(".csv"):
sourceZip.extract(filename, destinationPath2)
sourceZip.close()
if __name__=="__main__":
while True:
new_files=check_for_new_files(ROOT_DIR)
for file in new_files: # fails here
print "Unzipping files ... ", file
process_file(ROOT_DIR+"/"+file)

check_for_new_files has no return statement, and therefore implicitely returns None. Therefore,
new_files=check_for_new_files(ROOT_DIR)
sets new_files to None, and you cannot iterate over None.
Return the read files in check_for_new_files:
def check_for_new_files(path=ROOT_DIR):
new_files = os.listdir(path)
for file in new_files:
print "New file found ... ", file
return new_files

Here is the answer to your NEXT 2 questions:
(1) while True:: your code will loop forever.
(2) your function check_for_new_files doesn't check for new files, it checks for any files. You need to either move each incoming file to an archive directory after it's been processed, or use some kind of timestamp mechanism.

Example, student_grade = dict(zip(names, grades)) make sure names and grades are lists and both having at least more than one item to iterate with. This has helped me

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pypdf for lists of pdfs - python

Try not to use built-in types for variable names: Don't do this: for file in files: Do this instead: for myfile in files:

Related

Encrypting many PDFs by python using PyPDF2

saving a variable filename in reportlab

How to get pyPdf to work with os or glob

Python Minidom Parsing File Objects

"'NoneType' object is not iterable" error

Categories

Resources