How to get pyPdf to work with os or glob - python

My goal is to read a directory with several PDF files and return the number of pages in each file using Python. I'm trying to use the pyPdf library but it fails.
If I do this:
from pyPdf import PdfFileReader
testFile = "C:\\path\\file.pdf"
pdfFile = PdfFileReader(file(testFile, 'rb'))
print pdfFile.getNumPages()
I'll get a result
If I do this, it fails:
pdfList = []
for root, dirs, files in os.walk("C:\\path"):
for file in files:
pdfList.append(os.path.join(root, file)
for item in pdfList:
targetPdf = PdfFileReader(file(item,'rb'))
numPages = targetPdf.getNumPages()
print item, numPages
This always results in:
TypeError: 'str' object is not callable
If I try to recreate a pyPdf object manually, I get the same thing.
What am I doing wrong?

Issue is due to using name, file as variable.
You are using file as variable name in first for loop.
And as a function call in statement, targetPdf = PdfFileReader(file(item,'rb')).
Try changing variable name in first for loop from file to fileName.
Hope that helps

Related

python3 - how to produce output files whose names are modifications of the input files names?

Hello I'm using a python script that consist in the following code:
from Bio import SeqIO
# set input file, output file to write to
gbk_file = "bin.10.gbk"
tsv_file = "results.bin_10.tsv"
cluster_out = open(tsv_file, "w")
# Extract CLuster info. write to file
for seq_record in SeqIO.parse(gbk_file, "genbank"):
for seq_feat in seq_record.features:
if seq_feat.type == "protocluster":
cluster_number = seq_feat.qualifiers["protocluster_number"][0].replace(" ","_").replace(":","")
cluster_type = seq_feat.qualifiers["product"][0]
cluster_out.write("#"+cluster_number+"\tCluster Type:"+cluster_type+"\n")
THe issue is that I want to automatize this script to multiple files in a certain directory, in this way I want that gbk_file stores all the files that have .gbk as suffix, and that tsv_file results in a respective output file according to each input file.
so if a input file has the name "bin.10.gbk", the output will be "results.bin_10.tsv".
I tried using glob python function but dont know how to create a tsv_file variable that stores modified strings from imput file names:
import glob
# setting variables
gbk_files = glob.glob("*.gbk")
tsv_files = gbk_files.replace(".gbk",".results.tsv")
cluster_out = open(tsv_files, "w")
making that changes, I got the following error:
AttributeError: 'list' object has no attribute 'replace'
so how can I deal with this?
Thanks for reading :)
Hope the following function can help you.
def processfiles():
for file in glob.glob("*.gbk"):
names = file.split('.')
tsv_file = f'results.{names[-3]}_{names[-2]}.tsv'
with open(tsv_file, 'w') as tsv:
tsv.write('write your content here')
tsv.close

Encrypting many PDFs by python using PyPDF2

I am trying to make a python program which loops through all files in a folder, selects those which have extension '.pdf', and encrypt them with restricted permissions. I am using this version of PyPDF2 library:
https://github.com/vchatterji/PyPDF2. (A modification of the original PyPDF2 which also allows to set permissions). I have tested it with a single pdf file and it works fine. I want that the original pdf file should be deleted and the encrypted one should remain with the same name.
Here is my code:
import os
import PyPDF2
directory = './'
for filename in os.listdir(directory):
if filename.endswith(".pdf"):
pdfFile = open(filename, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFile)
pdfWriter = PyPDF2.PdfFileWriter()
for pageNum in range(pdfReader.numPages):
pdfWriter.addPage(pdfReader.getPage(pageNum))
pdfFile.close()
os.remove(filename)
pdfWriter.encrypt('', 'ispat', perm_mask=-3904)
resultPdf = open(filename, 'wb')
pdfWriter.write(resultPdf)
resultPdf.close()
continue
else:
continue
It gives the following error:
C:\Users\manul\Desktop\ghh>python encrypter.py
Traceback (most recent call last):
File "encrypter.py", line 9, in <module>
pdfReader = PyPDF2.PdfFileReader(pdfFile)
File "C:\Users\manul\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\pdf.py", line 1153, in __init__
self.read(stream)
File "C:\Users\manul\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\pdf.py", line 1758, in read
stream.seek(-1, 2)
OSError: [Errno 22] Invalid argument
I have some PDFs stored in 'ghh' folder on Desktop. Any help is greatly appreciated.
Using pdfReader = PyPDF2.PdfFileReader(filename) will make the reader work, but this specific error is caused by your files being empty. You can check the file sizes with os.path.getsize(filename). Your files were probably wiped because the script deletes the original file, then creates a new file with open(filepath, "wb"), and then it terminates incorrectly due to an error that occurs with pdfWriter.write(resultPdf), leaving an empty file with the original file name.
Passing a file name instead of a file object to PdfFileReader as mentioned resolves the error that occurs with pdfWriter (I don't know why), but you'll need to replace any empty files in your directory with copies of the original pdfs to get rid of the OSError.

Error when trying to read and write multiple files

I modified the code based on the comments from experts in this thread. Now the script reads and writes all the individual files. The script reiterates, highlight and write the output. The current issue is, after highlighting the last instance of the search item, the script removes all the remaining contents after the last search instance in the output of each file.
Here is the modified code:
import os
import sys
import re
source = raw_input("Enter the source files path:")
listfiles = os.listdir(source)
for f in listfiles:
filepath = source+'\\'+f
infile = open(filepath, 'r+')
source_content = infile.read()
color = ('red')
regex = re.compile(r"(\b be \b)|(\b by \b)|(\b user \b)|(\bmay\b)|(\bmight\b)|(\bwill\b)|(\b's\b)|(\bdon't\b)|(\bdoesn't\b)|(\bwon't\b)|(\bsupport\b)|(\bcan't\b)|(\bkill\b)|(\betc\b)|(\b NA \b)|(\bfollow\b)|(\bhang\b)|(\bbelow\b)", re.I)
i = 0; output = ""
for m in regex.finditer(source_content):
output += "".join([source_content[i:m.start()],
"<strong><span style='color:%s'>" % color[0:],
source_content[m.start():m.end()],
"</span></strong>"])
i = m.end()
outfile = open(filepath, 'w+')
outfile.seek(0)
outfile.write(output)
print "\nProcess Completed!\n"
infile.close()
outfile.close()
raw_input()
The error message tells you what the error is:
No such file or directory: 'sample1.html'
Make sure the file exists. Or do a try statement to give it a default behavior.
The reason why you get that error is because the python script doesn't have any knowledge about where the files are located that you want to open.
You have to provide the file path to open it as I have done below. I have simply concatenated the source file path+'\\'+filename and saved the result in a variable named as filepath. Now simply use this variable to open a file in open().
import os
import sys
source = raw_input("Enter the source files path:")
listfiles = os.listdir(source)
for f in listfiles:
filepath = source+'\\'+f # This is the file path
infile = open(filepath, 'r')
Also there are couple of other problems with your code, if you want to open the file for both reading and writing then you have to use r+ mode. More over in case of Windows if you open a file using r+ mode then you may have to use file.seek() before file.write() to avoid an other issue. You can read the reason for using the file.seek() here.

pypdf for lists of pdfs

I have got pypdf to work just fine for a single pdf file, but I can not seem to get it to work for a lits of files, or in a for loop for multiple pdfs, without failing because of the string not being callable. Any ideas I can use as a work around?
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
#print getPDFContent(r"Z:\GIS\MasterPermits\12300983.pdf").encode("ascii", "ignore")
#find pdfs
for root, dirs, files in os.walk(folder1):
for file in files:
if file.endswith(('.pdf')):
d=os.path.join(root, file)
print getPDFContent(d).encode("ascii", "ignore")
Traceback (most recent call last):
File "C:\Documents and Settings\dknight\Desktop\readpdf.py", line 50, in <module>
print getPDFContent(d).encode("ascii", "ignore")
File "C:\Documents and Settings\dknight\Desktop\readpdf.py", line 32, in getPDFContent
pdf = pyPdf.PdfFileReader(file(path, "rb"))
TypeError: 'str' object is not callable
I was using a list, but I got the exact same error, I didnt think this would be a big deal, but as of right now it is becoming one. I know I was able to work around similar issues in arcpy, but this is nothing close
Try not to use built-in types for variable names:
Don't do this:
for file in files:
Do this instead:
for myfile in files:

"'NoneType' object is not iterable" error

Just wrote my first python program! I get zip files as attachment in mail which is saved in local folder. The program checks if there is any new file and if there is one it extracts the zip file and based on the filename it extracts to different folder. When i run my code i get the following error:
Traceback (most recent call last): File "C:/Zip/zipauto.py", line 28, in for file in new_files: TypeError: 'NoneType' object is not iterable
Can anyone please tell me where i am going wrong.
Thanks a lot for your time,
Navin
Here is my code:
import zipfile
import os
ROOT_DIR = 'C://Zip//Zipped//'
destinationPath1 = "C://Zip//Extracted1//"
destinationPath2 = "C://Zip//Extracted2//"
def check_for_new_files(path=ROOT_DIR):
new_files=[]
for file in os.listdir(path):
print "New file found ... ", file
def process_file(file):
sourceZip = zipfile.ZipFile(file, 'r')
for filename in sourceZip.namelist():
if filename.startswith("xx") and filename.endswith(".csv"):
sourceZip.extract(filename, destinationPath1)
elif filename.startswith("yy") and filename.endswith(".csv"):
sourceZip.extract(filename, destinationPath2)
sourceZip.close()
if __name__=="__main__":
while True:
new_files=check_for_new_files(ROOT_DIR)
for file in new_files: # fails here
print "Unzipping files ... ", file
process_file(ROOT_DIR+"/"+file)
check_for_new_files has no return statement, and therefore implicitely returns None. Therefore,
new_files=check_for_new_files(ROOT_DIR)
sets new_files to None, and you cannot iterate over None.
Return the read files in check_for_new_files:
def check_for_new_files(path=ROOT_DIR):
new_files = os.listdir(path)
for file in new_files:
print "New file found ... ", file
return new_files
Here is the answer to your NEXT 2 questions:
(1) while True:: your code will loop forever.
(2) your function check_for_new_files doesn't check for new files, it checks for any files. You need to either move each incoming file to an archive directory after it's been processed, or use some kind of timestamp mechanism.
Example, student_grade = dict(zip(names, grades)) make sure names and grades are lists and both having at least more than one item to iterate with. This has helped me

Categories

Resources