urllib urlretrieve only saving final image in list of urls - python

I'm fairly new to using Python. I have been trying to set up a very basic web scraper to help speed up my workday, it is supposed to download images from a section of a website and save them.
I have a list of urls and I am trying to use urllib.request.urlretrieve to download all the images.
The output location (savepath) updates so it adds 1 to the current highest number in the folder.
I've tried a bunch of different ways but urlretrieve only saves the image from the last url in the list. Is there a way to download all the images in the url list?
to_download=['url1','url2','url3','url4']
for t in to_download:
urllib.request.urlretrieve(t, savepath)
This is the code I was trying to use to update the savepath every time
def getNextFilePath(photos):
highest_num = 0
for f in os.listdir(photos):
if os.path.isfile(os.path.join(photos, f)):
file_name = os.path.splitext(f)[0]
try:
file_num = int(file_name)
if file_num > highest_num:
highest_num = file_num
except ValueError:
'The file name "%s" is not an integer. Skipping' % file_name
output_file = os.path.join(output_folder, str(highest_num + 1))
return output_file

as suggested by #vks, you need to update savepath (otherwise you save each url onto the same file). One way to do so, is to use enumerate:
from urllib import request
to_download=['https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/']
for i, url in enumerate(to_download):
save_path = f'website_{i}.txt'
print(save_path)
request.urlretrieve(url, save_path)
which you may want to contract into:
from urllib import request
to_download=['https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/']
[request.urlretrieve(url, f'website_{i}.txt') for i, url in enumerate(to_download)]
see:
Python3 doc: Python enumerate doc
Example of enumerate: enumerate example
Example of f' using a string with a {variable}': f string example
FOR SECOND PART OF THE QUESTION:
Not sure what you are trying to achieve but:
def getNextFilePath(photos):
file_list = os.listdir(photos)
file_list = [int(s) for s in file_list if s.isdigit()]
print(file_list)
max_id_file = max(file_list)
print(f'max id:{max_id_file}')
output_file = os.path.join(output_folder, str(max_id_file + 1))
print(f'output file path:{output_file}')
return output_file
this will hopefully find all files that are named with digits (IDs), and find the highest ID, and return a new file name as a max_id+1
I guess that this will replace the save_path in your example.
Which quickly coding, AND MODIFYING above function, so that it returns the max_id and not the path.
The bellow code be a working example using the iterrator:
import os
from urllib import request
photo_folder = os.path.curdir
def getNextFilePath(photos):
file_list = os.listdir(photos)
print(file_list)
file_list = [int(os.path.splitext(s)[0]) for s in file_list if os.path.splitext(s)[0].isdigit()]
if not file_list:
return 0
print(file_list)
max_id_file = max(file_list)
#print(f'max id:{max_id_file}')
#output_file = os.path.join(photo_folder, str(max_id_file + 1))
#print(f'output file path:{output_file}')
return max_id_file
def download_pic(to_download):
start_id = getNextFilePath(photo_folder)
for i, url in enumerate(to_download):
save_path = f'{i+start_id}.png'
output_file = os.path.join(photo_folder, save_path)
print(output_file)
request.urlretrieve(url, output_file)
You should add handling exception etc, but this seems to be working, if I understood correctly.

Are you updating savepath? If you pass the same savepath to each loop iteration, it is likely just overwriting the same file over and over.
Hope that helps, happy coding!

Related

Get PDF attachments using Python

I admit that I am new to Python.
We have to process PDF files with attachments or annotated attachments. I am trying to extract attachments from a PDF file using PyPDF2 library.
The only (!) example found on GitHub contains the following code:
import PyPDF2
def getAttachments(reader):
catalog = reader.trailer["/Root"]
# VK
print (catalog)
#
fileNames = catalog['/Names']['/EmbeddedFiles']['/Names']
And the call is:
rootdir = "C:/Users/***.pdf" # My file path
handler = open(rootdir, 'rb')
reader = PyPDF2.PdfFileReader(handler)
dictionary = getAttachments(reader)
I am getting a KeyError: '/EmbeddedFiles'
A print of the catalog indeed does not contain EmbeddedFiles:
{'/Extensions': {'/ADBE': {'/BaseVersion': '/1.7', '/ExtensionLevel': 3}}, '/Metadata': IndirectObject(2, 0), '/Names': IndirectObject(5, 0), '/OpenAction': IndirectObject(6, 0), '/PageLayout': '/OneColumn', '/Pages': IndirectObject(3, 0), '/PieceInfo': IndirectObject(7, 0), '/Type': '/Catalog'}
This particular PDF contains 9 attachments. How can I get them?
Too Long for comments, and I have not tested personally this code, which looks very similar to your outline in the question, however I am adding here for others to test. It is the subject of a Pull Request https://github.com/mstamy2/PyPDF2/pull/440 and here is the full updated sequence as described by Kevin M Loeffler in https://kevinmloeffler.com/2018/07/08/how-to-extract-pdf-file-attachments-using-python-and-pypdf2/
Viewable at https://gist.github.com/kevinl95/29a9e18d474eb6e23372074deff2df38#file-extract_pdf_attachments-py
Download as
https://gist.github.com/kevinl95/29a9e18d474eb6e23372074deff2df38/raw/acdc194058f9fa2c4d2619a4c623d0efeec32555/extract_pdf_attachments.py
It always helps if you can provide an example input of the type you have problems with so that others can adapt the extraction routine to suit.
In response to getting an error
"I’m guessing the script is breaking because the embedded files section of the PDF doesn’t always exist so trying to access it throws an error."
"Something I would try is to put everything after the ‘catalog’ line in the get_attachments method in a try-catch."
Unfortunately there are many pending pull requests not included into PyPDF2 https://github.com/mstamy2/PyPDF2/pulls and others may also be relevant or needed to aid with this and other shortcomings. Thus you need to see if any of those may also help.
For one pending example of a try catch that you might be able to include / and adapt for other use cases see https://github.com/mstamy2/PyPDF2/pull/551/commits/9d52ef517319b538f007669631ba6b778f8ec3a3
Associated keywords for imbedded files apart from /Type/EmbeddedFiles include /Type /Filespec & /Subtype /FileAttachment note the pairs may not always have spaces so perhaps see if those can be interrogated for the attachments
Again on that last point the example searches for /EmbeddedFiles as indexed in the plural whilst any individual entry itself is identified as singular
This can be improved but it was tested to work (using PyMuPDF).
It detects corrupted PDF files, encryption, attachments, annotations and portfolios.
I am yet to compare the output with our internal classification.
Produces a semicolon separated file that can be imported into Excel.
import fitz # = PyMuPDF
import os
outfile = open("C:/Users/me/Downloads/testPDF3.txt", "w", encoding="utf-8")
folder = "C:/Users/me/Downloads"
print ("filepath;","encrypted;","pages;", "embedded;","attachments;","annotations;","portfolio", file = outfile)
enc=pages=count=names=annots=collection=''
for subdir, dirs, files in os.walk(folder):
for file in files:
#print (os.path.join(subdir, file))
filepath = subdir + os.sep + file
if filepath.endswith(".pdf"):
#print (filepath, file = outfile)
try:
doc = fitz.open(filepath)
enc = doc.is_encrypted
#print("Encrypted? ", enc, file = outfile)
pages = doc.page_count
#print("Number of pages: ", pages, file = outfile)
count = doc.embfile_count()
#print("Number of embedded files:", count, file = outfile) # shows number of embedded files
names = doc.embfile_names()
#print("Embedded files:", str(names), file = outfile)
#if count > 0:
# for emb in names:
# print(doc.embfile_info(emb), file = outfile)
annots = doc.has_annots()
#print("Has annots?", annots, file = outfile)
links = doc.has_links()
#print("Has links?", links, file = outfile)
trailer = doc.pdf_trailer()
#print("Trailer: ", trailer, file = outfile)
xreflen = doc.xref_length() # length of objects table
for xref in range(1, xreflen): # skip item 0!
#print("", file = outfile)
#print("object %i (stream: %s)" % (xref, doc.is_stream(xref)), file = outfile)
#print(doc.xref_object(i, compressed=False), file = outfile)
if "Collection" in doc.xref_object(xref, compressed=False):
#print ("Portfolio", file = outfile)
collection ='True'
break
else: collection="False"
#print(doc.xref_object(xref, compressed=False), file = outfile)
except:
#print ("Not a valid PDF", file = outfile)
enc=pages=count=names=annots=collection="Not a valid PDF"
print(filepath,";", enc,";",pages, ";",count, ";",names, ";",annots, ";",collection, file = outfile )
outfile.close()
I was also running into the same problem with several pdfs that I have. I was able to make these changes to the referenced code that got it to work for me:
import PyPDF2
def getAttachments(reader):
"""
Retrieves the file attachments of the PDF as a dictionary of file names
and the file data as a bytestring.
:return: dictionary of filenames and bytestrings
"""
attachments = {}
#First, get those that are pdf attachments
catalog = reader.trailer["/Root"]
if "/EmbeddedFiles" in catalog["/Names"]:
fileNames = catalog['/Names']['/EmbeddedFiles']['/Names']
for f in fileNames:
if isinstance(f, str):
name = f
dataIndex = fileNames.index(f) + 1
fDict = fileNames[dataIndex].getObject()
fData = fDict['/EF']['/F'].getData()
attachments[name] = fData
#Next, go through all pages and all annotations to those pages
#to find any attached files
for pagenum in range(0, reader.getNumPages()):
page_object = reader.getPage(pagenum)
if "/Annots" in page_object:
for annot in page_object['/Annots']:
annotobj = annot.getObject()
if annotobj['/Subtype'] == '/FileAttachment':
fileobj = annotobj["/FS"]
attachments[fileobj["/F"]] = fileobj["/EF"]["/F"].getData()
return attachments
handler = open(filename, 'rb')
reader = PyPDF2.PdfFileReader(handler)
dictionary = getAttachments(reader)
for fName, fData in dictionary.items():
with open(fName, 'wb') as outfile:
outfile.write(fData)
I know it is a late reply, but i only started looking into this yesterday. I have used the PyMuPdf library to extract the embedded files. here is my code:
import os
import fitz
def get_embedded_pdfs(input_pdf_path, output_path=None):
input_path = "/".join(input_pdf_path.split('/')[:-1])
if not output_path :
output_path = input_pdf_path.split(".")[0] + "_embeded_files/"
if output_path not in os.listdir(input_path):
os.mkdir(output_path)
doc = fitz.open(input_pdf_path)
item_name_dict = {}
for each_item in doc.embfile_names():
item_name_dict[each_item] = doc.embfile_info(each_item)["filename"]
for item_name, file_name in item_name_dict.items():
out_pdf = output_path + file_name
## get embeded_file in bytes
fData = doc.embeddedFileGet(item_name)
## save embeded file
with open(out_pdf, 'wb') as outfile:
outfile.write(fData)
disclaimer: I am the author of borb (the library used in this answer)
borb is an open-source, pure Python PDF library. It abstracts away most of the unpleasantness of dealing with PDF (such as having to deal with dictionaries and having to know PDF-syntax and structure).
There is a huge repository of examples, containing a section on dealing with embedded files, which you can find here.
I'll repeat the relevant example here for completeness:
import typing
from borb.pdf.document.document import Document
from borb.pdf.pdf import PDF
def main():
# read the Document
doc: typing.Optional[Document] = None
with open("output.pdf", "rb") as pdf_file_handle:
doc = PDF.loads(pdf_file_handle)
# check whether we have read a Document
assert doc is not None
# retrieve all embedded files and their bytes
for k, v in doc.get_embedded_files().items():
# display the file name, and the size
print("%s, %d bytes" % (k, len(v)))
if __name__ == "__main__":
main()
After the Document has been read, you can simply ask it for a dict mapping the filenames unto the bytes.

Merging PDFs while retaining custom page numbers (aka pagelabels) and bookmarks

I'm trying to automate merging several PDF files and have two requirements: a) existing bookmarks AND b) pagelabels (custom page numbering) need to be retained.
Retaining bookmarks when merging happens by default with PyPDF2 and pdftk, but not with pdfrw.
Pagelabels are consistently not retained in PyPDF2, pdftk or pdfrw.
I am guessing, after having searched a lot, that there is no straightforward approach to doing what I want. If I'm wrong then I hope someone can point to this easy solution. But, if there is no easy solution, any tips on how to get this going in python will be much appreciated!
Some example code:
1) With PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileMerger, PdfFileReader
tmp1 = PdfFileReader('file1.pdf', 'rb')
tmp2 = PdfFileReader('file2.pdf', 'rb')
#extracting pagelabels is easy
pl1 = tmp1.trailer['/Root']['/PageLabels']
pl2 = tmp2.trailer['/Root']['/PageLabels']
#but PdfFileWriter or PdfFileMerger does not support writing from what I understand
So I dont know how to proceed from here
2) With pdfrw (has more promise)
from pdfrw import PdfReader, PdfWriter
writer = PdfWriter()
#read 1st file
tmp1 = PdfReader('file1')
#add the pages
writer.addpages(tmp1.pages)
#copy bookmarks to writer
writer.trailer.Root.Outlines = tmp1.Root.Outlines
#copy pagelabels to writer
writer.trailer.Root.PageLabels = tmp1.Root.PageLabels
#read second file
tmp2 = PdfReader('file2')
#append pages
writer.addpages(tmp2.pages)
# so far so good
Page numbers of bookmarks from 2nd file need to be offset before adding them, but when reading outlines I almost always get (IndirectObject, XXX) instead of page numbers. Its unclear how to get page numbers for each label and bookmark using pdfrw. So, I'm stuck again
zp
As mentioned in my comment, I'm posting a generic solution to merge several pdfs that works in PyPDF2. Dont know what is different to make this work in PyPDF2 other than initializing pls as ArrayObject()
from PyPDF2 import PdfFileWriter, PdfFileMerger, PdfFileReader
import PyPDF2.pdf as PDF
# pls holds all the pagelabels as we iterate through multiple pdfs
pls = PDF.ArrayObject()
# used to offset bookmarks
pageCount = 0
cpdf = PdfFileMerger()
# pdffiles is a list of all files to be merged
for i in range(len(pdffiles)):
tmppdf = PdfFileReader(pdffiles[i], 'rb')
cpdf.append(tmppdf)
# copy all the pagelabels which I assume is present in all files
# you could use 'try' in case no pagelabels are present
plstmp = tmppdf.trailer['/Root']['/PageLabels']['/Nums']
# sometimes keys are indirect objects
# so, iterate through each pagelabel and...
for j in range(len(plstmp)):
# ... get the actual values
plstmp[j] = plstmp[j].getObject()
# offset pagenumbers by current count of pages
if isinstance(plstmp[j], int):
plstmp[j] = PDF.NumberObject(plstmp[j] + pageCount)
# once all the pagelabels are processed I append to pls
pls += plstmp
#increment pageCount
pageCount += tmppdf.getNumPages()
# rest follows KevinM's answer
pagenums = PDF.DictionaryObject()
pagenums.update({PDF.NameObject('/Nums') : pls})
pagelabels = PDF.DictionaryObject()
pagelabels.update({PDF.NameObject('/PageLabels') : pagenums})
cpdf.output._root_object.update(pagelabels)
cpdf.write("filename.pdf")
You need to iterate through the existing PageLabels and add them to the merged output, taking care to add an offset to the page index entry, based on the number of pages already added.
This solution also requires PyPDF4, since PyPDF2 produces a weird error (see bottom).
from PyPDF4 import PdfFileWriter, PdfFileMerger, PdfFileReader
# To manipulate the PDF dictionary
import PyPDF4.pdf as PDF
import logging
def add_nums(num_entry, page_offset, nums_array):
for num in num_entry['/Nums']:
if isinstance(num, (int)):
logging.debug("Found page number %s, offset %s: ", num, page_offset)
# Add the physical page information
nums_array.append(PDF.NumberObject(num+page_offset))
else:
# {'/S': '/r'}, or {'/S': '/D', '/St': 489}
keys = num.keys()
logging.debug("Found page label, keys: %s", keys)
number_type = PDF.DictionaryObject()
# Always copy the /S entry
s_entry = num['/S']
number_type.update({PDF.NameObject("/S"): PDF.NameObject(s_entry)})
logging.debug("Adding /S entry: %s", s_entry)
if '/St' in keys:
# If there is an /St entry, fetch it
pdf_label_offset = num['/St']
# and add the new offset to it
logging.debug("Found /St %s", pdf_label_offset)
number_type.update({PDF.NameObject("/St"): PDF.NumberObject(pdf_label_offset)})
# Add the label information
nums_array.append(number_type)
return nums_array
def write_merged(pdf_readers):
# Output
merger = PdfFileMerger()
# For PageLabels information
page_labels = []
page_offset = 0
nums_array = PDF.ArrayObject()
# Iterate through all the inputs
for pdf_reader in pdf_readers:
try:
# Merge the content
merger.append(pdf_reader)
# Handle the PageLabels
# Fetch page information
old_page_labels = pdf_reader.trailer['/Root']['/PageLabels']
page_count = pdf_reader.getNumPages()
# Add PageLabel information
add_nums(old_page_labels, page_offset, nums_array)
page_offset = page_offset + page_count
except Exception as err:
print("ERROR: %s" % err)
# Add PageLabels
page_numbers = PDF.DictionaryObject()
page_numbers.update({PDF.NameObject("/Nums"): nums_array})
page_labels = PDF.DictionaryObject()
page_labels.update({PDF.NameObject("/PageLabels"): page_numbers})
root_obj = merger.output._root_object
root_obj.update(page_labels)
# Write output
merger.write('merged.pdf')
pdf_readers = []
tmp1 = PdfFileReader('file1.pdf', 'rb')
tmp2 = PdfFileReader('file2.pdf', 'rb')
pdf_readers.append(tmp1)
pdf_readers.append(tmp2)
write_merged(pdf_readers)
Note: PyPDF2 produces this weird error:
...
...
File "/usr/lib/python3/dist-packages/PyPDF2/pdf.py", line 552, in _sweepIndirectReferences
data[key] = value
File "/usr/lib/python3/dist-packages/PyPDF2/generic.py", line 507, in __setitem__
raise ValueError("key must be PdfObject")
ValueError: key must be PdfObject

How do I apply my python code to all of the files in a folder at once, and how do I create a new name for each subsequent output file?

The code I am working with takes in a .pdf file, and outputs a .txt file. My question is, how do I create a loop (probably a for loop) which runs the code over and over again on all files in a folder which end in ".pdf"? Furthermore, how do I change the output each time the loop runs so that I can write a new file each time, that has the same name as the input file (ie. 1_pet.pdf > 1_pet.txt, 2_pet.pdf > 2_pet.txt, etc.)
Here is the code so far:
path="2_pet.pdf"
content = getPDFContent(path)
encoded = content.encode("utf-8")
text_file = open("Output.txt", "w")
text_file.write(encoded)
text_file.close()
The following script solve your problem:
import os
sourcedir = 'pdfdir'
dl = os.listdir('pdfdir')
for f in dl:
fs = f.split(".")
if fs[1] == "pdf":
path_in = os.path.join(dl,f)
content = getPDFContent(path_in)
encoded = content.encode("utf-8")
path_out = os.path.join(dl,fs[0] + ".txt")
text_file = open(path_out, 'w')
text_file.write(encoded)
text_file.close()
Create a function that encapsulates what you want to do to each file.
import os.path
def parse_pdf(filename):
"Parse a pdf into text"
content = getPDFContent(filename)
encoded = content.encode("utf-8")
## split of the pdf extension to add .txt instead.
(root, _) = os.path.splitext(filename)
text_file = open(root + ".txt", "w")
text_file.write(encoded)
text_file.close()
Then apply this function to a list of filenames, like so:
for f in files:
parse_pdf(f)
One way to operate on all PDF files in a directory is to invoke glob.glob() and iterate over the results:
import glob
for path in glob.glob('*.pdf')
content = getPDFContent(path)
encoded = content.encode("utf-8")
text_file = open("Output.txt", "w")
text_file.write(encoded)
text_file.close()
Another way is to allow the user to specify the files:
import sys
for path in sys.argv[1:]:
...
Then the user runs your script like python foo.py *.pdf.
You could use a recursive function to search the folders and all subfolders for files that end with pdf. Than take those files and then create a text file for it.
It could be something like:
import os
def convert_PDF(path, func):
d = os.path.basename(path)
if os.path.isdir(path):
[convert_PDF(os.path.join(path,x), func) for x in os.listdir(path)]
elif d[-4:] == '.pdf':
funct(path)
# based entirely on your example code
def convert_to_txt(path):
content = getPDFContent(path)
encoded = content.encode("utf-8")
file_path = os.path.dirname(path)
# replace pdf with txt extension
file_name = os.path.basename(path)[:-4]+'.txt'
text_file = open(file_path +'/'+file_name, "w")
text_file.write(encoded)
text_file.close()
convert_PDF('path/to/files', convert_to_txt)
Because the actual operation is changeable, you can replace the function with whatever operation you need to perform (like using a different library, converting to a different type, etc.)

Using urllib.urlretrieve to download files over HTTP not working

I'm still working on my mp3 downloader but now I'm having trouble with the files being downloaded. I have two versions of the part that's tripping me up. The first gives me a proper file but causes an error. The second gives me a file that is way too small but no error. I've tried opening the file in binary mode but that didn't help. I'm pretty new to doing any work with html so any help would be apprecitaed.
import urllib
import urllib2
def milk():
SongList = []
SongStrings = []
SongNames = []
earmilk = urllib.urlopen("http://www.earmilk.com/category/pop")
reader = earmilk.read()
#gets the position of the playlist
PlaylistPos = reader.find("var newPlaylistTracks = ")
#finds the number of songs in the playlist
NumberSongs = reader[reader.find("var newPlaylistIds = " ): PlaylistPos].count(",") + 1
initPos = PlaylistPos
#goes though the playlist and records the html address and name of the song
for song in range(0, NumberSongs):
songPos = reader[initPos:].find("http:") + initPos
namePos = reader[songPos:].find("name") + songPos
namePos += reader[namePos:].find(">")
nameEndPos = reader[namePos:].find("<") + namePos
SongStrings.append(reader[songPos: reader[songPos:].find('"') + songPos])
SongNames.append(reader[namePos + 1: nameEndPos])
initPos = nameEndPos
for correction in range(0, NumberSongs):
SongStrings[correction] = SongStrings[correction].replace('\\/', "/")
#downloading songs
fileName = ''.join([a.isalnum() and a or '_' for a in SongNames[0]])
fileName = fileName.replace("_", " ") + ".mp3"
# This version writes a file that can be played but gives an error saying: "TypeError: expected a character buffer object"
## songDL = open(fileName, "wb")
## songDL.write(urllib.urlretrieve(SongStrings[0], fileName))
# This version creates the file but it cannot be played (file size is much smaller than it should be)
## url = urllib.urlretrieve(SongStrings[0], fileName)
## url = str(url)
## songDL = open(fileName, "wb")
## songDL.write(url)
songDL.close()
earmilk.close()
Re-read the documentation for urllib.urlretrieve:
Return a tuple (filename, headers) where filename is the local file
name under which the object can be found, and headers is whatever the
info() method of the object returned by urlopen() returned (for a
remote object, possibly cached).
You appear to be expecting it to return the bytes of the file itself. The point of urlretrieve is that it handles writing to a file for you, and returns the filename it was written to (which will generally be the same thing as your second argument to the function if you provided one).

renaming a list of pdf files with for loop

i am trying to rename a list of pdf files by extracting the name from the file using PyPdf. i tried to use a for loop to rename the files but i always get an error with code 32 saying that the file is being used by another process. I am using python2.7
Here's my code
import os, glob
from pyPdf import PdfFileWriter, PdfFileReader
# this function extracts the name of the file
def getName(filepath):
output = PdfFileWriter()
input = PdfFileReader(file(filepath, "rb"))
output.addPage(input.getPage(0))
outputStream = file(filepath + '.txt', 'w')
output.write(outputStream)
outputStream.close()
outText = open(filepath + '.txt', 'rb')
textString = outText.read()
outText.close()
nameStart = textString.find('default">')
nameEnd = textString.find('_SATB', nameStart)
nameEnd2 = textString.find('</rdf:li>', nameStart)
if nameStart:
testName = textString[nameStart+9:nameEnd]
if len(testName) <= 100:
name = testName + '.pdf'
else:
name = textString[nameStart+9:nameEnd2] + '.pdf'
return name
pdfFiles = glob.glob('*.pdf')
m = len(pdfFiles)
for each in pdfFiles:
newName = getName(each)
os.rename(each, newName)
Consider using the with directive of Python. With it you do not need to handle closing the file yourself:
def getName(filepath):
output = PdfFileWriter()
with file(filepath, "rb") as pdfFile:
input = PdfFileReader(pdfFile)
...
You're not closing the input stream (the file) used by the pdf reader.
Thus, when you try to rename the file, it's still open.
So, instead of this:
input = PdfFileReader(file(filepath, "rb"))
Try this:
inputStream = file(filepath, "rb")
input = PdfFileReader(inputStream)
(... when done with this file...)
inputStream.close()
It does not look like you close the file object associated with the PDF reader object. Though maybe at tne end of the function it is closed automatically, but to be sure you might want to create a separate file object which you pass to the PdfFileReader and then close the file handle when done. Then rename.
The below was from SO: How to close pyPDF "PdfFileReader" Class file handle
import os.path
from pyPdf import PdfFileReader
fname = 'my.pdf'
fh = file(fname, "rb")
input = PdfFileReader(fh)
fh.close()
os.rename(fname, 'my_renamed.pdf')

Categories

Resources