Recursively merge pdf's in subfolders using the pyPDF2 module in python - python

Im a novice developer learning python and .Im trying to recursively parse folders and subfolders conatining multiple pdf's and merging them into one pdf based on the subfolder name.
I have the following folder and subfolder structure
folder before merger
dummy
ball
ball_baseball.pdf
ball_basketball.pdf
ball_volleyball.pdf
ice
ice_skating.pdf
ice_curling.pdf
ice_hockey.pdf
The ideal result that id like to see is
dummy
ball
ball.pdf(containing 3 sheets)
ice
ice.pdf (containing 3 sheets)
There is question string previous answered for csv files using pandas .but im using PyPDf for merging the pdf'f
Here is the code I have tried so far.
It seems to work but imay have messed up the for loop so recursively appends and merges pdf's in the subfolder
import sys, os,PyPDf2
from PyPDF2 import PdfFileMerger, PdfFileReader
dirs=r"path to the folder directory"
for root,dirs,files in os.walk(dirs):
merger = PdfFileMerger()
for filename in files:
if filename.endswith(".pdf"):
filepath = os.path.join(root, filename)
merger.append(PdfFileReader(open(filepath, 'rb')))
merger.write(str(filename))`
Any advise will be greatly appreciated
Thanks in advance

I know this is quite an old question but i had the same problem myself. I tried the solution by C. Taylor but I ended up with some errors. Anyways for following code worked for me.
import sys, os,PyPDf2
from PyPDF2 import PdfFileMerger, PdfFileReader
print("testing ")
hdir=os.getcwd()
for root,dirs,files in os.walk(hdir):
merger = PdfFileMerger()
for filename in files:
if filename.endswith(".pdf"):
print(filename)
filepath = os.path.join(root, filename)
merger.append(PdfFileReader(open(filepath, 'rb')))
merger.write(os.path.join(hdir,os.path.basename(os.path.normpath(root))+'.pdf'))
The merged PDFs had the name of their folder and it was written to the main dir.

If what you want is the merged files to be written to the folder containing your python script rather than the subfolder, you'd need to make a couple of tweaks:
import sys, os,PyPDf2
from PyPDF2 import PdfFileMerger, PdfFileReader
hdir=r #path to the folder directory; would suggest using os.getcwd()
for root,dirs,files in os.walk(hdir):
#changed so that directories thrown by os.walk are not the same as start
merger = PdfFileMerger()
for dir in dirs:
for filename in files:
if filename.endswith(".pdf"):
filepath = os.path.join(root, filename)
merger.append(PdfFileReader(open(filepath, 'rb')))
#merger.write(str(filename))
merger.write(os.path.join(hdir,dir+'.pdf'))
#writes to the main directory, names the merged file after the subdirectory

I have figured how to run them in a loop
rootDir=r"path to your directory"
for dirName,subDir, fileList in os.walk(rootDir, topdown=False):
merger = PdfFileMerger()
for fname in fileList:
merger.append(PdfFileReader(open(os.path.join(dirName, fname),'rb')))
merger.write(str(dirName)+".pdf")
bringing the merger= PdfMerger() inside the loop did the trick !!

Related

How to modify this script so that all of my files are not deleted when trying to delete files that do not have XML files with them?

I am trying to delete all .JPG files that do not have .xml files with the same name attached to them. However, when I run this script, all of my files are deleted in my directory and not just the desired images. How can I change this script so that I can just delete the images without corresponding .xml files?
Note: The only files I have in the directory are .JPG and .XML
import os
from tqdm import tqdm
path = 'C:\\users\\my_username\\path_to_directory_with_xml_and_jpg_images'
files = os.listdir(path)
for file in tqdm(files):
filename, filetype = file.split('.')
if filetype == 'xml':
continue
imgfile = os.path.join(path, file)
xmlfile = os.path.join(path, filename + '.xml')
if not os.path.exists(xmlfile):
print('{} deleted.'.format(imgfile))
os.remove(imgfile)
It's hard to tell why your code doesn't work as we don't know the exact contents of the directory. But a simpler way to do what you want could be to use the amazing pathlib library (Python >= 3.4). The method Path.with_suffix() will make the task quite easy, together with Path.glob():
from pathlib import Path
path = Path('C:\\users\\my_username\\path_to_directory_with_xml_and_jpg_images')
for imgfile in path.glob("*.jpg"):
xmlfile = imgfile.with_suffix(".xml")
if not xmlfile.exists():
imgfile.unlink()
print(imgfile, 'deleted.')

How to open all files in a folder in Python? [duplicate]

This question already has answers here:
How do I list all files of a directory?
(21 answers)
Closed 1 year ago.
How do I open all files in a folder in python? I need to open all files in a folder, so I can index the files for language processing.
Here you have an example. here is what it does:
os.listdir('yourBasebasePath') returns a list of files in your directory
with open(os.path.join(os.getcwd(), filename), 'r') is opening the current file as readonly (you will not be able to write inside)
import os
for filename in os.listdir('yourBasebasePath'):
with open(os.path.join(os.getcwd(), filename), 'r') as f:
# do your stuff
How to open every file in a folder
I would recommend looking at the pathlib library https://docs.python.org/3/library/pathlib.html
you could do something like:
from pathlib import Path
folder = Path('<folder to index>')
# get all the files in the folder
files = folder.glob('**/*.csv') # assuming the files are csv
for file in files:
with open(file, 'r') as f:
print(f.readlines())
you can use os.walk for listing all the files having in your folder.
you can refer os.walk documentation
import os
folderpath = r'folderpath'
for root, dirs, files in os.walk(folderpath, topdown=False):
for name in files:
print(os.path.join(root, name))
for name in dirs:
print(os.path.join(root, name))
You can use
import os
os.walk()

Created Python Script to merge PDF files works fine in Pycharm but not as a solo EXE I made via pyinstaller

So this is my code below.AS stated in the title it works as intended in PyCharm but not outside of it. Would it be because I used PyPDF2 library? Thank you any help would be much appreciated.
import os
from PyPDF2 import PdfFileMerger
def main():
print("PDF Merger Initialized")
pdfs = [pdf_file for pdf_file in os.listdir() if pdf_file.endswith(".pdf")] #sets pdfs as a list containing all files with the .pdf extenstion
merger = PdfFileMerger()
for pdf in pdfs:
merger.append(pdf)
merger.write("merged_bills.pdf")
merger.close()
print("PDF Merger Completed")
main()
try instead
import glob
pdfs = [pdf_file for pdf_file in glob.glob(os.path.join(os.getcwd(),"*.pdf")]
glob will return the full path to the file in this way
alternatively try
pdfs = [os.path.join(os.getcwd(),pdf_file) for pdf_file in os.listdir(os.getcwd()) if pdf_file.endswith(".pdf")]

How can I delete files by extension in subfolders of a folder?

I have a folder, which contains many subfolders, each containing some videos and .srt files. I want to loop over the main folder so that all the .srt files from all subfolders are deleted.
Here is something I tried-
import sys
import os
import glob
main_dir = '/Users/Movies/Test'
folders = os.listdir(main_dir)
for (dirname, dirs, files) in os.walk(main_dir):
for file in files:
if file.endswith('.srt'):
os.remove(file)
However, I get an error as follows-
FileNotFoundError: [Errno 2] No such file or directory: 'file1.srt'
Is there any way I can solve this? I am still a beginner so sorry I may have overlooked something obvious.
You need to join the filename with the location.
import sys
import os
import glob
main_dir = '/Users/Movies/Test'
folders = os.listdir(main_dir)
for (dirname, dirs, files) in os.walk(main_dir):
for file in files:
if file.endswith('.srt'):
source_file = os.path.join(dirname, file)
os.remove(source_file)

renaming files in a directory + subdirectories in python

I have some files that I'm working with in a python script. The latest requirement is that I go into a directory that the files will be placed in and rename all files by adding a datestamp and project name to the beginning of the filename while keeping the original name.
i.e. foo.txt becomes 2011-12-28_projectname_foo.txt
Building the new tag was easy enough, it's just the renaming process that's tripping me up.
Can you post what you have tried?
I think you should just need to use os.walk with os.rename.
Something like this:
import os
from os.path import join
for root, dirs, files in os.walk('path/to/dir'):
for name in files:
newname = foo + name
os.rename(join(root,name),join(root,newname))
I know this is an older post of mine, but seeing as how it's been viewed quite a few times I figure I'll post what I did to resolve this.
import os
sv_name="(whatever it's named)"
today=datetime.date.today()
survey=sv_name.replace(" ","_")
date=str(today).replace(" ","_")
namedate=survey+str(date)
[os.rename(f,str(namedate+"_"+f)) for f in os.listdir('.') if not f.startswith('.')]
import os
dir_name = os.path.realpath('ur directory')
cnt=0 for root, dirs, files in os.walk(dir_name, topdown=False):
for file in files:
cnt=cnt+1
file_name = os.path.splitext(file)[0]#file name no ext
extension = os.path.splitext(file)[1]
dir_name = os.path.basename(root)
try:
os.rename(root+"/"+file,root+"/"+dir_name+extension)
except FileExistsError:
os.rename(root+"/"+file,root+""+dir_name+str(cnt)+extension)
to care if more files are there in single folder and if we need to give incremental value for the files

Categories

Resources