Iterate through many directories in Python [Python3]

Iterate through many directories in Python [Python3] - python

So I have been working on a program that needs to run through each file in a particular directory and do things depending on the file. That bit is done (code below), however I really need to expand this so that I can parse in as many directories as needed and have the program cycle through them all sequentially. My code is as follows (apologies for really clunky, bad code):
def createTemplate( self, doType = 0, bandNameL = None, bandName430 = None ):
'''
Loads the archive of each file in a directory.
Depending on the choice made on initialization, a template will be
created for one band (of the user's choosing) or all bands.
Templates are saved in self.directory (either CWD or whatever the user
wishes) as 1D numpy arrays, .npy. If arguments are not provided for the
template names (without the .npy suffix), default names will be used.
Useage of the doType parameter: 0 (default) does both bands and returns a tuple.
1 does only the L-band.
2 does only the 430-band.
'''
print( "Beginning template creation..." )
# Initialize the file names if given
nameL = str( bandNameL ) + ".npy"
name430 = str( bandName430 ) + ".npy"
# Set the templates to empty arrays
self.templateProfile430, self.templateProfileL = [], []
# Set the call counters for the creation scripts to 0
self._templateCreationScript430.__func__.counter = 0
self._templateCreationScriptL.__func__.counter = 0
# Cycle through each file in the stored directory
for file in os.listdir( self.directory ):
# Set the file to be a global variable in the class for use elsewhere
self.file = file
# Check whether the file is a fits file
if self.file.endswith( ".fits" ) or self.file.endswith( ".refold" ):
# Check if the file is a calibration file (not included in the template)
if self.file.find( 'cal' ) == -1:
# Open the fits file header
hdul = fits.open( self.directory + self.file )
# Get the frequency band used in the observation.
frequencyBand = hdul[0].header[ 'FRONTEND' ]
# Close the header once it's been used or the program becomes very slow.
hdul.close()
# Check which band the fits file belongs to
if frequencyBand == 'lbw' or frequencyBand == 'L_Band':
if doType == 0 or doType == 1:
self.templateProfileL = self._templateCreationScriptL()
# Check if a save name was provided
if bandNameL == None:
np.save( self.directory + "Lbandtemplate.npy", self.templateProfileL )
else:
np.save( self.directory + nameL, self.templateProfileL )
else:
print( "L-Band file" )
elif frequencyBand == '430':
if doType == 0 or doType == 2:
self.templateProfile430 = self._templateCreationScript430()
# Check if a save name was provided
if bandName430 == None:
np.save( self.directory + "430bandtemplate.npy", self.templateProfile430 )
else:
np.save( self.directory + name430, self.templateProfile430 )
else:
print( "430-Band file" )
else:
print( "Frontend is neither L-Band, nor 430-Band..." )
else:
print( "Skipping calibration file..." )
else:
print( "{} is not a fits file...".format( self.file ) )
# Decide what to return based on doType
if doType == 0:
print( "Template profiles created..." )
return self.templateProfileL, self.templateProfile430
elif doType == 1:
print( "L-Band template profile created..." )
return self.templateProfileL
else:
print( "430-Band template profile created..." )
return self.templateProfile430
So currently, it works perfectly for one directory but just need to know how to modify for multiple directories. Thank you anyone who can help.
EDIT: self.directory is initialised in the class initialisation, so maybe there's something that needs to be changed there instead:
class Template:
'''
Class for the creation, handling and analysis of pulsar template profiles.
Templates can be created for each frequency band of data in a folder which
can either be the current working directory or a folder of the user's
choosing.
'''
def __init__( self, directory = None ):
# Check if the directory was supplied by the user. If not, use current working directory.
if directory == None:
self.directory = str( os.getcwd() )
else:
self.directory = str( directory )

Here is how you can run your logic in different directories:
>>> import os
>>> path = './python'
>>> for name in os.listdir(path) :
... newpath = path+'/'+name
... if os.path.isdir(newpath) :
... for filename in os.listdir(newpath) :
... # do the work
... filepath = newpath + '/' + filename
... print(filepath)
...

Related

How do I Check a Folder's 'hidden' Attribute in Python?

I am working on a load/save module (a GUI written in Python) that will be used with and imported to future programs. My operating system is Windows 10. The problem I've run into is that my get_folders() method is grabbing ALL folder names, including ones that I would rather ignore, such as system folders and hidden folders (best seen on the c-drive).
I have a work around using a hard-coded exclusion list. But this only works for folders already on the list, not for hidden folders that my wizard may come across in the future. I would like to exclude ALL folders that have their 'hidden' attribute set. I would like to avoid methods that require installing new libraries that would have to be re-installed whenever I wipe my system. Also, if the solution is non-Windows specific, yet will work with Windows 10, all the better.
I have searched SO and the web for an answer but have come up empty. The closest I have found is contained in Answer #4 of this thread: Check for a folder, then create a hidden folder in Python, which shows how to set the hidden attribute when creating a new directory, but not how to read the hidden attribute of an existing directory.
Here is my Question: Does anyone know of a way to check if a folder's 'hidden' attribute is set, using either native python, pygame, or os. commands? Or, lacking a native answer, I would accept an imported method from a library that achieves my goal.
The following program demonstrates my get_folders() method, and shows the issue at hand:
# Written for Windows 10
import os
import win32gui
CLS = lambda :os.system('cls||echo -e \\\\033c') # Clear-Command-Console function
def get_folders(path = -1, exclude_list = -1):
if path == -1: path = os.getcwd()
if exclude_list == -1: exclude_list = ["Config.Msi","Documents and Settings","System Volume Information","Recovery","ProgramData"]
dir_list = [entry.name for entry in os.scandir(path) if entry.is_dir()] if exclude_list == [] else\
[entry.name for entry in os.scandir(path) if entry.is_dir() and '$' not in entry.name and entry.name not in exclude_list]
return dir_list
def main():
HWND = win32gui.GetForegroundWindow() # Get Command Console Handle.
win32gui.MoveWindow(HWND,100,50,650,750,False) # Size and Position Command Console.
CLS() # Clear Console Screen.
print(''.join(['\n','Folder Names'.center(50),'\n ',('-'*50).center(50)]))
# Example 1: Current Working Directory
dirs = get_folders() # Get folder names in current directory (uses exclude list.)
for elm in dirs:print(' ',elm) # Show the folder names.
print('','-'*50)
# Examle 2: C Drive, All Folders Included
dirs = get_folders('c:\\', exclude_list = []) # Get a list of folders in the root c: drive, nothing excluded.
for elm in dirs: print(' ',elm) # Show the fiolder names
print('','-'*50)
# Example 3: C Drive, Excluded Folder List Work-Around
dirs = get_folders('c:\\') # Get a list of folders in the root c: drive, excluding sytem dirs those named in the exclude_list.
for elm in dirs:print(' ',elm) # Show the folder names.
print("\n Question: Is there a way to identify folders that have the 'hidden' attribute\n\t set to True, rather than using a hard-coded exclusion list?",end='\n\n' )
# ==========================
if __name__ == "__main__":
main()
input(' Press [Enter] to Quit: ')
CLS()

Here is my revised get_folders() method, with thanks to Alexander.
# Written for Windows 10
# key portions of this code borrowed from:
# https://stackoverflow.com/questions/40367961/how-to-read-or-write-the-a-s-h-r-i-file-attributes-on-windows-using-python-and-c/40372658#40372658
# with thanks to Alexander Goryushkin.
import os
from os import scandir, stat
from stat import (
FILE_ATTRIBUTE_ARCHIVE as A,
FILE_ATTRIBUTE_SYSTEM as S,
FILE_ATTRIBUTE_HIDDEN as H,
FILE_ATTRIBUTE_READONLY as R,
FILE_ATTRIBUTE_NOT_CONTENT_INDEXED as I
)
from ctypes import WinDLL, WinError, get_last_error
import win32gui
CLS = lambda :os.system('cls||echo -e \\\\033c') # Clear-Command-Console function
def read_or_write_attribs(kernel32, entry, a=None, s=None, h=None, r=None, i=None, update=False):
# Get the file attributes as an integer.
if not update: attrs = entry.stat(follow_symlinks=False).st_file_attributes# Fast because we access the stats from the entry
else:
# Notice that this will raise a "WinError: Access denied" on some entries,
# for example C:\System Volume Information\
attrs = stat(entry.path, follow_symlinks=False).st_file_attributes
# A bit slower because we re-read the stats from the file path.
# Construct the new attributes
newattrs = attrs
def setattrib(attr, value):
nonlocal newattrs
# Use '{0:032b}'.format(number) to understand what this does.
if value is True: newattrs = newattrs | attr
elif value is False:
newattrs = newattrs & ~attr
setattrib(A, a)
setattrib(S, s)
setattrib(H, h)
setattrib(R, r)
setattrib(I, i if i is None else not i) # Because this attribute is True when the file is _not_ indexed
# Optional add more attributes here.
# See https://docs.python.org/3/library/stat.html#stat.FILE_ATTRIBUTE_ARCHIVE
# Write the new attributes if they changed
if newattrs != attrs:
if not kernel32.SetFileAttributesW(entry.path, newattrs):
raise WinError(get_last_error())
# Return an info tuple consisting of bools
return ( bool(newattrs & A),
bool(newattrs & S),
bool(newattrs & H),
bool(newattrs & R),
not bool(newattrs & I) )# Because this attribute is true when the file is _not_ indexed)
# Get the file attributes as an integer.
if not update:
# Fast because we access the stats from the entry
attrs = entry.stat(follow_symlinks=False).st_file_attributes
else:
# A bit slower because we re-read the stats from the file path.
# Notice that this will raise a "WinError: Access denied" on some entries,
# for example C:\System Volume Information\
attrs = stat(entry.path, follow_symlinks=False).st_file_attributes
return dir_list
def get_folders(path = -1, show_hidden = False):
if path == -1: path = os.getcwd()
dir_list = []
kernel32 = WinDLL('kernel32', use_last_error=True)
for entry in scandir(path):
a,s,hidden,r,i = read_or_write_attribs(kernel32,entry)
if entry.is_dir() and (show_hidden or not hidden): dir_list.append(entry.name)
return dir_list
def main():
HWND = win32gui.GetForegroundWindow() # Get Command Console Handle.
win32gui.MoveWindow(HWND,100,50,650,750,False) # Size and Position Command Console.
CLS() # Clear Console Screen.
line_len = 36
# Example 1: C Drive, Exclude Hidden Folders
print(''.join(['\n','All Folders Not Hidden:'.center(line_len),'\n ',('-'*line_len).center(line_len)]))
dirs = get_folders('c:\\') # Get a list of folders on the c: drive, exclude hidden.
for elm in dirs:print(' ',elm) # Show the folder names.
print('','='*line_len)
# Examle 2: C Drive, Include Hidden Folders
print(" All Folders Including Hidden\n "+"-"*line_len)
dirs = get_folders('c:\\', show_hidden = True) # Get a list of folders on the c: drive, including hidden.
for elm in dirs: print(' ',elm) # Show the fiolder names
print('','-'*line_len)
# ==========================
if __name__ == "__main__":
main()
input(' Press [Enter] to Quit: ')
CLS()

Updating Appending List to a txt file

Hello currently i am studying python and i wanted to know on how you can have a list that is being appended if there is a change constantly to a txtfile. Wording is a hard here is the code anyways
list=[]
random_number=0
file_handler=open("history.txt","w")
file_handler.write(str(list))
lenght_cumulative_data=len(list)
confirmed.append(random_number)
Now what i want to accomplish is that the list variable of the number 0 would be shown in history.txt but that doesnt happen and lets just imagine that random_number is always changing I want the list variable to be able to always update itself. Like if let say random_number changes to 1 and then 2 I want list to be updated to [0,1,2]. How do you do that? I've been searching on youtube and all they gave me is this write function is there anyway someone could refrence it or have any ideas?

from os import stat
from _thread import start_new_thread
from time import sleep
List = []
class WatchFileForChanges:
def __init__(self, filename):
self.file = filename
self.cached_file = stat(self.file).st_mtime
def watch(self):
num = 0
while 1:
status = stat(self.file).st_mtime
if status != self.cached_file:
self.cached_file = status
#file changed
List.append(num)
num += 1
def main():
Watcher = WatchFileForChanges("file.txt")
start_new_thread(Watcher.watch, ())
while 1:
print(List)
sleep(1)
if __name__ == '__main__':
main()
This will do what you want.
If I understood you correctly, you want to append to the list every time a file changes.

Note: this answer will only work on Windows
changes.py:
# Adapted from http://timgolden.me.uk/python/win32_how_do_i/watch_directory_for_changes.html
import threading
import os
import win32file
import win32con
ACTIONS = {
1 : "Created",
2 : "Deleted",
3 : "Updated",
4 : "Renamed from something",
5 : "Renamed to something"
}
# Thanks to Claudio Grondi for the correct set of numbers
FILE_LIST_DIRECTORY = 0x0001
def monitor_changes(callback, path, filenames):
path = path or ""
if type(filenames) == "str":
filenames = (filenames,)
thread = threading.Thread(target=_monitor, args=(callback, path, filenames))
thread.start()
return thread
def _monitor(callback, path, filenames):
hDir = win32file.CreateFile (
path,
FILE_LIST_DIRECTORY,
win32con.FILE_SHARE_READ | win32con.FILE_SHARE_WRITE | win32con.FILE_SHARE_DELETE,
None,
win32con.OPEN_EXISTING,
win32con.FILE_FLAG_BACKUP_SEMANTICS,
None
)
while True:
#
# ReadDirectoryChangesW takes a previously-created
# handle to a directory, a buffer size for results,
# a flag to indicate whether to watch subtrees and
# a filter of what changes to notify.
#
# NB Tim Juchcinski reports that he needed to up
# the buffer size to be sure of picking up all
# events when a large number of files were
# deleted at once.
#
results = win32file.ReadDirectoryChangesW (
hDir,
1024,
True,
win32con.FILE_NOTIFY_CHANGE_LAST_WRITE,
None,
None
)
for action, file in results:
if filenames and file not in filenames and os.path.basename(file) not in filenames:
continue
callback(action, file)
if __name__ == '__main__':
# monitor by printing
t = monitor_changes(print, ".", None)
And in your main.py:
import changes
import os
my_list = []
def callback(action_id, filename):
# the function running means
# that the file has been modified
action_desc = changes.ACTIONS[action_id]
print(action_desc, filename)
with open(filename) as f:
my_list.append(f.read())
thread = changes.monitor_changes(callback, ".", "my_file_that_I_want_to_monitor.txt")
If you want to monitor all files in the directory, call monitor_changes with None as the third argument.
Note: this will monitor all subdirectories, so files with the same name but in different folders will trigger the callback. If you want to avoid this, then check the filename passed to your callback function is exactly what you want to monitor.

File Not Found Error Django

Hello I'm new to django and I'm trying to make a web app. I have a running back end, but the problem is it's only running on cli and I have to turn it into a web app.
def testing(request):
ksize = 6
somsize= 10
csvname="input.csv"
testcap = "testing.pcap"
pl.csv5("chap/a",testcap)
tmparr=[]
for filename in os.listdir("chap"):
if filename.endswith(".csv"):
tmparr.append(filename)
continue
else:
continue
tmparr.sort()
visual_list = natsort.natsorted(tmparr)
csv = sl.opencsv(csvname)
norm = sl.normalize(csv)
weights = sl.som(norm,somsize)
label = sl.kmeans(ksize,weights)
#for x in range (2,21):
# label = sl.kmeans(x,weights)
# print("K is", x, "Score is ", label[1])
lblarr = np.reshape(label,(somsize,somsize))
#sl.dispcolor(lblarr)
classess = sl.cluster_coloring(weights,norm,csv)
classpercluster = sl.determine_cluster(classess,lblarr,ksize)
classpercent = sl.toperc(classpercluster)
print (classpercent)
#print(classpercluster)
for x in visual_list:
temp = ("chap/"+x)
tests = sl.opencsv(temp)
print(tests)
hits = sl.som_hits(weights, tests)
name = ("img/" + x + ".png")
sl.disp(lblarr,name,hits)
return render(request,'visualization/detail.html')
The system cannot find the path specified: 'chap', I'm not sure if I should put the chap folder inside the templates folder or in the app folder. Thank you in advance!

You're doing relative paths here it looks like. Change it to an absolute path.
dirpath = os.path.dirname(os.path.abspath(__file__))
chap_dirpath = os.path.join(dirpath, chap_dirpath)

Python: why is my method using os.walk does not return all available paths?

Question as in title and here's the method:
def walkThroughPath(self , sBasePath, blFolders = True, blFiles = True ):
aPaths = []
for sRootDir, aSubFolders, aFiles in os.walk( sBasePath ):
for sFolder in aSubFolders:
if blFolders == True:
aPaths.append( sRootDir )
for sFileName in aFiles:
if blFiles == True:
aPaths.append( sRootDir + "/" + sFileName )
return aPaths
The method returns a big amount of subfolders and files but definetly not all that I've found.
What's wrong with my method (or is it a wrong usage of os.walk)?
For those who are interested in the Background:
http://www.playonlinux.com/en/topic-10962-centralized_wineprefix_as_preparation_for_debpackages.html

Here are two possibilities:
You don't have permission to read a certain directory.
By default, os.walk does not follow symbolic links. Use the
followlinks=True keyword to follow symbolic links:
os.walk( sBasePath, followlinks=True )
Having skimmed the link you provided, it looks like followlinks=True may be the solution.

both of your hints brought the final solution that looks like that now:
def walkThroughPath(self , sBasePath, blFolders = True, blFiles = True, blFollowSymlinks = True ):
aPaths = []
for sRootDir, aSubFolders, aFiles in os.walk( sBasePath, blFollowSymlinks ):
for sFolder in aSubFolders:
if blFolders == True:
try:
aPaths.index( sRootDir )
blPathExists = True
except:
blPathExists = False
pass
if blPathExists == False:
aPaths.append( sRootDir )
self.logDebug("Append: " + sRootDir )
self.logDebug("Current content of aPaths: \n" + pprint.pformat(aPaths) )
for sFileName in aFiles:
self.logDebug("Current root dir: " + sRootDir )
if blFiles == True:
try:
aPaths.index( sRootDir + "/" + sFileName )
blPathExists = True
except:
blPathExists = False
pass
if blPathExists == False:
aPaths.append( sRootDir + "/" + sFileName )
if blFolders == True:
try:
aPaths.index( sRootDir )
blPathExists = True
except:
blPathExists = False
pass
if blPathExists == False:
aPaths.append( sRootDir )
self.logDebug("Append: " + sRootDir )
self.logDebug("Current content of aPaths: \n" + pprint.pformat(aPaths) )
self.logDebug("Folders: " + str(blFolders) )
self.logDebug("Files : " + str(blFiles) )
self.logDebug("Paths found in " + sBasePath + " : \n" + pprint.pformat(aPaths) )
return aPaths
First I indented incorrectly as Steven said.
os.walk seems to handle the lists not as I expected them to be. Folders of files not nessessarily appear in folders list. This caused many folders I left out just because those folder pathes have been in the files list. Additionally I only checked files only in this limited folders list.
Next I added the follow symlinks flag optional as unutbu suggested. Maybe in my case they could be needed as well eventually.
Those method above is surely a candidate for improvement, but it's at least working :-)
Best,
André

I know that you already solved the problem. But for further references: If you want to walk through dirs as an 32 bit application running on a 64 bit Windows make sure that you check for the redirected directories.
The %windir%\System32 directory is reserved for 64-bit applications. Most DLL file names were not changed when 64-bit versions of the DLLs were created, so 32-bit versions of the DLLs are stored in a different directory. WOW64 hides this difference by using a file system redirector.
File System Redirector on MSDN

split a pdf based on outline

i would like to use pyPdf to split a pdf file based on the outline where each destination in the outline refers to a different page within the pdf.
example outline:
main --> points to page 1
sect1 --> points to page 1
sect2 --> points to page 15
sect3 --> points to page 22
it is easy within pyPdf to iterate over each page of the document or each destination in the document's outline; however, i cannot figure out how to get the page number where the destination points.
does anybody know how to find the referencing page number for each destination in the outline?

I figured it out:
class Darrell(pyPdf.PdfFileReader):
def getDestinationPageNumbers(self):
def _setup_outline_page_ids(outline, _result=None):
if _result is None:
_result = {}
for obj in outline:
if isinstance(obj, pyPdf.pdf.Destination):
_result[(id(obj), obj.title)] = obj.page.idnum
elif isinstance(obj, list):
_setup_outline_page_ids(obj, _result)
return _result
def _setup_page_id_to_num(pages=None, _result=None, _num_pages=None):
if _result is None:
_result = {}
if pages is None:
_num_pages = []
pages = self.trailer["/Root"].getObject()["/Pages"].getObject()
t = pages["/Type"]
if t == "/Pages":
for page in pages["/Kids"]:
_result[page.idnum] = len(_num_pages)
_setup_page_id_to_num(page.getObject(), _result, _num_pages)
elif t == "/Page":
_num_pages.append(1)
return _result
outline_page_ids = _setup_outline_page_ids(self.getOutlines())
page_id_to_page_numbers = _setup_page_id_to_num()
result = {}
for (_, title), page_idnum in outline_page_ids.iteritems():
result[title] = page_id_to_page_numbers.get(page_idnum, '???')
return result
pdf = Darrell(open(PATH-TO-PDF, 'rb'))
template = '%-5s %s'
print template % ('page', 'title')
for p,t in sorted([(v,k) for k,v in pdf.getDestinationPageNumbers().iteritems()]):
print template % (p+1,t)

This is just what I was looking for. Darrell's additions to PdfFileReader should be part of PyPDF2.
I wrote a little recipe that uses PyPDF2 and sejda-console to split a PDF by bookmarks. In my case there are several Level 1 sections that I want to keep together. This script allows me to do that and give the resulting files meaningful names.
import operator
import os
import subprocess
import sys
import time
import PyPDF2 as pyPdf
# need to have sejda-console installed
# change this to point to your installation
sejda = 'C:\\sejda-console-1.0.0.M2\\bin\\sejda-console.bat'
class Darrell(pyPdf.PdfFileReader):
...
if __name__ == '__main__':
t0= time.time()
# get the name of the file to split as a command line arg
pdfname = sys.argv[1]
# open up the pdf
pdf = Darrell(open(pdfname, 'rb'))
# build list of (pagenumbers, newFileNames)
splitlist = [(1,'FrontMatter')] # Customize name of first section
template = '%-5s %s'
print template % ('Page', 'Title')
print '-'*72
for t,p in sorted(pdf.getDestinationPageNumbers().iteritems(),
key=operator.itemgetter(1)):
# Customize this to get it to split where you want
if t.startswith('Chapter') or \
t.startswith('Preface') or \
t.startswith('References'):
print template % (p+1, t)
# this customizes how files are renamed
new = t.replace('Chapter ', 'Chapter')\
.replace(': ', '-')\
.replace(': ', '-')\
.replace(' ', '_')
splitlist.append((p+1, new))
# call sejda tools and split document
call = sejda
call += ' splitbypages'
call += ' -f "%s"'%pdfname
call += ' -o ./'
call += ' -n '
call += ' '.join([str(p) for p,t in splitlist[1:]])
print '\n', call
subprocess.call(call)
print '\nsejda-console has completed.\n\n'
# rename the split files
for p,t in splitlist:
old ='./%i_'%p + pdfname
new = './' + t + '.pdf'
print 'renaming "%s"\n to "%s"...'%(old, new),
try:
os.remove(new)
except OSError:
pass
try:
os.rename(old, new)
print' succeeded.\n'
except:
print' failed.\n'
print '\ndone. Spliting took %.2f seconds'%(time.time() - t0)

Small update to #darrell class to be able to parse UTF-8 outlines, which I post as answer because comment would be hard to read.
Problem is in pyPdf.pdf.Destination.title which may be returned in two flavors:
pyPdf.generic.TextStringObject
pyPdf.generic.ByteStringObject
so that output from _setup_outline_page_ids() function returns also two different types for title object, which fails with UnicodeDecodeError if outline title contains anything then ASCII.
I added this code to solve the problem:
if isinstance(title, pyPdf.generic.TextStringObject):
title = title.encode('utf-8')
of whole class:
class PdfOutline(pyPdf.PdfFileReader):
def getDestinationPageNumbers(self):
def _setup_outline_page_ids(outline, _result=None):
if _result is None:
_result = {}
for obj in outline:
if isinstance(obj, pyPdf.pdf.Destination):
_result[(id(obj), obj.title)] = obj.page.idnum
elif isinstance(obj, list):
_setup_outline_page_ids(obj, _result)
return _result
def _setup_page_id_to_num(pages=None, _result=None, _num_pages=None):
if _result is None:
_result = {}
if pages is None:
_num_pages = []
pages = self.trailer["/Root"].getObject()["/Pages"].getObject()
t = pages["/Type"]
if t == "/Pages":
for page in pages["/Kids"]:
_result[page.idnum] = len(_num_pages)
_setup_page_id_to_num(page.getObject(), _result, _num_pages)
elif t == "/Page":
_num_pages.append(1)
return _result
outline_page_ids = _setup_outline_page_ids(self.getOutlines())
page_id_to_page_numbers = _setup_page_id_to_num()
result = {}
for (_, title), page_idnum in outline_page_ids.iteritems():
if isinstance(title, pyPdf.generic.TextStringObject):
title = title.encode('utf-8')
result[title] = page_id_to_page_numbers.get(page_idnum, '???')
return result

Darrell's class can be modified slightly to produce a multi-level table of contents for a pdf (in the manner of pdftoc in the pdftk toolkit.)
My modification adds one more parameter to _setup_page_id_to_num, an integer "level" which defaults to 1. Each invocation increments the level. Instead of storing just the page number in the result, we store the pair of page number and level. Appropriate modifications should be applied when using the returned result.
I am using this to implement the "PDF Hacks" browser-based page-at-a-time document viewer with a sidebar table of contents which reflects LaTeX section, subsection etc bookmarks. I am working on a shared system where pdftk can not be installed but where python is available.

A solution 10 years later for newer python and PyPDF:
from PyPDF2 import PdfReader, PdfWriter
filename = "main.pdf"
with open(filename, "rb") as f:
r = PdfReader(f)
bookmarks = list(map(lambda x: (x.title, r.get_destination_page_number(x)), r.outline))
print(bookmarks)
for i, b in enumerate(bookmarks):
begin = b[1]
end = bookmarks[i+1][1] if i < len(bookmarks) - 1 else len(r.pages)
# print(len(r.pages[begin:end]))
name = b[0] + ".pdf"
print(f"{name=}: {begin=}, {end=}")
with open(name, "wb") as f:
w = PdfWriter(f)
for p in r.pages[begin:end]:
w.add_page(p)
w.write(f)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterate through many directories in Python [Python3] - python

Related

How do I Check a Folder's 'hidden' Attribute in Python?

Updating Appending List to a txt file

File Not Found Error Django

Python: why is my method using os.walk does not return all available paths?

split a pdf based on outline

Categories

Resources