os.walk to crawl through folder structure

os.walk to crawl through folder structure - python

I have some code that looks at a single folder and pulls out files.
but now the folder structure has changed and i need to trawl throught the folders looking for files that match.
what the old code looks like
GSB_FOLDER = r'D:\Games\Gratuitous Space Battles Beta'
def get_module_data():
module_folder = os.path.join(GSB_FOLDER, 'data', 'modules')
filenames = [os.path.join(module_folder, f) for f in
os.listdir(module_folder)]
data = [parse_file(f) for f in filenames]
return data
But now the folder structure has changed to be like this
GSB_FOLDER\data\modules
\folder1\data\modules
\folder2\data\modules
\folder3\data\modules
where folder1,2 or 3, could be any text string
how do i rewrite the code above to do this...
I have been told about os.walk but I'm just learning Python... so any help appreciated

Nothing much changes you just call os.walk and it will recursively go thru the directory and return files e.g.
for root, dirs, files in os.walk('/tmp'):
if os.path.basename(root) != 'modules':
continue
data = [parse_file(os.path.join(root,f)) for f in files]
Here I am checking files only in folders named 'modules' you can change that check to do something else, e.g. paths which have module somewhere root.find('/modules') >= 0

os.walk is a nice easy way to get the directory structure of everything inside a dir you pass it;
in your example, you could do something like this:
for dirpath, dirnames, filenames in os.walk("...GSB_FOLDER"):
#whatever you want to do with these folders
if "/data/modules/" in dirpath:
print dirpath, dirnames, filenames
try that out, should be fairly self explanatory how it works...

Created a function that kind of serves a general purpose of crawling through directory structure and returning files and/or paths that match pattern.
import os
import re
import sys
def directory_spider(input_dir, path_pattern="", file_pattern="", maxResults=500):
file_paths = []
if not os.path.exists(input_dir):
raise FileNotFoundError("Could not find path: %s"%(input_dir))
for dirpath, dirnames, filenames in os.walk(input_dir):
if re.search(path_pattern, dirpath):
file_list = [item for item in filenames if re.search(file_pattern,item)]
file_path_list = [os.path.join(dirpath, item) for item in file_list]
file_paths += file_path_list
if len(file_paths) > maxResults:
break
return file_paths[0:maxResults]
Example usages:
directory_spider('/path/to/find') --> Finds the top 500 files in the path if it exists
directory_spider('/path/to/find',path_pattern="",file_pattern=".py$", maxResults=10)

You can use os.walk like #Anurag has detailed or you can try my small pathfinder library:
data = [parse_file(f) for f in pathfinder.find(GSB_FOLDER), just_files=True]

Related

Python loop through directories

I am trying to use python library os to loop through all my subdirectories in the root directory, and target specific file name and rename them.
Just to make it clear this is my tree structure
My python file is located at the root level.
What I am trying to do, is to target the directory 942ba loop through all the sub directories and locate the file 000000 and rename it to 000000.csv
the current code I have is as follow:
import os
root = '<path-to-dir>/942ba956-8967-4bec-9540-fbd97441d17f/'
for dirs, subdirs, files in os.walk(root):
for f in files:
print(dirs)
if f == '000000':
dirs = dirs.strip(root)
f_new = f + '.csv'
os.rename(os.path.join(r'{}'.format(dirs), f), os.path.join(r'{}'.format(dirs), f_new))
But this is not working, because when I run my code, for some reasons the code strips the date from the subduers
can anyone help me to understand how to solve this issue?

A more efficient way to iterate through the folders and only select the files you are looking for is below:
source_folder = '<path-to-dir>/942ba956-8967-4bec-9540-fbd97441d17f/'
files = [os.path.normpath(os.path.join(root,f)) for root,dirs,files in os.walk(source_folder) for f in files if '000000' in f and not f.endswith('.gz')]
for file in files:
os.rename(f, f"{f}.csv")
The list comprehension stores the full path to the files you are looking for. You can change the condition inside the comprehension to anything you need. I use this code snippet a lot to find just images of certain type, or remove unwanted files from the selected files.
In the for loop, files are renamed adding the .csv extension.

I would use glob to find the files.
import os, glob
zdir = '942ba956-8967-4bec-9540-fbd97441d17f'
files = glob.glob('*{}/000000'.format(zdir))
for fly in files:
os.rename(fly, '{}.csv'.format(fly))

How to delete all files inside a main folder with many subfolders?

I want to delete only the files, not the folder and subfolders?
Tried this but I dont want to give examples of characters in a condition.
for i in glob('path'+ '**/*',recursive = True):
if '.' in i:
os.remove(i)
I don't like this because some folder names have '.' in the name. Also there are many types of files there so making a list and check those in a list would not be efficient. What ways do you suggest?

You can use os.walk:
import os
for root, _, files in os.walk('path'):
for file in files:
os.remove(os.path.join(root, file))

Try something like that:
def get_file_paths(folder_path):
paths = []
for root, directories, filenames in os.walk(folder_path):
for filename in filenames:
paths.append(os.path.join(root, filename))
return paths

python recurcive os.path append with some brake

Hey, this is my first post (!)
Just looking after headache recursive solution to my littel project :)
Trying to collect all folders path (recursively),
thats contaion some specefic file
to array of path's.
ex:
my (root) path is:
c:/test
folder test is contain the file 'test.txt'
and some folders: '1','2','3'.
any of them contain 'test.txt' too!
(if 'text.txt' is not found:
just brake the loop and dont search in subfolders!)
now my function will look for 'test.txt'
and then, collect all folders to my folderslist:
if os.path.exists(os.path.join(path, 'test.txt')):
full_list = os.listdir(path)
folderslist = []
for folder in full_list:
if os.path.isfile(os.path.join(path, folder)) == 0:
folderslist.append(os.path.join(path, folder))
its working not bad, just not recurcive...
really dont know how to call the function again
with the same list, and force him to change the 'current path'...
not sure if 'list' is the best data struct for me to call with it again.
my goal is to make some opration's in every forlder on this list:
c:/test c:/test/1 c:/test/2 c:/test/3
but if there is more folders (that not contain 'test.txt' so, just dont add it to my folder list, and do not looking inside)
hope my fisrt post was clear enough :X

You can use os.walk to traverse the subfolders, and if test.txt is not found, clear the directory list so os.walk won't traverse its subfolders any further:
import os
folderslist = []
for root, dirs, files in os.walk('c:/test'):
if 'test.txt' in files:
folderslist.append(root)
else:
dirs.clear()

In python, how to get the path to all the files in a directory, including files in subdirectories, but excluding path to subdirectories

I have a directory containing folders and subfolders. At the end of each path there are files. I want to make a txt file containing the path to all the files, but excluding the path to folders.
I tried this suggestion from Getting a list of all subdirectories in the current directory, and my code looks like this:
import os
myDir = '/path/somewhere'
print [x[0] for x in os.walk(myDir)]
And it gives the path of all elements (files AND folders), but I want only the paths to the files. Any ideas for it?

os.walk(path) returns three tuples parent folder, sub directories and files.
so you can do like this:
for dir, subdir, files in os.walk(path):
for file in files:
print os.path.join(dir, file)

The os.walk method gives you dirs, subdirs and files in each iteration, so when you are looping through os.walk, you will have to then iterate over the files and combine each file with "dir".
In order to perform this combination, what you want to do is do an os.path.join between the directory and the files.
Here is a simple example to help illustrate how traversing with os.walk works
from os import walk
from os.path import join
# specify in your loop in order dir, subdirectory, file for each level
for dir, subdir, files in walk('path'):
# iterate over each file
for file in files:
# join will put together the directory and the file
print(join(dir, file))

If you just want the paths, then add a filter to your list comprehension as follows:
import os
myDir = '/path/somewhere'
print [dirpath for dirpath, dirnames, filenames in os.walk(myDir) if filenames]
This would then only add the path for folders which contain files.

def get_paths(path, depth=None):
for name in os.listdir(path):
full_path = os.path.join(path, name)
if os.path.isfile(full_path):
yield full_path
else:
d = depth - 1 if depth is not None else None
if d is None or d >= 0:
for sub_path in get_paths(full_path):
yield sub_path

Is there a way to get all the directories but not files in a directory in Python?

This link is using a custom method, but I just wanna see if there is a single method to do it in Python 2.6?

There isn't a built-in function to only list files, but it's easy enough to define in a couple of lines:
def listfiles(directory):
return [f for f in os.listdir(directory)
if os.path.isdir(os.path.join(directory, f))]
EDIT: fixed, thanks Stephan202

If a_directory is the directory you want to inspect, then:
next(f1 for f in os.walk(a_directory))
From the os.walk() reference:
Generate the file names in a directory tree by walking the tree either top-down or bottom-up. For each directory in the tree rooted at directory top (including top itself), it yields a 3-tuple (dirpath, dirnames, filenames).

I don't believe there is. Since directories are also files, you have to ask for all the files, then ask each one if it is a directory.

def listdirs(path):
ret = []
for cur_name in os.listdir(path):
full_path = os.path.join(path, cur_name)
if os.path.isdir(full_path):
ret.append(cur_name)
return ret
onlydirs = listdir("/tmp/")
print onlydirs
..or as a list-comprehension..
path = "/tmp/"
onlydirs = [x for x in os.listdir(path) if os.path.isdir(os.path.join(path, x))]
print onlydirs

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

os.walk to crawl through folder structure - python

You can use os.walk like #Anurag has detailed or you can try my small pathfinder library: data = [parse_file(f) for f in pathfinder.find(GSB_FOLDER), just_files=True]

Related

Python loop through directories

How to delete all files inside a main folder with many subfolders?

python recurcive os.path append with some brake

In python, how to get the path to all the files in a directory, including files in subdirectories, but excluding path to subdirectories

Is there a way to get all the directories but not files in a directory in Python?

Categories

Resources