I am a total Python Newb
I need to loop through a directory looking for .txt files, and then read and process them individually. I would like to set this up so that whatever directory the script is in is treated as the root of this action. For example if the script is in /bsepath/workDir, then it would loop over all of the files in workDir and its children.
What I have so far is:
#!/usr/bin/env python
import os
scrptPth = os.path.realpath(__file__)
for file in os.listdir(scrptPth)
with open(file) as f:
head,sub,auth = [f.readline().strip() for i in range(3)]
data=f.read()
#data.encode('utf-8')
pth = os.getcwd()
print head,sub,auth,data,pth
This code is giving me an invalid syntax error and I suspect that is because os.listdir does not like file paths in standard string format. Also I dont think that I am doing the looped action right. How do I reference a specific file in the looped action? Is it packaged as a variable?
Any help is appriciated
import os, fnmatch
def findFiles (path, filter):
for root, dirs, files in os.walk(path):
for file in fnmatch.filter(files, filter):
yield os.path.join(root, file)
Use it like this, and it will find all text files somewhere within the given path (recursively):
for textFile in findFiles(r'C:\Users\poke\Documents', '*.txt'):
print(textFile)
os.listdir expects a directory as input. So, to get the directory in which the script resides use:
scrptPth = os.path.dirname(os.path.realpath(__file__))
Also, os.listdir returns just the filenames, not the full path.
So open(file) will not work unless the current working directory happens to be the directory where the script resides. To fix this, use os.path.join:
import os
scrptPth = os.path.dirname(os.path.realpath(__file__))
for file in os.listdir(scrptPth):
with open(os.path.join(scrptPth, file)) as f:
Finally, if you want to recurse through subdirectories, use os.walk:
import os
scrptPth = os.path.dirname(os.path.realpath(__file__))
for root, dirs, files in os.walk(scrptPth):
for filename in files:
filename = os.path.join(root, filename)
with open(filename, 'r') as f:
head,sub,auth = [f.readline().strip() for i in range(3)]
data=f.read()
#data.encode('utf-8')
This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
How can I find all the files in a directory having the extension .txt in python?
You can use glob:
import glob, os
os.chdir("/mydir")
for file in glob.glob("*.txt"):
print(file)
or simply os.listdir:
import os
for file in os.listdir("/mydir"):
if file.endswith(".txt"):
print(os.path.join("/mydir", file))
or if you want to traverse directory, use os.walk:
import os
for root, dirs, files in os.walk("/mydir"):
for file in files:
if file.endswith(".txt"):
print(os.path.join(root, file))
Use glob.
>>> import glob
>>> glob.glob('./*.txt')
['./outline.txt', './pip-log.txt', './test.txt', './testingvim.txt']
Something like that should do the job
for root, dirs, files in os.walk(directory):
for file in files:
if file.endswith('.txt'):
print(file)
You can simply use pathlibs glob 1:
import pathlib
list(pathlib.Path('your_directory').glob('*.txt'))
or in a loop:
for txt_file in pathlib.Path('your_directory').glob('*.txt'):
# do something with "txt_file"
If you want it recursive you can use .glob('**/*.txt')
1The pathlib module was included in the standard library in python 3.4. But you can install back-ports of that module even on older Python versions (i.e. using conda or pip): pathlib and pathlib2.
Something like this will work:
>>> import os
>>> path = '/usr/share/cups/charmaps'
>>> text_files = [f for f in os.listdir(path) if f.endswith('.txt')]
>>> text_files
['euc-cn.txt', 'euc-jp.txt', 'euc-kr.txt', 'euc-tw.txt', ... 'windows-950.txt']
import os
path = 'mypath/path'
files = os.listdir(path)
files_txt = [i for i in files if i.endswith('.txt')]
I like os.walk():
import os
for root, dirs, files in os.walk(dir):
for f in files:
if os.path.splitext(f)[1] == '.txt':
fullpath = os.path.join(root, f)
print(fullpath)
Or with generators:
import os
fileiter = (os.path.join(root, f)
for root, _, files in os.walk(dir)
for f in files)
txtfileiter = (f for f in fileiter if os.path.splitext(f)[1] == '.txt')
for txt in txtfileiter:
print(txt)
Here's more versions of the same that produce slightly different results:
glob.iglob()
import glob
for f in glob.iglob("/mydir/*/*.txt"): # generator, search immediate subdirectories
print f
glob.glob1()
print glob.glob1("/mydir", "*.tx?") # literal_directory, basename_pattern
fnmatch.filter()
import fnmatch, os
print fnmatch.filter(os.listdir("/mydir"), "*.tx?") # include dot-files
Try this this will find all your files recursively:
import glob, os
os.chdir("H:\\wallpaper")# use whatever directory you want
#double\\ no single \
for file in glob.glob("**/*.txt", recursive = True):
print(file)
Python v3.5+
Fast method using os.scandir in a recursive function. Searches for all files with a specified extension in folder and sub-folders. It is fast, even for finding 10,000s of files.
I have also included a function to convert the output to a Pandas Dataframe.
import os
import re
import pandas as pd
import numpy as np
def findFilesInFolderYield(path, extension, containsTxt='', subFolders = True, excludeText = ''):
""" Recursive function to find all files of an extension type in a folder (and optionally in all subfolders too)
path: Base directory to find files
extension: File extension to find. e.g. 'txt'. Regular expression. Or 'ls\d' to match ls1, ls2, ls3 etc
containsTxt: List of Strings, only finds file if it contains this text. Ignore if '' (or blank)
subFolders: Bool. If True, find files in all subfolders under path. If False, only searches files in the specified folder
excludeText: Text string. Ignore if ''. Will exclude if text string is in path.
"""
if type(containsTxt) == str: # if a string and not in a list
containsTxt = [containsTxt]
myregexobj = re.compile('\.' + extension + '$') # Makes sure the file extension is at the end and is preceded by a .
try: # Trapping a OSError or FileNotFoundError: File permissions problem I believe
for entry in os.scandir(path):
if entry.is_file() and myregexobj.search(entry.path): #
bools = [True for txt in containsTxt if txt in entry.path and (excludeText == '' or excludeText not in entry.path)]
if len(bools)== len(containsTxt):
yield entry.stat().st_size, entry.stat().st_atime_ns, entry.stat().st_mtime_ns, entry.stat().st_ctime_ns, entry.path
elif entry.is_dir() and subFolders: # if its a directory, then repeat process as a nested function
yield from findFilesInFolderYield(entry.path, extension, containsTxt, subFolders)
except OSError as ose:
print('Cannot access ' + path +'. Probably a permissions error ', ose)
except FileNotFoundError as fnf:
print(path +' not found ', fnf)
def findFilesInFolderYieldandGetDf(path, extension, containsTxt, subFolders = True, excludeText = ''):
""" Converts returned data from findFilesInFolderYield and creates and Pandas Dataframe.
Recursive function to find all files of an extension type in a folder (and optionally in all subfolders too)
path: Base directory to find files
extension: File extension to find. e.g. 'txt'. Regular expression. Or 'ls\d' to match ls1, ls2, ls3 etc
containsTxt: List of Strings, only finds file if it contains this text. Ignore if '' (or blank)
subFolders: Bool. If True, find files in all subfolders under path. If False, only searches files in the specified folder
excludeText: Text string. Ignore if ''. Will exclude if text string is in path.
"""
fileSizes, accessTimes, modificationTimes, creationTimes , paths = zip(*findFilesInFolderYield(path, extension, containsTxt, subFolders))
df = pd.DataFrame({
'FLS_File_Size':fileSizes,
'FLS_File_Access_Date':accessTimes,
'FLS_File_Modification_Date':np.array(modificationTimes).astype('timedelta64[ns]'),
'FLS_File_Creation_Date':creationTimes,
'FLS_File_PathName':paths,
})
df['FLS_File_Modification_Date'] = pd.to_datetime(df['FLS_File_Modification_Date'],infer_datetime_format=True)
df['FLS_File_Creation_Date'] = pd.to_datetime(df['FLS_File_Creation_Date'],infer_datetime_format=True)
df['FLS_File_Access_Date'] = pd.to_datetime(df['FLS_File_Access_Date'],infer_datetime_format=True)
return df
ext = 'txt' # regular expression
containsTxt=[]
path = 'C:\myFolder'
df = findFilesInFolderYieldandGetDf(path, ext, containsTxt, subFolders = True)
path.py is another alternative: https://github.com/jaraco/path.py
from path import path
p = path('/path/to/the/directory')
for f in p.files(pattern='*.txt'):
print f
To get all '.txt' file names inside 'dataPath' folder as a list in a Pythonic way:
from os import listdir
from os.path import isfile, join
path = "/dataPath/"
onlyTxtFiles = [f for f in listdir(path) if isfile(join(path, f)) and f.endswith(".txt")]
print onlyTxtFiles
Python has all tools to do this:
import os
the_dir = 'the_dir_that_want_to_search_in'
all_txt_files = filter(lambda x: x.endswith('.txt'), os.listdir(the_dir))
I did a test (Python 3.6.4, W7x64) to see which solution is the fastest for one folder, no subdirectories, to get a list of complete file paths for files with a specific extension.
To make it short, for this task os.listdir() is the fastest and is 1.7x as fast as the next best: os.walk() (with a break!), 2.7x as fast as pathlib, 3.2x faster than os.scandir() and 3.3x faster than glob.
Please keep in mind, that those results will change when you need recursive results. If you copy/paste one method below, please add a .lower() otherwise .EXT would not be found when searching for .ext.
import os
import pathlib
import timeit
import glob
def a():
path = pathlib.Path().cwd()
list_sqlite_files = [str(f) for f in path.glob("*.sqlite")]
def b():
path = os.getcwd()
list_sqlite_files = [f.path for f in os.scandir(path) if os.path.splitext(f)[1] == ".sqlite"]
def c():
path = os.getcwd()
list_sqlite_files = [os.path.join(path, f) for f in os.listdir(path) if f.endswith(".sqlite")]
def d():
path = os.getcwd()
os.chdir(path)
list_sqlite_files = [os.path.join(path, f) for f in glob.glob("*.sqlite")]
def e():
path = os.getcwd()
list_sqlite_files = [os.path.join(path, f) for f in glob.glob1(str(path), "*.sqlite")]
def f():
path = os.getcwd()
list_sqlite_files = []
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(".sqlite"):
list_sqlite_files.append( os.path.join(root, file) )
break
print(timeit.timeit(a, number=1000))
print(timeit.timeit(b, number=1000))
print(timeit.timeit(c, number=1000))
print(timeit.timeit(d, number=1000))
print(timeit.timeit(e, number=1000))
print(timeit.timeit(f, number=1000))
Results:
# Python 3.6.4
0.431
0.515
0.161
0.548
0.537
0.274
import os
import sys
if len(sys.argv)==2:
print('no params')
sys.exit(1)
dir = sys.argv[1]
mask= sys.argv[2]
files = os.listdir(dir);
res = filter(lambda x: x.endswith(mask), files);
print res
To get an array of ".txt" file names from a folder called "data" in the same directory I usually use this simple line of code:
import os
fileNames = [fileName for fileName in os.listdir("data") if fileName.endswith(".txt")]
This code makes my life simpler.
import os
fnames = ([file for root, dirs, files in os.walk(dir)
for file in files
if file.endswith('.txt') #or file.endswith('.png') or file.endswith('.pdf')
])
for fname in fnames: print(fname)
Use fnmatch: https://docs.python.org/2/library/fnmatch.html
import fnmatch
import os
for file in os.listdir('.'):
if fnmatch.fnmatch(file, '*.txt'):
print file
A copy-pastable solution similar to the one of ghostdog:
def get_all_filepaths(root_path, ext):
"""
Search all files which have a given extension within root_path.
This ignores the case of the extension and searches subdirectories, too.
Parameters
----------
root_path : str
ext : str
Returns
-------
list of str
Examples
--------
>>> get_all_filepaths('/run', '.lock')
['/run/unattended-upgrades.lock',
'/run/mlocate.daily.lock',
'/run/xtables.lock',
'/run/mysqld/mysqld.sock.lock',
'/run/postgresql/.s.PGSQL.5432.lock',
'/run/network/.ifstate.lock',
'/run/lock/asound.state.lock']
"""
import os
all_files = []
for root, dirs, files in os.walk(root_path):
for filename in files:
if filename.lower().endswith(ext):
all_files.append(os.path.join(root, filename))
return all_files
You can also use yield to create a generator and thus avoid assembling the complete list:
def get_all_filepaths(root_path, ext):
import os
for root, dirs, files in os.walk(root_path):
for filename in files:
if filename.lower().endswith(ext):
yield os.path.join(root, filename)
I suggest you to use fnmatch and the upper method. In this way you can find any of the following:
Name.txt;
Name.TXT;
Name.Txt
.
import fnmatch
import os
for file in os.listdir("/Users/Johnny/Desktop/MyTXTfolder"):
if fnmatch.fnmatch(file.upper(), '*.TXT'):
print(file)
Here's one with extend()
types = ('*.jpg', '*.png')
images_list = []
for files in types:
images_list.extend(glob.glob(os.path.join(path, files)))
Functional solution with sub-directories:
from fnmatch import filter
from functools import partial
from itertools import chain
from os import path, walk
print(*chain(*(map(partial(path.join, root), filter(filenames, "*.txt")) for root, _, filenames in walk("mydir"))))
In case the folder contains a lot of files or memory is an constraint, consider using generators:
def yield_files_with_extensions(folder_path, file_extension):
for _, _, files in os.walk(folder_path):
for file in files:
if file.endswith(file_extension):
yield file
Option A: Iterate
for f in yield_files_with_extensions('.', '.txt'):
print(f)
Option B: Get all
files = [f for f in yield_files_with_extensions('.', '.txt')]
use Python OS module to find files with specific extension.
the simple example is here :
import os
# This is the path where you want to search
path = r'd:'
# this is extension you want to detect
extension = '.txt' # this can be : .jpg .png .xls .log .....
for root, dirs_list, files_list in os.walk(path):
for file_name in files_list:
if os.path.splitext(file_name)[-1] == extension:
file_name_path = os.path.join(root, file_name)
print file_name
print file_name_path # This is the full path of the filter file
Many users have replied with os.walk answers, which includes all files but also all directories and subdirectories and their files.
import os
def files_in_dir(path, extension=''):
"""
Generator: yields all of the files in <path> ending with
<extension>
\param path Absolute or relative path to inspect,
\param extension [optional] Only yield files matching this,
\yield [filenames]
"""
for _, dirs, files in os.walk(path):
dirs[:] = [] # do not recurse directories.
yield from [f for f in files if f.endswith(extension)]
# Example: print all the .py files in './python'
for filename in files_in_dir('./python', '*.py'):
print("-", filename)
Or for a one off where you don't need a generator:
path, ext = "./python", ext = ".py"
for _, _, dirfiles in os.walk(path):
matches = (f for f in dirfiles if f.endswith(ext))
break
for filename in matches:
print("-", filename)
If you are going to use matches for something else, you may want to make it a list rather than a generator expression:
matches = [f for f in dirfiles if f.endswith(ext)]
I have a C++/Obj-C background and I am just discovering Python (been writing it for about an hour).
I am writing a script to recursively read the contents of text files in a folder structure.
The problem I have is the code I have written will only work for one folder deep. I can see why in the code (see #hardcoded path), I just don't know how I can move forward with Python since my experience with it is only brand new.
Python Code:
import os
import sys
rootdir = sys.argv[1]
for root, subFolders, files in os.walk(rootdir):
for folder in subFolders:
outfileName = rootdir + "/" + folder + "/py-outfile.txt" # hardcoded path
folderOut = open( outfileName, 'w' )
print "outfileName is " + outfileName
for file in files:
filePath = rootdir + '/' + file
f = open( filePath, 'r' )
toWrite = f.read()
print "Writing '" + toWrite + "' to" + filePath
folderOut.write( toWrite )
f.close()
folderOut.close()
Make sure you understand the three return values of os.walk:
for root, subdirs, files in os.walk(rootdir):
has the following meaning:
root: Current path which is "walked through"
subdirs: Files in root of type directory
files: Files in root (not in subdirs) of type other than directory
And please use os.path.join instead of concatenating with a slash! Your problem is filePath = rootdir + '/' + file - you must concatenate the currently "walked" folder instead of the topmost folder. So that must be filePath = os.path.join(root, file). BTW "file" is a builtin, so you don't normally use it as variable name.
Another problem are your loops, which should be like this, for example:
import os
import sys
walk_dir = sys.argv[1]
print('walk_dir = ' + walk_dir)
# If your current working directory may change during script execution, it's recommended to
# immediately convert program arguments to an absolute path. Then the variable root below will
# be an absolute path as well. Example:
# walk_dir = os.path.abspath(walk_dir)
print('walk_dir (absolute) = ' + os.path.abspath(walk_dir))
for root, subdirs, files in os.walk(walk_dir):
print('--\nroot = ' + root)
list_file_path = os.path.join(root, 'my-directory-list.txt')
print('list_file_path = ' + list_file_path)
with open(list_file_path, 'wb') as list_file:
for subdir in subdirs:
print('\t- subdirectory ' + subdir)
for filename in files:
file_path = os.path.join(root, filename)
print('\t- file %s (full path: %s)' % (filename, file_path))
with open(file_path, 'rb') as f:
f_content = f.read()
list_file.write(('The file %s contains:\n' % filename).encode('utf-8'))
list_file.write(f_content)
list_file.write(b'\n')
If you didn't know, the with statement for files is a shorthand:
with open('filename', 'rb') as f:
dosomething()
# is effectively the same as
f = open('filename', 'rb')
try:
dosomething()
finally:
f.close()
If you are using Python 3.5 or above, you can get this done in 1 line.
import glob
# root_dir needs a trailing slash (i.e. /root/dir/)
for filename in glob.iglob(root_dir + '**/*.txt', recursive=True):
print(filename)
As mentioned in the documentation
If recursive is true, the pattern '**' will match any files and zero or more directories and subdirectories.
If you want every file, you can use
import glob
for filename in glob.iglob(root_dir + '**/**', recursive=True):
print(filename)
Agree with Dave Webb, os.walk will yield an item for each directory in the tree. Fact is, you just don't have to care about subFolders.
Code like this should work:
import os
import sys
rootdir = sys.argv[1]
for folder, subs, files in os.walk(rootdir):
with open(os.path.join(folder, 'python-outfile.txt'), 'w') as dest:
for filename in files:
with open(os.path.join(folder, filename), 'r') as src:
dest.write(src.read())
TL;DR: This is the equivalent to find -type f to go over all files in all folders below and including the current one:
for currentpath, folders, files in os.walk('.'):
for file in files:
print(os.path.join(currentpath, file))
As already mentioned in other answers, os.walk() is the answer, but it could be explained better. It's quite simple! Let's walk through this tree:
docs/
└── doc1.odt
pics/
todo.txt
With this code:
for currentpath, folders, files in os.walk('.'):
print(currentpath)
The currentpath is the current folder it is looking at. This will output:
.
./docs
./pics
So it loops three times, because there are three folders: the current one, docs, and pics. In every loop, it fills the variables folders and files with all folders and files. Let's show them:
for currentpath, folders, files in os.walk('.'):
print(currentpath, folders, files)
This shows us:
# currentpath folders files
. ['pics', 'docs'] ['todo.txt']
./pics [] []
./docs [] ['doc1.odt']
So in the first line, we see that we are in folder ., that it contains two folders namely pics and docs, and that there is one file, namely todo.txt. You don't have to do anything to recurse into those folders, because as you see, it recurses automatically and just gives you the files in any subfolders. And any subfolders of that (though we don't have those in the example).
If you just want to loop through all files, the equivalent of find -type f, you can do this:
for currentpath, folders, files in os.walk('.'):
for file in files:
print(os.path.join(currentpath, file))
This outputs:
./todo.txt
./docs/doc1.odt
The pathlib library is really great for working with files. You can do a recursive glob on a Path object like so.
from pathlib import Path
for elem in Path('/path/to/my/files').rglob('*.*'):
print(elem)
import glob
import os
root_dir = <root_dir_here>
for filename in glob.iglob(root_dir + '**/**', recursive=True):
if os.path.isfile(filename):
with open(filename,'r') as file:
print(file.read())
**/** is used to get all files recursively including directory.
if os.path.isfile(filename) is used to check if filename variable is file or directory, if it is file then we can read that file.
Here I am printing file.
If you want a flat list of all paths under a given dir (like find . in the shell):
files = [
os.path.join(parent, name)
for (parent, subdirs, files) in os.walk(YOUR_DIRECTORY)
for name in files + subdirs
]
To only include full paths to files under the base dir, leave out + subdirs.
I've found the following to be the easiest
from glob import glob
import os
files = [f for f in glob('rootdir/**', recursive=True) if os.path.isfile(f)]
Using glob('some/path/**', recursive=True) gets all files, but also includes directory names. Adding the if os.path.isfile(f) condition filters this list to existing files only
For my taste os.walk() is a little too complicated and verbose. You can do the accepted answer cleaner by:
all_files = [str(f) for f in pathlib.Path(dir_path).glob("**/*") if f.is_file()]
with open(outfile, 'wb') as fout:
for f in all_files:
with open(f, 'rb') as fin:
fout.write(fin.read())
fout.write(b'\n')
use os.path.join() to construct your paths - It's neater:
import os
import sys
rootdir = sys.argv[1]
for root, subFolders, files in os.walk(rootdir):
for folder in subFolders:
outfileName = os.path.join(root,folder,"py-outfile.txt")
folderOut = open( outfileName, 'w' )
print "outfileName is " + outfileName
for file in files:
filePath = os.path.join(root,file)
toWrite = open( filePath).read()
print "Writing '" + toWrite + "' to" + filePath
folderOut.write( toWrite )
folderOut.close()
os.walk does recursive walk by default. For each dir, starting from root it yields a 3-tuple (dirpath, dirnames, filenames)
from os import walk
from os.path import splitext, join
def select_files(root, files):
"""
simple logic here to filter out interesting files
.py files in this example
"""
selected_files = []
for file in files:
#do concatenation here to get full path
full_path = join(root, file)
ext = splitext(file)[1]
if ext == ".py":
selected_files.append(full_path)
return selected_files
def build_recursive_dir_tree(path):
"""
path - where to begin folder scan
"""
selected_files = []
for root, dirs, files in walk(path):
selected_files += select_files(root, files)
return selected_files
I think the problem is that you're not processing the output of os.walk correctly.
Firstly, change:
filePath = rootdir + '/' + file
to:
filePath = root + '/' + file
rootdir is your fixed starting directory; root is a directory returned by os.walk.
Secondly, you don't need to indent your file processing loop, as it makes no sense to run this for each subdirectory. You'll get root set to each subdirectory. You don't need to process the subdirectories by hand unless you want to do something with the directories themselves.
Try this:
import os
import sys
for root, subdirs, files in os.walk(path):
for file in os.listdir(root):
filePath = os.path.join(root, file)
if os.path.isdir(filePath):
pass
else:
f = open (filePath, 'r')
# Do Stuff
If you prefer an (almost) Oneliner:
from pathlib import Path
lookuppath = '.' #use your path
filelist = [str(item) for item in Path(lookuppath).glob("**/*") if Path(item).is_file()]
In this case you will get a list with just the paths of all files located recursively under lookuppath.
Without str() you will get PosixPath() added to each path.
This worked for me:
import glob
root_dir = "C:\\Users\\Scott\\" # Don't forget trailing (last) slashes
for filename in glob.iglob(root_dir + '**/*.jpg', recursive=True):
print(filename)
# do stuff
If just the file names are not enough, it's easy to implement a Depth-first search on top of os.scandir():
stack = ['.']
files = []
total_size = 0
while stack:
dirname = stack.pop()
with os.scandir(dirname) as it:
for e in it:
if e.is_dir():
stack.append(e.path)
else:
size = e.stat().st_size
files.append((e.path, size))
total_size += size
The docs have this to say:
The scandir() function returns directory entries along with file attribute information, giving better performance for many common use cases.