Count number of files with certain extension in Python - python

I am fairly new to Python and I am trying to figure out the most efficient way to count the number of .TIF files in a particular sub-directory.
Doing some searching, I found one example (I have not tested), which claimed to count all of the files in a directory:
file_count = sum((len(f) for _, _, f in os.walk(myPath)))
This is fine, but I need to only count TIF files. My directory will contain other files types, but I only want to count TIFs.
Currently I am using the following code:
tifCounter = 0
for root, dirs, files in os.walk(myPath):
for file in files:
if file.endswith('.tif'):
tifCounter += 1
It works fine, but the looping seems to be excessive/expensive to me. Any way to do this more efficiently?
Thanks.

Something has to iterate over all files in the directory, and look at every single file name - whether that's your code or a library routine. So no matter what the specific solution, they will all have roughly the same cost.
If you think it's too much code, and if you don't actually need to search subdirectories recursively, you can use the glob module:
import glob
tifCounter = len(glob.glob1(myPath,"*.tif"))

For this particular use case, if you don't want to recursively search in the subdirectory, you can use os.listdir:
len([f for f in os.listdir(myPath)
if f.endswith('.tif') and os.path.isfile(os.path.join(myPath, f))])

Your code is fine.
Yes, you're going to need to loop over those files to filter out the .tif files, but looping over a small in-memory array is negligible compared to the work of scanning the file directory to find these files in the first place, which you have to do anyway.
I wouldn't worry about optimizing this code.

If you do need to search recursively, or for some other reason don't want to use the glob module, you could use
file_count = sum(len(f for f in fs if f.lower().endswith('.tif')) for _, _, fs in os.walk(myPath))
This is the "Pythonic" way to adapt the example you found for your purposes. But it's not going to be significantly faster or more efficient than the loop you've been using; it's just a really compact syntax for more or less the same thing.

try using fnmatch
https://docs.python.org/2/library/fnmatch.html
import fnmatch,os
num_files = len(fnmatch.filter(os.listdir(your_dir),'*.tif'))
print(num_files)

Related

How to open files in a particular folder with randomly generated names?

How to open files in a particular folder with randomly generated names? I have a folder named 2018 and the files within that folder are named randomly. I want to iterate through all of the files and open them up.
I will post three names of the files as an example but note that there are over a thousand files in this folder so it has to work on a large scale without any hard coding.
0a2ec2da-628d-417d-9520-b0889886e2ac_1.xml
00a6b260-951d-46b5-ab27-b2e8729e664d_1.xml
00a6b260-951d-46b5-ab27-b2e8729e664d_2.xml
You're looking for os.walk().
In general, if you want to do something with files, it's worth glancing at the os, os.path, pathlib and other built-in modules. They're all documented.
You could also use glob expansion to expand "folder/*" into a list of all the filenames, but os.walk is probably better.
With os.listdir() or os.walk(), depending on whether you want to do it recursively or not.
You can go through the python doc
https://docs.python.org/3/library/os.html#os.walk
https://docs.python.org/3/library/os.html#os.listdir
One you have list of files you can read it simply -
for file in files:
with open(file, "r") as f:
# perform file operations

prevent getfiles from seeing .DS and other hidden files

I am currently working on a python project on my macintosh, and from time to time I get unexpected errors, because .DS or other files, which are not visible to the "not-root-user" are found in folders. I am using the following command
filenames = getfiles.getfiles(file_directory)
to retreive information about the amount and name of the files in a folder. So I was wondering, if there is a possibility to prevent the getfiles command to see these types of files, by for example limiting its right or the extensions which it can see (all files are of .txt format)
Many thanks in advance!
In your case, I would recommend you switch to the Python standard library glob.
In your case, if all files are of .txt format, and they are located in the directory of /sample/directory/, you can use the following script to get the list of files you wanted.
from glob import glob
filenames = glob.glob("/sample/directory/*.txt")
You can easily use regular expressions to match files and filter out files you do not need. More details can be found from Here.
Keep in mind that with regular expression, you could do much more complicated pattern matching than the above example to handle your future needs.
Another good example of using glob to glob multiple extensions can be found from Here.
If you only want to get the basenames of those files, you can always use standard library os to extract basenames from the full paths.
import os
file_basenames = [os.path.basename(full_path) for full_path in filenames]
There isn't an option to filter within getfiles, but you could filter the list after.
Most-likely you will want to skip all "dot files" ("system files", those with a leading .), which you can accomplish with code like the following.
filenames = [f for f in ['./.a', './b'] if not os.path.basename(f).startswith('.')]
Welcome to Stackoverflow.
You might find the glob module useful. The glob.glob function takes a path including wildcards and returns a list of the filenames that match.
This would allow you to either select the files you want, like
filenames = glob.glob(os.path.join(file_directory, "*.txt")
Alternatively, select the files you don't want, and ignore them:
exclude_files = glob.glob(os.path.join(file_directory, ".*"))
for filename in getfiles.getfiles(file_directory):
if filename in exclude_files:
continue
# process the file

os.listdir() not printing out all files

I've got a bunch of files and a few folders. I'm trying to append the zips to a list so I can extract those files in other part of the code. It never finds the zips.
for file in os.listdir(path):
print(file)
if file.split(".")[1] == 'zip':
reg_zips.append(file)
The path is fine or it wouldn't print out anything. It picks up the same files each time but will not pick up any others. It picks up about 1/5th of the files in the directory.
At a complete loss. I've made sure that some weird race condition with the file availability isn't the problem by putting a time.sleep(3) in the code. Didn't solve it.
It's possible your files have more than one period in them. Try using str.endswith:
reg_zips = []
for file in os.listdir(path):
if file.endswith('zip'):
reg_zips.append(file)
Another good idea (thanks, Jean-François Fabre!) is to use os.path.splitext, which handles the extension quite nicely:
if os.path.splitext(file)[-1] == '.zip':
...
Am even better solution, I recommend with the glob.glob function:
import glob
reg_zips = glob.glob('*.zip')
reg_zips = [z for z in os.listdir(path) if z.endswith("zip")]

Finding File Path in Python

I'd like to find the full path of any given file, but when I tried to use
os.path.abspath("file")
it would only give me the file location as being in the directory where the program is running. Does anyone know why this is or how I can get the true path of the file?
What you are looking to accomplish here is ultimately a search on your filesystem. This does not work out too well, because it is extremely likely you might have multiple files of the same name, so you aren't going to know with certainty whether the first match you get is in fact the file that you want.
I will give you an example of how you can start yourself off with something simple that will allow you traverse through directories to be able to search.
You will have to give some kind of base path to be able to initiate the search that has to be made for the path where this file resides. Keep in mind that the more broad you are, the more expensive your searching is going to be.
You can do this with the os.walk method.
Here is a simple example of using os.walk. What this does is collect all your file paths with matching filenames
Using os.walk
from os import walk
from os.path import join
d = 'some_file.txt'
paths = []
for i in walk('/some/base_path'):
if d in i[2]:
paths.append(join(i[0], d))
So, for each iteration over os.walk you are going to get a tuple that holds:
(path, directories, files)
So that is why I am checking against location i[2] to look at files. Then I join with i[0], which is the path, to put together the full filepath name.
Finally, you can actually put the above code all in to one line and do:
paths = [join(i[0], d) for i in walk('/some/base_path') if d in i[2]]

Python - walk through a huge set of files but in a more efficient manner

I have huge set of files that I want to traverse through using python. I am using os.walk(source) for the same and is working but since I have a huge set of files it is taking too much and memory resources since its getting the complete list all at once. How can I optimize this to use less resources and may be walk through one directory at a time or in some other efficient manner and still able to iterate the complete set of files. Thanks
for dir, dirnames, filenames in os.walk(START_FOLDER):
for name in dirnames:
#if PRIVATE_FOLDER not in name:
for keyword in FOLDER_WITH_KEYWORDS_DELETION_EXCEPTION_LIST:
if keyword in name.lower():
ignoreList.append(name)
If the issue is that the directory simply has too many files in it, this will hopefully be solved in Python 3.5.
Until then, you may want to check out scandir.
You should make use of the in keyword to test if a directory name matches a keyword.
for _, dirnames, _ in os.walk(START_FOLDER):
for name in dirnames:
if any((k in name.lower() for k in FOLDER_WITH_KEYWORDS_DELETION_EXCEPTION_LIST)):
ignoreList.append(name)
If your ignoreList is too big, you may want to think about creating an acceptedList and using that instead.

Categories

Resources