Python program to produce dictionary of file extensions and sizes

Python program to produce dictionary of file extensions and sizes - python

I am trying to create a program in Python that will search through a directory of files and create a dictionary whose keys are the various file extensions in the directory, and whose values constitute lists containing the number of times that extension appears in the directory, the size of the largest file with that extension, the size of the smallest, and the average size of files with that extension.
I have written the following so far:
for root, dirs, files in os.walk('.'):
contents={}
for name in files:
size=(os.path.getsize(name))
title, extension=os.path.splitext(name)
if extension not in contents:
contents[extension]=[1, size, size, size]
else:
contents[extension][0]=contents[extension][0]+1
contents[extension][3]=contents[extension][3]+size
if size>=contents[extension][1]:
contents[extension][1]=size
elif size<contents[extension][2]:
contents[extension][2]=size
contents[extension][3]=contents[extension][3]/contents[extension][0]
print(contents)
If I import os and use os.chdir() to enter the directory I want to explore, this script works to the extent that it returns a dictionary whose keys are the extensions in the directory, and whose values are lists that correctly identify the number of times that extension appears, the size of the largest file with that extension, and the size of the smallest. Where it goes wrong is that the average is calculated correctly in one case, but in the others it is incorrect but in inconsistent ways.
Any advice for fixing this? I'd like the dictionary to show the proper averages in each case. I'm new to Python, and programming, and am clearly missing something!
Thanks in advance.

In your last step,
contents[extension][3]=contents[extension][3]/contents[extension][0]
you're only performing this for a single extension, you need to loop through all your extensions:
for extension in contents:
contents[extension][3]=contents[extension][3]/contents[extension][0]

One thing that's certainly a problem is that to get the size of a file, you need to use the correct relative path. When os.walk() recurses into a subdirectory, the relative path is root+"/"+name -- not just name. So you should be getting the size like this:
size=os.path.getsize(root+"/"+name)
(Your variable root is not actually the "root" of the directory tree; it is each directory whose files are being listed in files.)
Will this fix the problem? Who knows. The way your code is now it should be raising an exception, so either you don't have any subdirectories or you are not showing us your complete code.

Try:
for root, dirs, files in os.walk('.'):
contents={}
for name in files:
size=(os.path.getsize(name))
title, extension=os.path.splitext(name)
if extension not in contents:
contents[extension]=[1, size, size, size]
else:
contents[extension][0]=contents[extension][0]+1
contents[extension][3]=contents[extension][3]+size
if size>=contents[extension][1]:
contents[extension][1]=size
elif size<contents[extension][2]:
contents[extension][2]=size
for k in contents.keys():
contents[k][3]=contents[k][3] / float(contents[k][0])
print(contents)
You are calculating the average only to one of the extensions, the last.
And use float, if you don't do that, the answer is not going to be exact.

Related

Get absolute file path list and ignore dot directories/files python

How to get Absolute file path within a specified directory and ignore dot(.) directories and dot(.)files
I have below solution, which will provide a full path within the directory recursively,
Help me with the fastest way of list files with full path and ignore .directories/ and .files to list
(Directory may contain 100 to 500 millions files )
import os
def absoluteFilePath(directory):
for dirpath,_,filenames in os.walk(directory):
for f in filenames:
yield os.path.abspath(os.path.join(dirpath, f))
for files in absoluteFilePath("/my-huge-files"):
#use some start with dot logic ? or any better solution
Example:
/my-huge-files/project1/file{1..100} # Consider all files from file1 to 100
/my-huge-files/.project1/file{1..100} # ignore .project1 directory and its files (Do not need any files under .(dot) directories)
/my-huge-files/project1/.file1000 # ignore .file1000, it is starts with dot

os.walk by definition visits every file in a hierarchy, but you can select which ones you actually print with a simple textual filter.
for file in absoluteFilePath("/my-huge-files"):
if '/.' not in file:
print(file)
When your starting path is already absolute, calling os.path.abspath on it is redundant, but I guess in the great scheme of things, you can just leave it in.

Don't use os.walk() as it will visit every file
Instead, fall back to .scandir() or .listdir() and write your own implementation
You can use pathlib.Path(test_path).expanduser().resolve() to fully expand a path
import os
from pathlib import Path
def walk_ignore(search_root, ignore_prefixes=(".",)):
""" recursively walk directories, ignoring files with some prefix
pass search_root as an absolute directory to get absolute results
"""
for dir_entry in os.scandir(Path(search_root)):
if dir_entry.name.startswith(ignore_prefixes):
continue
if dir_entry.is_dir():
yield from walk_ignore(dir_entry, ignore_prefixes=ignore_prefixes)
else:
yield Path(dir_entry)
You may be able to save some overhead with a closure, coercing to Path once, yielding only .name, etc., but that's really up to your needs
Also not to your question, but related to it; if the files are very small, you'll likely find that packing them together (several files in one) or tuning the filesystem block size will see tremendously better performance
Finally, some filesystems come with bizarre caveats specific to them and you can likely break this with oddities like symlink loops

Getting the absolute paths of all files in a folder, without traversing the subfolders

Let
my_dir = "/raid/user/my_dir"
be a folder on my filesystem, which is not the current folder (i.e., it's not the result of os.getcwd()). I want to retrieve the absolute paths of all files at the first level of hierarchy in my_dir (i.e., the absolute paths of all files which are in my_dir, but not in a subfolder of my_dir) as a list of strings absolute_paths. I need it, in order to later delete those files with os.remove().
This is nearly the same use case as
Get absolute paths of all files in a directory
but the difference is that I don't want to traverse the folder hierarchy: I only need the files at the first level of hierarchy (at depth 0? not sure about terminology here).

It's easy to adapt that solution: Call os.walk() just once, and don't let it continue:
root, dirs, files = next(os.walk(my_dir, topdown=True))
files = [ os.path.join(root, f) for f in files ]
print(files)

You can use the os.path module and a list comprehension.
import os
absolute_paths= [os.path.abspath(f) for f in os.listdir(my_dir) if os.path.isfile(f)]

You can use os.scandir which returns an os.DirEntry object that has a variety of options including the ability to distinguish files from directories.
with os.scandir(somePath) as it:
paths = [entry.path for entry in it if entry.is_file()]
print(paths)
If you want to list directories as well, you can, of course, remove the condition from the list comprehension if you want to see them in the list.
The documentation also has this note under listDir:
See also The scandir() function returns directory entries along with file attribute information, giving better performance for many common use cases.

looping throught the folder

I need to solve trivial task running in loop sequence of the commands:
1) to take input .dcd file from the folder
2) to make some operations with the file
3) to save results in list
My code (which is not working !) looks like
# make LIST OF THE input DCD FILES
path="./inputs/"
dirs=os.listdir(path)
for traj in dirs:
trajectory = command(traj)
it correctly define name of the input but wrote that evvery file is empty
alternatively I've used below script to loop through the files using digit variable assidned to name of each file (which is not good in my current task because I need to keep name of each input file avoiding to use digits!)
# number of input files
n=3
for i in xrange (1,n+1):
trajectory = command('./inputs/file_%d.dcd' %(i))
In the last case all dcd files were correctly loaded (in opposit to the first example)! So the question what should I to fix in the first example?

os.listdir() gives you only the base filenames relative to the directory. No path is included.
Prefix your filenames with the path:
for traj in dirs:
trajectory = command(os.path.join(path, traj))

Most efficient/fastest way to get a single file from a directory

What is the most efficient and fastest way to get a single file from a directory using Python?
More details on my specific problem:
I have a directory containing a lot of pregenerated files, and I just want to pick a random one. Since I know that there's no really efficient way of picking a random file from a directory other than listing all the files first, my files are generated with an already random name, thus they are already randomly sorted, and I just need to pick the first file from the folder.
So my question is: how can I pick the first file from my folder, without having to load the whole list of files from the directory (nor having the OS to do that, my optimal goal would be to force the OS to just return me a single file and then stop!).
Note: I have a lot of files in my directory, hence why I would like to avoid listing all the files to just pick one.
Note2: each file is only picked once, then deleted to ensure that only new files are picked the next time (thus ensuring some kind of randomness).
SOLUTION
I finally chose to use an index file that will store:
the index of the current file to be picked (eg: 1 for file1.ext, 2 for file2.ext, etc..)
the index of the last file generated (eg: 1999 for file1999.ext)
Of course, this means that my files are not generated with a random name anymore, but using a deterministic incrementable pattern (eg: "file%s.ext" % ID)
Thus I have a near constant time for my two main operations:
Accessing the next file in the folder
Counting the number of files that are left (so that I can generate new files in a background thread when needed).
This is a specific solution for my problem, for more generic solutions, please read the accepted answer.
Also you might be interested into these two other solutions I've found to optimize the access of files and directory walking using Python:
os.walk optimized
Python FAM (File Alteration Monitor)

Don't have a lot of pregenerated files in 1 directory. Divide them over subdirectories if more than 'n' files in the directory.

when creating the files add the name of the newest file to a list stored in a text file. When you want to read/process/delete a file:
Open the text file
Set filename to the name on the top of the list.
Delete the name from the top of the list
Close the text file
Process filename.

Just use random.choice() on the os.listdir() result:
import random
import os
randomfilename = random.choice(os.listdir(path_to_directory))
os.listdir() returns results in the ordering given by the OS. Using random filenames does not change that ordering, only adding items to or removing items from the directory can influence that ordering.
If your fear that you'll have too many files, do not use a single directory. Instead, set up a tree of directories with pre-generated names, pick one of those at random, then pick a file from there.

In python, how do I copy files into a directory and stop once that directory reaches a certain size

I am still very new to Python, but I am trying to create a program which will, among other things, copy the contents of a directory into a set of directories that will fit onto a disc (I have it set up the following variables to be the size capacities I want, and set up an input statement to say which one applies):
BluRayCap = 25018184499
DVDCap = 4617089843
CDCap = 681574400
So basically I want to copy the contents of a beginning directory into another directory, and as needed, create another directory in order for the contents to fit into discs.
I kind of hit a roadblock here. Thanks!

You can use os.path.getsize to get the size of a file, and you can use os.walk to walk a directory tree, so something like the following (I'll let you implement CreateOutputDirectory and CopyFileToDirectory):
current_destination = CreateOutputDirectory()
for root, folders, files in os.walk(input_directory):
for file in files:
file_size = os.path.getsize(file)
if os.path.getsize(current_destination) + file_size > limit:
current_destination = CreateOutputDirectory()
CopyFileToDirectory(root, file, current_destination)
Also, you may find the Python Search extension for Chrome helpful for looking up this documentation.

Michael Aaron Safyan's answer is good.
Besides, you can use shutil module to CreateOutputDirectory and CopyFileToDirectory

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python program to produce dictionary of file extensions and sizes - python

In your last step, contents[extension][3]=contents[extension][3]/contents[extension][0] you're only performing this for a single extension, you need to loop through all your extensions: for extension in contents: contents[extension][3]=contents[extension][3]/contents[extension][0]

Related

Get absolute file path list and ignore dot directories/files python

Getting the absolute paths of all files in a folder, without traversing the subfolders

looping throught the folder

Most efficient/fastest way to get a single file from a directory

In python, how do I copy files into a directory and stop once that directory reaches a certain size

Categories

Resources