Pythonic way to delete files/folders older than X days

Pythonic way to delete files/folders older than X days - python

In my Python3 program I need to delete files and folders that are older than X days. I know there a many similar questions here, but in my case I don't need to check the modification times of these folders and files. Instead I have the following folder structure:
/root_folder/<year>/<month>/<day>/<files>
So for example something like this:
.
└── 2020
├── 04
│   └── 30
│   ├── file.1
│   └── file.2
└── 05
├── 14
│   ├── file.1
│   └── file.2
├── 19
├── 21
│   └── file.1
└── 22
├── file.1
├── file.2
└── file.3
What I want now is to delete all the folders and their files that represent the date older than X days. I have created a solution, but coming from Java it seems to me that is not very Pythonic, or it might be easier to solve in Pyhton. Can you Python experts guide me a bit here, of course taking into account "jumps" over months and years?

Not a python expert here either, but here's something simple:
Find the date oldest date that you want to keep. Anything older than this will be deleted. Let's say it is the 28/04/2020
From that date, you can build a string "/root_folder/2020/04/28"
List all the files, if their path (as string) is less than the string from the previous step, you can delete them all
Example:
files = []
# r=root, d=directories, f = files
for r, d, f in os.walk(path):
for file in f:
if '.txt' in file:
files.append(os.path.join(r, file))
Source of that snippet: https://mkyong.com/python/python-how-to-list-all-files-in-a-directory/
Now, you can do:
for f in files:
if f < date_limit:
os.remove(f)
Note: This is not optimal
It deletes file by file, but the moment you enter the if you could just delete the whole folder where this file is (but then the list of files points to files that have been deleted).
You actually don't care about the files. You could apply the logic to folders alone and remove them recursively.
Update: doing both steps as we browse the folders:
for r, d, f in os.walk(path):
if( r < date_limit ):
print(f"Deleting {r}")
shutil.rmtree(r)

Glob your paths to get your filepaths in an array then run it something like this (below), good luck!
def is_file_access_older_than(file_path, seconds, from_time=None):
"""based on st_atime --> https://docs.python.org/3/library/os.html#os.stat_result.st_atime"""
if not from_time:
from_time = time.time()
if (from_time - os.stat(file_path).st_atime) > seconds:
return file_path
return False

Related

Recursive Function to get all files from main folder and subdirectories inside it in Python

I have a file directory that looks something like this. I have a larger directory, but showing this one just for explanation purposes:
.
├── a.txt
├── b.txt
├── foo
│ └── w.txt
│ └── a.txt
└── moo
└── cool.csv
└── bad.csv
└── more
└── wow.csv
I want to write a recursive function to get year counts for files within each subdirectory within this directory.
I want the code to basically check if it's a directory or file. If it's a directory then I want to call the function again and get counts until there's no more subdirectories.
I have the following code (which keeps breaking my kernel when I test it). There's probably some logic error as well I would think..
import os
import pandas as pd
dir_path = 'S:\\Test'
def getFiles(dir_path):
contents = os.listdir(dir_path)
# check if content is directory or not
for file in contents:
if os.path.isdir(os.path.join(dir_path, file)):
# get everything inside subdirectory
getFiles(dir_path = os.path.join(dir_path, file))
# it's a file
else:
# do something to get the year of the file and put it in a list or something
# at the end create pandas data frame and return
Expected output would be a pandas dataframe that looks something like this..
Subdir 2020 2021 2022
foo 0 1 1
moo 0 2 0
more 1 0 0
How can I do this in Python?
EDIT:
Just realized os.walk() is probably extremely useful for my case here.
Trying to figure out a solution with os.walk() instead of doing it the long way..

Getting string between a character and a symbol, without caring about what follows the symbol

In order to order some files into folders, I have to get the number (as if it was some sort of ID) of both folders (named as p.X, p. fixed, being X a number that can range from 1 to 200150) and files (being PX_N.gmspr, where P is fixed, X is the ID number of the folder and N an identifier of the file, which can be 2,3,6,8,9,A and H).
An example would be p.24 and P24_2.gmspr, P24_3.gmspr, P24_6.gmspr, P24_8.gmspr, P24_9.gmspr, P24_A.gmspr and P24_H.gmspr, in order to move all P24_N.gmspr to p.24
The PX_N.gmspr files are in a different folder than the target folders p.X . A little of os.chdir and os.rename and the files can be moved easily so I believe that is not a problem.
What I want is to obtain the X number of the filename to compare with the folder number, forgetting about both the P and the _N.gmspr string.
Whereas I can obtain the folder number via
foldername.split(".",1)[1] I don't really know how to do it for the file number.
To sum up, I want to move some files called PX_N.gmspr to another folder identified almost the same p.X
Any idea? Thank you!!!
EDIT:
Regarding the answer given, I have to clarify myself about what I am trying to do, specially with the file and folder format:
Mother folder
├── Unclassified
│ └── All PX_N.gmspr being PX certain files that gotta be moved to another folders, X a number that ranges from 1 to 200150 (but not exactly 200150, is just a number ID) and N can be only 2, 3, 6, 9, A or H, nothing more. In total 15435 elements with each of the X having one of the 6 possibles N gmspr.
├──First Folder
│ └── p.X folders (X from 1 to 151), the aim is to select all the PX_N.gmspr files that agree with the X number that matches the p.X of the folder and move it to each folder.
├──Second Folder
│ └── p.X folders (X from 152 to 251, plus p.602 to p.628, p.823, p.824,
│ p.825, p.881 and p.882)
└──Third Folder
└── p.X folders (X from 252 to 386, plus p.585, p.586 and p. 587)
There are some other folders in order to order some more of the 15435 files.
I am currently searching about regex; unluckily for me, it is the first time I actually have to use them.
EDIT CAUSE SOLVED: SO THE POINT WAS TO PLAY WITH REGEX AND GETTING ONLY THE NUMBERS, BUT THEN AS NESTED LISTS APPEARED, ONLY THE FIRST NUMBER WAS USEFUL

This is the perfect job for regexes.
First, let's create a temporary dir and fill it with some files to demonstrate.
from pathlib import Path
from random import choices, randint
from string import ascii_letters
from tempfile import TemporaryDirectory
tmpdir = TemporaryDirectory()
for i in range(4):
n = randint(1, 999)
for i in range(randint(1, 5)):
Path(
tmpdir.name, f"P{n}.{''.join(choices(ascii_letters, k=10))}"
).touch()
Now we have 4 types of file (PN.), with between 1 and 5 files in this type.
Then, we just need to iterate through those file, extract the N from the file name with the regex P(\d+)\..+, and finally create destination dir and move the file.
from pathlib import Path
import re
dir_re = re.compile(r"P(\d+)\..+")
for filepath in Path(tmpdir.name).iterdir():
m = dir_re.match(filepath.name)
dirpath = filepath.parent / f"p.{m.group(1)}"
if not dirpath.is_dir():
dirpath.mkdir()
filepath.rename(dirpath / filepath.name)
For instance, from a flat temp directory, we have now the following sorted.
/var/folders/lf/z7ftpkws0vn7svq8n212czm40000gn/T/tmppve5_m1u/
├── p.413
│   └── P413.yJvxPtuzfz
├── p.705
│   ├── P705.DbwPyiFxum
│   ├── P705.FVwMuSqFms
│   ├── P705.PZyGIQEqSG
│   ├── P705.baRrkcNaZR
│   └── P705.tZKFTKwDah
├── p.794
│   ├── P794.CQTBgXOckQ
│   ├── P794.JNoKsUtgRU
│   └── P794.iSdrdohKYq
└── p.894
└── P894.XbzFxnqYOY
And finally, cleanup the temporary directory.
tmpdir.cleanup()

How to find and delete folders that are empty / contains empty folders only

If folder X is empty, I would like to delete X.
If folder Y only contains folder X and other folders that are empty, I would like to delete Y.
If folder Z only contains folders like X and/or Y, I would like to delete Z.
How do I do this recursively for anything under a specific dir, with Python?
I tried something like the following, but it is only able to identify folders like X, not folders like Y or Z.
from pathlib import Path
folder = '/home/abc'
for path in Path(folder).glob("**/*"):
if path.is_dir() and len(list(path.iterdir())) == 0:
logger.info(f"remove {path}")

May be a bit verbose, but this seems to do the job.
def dir_empty(path):
empty = True
for item in path.glob('*'):
if item.is_file():
empty = False
if item.is_dir() and not dir_empty(item):
empty = False
if empty:
path.rmdir() # Remove if you just want to have the result
return empty
from pathlib import Path
dir_empty(Path('Z'))

os.rmdir() will fail on any directory with contents. So one method here is to just rmdir every directory from bottom to top while suppressing the OSError exception which is thrown when attempting to remove a non-empty directory. All empty ones will be removed, all with contents will remain. Technically, checking if a directory is empty before attempting the removal is a race condition (though typically, a harmless one).
Let's take this filesystem with 2 files in it.
testtree/
├── a
│   └── aa
│   └── filea
├── b
│   ├── bb
│   └── fileb
└── c
└── cc
And run this:
import os
from pathlib import Path
from contextlib import suppress
for root,dirs,_ in os.walk("testtree", topdown=False):
for d in dirs:
with suppress(OSError):
os.rmdir(Path(root,d))
Then the tree is transformed to
testtree/
├── a
│   └── aa
│   └── filea
└── b
└── fileb
bb, cc, and c were all removed.

The problem here is that glob("**/*") appears to do a preorder traversal. In other words, it returns a parent folder before returning any children.
You need to do a postorder traversal instead. In your example, if you delete a folder like X, then Y just becomes the same as X.
You could do this with manual recursion. But if you want to use glob(), you need to reverse the items returned. You can do this with reversed(Path(folder.glob("**/*")).

Append Excel Files in Multiple Directories in Python

My goal is to append 9 excel files together that exist in different directories. I have a directory tree with the following structure:
Big Folder
|
├── folder_1/
| ├── file1.xls
| ├── file2.xls
| └── file3.xls
|
├── folder_2/
| ├── file4.xls
| ├── file5.xls
| └── file6.xls
|
├── folder_3/
| ├── file7.xls
| ├── file8.xls
| └── file9.xls
I successfully wrote a loop that appends file1, file2, and file3 together within folder_1. My idea is to nest this loop into another loop that flows through each folder as a list. I'm currently tring to us os.walk to accomplish this but am running into the following error in folder_1
[Errno 2 No such file or directory]
Do community members have recommendations on how to extend this loop to execute in each directory? Thanks!

It is hard for me to know how you have implemented the program without given some sort of code to work with, however I believe you have misused the os.walk() method, please read about it here.
I would use the os.walk() method the following way for getting the path to various files in a current directory and subdirectories.
import os
all_files = [(path, files) for path, dirs, files in os.walk(".")]
and then get all the files which ends with "*.xls" like so
all_xls_files = [
os.path.join(path, xls_file)
for (path, xls_files_list) in all_files
for xls_file in xls_files_list
if xls_file.endswith(".xls")
]
this is equivalent to
all_xls_files = []
for (path, xls_files_list) in all_files:
for xls_file in xls_files_list:
if xls_file.endswith(".xls"):
files.append(os.path.join(path, xls_file))
Once you obtain a list of excel files with their path
you can open them by
with open("my_output_file", "w") as output_file:
for file in all_xls_files:
with open(file) as f:
# Do your append here

what does this path meant /**/*.jpg?

I came across this in IBM Applied AI course:
path_for_license_plates = os.getcwd() + "/license-plates/**/*.jpg"
what does **/*.jpg mean in the above path ?

From https://docs.python.org/3/library/glob.html :
glob.glob(pathname, *, recursive=False)
If recursive is true, the
pattern “**” will match any files and zero or more directories,
subdirectories and symbolic links to directories. If the pattern is
followed by an os.sep or os.altsep then files will not match.

It is apparently meant to be a glob pattern in "recursive" mode, as
the "**" suggests.
Given the directory tree
license-plates/
├── a
│   ├── b
│   │   └── x.jpg
│   └── x.jpg
└── x.jpg
The function
glob.glob()
works as follows,
>>> import glob
>>> glob.glob('license-plates/**/*.jpg', recursive=True)
['license-plates/x.jpg', 'license-plates/a/x.jpg', 'license-plates/a/b/x.jpg']

path_for_license_plates is literally a string.
That's it.
It can be used to say get me all the jpg files recursively looking in all the directories under "license-plates".
A better question is "how is it going to be used later in the program?"
It's likely (since they used the os module) this is an older program. This tends to use the glob module as others have shown. But if you are changing this program you can modernize it.
with modern python (3.6+) here is how you can ask for the same information:
from pathlib import Path
path_for_license_plates = Path('license-plates').glob("**/*.jpg")
for license_plate_file_location in path_for_license_plates:
print(license_plate_file_location)
This will assume the license-plates is in the current working directory and give you a generator that will yield a much shorter code and this will work on the major filesystems as well. (windows/linux/mac)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.