How to delete directory trees safely

How to delete directory trees safely - python

I want my python program to delete directory trees which were created by the program during a previous execution. I worry that if there is a bug in the program using shutil.rmtree migth delete much more then intended with catastrophic consequences. Are there any best practises on how to avoid that?
I had the following ideas but they dont seem very elegant to me:
Exactly determine every file created by the program and delete them one-by-one and delete empty directories afterwards.
Determining the size or number of files of a directory and putting an upper bound on them. (E.g. only delete folders with size less than 5MB and less then 100 files.)

Same issue here:
a third party is writing files in folders I have to provide
it fails if the folders are not empty
there are "re-runs": I am likely to provide again the same folder
Some modest trick ideas I'm using:
there is one and only one method in the whole codebase that performs shutil.rmtree and it has some basic safety checks:
check the depth against a safety value if len(pathlib.Path(dir_to_delete).parents) < SAFETY_RMTREE_DEPTH:
add an optional prompt boolean that issues a blocking prompt before delete, with default to True (or anything similar that blocks execution as GUI prompts might not always be available): def delete_folder(folder: Path, prompt: bool = True). If you forget to set the prompt to False, then it forces you to think again about it the first time it gets executed.
Might not work if the first ever execution happens in production...
upper bounds on file size or number would be OK as you suggested if the checks are not too costly
Each time I am about to use the main delete_folder I try to instead define a specific delete method def delete_xxx that first checks that the target looks like what I want to delete. Specific checks could be tags or extensions in filenames, etc.
In some cases it is possible to move to a garbage instead of deleting
It might be OK to always add a timestamp to a specific folder you need to provide to the third party. Many issues can come from timestamping but for simple applications that might be acceptable.
Also, an online search with "criteria-based deletion" seems to give some results with similar tricks.

Related

Time trial version of a program

I want to create a trial version of a program for my customer. I want to give him/her some time to test the program (7 days in this case).
I have this command in the application (in *.py file):
if os.path.isfile('some_chars.txt') or datetime.now()<datetime.strptime('30-8-2015','%d-%m-%Y'):
# DO WHAT application HAS TO DO
else:
print 'TRIAL EXPIRED'
quit()
I'm curious whether is this approach enough for common customer or whether I have to change it. The thing is that the application has to find a file which name is, let's say, 'some_chars.txt'. If the file was found, the application works as it has to do, if not, it returns a text 'Trial version expired'.
So the main question is - is it enough for common customer? Can it be found somewhere or is it compiled to machine code so he would had to disassemble it?
EDIT: I forgot to mention very important thing, I'm using py2exe to make an executable file (main) with unnecessary files and folders.

Of course it has everything to do with the target (population) you're aiming: there are some cases when security is an offense (that involves lots of money so it's not our case);
Let's take an example:
Have a program that reads plain data from a file(registry,...); e.g. :the date (the program converts the date does a comparison and depending on the trial period close or lets the user go on)
Have everything from previous step, but the data is not in plain text, it is encrypted (e.g.: 1 is added to every char in the data so it is not immediately readable)
Use some well known encryption algorithms (would make the data unreadable to the user)
But, no matter the method you choose, it's just a matter of time til it will be broken.
A "hard to beat" way would be to have an existing server where the client could connect and "secretly talk" (I'm talking about a SSLed connecion anyway), even for trial period.
"Hiding the obvious info"(delivering a "compiled" .py script) is no longer the way (the most common Google search will point to a Python "decompiler")

Python is interpreted, so all they have to do is look at the source code to see time limiting section.
There are some options of turning a python script into an executable. I would try this and don't use any external files to set the date, keep it in the script.

How to traverse directory in order of modification date without visiting all files in directory at least once in Python

I have a directory that is full of potentially millions of files. These files "mark" themselves when used, and then my Python program wants to find the "marked" ones then record that they were marked and unmark them. They are individual html files so they can't easily communicate with the python program themselves during this marking process (the user will just open whatever ones they choose).
Because they are marked when used, if I access them by modification date, one at a time, once I reach one that isn't marked I can stop (or at least once I get to one that was modified a decent amount of time in the future). However, all ways I've seen of doing this so far require accessing every file's metadata at least once, and then sorting this data, which isn't ideal with the magnitude of files I have. Note that this check occurs during an update step which occurs every 5 seconds or so combined with other work and so the time ideally needs to be independent of the number of files in the directory.
So is there a way to traverse a directory in order of modification date without visiting all files's medatada's at least once in Python?

No, I don't think there is a way to fetch file names in chunks sorted by modification dates.
You should use file system notifications to know about modified files.
For example use https://github.com/gorakhargosh/watchdog or https://github.com/seb-m/pyinotify/wiki

Force directory to be created in Python

I am running Python with MPI on a supercomputing cluster. I am getting strange nondeterministic behavior that I think is a result of I/O complications that are not present on the single machines I'm used to working with.
One of the things my code does is to create directories using os.makedirs somewhat frequently. I know also that I generally should not write small amounts of data to the filesystem-- this can end up with the data getting stuck in some buffer and not written for a long time. I suspect this may be happening with my directory creation calls, and then later code tries to write to files inside the directory before it exists. Two questions:
is creating a new directory effectively the same thing as writing a small amount of data?
When forcing data to be written, I use flush and os.fsync. These require a file object. Is there an equivalent to make sure the directory has been created?

Creating a new directory is effectively the same as writing small amount of data. It adds an inode.
The only way mkdir (or os.mkdirs) should fail is if the directory exists - otherwise the directory will always be created. In terms of the data being buffered - it's unlikely that this would happen - even journaled filesystems will sync out pretty regularly.
If you're having non-deterministic behavior, just wrap your directory creation / writing a file into that directory inside a try / except / finally that makes a few efforts? But really - the need for such code hints at something much more sinister and is likely a bigger issue.

update the list of directories with only the new ones

I am creating a list (a text file) of all the directories recursively using the code below. Since there are a thousands of sub-directories, I don't want to create the list again and again, but would like to update/insert only the newly created ones from the last time I had listed them.
Is there a good way to do this ?
import os, sys
rootdir ="/store/user/"
myusers=['u1','u2','u3','u4','u5','u6','u7']
for myuser in myusers:
rootuserdir=os.path.join(rootdir, myuser)
for myRoot, mySubFolders, myFiles in os.walk(rootuserdir):
for mySubFolder in mySubFolders:
dirpath = os.path.join(myRoot, mySubFolder)
print dirpath

You don't save anything by trying to incrementally update your list of folders. There is no efficient way to delete a line from the middle of a file, nor to insert a line. Simply writing the whole list again is the most efficient approach, and also the easiest one.

Trying to locate a specific entry in a file would be more resource intensive than just repopulating the list each time.
For performance optimization always try to determine where the true bottlenecks lie before focusing on one specific area. More times than none, your focus will be in the wrong place when not employing this approach.
Determining the bottlenecks or hotspots should always be one of the first focus areas when refactoring your code. By doing so, you'll ensure that you are focusing on the areas that have the highest ROI with the least amount of LOE. A rule of thumb is that you should only attempt to refactor code if you can make the entire program or at least a significant part of it at least twice as fast. more...

You could run a one-off process to cache the information in some sort of database (possibly a document orientated one for simplicity), then use pyinotify in a daemon process to keep the database in sync.

Using an index to recursively get all files in a directory really fast

Attempt #2:
People don't seem to be understanding what I'm trying to do. Let me see if I can state it more clearly:
1) Reading a list of files is much faster than walking a directory.
2) So let's have a function that walks a directory and writes the resulting list to a file. Now, in the future, if we want to get all the files in that directory we can just read this file instead of walking the dir. I call this file the index.
3) Obviously, as the filesystem changes the index file gets out of sync. To overcome this, we have a separate program that hooks into the OS in order to monitor changes to the filesystem. It writes those changes to a file called the monitor log. Immediately after we read the index file for a particular directory, we use the monitor log to apply the various changes to the index so that it reflects the current state of the directory.
Because reading files is so much cheaper than walking a directory, this should be much faster than walking for all calls after the first.
Original post:
I want a function that will recursively get all the files in any given directory and filter them according to various parameters. And I want it to be fast -- like, an order of magnitude faster than simply walking the dir. And I'd prefer to do it in Python. Cross-platform is preferable, but Windows is most important.
Here's my idea for how to go about this:
I have a function called all_files:
def all_files(dir_path, ...parms...):
...
The first time I call this function it will use os.walk to build a list of all the files, along with info about the files such as whether they are hidden, a symbolic link, etc. I'll write this data to a file called ".index" in the directory. On subsequent calls to all_files, the .index file will be detected, and I will read that file rather than walking the dir.
This leaves the problem of the index getting out of sync as files are added and removed. For that I'll have a second program that runs on startup, detects all changes to the entire filesystem, and writes them to a file called "mod_log.txt". It detects changes via Windows signals, like the method described here. This file will contain one event per line, with each event consisting of the path affected, the type of event (create, delete, etc.), and a timestamp. The .index file will have a timestamp as well for the time it was last updated. After I read the .index file in all_files I will tail mod_log.txt and find any events that happened after the timestamp in the .index file. It will take these recent events, find any that apply to the current directory, and update the .index accordingly.
Finally, I'll take the list of all files, filter it according to various parameters, and return the result.
What do you think of my approach? Is there a better way to do this?
Edit:
Check this code out. I'm seeing a drastic speedup from reading a cached list over a recursive walk.
import os
from os.path import join, exists
import cProfile, pstats
dir_name = "temp_dir"
index_path = ".index"
def create_test_files():
os.mkdir(dir_name)
index_file = open(index_path, 'w')
for i in range(10):
print "creating dir: ", i
sub_dir = join(dir_name, str(i))
os.mkdir(sub_dir)
for i in range(100):
file_path = join(sub_dir, str(i))
open(file_path, 'w').close()
index_file.write(file_path + "\n")
index_file.close()
#
# 0.238 seconds
def test_walk():
for info in os.walk("temp_dir"):
pass
# 0.001 seconds
def test_read():
open(index_path).readlines()
if not exists("temp_dir"):
create_test_files()
def profile(s):
cProfile.run(s, 'profile_results.txt')
p = pstats.Stats('profile_results.txt')
p.strip_dirs().sort_stats('cumulative').print_stats(10)
profile("test_walk()")
profile("test_read()")

Do not try to duplicate the work that the filesystem already does. You are not going to do better than it already does.
Your scheme is flawed in many ways and it will not get you an order-of-magnitude improvement.
Flaws and potential problems:
You are always going to be working with a snapshot of the file system. You will never know with any certainty that it is not significantly disjoint from reality. If that is within the working parameters of your application, no sweat.
The filesystem monitor program still has to recursively walk the file system, so the work is still being done.
In order to increase the accuracy of the cache, you have to increase the frequency with which the filesystem monitor runs. The more it runs, the less actual time that you are saving.
Your client application likely won't be able to read the index file while it is being updated by the filesystem monitor program, so you'll lose time while the client waits for the index to be readable.
I could go on.
If, in fact, you don't care about working with a snapshot of the filesystem that may be very disjoint from reality, I think that you'd be much better off with keeping the index in memory and updating from with the application itself. That will scrub any file contention issues that will otherwise arise.

The best answer came from Michał Marczyk toward the bottom of the comment list on the initial question. He pointed out that what I'm describing is very close to the UNIX locate program. I found a Windows version here: http://locate32.net/index.php. It solved my problem.
Edit: Actually the Everything search engine looks even better. Apparently Windows keeps journals of changes to the filesystem, and Everything uses that to keep the database up to date.

Doesn't Windows Desktop Search provide such an index as a byproduct? On the mac the spotlight index can be queried for filenames like this: mdfind -onlyin . -name '*'.
Of course it's much faster than walking the directory.

The short answer is "no". You will not be able to build an indexing system in Python that will outpace the file system by an order of magnitude.
"Indexing" a filesystem is an intensive/slow task, regardless of the caching implementation. The only realistic way to avoid the huge overhead of building filesystem indexes is to "index as you go" to avoid the big traversal. (After all, the filesystem itself is already a data indexer.)
There are operating system features that are capable of doing this "build as you go" filesystem indexing. It's the very foundation of services like Spotlight on OSX and Windows Desktop Search.
To have any hope of getting faster speeds than walking the directories, you'll want to leverage one of those OS or filesystem level tools.
Also, try not to mislead yourself into thinking solutions are faster just because you've "moved" the work to a different time/process. Your example code does exactly that. You traverse the directory structure of your sample files while you're building the same files and create the index, and then later just read that file.
There are two lessons, here. (a) To create a proper test it's essential to separate the "setup" from the "test". Here your performance test essentially says, "Which is faster, traversing a directory structure or reading an index that's already been created in advance?" Clearly this is not an apples to oranges comparison.
However, (b) you've stumbled on the correct answer at the same time. You can get a list of files much faster if you use an already existing index. This is where you'd need to leverage something like the Windows Desktop Search or Spotlight indexes.
Make no mistake, in order to build an index of a filesystem you must, by definition, "visit" every file. If your files are stored in a tree, then a recursive traversal is likely going to be the fastest way you can visit every file. If the question is "can I write Python code to do exactly what os.walk does but be an order of magnitude faster than os.walk" the answer is a resounding no. If the question is "can I write Python code to index every file on the system without taking the time to actually visit every file" then the answer is still no.
(Edit in response to "I don't think you understand what I'm trying to do")
Let's be clear here, virtually everyone here understands what you're trying to do. It seems that you're taking "no, this isn't going to work like you want it to work" to mean that we don't understand.
Let's look at this from another angle. File systems have been an essential component to modern computing from the very beginning. The categorization, indexing, storage, and retrieval of data is a serious part of computer science and computer engineering and many of the most brilliant minds in computer science are working on it constantly.
You want to be able to filter/select files based on attributes/metadata/data of the files. This is an extremely common task utilized constantly in computing. It's likely happening several times a second even on the computer you're working with right now.
If it were as simple to speed up this process by an order of magnitude(!) by simply keeping a text file index of the filenames and attributes, don't you think every single file system and operating system in existence would do exactly that?
That said, of course caching the results of your specific queries could net you some small performance increases. And, as expected, file system and disk caching is a fundamental part of every modern operating system and file system.
But your question, as you asked it, has a clear answer: No. In the general case, you're not going to get an order of magnitude faster reimplementing os.walk. You may be able to get a better amortized runtime by caching, but you're not going to be beat it by an order of magnitude if you properly include the work to build the cache in your profiling.

I would like to recommend you just use a combination of os.walk (to get directory trees) & os.stat (to get file information) for this. Using the std-lib will ensure it works on all platforms, and they do the job nicely. And no need to index anything.
As other have stated, I don't really think you're going to buy much by attempting to index and re-index the filesystem, especially if you're already limiting your functionality by path and parameters.

I'm new to Python, but I'm using a combination of list comprehensions, iterator and a generator should scream according to reports I've read.
class DirectoryIterator:
def __init__(self, start_dir, pattern):
self.directory = start_dir
self.pattern = pattern
def __iter__(self):
[([DirectoryIterator(dir, self.pattern) for dir in dirnames], [(yield os.path.join(dirpath, name)) for name in filenames if re.search(self.pattern, name) ]) for dirpath, dirnames, filenames in os.walk(self.directory)]
###########
for file_name in DirectoryIterator(".", "\.py$"): print file_name

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.