Monitoring Rsync Progress

Monitoring Rsync Progress - python

I'm trying to write a Python script which will monitor an rsync transfer, and provide a (rough) estimate of percentage progress. For my first attempt, I looked at an rsync --progress command and saw that it prints messages such as:
1614 100% 1.54MB/s 0:00:00 (xfer#5, to-check=4/10)
I wrote a parser for such messages, and used the to-check part to produce a percentage progress, here, this would be 60% complete.
However, there are two flaws in this:
In large transfers, the "numerator" of the to-check fraction doesn't seem to monotonically decrease, so the percentage completeness can jump backwards.
Such a message is not printed for all files, meaning that the progress can jump forwards.
I've had a look at other alternatives of messages to use, but haven't managed to find anything. Does anyone have any ideas?
Thanks in advance!

The current version of rsync (at the time of editing 3.1.2) has an option --info=progress2 which will show you progress of the entire transfer instead of individual files.
From the man page:
There is also a --info=progress2 option that outputs statistics based on the whole transfer, rather than individual files. Use this flag without outputting a filename (e.g. avoid -v or specify --info=name0 if you want to see how the transfer is doing without scrolling the screen with a lot of names. (You don't need to specify the --progress option in order to use --info=progress2.)
So, if possible on your system you could upgrade rsync to a current version which contains that option.

You can disable the incremental recursion with the argument --no-inc-recursive. rsync will do a pre-scan of the entire directory structure, so it knows the total number of files it has to check.
This is actually the old way it recursed. Incremental recursion, the current default, was added for speed.

Note the caveat here that even --info=progress2 is not entirely reliable since this is percentage based on the number of files rsync knows about at the time when the progress is being displayed. This is not necessarily the total number of files that needed to be sync'd (for instance, if it discovers a large number of large files in a deeply nested directory).
One way to ensure that --info=progress2 doesn't jump back in the progress indication would be to force rsync to scan all the directories recursively before starting the sync (instead of its default behavior of doing an incrementally recursive scan), by also providing the --no-inc-recursive option. Note however that this option will also increase rsync memory usage and run-time.

For full control over the transfer you should use a more low-level diff tool and manage directory listing and data transfer yourself.
Based on librsync there is either the command line rdiff or the python module pysync

Related

How to delete directory trees safely

I want my python program to delete directory trees which were created by the program during a previous execution. I worry that if there is a bug in the program using shutil.rmtree migth delete much more then intended with catastrophic consequences. Are there any best practises on how to avoid that?
I had the following ideas but they dont seem very elegant to me:
Exactly determine every file created by the program and delete them one-by-one and delete empty directories afterwards.
Determining the size or number of files of a directory and putting an upper bound on them. (E.g. only delete folders with size less than 5MB and less then 100 files.)

Same issue here:
a third party is writing files in folders I have to provide
it fails if the folders are not empty
there are "re-runs": I am likely to provide again the same folder
Some modest trick ideas I'm using:
there is one and only one method in the whole codebase that performs shutil.rmtree and it has some basic safety checks:
check the depth against a safety value if len(pathlib.Path(dir_to_delete).parents) < SAFETY_RMTREE_DEPTH:
add an optional prompt boolean that issues a blocking prompt before delete, with default to True (or anything similar that blocks execution as GUI prompts might not always be available): def delete_folder(folder: Path, prompt: bool = True). If you forget to set the prompt to False, then it forces you to think again about it the first time it gets executed.
Might not work if the first ever execution happens in production...
upper bounds on file size or number would be OK as you suggested if the checks are not too costly
Each time I am about to use the main delete_folder I try to instead define a specific delete method def delete_xxx that first checks that the target looks like what I want to delete. Specific checks could be tags or extensions in filenames, etc.
In some cases it is possible to move to a garbage instead of deleting
It might be OK to always add a timestamp to a specific folder you need to provide to the third party. Many issues can come from timestamping but for simple applications that might be acceptable.
Also, an online search with "criteria-based deletion" seems to give some results with similar tricks.

How to find modified files in Python

I want to monitor a folder and see if any new files are added, or existing files are modified. The problem is, it's not guaranteed that my program will be running all the time (so, inotify based solutions may not be suitable here). I need to cache the status of the last scan and then with the next scan I need to compare it with the last scan before processing the files.
What are the alternatives for achieving this in Python 2.7?
Note1: Processing of the files is expensive, so I'm trying to process the files that are not modified in the meantime. So, if the file is only renamed (as opposed to a change in the contents of the file), I would also like to detect that and skip the processing.
Note2: I'm only interested in a Linux solution, but I wouldn't complain if answers for other platforms are added.

There are several ways to detect changes in files. Some are easier to
fool than others. It doesn't sound like this is a security issue; more
like good faith is assumed, and you just need to detect changes without
having to outwit an adversary.
You can look at timestamps. If files are not renamed, this is a good way
to detect changes. If they are renamed, timestamps alone wouldn't
suffice to reliably tell one file from another. os.stat will tell you
the time a file was last modified.
You can look at inodes, e.g., ls -li. A file's inode number may change
if changes involve creating a new file and removing the old one; this is
how emacs typically changes files, for example. Try changing a file
with the standard tool your organization uses, and compare inodes before
and after; but bear in mind that even if it doesn't change this time, it
might change under some circumstances. os.stat will tell you inode
numbers.
You can look at the content of the files. cksum computes a small CRC
checksum on a file; it's easy to beat if someone wants to. Programs such
as sha256sum compute a secure hash; it's infeasible to change a file
without changing such a hash. This can be slow if the files are large.
The hashlib module will compute several kinds of secure hashes.
If a file is renamed and changed, and its inode number changes, it would
be potentially very difficult to match it up with the file it used to
be, unless the data in the file contains some kind of immutable and
unique identifier.
Think about concurrency. Is it possible that someone will be changing a
file while the program runs? Beware of race conditions.

I would've probably go with some kind of sqlite solution, such as writing the last polling time.
Then on each such poll, sort the files by last_modified_time (mtime) and get all the ones who are having mtime greater than your previous poll (this value will be taken out of the sqlite or some kind of file if you insist on not having requirement of such db).

Monitoring for new files isn't hard -- just keep a list or database of inodes for all files in the directory. A new file will introduce a new inode. This will also help you avoid processing renamed files, since inode doesn't change on rename.
The harder problem is monitoring for file changes. If you also store file size per inode, then obviously a changed size indicates a changed file and you don't need to open and process the file to know that. But for a file that has (a) a previously recorded inode, and (b) is the same size as before, you will need to process the file (e.g. compute a checksum) to know if it has changed.

I suggest cheating and using the system find command. For example, the following finds all Python files that have been modified or created in the last 60 minutes. Using the ls output can determine if further checking is needed.
$ echo beer > zoot.py
$ find . -name '*.py' -mmin -60 -type f -ls
1973329 4 -rw-r--r-- 1 johnm johnm 5 Aug 30 15:17 ./zoot.py

What is the best way to archive a data CD/DVD in python?

I have to archive a large amount of data off of CDs and DVDs, and I thought it was an interesting problem that people might have useful input on. Here's the setup:
The script will be running on multiple boxes on multiple platforms, so I thought python would be the best language to use. If the logic creates a bottleneck, any other language works.
We need to archive ~1000 CDs and ~500 DVDs, so speed is a critical issue
The data is very valuable, so verification would be useful
The discs are pretty old, so a lot of them will be hard or impossible to read
Right now, I was planning on using shutil.copytree to dump the files into a directory, and compare file trees and sizes. Maybe throw in a quick hash, although that will probably slow things down too much.
So my specific questions are:
What is the fastest way to copy files off a slow medium like CD/DVDs? (or does the method even matter)
Any suggestions of how to deal with potentially failing discs? How do you detect discs that have issues?

When you read file by file, you're seeking randomly around the disc, which is a lot slower than a bulk transfer of contiguous data. And, since the fastest CD drives are several dozen times slower than the slowest hard drives (and that's not even counting the speed hit for doing multiple reads on each bad sector for error correction), you want to get the data off the CD as soon as possible.
Also, of course, having an archive as a .iso file or similar means that, if you improve your software later, you can re-scan the filesystem without needing to dig out the CD again (which may have further degraded in storage).
Meanwhile, trying to recovering damaged CDs, and damaged filesystems, is a lot more complicated than you'd expect.
So, here's what I'd do:
Block-copy the discs directly to .iso files (whether in Python, or with dd), and log all the ones that fail.
Hash the .iso files, not the filesystems. If you really need to hash the filesystems, keep in mind that the common optimization of compression the data before hashing (that is, tar czf - | shasum instead of just tar cf - | shasum) usually slows things down, even for easily-compressable data—but you might as well test it both ways on a couple discs. If you need your verification to be legally useful you may have to use a timestamped signature provided by an online service, instead, in which case compressing probably will be worthwhile.
For each successful .iso file, mount it and use basic file copy operations (whether in Python, or with standard Unix tools), and again log all the ones that fail.
Get a free or commercial CD recovery tool like IsoBuster (not an endorsement, just the first one that came up in a search, although I have used it successfully before) and use it to manually recover all of the damaged discs.
You can do a lot of this work in parallel—when each block copy finishes, kick off the filesystem dump in the background while you're block-copying the next drive.
Finally, if you've got 1500 discs to recover, you might want to invest in a DVD jukebox or auto-loader. I'm guessing new ones are still pretty expensive, but there must be people out there selling older ones for a lot cheaper. (From a quick search online, the first thing that came up was $2500 new and $240 used…)

Writing your own backup system is not fun. Have you considered looking at ready-to-use backup solutions? There are plenty, many free ones...
If you are still bound to write your own... Answering your specific questions:
With CD/DVD you first typically have to master the image (using a tool like mkisofs), then write image to the medium. There are tools that wrap both operations for you (genisofs I believe) but this is typically the process.
To verify the backup quality, you'll have to read back all written files (by mounting a newly written CD) and compare their checksums against those of the original files. In order to do incremental backups, you'll have to keep archives of checksums for each file you save (with backup date etc).

Using an index to recursively get all files in a directory really fast

Attempt #2:
People don't seem to be understanding what I'm trying to do. Let me see if I can state it more clearly:
1) Reading a list of files is much faster than walking a directory.
2) So let's have a function that walks a directory and writes the resulting list to a file. Now, in the future, if we want to get all the files in that directory we can just read this file instead of walking the dir. I call this file the index.
3) Obviously, as the filesystem changes the index file gets out of sync. To overcome this, we have a separate program that hooks into the OS in order to monitor changes to the filesystem. It writes those changes to a file called the monitor log. Immediately after we read the index file for a particular directory, we use the monitor log to apply the various changes to the index so that it reflects the current state of the directory.
Because reading files is so much cheaper than walking a directory, this should be much faster than walking for all calls after the first.
Original post:
I want a function that will recursively get all the files in any given directory and filter them according to various parameters. And I want it to be fast -- like, an order of magnitude faster than simply walking the dir. And I'd prefer to do it in Python. Cross-platform is preferable, but Windows is most important.
Here's my idea for how to go about this:
I have a function called all_files:
def all_files(dir_path, ...parms...):
...
The first time I call this function it will use os.walk to build a list of all the files, along with info about the files such as whether they are hidden, a symbolic link, etc. I'll write this data to a file called ".index" in the directory. On subsequent calls to all_files, the .index file will be detected, and I will read that file rather than walking the dir.
This leaves the problem of the index getting out of sync as files are added and removed. For that I'll have a second program that runs on startup, detects all changes to the entire filesystem, and writes them to a file called "mod_log.txt". It detects changes via Windows signals, like the method described here. This file will contain one event per line, with each event consisting of the path affected, the type of event (create, delete, etc.), and a timestamp. The .index file will have a timestamp as well for the time it was last updated. After I read the .index file in all_files I will tail mod_log.txt and find any events that happened after the timestamp in the .index file. It will take these recent events, find any that apply to the current directory, and update the .index accordingly.
Finally, I'll take the list of all files, filter it according to various parameters, and return the result.
What do you think of my approach? Is there a better way to do this?
Edit:
Check this code out. I'm seeing a drastic speedup from reading a cached list over a recursive walk.
import os
from os.path import join, exists
import cProfile, pstats
dir_name = "temp_dir"
index_path = ".index"
def create_test_files():
os.mkdir(dir_name)
index_file = open(index_path, 'w')
for i in range(10):
print "creating dir: ", i
sub_dir = join(dir_name, str(i))
os.mkdir(sub_dir)
for i in range(100):
file_path = join(sub_dir, str(i))
open(file_path, 'w').close()
index_file.write(file_path + "\n")
index_file.close()
#
# 0.238 seconds
def test_walk():
for info in os.walk("temp_dir"):
pass
# 0.001 seconds
def test_read():
open(index_path).readlines()
if not exists("temp_dir"):
create_test_files()
def profile(s):
cProfile.run(s, 'profile_results.txt')
p = pstats.Stats('profile_results.txt')
p.strip_dirs().sort_stats('cumulative').print_stats(10)
profile("test_walk()")
profile("test_read()")

Do not try to duplicate the work that the filesystem already does. You are not going to do better than it already does.
Your scheme is flawed in many ways and it will not get you an order-of-magnitude improvement.
Flaws and potential problems:
You are always going to be working with a snapshot of the file system. You will never know with any certainty that it is not significantly disjoint from reality. If that is within the working parameters of your application, no sweat.
The filesystem monitor program still has to recursively walk the file system, so the work is still being done.
In order to increase the accuracy of the cache, you have to increase the frequency with which the filesystem monitor runs. The more it runs, the less actual time that you are saving.
Your client application likely won't be able to read the index file while it is being updated by the filesystem monitor program, so you'll lose time while the client waits for the index to be readable.
I could go on.
If, in fact, you don't care about working with a snapshot of the filesystem that may be very disjoint from reality, I think that you'd be much better off with keeping the index in memory and updating from with the application itself. That will scrub any file contention issues that will otherwise arise.

The best answer came from Michał Marczyk toward the bottom of the comment list on the initial question. He pointed out that what I'm describing is very close to the UNIX locate program. I found a Windows version here: http://locate32.net/index.php. It solved my problem.
Edit: Actually the Everything search engine looks even better. Apparently Windows keeps journals of changes to the filesystem, and Everything uses that to keep the database up to date.

Doesn't Windows Desktop Search provide such an index as a byproduct? On the mac the spotlight index can be queried for filenames like this: mdfind -onlyin . -name '*'.
Of course it's much faster than walking the directory.

The short answer is "no". You will not be able to build an indexing system in Python that will outpace the file system by an order of magnitude.
"Indexing" a filesystem is an intensive/slow task, regardless of the caching implementation. The only realistic way to avoid the huge overhead of building filesystem indexes is to "index as you go" to avoid the big traversal. (After all, the filesystem itself is already a data indexer.)
There are operating system features that are capable of doing this "build as you go" filesystem indexing. It's the very foundation of services like Spotlight on OSX and Windows Desktop Search.
To have any hope of getting faster speeds than walking the directories, you'll want to leverage one of those OS or filesystem level tools.
Also, try not to mislead yourself into thinking solutions are faster just because you've "moved" the work to a different time/process. Your example code does exactly that. You traverse the directory structure of your sample files while you're building the same files and create the index, and then later just read that file.
There are two lessons, here. (a) To create a proper test it's essential to separate the "setup" from the "test". Here your performance test essentially says, "Which is faster, traversing a directory structure or reading an index that's already been created in advance?" Clearly this is not an apples to oranges comparison.
However, (b) you've stumbled on the correct answer at the same time. You can get a list of files much faster if you use an already existing index. This is where you'd need to leverage something like the Windows Desktop Search or Spotlight indexes.
Make no mistake, in order to build an index of a filesystem you must, by definition, "visit" every file. If your files are stored in a tree, then a recursive traversal is likely going to be the fastest way you can visit every file. If the question is "can I write Python code to do exactly what os.walk does but be an order of magnitude faster than os.walk" the answer is a resounding no. If the question is "can I write Python code to index every file on the system without taking the time to actually visit every file" then the answer is still no.
(Edit in response to "I don't think you understand what I'm trying to do")
Let's be clear here, virtually everyone here understands what you're trying to do. It seems that you're taking "no, this isn't going to work like you want it to work" to mean that we don't understand.
Let's look at this from another angle. File systems have been an essential component to modern computing from the very beginning. The categorization, indexing, storage, and retrieval of data is a serious part of computer science and computer engineering and many of the most brilliant minds in computer science are working on it constantly.
You want to be able to filter/select files based on attributes/metadata/data of the files. This is an extremely common task utilized constantly in computing. It's likely happening several times a second even on the computer you're working with right now.
If it were as simple to speed up this process by an order of magnitude(!) by simply keeping a text file index of the filenames and attributes, don't you think every single file system and operating system in existence would do exactly that?
That said, of course caching the results of your specific queries could net you some small performance increases. And, as expected, file system and disk caching is a fundamental part of every modern operating system and file system.
But your question, as you asked it, has a clear answer: No. In the general case, you're not going to get an order of magnitude faster reimplementing os.walk. You may be able to get a better amortized runtime by caching, but you're not going to be beat it by an order of magnitude if you properly include the work to build the cache in your profiling.

I would like to recommend you just use a combination of os.walk (to get directory trees) & os.stat (to get file information) for this. Using the std-lib will ensure it works on all platforms, and they do the job nicely. And no need to index anything.
As other have stated, I don't really think you're going to buy much by attempting to index and re-index the filesystem, especially if you're already limiting your functionality by path and parameters.

I'm new to Python, but I'm using a combination of list comprehensions, iterator and a generator should scream according to reports I've read.
class DirectoryIterator:
def __init__(self, start_dir, pattern):
self.directory = start_dir
self.pattern = pattern
def __iter__(self):
[([DirectoryIterator(dir, self.pattern) for dir in dirnames], [(yield os.path.join(dirpath, name)) for name in filenames if re.search(self.pattern, name) ]) for dirpath, dirnames, filenames in os.walk(self.directory)]
###########
for file_name in DirectoryIterator(".", "\.py$"): print file_name

Create a user-group in linux using python

I want to create a user group using python on CentOS system. When I say 'using python' I mean I don't want to do something like os.system and give the unix command to create a new group. I would like to know if there is any python module that deals with this.
Searching on the net did not reveal much about what I want, except for python user groups.. so I had to ask this.
I learned about the grp module by searching here on SO, but couldn't find anything about creating a group.
EDIT: I dont know if I have to start a new question for this, but I would also like to know how to add (existing) users to the newly created group.
Any help appreciated.
Thank you.

I don't know of a python module to do it, but the /etc/group and /etc/gshadow format is pretty standard, so if you wanted you could just open the files, parse their current contents and then add the new group if necessary.
Before you go doing this, consider:
What happens if you try to add a group that already exists on the system
What happens when multiple instances of your program try to add a group at the same time
What happens to your code when an incompatible change is made to the group format a couple releases down the line
NIS, LDAP, Kerberos, ...
If you're not willing to deal with these kinds of problems, just use the subprocess module and run groupadd. It will be way less likely to break your customers machines.
Another thing you could do that would be less fragile than writing your own would be to wrap the code in groupadd.c (in the shadow package) in Python and do it that way. I don't see this buying you much versus just exec'ing it, though, and it would add more complexity and fragility to your build.

I think you should use the commandline programs from your program, a lot of care has gone into making sure that they don't break the groups file if something goes wrong.
However the file format is quite straight forward to write something yourself if you choose to go that way

There are no library calls for creating a group. This is because there's really no such thing as creating a group. A GID is simply a number assigned to a process or a file. All these numbers exist already - there is nothing you need to do to start using a GID. With the appropriate privileges, you can call chown(2) to set the GID of a file to any number, or setgid(2) to set the GID of the current process (there's a little more to it than that, with effective IDs, supplementary IDs, etc).
Giving a name to a GID is done by an entry in /etc/group on basic Unix/Linux/POSIX systems, but that's really just a convention adhered to by the Unix/Linux/POSIX userland tools. Other network-based directories also exist, as mentioned by Jack Lloyd.
The man page group(5) describes the format of the /etc/group file, but it is not recommended that you write to it directly. Your distribution will have policies on how unnamed GIDs are allocated, such as reserving certain spaces for different purposes (fixed system groups, dynamic system groups, user groups, etc). The range of these number spaces differs on different distributions. These policies are usually encoded in the command-line tools that a sysadmin uses to assign unnamed GIDs.
This means the best way to add a group locally is to use the command-line tools.

If you are looking at Python, then try this program. Its fairly simple to use, and the code can easily be customized http://aleph-null.tv/downloads/mpb-adduser-1.tgz

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.