I have a folder called raw_files. Very large files (~100GB files) from several sources will be uploaded to this folder.
I need to get file information from videos that have finished uploading to the folder. What is the best way to determine if a file is currently being downloaded to the folder (pass) or if the video has finished download (run script)? Thank you.
The most reliable way is to modify the uploading software if you can.
A typical scheme would be to first upload each file into a temporary directory on the same filesystem, and move to the final location when the upload is finished. Such a "move" operation is cheap and atomic.
A variation on this theme is to upload each file under a temporary name (e.g. file.dat.incomplete instead of file.dat) and then rename. You script will simply need to skip files called *.incomplete.
If you check those files, store the size of the files somewhere. When you are in the next round and the filesize is still the same, you can pretty much consider them as finished (depending on how much time is between first and second check). The time interval could e.g. be set to the timeout-interval of your uploading service(FTP, whatever).
There is no special sign or content showing that a file is complete.
Related
There is a similar question in the .NET version, but I'm asking specifically about the Python os.walk function. I am using the latest Python 3.
Suppose I have a directory, for which there are many files. I am performing an operation that takes awhile on each file, leaving ample time for the root directory to be changed mid loop by another process. Here would be an example, for clarity's sake:
for root, dirs, files in os.walk(rootDirectory):
for file in files:
doAReallyLongBlockingCall(file)
Would os.walk detect such changes? If I add files, will they be found by os.walk?
This is not really about os.walk but what the OS does
On linux once you open a dir ( there is a syscall for that) the data will not change under your hands
so if in your loop you open a file and take 10 minutes to process it and one of the unprocessed files gets deleted, walk will still give you the name and also any file created after opendir will not be listed
you will need exceptions to handle the cases when you lose a file before you get to it
I'm trying to write a Python script that runs on Windows. Files are copied to a folder every few seconds, and I'm polling that folder every 30 seconds for names of the new files that were copied to the folder after the last poll.
What I have tried is to use one of the os.path.getXtime(folder_path) functions and compare that with the timestamp of my previous poll. If the getXtime value is larger than the timestamp, then I work on those files.
I have tried to use the function os.path.getctime(folder_path), but that didn't work because the files were created before I wrote the script. I tried os.path.getmtime(folder_path) too but the modified times are usually smaller than the poll timestamp.
Finally, I tried os.path.getatime(folder_path), which works for the first time the files were copied over. The problem is I also read the files once they were in the folder, so the access time keeps getting updated and I end up reading the same files over and over again.
I'm not sure what is a better way or function to do this.
You've got a bit of an XY problem here. You want to know when files in a folder change, you tried a homerolled solution, it didn't work, and now you want to fix your homerolled solution.
Can I suggest that instead of terrible hackery, you use an existing package designed for monitoring for file changes? One that is not a polling loop, but actually gets notified of changes as they happen? While inotify is Linux-only, there are other options for Windows.
I have some bash code which moves files and directory to /tmp/rmf rather than deleting them, for safety purposes.
I am migrating the code to Python to add some functionality. One of the added features is checking the available size on /tmp and asserting that the moved directory can fit in /tmp.
Checking for available space is done using os.statvfs, but how can I measure the disk usage of the moved directory?
I could either call du using subprocess, or recursively iterate over the directory tree and sum the sizes of each file. Which approach would be better?
I think you might want to reconsider your strategy. Two reasons:
Checking if you can move a file, asserting you can move a file, and then moving a file provides a built-in race-condition to the operation. A big file gets created in /tmp/ after you've asserted but before you've moved your file.. Doh.
Moving the file across filesystems will result in a huge amount of overhead. This is why on OSX each volume has their own 'Trash' directory. Instead of moving the blocks that compose the file, you just create a new inode that points to the existing data.
I'd consider how long the file needs to be available and the visibility to consumers of the files. If it's all automated stuff happening on the backend - renaming a file to 'hide' it from computer and human consumers is easy enough in most cases and has the added benefit of being an atomic operation)
Occasionally scan the filesystem for 'old' files to cull and rm them after some grace period. No drama. Also makes restoring files a lot easier since it's just a rename to restore.
This should do the trick:
import os
path = 'THE PATH OF THE DIRECTORY YOU WANT TO FETCH'
os.statvfs(path)
In short:
I need to split a single (or more) file(s) into multiple max-sized archives using dummy-safe format (e.g. zip or rar anything that work will do!).
I would love to know when a certain part is done (callback?) so I could start shipping it away.
I would rather not do it using rar or zip command line utilities unless impossible otherwise.
I'm trying to make it os independent for the future but right now I can live if the compression could be made only on linux (my main pc) I still need to make it easily opened in windows (wife's pc)
In long:
I'm writing an hopefully to-be-awesome backup utility that scans my pictures folder, zips each folder and uploads them to whatever uploading class is registered (be it mail-sending, ftp-uploading, http-uploading).
I used zipfile to create a gigantic archive for each folder but since my uploading speed is really bad I let it work at only at nights but my internet goes off occassionally and the whole thing messes up. So I decided to split it into ~10MB pieces. I found no way of doing it with zipfile so I just added files to the zip until it reached > 10MB.
Problem is there are often 200-300MB and sometimes more videos in there and again we reach the middle-of-the-night cutoffs.
I am using Subprocess with "rar" right now to create the split archives but since directories are so big and I'm using large compression this thing takes ages even the first files are already ready - this is why I love to know when a file is ready to be sent.
So short story long I need a good way to split it into max-sized archives.
I am looking at making it somewhat generic and as dummy-proof as possible as eventually I'm planning on making it some awesome extensible backup library..
I want to run a script to check whether certain files in my Dropbox folder have changed. I am currently using os.path.getmtime() to check that the modified time is in some window of time.time(). The problem is that if I modify a file in my Dropbox folder from a different computer than where the script is set to run, the modified time does not change on that latter computer. Is there a good way to watch shared files that doesn't run into this problem?
Thanks for any help! I am just getting into python.
*******UPDATE*******
I have been playing more with how Dropbox handles file timestamping. It only updates the mtime if the file changes. If you open a file, modify it, but save it unchanged, the mtime stays the same.
It looks that Dropbox preserves mtime when synchronizing files. Try to detect changed file by changed file size and/or checksum (MD5, SHA1 or so) instead of modification time. Or just ask Dropbox :) (I don't know if it has any API for this).