How to make sure a file is completed before copying it?

How to make sure a file is completed before copying it? - python

An application A (out of my control) writes a file into a directory.
After the file is written I want to back it up somewhere else with a python script of mine.
Question: how may I be sure that the file is completed or that instead the application A is still writing the file so that I should wait until its completion? I am worried I could copy a partial file....
I wanted to use this function shutil.copyfile(src,dst) but I don't know if it is safe or I should check the file to copy in some other way.

In general, you can't.
Because you don't have the information needed to solve the problem.
If you have to know that a file was completely transferred/created/written/whatever successfully, the creator has to send you a signal somehow, because only the creator has that information. From the receiving side, there's in general no way to infer that a file has been completely transferred. You can try to guess, but that's all it is. You can't in general tell a complete transfer from one where the connection was lost, for example.
So you need a signal of some sort from the sender.
One common way is to use a rename operation from something like filename.xfr to filename, or from an "in work" directory to the one you're watching. Since most operating systems implement such rename operations atomically, if the sender only does the rename when the transfer is successfully done, you'll only process complete files that have been successfully transferred.
Another common signal is to send a "done" flag file, such as sending filename.done once filename has been successfully sent.
Since you don't control the sender, you can't reliably solve this problem by watching for files.

Related

Dynamic/changing last line in log file

I have a logger, where I append a line for each downloaded file, because I need to monitor that.
But then I end up with log full of these. I would like a solution where when downloading 50.000 files from server, the last line would just change the count of the downloads finished and last file downloaded, like this:
[timestamp] Started downloading 50 000 files.
[timestamp] Downloaded 1002th file - filename.csv
[timestamp] <Error downloading this file> #show only when err ofc
[timestamp] Download finished.
This is not a terminal log, it is a log file, which I read actively with tail -f.
How can I make the line Downloaded 1002th file - filename.csv dynamic?

The easiest solution would be to write whole file at once, after each download is complete, and truncate it before each such write. Otherwise you would have to work on rather low level, using
seek and tell function of python file object, which would be rather overkill (https://docs.python.org/3/tutorial/inputoutput.html) to just save few lines. In any solution such changes on file may not work properly with tail -f (because if file size does not change, tail may not update position in file; moreover if you reopen file in python, file descriptor will change and you may have to use tail -F ). Maybe it would be enough to use watch cat?

Attempting to modify the log file is:
Very hard, as you're trying to modify it while continuously writing it from Python:
If you'll do it from an external program, you'll have 2 writers on the same file section and it can cause big issues.
If you'll do it from Python, you won't actually be able to use the logging module, as you'd need to start creating custom file handlers and flags.
You'll cause issues with tail -F actually reading this.
Discouraged. The log is a log, you shouldn't go in random sections and modify them.
If you wish to easily monitor this, you have multiple different solutions:
Write the "successfully downloaded file" using logging.debug instead of logging.info. Then monitor on logging.INFO level. I believe this is the best course of action. You can even write the debug to one log and info to another, and monitor the info.
Send a "successfully downloaded 10/100/1000 files". That'll batch the logging.info rows.
Use any type of external monitoring. More of a custom solution, a bit out of scope for the question.

Pythonic way to handle if the file being written is deleted externally

A python process is writing to a file and the file has been deleted/moved by an external process (cron job in my case).
The python process will continue to execute without any errors (expected as it is being written to the buffer rather than the file and will be flushed after f.close()). Yet there won't be any new file created in this case and buffer will be silently discarded(correct me if I'm wrong).
Is there any pythonic way to handle this instead of checking if file exists and create one if not, before every write operation.

There is no "pythonic" way to do this because the question isn't about a specific language. It's an operating system question. So the answer is going to be different for MS Windows than it is for a UNIX like OS such as Linux or macOS. To do this efficiently requires using a facility such as the Linux inotify API. A simpler approach that will work on any UNIX like OS is to open the file then call os.fstat() and remember the st_ino member of the returned object. Then periodically call os.stat() on the path name and compare its st_ino value to the one you saved earlier. If it changes, or the os.stat() call fails, then you know the file name you are writing to is no longer the same file.

Attribute system similar to HTTP Headers for local files

I am in the process of writing a program and need some guidance. Essentially, I am trying to determine if a file has some marker or flag attached to it. Sort of like the attributes for a HTTP Header.
If such a marker exists, that file will be manipulated in some way (moved to another directory).
My question is:
Where exactly should I be storing this flag/marker? Do files have a system similar to HTTP Headers? I don't want to access or manipulate the contents of the file, just some kind of property of the file that can be edited without corrupting the actual file--and it must be rather universal among file types as my potential domain of file types is unbound. I have some experience with Web APIs so I am familiar with HTTP Headers and json. Does any similar system exist for local files in windows? I am especially interested in anyone who has professional/industry knowledge of common techniques that programmers use when trying to store 'meta data' in files in order to access them later. Or if anyone knows of where to point me, as I am unsure to what I should be researching.
For the record, I am going to write a program for Windows probably using Golang or Python. And the files I am going to manipulate will be potentially all common ones (.docx, .txt, .pdf, etc.)

Metadata you wish to add is best kept in a separate file or database for all files.
Or in another file with same name and different extension or prefix, that you can make hidden.
Relying on a file system is very tricky and your data will be bound by the restrictions and capabilities of the file system your file is stored on.
And, you cannot count on your data remaining intact as any application may wish to change these flags.
And some of those have very specific, clearly defined use, such as creation time, modification time, access time...
See, if you need only flagging the document, you may wish to use creation time, which will stay unchanged through out the live of this document (until is copied) to store your flags. :D
Very dirty business, unprofessional, unreliable and all that.
But it's a solution. Poor one, but exists.
I do not know that FAT32 or NTFS file systems support any extra bits for flagging except those already used by the OS.
Unixes EXT family FS's do support some extra bits. And even than you should be careful in case some other important application makes use of them for something.
Mac OS may support some metadata by itself, but I am not 100% sure.
On Windows, you have one more option to associate more data with a file, but I wouldn't use that as well.
Well, NTFS file system (FAT doesn't support that) has a feature called streams.
In essential, same file can have multiple data streams under itself. I.e. You have more than one file contents under same file node.
To be more clear. Same file contains two different files.
When you open the file normally only main stream is visible to the application. Applications must check whether the other streams are present and choose the one they want to follow.
So, you may choose to store metadata under the second stream of the file.
But, what if all streams are taken?
Even more, anti-virus programs may prevent you access to the metadata out of paranoya, or at least ask for a permission.
I don't know why MS included that option, probably for file duplication or something, but bad hackers made use of the fact that you can store some data, under existing regular file, that nobody is aware of.
Imagine a virus writing it's copy into another stream of one of programs already there.
All that is needed for it to start, instead of your old program next time you run it is a batch script added to task scheduler that flips two streams making the virus data the main one.
Nasty trick! So when this feature started to be abused, anti-virus software started restricting files with multiple streams, so it's like this feature doesn't exist.
If you want to add some metadata using OS's technology, use Windows registry,
but even that is unwise.
What to tell you?
Don't add metadata to files, organize a separate file, or index your data in special files with same name as the file you are refering to and in same folder.

If you are dealing with binary files like docx and pdf, you're best off storing the metadata in seperate files or in a sqlite file.
Metadata is usually stored seperate from files, in data structures called inodes (at least in Unix systems, Windows probably has something similar). But you probably don't want to get that deep into the rabbit hole.
If your goal is to query the system based on metadata, then it would be easier and more efficient to use something SQLite. Having the meta data in the file would mean that you would need to open the file, read it into memory from disk, and then check the meta data - i.e slower queries.
If you don't need to query based on metadata, then storing metadata in the file might make sense. It would reduce the dependencies in your application, but in order to access the contents of the file through Word or Adobe Reader, you'd need to strip the metadata before handing it off to the application. Not worth the hassle, usually

python WatchedFileHandler still writing to old file after rotation

I've been using WatchedFileHandler as my python logging file handler, so that I can rotate my logs with logrotate (on ubuntu 14.04), which you know is what the docs say its for. My logrotate config files looks like
/path_to_logs/*.log {
daily
rotate 365
size 10M
compress
delaycompress
missingok
notifempty
su root root
}
Everything seemed to be working just fine. I'm using logstash to ship my logs to my elasticsearch cluster and everything is great. I added a second log file for my debug logs which gets rotated but is not watched by logstash. I noticed that when that file is rotated, python just keeps writing to /path_to_debug_logs/*.log.1 and never starts writting to the new file. If I manually tail /path_to_debug_logs/*.log.1, it switches over instantly and starts writing to /path_to_debug_logs/*.log.
This seems REALLY weird to me.
I believe what is happening is that logstash is always tailing my non-debug logs, which some how triggers the switch over to the new file after logrotate is called. If logrotate is called twice without a switch over, the log.1 file gets moved and compressed to log.2.gz, which python can no longer log to and logs are lost.
Clearly there are a bunch of hacky solutions to this (such as a cronjob that tails all my logs every now and then), but I feel like I must be doing something wrong.
I'm using WatchedFileHandler and logrotate instead of RotatingFileHandler for a number of reasons, but mainly because it will nicely compress my logs for me after rotation.
UPDATE:
I tried the horrible hack of adding a manual tail to the end of my log rotation config script.
sharedscripts
postrotate
/usr/bin/tail -n 1 path_to_logs/*.log.1
endscript
Sure enough this works most of the time, but randomly fails sometimes for no clear reason, so isn't a solution. I've also tried a number of less hacky solutions where I've modified the way WatchFileHandler checks if the file has changed, but no luck.
I'm fairly sure the root of my problem is that the logs are stored on a network drive, which is somehow confusing the file system.
I'm moving my rotation to python with RotatingFileHandler, but if anyone knows the proper way to handle this I'd love to know.

Use copytruncate option of logrotate. From docs
copytruncate
Truncate the original log file in place after creating a copy, instead of moving the old log file and optionally creating a new one, It can be used when some program can not be told to close its logfile and thus might continue writing (appending) to the previous log file forever. Note that there is a very small time slice between copying the file and truncating it, so some logging data might be lost. When this option is used, the create option will have no effect, as the old log file stays in place.

WatchedFileHandler does a rollover when a device and/or inode change is detected in the log file just before writing to it. Perhaps the file which isn't being watched by logstash doesn't see a change in its device/inode? That would explain why the handler keeps on writing to it.

Back end process in windows

I need to run the python program in the backend. To the script I have given one input file and the code is processing that file and creating new output file. Now if I change the input file content I don't want to run the code again. It should run in the back end continously and generate the output file. Please if someone knows the answer for this let me know.
thank you

Basically, you have to set up a so-called FileWatcher, i.e. some mechanism which looks out for changes in a file.
There are several techniques for watching file/directory changes in python. Have a look at this question: Monitoring contents of files/directories?. Another link is here, this is about directory changes but file changes are handled in a similar way. You could also google for "watch file changes python" in order to get a lot of answers :)
Note: If you're programming in windows, you should probably implement your program as windows service, look here for how to do that.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.