Python to check if file status is being uploading - python

Python 2.6
My script needs to monitor some 1G files on the ftp, when ever it's changed/modified, the script will download it to another place. Those file name will remain unchanged, people will delete the original file on ftp first, then upload a newer version. My script will checking the file metadata like file size and date modified to see if any difference.
The question is when the script checking metadata, the new file may be still being uploading. How to handle this situation? Is there any file attribute indicates uploading status (like the file is locked)? Thanks.

There is no such attribute. You may be unable to GET such file, but it depends on the server software. Also, file access flags may be set one way while the file is being uploaded and then changed when upload is complete; or incomplete file may have modified name (e.g. original_filename.ext.part) -- it all depends on the server-side software used for upload.
If you control the server, make your own metadata, e.g. create an empty flag file alongside the newly uploaded file when upload is finished.
In the general case, I'm afraid, the best you can do is monitor file size and consider the file completely uploaded if its size is not changing for a while. Make this interval sufficiently large (on the order of minutes).

Your question leaves out a few details, but I'll try to answer.
If you're running your status checker
program on the same server thats
running ftp:
1) Depending on your operating system, if you're using Linux and you've built inotify into your kernel you could use pyinotify to watch your upload directory -- inotify distinguishes from open, modify, close events and lets you asynchronously watch filesystem events so you're not polling constantly. OSX and Windows both have similar but differently implemented facilities.
2) You could pythonically tail -f to see when a new file is put on the server (if you're even logging that) and just update when you see related update messages.
If you're running your program remotely
3) If your status checking utility has to run on a remote host from the FTP server, you'd have to poll the file for status and build in some logic to detect size changes. You can use the FTP 'SIZE' command for this for an easily parse-able string.
You'd have to put some logic into it such that if the filesize gets smaller you would assume it's being replaced, and then wait for it to get bigger until it stops growing and stays the same size for some duration. If the archive is compressed in a way that you could verify the sum you could then download it, checksum, and then reupload to the remote site.

Related

Lock files while process is running

I have skeleton code where the user specifies a scripts.txt file that contains service names. Whenever the application is installed, the code automatically generates .service files for every entry in scripts.txt. These services are placed in /lib/systemd/system, so that whenever the machine crashes the application gets restarted.
I would like to add a deinstallation functionality that removes the .service files that were created. For that the code has to again look into scripts.txt, but if the user changed this file meanwhile, the application will not know which services to remove.
Therefore, I would like to know, if there is a way to lock a file from being edited while a certain process is running?

External input to Python program during runtime

I am creating a test automation which uses an application without any interfaces. However, The application calls a batch script when it changes modes, and I am therefore am able to catch the mode transitions.
What I want to do is to get the batch script to give an input to my python script (I have a state machine running in python) during runtime. Such that I can monitor the state of the application with python instead of the batch file.
I am using a similar state machine to the one of Karn Saheb:
https://dev.to/karn/building-a-simple-state-machine-in-python
However, instead of changing states statically like:
device.on_event('event')
I want the python script to do something similar to:
while(True):
device.on_event(input()) # where the input is passed from the batch script:
REM state.bat
set CurrentState=%1
"magic code to pass CurrentState to python input()" %CurrentState%
I see that a solution would be to start the python script from the batch file every time it is called with the "event" and then save the current event in another file upon termination of the python script... But I want to avoid such handling and rather evaluate this during runtime.
Thank you in advance!
A reasonably portable way of doing this without ugly polling on temporary files is to use a socket: have the main process listen and have the batch file(s) start a small program that connects to the server and writes a message.
There are security considerations here: you can start by listening only to the loopback interface, with further authentication if the local machine should not be trusted.
If you have more than one of these processes, or if you need to handle the child dying before it issues its next report, you’ll have to use threads or something like select to unify the news from different input channels (e.g., waiting on the child to exit vs. waiting on news from the next batch file).

File flush needed after process exit?

I'm writing files from one process using open and write (i.e. direct kernel calls.) After the write, I simply close and exit the application without flushing. Now, the application is started from a Python-Wrapper which immediately after the application exits reads the files. Sometimes however, the Python wrapper reads incorrect data, as if I'm still reading an old version of the file (i.e. the wrapper reads stale data)
I thought that no matter whether the file metadata and contents are written to disk, the user visible contents would be always valid & consistent (i.e. buffers get flushed to memory at least, so subsequent reads get the same content, even though it might not be committed to disk.) What's going on here? Do I need to sync on close in my application; or can I simply issue a sync command after running my application from the Python script to guarantee that everything has been written correctly? This is running on ext4.
On the Python side:
# Called for lots of files
o = subprocess.check_output (['./App.BitPacker', inputFile]) # Writes indices.bin and dict.bin
indices = open ('indices.bin', 'rb').read ()
dictionary = open ('dict.bin', 'rb').read ()
with open ('output-file', 'wb') as output:
output.write (dictionary) # Invalid content in output-file ...
# output-file is a placeholder, one output-file per inputFile or course
I've never had your problem and always found a call to close() to be sufficient. However, from the man entry on close(2):
A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes. It is not common for a file system to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored use fsync(2). (It will depend on the disk hardware at this point.)
As, at time of writing, you haven't included code for the write processes I can only suggest adding a call to fsync in that process and see if this makes a difference.

How to download part of a file over SFTP connection?

So I have a Python program that pulls access logs from remote servers and processes them. There are separate log files for each day. The files on the servers are in this format:
access.log
access.log-20130715
access.log-20130717
The file "access.log" is the log file for the current day, and is modified throughout the day with new data. The files with the timestamp appended are archived log files, and are not modified. If any of the files in the directory are ever modified, it is either because (1) data is being added to the "access.log" file, or (2) the "access.log" file is being archived, and an empty file takes its place. Every minute or so, my program checks for the most recent modification time of any files in the directory, and if it changes it pulls down the "access.log" file and any newly archived files
All of this currently works fine. However, if a lot of data is added to the log file throughout the day, downloading the whole thing over and over just to get some of the data at the end of the file will create a lot of traffic on the network, and I would like to avoid that. Is there any way to only download a part of the file? If I have already processed, say 1 GB of the file, and another 500 bytes suddenly get added to the log file, is there a way to only download the 500 bytes at the end?
I am using Python 3.2, my local machine is running Windows, and the remote servers all run Linux. I am using Chilkat for making SSH and SFTP connections. Any help would be greatly appreciated!
Call ResumeDownloadFileByName. Here's the description of the method in the Chilkat reference documentation:
Resumes an SFTP download. The size of the localFilePath is checked and
the download begins at the appropriate position in the remoteFilePath.
If localFilePath is empty or non-existent, then this method is
identical to DownloadFileByName. If the localFilePath is already fully
downloaded, then no additional data is downloaded and the method will
return True.
See http://www.chilkatsoft.com/refdoc/pythonCkSFtpRef.html
You could do that, or you could massively reduce your complexity by splitting the latest log file down into hours, or tens of minutes.

Special considerations when performing file I/O on an NFS share via a Python-based daemon?

I have a python-based daemon that provides a REST-like interface over HTTP to some command line tools. The general nature of the tool is to take in a request, perform some command-line action, store a pickled data structure to disk, and return some data to the caller. There's a secondary thread spawned on daemon startup that looks at that pickled data on disk periodically and does some cleanup based on what's in the data.
This works just fine if the disk where the pickled data resides happens to be local disk on a Linux machine. If you switch to NFS-mounted disk the daemon starts life just fine, but over time the NFS-mounted share "disappears" and the daemon can no longer tell where it is on disk with calls like os.getcwd(). You'll start to see it log errors like:
2011-07-13 09:19:36,238 INFO Retrieved submit directory '/tech/condor_logs/submit'
2011-07-13 09:19:36,239 DEBUG CondorAgent.post_submit.do_submit(): handler.path: /condor/submit?queue=Q2%40scheduler
2011-07-13 09:19:36,239 DEBUG CondorAgent.post_submit.do_submit(): submitting from temporary submission directory '/tech/condor_logs/submit/tmpoF8YXk'
2011-07-13 09:19:36,240 ERROR Caught un-handled exception: [Errno 2] No such file or directory
2011-07-13 09:19:36,241 INFO submitter - - [13/Jul/2011 09:19:36] "POST /condor/submit?queue=Q2%40scheduler HTTP/1.1" 500 -
The un-handled exception resolves to the daemon being unable to see the disk any more. Any attempts to figure out the daemon's current working directory with os.getcwd() at this point will fail. Even trying to change to the root of the NFS mount /tech, will fail. All the while the logger.logging.* methods are happily writing out log and debug messages to a log file located on the NFS-mounted share at /tech/condor_logs/logs/CondorAgentLog.
The disk is most definitely still available. There are other, C++-based daemons, reading and writing with a much higher rate of frequency on this share at the time that the python-based daemon.
I've come to an impasse debugging this problem. Since it works on local disk the general structure of the code must be good, right? There's something about NFS-mounted shares and my code that are incompatible but I can't tell what it might be.
Are there special considerations one must implement when dealing with a long-running Python daemon that will be reading and writing frequently to an NFS-mounted file share?
If anyone wants to see the code the portion that handles the HTTP request and writes the pickled object to disk is in github here. And the portion that the sub-thread uses to do periodic cleanup of stuff from disk by reading the pickled objects is here.
I have the answer to my problem and it had nothing to with the fact that I was doing file I/O on an NFS share. It turns out the problem just showed up faster if the I/O was over an NFS mount versus local disk.
A key piece of information is that the code was running threaded via the SocketServer.ThreadingMixIn and HTTPServer classes.
My handler code was doing something close to the following:
base_dir = getBaseDirFromConfigFile()
current_dir = os.getcwd()
temporary_dir = tempfile.mkdtemp(dir=base_dir)
chdir(temporary_dir)
doSomething()
chdir(current_dir)
cleanUp(temporary_dir)
That's the flow, more or less.
The problem wasn't that the I/O was being done on NFS. The problem was that os.getcwd() isn't thread-local, it's a process global. So as one thread issued a chdir() to move to the temporary space it just created under base_dir, the next thread calling os.getcwd() would get the other thread's temporary_dir instead of the static base directory where the HTTP server was started in.
There's some other people reporting similar issues here and here.
The solution was to get rid of the chdir() and getcwd() calls. To startup and stay in one directory and access everything else through absolute paths.
The NFS vs local file stuff through me for a loop. It turns out my block:
chdir(temporary_dir)
doSomething()
chdir(current_dir)
cleanUp(temporary_dir)
was running much slower when the filesystem was NFS versus local. It made the problem occur much sooner because it increased the chances that one thread was still in doSomething() while another thread was running the current_dir = os.getcwd() part of the code block. On local disk the threads moved through the entire code block so quickly they rarely intersected like that. But, give it enough time (about a week), and the problem would crop up when using local disk.
So a big lesson learned on thread safe operations in Python!
To answer the question literally, yes there are some gotchas with NFS. E.g.:
NFS is not cache coherent, so if several clients are accessing a file they might get stale data.
In particular, you cannot rely on O_APPEND to atomically append to files.
Depending on the NFS server, O_CREAT|O_EXCL might not work properly (it does work properly on modern Linux, at least).
Especially older NFS servers have deficient or completely non-working locking support. Even on more modern servers, lock recovery can be a problem after server and/or client reboot. NFSv4, a stateful protocol, ought to be more robust here than older protocol versions.
All this being said, it sounds like you problem isn't really related to any of the above. In my experience, the Condor daemons will at some point, depending on the configuration, clean up files left from jobs that have finished. My guess would be to look for the suspect here.

Categories

Resources