Wait till file gets copy/upload completes

Wait till file gets copy/upload completes - python

I have to wait till file copy/upload finishes completely using python (preferred approach), bash/shell also fine(I will call from python)
I have shared nfs directory /data/files_in/, if somebody copies/uploads a file to /data/files_in/ directory, I should notify to other application, only after complete file copy/upload is done
My current code to check file is completed copied or not
while True:
current_size = Path(file_path).stat().st_size
time.sleep(5)
result_size = Path(file_path).stat().st_size
if result_size == current_size:
break
# Notify your application
It is working only with small size files, for large files like 100G files it is not working properly.
I have increased a timer, but still sometimes it is failing and timer based approach seems not good idea to rely on.
Is there any other way, I can implement code to fix this issue?
OS: Linux, Cent os
Python Version: 3.9

I can't comment so I will ask here. Shouldn't the resulting size be larger (or at least different) from the current one in order for a file to be done uploading and therefore stop the loop?

I assume you cannot establish any kind of direct communications with the other process, i.e. the one which is copying/uploading the file.
One common approach in these cases is to have the other process to write/erase a "semaphore" file. It may be that it creates the semaphore just before beginning copying and erases it just after finishing, so the semaphore means "don't do anything, I'm still running", or the other way round, it creates the semaphore just after finishing and erases it just before starting next time, so the semaphore means "your data are ready to use".
That said, I'm amazed your approach doesn't work if you allow enough time, and 5 secs should be more than enough on any networks

Related

Good practice for parallel tasks in python

I have one python script which is generating data and one which is training a neural network with tensorflow and keras on this data. Both need an instance of the neural network.
Since I haven't set the flag "allow growth" each process takes the full GPU memory. Therefore I simply give each process it's own GPU. (Maybe not a good solution for people with only one GPU... yet another unsolved problem)
The actual problem is as follow: Both instances need access to the networks weights file. I recently had a bunch of crashes because both processes tried to access the weights. A flag or something similar should stop each process from accessing it, whilst the other process is accessing. Hopefully this doesn't create a bottle neck.
I tried to come up with a solution like semaphores in C, but today I found this post in stack-exchange.
The idea with renaming seems quite simple and effective to me. Is this good practice in my case? I'll just create the weight file with my own function
self.model.save_weights(filepath='weights.h5$$$')
in the learning process, rename them after saving with
os.rename('weights.h5$$$', 'weights.h5')
and load them in my data generating process with function
self.model.load_weights(filepath='weights.h5')
?
Will this renaming overwrite the old file? And what happens if the other process is currently loading? I would appreciate other ideas how I could multithread / multiprocess my script. Just realized that generating data, learn, generating data,... in a sequential script is not really performant.
EDIT 1: Forgot to mention that the weights are stored in a .h5 file by keras' save function

The multiprocessing module has a RLock class that you can use to regulate access to a sharded resource. This also works for files if you remember to acquire the lock before reading and writing and release it afterwards. Using a lock implies that some of the time one of the processes cannot read or write the file. How much of a problem this is depends on how much both processes have to access the file.
Note that for this to work, one of the scripts has to start the other script as a Process after creating the lock.
If the weights are a Python data structure, you could put that under control of a multiprocessing.Manager. That will manage access to the objects under its control for you. Note that a Manager is not meant for use with files, just in-memory objects.
Additionally on UNIX-like operating systems Python has os.lockf to lock (part of) a file. Note that this is an advisory lock only. That is, if another process calls lockf, the return value indicates that the file is already locked. It does not actually prevent you from reading the file.
Note:
Files can be read and written. Only when two processes are reading the same file (read/read) does this work well. Every other combination (read/write, write/read, write/write) can and eventually will result in undefined behavior and data corruption.
Note2:
Another possible solution involves inter process communication.
Process 1 writes a new h5 file (with a random filename), closes it, and then sends a message (using a Pipe or Queue to Process 2 "I've written a new parameter file \path\to\file".
Process 2 then reads the file and deletes it. This can work both ways but requires that both processes check for and process messages every so often. This prevents file corruption because the writing process only notifies the reading process after it has finished the file.

exiting a program with a cached exit code

I have a "healthchecker" program, that calls a "prober" every 10 seconds to check if a service is running. If the prober exits with return code 0, the healthchecker considers the tested service fine. Otherwise, it considers it's not working.
I can't change the healthchecker (I can't make it check with a bigger interval, or using a better communication protocol than spawning a process and checking its exit code).
That said, I don't want to really probe the service every 10 seconds because it's overkill. I just wanna probe it every minute.
My solution to that is to make the prober keep a "cache" of the last answer valid for 1 minute, and then just really probe when this cache expires.
That seems fine, but I'm having trouble thinking on a decent approach to do that, considering the program must exit (to return an exit code). My best bet so far would be to transform my prober in a daemon (that will keep the cache in memory) and create a client to just query it and exit with its response, but it seems too much work (and dealing with threads, and so on).
Another approach would be to use SQLite/memcached/redis.
Any other ideas?

Since no one has really proposed anything I'll drop my idea here. If you need an example let me know and I'll include one.
The easiest thing to do would be to serialize a dictionary that contains the system health and last time.time() it was checked. At the beginning of your program unpickle the dictionary, check the time, if it's less then your 60 second time interval, quit. Otherwise check the health like normal and cache it (with the time).

Implementing distributed lock using files

I have a network drive (Z:\) which is shared by multiple Windows computers. Is it possible to implement a cross-machine lock by simply creating/deleting files on this network drive?
For example, two computers, A and B, want to write to a shared resource with an ID of 123 at the same time.
One of the computers, say A, locks the resource first by creating an empty file Z:\locks\123. When B sees there is the lock file with the name of "123", B knows the resource 123 is being used by someone else, so it has to wait for Z:\locks\123 to be deleted by A before it can access the resource.
It's like critical section in multithreading, but I want to do it on multiple machines.
I'm trying to implement in Python. Here's what I came up with:
import os
import time
def lock_it(lock_id):
lock_path = "Z:\\locks\\" + lock_id
while os.path.exists(lock_path):
time.sleep(5) # wait for 5 seconds
# create the lock file
lock_file = open(lock_path, "w")
lock_file.close()
def unlock_it(lock_id):
# delete the lock file
lock_path = "Z:\\locks\\" + lock_id
if os.path.exists(lock_path):
os.remove(lock_path)
This won't work because there could be more than one processes exit the waiting status and create the lock file at the same time.
So again, the question is: Is it possible to implement a cross-machine locking mechanism on a shared storage?

... sort of.
First, you should create a lock directory instead of a lock file. Creating a directory (see os.mkdir) will fail if the directory already exists, so you can acquire the lock like this:
while True:
try:
os.mkdir(r"z:\my_lock")
return
except OSError as e:
if e.errno != 21: # Double check that errno will be the same on Windows
raise
time.sleep(5)
Second (and this is where the "sort of" comes in) you'll want some way to notice when the person holding the lock has died. One simple way to do this might be to have them occasionally update a file inside the lock directory. Then if clients notice that file hasn't been updated in a while, they can remove the directory and try to acquire the lock themselves.

This will not work nearly as well you as might hope. You'll have other issues such as the network drive going away, in which case all your processes will either be stuck or think that no one is holding a lock.
I suggest you look into something like ZooKeeper. You will be able to create synchronous locks and recover in the event of network failures. The framework behind distributed locks is much more complex then creating a file on a network drive.

How to pause a python script running in terminal

I have a web crawling python script running in terminal for several hours, which is continuously populating my database. It has several nested for loops. For some reasons I need to restart my computer and continue my script from exactly the place where I left. Is it possible to preserve the pointer state and resume the previously running script in terminal?
I am looking for a solution which will work without altering the python script. Modifying the code is a lower priority as that would mean to relaunch the program and reinvest time.
Update:
Thanks for the VM suggestion. I'll take that. For the sake of completion, what generic modifications should be made to script to make it pause and resumable?
Update2:
Porting on VM works fine. I have also modified script to make it failsafe against network failures. Code written below.

You might try suspending your computer or running in a virtual machine which you can subsequently suspend. But as your script is working with network connections chances are your script won't work from the point you left once you bring up the system. Suspending a computer and restoring it or saving a Virtual M/C and restoring it would mean you need to restablish the network connection. This is true for any elements which are external to your system and network is one of them. And there are high chances that if you are using a dynamic network, the next time you boot chances are you would get a new IP and the network state that you were working previously would be void.
If you are planning to modify the script, few things you need to keep it mind.
Add serializing and Deserializing capabilities. Python has the pickle and the faster cPickle method to do it.
Add Restart points. The best way to do this is to save the state at regular interval and when restarting your script, restart from last saved state after establishing all the transients elements like network.
This would not be an easy task so consider investing a considrable amount of time :-)
Note***
On a second thought. There is one alternative from changing your script. You can try using cloud Virtualization Solutions like Amazon EC2.

I ported my script to VM and launched it from there. However there were network connection glitches after resuming from hibernation. Here's how I solved it by tweaking python script:
import logging
import socket
import time
socket.setdefaulttimeout(30) #set timeout in secs
maxretry = 10 #set max retries
sleeptime_between_retry = 1 #waiting time between retries
erroroccured = 0
while True:
try:
domroot = parse(urllib2.urlopen(myurl)).getroot()
except Exception as e:
erroroccured += 1
if erroroccured>maxretry:
logger.info("Maximum retries reached. Quitting this leg.")
break
time.sleep(sleeptime_between_retry)
logging.info("Network error occurred. Retrying %d time..."%(erroroccured))
continue
finally:
#common code to execute after try or except block, if any
pass
break
This modification made my script temper proof to network failures.

As others have commented, unless you are running your script in a virtual machine that can be suspended, you would need to modify your script to track its state.

Since you're populating a database with your data, I suggest to use it as a way to track the progress of the script (get the latest URL parsed, have a list of pending URLs, etc.).
If the script is terminated abruptly, you don't have to worry about saving its state because the database transactions will come to the rescue and only the data that you've committed will be saved.
When the script is retarted, only the data for the URLs that you completely processed will be stored and you it can resume just picking up the next URL according to the database.

If this problem is important enough to warrant this kind of financial investment, you could run the script on a virtual machine. When you need to shut down, suspend the virtual machine, and then shut down the computer. When you want to start again, start the computer, and then wake up your virtual machine.

WinPDB is a python debugger that supports remote debugging. I never used it, and don't know if remote debugging a running process requires a modification to the script (which is very likely, otherwise it'd be a security issue); but if remote debugging without modifying the script is possible then you may be able to dump the current state of the script to a file and figure out later how to load it. I don't think it would work though.

Python: Lock a file

I have a Python app running on Linux. It is called every minute from cron. It checks a directory for files and if it finds one it processes it - this can take several minutes. I don't want the next cron job to pick up the file currently being processed so I lock it using the code below which calls portalocker. The problem is it doesn't seem to work. The next cron job manages to get a file handle returned for the file all ready being processed.
def open_and_lock(full_filename):
file_handle = open(full_filename, 'r')
try:
portalocker.lock(file_handle, portalocker.LOCK_EX
| portalocker.LOCK_NB)
return file_handle
except IOError:
sys.exit(-1)
Any ideas what I can do to lock the file so no other process can get it?
UPDATE
Thanks to #Winston Ewert I checked through the code and found the file handle was being closed way before the processing had finished. It seems to be working now except the second process blocks on portalocker.lock rather than throwing an exception.

After fumbling with many schemes, this works in my case. I have a script that may be executed multiple times simultaneously. I need these instances to wait their turn to read/write to some files. The lockfile does not need to be deleted, so you avoid blocking all access if one script fails before deleting it.
import fcntl
def acquireLock():
''' acquire exclusive lock file access '''
locked_file_descriptor = open('lockfile.LOCK', 'w+')
fcntl.lockf(locked_file_descriptor, fcntl.LOCK_EX)
return locked_file_descriptor
def releaseLock(locked_file_descriptor):
''' release exclusive lock file access '''
locked_file_descriptor.close()
lock_fd = acquireLock()
# ... do stuff with exclusive access to your file(s)
releaseLock(lock_fd)

You're using the LOCK_NB flag which means that the call is non-blocking and will just return immediately on failure. That is presumably happening in the second process. The reason why it is still able to read the file is that portalocker ultimately uses flock(2) locks, and, as mentioned in the flock(2) man page:
flock(2) places advisory locks only;
given suitable permissions on a file,
a process is free to ignore the use of
flock(2) and perform I/O on the file.
To fix it you could use the fcntl.flock function directly (portalocker is just a thin wrapper around it on Linux) and check the returned value to see if the lock succeeded.

Don't use cron for this. Linux has inotify, which can notify applications when a filesystem event occurs. There is a Python binding for inotify called pyinotify.
Thus, you don't need to lock the file -- you just need to react to IN_CLOSE_WRITE events (i.e. when a file opened for writing was closed). (You also won't need to spawn a new process every minute.)
An alternative to using pyinotify is incron which allows you to write an incrontab (very much in the same style as a crontab), to interact with the inotify system.

what about manually creating an old-fashioned .lock-file next to the file you want to lock?
just check if it’s there; if not, create it, if it is, exit prematurely. after finishing, delete it.

I think fcntl.lockf is what you are looking for.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.