Ensuring that my program is not doing a concurrent file write - python

I am writing a script that is required to perform safe-writes to any given file i.e. append a file if no other process is known to be writing into it. My understanding of the theory was that concurrent writes were prevented using write locks on the file system but it seems not to be the case in practice.
Here's how I set up my test case:
I am redirecting the output of a ping command:
ping 127.0.0.1 > fileForSafeWrites.txt
On the other end, I have the following python code attempting to write to the file:
handle = open('fileForSafeWrites.txt', 'w')
handle.write("Probing for opportunity to write")
handle.close()
Running concurrently both processes gracefully complete. I see that fileForSafeWrites.txt has turned into a file with binary content, instead of a write lock issued by the first process that protects it from being written into by the Python code.
How do I force either or both of my concurrent processes not to interfere with each other? I have read people advise the ability to get a write file handle as evidence for the file being write to safe, such as in https://stackoverflow.com/a/3070749/1309045
Is this behavior specific to my Operating System and Python. I use Python2.7 in an Ubuntu 12.04 environment.

Use the lockfile module as shown in Locking a file in Python

Inspired from a solution described for concurrency checks, I came up with the following snippet of code. It works if one is able to appropriately predict the frequency at which the file in question is written. The solution is through the use of file-modification times.
import os
import time
'''Find if a file was modified in the last x seconds given by writeFrequency.'''
def isFileBeingWrittenInto(filename,
writeFrequency = 180, overheadTimePercentage = 20):
overhead = 1+float(overheadTimePercentage)/100 # Add some buffer time
maxWriteFrequency = writeFrequency * overhead
modifiedTimeStart = os.stat(filename).st_mtime # Time file last modified
time.sleep(writeFrequency) # wait writeFrequency # of secs
modifiedTimeEnd = os.stat(filename).st_mtime # File modification time again
if 0 < (modifiedTimeEnd - modifiedTimeStart) <= maxWriteFrequency:
return True
else:
return False
if not isFileBeingWrittenInto('fileForSafeWrites.txt'):
handle = open('fileForSafeWrites.txt', 'a')
handle.write("Text written safely when no one else is writing to the file")
handle.close()
This does not do true concurrency checks but can be combined with a variety of other methods for practical purposes to safely write into a file without having to worry about garbled text. Hope it helps the next person searching for a way to do this.
EDIT UPDATE:
Upon further testing, I encountered a high-frequency write process that required the conditional logic to be modified from
if 0 < (modifiedTimeEnd - modifiedTimeStart) < maxWriteFrequency
to
if 0 < (modifiedTimeEnd - modifiedTimeStart) <= maxWriteFrequency
That makes a better answer, in theory and in practice.

Related

Why `pathlib.Path("xxx/yyy").unlink / mkdir / rmdir` are not sync operations?

I am using python's pathlib.Path module, with pytorch's DistributedDataParallel together.
When I use multi-process using DistributedDataParallel, I delete or create a file with Path("xxx/yyy").rmdir / .unlink / .mkdir only in process 0 which is local_rank: 0.
And then the weird thing happened,it seems that the Path("xxx/yyy").rmdir / .unlink / .mkdir operations is not synchronous, which i means that the process in rank0 runs but does not wait until the file opertion ends. So if there is a file check after that operation, for example Path("xxx/yyy").parent.iterdir, the results in rank0 and rank1 or other ranks are not equal, which means that rank0 finds or does not find some files that other ranks does not find or find.
The problem has been fixed by adding a synchronous lock, the code is simple and as follows:
from time import sleep
def wait_to_success(fn, no=False):
while True:
sleep(0.01)
if (no and not fn()) or (not no and fn()):
break
And this function is used as follows:
remove_file_operation(filepath)
wait_to_success(filepath.exist, no=True)
or
create_file_operation(filepath)
wait_to_success(filepath.exist, no=False)
And it is needed to notice that the torch.distributed.barrier() function does not work for this problem.
So I am very confused why pathlib.Path operation does not wait for this operation ends?

Why does implementing threading in Python cause I/O to become really slow?

I have this function that saves a large list of dictionaries into files of 100 items each. This worked flawlessly in a normal environment, but after changing nothing but running this using threading, I experienced rather significant slow-downs. To my knowledge, simply adding threading or multiprocessing shouldn't cause slowdowns in I/O, but if I am missing something trivial, please let me know, but if I am not, how can I make this not run so slowly?
def savePlayerQueueToDisk(saveCopy):
print("="*10)
print("Application exit signal received. Caching loaded players...")
import pickle
i = 0
for l in chunks(saveCopy, 100):
with open(CACHED_PLAYERS_DIR / f"{str(i)}", 'wb') as filehandle:
pickle.dump(saveCopy, filehandle)
i += 1
print(f"saved chunk {i}")
import sys
print("="*10)
sys.exit()
def mainFunction():
# Calls a main program, where on KeyboardException, it calls the savePlayerQueueToDisk function.
t2 = threading.Thread(target=main, kwargs={'ignore_exceptions': False})
t2.start()
Edit: After some more testing, by using multiprocessing instead of threading, and only using a second process for the one of the threads, and keeping the mainFunction on the main thread, I experienced no slowdowns. Why is this the case?
Edit: after more testing and debugging, I found that the issue is not actually tied to multiprocessing and I/O bounding. In fact, there is actually a logic error on line 8 of my savePlayerQueueToDisk() function. It reads pickle.dump(saveCopy, filehandle), when it should instead be pickle.dump(l, filehandle). The I/O would just get slower and slower the more I ran the function because I would save the entire list into over 100 files, and when all those files were loaded in, I would save 100 copies of each data again into over 100 files. Loading and saving these, obviously, would just get out of hand.
After more testing and debugging, I found that the issue is not actually tied to multiprocessing and I/O bounding. In fact, there is actually a logic error on line 8 of my savePlayerQueueToDisk() function. It reads pickle.dump(saveCopy, filehandle), when it should instead be pickle.dump(l, filehandle). The I/O would just get slower and slower the more I ran the function because I would save the entire list into over 100 files, and when all those files were loaded in, I would save 100 copies of each data again into over 100 files. Loading and saving these, obviously, would just get out of hand.

Keeping Python Variables between Script Calls

I have a python script, that needs to load a large file from disk to a variable. This takes a while. The script will be called many times from another application (still unknown), with different options and the stdout will be used. Is there any possibility to avoid reading the large file for each single call of the script?
I guess i could have one large script running in the background that holds the variable. But then, how can I call the script with different options and read the stdout from another application?
Make it a (web) microservice: formalize all different CLI arguments as HTTP endpoints and send requests to it from main application.
(I misunderstood the original question, but the first answer I wrote has a different solution, which might be useful to someone fitting that scenario, so I am keeping that one as is and proposing second solution.
)
For a single machine, OS provided pipes are the best solution for what you are looking.
Essentially you will create a forever running process in python which reads from pipe, and process the commands entering the pipe, and then prints to sysout.
Reference: http://kblin.blogspot.com/2012/05/playing-with-posix-pipes-in-python.html
From above mentioned source
Workload
In order to simulate my workload, I came up with the following simple script called pipetest.py that takes an output file name and then writes some text into that file.
#!/usr/bin/env python
import sys
def main():
pipename = sys.argv[1]
with open(pipename, 'w') as p:
p.write("Ceci n'est pas une pipe!\n")
if __name__ == "__main__":
main()
The Code
In my test, this "file" will be a FIFO created by my wrapper code. The implementation of the wrapper code is as follows, I will go over the code in detail further down this post:
#!/usr/bin/env python
import tempfile
import os
from os import path
import shutil
import subprocess
class TemporaryPipe(object):
def __init__(self, pipename="pipe"):
self.pipename = pipename
self.tempdir = None
def __enter__(self):
self.tempdir = tempfile.mkdtemp()
pipe_path = path.join(self.tempdir, self.pipename)
os.mkfifo(pipe_path)
return pipe_path
def __exit__(self, type, value, traceback):
if self.tempdir is not None:
shutil.rmtree(self.tempdir)
def call_helper():
with TemporaryPipe() as p:
script = "./pipetest.py"
subprocess.Popen(script + " " + p, shell=True)
with open(p, 'r') as r:
text = r.read()
return text.strip()
def main():
call_helper()
if __name__ == "__main__":
main()
Since you already can read the data into a variable, then you might consider memory mapping the file using mmap. This is safe if multiple processes are only reading it - to support a writer would require a locking protocol.
Assuming you are not familiar with memory mapped objects, I'll wager you use them every day - this is how the operating system loads and maintains executable files. Essentially your file becomes part of the paging system - although it does not have to be in any special format.
When you read a file into memory it is unlikely it is all loaded into RAM, it will be paged out when "real" RAM becomes over-subscribed. Often this paging is a considerable overhead. A memory mapped file is just your data "ready paged". There is no overhead in reading into memory (virtual memory, that is), it is there as soon as you map it .
When you try to access the data a page fault occurs and a subset (page) is loaded into RAM - all done by the operating system, the programmer is unaware of this.
While a file remains mapped it is connected to the paging system. Another process mapping the same file will access the same object, provided changes have not been made (See MAP_SHARED).
It needs a daemon to keep the memory mapped object current in kernel, but other than creating the object linked to the physical file, it does not need to do anything else - it can sleep or wait on a shutdown signal.
Other processes open the file (use os.open()) and map the object.
See the examples in the documentation, here and also Giving access to shared memory after child processes have already started
You can store the processed values in a file, and then read the values from that file in another script.
>>> import pickle as p
>>> mystr="foobar"
>>> p.dump(mystr,open('/tmp/t.txt','wb'))
>>> mystr2=p.load(open('/tmp/t.txt','rb'))
>>> mystr2
'foobar'

Error message when use Python multiprocessing

I need to convert a very huge .bam file to .bed file, although I found a solution by using bedops's bam2bed parallel, that parallel support SEG and gnuParallel, the two clusters I can access only support slurm and torque schedulers, and I do not know much about tcsh, I can even not modify the script to meet the require of slurm and torque.
Due to I know a little about Python, I plan to use Python's multiprocessing module to do this, however, the following code raise a weird message:
"Python quit unexpectedly while using the calignmentfile.so plug-in"
# The code here is just a test code, ignore its real meaning.
import multiprocessing as mp
import pysam
def work(read):
return read.query
# return read.split()[0]
if __name__ == '__main__':
cpu = mp.cpu_count()
pool = mp.Pool(cpu)
sam = pysam.AlignmentFile('foo.bam', 'rb')
read = sam.fetch(until_eof=True)
# f = open('foo.text', 'rb')
# results = pool.map(work, f, cpu)
results = pool.map(work, read, cpu)
print(results)
Does this message mean the reads from pysam.AlignmentFile() does not support parallelism, or Python doesn't support this kind of parallel? I use a regular text file test this piece of code, it works well (e.g. the code was commented).
pysam indeed has some problems with concurrency. If you see the source code for fetch you'll see there's a problem with concurrency and iterating it's return types

Python: Waiting for a file to reach a size limit in a CPU friendly manner

I am monitoring a file in Python and triggering an action when it reaches a certain size. Right now I am sleeping and polling but I'm sure there is a more elegant way to do this:
POLLING_PERIOD = 10
SIZE_LIMIT = 1 * 1024 * 1024
while True:
sleep(POLLING_PERIOD)
if stat(file).st_size >= SIZE_LIMIT:
# do something
The thing is, if I have a big POLLING_PERIOD, my file limit is not accurate if the file grows quickly, but if I have a small POLLING_PERIOD, I am wasting CPU.
Thanks!
How can I do this?
Thanks!
Linux Solution
You want to look at using pyinotify it is a Python binding for inotify.
Here is an example on watching for close events, it isn't a big jump to listening for size changes.
#!/usr/bin/env python
import os, sys
from pyinotify import WatchManager, Notifier, ProcessEvent, EventsCodes
def Monitor(path):
class PClose(ProcessEvent):
def process_IN_CLOSE(self, event):
f = event.name and os.path.join(event.path, event.name) or event.path
print 'close event: ' + f
wm = WatchManager()
notifier = Notifier(wm, PClose())
wm.add_watch(path, EventsCodes.IN_CLOSE_WRITE|EventsCodes.IN_CLOSE_NOWRITE)
try:
while 1:
notifier.process_events()
if notifier.check_events():
notifier.read_events()
except KeyboardInterrupt:
notifier.stop()
return
if __name__ == '__main__':
try:
path = sys.argv[1]
except IndexError:
print 'use: %s dir' % sys.argv[0]
else:
Monitor(path)
Windows Solution
pywin32 has bindings for file system notifications for the Windows file system.
What you want to look for is using FindFirstChangeNotification and tie into that and list for FILE_NOTIFY_CHANGE_SIZE. This example listens for File Name change it isn't a big leap to listen for size changes.
import os
import win32file
import win32event
import win32con
path_to_watch = os.path.abspath (".")
#
# FindFirstChangeNotification sets up a handle for watching
# file changes. The first parameter is the path to be
# watched; the second is a boolean indicating whether the
# directories underneath the one specified are to be watched;
# the third is a list of flags as to what kind of changes to
# watch for. We're just looking at file additions / deletions.
#
change_handle = win32file.FindFirstChangeNotification (
path_to_watch,
0,
win32con.FILE_NOTIFY_CHANGE_FILE_NAME
)
#
# Loop forever, listing any file changes. The WaitFor... will
# time out every half a second allowing for keyboard interrupts
# to terminate the loop.
#
try:
old_path_contents = dict ([(f, None) for f in os.listdir (path_to_watch)])
while 1:
result = win32event.WaitForSingleObject (change_handle, 500)
#
# If the WaitFor... returned because of a notification (as
# opposed to timing out or some error) then look for the
# changes in the directory contents.
#
if result == win32con.WAIT_OBJECT_0:
new_path_contents = dict ([(f, None) for f in os.listdir (path_to_watch)])
added = [f for f in new_path_contents if not f in old_path_contents]
deleted = [f for f in old_path_contents if not f in new_path_contents]
if added: print "Added: ", ", ".join (added)
if deleted: print "Deleted: ", ", ".join (deleted)
old_path_contents = new_path_contents
win32file.FindNextChangeNotification (change_handle)
finally:
win32file.FindCloseChangeNotification (change_handle)
OSX Solution
There is equivalent hooks into the OSX file system using PyKQueue as well, but if you can understand these examples you can Google for the OSX solution as well.
Here is a good article about Cross Platform File System Monitoring.
You're correct: "Polling is Evil". The more often you poll, the more you waste CPU if nothing happened. If you poll less frequently, you delay handing the event when it does occur.
The only alternative, however, is to "block" until you receive some kind of "signal".
If you're on Linux, you can use "inotify":
http://linux.die.net/man/7/inotify
You're right that polling is generally a sub-optimal solution compared to other ways of accomplishing something. However, sometimes it's the simplest solution, especially if you are trying to write something that will work on both Windows and Linux/UNIX.
Fortunately, modern hardware is quite fast. On my machine, I was able to run your loop, polling ten times a second, without any visible impact on the CPU usage in the Windows Task Manager. 100 times a second did produce some tiny humps on the usage graph and the CPU usage would occasionally reach 1%.
Ten seconds between polling, as in your example, is utterly trivial in terms of CPU usage.
You can also give your script a lower priority to make sure it doesn't affect the performance of other tasks on the machine, if that's important.
A bit of a kludge, but given the CPU load vs. Size accuracy conundrum, would a variable polling period be appropriate?
In a nutshell, the the polling period would decrease as the file size approaches the limit and/or as the current growth rate is above a certain level. In this fashion, at least most of the time, the CPU load cost of the polling would be somewhat reduced.
For example, until the current file size reaches a certain threshold (or a tiers of thresholds), you'd allow for a longer period.
Another heuristic: if the file size didn't change since last time, indicating no current activity on the file, you could further boost the period by a fraction of a second. (Or comparing the timestamp of the file with current time may do?).
The specifics parameters of the function that determine the polling period given the context will depend on the particulars of the way the file growth takes place in general.
What is the absolute minimal amount of time for the file to grow to say 10% of the size limit?
What is the average amount of data written to these files in 24 hours?
Is there much difference in the file growth rate at different times of day and night ?
What is the frequency at which the file size changes?
Is it a "continuous" logging-like activity, whereby on average several dozen of writes take place each second, or are the writes taking place less frequently but putting much more data at once when they do?
etc.
Alternatively, but at the risk of introducing OS-specific logic in your program, you can look into file/directory changes notification systems. I know that both the WIN32 API and Linux offer their flavor of this kind of feature. I'm unaware of the specific implementation the OSes use, and these may well introduce some kind of polling approach similar to what you have externally. Yet, the OS has various hooks into the file system that may allow it a much less intrusive implementation.

Categories

Resources