Concurrent access to one file by several unrelated processes on macOS

Concurrent access to one file by several unrelated processes on macOS - python

I need to get several processes to communicate with each others on a macOS system. These processes will be spawned several times per day at different times, and I cannot predict when they will be up at the same time (if ever). These programs are in python or swift.
How can I safely allow these programs to all write to the same file?
I have explored a few different options:
I thought of using sqlite3, however I couldn't find an answer in the documentation on whether it was safe to write concurrently across processes. This question is not very definitive, old, and I would ideally like to get a more authoritative answer.
I thought of using multiprocesing as it supports locks. However, as far as I could see in the documentation, you need a meta-process that spawns the children and stays up for the duration of the longest child process. I am am fine having a meta-spawner process, but it feels wasteful to have a meta-process basically staying up all day long, just to resolve conflicting access ?
Along the lines of this, I thought of having a process that stays up all day long, and receive messages from all other processes, and is the sole responsible for writing to file. It feels a little wasteful, how worried should I be about the resource cost of having a program up all day, and doing little? Are the only thing to worry about memory footprint and CPU usage (as shown in activity monitor), or could there be other significant costs, e.g. context switching?
I have come across flock on linux, that seems to be a locking mechanism to access files, provided by the OS. This seems like a good solution, but this does not seem to exist on macOS?
Any idea to solve this requirement in a robust manner (so that I don't have to debug every other day - I now concurrency can be a pain), is most welcome!

While you are in control of the source code of all such processes, you could use flock. It will put the advisory lock on file, so the other writer will be blocked only in case he is also access the file the same way. This is OK for you, if only your processes will ever need to write to the shared file.
I've tested flock on BigSur, it is still implemented and works fine.
You can also do it in any other common manner: create temporary .lock file in the known location(this is what git does), and remove it after current writer is done with the main file; use semaphores; etc

Related

Problems Using the Berkeley DB Transactional Processing

I'm writing a set of programs that have to operate on a common database, possibly concurrently. For the sake of simplicity (for the user), I didn't want to require the setup of a database server. Therefore I setteled on Berkeley DB, where one can just fire up a program and let it create the DB if it doesn't exist.
In order to let programs work concurrently on a database, one has to use the transactional features present in the 5.x release (here I use python3-bsddb3 6.1.0-1+b2 with libdb5.3 5.3.28-12): the documentation clearly says that it can be done. However I quickly ran in trouble, even with some basic tasks :
Program 1 initializes records in a table
Program 2 has to scan the records previously added by program 1 and updates them with additional data.
To speed things up, there is an index for said additional data. When program 1 creates the records, the additional data isn't present, so the pointer to that record is added to the index under an empty key. Program 2 can then just quickly seek to the not-yet-updated records.
Even when not run concurrently, the record updating program crashes after a few updates. First it complained about insufficient space in the mutex area. I had to resolve this with an obscure DB_CONFIG file and then run db_recover.
Next, again after a few updates it complained 'Cannot allocate memory -- BDB3017 unable to allocate space from the buffer cache'. db_recover and relaunching the program did the trick, only for it to crash again with the same error a few records later.
I'm not even mentioning concurrent use: when one of the programs is launched while the other is running, they almost instantly crash with deadlock, panic about corrupted segments and ask to run recover. I made many changes so I went throug a wide spectrum of errors which often yield irrelevant matches when searched for. I even rewrote the db calls to use lmdb, which in fact works quite well and is really quick, which tends to indicate my program logic isn't at fault. Unfortunately it seems the datafile produced by lmdb is quite sparse, and quickly grew to unacceptable sizes.
From what I said, it seems that maybe some resources are being leaked somewhere. I'm hesitant to rewrite all this directly in C to check if the problem can come from the Python binding.
I can and I will update the question with code, but for the moment ti is long enough. I'm looking for people who have used the transactional stuff in BDB, for similar uses, which could point me to some of the gotchas.
Thanks

RPM (see http://rpm5.org) uses Berkeley DB in transactional mode. There's a fair number of gotchas, depending on what you are attempting.
You have already found DB_CONFIG: you MUST configure the sizes for mutexes and locks, the defaults are invariably too small.
Needing to run db_recover while developing is quite painful too. The best fix (imho) is to automate recovery while opening by checking the return code for DB_RUNRECOVERY, and then reopening the dbenv with DB_RECOVER.
Deadlocks are usually design/coding errors: run db_stat -CA to see what is deadlocked (or what locks are held) and adjust your program. "Works with lmdv" isn't sufficient to claim working code ;-)
Leaks can be seen with either valgrind and/or BDB compilation with -fsanitize:address. Note that valgrind will report false uninitializations unless you use overrides and/or compile BDB to initialize.

file.read() multiprocessing and the GIL

I've read that certain Python functions implemented in C, which I assume includes file.read(), can release the GIL while they're working and then get it back on completion and by doing so make use of multiple cores if they're available.
I'm using multiprocess to parallelize some code and currently I've got three processes, the parent, one child that reads data from a file, and one child that generates a checksum from the data passed to it by the first child process.
Now if I'm understanding this right, it seems that creating a new process to read the file as I'm currently doing is uneccessary and I should just call it in the main process. The question is am I understanding this right and will I get better performance with the read kept in the main process or in a separate one?
So given my function to read and pipe the data to be processed:
def read(file_path, pipe_out):
with open(file_path, 'rb') as file_:
while True:
block = file_.read(block_size)
if not block:
break
pipe_out.send(block)
pipe_out.close()
I reckon that this will definitely make use of multiple cores, but also introduces some overhead:
multiprocess.Process(target=read, args).start()
But now I'm wondering if just doing this will also use multiple cores, minus the overhead:
read(*args)
Any insights anybody has as to which one would be faster and for what reason would be much appreciated!

Okay, as came out by the comments, the actual question is:
Does (C)Python create threads on its own, and if so, how can I make use of that?
Short answer: No.
But, the reason why these C-Functions are nevertheless interesting for Python programmers is the following. By default, no two snippets of python code running in the same interpreter can execute in parallel, this is due to the evil called the Global Interpreter Lock, aka the GIL. The GIL is held whenever the interpreter is executing Python code, which implies the above statement, that no two pieces of python code can run in parallel in the same interpreter.
Nevertheless, you can still make use of multithreading in python, namely when you're doing a lot of I/O or make a lot of use of external libraries like numpy, scipy, lxml and so on, which all know about the issue and release the GIL whenever they can (i.e. whenever they do not need to interact with the python interpreter).
I hope that cleared up the issue a bit.

I think this is the main part of your question:
The question is am I understanding this right and will I get better
performance with the read kept in the main process or in a separate
one?
I assume your goal is to read and process the file as fast as possible. File reading is in any case I/O bound and not CPU bound. You cannot process data faster than you are able to read it. So file I/O clearly limits the performance of your software. You cannot increase the read data rate by using concurrent threads/processes for file reading. Also 'low level' CPython is not doing this. As long as you read the file in one process or thread (even in case of CPython with its GIL a thread is fine), you will get as much data per time as you can get from the storage device. It is also fine if you do the file reading in the main thread as long as there are no other blocking calls that would actually slow down the file reading.

Multithreading / Multiprocessing in Python

I have created a simple substring search program that recursively looks through a folder and scans a large number of files. The program uses the Boyer-Moore-Horspool algorithm and is very efficient at parsing large amounts of data.
Link to program: http://pastebin.com/KqEMMMCT
What I'm trying to do now is make it even more efficient. If you look at the code, you'll notice that there are three different directories being searched. I would like to be able to create a process/thread that searches each directory concurrently, it would greatly speed up my program.
What is the best way to implement this? I have done some preliminary research, but my implementations have been unsuccessful. They seem to die after 25 minutes or so of processing (right now the single process version takes nearly 24 hours to run; it's a lot of data, and there are 648 unique keywords.)
I have done various experiments using the multiprocessing API and condensing all the various files into 3 files (one for each directory) and then mapping the files to memory via mmap(), but a: I'm not sure if this is the appropriate route to go, and b: my program kept dying at random points, and debugging was an absolute nightmare.
Yes, I have done extensive googleing, but I'm getting pretty confused between pools/threads/subprocesses/multithreading/multiprocessing.
I'm not asking for you to write my program, just help me understand the thought process needed to go about implementing a solution. Thank you!
FYI: I plan to open-source the code once I get the program running. I think it's a fairly useful script, and there are limited examples of real world implementations of multiprocessing available online.

What to do depends on what's slowing down the process.
If you're reading on a single disk, and disk I/O is slowing you down, multiple threads/process will probably just slow you down as the read head will now be jumping all over the place as different threads get control, and you'll be spending more time seeking than reading.
If you're reading on a single disk, and processing is slowing you down, then you might get a speedup from using multiprocessing to analyze the data, but you should still read from a single thread to avoid seek time delays (which are usually very long, multiple milliseconds).
If you're reading from multiple disks, and disk I/O is slowing you down, then either multiple threads or processes will probably give you a speed improvement. Threads are easier, and since most of your delay time is away from the processor, the GIL won't be in your way.
If you're reading from multiple disks,, and processing is slowing you down, then you'll need to go with multiprocessing.

Multiprocessing is easier to understand/use than multithreading(IMO). For my reasons, I suggest reading this section of TAOUP. Basically, everything a thread does, a process does, only the programmer has to do everything that the OS would handle. Sharing resources (memory/files/CPU cycles)? Learn locking/mutexes/semaphores and so on for threads. The OS does this for you if you use processes.
I would suggest building 4+ processes. 1 to pull data from the hard drive, and the other three to query it for their next piece. Perhaps a fifth process to stick it all together.
This naturally fits into generators. See the genfind example, along with the gengrep example that uses it.
Also on the same site, check out the coroutines section.

same python interpreter instance running multiple scripts simultaneously?

6-7 years ago i saw an initiative of a way to run python on tight resources env by running the interpreter only once, while allowing several scripts to use it at the same time.
the idea was bot the save the interpreter startup overhead and to save RAM.
Does something alike exists?
this question
Python: Execute multiple Scripts simultaneously from same Interpreter
doesn't address concurrency. at least the answers were about sequential running, but i need simultaneously :)
ideas?

Yes and no. Python itself uses a Global Interpreter Lock (GIL), which you can read a lot about, if you care to. To make a long story short, however, it ensures the interpreter is basically single-threaded. You can create (and run) more than one thread in your Python program, but when/if they use the Python interpreter, only one can do so at a time. If, however, you have threads running mostly code from something like SciPy or NumPy (which is native code that doesn't get interpreted) then you can run several concurrently.
Most operating systems, however, have a Copy On Write mechanism for process memory pages, which means that (as long as the code isn't modified) most of the code used by the interpreter will be shared without any extra work on your part (or the interpreter's) at all. IOW, when you run two or more copies of the interpreter, the second and subsequent will share most of the memory (at least for executable code) with the first, so resource usage will not rise (anywhere close to) linearly as you run more instances. Startup time will also be substantially reduced -- the OS has to create a new page table mapping the memory pages to the new process, but does not need to reread those pages from disk or anything like that.

Python supports threading via the thread and threading modules (one is lowlevel, the other one highlevel).

Python - How to check if a file is used by another application?

I want to open a file which is periodically written to by another application. This application cannot be modified. I'd therefore like to only open the file when I know it is not been written to by an other application.
Is there a pythonic way to do this? Otherwise, how do I achieve this in Unix and Windows?
edit: I'll try and clarify. Is there a way to check if the current file has been opened by another application?
I'd like to start with this question. Whether those other application read/write is irrelevant for now.
I realize it is probably OS dependent, so this may not really be python related right now.

Will your python script desire to open the file for writing or for reading? Is the legacy application opening and closing the file between writes, or does it keep it open?
It is extremely important that we understand what the legacy application is doing, and what your python script is attempting to achieve.
This area of functionality is highly OS-dependent, and the fact that you have no control over the legacy application only makes things harder unfortunately. Whether there is a pythonic or non-pythonic way of doing this will probably be the least of your concerns - the hard question will be whether what you are trying to achieve will be possible at all.
UPDATE
OK, so knowing (from your comment) that:
the legacy application is opening and
closing the file every X minutes, but
I do not want to assume that at t =
t_0 + n*X + eps it already closed
the file.
then the problem's parameters are changed. It can actually be done in an OS-independent way given a few assumptions, or as a combination of OS-dependent and OS-independent techniques. :)
OS-independent way: if it is safe to assume that the legacy application keeps the file open for at most some known quantity of time, say T seconds (e.g. opens the file, performs one write, then closes the file), and re-opens it more or less every X seconds, where X is larger than 2*T.
stat the file
subtract file's modification time from now(), yielding D
if T <= D < X then open the file and do what you need with it
This may be safe enough for your application. Safety increases as T/X decreases. On *nix you may have to double check /etc/ntpd.conf for proper time-stepping vs. slew configuration (see tinker). For Windows see MSDN
Windows: in addition (or in-lieu) of the OS-independent method above, you may attempt to use either:
sharing (locking): this assumes that the legacy program also opens the file in shared mode (usually the default in Windows apps); moreover, if your application acquires the lock just as the legacy application is attempting the same (race condition), the legacy application will fail.
this is extremely intrusive and error prone. Unless both the new application and the legacy application need synchronized access for writing to the same file and you are willing to handle the possibility of the legacy application being denied opening of the file, do not use this method.
attempting to find out what files are open in the legacy application, using the same techniques as ProcessExplorer (the equivalent of *nix's lsof)
you are even more vulnerable to race conditions than the OS-independent technique
Linux/etc.: in addition (or in-lieu) of the OS-independent method above, you may attempt to use the same technique as lsof or, on some systems, simply check which file the symbolic link /proc/<pid>/fd/<fdes> points to
you are even more vulnerable to race conditions than the OS-independent technique
it is highly unlikely that the legacy application uses locking, but if it is, locking is not a real option unless the legacy application can handle a locked file gracefully (by blocking, not by failing - and if your own application can guarantee that the file will not remain locked, blocking the legacy application for extender periods of time.)
UPDATE 2
If favouring the "check whether the legacy application has the file open" (intrusive approach prone to race conditions) then you can solve the said race condition by:
checking whether the legacy application has the file open (a la lsof or ProcessExplorer)
suspending the legacy application process
repeating the check in step 1 to confirm that the legacy application did not open the file between steps 1 and 2; delay and restart at step 1 if so, otherwise proceed to step 4
doing your business on the file -- ideally simply renaming it for subsequent, independent processing in order to keep the legacy application suspended for a minimal amount of time
resuming the legacy application process

Unix does not have file locking as a default. The best suggestion I have for a Unix environment would be to look at the sources for the lsof command. It has deep knowledge about which process have which files open. You could use that as the basis of your solution. Here are the Ubuntu sources for lsof.

One thing I've done is have python very temporarily rename the file. If we're able to rename it, then no other process is using it. I only tested this on Windows.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.