The use case is as follows :
I have a script that runs a series of
non-python executables to reduce (pulsar) data. I right now use
subprocess.Popen(..., shell=True) and then the communicate function of subprocess to
capture the standard out and standard error from the non-python executables and the captured output I log using the python logging module.
The problem is: just one core of the possible 8 get used now most of the time.
I want to spawn out multiple processes each doing a part of the data set in parallel and I want to keep track of progres. It is a script / program to analyze data from a low frequencey radio telescope (LOFAR). The easier to install / manage and test the better.
I was about to build code to manage all this but im sure it must already exist in some easy library form.
The subprocess module can start multiple processes for you just fine, and keep track of them. The problem, though, is reading the output from each process without blocking any other processes. Depending on the platform there's several ways of doing this: using the select module to see which process has data to be read, setting the output pipes non-blocking using the fnctl module, using threads to read each process's data (which subprocess.Popen.communicate itself uses on Windows, because it doesn't have the other two options.) In each case the devil is in the details, though.
Something that handles all this for you is Twisted, which can spawn as many processes as you want, and can call your callbacks with the data they produce (as well as other situations.)
Maybe Celery will serve your needs.
If I understand correctly what you are doing, I might suggest a slightly different approach. Try establishing a single unit of work as a function and then layer on the parallel processing after that. For example:
Wrap the current functionality (calling subprocess and capturing output) into a single function. Have the function create a result object that can be returned; alternatively, the function could write out to files as you see fit.
Create an iterable (list, etc.) that contains an input for each chunk of data for step 1.
Create a multiprocessing Pool and then capitalize on its map() functionality to execute your function from step 1 for each of the items in step 2. See the python multiprocessing docs for details.
You could also use a worker/Queue model. The key, I think, is to encapsulate the current subprocess/output capture stuff into a function that does the work for a single chunk of data (whatever that is). Layering on the parallel processing piece is then quite straightforward using any of several techniques, only a couple of which were described here.
Related
I need to get several processes to communicate with each others on a macOS system. These processes will be spawned several times per day at different times, and I cannot predict when they will be up at the same time (if ever). These programs are in python or swift.
How can I safely allow these programs to all write to the same file?
I have explored a few different options:
I thought of using sqlite3, however I couldn't find an answer in the documentation on whether it was safe to write concurrently across processes. This question is not very definitive, old, and I would ideally like to get a more authoritative answer.
I thought of using multiprocesing as it supports locks. However, as far as I could see in the documentation, you need a meta-process that spawns the children and stays up for the duration of the longest child process. I am am fine having a meta-spawner process, but it feels wasteful to have a meta-process basically staying up all day long, just to resolve conflicting access ?
Along the lines of this, I thought of having a process that stays up all day long, and receive messages from all other processes, and is the sole responsible for writing to file. It feels a little wasteful, how worried should I be about the resource cost of having a program up all day, and doing little? Are the only thing to worry about memory footprint and CPU usage (as shown in activity monitor), or could there be other significant costs, e.g. context switching?
I have come across flock on linux, that seems to be a locking mechanism to access files, provided by the OS. This seems like a good solution, but this does not seem to exist on macOS?
Any idea to solve this requirement in a robust manner (so that I don't have to debug every other day - I now concurrency can be a pain), is most welcome!
While you are in control of the source code of all such processes, you could use flock. It will put the advisory lock on file, so the other writer will be blocked only in case he is also access the file the same way. This is OK for you, if only your processes will ever need to write to the shared file.
I've tested flock on BigSur, it is still implemented and works fine.
You can also do it in any other common manner: create temporary .lock file in the known location(this is what git does), and remove it after current writer is done with the main file; use semaphores; etc
I have a main program X which is getting feed from my webcam.
I want to configure X in real-time while it's executing.
I understand that one of the common ways of doing that is using IPC like named-pipes/Unix sockets/Internet sockets, etc. But I want to avoid each caller to have to separately open a socket/named-pipe, and communicate each time.
In short, I want a helper program by the name Y, which I can use in the following manner:
Y set-fps=15
Y show-frame=true
Y get-fps (should return 15)
I would want to play this helper program Y in /usr/bin/* (or rather place it in one of $PATH directoreis) so that it's executable from the command-line.
What are my options for obtaining this functionality. My constraints are as under:
(i) Program X could be either C++/Python.
(ii) Multiple clients could call Y simultaneously.
I guess such systems are common on linux where you have programs like nmcli interacting with services like the network-manager ?
sorry to put that post in answer, i still can't comment.
one of the way i could see to do such is the following :
as your program takes frames from the webcam, there is certainly a loop or a redundant automat.
In this way, you could allocate some registers in the kernel memory in one block, each of registers being one of the parameter.
then, at each loop, your X program could read the different parameters from that kernel space. On the other side, each client who wants to modify the parameters will access this kernel space too, and modify the required values.
Then, you have to protect the kernel space with a semaphore, for any write or read operation.(a mutex is ok there, however you may want some writer/reader implementation)
Of course, this is not optimal in performance compared to a protection for each one of the parameters, but with that you have to handle only one semaphore, and a simple mutex would do the job. It's certainly also less beautiful than a communication with pipes or sockets, but with them, you would still have to protect the read and write of your parameters...
Premise:
I've created a mainwindow. One of the drop down menu's has an 'ProcessData' item. When it's selected, I create a QProgressDialog. I then do a lot of processing in the main loop and periodically update the label and percentage in the QProgressDialog.
My processing looks like: read a large amount of data from a file (numpy memmapped array), do some signal processing, write the output to a common h5py file. I iterate over the available input files, and all of the output is stored in a common h5py hdf5 file. The entire process takes about two minutes per input file and pins one CPU to 100%.
Goal:
How do I make this process non-blocking, so that the UI is still responsive? I'd still like my processing function to be able to update the QProgressDialog and it's associated label.
Can I extend this to process more than one dataset concurrently and retain the ability to update the progressbar info?
Can I write into h5py from more than one thread/process/etc.? Will I have to implement locking on the write operation?
Software Versions:
I use python 3.3+ with numpy/scipy/etc. UI is in PyQt4 4.11/ Qt 4.8, although I'd be interested in solutions that use python 3.4 (and therefore asyncio) or PyQt5.
This is quite a complex problem to solve, and this format is not really suited to providing complete answers to all your questions. However, I'll attempt to put you on the right track.
How do I make this process non-blocking, so that the UI is still responsive? I'd still like my processing function to be able to update the QProgressDialog and it's associated label.
To make it non-blocking, you need to offload the processing into a Python thread or QThread. Better yet, offload it into a subprocess that communicates progress back to the main program via a thread in the main program.
I'll leave you to implement (or ask another question on) creating subprocesses or threads. However, you need to be aware that only the MainThread can access GUI methods. This means you need to emit a signal if using a QThread or use QApplication.postEvent() from a python thread (I've wrapped the latter up into a library for Python 2.7 here. Python 3 compatibility will come one day)
Can I extend this to process more than one dataset concurrently and retain the ability to update the progressbar info?
Yes. One example would be to spawn many subprocesses. Each subprocess can be configured to send messages back to an associated thread in the main process, which communicates the progress information to the GUI via the method described for the above point. How you display this progress information is up to you.
Can I write into h5py from more than one thread/process/etc.? Will I have to implement locking on the write operation?
You should not write to a hdf5 file from more than one thread at a time. You will need to implement locking. I think possibly even read access should be serialised.
A colleague of mine has produced something along these lines for Python 2.7 (see here and here), you are welcome to look at it or fork it if you wish.
I've read that certain Python functions implemented in C, which I assume includes file.read(), can release the GIL while they're working and then get it back on completion and by doing so make use of multiple cores if they're available.
I'm using multiprocess to parallelize some code and currently I've got three processes, the parent, one child that reads data from a file, and one child that generates a checksum from the data passed to it by the first child process.
Now if I'm understanding this right, it seems that creating a new process to read the file as I'm currently doing is uneccessary and I should just call it in the main process. The question is am I understanding this right and will I get better performance with the read kept in the main process or in a separate one?
So given my function to read and pipe the data to be processed:
def read(file_path, pipe_out):
with open(file_path, 'rb') as file_:
while True:
block = file_.read(block_size)
if not block:
break
pipe_out.send(block)
pipe_out.close()
I reckon that this will definitely make use of multiple cores, but also introduces some overhead:
multiprocess.Process(target=read, args).start()
But now I'm wondering if just doing this will also use multiple cores, minus the overhead:
read(*args)
Any insights anybody has as to which one would be faster and for what reason would be much appreciated!
Okay, as came out by the comments, the actual question is:
Does (C)Python create threads on its own, and if so, how can I make use of that?
Short answer: No.
But, the reason why these C-Functions are nevertheless interesting for Python programmers is the following. By default, no two snippets of python code running in the same interpreter can execute in parallel, this is due to the evil called the Global Interpreter Lock, aka the GIL. The GIL is held whenever the interpreter is executing Python code, which implies the above statement, that no two pieces of python code can run in parallel in the same interpreter.
Nevertheless, you can still make use of multithreading in python, namely when you're doing a lot of I/O or make a lot of use of external libraries like numpy, scipy, lxml and so on, which all know about the issue and release the GIL whenever they can (i.e. whenever they do not need to interact with the python interpreter).
I hope that cleared up the issue a bit.
I think this is the main part of your question:
The question is am I understanding this right and will I get better
performance with the read kept in the main process or in a separate
one?
I assume your goal is to read and process the file as fast as possible. File reading is in any case I/O bound and not CPU bound. You cannot process data faster than you are able to read it. So file I/O clearly limits the performance of your software. You cannot increase the read data rate by using concurrent threads/processes for file reading. Also 'low level' CPython is not doing this. As long as you read the file in one process or thread (even in case of CPython with its GIL a thread is fine), you will get as much data per time as you can get from the storage device. It is also fine if you do the file reading in the main thread as long as there are no other blocking calls that would actually slow down the file reading.
Is there a way to generically retrieve process stats using Perl or Python? We could keep it Linux specific.
There are a few problems: I won't know the PID ahead of time, but I can run the process in question from the script itself. For example, I'd have no problem doing:
./myscript.pl some/process/I/want/to/get/stats/for
Basically, I'd like to, at the very least, get the memory consumption of the process, but the more information I can get the better (like run time of the process, average CPU usage of the process, etc.)
Thanks.
Have a look at the Proc::ProcessTable module which returns quite a bit of information on the processes in the system. Call the "fields" method to get a list of details that you can extract from each process.
I recently discovered the above module which has just about replaced the Process module that I had written when writing a Perl kill program for Linux. You can have a look at my script here.
It can be easily extended to extract further information from the ps command. For eg. the 'getbycmd' method returns a list of pid's whose command line invocation matches the passed argument. You can then retrieve a specific process' details by calling 'getdetail' by passing that PID to it like so,
my $psTable = Process->new();
# Get list of process owned by 'root'
for my $pid ( $psTable->getbyuser("root") ) {
$psDetail = $psList->getdetail( $pid );
# Do something with the psDetail..
}
If you are fork()ing the child, you will know it's PID.
From within the parent you can then parse the files in /proc/<PID/ to check the memory and CPU usage, albeit only for as long as the child process is running.
A common misconception is that reading /proc is like reading /home. /proc is designed to give you the same information with one open() that 20 similar syscalls filling some structure could provide. Reading it does not pollute buffers, send innocent programs to paging hell or otherwise contribute to the death of kittens.
Accessing /proc/foo is just telling the kernel "give me information on foo that I can process in a language agnostic way"
If you need more details on what is in /proc/{pid}/ , update your question and I'll post them.