subprocess.run() completes after first step of nextflow pipeline

subprocess.run() completes after first step of nextflow pipeline - python

I wrote a nextflow workflow that performs five steps. I have done this to be used by me and my colleagues, but not everyone is skilled with nextflow. Therefore, I decided to write a small wrapper in python3 that could run it for them.
The wrapper is very simple, reads the options with argparse and creates a command to be run with subprocess.run(). The issue with the wrapper is that, once the first step of the pipeline is completed, subprocess.run() thinks that the process is over.
I tried using shell=True, I tried using subprocess.Popen() with a while waiting for an output file, but it won't solve it.
How can I tell either subprocess.run() to wait until the end-end, or to nextflow run not to emit any exit code until the last step? Is there a way, or am I better off giving my colleagues a NF tutorial instead?
EDIT:
The reason why I prefer the wrapper, is that nextflow creates lots of temporary files which one has to know how to clean up. The wrapper does it for them, saving disk space and my time.

The first part of your question is a bit tricky to answer without the details, but we know subprocess.run() should wait for the command specified to complete. If your nextflow command is actually exiting before all of your tasks/steps have completed, then there could be a problem with the workflow or with the version of Nextflow itself. Since this occurs after the first process completes, I would suspect the former. My guess is that there might be some plumbing issue somewhere. For example, if your second task/step definition is conditional in any way then this could allow an early exit from your workflow.
I would avoid the wrapper here. Running Nextflow pipelines should be easy, and the documentation that accompanies your workflow should be sufficient to get it up and running quickly. If you need to set multiple params on the command line, you could include one or more configuration profiles to make it easy for your colleagues to get started running it. The section on pipeline sharing is also worth reading if you haven't seen it already. If the workflow does create lots of temporary files, just ensure these are all written to the working directory. So upon successful completion, all you need to clean up should be a simple rm -rf ./work. I tend to avoid automating destructive commands like this to avoid accidental deletes. A line in your workflow's documentation to say that the working directory can be removed (following successful completion of the pipeline) should be sufficient in my opinion and just leave it up to the users to clean up after themselves.
EDIT: You may also be interested in this project: https://github.com/goodwright/nextflow.py

Related

Starting process in Google Colab with Prefix "!" vs. "subprocess.Popen(..)"

I've been using Google Colab for a few weeks now and I've been wondering what the difference is between the two following commands (for example):
!ffmpeg ...
subprocess.Popen(['ffmpeg', ...
I was wondering because I ran into some issues when I started either of the commands above and then tried to stop execution midway. Both of them cancel on KeyboardInterrupt but I noticed that after that the runtime needs a factory reset because it somehow got stuck. Checking ps aux in the Linux console listed a process [ffmpeg] <defunct> which somehow still was running or at least blocking some ressources as it seemed.
I then did some research and came across some similar posts asking questions on how to terminate a subprocess correctly (1, 2, 3). Based on those posts I generally came to the conclusion that using the subprocess.Popen(..) variant obviously provides more flexibility when it comes to handling the subprocess: Defining different stdout procedures or reacting to different returncode etc. But I'm still unsure on what the first command above using the ! as prefix exactly does under the hood.
Using the first command is much easier and requires way less code to start this process. And assuming I don't need a lot of logic handling the process flow it would be a nice way to execute something like ffmpeg - if I were able to terminate it as expected. Even following the answers from the other posts using the 2nd command never got me to a point where I could terminate the process fully once started (even when using shell=False, process.kill() or process.wait() etc.). This got me frustrated, because restarting and re-initializing the Colab instance itself can take several minutes every time.
So, finally, I'd like to understand in more general terms what the difference is and was hoping that someone could enlighten me. Thanks!

! commands are executed by the notebook (or more specifically by the ipython interpreter), and are not valid Python commands. If the code you are writing needs to work outside of the notebook environment, you cannot use ! commands.
As you correctly note, you are unable to interact with the subprocess you launch via !; so it's also less flexible than an explicit subprocess call, though similar in this regard to subprocess.call
Like the documentation mentions, you should generally avoid the bare subprocess.Popen unless you specifically need the detailed flexibility it offers, at the price of having to duplicate the higher-level functionality which subprocess.run et al. already implement. The code to run a command and wait for it to finish is simply
subprocess.check_call(['ffmpeg', ... ])
with variations for capturing its output (check_output) and the more modern run which can easily replace all three of the legacy high-level calls, albeit with some added verbosity.

Python Communicate/Wait with a shell subprocess

Tried searching for the solution to this problem but due to there being a command Shell=True (don't think that is related to what I'm doing but I could well be wrong) it get's lots of hits that aren't seemingly useful.
Ok so the problem I is basically:
I'm running a Python script on a cluster. On the cluster the normal thing to do is to launch all codes/etc. via a shell script which is used to request the appropriate resources (maximum run time, nodes, processors per node, etc.) needed to run the job. This shell then calls the script and away it goes.
This isn't an issue, but the problem I have is my 'parent' code needs to wait for it's 'children' to run fully (and generate their data to be used by the parent) before continuing. This isn't a problem when I don't have the shell between it and the script but as it stands .communicate() and .wait() are 'satisfied' when the shell script is done. I need to to wait until the script(s) called by the shell are done.
I could botch it by putting a while loop in that needs certain files to exist before breaking, but this seems messy to me.
So my question is, is there a way I can get .communicate (idealy) or .wait or via some other (clean/nice) method to pause the parent code until the shell, and everything called by the shell, finishes running? Ideally (nearly essential tbh) is that this be done in the parent code alone.
I might not be explaining this very well so happy to provide more details if needed, and if somewhere else answers this I'm sorry, just point me thata way!

Python Code Coverage and Multiprocessing

I use coveralls in combination with coverage.py to track python code coverage of my testing scripts. I use the following commands:
coverage run --parallel-mode --source=mysource --omit=*/stuff/idont/need.py ./mysource/tests/run_all_tests.py
coverage combine
coveralls --verbose
This works quite nicely with the exception of multiprocessing. Code executed by worker pools or child processes is not tracked.
Is there a possibility to also track multiprocessing code? Any particular option I am missing? Maybe adding wrappers to the multiprocessing library to start coverage every time a new process is spawned?
EDIT:
I (and jonrsharpe, also :-) found a monkey-patch for multiprocessing.
However, this does not work for me, my Tracis-CI build is killed almost right after the start. I checked the problem on my local machine and apparently adding the patch to multiprocessing busts my memory. Tests that take much less than 1GB of memory need more than 16GB with this fix.
EDIT2:
The monkey-patch does work after a small modification: Removing
the config_file parsing (config_file=os.environ['COVERAGE_PROCESS_START']) did the trick. This solved the issue of the bloated memory. Accordingly, the corresponding line simply becomes:
cov = coverage(data_suffix=True)

Coverage 4.0 includes a command-line option --concurrency=multiprocessing to deal with this. You must use coverage combine afterward. For instance, if your tests are in regression_tests.py, then you would simply do this at the command line:
coverage run --concurrency=multiprocessing regression_tests.py
coverage combine

I've had spent some time trying to make sure coverage works with multiprocessing.Pool, but it never worked.
I have finally made a fix that makes it work - would be happy if someone directed me if I am doing something wrong.
https://gist.github.com/andreycizov/ee59806a3ac6955c127e511c5e84d2b6

One of the possible causes of missing coverage data from forked processes, even with concurrency=multiprocessing, is the way of multiprocessing.Pool shutdown. For example, with statement leads to terminate() call (see __exit__ here). As a consequence, pool workers have no time to save coverage data. I had to use close(), timed join() (in a thread), terminate sequence instead of with to get coverage results saved.

subprocess.call does not wait for the process to complete

Per Python documentation, subprocess.call should be blocking and wait for the subprocess to complete. In this code I am trying to convert few xls files to a new format by calling Libreoffice on command line. I assumed that the call to subprocess call is blocking but seems like I need to add an artificial delay after each call otherwise I miss few files in the out directory.
what am I doing wrong? and why do I need the delay?
from subprocess import call
for i in range(0,len(sorted_files)):
args = ['libreoffice', '-headless', '-convert-to',
'xls', "%s/%s.xls" %(sorted_files[i]['filename'],sorted_files[i]['filename']), '-outdir', 'out']
call(args)
var = raw_input("Enter something: ") # if comment this line I dont get all the files in out directory
EDIT It might be hard to find the answer through the comments below. I used unoconv for document conversion which is blocking and easy to work with from an script.

It's possible likely that libreoffice is implemented as some sort of daemon/intermediary process. The "daemon" will (effectively1) parse the commandline and then farm the work off to some other process, possibly detaching them so that it can exit immediately. (based on the -invisible option in the documentation I suspect strongly that this is indeed the case you have).
If this is the case, then your subprocess.call does do what it is advertised to do -- It waits for the daemon to complete before moving on. However, it doesn't do what you want which is to wait for all of the work to be completed. The only option you have in that scenario is to look to see if the daemon has a -wait option or similar.
1It is likely that we don't have an actual daemon here, only something which behaves similarly. See comments by abernert

The problem is that the soffice command-line tool (which libreoffice is either just a link to, or a further wrapper around) is just a "controller" for the real program soffice.bin. It finds a running copy of soffice.bin and/or creates on, tells it to do some work, and then quits.
So, call is doing exactly the right thing: it waits for libreoffice to quit.
But you don't want to wait for libreoffice to quit, you want to wait for soffice.bin to finish doing the work that libreoffice asked it to do.
It looks like what you're trying to do isn't possible to do directly. But it's possible to do indirectly.
The docs say that headless mode:
… allows using the application without user interface.
This special mode can be used when the application is controlled by external clients via the API.
In other words, the app doesn't quit after running some UNO strings/doing some conversions/whatever else you specify on the command line, it sits around waiting for more UNO commands from outside, while the launcher just runs as soon as it sends the appropriate commands to the app.
You probably have to use that above-mentioned external control API (UNO) directly.
See Scripting LibreOffice for the basics (although there's more info there about internal scripting than external), and the API documentation for details and examples.
But there may be an even simpler answer: unoconv is a simple command-line tool written using the UNO API that does exactly what you want. It starts up LibreOffice if necessary, sends it some commands, waits for the results, and then quits. So if you just use unoconv instead of libreoffice, call is all you need.
Also notice that unoconv is written in Python, and is designed to be used as a module. If you just import it, you can write your own (simpler, and use-case-specific) code to replace the "Main entrance" code, and not use subprocess at all. (Or, of course, you can tear apart the module and use the relevant code yourself, or just use it as a very nice piece of sample code for using UNO from Python.)
Also, the unoconv page linked above lists a variety of other similar tools, some that work via UNO and some that don't, so if it doesn't work for you, try the others.
If nothing else works, you could consider, e.g., creating a sentinel file and using a filesystem watch, so at least you'll be able to detect exactly when it's finished its work, instead of having to guess at a timeout. But that's a real last-ditch workaround that you shouldn't even consider until eliminating all of the other options.

If libreoffice is being using an intermediary (daemon) as mentioned by #mgilson, then one solution is to find out what program it's invoking, and then directly invoke it yourself.

How to access a data structure from a currently running Python process on Linux?

I have a long-running Python process that is generating more data than I planned for. My results are stored in a list that will be serialized (pickled) and written to disk when the program completes -- if it gets that far. But at this rate, it's more likely that the list will exhaust all 1+ GB free RAM and the process will crash, losing all my results in the process.
I plan to modify my script to write results to disk periodically, but I'd like to save the results of the currently-running process if possible. Is there some way I can grab an in-memory data structure from a running process and write it to disk?
I found code.interact(), but since I don't have this hook in my code already, it doesn't seem useful to me (Method to peek at a Python program running right now).
I'm running Python 2.5 on Fedora 8. Any thoughts?
Thanks a lot.
Shahin

There is not much you can do for a running program. The only thing I can think of is to attach the gdb debugger, stop the process and examine the memory. Alternatively make sure that your system is set up to save core dumps then kill the process with kill --sigsegv <pid>. You should then be able to open the core dump with gdb and examine it at your leisure.
There are some gdb macros that will let you examine python data structures and execute python code from within gdb, but for these to work you need to have compiled python with debug symbols enabled and I doubt that is your case. Creating a core dump first then recompiling python with symbols will NOT work, since all the addresses will have changed from the values in the dump.
Here are some links for introspecting python from gdb:
http://wiki.python.org/moin/DebuggingWithGdb
http://chrismiles.livejournal.com/20226.html
or google for 'python gdb'
N.B. to set linux to create coredumps use the ulimit command.
ulimit -a will show you what the current limits are set to.
ulimit -c unlimited will enable core dumps of any size.

While certainly not very pretty you could try to access data of your process through the proc filesystem.. /proc/[pid-of-your-process]. The proc filesystem stores a lot of per process information such as currently open file pointers, memory maps and what not. With a bit of digging you might be able to access the data you need though.
Still i suspect you should rather look at this from within python and do some runtime logging&debugging.

+1 Very interesting question.
I don't know how well this might work for you (especially since I don't know if you'll reuse the pickled list in the program), but I would suggest this: as you write to disk, print out the list to STDOUT. When you run your python script (I'm guessing also from command line), redirect the output to append to a file like so:
python myScript.py >> logFile.
This should store all the lists in logFile.
This way, you can always take a look at what's in logFile and you should have the most up to date data structures in there (depending on where you call print).
Hope this helps

This answer has info on attaching gdb to a python process, with macros that will get you into a pdb session in that process. I haven't tried it myself but it got 20 votes. Sounds like you might end up hanging the app, but also seems to be worth the risk in your case.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.