Snakemake rule stops with 'MissingOutputException' after successful processing of first input

Snakemake rule stops with 'MissingOutputException' after successful processing of first input - python

I wrote my first snakemake rule that uses a python script for processing files:
rule sanitize_labels:
input:
"data/raw/labels/rois_essence_31_10_2019_final.shp",
"data/raw/labels/pts_carte_auto_final.shp"
output:
"data/interim/labels/rois_essence_31_10_2019_final.csv",
"data/interim/labels/pts_carte_auto_final.csv"
params:
crs = 32189,
log = True
script:
"../../scripts/data/sanitize_labels.py"
It runs successfully for the first file, than stops with this message:
Waiting at most 5 seconds for missing files.
MissingOutputException in line 9 of E:\code\projects\essences\workflow\rules\pre
processing.smk:
Missing files after 5 seconds:
data/interim/labels/pts_carte_auto_final.csv
This might be due to filesystem latency. If that is the case, consider to increa
se the wait time with --latency-wait.
Removing output files of failed job sanitize_labels since they might be corrupte
d:
data/interim/labels/rois_essence_31_10_2019_final.csv
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: E:\code\projects\essences\.snakemake\log\2020-02-10T025157.458955.
snakemake.log
I tried swapping file order both in input and output; always only the first file gets processed.
In my python script, I refer to input and output as snakemake.input[0] and snakemake.output[0]. If I understand correctly snakemake.input[0] is assigned the current input in each call of the script (no matter what's the number of inputs in the rule). Same goes for snakemake.output[0]. Is that correct?
Do you have other hints at what can cause this error?
I'm running snakemake version 5.10.0 (installed as snakemake-minimal from bioconda channel.).
Thanks a lot for any hint.

Adding to Maarten's answer, once you have specified the generic rule he provided, you then request the final outputs you want as the rule 'all' as your first rule:
rule all:
input: expand("data/interim/labels/{name}.csv", name=DATASETS)
If you place the input directive in your generic sanitize_labels rule, it is no longer generic. Snakemake expands the input you provide to create the same rule as in your question.
Go through the tutorial again if it's still not clear. While you may think and write your rules from start to finish, snakemake evaluates from finish to start. You request the final outputs in all (as inputs) and snakemake decides what needs to be run. It's confusing at first, but just remember to request your final output in all and keep your rules generic.

I think you need to take a look again at the "idea" behind snakemake. Probably what you need is something like this:
rule sanitize_labels:
input:
"data/raw/labels/{name}.shp"
output:
"data/interim/labels/{name}.csv"
params:
crs = 32189,
log = True
script:
"../../scripts/data/sanitize_labels.py"
Where you do not exactly specify the filename, but you tell snakemake how it can generate a certain output from a certain input. In this case, if your script needs both data/interim/labels/rois_essence_31_10_2019_final.csv and data/interim/labels/pts_carte_auto_final.csv, Snakemake "understands" how to make these files, and it knows which inputs it needs.

Related

max_iters doesn't seem to work with GLPK_MI solver in Python

I'm debugging my code right now and since it's running with some datas and not with other ones, I wanted to set the 'max_iters' option to 1 to see if it works in only 1 iteration or if it needs more. I realised it doesn't seem to even use it. I tried putting a string "hello" instead of an int and it even worked. Do someone knows if it's a known problem?
self.prob.solve(solver="GLPK_MI", max_iters=1)
I'm using the CVXPY module with CVXOPT.
EDIT:
I want to do this because I don't get an error, it just continues to run forever. And with the project I'm working on it can take a lot of time to run so I wonder if it's really not working or if it's just a question of time.

Wouldn't be better if you set the max iterations as a variable? (just a suggestion)
In any case, in CVXOPT you need to set the max number of iteration as
'maxiters' : 1
or you can set it as a variable and then call solver as per below
opts = {'maxiters' : 1}
self.prob.solve(solver="GLPK_MI", options = opts)

Design pattern for a data parsing&feature engineering pipeline

I know this question has been asked a few times on these boards in the past, but I have a more generic version of it, which might be applicable to someone else's projects in the future as well.
In short - I am building an ML system (with Python, but language choice in this case is not very critical), which has its ML model at the end of a pipeline of actions happening:
Data upload
Data parsing
Feature engineering
Feature engineering (in a different logical bracket from prev. step)
Feature engineering (in a different logical bracket from prev. steps)
... (more steps like the last 3)
Data passed to ML model
Each of the above steps, has its own series of actions it must take, in order to build a proper output, which is then used as an input in the next one etc. These sub-steps in turn, can either be completely decoupled from one another, or some of them might need some steps inside of that big step, to be completed first, to produce data these following steps use.
The thing right now is, that I need to build a custom pipeline, which will make it super easy to add new steps into the mix (both big and small), without upsetting the existing ones.
So far, I have this concept idea of how this might look like from an architecture perspective, as shown below:
While looking at this architecture, I am immediately thinking about a Chain of Responsibility Design Pattern, which manages BIG STEPS (1, 2, ..., n), and each of these BIG STEPS having their own small version of Chain of Responsibility happening inside of their guts, which happen independently for NO_REQ steps, and then for REQ steps (with REQ steps looping-over until they are all done). With a shared interface for running logic inside of big and small steps, it would probably run rather neatly.
Yet, I am wondering, if there is any better way of doing it? Moreover, what I do not like about a Chain of Responsibility, is that it would require a person adding new BIG/SMALL step, to always edit the "guts" of the logic setting up step bags, to manually include the newly added step. I would love to build something, which instead would just scan a folder specific to steps under each BIG STEP, and build a list of NO_REQ and REQ steps on its own (to uphold the Open/Closed SOLID principle).
I would be grateful for any ideas.

I did something similar recently. Challenge is to allow "steps" to be plugged in easily without much changes. So, what I did was, something like below. Core idea is to define interface 'process' and fix input and output format.
Code:
class Factory:
def process(self, input):
raise NotImplementedError
class Extract(Factory):
def process(self, input):
print("Extracting...")
output = {}
return output
class Parse(Factory):
def process(self, input):
print("Parsing...")
output = {}
return output
class Load(Factory):
def process(self, input):
print("Loading...")
output = {}
return output
pipeline = {
"Extract" : Extract(),
"Parse" : Parse(),
"Load" : Load(),
}
input_data = {} #vanilla input
for process_name, process_instance in pipeline.items():
output = process_instance.process(input_data)
input_data = output
Output:
Extracting...
Parsing...
Loading...
So, in case you need to add a 'step', 'append_headers' after parse, all you need to do is,
#Defining a new step.
class AppendHeaders(Factory):
def process(self, input):
print("adding headers...")
output = {}
return output
pipeline = {
"Extract" : Extract(),
"Append headers": AppendHeaders(), #adding a new step
"Parse" : Parse(),
"Load" : Load(),
}
New output:
Extracting...
adding headers...
Parsing...
Loading...
Additional requirement in your case where, you might want to scan specific folder for REQ/NOT_REQ, can be added as field in json and load it into pipeline, meaning, create those "steps" objects only if flag is set to REQ.
Not sure how much this idea can help. Thought, I will convey my thoughts.

How to link interactive problems (w.r.t. CodeJam)?

I'm not sure if it's allowed to seek for help(if not, I don't mind not getting an answer until the competition period is over).
I was solving the Interactive Problem (Dat Bae) on CodeJam. On my local files, I can run the judge (testing_tool.py) and my program (<name>.py) separately and copy paste the I/O manually. However, I assume I need to find a way to make it automatically.
Edit: To make it clear, I want every output of x file to be input in y file and vice versa.
Some details:
I've used sys.stdout.write / sys.stdin.readline instead of print / input throughout my program
I tried running interactive_runner.py, but I don't seem to figure out how to use it.
I tried running it on their server, my program in first tab, the judge file in second. It's always throwing TLE error.
I don't seem to find any tutorial to do the same either, any help will be appreciated! :/

The usage is documented in comments inside the scripts:
interactive_runner.py
# Run this as:
# python interactive_runner.py <cmd_line_judge> -- <cmd_line_solution>
#
# For example:
# python interactive_runner.py python judge.py 0 -- ./my_binary
#
# This will run the first test set of a python judge called "judge.py" that
# receives the test set number (starting from 0) via command line parameter
# with a solution compiled into a binary called "my_binary".
testing_tool.py
# Usage: `testing_tool.py test_number`, where the argument test_number
# is 0 for Test Set 1 or 1 for Test Set 2.
So use them like this:
python interactive_runner.py python testing_tool.py 0 -- ./dat_bae.py

How to check python code including libraries?

I'm working on some machine learning code and today I've lost about 6 hours because simple typo.
It was this:
numpy.empty(100,100)
instead of
numpy.empty([100,100])
As I'm not really used to numpy, so I forgot the brackets. The code happily crunched the numbers and at the end, just before saving results to disk, it crashed on that line.
Just to put things in perspective I code on remote machine in shell, so IDE is not really an option. Also I doubt IDE would catch this.
Here's what I already tried:
running pylint - well pylint kinda works. After I've disabled everything apart of errors and warnings, it even seem to be usefull. But pylint have serious issue with imported modules. As seen on official bug tracker devs know about it, but cannot/won't do anything about it. There is suggested workaround, but ignoring whole module, would not help in my case.
running pychecker - if I create code snippet with the mistake I made, the pychecker reports error - same error as python interpreter. However if I run pychecker on the actual source file (~100 LOC) it reported other errors (unused vars, unused imports, etc.); but the faulty numpy line was skipped.
At last I have tried pyflakes but it does even less checking than pychecker/pylint combo.
So is there any reliable method which can check code in advance? Without actually running it.

A language with stronger type checking would have been able to save you from this particular error, but not from errors in general. There are plenty of ways to go wrong that pass static type checking. So if you have computations that takes a long time, it makes sense to adopt the following strategies:
Test the code from end to end on small examples (that run in a few seconds or minutes) before running it on big data that will consume hours.
Structure long-running computations so that intermediate results are saved to files on disk at appropriate points in the computation. This means that when something breaks, you can fix the problem and restart the computation from the last save point.
Run the code from the interactive interpreter, so that in the event of an exception you are returned to the interactive session, giving you a chance of being able to recover the data using a post-mortem debugging session. For example, suppose I have some long-running computation:
def work(A, C):
B = scipy.linalg.inv(A) # takes a long time when A is big
return B.dot(C)
I run this from the interactive interpreter and it raises an exception:
>>> D = work(A, C)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "q22080243.py", line 6, in work
return B.dot(C)
ValueError: matrices are not aligned
Oh no! I forgot to transpose C! Do I have to do the inversion of A again? Not if I call pdb.pm:
>>> import pdb
>>> pdb.pm()
> q22080243.py(6)work()
-> return B.dot(C)
(Pdb) B
array([[-0.01129249, 0.06886091, ..., 0.08530621, -0.03698717],
[ 0.02586344, -0.04872148, ..., -0.04853373, 0.01089163],
...,
[-0.11463087, 0.15048804, ..., 0.0722889 , -0.12388141],
[-0.00467437, -0.13650975, ..., -0.13894875, 0.02823997]])
Now, unlike in Lisp, I can't just set things right and continue the execution. But at least I can recover the intermediate results:
(Pdb) D = B.dot(C.T)
(Pdb) numpy.savetxt('result.txt', D)

Do you use unit tests? There is really no better way.

What does windows error 0 "ERROR_SUCCESS" mean?

I've written a python program which read a stdout of another process by the pipe redirection.
However, the program sucks in this line:
print "[Input Thread] ", self.inputPipe.readline(ii)
The error is IOError: [Errno 0] Error
I found the explanation of windows errno 0. It makes confused because it defined as:
The operation completed successfully.
Why does an operation completed successfully lead to an error?

The name can trick you but ERROR_SUCCESS actually means there was no error.
From https://msdn.microsoft.com/en-us/library/windows/desktop/ms681382.aspx:
ERROR_SUCCESS
0 (0x0)
The operation completed successfully.

I know this is kind of old but having spent a fair amount of time trying to find a complete answer without success. So I figured I'd share what I've figured out.
The complete answer of how this happens, is when the pInvoke method you called "fails" but not because of an error.
huh you think
For example lets say you need to unhook a windows hook, but it gets called twice due to a bit of spaghetti or a paranoid level of defensive programming in your object architecture.
// hook assigned earlier
// now we call our clean up code
if (NativeMethods.UnhookWindowsHookEx(HookHandle) == 0)
{
// method succeeds normally so we do not get here
Log.ReportWin32Error("Error removing hook", Marshal.GetLastWin32Error());
}
// other code runs, but the hook is never reattached,
// due to paranoid defensive program you call your clean up code twice
if (NativeMethods.UnhookWindowsHookEx(HookHandle) == 0)
{
// pInvoke method failed (return zero) because there was no hook to remove
// however there was no error, the hook was already gone thus ERROR_SUCCESS (0)
// is our last error
Log.ReportWin32Error("Error removing hook", Marshal.GetLastWin32Error());
}

The windows API can be tricky. Most likely, the error number was not properly retrieved by the second program you mentioned. It was either overwritten or not pulled in at all; so it defaulted to 0.
You didn't say what the other program was; but for example, in .net, it is easy to omit the 'SetLastError' flag when declaring your external calls.
[DllImport('kernel32.dll', SetLastError = true)]
https://www.medo64.com/2013/03/error-the-operation-completed-successfully/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.