ceberus: How to ignore a field based on yaml comment? - python

Overview
I have a lot of .yaml files, and a schema to validate them.
Sometimes, a "incorrect" value, is in fact correct.
I need some way to ignore some fields. No validations should be performed on these fields.
Example
## file -- a.yaml
some_dict:
some_key: some_valid_value
## file -- b.yaml
some_dict:
some_key: some_INVALID_value # cerberus: ignore
How can I do this?

Quick Answer (TL;DR)
The "composite validation" approach allows for conditional (context-aware) validation rules.
The python cerberus package supports composite validation "out of the box".
YAML comments cannot be used for composite validation, however YAML fields can.
Detailed Answer
Context
python 2.7
cerberus validation package
Problem
Developer PabloPajamasCreator wishes to apply conditional validation rules.
The conditional validation rules become activated based on the presence or value other fields in the dataset.
The conditional validation rules need to be sufficiently flexible to change "on-the-fly" based on any arbitrary states or relationships in the source data.
Solution
This approach can be accomplished with composite data validation.
Under this use-case, composite validation simply means creating a sequential list of validation rules, such that:
Each individual rule operates on a composite data variable
Each individual rule specifies a "triggering condition" for when the rule applies
Each individual rule produces one of three mutually-exclusive validation outcomes: validation-success, validation-fail, or validation-skipped
Example
Sample validation rules
- rule_caption: check-required-fields
rule_vpath: "#"
validation_schema:
person_fname:
type: string
required: true
person_lname:
type: string
required: true
person_age:
type: string
required: true
- rule_caption: check-age-range
rule_vpath: '#|#.person_age'
validation_schema:
person_age:
"min": 2
"max": 120
- rule_caption: check-underage-minor
rule_vpath: '[#]|[? #.person_age < `18`]'
validation_schema:
prize_category:
type: string
allowed: ['pets','toys','candy']
prize_email:
type: string
regex: '[\w]+#.*'
The code above is a YAML formatted representation of multiple validation rules.
Rationale
This approach can be extended to any arbitrary level of complexity.
This approach is easily comprehensible by humans (although the jmespath syntax can be a challenge)
Any arbitrarily complex set of conditions and constraints can be established using this approach.
Pitfalls
The above example uses jmespath syntax to specify rule_vpath, which tells the system when to trigger specific rules, this adds a dependency on jmespath.
See also
complete code example on github

Related

Adding wildcards to a workflow -- best practices

I have a quite complex snakemake bioinformatics workflow consisting of >200 rules. It basically starts from a set of FASTQ files, from which variables are inferred, like so:
(WC1, WC2, WC3, WC4) = glob_wildcards(FASTQPATH + "{wc1}_{wc2}_{wc3}_{wc4}.fastq.gz")
Those are then expanded to generate the target files, for example (I am skipping intermediate rules for brevity):
rule all:
expand("mappings/{wc1}_{wc2}_{wc3}_{wc4}.bam", wc1=WC1, wc2=WC2, wc3=WC3, wc4=WC4),
Over the course of a project, metadata can evolve and wildcards need to be added, e.g. wc5:
(WC1, WC2, WC3, WC4, WC5) = glob_wildcards(FASTQPATH + "{wc1}_{wc2}_{wc3}_{wc4}_{wc5}.fastq.gz")
This results in manually editing ~200 workflow rules to comply with the new input. I wonder if anyone in the community has come up with a more elegant, less cumbersome solution (using input functions perhaps?), or is it just a Snakemake limitation we all have to live with?
Thanks in advance
I have a workflow for ChIP-seq data, and my fastq files are named in the format MARK_TISSUE_REPLICATE.fastq.gz, so for example H3K4me3_Liver_B.fastq.gz. For many of my rules, I don't need to have separate wildcards for the mark, tissue, and replicate. I can just write my rules this way:
rule example:
input: "{library}.fq.gz"
output: "{library}.bam"
Then for the rules where I need to have multiple inputs, maybe to combine by replicates together or to do something across all tissues, I have a function I called "libraries" that returns a list of libraries with certain criteria. For example libraries(mark="H3K4me3") would return all libraries for that mark, or libraries(tissue="Liver", replicate="A") would return the libraries for all marks from that specific tissue sample. I can use this to write rules that need to combine multiple libraries, such as:
rule example2:
input: lambda wildcards: expand("{library}.bam", library=libraries(mark=wildcards.mark))
output: "{mark}_Heatmap_Clustering.png"
To fix some weird or ambiguous rule problems, I found it helpful to set some wildcard constraints like this:
wildcard_constraints:
mark="[^_/]+",
tissue="[^_/]+",
replicate="[^_/]+",
library="[^_/]+_[^_/]+_[^_/]+"
Hopefully you can apply some of these ideas to your own workflow.
I think #Colin is on the right (most snakemake-ish) path here. However if you want to make use of the wildcards, e.g. in the log or they dictate certain parameters then you could try to replace the wildcards by a variable, and inject this in the input and output of rules:
metadata = "{wc1}_{wc2}_{wc3}_{wc4}"
WC1, WC2, WC3, WC4 = glob_wildcards(FASTQPATH + metadata + ".fastq.gz")
rule map:
input:
expand(f"unmapped/{metadata}.fq")
input:
expand(f"mappings/{metadata}.fq")
shell:
"""
echo {wildcards.wc1};
mv {input} {output}
"""
rule all:
expand("mappings/{wc1}_{wc2}_{wc3}_{wc4}.bam", wc1=WC1, wc2=WC2, wc3=WC3, wc4=WC4)
This way changing to more or less wildcards is relatively easy.
disclaimer I haven't tested whether any of this actually works :)

Hypothesis equivalent of QuickCheck frequency generator?

As a learning project I am translating some Haskell code (which I'm unfamiliar with) into Python (which I know well)...
The Haskell library I'm translating has tests which make use of QuickCheck property-based testing. On the Python side I am using Hypothesis as the property-based testing library.
The Haskell tests make use of a helper function which looks like this:
mkIndent' :: String -> Int -> Gen String
mkIndent' val size = concat <$> sequence [indent, sym, trailing]
where
whitespace_char = elements " \t"
trailing = listOf whitespace_char
indent = frequency [(5, vectorOf size whitespace_char), (1, listOf whitespace_char)]
sym = return val
My question is specifically about the frequency generator in this helper.
http://hackage.haskell.org/package/QuickCheck-2.12.6.1/docs/Test-QuickCheck-Gen.html#v:frequency
I understand it to mean that most of the time it will return vectorOf whitespace_char with the expected size, but 1 in 5 times it will return listOf whitespace_char which could be any length including zero.
In the context of the library, an indent which does not respect the size would model bad input data for the function under test. So I see the point of occasionally producing such an input.
What I currently don't understand is why the 5:1 ratio in favour of valid inputs? I would have expected the property-based test framework to generate various valid and invalid inputs. For now I assume that this is sort of like an optimisation, so that it doesn't spend most of its time generating invalid examples?
The second part of my question is how to translate this into Hypothesis. AFAICT Hypothesis does not have any equivalent of the frequency generator.
I am wondering whether I should attempt to build a frequency strategy myself from existing Hypothesis strategies, or if the idiom itself is not worth translating and I should just let the framework generate valid & invalid examples alike?
What I have currently is:
from hypothesis import strategies as st
#st.composite
def make_indent_(draw, val, size):
"""
Indent `val` by `size` using either space or tab.
Will sometimes randomly ignore `size` param.
"""
whitespace_char = st.text(' \t', min_size=1, max_size=1)
trailing = draw(st.text(draw(whitespace_char)))
indent = draw(st.one_of(
st.text(draw(whitespace_char), min_size=size, max_size=size),
st.text(draw(whitespace_char)),
))
return ''.join([indent, val, trailing])
If I generate a few examples in a shell this seems to be doing exactly what I think it should.
But this is my first use of Hypothesis or property-based testing and I am wondering if I am losing something vital by replacing the frequency distribution with a simple one_of?
As far as I can see, you've correctly understood the purpose of using frequency here. It is used to allow the occasional mis-sized indent instead of either (1) only generating correctly sized indents which would never test bad indent sizes; or (2) generating randomly sized indents which would test bad indents over and over again but only generate a fraction of cases with good indents to test other aspects of the code.
Now, the 5:1 ratio of good to (potentially) bad indent sizes is probably quite arbitrary, and it's hard to know if 1:1 or 10:1 would have been better choices without seeing the details of what's being tested.
Luckily though, with respect to porting this to hypothesis, the answer to Have a Strategy that does not uniformly choose between different strategies includes a deleted comment:
Hypothesis doesn't actually support user-specific probabilities - we start with a uniform distribution, but bias it based on coverage from observed inputs. [...] – Zac Hatfield-Dodds Apr 15 '18 at 3:43
This suggests that the "hypothesis" package automatically adjusts weights when using one_of to increase coverage, meaning that it may automatically up-weight the case with correct size in your make_indent_ implementation, making it a sort of automatic version of frequency.

Detect empty string in numeric field using Cerberus

I am using the python library cerberus (http://docs.python-cerberus.org/en/stable/) and I want to check if a JSON field is a number (integer) or an empty string.
I tried using the condition:
{"empty": True, "type": "intenger"}
But when the field is an empty string, for example: (""), I get the following error.
'must be of integer type'
Is there a way of using the basic validation rules so it detects also an empty string in a numeric field?, I know it can be done by using extended validation functions but I want to avoid that solution for the moment.
Try something like this:
{"anyof":[
{"type":"string","allowed":[""]},
{"anyof_type":["float","integer"]}
]},
I would advise to not overcomplicate schemas. 1) Multiple types can be declared for the type rule. 2) The empty rule is only applied to sizable values, so it would ignore any given integer. Hence this is the simplest possible rules set for your constraints:
{'type': ('integer', 'string'),
'empty': True}
Mind that this doesn't enforce the value to be an empty string, but allows it to be, vulgo: a non-empty string would also pass. You may want to use the max_lengh rule w/ 0 as constraint instead.

How can I define an application's Validation Rules in an XML file

I have an application that validates a CSV file against some set rules. The application checks if some "columns/fields" in the CVS are marked as mandatory, others it checks if their mandatory status is based upon another field. E.g. Column 2 has a conditional check against column 5 such that if column 5 has a value, then column 2 must also have a value.
I have already implemented this using VB and Python. Problem is this logic is hard coded in the application. What i want is to move this rules into say an XML where the application will read that XML and process the file. If the rules for processing change -and they change often- then the application remains the same and only the XML changes.
Here are two sample rules in python:
Sample One
current_column_data = 5 #data from the current position in the CSV
if validate_data_type(current_column_data, expected_data_type) == False:
return error_message
index_to_check_against = 10 #Column against which there is a "logical" test
text_to_check = get_text(index_to_check_against)
if validate_data_type(text_to_check, expected_data_type) == False:
return error_message
if current_column_data > 10: #This test could be checking String Vs String so have to keep in mind that to avoid errors since current column data could be a string value
if text_to_check <= 0:
return "Text to check should be greater than 0 if current column data is greater than 10 "
Sample Two
current_column_data = "Self Employed" #data from the current position in the CSV
if validate_data_type(current_column_data, expected_data_type) == False:
return error_message
index_to_check_against = 10 #Column against which there is a "logical" test
text_to_check = get_text(index_to_check_against)
if validate_data_type(text_to_check, expected_data_type) == False:
return error_message
if text_to_check == "A": #Here we expect if A is provided in the index to check, then current column should have a value hence we raise an error message
if len(current_column_data) = 0:
return "Current column is mandatory given that "A" is provided in Column_to_check""
Note: For each column in the CSV, we already know the data type to expect, the expected length of that field, whether its mandatory, optional or conditional and if its conditional the other column the condition is based on
Now I just need some guidance on how I can possibly do it in XML and the application reads the XML and knows what to do with each column.
Someone suggested the following sample elsewhere but I still can't wrap my head around the concept.:
<check left="" right="9" operation="GTE" value="3" error_message="logical failure for something" />
#Meaning: Column 9 should be "GTE" i.e. Greater than or equal two value 3"
Is there a different way to go about achieving this kind of logic or even a way to improve what I have here?
Suggestions and pointers welcome
This concept is called a Domain Specific Language (DSL) - you are effectively creating a mini-programming language for validating your CSV files. Your DSL allows you to express succinctly the rules for a valid CSV file.
This DSL could be expressed using XML, or an alternative approach would be to develop a library of functions in python instead. Then your DSL could be expressed as a mini-python program which is a sequence of these functions. This approach is called an in-language or "internal" DSL - and has the benefit that you have the full power of python at your disposal within your language.
Looking at your samples - you're very close to this already. When I read them, they're almost like an English description of the CSV validation rules.
Don't feel you have to go down the XML route - there's nothing wrong with keeping everything in Python
You can split your code, so you have a python file with the "CSV validation rules" expressed in your DSL, which your need to update/redistribute frequently, and separate files which define your DSL functions, which will change less frequently
In some cases it's even possible to develop the DSL to the point where non-programmers can update/maintain "programs" written in it
The problem you are solving is not necessarily bounded with XML. OK, you can do validation for XML with XSD, but that means that your data needed to be XML, and I'm not sure if you can do it to extent that "if A > 3, following rule applies".
A little less elegant, but maybe simpler approach than Ross answers, is simply define set of rules as data and have specific function process them, which is basically what your XML example does, storing (i.e. serializing) the data using XML---but you can use any other serialization format like JSON, YAML, INI or even CSV (not that it would be advisable).
So you could concentrate on the data model of the rules. I'll try to illustrate that with XML (but not using properties):
<cond name="some explanatory name">
<if><expr>...</expr>
<and>
<expr>
<left><column>9</column></left>
<op>ge</op>
<right>3</right>
</expr>
<expr>
<left><column>1</column></left>
<op>true</op>
<right></right>
</expr>
</and>
</cond>
Then, you can load that to Python and traverse over it for each row, raising nice explanatory exception as appropriate.
Edit: You mentioned that the file might need to be human-writable. Note that YAML has been designed for that.
Similar (not the same, changed to make it better illustrate the language) structure:
# comments, explanations...
conds:
- name: some explanatory name
# seen that? no quotes needed (unless you include some of
# quite limited set of special chars)
if:
expr:
# "..."
and:
expr:
left:
column: 9
op: ge
right: 3
expr:
left:
column: 1
op: true
- name: some other explanatory name
# i'm using alternative notation for columns below just to have
# them better indented (not sure about support in libs)
if:
expr:
# "..."
and:
expr:
left: { column: 9 }
op: ge
right: 3
expr:
left: { column: 1 }
op: true

Does Pyparsing Support Context-Sensitive Grammars?

Forgive me if I have the incorrect terminology; perhaps just getting the "right" words to describe what I want is enough for me to find the answer on my own.
I am working on a parser for ODL (Object Description Language), an arcane language that as far as I can tell is now used only by NASA PDS (Planetary Data Systems; it's how NASA makes its data available to the public). Fortunately, PDS is finally moving to XML, but I still have to write software for a mission that fell just before the cutoff.
ODL defines objects in something like the following manner:
OBJECT = TABLE
ROWS = 128
ROW_BYTES = 512
END_OBJECT = TABLE
I am attempting to write a parser with pyparsing, and I was doing fine right up until I came to the above construction.
I have to create some rule that is able to ensure that the right-hand-value of the OBJECT line is identical to the RHV of END_OBJECT. But I can't seem to put that into a pyparsing rule. I can ensure that both are syntactically valid values, but I can't go the extra step and ensure that the values are identical.
Am I correct in my intuition that this is a context-sensitive grammar? Is that the phrase I should be using to describe this problem?
Whatever kind of grammar this is in the theoretical sense, is pyparsing able to handle this kind of construction?
If pyparsing is not able to handle it, is there another Python tool capable of doing so? How about ply (the Python implementation of lex/yacc)?
It is in fact a grammar for a context-sensitive language, classically abstracted as wcw where w is in (a|b)* (note that wcw' , where ' indicates reversal, is context-free).
Parsing Expression Grammars are capable of parsing wcw-type languages by using semantic predicates. PyParsing provides the matchPreviousExpr() and matchPreviousLiteral() helper methods for this very purpose, e.g.
w = Word("ab")
s = w + "c" + matchPreviousExpr(w)
So in your case you'd probably do something like
table_name = Word(alphas, alphanums)
object = Literal("OBJECT") + "=" + table_name + ... +
Literal("END_OBJECT") + "=" +matchPreviousExpr(table_name)
As a general rule, parsers are built as context-free parsing engines. If there is context sensitivity, it is grafted on after parsing (or at least after the relevant parsing steps are completed).
In your case, you want to write context-free grammar rules:
head = 'OBJECT' '=' IDENTIFIER ;
tail = 'END_OBJECT' '=' IDENTIFIER ;
element = IDENTIFIER '=' value ;
element_list = element ;
element_list = element_list element ;
block = head element_list tail ;
The checks that the head and tail constructs have matching identifiers isn't technically done by the parser.
Many parsers, however, allow a semantic action to occur when a syntactic element is recognized, often for the purpose of building tree nodes. In your case, you want
to use this to enable additional checking. For element, you want to make sure the IDENTIFIER isn't a duplicate of something already in the block; this means for each element encountered, you'll want to capture the corresponding IDENTIFIER and make a block-specific list to enable duplicate checking. For block, you want to capture the head *IDENTIFIER*, and check that it matches the tail *IDENTIFIER*.
This is easiest if you build a tree representing the parse as you go along, and hang the various context-sensitive values on the tree in various places (e.g., attach the actual IDENTIFIER value to the tree node for the head clause). At the point where you are building the tree node for the tail construct, it should be straightforward to walk up the tree, find the head tree, and then compare the identifiers.
This is easier to think about if you imagine the entire tree being built first, and then a post-processing pass over the tree is used to this checking. Lazy people in fact do it this way :-} All we are doing is pushing work that could be done in the post processing step, into the tree-building steps attached to the semantic actions.
None of these concepts is python specific, and the details for PyParsing will vary somewhat.

Categories

Resources