I have many large collections of objects that need to be filtered. I need the filters to be flexible and user definable.
Hard coded, it might look something like:
selected = set()
for o in objects:
if o.someproperty == 'ab':
selected.add(o)
if o.a == 'a' and 'something' in o.b:
selected.add(o)
But I need something I can store in the db.
I've seen something referring to this is the Criteria (or Filter) pattern http://www.tutorialspoint.com/design_pattern/filter_pattern.htm but I can't find much information on it.
I'd like the rules to be flexible and serializable in a simple format.
Maybe the above could look something like:
someproperty == 'ab'
a == 'a', 'something' in b
With each line of the rule being a different set of things that need to meet. If any line in the ruleset matches then the object is included. But should the boolean logic be the other way around (with and between the lines and or within them)? Does this give the flexibility to handle various negations? How would I parse it?
What simple approaches are there to this problem?
A couple of my example uses
# example 1
for o in objects:
if o.width < 1 and o.height < 1:
selected.add(o)
# example 2
for o in objects:
if o.type == 'i' or o.type == 't':
continue
if not o.color:
selected.add(o)
continue
if o.color == 'ffffff':
selected.add(o)
continue
if o.color == '000000':
continue
grey = (o.color[1:3] == o.color[3:5] and o.color[1:3] == o.color[5:7])
if grey:
selected.add(o)
Well, if you want a safe method you don't want to store code in your db.
What you want is a way for the user to specify the properties that can be parsed efficiently and applied as a filter.
I believe it's useless to invent your own language for describing properties. Just use an existing one. For example XML (though I'm not a fan...).
A filter may look like:
<filter name="something">
<or>
<isEqual attrName="someproperty" type="str" value="ab" />
<and>
<isEqual attrName="a" type="str" value="a" />
<contains value="something" type="str" attribute="b" />
</and>
</or>
</filter>
You can then parse the XML to obtain an object representation of the filter.
It shouldn't be hard from this to obtain a piece of code that performs the actions.
For each kind of check you'll have a class that implement that check, and when parsing you replace the nodes with an instance of that class. It should be a very simple thing to do.
In this way the filter is easily modified by anyone who has a little knowledge of XML. Moreover you can extend its logic and you don't have to change the parser for every new construct that you allow.
Related
Given a document containing a paragraph
d = docx.Document()
p = d.add_paragraph()
I expected the following technique to work every time:
if len(p._element) == 0:
# p is empty
OR
if len(p._p) == 0:
# p is empty
(Side question, what's the difference there? It seems that p._p is p._element in every case I've seen in the wild.)
If I add a style to my paragraph, the check no longer works:
>>> p2 = d.add_paragraph(style="Normal")
>>> print(len(p2._element))
1
Explicitly setting text=None doesn't help either, not that I would expect it to.
So how to I check if a paragraph is empty of content (specifically text and images, although more generic is better)?
Update
I messed around a little and found that setting the style apparently adds a single pPr element:
>>> p2._element.getchildren()
[<CT_PPr '<w:pPr>' at 0x7fc9a2b64548>]
The element itself it empty:
>>> len(p2._element.getchildren()[0])
0
But more importantly, it is not a run.
So my test now looks like this:
def isempty(par):
return sum(len(run) for run in par._element.xpath('w:r')) == 0
I don't know enough about the underlying system to have any idea if this is a reasonable solution or not, and what the caveats are.
More Update
Seems like I need to be able to handle a few different cases here:
def isempty(par):
p = par._p
runs = p.xpath('./w:r[./*[not(self::w:rPr)]]')
others = p.xpath('./*[not(self::w:pPr) and not(self::w:r)] and '
'not(contains(local-name(), "bookmark"))')
return len(runs) + len(others) == 0
This skips all w:pPr elements and runs with nothing but w:rPr elements. Any other element, except bookmarks, whether in the paragraph directly or in a run, will make the result non-empty.
The <w:p> element can have any of a large number of children, as you can see from the XML Schema excerpt here: http://python-docx.readthedocs.io/en/latest/dev/analysis/schema/ct_p.html (see the CT_P and EG_PContent definitions).
In particular, it often has a w:pPr child, which is where the style setting goes.
So your test isn't very reliable against false positives (if being empty is considered positive).
I'd be inclined to use paragraph.text == '', which parses through the runs.
A run can be empty (of text), so the mere presence of a run is not proof enough. The actual text is held in a a:t (text) element, which can also be empty. So the .text approach avoids all those low-level complications for you and has the benefit of being part of the API so much, much less likely to change in a future release.
xml file snapshot
From above .xml file I am extracting article-id, article-title, abstract and keywords. For normal text inside single tag getting correct results. But text with multiple tags such as:
<title-group>
<article-title>
Acetylcholinesterase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Bacterium,
<italic>Rapidithrix thailandica</italic>
</article-title>
</title-group>
.
.
same is for abstract...
I got output as:
OrderedDict([(u'italic**', u'Rapidithrix thailandica'), ('#text', u'Acetylcholines terase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Ba cterium,')])
code has considered tag as a text and the o/p generated is also not in the sequence.
How to simply extract text from such input document as "Acetylcholinesterase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Bacterium, Rapidithrix thailandica".
I am using below python code to perform above task..
import xmltodict
import os
from os.path import basename
import re
with open('2630847.nxml') as fd:
doc = xmltodict.parse(fd.read())
pmc_id = doc['article']['front']['article-meta']['article-id'][1]['#text']
article_title = doc['article']['front']['article-meta']['title-group']['article-title']
y = doc['article']['front']['article-meta']['abstract']
y = y.items()[0]
article_abstract = [g.encode('ascii','ignore') for g in y][1]
z = doc['article']['front']['article-meta']['kwd-group']['kwd']
zz = [g.encode('ascii','ignore') for g in z]
article_keywords = ",".join(zz).replace(","," ")
fout = open(str(pmc_id)+".txt","w")
fout.write(str(pmc_id)+"\n"+str(article_title)+". "+str(article_abstract)+". "+str(article_keywords))
Can somebody please suggest corrections..
xmltodict will likely be hard to use for your data. PMC journal articles are definitely not what the authors could have had in mind. Putting any but the most trivial XML into xmltodict is pounding a round peg into a square hole -- you might succeed, but it won't be pretty. I explain this further below under "tldr"....
Instead, I suggest you use a library whose data model fits your data better, such as xml.dom, minidom, or recent versions of BeautifulSoup. In many such libraries you just load the document with one call and then call some function like innerText() to get all the text content of it. You could even just load the document into a browser and call the Javascript innerText() function to get what you want. If the tool you choose doesn't provide innertext() already, it is:
def innertext(node):
t = ""
for curNode in node.childNodes:
if (isinstance(curNode, Text)):
t += curNode.nodeValue
elif (isinstance(curNode, Element)):
t += curNode.innerText
return(t)
You could tweak that to put spaces between the text nodes, depending on your data.
Hope that helps.
==tldr==
xmltodict makes an admirable attempt at making XML "as simple as possible"; but IMHO it errs in making it simpler than possible.
xmltodict basically works by turning every element into a dict, with its children as the dict items, keyed by their element names. But in many cases (such as yours), XML data isn't very much like that at all. For example, an element can have many children with the same name, but a dict can't.
So xmltodict has to do something special. It turns adjacent instances of the same element type into an array (without the element type). Here's an example excerpted from https://github.com/martinblech/xmltodict):
<and>
<many>elements</many>
<many>more elements</many>
</and>
becomes:
"and": {
"many": [
"elements",
"more elements"
]
},
First off, this means that xmltodict always loses the ordering information about child elements unless they are of the same type. So a section that contains a mix of paragraphs, lists, blockquotes, and so on, will either fail to load in xmltodict, or have all the scattered instances of each kind of child gathered together, completely losing their order.
The xmltodict approach also introduces frequent special-cases -- for example, you can't just get a list of all the children, or use len() to find out how many there are, etc. etc., because at every step you have to check whether you're really at a child element, or at a list of them.
Looking at xmltodict's own examples, you'll see that they mostly consist of walking down the tree by element names, but every now and then there's an integer subscript -- that's for the cases where these arrays are needed. But unless the data is unusually simple (which yours isn't), you won't know where that is. For example, if one DIV in an HTML document happens to contain only one P, the code to access the P needs one fewer subscript than with another DIV that happens to have more than one P.
It seems to me undesirable that the number of subscripts to get to something depends on how many siblings it has, and their types.
Alas, the structure still isn't good enough. Since child elements may have their own child elements, just making them strings in that extra array won't be enough. Sometimes they'll have to be dicts again, with some of their items in turn perhaps being arrays, some of whose items may be dicts, and so on. Writing the correct traversal algorithm to gather up the text is significantly harder than the DOM one shown above.
To be completely fair, there is some XML in which the order doesn't matter logically -- for example, you could export a SQL table into an XML file, using a container element for each record with a child element for each field. The order of fields is not information, so if you load such XML into xmltodict, losing the order doesn't matter. Likewise if you serialized Python data that was already just a dict. But those are very specialized edge cases. xmltodict might be an excellent choice for a case like that -- but the articles you're looking at are very far from that.
I'm building an app that has a grid of 20x20 labels which I've set as ids id: x1y1, id: x1y2, ... , id: x20y20 etc. I want to be able to reference the id by passing a string, and modify the text within, using something like;
self.ids.name("x1y1").text = "Anything"
I've had a look at dir(self.ids), dir(self.ids.items()) etc but can't seem to figure it out. Is it even possible?
I could create a list of if statements like;
if str(passedString) == "x1y1":
self.id.x1y1.text == "Anything"
if str(passedString) == "x1y2":
self.id.x1y2.text == "Anything"
Although I feel that this is incredibly bad practice, also considering I'd need 400 if statements! - Sure I could write a little helper-script to write all this code out for me, but again, it's not ideal.
EDIT:
I had a look at;
for key, val in self.ids.items():
print("key={0}, val={1}".format(key, val))
Which printed;
key=x1y2, val=<kivy.uix.label.Label object at 0x7f565a34e6d0>
key=x1y3, val=<kivy.uix.label.Label object at 0x7f565a2c1e88>
Might give you/someone an idea of where to go and/or what to do.
You just want to get the reference via a string of the id? You can use normal dictionary lookup (as ids is really a dict).
the_string = 'x1y1'
the_reference = self.ids[the_string]
the_reference.text = 'the_new_text'
I have some XML that I am parsing in python via lxml.
I am encountering situations where some elements have attributes and some don't.
I need to extract them if they exist, but skip them if they don't - I'm currently landing with errors (as my approach is wrong...)
I have deployed a testfornull, but that doesn't work in all cases:
Code:
if root[0][a][b].attrib == '<>':
ByteSeqReference = "NULL"
else:
ByteSeqReference = (attributes["Reference"])
XML A:
<ByteSequence Reference="BOFoffset">
XML B:
<ByteSequence Endianness = "little-endian" Reference="BOFoffset">
XML C:
<ByteSequence Endianness = "little-endian">
XML D:
<ByteSequence>
My current method can only deal with A, B or D. It can not cope with C.
I'm surprised that a test for null values on an attribute which often won't exist works ever -- what you should be doing is checking whether it exists, not whether it's empty:
if 'Reference' in current_element.attrib:
...do something with it...
I'm using django-tagging and I've got an array of tag objects. What's the best way to determine whether a given tag is among them?
def is_new (self):
tags = Tag.objects.get_for_object(self)
tagged = False
for tag in tags:
if tag = 'new':
tagged = True
return tagged
I have never really used django tagging but looking through the source really quickly the .get_for_object returns a queryset of the tags for that object. Not an actual list.
I'm not sure if your code is working [appart from the assignment/comparison issue] or if you just want to improve it. But Since you are returning a queryset couldn't you continue filtering it for instance:
Tag.objects.get_for_object(self).filter(name='new')
or to be able to use JamesO's example of:
if 'new' in tags:
return True
I think you need to turn the queryset into a list first.
list(tags)
And then it should work.
See documentation for forcing list evaluation - and note the memory concerns of doing that.
So my recommendation would be testing filtering first - and let us know if it works, because now I got curious.
tags = Tag.objects.get_for_object(self)
if 'new' in tags:
return True