How to change the structure of an an XML - python

From this string:
label_config={
"label1": [
"modality1",
"modality2",
"modality3"],
"choice":"single",
"required": "true",
"name" : "sentiment"},{
"label2": [
"modality1",
"modality2"],
"name" : "price"
}
I created this XML which is printed:
Anyone knows how thanks to this library: from lxml import etree
can move the slashes of the yellow elements from the end to the beginning?
Here is the code of the generation:
from lxml import etree
import sys
def topXML(dictAttrib = None):
root : {lxml.etree._Element}
root = etree.Element("View")
textEl = etree.SubElement(root, "Text")
if dictAttrib == None:
dictAttrib = {
"name":"text",
"value":"$text"
}
for k_,v_ in dictAttrib.items():
textEl.set(k_,v_)
return root
def choiceXML(root,locChoice):
headerEl = etree.SubElement(root, "Header")
choisesEl = etree.SubElement(root, "Choices")
for k_,v_ in locChoice.items():
if (isinstance(k_,str) & isinstance(v_,list)):
choices = v_
headerEl.set("value",k_)
if locChoice.get("toName") == None:
choisesEl.set("toName","text")
for op_ in choices:
opEl = etree.SubElement(root, "Choice")
opEl.set("value",op_)
else :
choisesEl.set(k_,v_)
choisesEl = etree.SubElement(root, "Choices")
return root
def checkConfig(locChoice):
if locChoice.get("name") == None :
sys.exit("Warning : label_config needs a parameter called 'name' assigned")
def xmlConstructor(label_config):
root = topXML()
for ch_ in label_config:
checkConfig(ch_)
root = choiceXML(root,ch_)
return root
The generated code will be used in this site https://labelstud.io/playground/. They use some type of XML do create the code. Unfortunately, using etree it doesn't achieve the wanted product and I found out that if I made the changes described above it will work.
In the meantime, I am contacting their team to get more info but if somoeone here has any idea on how to make it work, please come forward.

To correctly encapsulate the <Choice> nodes under its parent, <Choices>, simply make the following very simple two changes to your choiceXML method. Namely, add opEl sub elements under the choisesEl element (not root) and remove the redundant second choisesEl line at the end.
def choiceXML(root, locChoice):
headerEl = etree.SubElement(root, "Header")
choisesEl = etree.SubElement(root, "Choices")
for k_,v_ in locChoice.items():
if (isinstance(k_,str) & isinstance(v_,list)):
choices = v_
headerEl.set("value",k_)
if locChoice.get("toName") == None:
choisesEl.set("toName","text")
for op_ in choices:
opEl = etree.SubElement(choisesEl, "Choice") # CHANGE root to choisesEl
opEl.set("value",op_)
else :
choisesEl.set(k_,v_)
#choisesEl = etree.SubElement(root, "Choices") # REMOVE THIS LINE
return root
Full Process
label_config = {
"label1": [
"modality1",
"modality2",
"modality3"],
"choice":"single",
"required": "true",
"name" : "sentiment"},{
"label2": [
"modality1",
"modality2"],
"name" : "price"
}
def topXML(dictAttrib = None):
# ...NO CHANGE...
def choiceXML(root,locChoice):
# ...ABOVE CHANGE...
def checkConfig(locChoice):
# ...NO CHANGE...
def xmlConstructor(label_config):
# ...NO CHANGE...
output = xmlConstructor(label_config)
Output
print(etree.tostring(output, pretty_print=True).decode("utf-8"))
# <View>
# <Text name="text" value="$text"/>
# <Header value="label1"/>
# <Choices toName="text" choice="single" required="true" name="sentiment">
# <Choice value="modality1"/>
# <Choice value="modality2"/>
# <Choice value="modality3"/>
# </Choices>
# <Header value="label2"/>
# <Choices toName="text" name="price">
# <Choice value="modality1"/>
# <Choice value="modality2"/>
# </Choices>
# </View>

The <Choices/> is short for <Choices></Choices> (XML spec). If you just make it a closing element, you probably don't have an opening one, and the result will be invalid xml. Any program trying to read / parse that will error out.
Notice that you have trailing slashes on all your <Choices> elements, also the non-empty ones.
If you don't want the empty <Choices/> elements, you may need to look into how you generate the XML from the dict. Since you don't provide a MCVE we can't answer that part.

This is more a comment than an answer, but it's a bit too long for a comment. Looking at what you provide, it seems like the problem is not that your xml is too well formed (there's no such thing) or that the playground has some sort of weird xml structure. I believe the xml you generated is not what they are looking for.
If you look at your 2nd <Choices> element, it reads
<Choices toName="text" name="price"/>
Try dropping the closing / so it reads:
<Choices toName="text" name="price">
It will then be closed with the following <Choices/> and maybe it will work.

Related

Get parent node?

I write a script to delete unwanted objects from huge datasets by their id-prefix.
That's how these objects are structured:
<wfsext:Replace vendorId="AdV" safeToIgnore="false">
<AX_Anschrift gml:id="DENWAEDA0000001G20161222T083308Z">
<gml:identifier codeSpace="http://www.adv-online.de/">urn:adv:oid:DENWAEDA0000001G</gml:identifier>
...
</AX_Anschrift>
<ogc:Filter>
<ogc:FeatureId fid="DENWAEDA0000001G20161222T083308Z" />
</ogc:Filter>
</wfsext:Replace>
I like to delete these full snippet within <wfsext:Replace>...</wfsext:Replace>
And there is a code snippet from my script:
file = etree.parse(portion_file)
root = file.getroot()
nsmap = root.nsmap.copy()
nsmap['adv'] = nsmap.pop(None)
node = root.xpath(".//adv:geaenderteObjekte/wfs:Transaction", namespaces=nsmap)[0]
for t in node:
for obj in t:
objecttype = str(etree.QName(obj.tag).localname)
if objecttype == 'Filter':
pass
else:
objid = (obj.xpath('#gml:id', namespaces=nsmap))[0][:16]
if debug:
print('{} - {}'.format(objid[:16], objecttype))
if objid[:6] != prefix:
#parent = obj.getparent()
t.remove(obj)
The t.remove(obj) removes <AX_Anschrift>..</AX_Anschrift> but not the rest of the object. I tried to get the parent node by using obj.getparent() but this gives me an error. How to catch it?
obj.getparent() is t, so you don't actually need to call getparent(), simply remove the entire object with:
node.remove(t)
or, if you want to remove the entire wfs:Transaction,
node.getparent().remove(node)

Python xml parsing etree find element X by postion

I'm trying to parse the following xml to pull out certain data then eventually edit the data as needed.
Here is the xml:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<CHECKLIST>
<VULN>
<STIG_DATA>
<VULN_ATTRIBUTE>Vuln_Num</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA>V-38438</ATTRIBUTE_DATA>
</STIG_DATA>
<STIG_DATA>
<VULN_ATTRIBUTE>Rule_Title</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA>More text.</ATTRIBUTE_DATA>
</STIG_DATA>
<STIG_DATA>
<VULN_ATTRIBUTE>Vuln_Discuss</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA>Some text here</ATTRIBUTE_DATA>
</STIG_DATA>
<STIG_DATA>
<VULN_ATTRIBUTE>IA_Controls</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA></ATTRIBUTE_DATA>
</STIG_DATA>
<STIG_DATA>
<VULN_ATTRIBUTE>Rule_Ver</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA>Gen000000</ATTRIBUTE_DATA>
</STIG_DATA>
<STATUS>NotAFinding</STATUS>
<FINDING_DETAILS></FINDING_DETAILS>
<COMMENTS></COMMENTS>
<SEVERITY_OVERRIDE></SEVERITY_OVERRIDE>
<SEVERITY_JUSTIFICATION></SEVERITY_JUSTIFICATION>
</VULN>
The data that I'm looking to pull from this is the STATUS, COMMENTS and the ATTRIBUTE_DATA directly following VULN_ATTRIBUTE that matches == Rule_Ver. So in this example.
I should get the following:
Gen000000 NotAFinding None
What I have so far is that I can get the Status and Comments easy, but can't figure out the ATTRIBUTE_DATA portion. I can find the first one (Vuln_Num), then I tried to add a index but that gives a "list index out of range" error.
This is where I'm at now.
import xml.etree.ElementTree as ET
doc = ET.parse('test.ckl')
root=doc.getroot()
TagList = doc.findall("./VULN")
for curTag in TagList:
StatusTag = curTag.find("STATUS")
CommentTag = curTag.find("COMMENTS")
DataTag = curTag.find("./STIG_DATA/ATTRIBUTE_DATA")
print "GEN:[%s] Status:[%s] Comments: %s" %( DataTag.text, StatusTag.text, CommentTag.text)
This gives the following output:
GEN:[V-38438] Status:[NotAFinding] Comments: None
I want:
GEN:[Gen000000] Status:[NotAFinding] Comments: None
So the end goal is to be able to parse hundreds of these and edit the comments field as needed. I don't think the editing part will be that hard once I get the right element.
Logically I see two ways of doing this. Either go to the ATTRIBUTE_DATA[5] and grab the text or find VULN_ATTRIBUTE == Rule_Ver then grab the next ATTRIBUTE_DATA.
I have tried doing this:
DataTag = curTag.find(".//STIG_DATA//ATTRIBUTE_DATA")[5]
andDataTag[5].text`
and both give meIndexError: list index out of range
I saw lxml had get_element_by_id and xpath, but I can't add modules to this system so it is etree for me.
Thanks in advance.
One can find an element by position, but you've used the incorrect XPath syntax. Either of the following lines should work:
DataTag = curTag.find("./STIG_DATA[5]/ATTRIBUTE_DATA") # Note: 5, not 4
DataTag = curTag.findall("./STIG_DATA/ATTRIBUTE_DATA")[4] # Note: 4, not 5
However, I strongly recommend against using that. There is no guarantee that the Rule_Ver instance of STIG_DATA is always the fifth item.
If you could change to lxml, then this works:
DataTag = curTag.xpath(
'./STIG_DATA/VULN_ATTRIBUTE[text()="Rule_Ver"]/../ATTRIBUTE_DATA')[0]
Since you can't use lxml, you must iterate the STIG_DATA elements by hand, like so:
def GetData(curTag):
for stig in curTag.findall('STIG_DATA'):
if stig.find('VULN_ATTRIBUTE').text == 'Rule_Ver':
return stig.find('ATTRIBUTE_DATA')
Here is a complete program with error checking added to GetData():
import xml.etree.ElementTree as ET
doc = ET.parse('test.ckl')
root=doc.getroot()
TagList = doc.findall("./VULN")
def GetData(curTag):
for stig in curTag.findall('STIG_DATA'):
vuln = stig.find('VULN_ATTRIBUTE')
if vuln is not None and vuln.text == 'Rule_Ver':
data = stig.find('ATTRIBUTE_DATA')
return data
for curTag in TagList:
StatusTag = curTag.find("STATUS")
CommentTag = curTag.find("COMMENTS")
DataTag = GetData(curTag)
print "GEN:[%s] Status:[%s] Comments: %s" %( DataTag.text, StatusTag.text, CommentTag.text)
References:
https://stackoverflow.com/a/10836343/8747
http://lxml.de/xpathxslt.html#xpath

Python ElementTree

Having trouble with XML config files using ElementTree. I want to have an easy way to find the text of an element regardless of where it is in the XML Tree. From what the documentation says, I should be able to do this with findtext(), but no matter what, I get a return of None. Where am I going wrong here? Everyone was telling me XML is so simple to handle in Python, yet I have had nothing but troubles.
configFileName = 'file.xml'
def configSet (x):
if os.path.exists(configFileName):
tree = ET.parse(configFileName)
root = tree.getroot()
return root.findtext(x)
hiTemp = configSet('hiTemp')
print hiTemp
and the XML
<configData>
<units>
<temp>F</temp>
</units>
<pins>
<lights>1</lights>
<fan>2</fan>
<co2>3</co2>
</pins>
<events>
<airTemps>
<hiTemp>80</hiTemp>
<lowTemp>72</lowTemp>
<hiTempAlarm>84</hiTempAlarm>
</airTemps>
<CO2>
<co2Hi>1500</co2Hi>
<co2Low>1400</co2Low>
<co2Alarm>600</co2Alarm>
</CO2>
</events>
<settings>
<apikeys>
<prowl>
<apikey>None</apikey>
</prowl>
</apikeys>
</settings>
expected result
80
actual result
None
findtext requires a full path, but you have given a relative path, so you cannot find the element you are looking for.
You can either provide a good xpath or modify your code
def configSet(x):
if os.path.exists(configFileName):
tree = ET.parse(configFileName)
root = tree.getroot()
for e in root.getiterator():
t = e.findtext(x)
if t is not None:
return t
Update 1:
If you want to have all matched text as a list, the code is a bit different.
def configSet(x):
matches = []
if os.path.exists(configFileName):
tree = ET.parse(configFileName)
root = tree.getroot()
for e in root.getiterator():
t = e.findtext(x)
if t is not None:
matches.append(t)
return matches
You can use xpath to get to your desired element.
return root.find('./events/airTemps/hiTemp').text
There's easy to follow documentation here.

docutils/sphinx custom directive creating sibling section rather than child

Consider a reStructuredText document with this skeleton:
Main Title
==========
text text text text text
Subsection
----------
text text text text text
.. my-import-from:: file1
.. my-import-from:: file2
The my-import-from directive is provided by a document-specific Sphinx extension, which is supposed to read the file provided as its argument, parse reST embedded in it, and inject the result as a section in the current input file. (Like autodoc, but for a different file format.) The code I have for that, right now, looks like this:
class MyImportFromDirective(Directive):
required_arguments = 1
def run(self):
src, srcline = self.state_machine.get_source_and_line()
doc_file = os.path.normpath(os.path.join(os.path.dirname(src),
self.arguments[0]))
self.state.document.settings.record_dependencies.add(doc_file)
doc_text = ViewList()
try:
doc_text = extract_doc_from_file(doc_file)
except EnvironmentError as e:
raise self.error(e.filename + ": " + e.strerror) from e
doc_section = nodes.section()
doc_section.document = self.state.document
# report line numbers within the nested parse correctly
old_reporter = self.state.memo.reporter
self.state.memo.reporter = AutodocReporter(doc_text,
self.state.memo.reporter)
nested_parse_with_titles(self.state, doc_text, doc_section)
self.state.memo.reporter = old_reporter
if len(doc_section) == 1 and isinstance(doc_section[0], nodes.section):
doc_section = doc_section[0]
# If there was no title, synthesize one from the name of the file.
if len(doc_section) == 0 or not isinstance(doc_section[0], nodes.title):
doc_title = nodes.title()
doc_title.append(make_title_text(doc_file))
doc_section.insert(0, doc_title)
return [doc_section]
This works, except that the new section is injected as a child of the current section, rather than a sibling. In other words, the example document above produces a TOC tree like this:
Main Title
Subsection
File1
File2
instead of the desired
Main Title
Subsection
File1
File2
How do I fix this? The Docutils documentation is ... inadequate, particularly regarding control of section depth. One obvious thing I have tried is returning doc_section.children instead of [doc_section]; that completely removes File1 and File2 from the TOC tree (but does make the section headers in the body of the document appear to be for the right nesting level).
I don't think it is possible to do this by returning the section from the directive (without doing something along the lines of what Florian suggested), as it will get appended to the 'current' section. You can, however, add the section via self.state.section as I do in the following (handling of options removed for brevity)
class FauxHeading(object):
"""
A heading level that is not defined by a string. We need this to work with
the mechanics of
:py:meth:`docutils.parsers.rst.states.RSTState.check_subsection`.
The important thing is that the length can vary, but it must be equal to
any other instance of FauxHeading.
"""
def __init__(self, length):
self.length = length
def __len__(self):
return self.length
def __eq__(self, other):
return isinstance(other, FauxHeading)
class ParmDirective(Directive):
required_arguments = 1
optional_arguments = 0
has_content = True
option_spec = {
'type': directives.unchanged,
'precision': directives.nonnegative_int,
'scale': directives.nonnegative_int,
'length': directives.nonnegative_int}
def run(self):
variableName = self.arguments[0]
lineno = self.state_machine.abs_line_number()
secBody = None
block_length = 0
# added for some space
lineBlock = nodes.line('', '', nodes.line_block())
# parse the body of the directive
if self.has_content and len(self.content):
secBody = nodes.container()
block_length += nested_parse_with_titles(
self.state, self.content, secBody)
# keeping track of the level seems to be required if we want to allow
# nested content. Not sure why, but fits with the pattern in
# :py:meth:`docutils.parsers.rst.states.RSTState.new_subsection`
myLevel = self.state.memo.section_level
self.state.section(
variableName,
'',
FauxHeading(2 + len(self.options) + block_length),
lineno,
[lineBlock] if secBody is None else [lineBlock, secBody])
self.state.memo.section_level = myLevel
return []
I don't know how to do it directly inside your custom directive. However, you can use a custom transform to raise the File1 and File2 nodes in the tree after parsing. For example, see the transforms in the docutils.transforms.frontmatter module.
In your Sphinx extension, use the Sphinx.add_transform method to register the custom transform.
Update: You can also directly register the transform in your directive by returning one or more instances of the docutils.nodes.pending class in your node list. Make sure to call the note_pending method of the document in that case (in your directive you can get the document via self.state_machine.document).

Editing XML as a dictionary in python?

I'm trying to generate customized xml files from a template xml file in python.
Conceptually, I want to read in the template xml, remove some elements, change some text attributes, and write the new xml out to a file. I wanted it to work something like this:
conf_base = ConvertXmlToDict('config-template.xml')
conf_base_dict = conf_base.UnWrap()
del conf_base_dict['root-name']['level1-name']['leaf1']
del conf_base_dict['root-name']['level1-name']['leaf2']
conf_new = ConvertDictToXml(conf_base_dict)
now I want to write to file, but I don't see how to get to
ElementTree.ElementTree.write()
conf_new.write('config-new.xml')
Is there some way to do this, or can someone suggest doing this a different way?
This'll get you a dict minus attributes. I don't know, if this is useful to anyone. I was looking for an xml to dict solution myself, when I came up with this.
import xml.etree.ElementTree as etree
tree = etree.parse('test.xml')
root = tree.getroot()
def xml_to_dict(el):
d={}
if el.text:
d[el.tag] = el.text
else:
d[el.tag] = {}
children = el.getchildren()
if children:
d[el.tag] = map(xml_to_dict, children)
return d
This: http://www.w3schools.com/XML/note.xml
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
Would equal this:
{'note': [{'to': 'Tove'},
{'from': 'Jani'},
{'heading': 'Reminder'},
{'body': "Don't forget me this weekend!"}]}
I'm not sure if converting the info set to nested dicts first is easier. Using ElementTree, you can do this:
import xml.etree.ElementTree as ET
doc = ET.parse("template.xml")
lvl1 = doc.findall("level1-name")[0]
lvl1.remove(lvl1.find("leaf1")
lvl1.remove(lvl1.find("leaf2")
# or use del lvl1[idx]
doc.write("config-new.xml")
ElementTree was designed so that you don't have to convert your XML trees to lists and attributes first, since it uses exactly that internally.
It also support as small subset of XPath.
For easy manipulation of XML in python, I like the Beautiful Soup library. It works something like this:
Sample XML File:
<root>
<level1>leaf1</level1>
<level2>leaf2</level2>
</root>
Python code:
from BeautifulSoup import BeautifulStoneSoup, Tag, NavigableString
soup = BeautifulStoneSoup('config-template.xml') # get the parser for the xml file
soup.contents[0].name
# u'root'
You can use the node names as methods:
soup.root.contents[0].name
# u'level1'
It is also possible to use regexes:
import re
tags_starting_with_level = soup.findAll(re.compile('^level'))
for tag in tags_starting_with_level: print tag.name
# level1
# level2
Adding and inserting new nodes is pretty straightforward:
# build and insert a new level with a new leaf
level3 = Tag(soup, 'level3')
level3.insert(0, NavigableString('leaf3')
soup.root.insert(2, level3)
print soup.prettify()
# <root>
# <level1>
# leaf1
# </level1>
# <level2>
# leaf2
# </level2>
# <level3>
# leaf3
# </level3>
# </root>
My modification of Daniel's answer, to give a marginally neater dictionary:
def xml_to_dictionary(element):
l = len(namespace)
dictionary={}
tag = element.tag[l:]
if element.text:
if (element.text == ' '):
dictionary[tag] = {}
else:
dictionary[tag] = element.text
children = element.getchildren()
if children:
subdictionary = {}
for child in children:
for k,v in xml_to_dictionary(child).items():
if k in subdictionary:
if ( isinstance(subdictionary[k], list)):
subdictionary[k].append(v)
else:
subdictionary[k] = [subdictionary[k], v]
else:
subdictionary[k] = v
if (dictionary[tag] == {}):
dictionary[tag] = subdictionary
else:
dictionary[tag] = [dictionary[tag], subdictionary]
if element.attrib:
attribs = {}
for k,v in element.attrib.items():
attribs[k] = v
if (dictionary[tag] == {}):
dictionary[tag] = attribs
else:
dictionary[tag] = [dictionary[tag], attribs]
return dictionary
namespace is the xmlns string, including braces, that ElementTree prepends to all tags, so here I've cleared it as there is one namespace for the entire document
NB that I adjusted the raw xml too, so that 'empty' tags would produce at most a ' ' text property in the ElementTree representation
spacepattern = re.compile(r'\s+')
mydictionary = xml_to_dictionary(ElementTree.XML(spacepattern.sub(' ', content)))
would give for instance
{'note': {'to': 'Tove',
'from': 'Jani',
'heading': 'Reminder',
'body': "Don't forget me this weekend!"}}
it's designed for specific xml that is basically equivalent to json, should handle element attributes such as
<elementName attributeName='attributeContent'>elementContent</elementName>
too
there's the possibility of merging the attribute dictionary / subtag dictionary similarly to how repeat subtags are merged, although nesting the lists seems kind of appropriate :-)
Adding this line
d.update(('#' + k, v) for k, v in el.attrib.iteritems())
in the user247686's code you can have node attributes too.
Found it in this post https://stackoverflow.com/a/7684581/1395962
Example:
import xml.etree.ElementTree as etree
from urllib import urlopen
xml_file = "http://your_xml_url"
tree = etree.parse(urlopen(xml_file))
root = tree.getroot()
def xml_to_dict(el):
d={}
if el.text:
d[el.tag] = el.text
else:
d[el.tag] = {}
children = el.getchildren()
if children:
d[el.tag] = map(xml_to_dict, children)
d.update(('#' + k, v) for k, v in el.attrib.iteritems())
return d
Call as
xml_to_dict(root)
Have you tried this?
print xml.etree.ElementTree.tostring( conf_new )
most direct way to me :
root = ET.parse(xh)
data = root.getroot()
xdic = {}
if data > None:
for part in data.getchildren():
xdic[part.tag] = part.text
XML has a rich infoset, and it takes some special tricks to represent that in a Python dictionary. Elements are ordered, attributes are distinguished from element bodies, etc.
One project to handle round-trips between XML and Python dictionaries, with some configuration options to handle the tradeoffs in different ways is XML Support in Pickling Tools. Version 1.3 and newer is required. It isn't pure Python (and in fact is designed to make C++ / Python interaction easier), but it might be appropriate for various use cases.

Categories

Resources