Working with xml and exporting names of nodes

Working with xml and exporting names of nodes - python

I wrote this code below. In my XML file I have nodes:
Assembly_1, Detail_1, Detail_2, Assembly_2, Detail_3
What I am trying to do is to get the name of the assembly for each detail (Detail_1 and 2 would be in Assembly_1, etc.)
I have a lot of details... more than 200. So this code (function) works good but it takes a lot of time because the XML file is loaded each time.
How can I make it run faster?
def CorrectAssembly(detail):
from xml.dom import minidom
xml_path = r"C:\Users\vblagoje\test_python_s2k\Load_Independent_Results\HSB53111-01-D_2008_v2-Final-Test-Cases_All_1.1.xml"
mydoc=minidom.parse(xml_path)
root = mydoc.getElementsByTagName("FEST2000")
assembly=""
for node in root:
for childNodes in node.childNodes:
if childNodes.nodeType == childNodes.TEXT_NODE: continue
if childNodes.nodeName == "ASSEMBLY":
assembly = childNodes.getAttribute("NAME")
if childNodes.nodeName == "DETAIL":
if detail == childNodes.getAttribute("NAME"):
break
return assembly

One solution is, to simply read the XML-file once before looking up all the details.
Something along this:
from xml.dom import minidom
def CorrectAssembly(detail, root):
assembly=""
for node in root:
for childNodes in node.childNodes:
if childNodes.nodeType == childNodes.TEXT_NODE: continue
if childNodes.nodeName == "ASSEMBLY":
assembly = childNodes.getAttribute("NAME")
if childNodes.nodeName == "DETAIL":
if detail == childNodes.getAttribute("NAME"):
break
return assembly
xml_path = r"C:\Users\vblagoje\test_python_s2k\Load_Independent_Results\HSB53111-01-D_2008_v2-Final-Test-Cases_All_1.1.xml"
mydoc=minidom.parse(xml_path)
root = mydoc.getElementsByTagName("FEST2000")
aDetail = "myDetail"
assembly = CorrectAssembly(aDetail, root)
anotherDetail = "myDetail2"
assembly = CorrectAssembly(anotherDetail , root)
# an so on
You still go through (part of) the loaded XML every time you call the function though. Maybe it is beneficial to create a dictionary mapping the assembly to details and then to simply look them up when you need it:
from xml.dom import minidom
# read the xml
xml_path = r"C:\Users\vblagoje\test_python_s2k\Load_Independent_Results\HSB53111-01-D_2008_v2-Final-Test-Cases_All_1.1.xml"
mydoc=minidom.parse(xml_path)
root = mydoc.getElementsByTagName("FEST2000")
detail_assembly_map = {}
# fill the dictionary
for node in root:
for childNodes in node.childNodes:
if childNodes.nodeType == childNodes.TEXT_NODE: continue
if childNodes.nodeName == "ASSEMBLY":
assembly = childNodes.getAttribute("NAME")
if childNodes.nodeName == "DETAIL":
detail_assembly_map[childNodes.getAttribute("NAME")] = assembly
# use it
aDetail = "myDetail"
assembly = detail_assembly_map[aDetail]
From your post it is not really clear how the structure of the XML is, but in case the details are children of the assemblies, then the mapping could be done differently by iterating first through the assembly-knots and therein through its detail-children. Then you would not rely on a proper ordering of the elements.
This post could be helpful too, depending on the structure of your XML-tree.

Related

lxml (etree) - Pretty Print attributes of root tag

Is it possible in python to pretty print the root's attributes?
I used etree to extend the attributes of the child tag and then I had overwritten the existing file with the new content. However during the first generation of the XML, we were using a template where the attributes of the root tag were listed one per line and now with the etree I don't manage to achieve the same result.
I found similar questions but they were all referring to the tutorial of etree, which I find incomplete.
Hopefully someone has found a solution for this using etree.
EDIT: This is for custom XML so HTML Tidy (which was proposed in the comments), doesn't work for this.
Thanks!
generated_descriptors = list_generated_files(generated_descriptors_folder)
counter = 0
for g in generated_descriptors:
if counter % 20 == 0:
print "Extending Descriptor # %s out of %s" % (counter, len(descriptor_attributes))
with open(generated_descriptors_folder + "\\" + g, 'r+b') as descriptor:
root = etree.XML(descriptor.read(), parser=parser)
# Go through every ContextObject to check if the block is mandatory
for context_object in root.findall('ContextObject'):
for attribs in descriptor_attributes:
if attribs['descriptor_name'] == g[:-11] and context_object.attrib['name'] in attribs['attributes']['mandatoryobjects']:
context_object.set('allow-null', 'false')
elif attribs['descriptor_name'] == g[:-11] and context_object.attrib['name'] not in attribs['attributes']['mandatoryobjects']:
context_object.set('allow-null', 'true')
# Sort the ContextObjects based on allow-null and their name
context_objects = root.findall('ContextObject')
context_objects_sorted = sorted(context_objects, key=lambda c: (c.attrib['allow-null'], c.attrib['name']))
root[:] = context_objects_sorted
# Remove mandatoryobjects from Descriptor attributes and pretty print
root.attrib.pop("mandatoryobjects", None)
# paste new line here
# Convert to string in order to write the enhanced descriptor
xml = etree.tostring(root, pretty_print=True, encoding="UTF-8", xml_declaration=True)
# Write the enhanced descriptor
descriptor.seek(0) # Set cursor at beginning of the file
descriptor.truncate(0) # Make sure that file is empty
descriptor.write(xml)
descriptor.close()
counter+=1

Python ElementTree

Having trouble with XML config files using ElementTree. I want to have an easy way to find the text of an element regardless of where it is in the XML Tree. From what the documentation says, I should be able to do this with findtext(), but no matter what, I get a return of None. Where am I going wrong here? Everyone was telling me XML is so simple to handle in Python, yet I have had nothing but troubles.
configFileName = 'file.xml'
def configSet (x):
if os.path.exists(configFileName):
tree = ET.parse(configFileName)
root = tree.getroot()
return root.findtext(x)
hiTemp = configSet('hiTemp')
print hiTemp
and the XML
<configData>
<units>
<temp>F</temp>
</units>
<pins>
<lights>1</lights>
<fan>2</fan>
<co2>3</co2>
</pins>
<events>
<airTemps>
<hiTemp>80</hiTemp>
<lowTemp>72</lowTemp>
<hiTempAlarm>84</hiTempAlarm>
</airTemps>
<CO2>
<co2Hi>1500</co2Hi>
<co2Low>1400</co2Low>
<co2Alarm>600</co2Alarm>
</CO2>
</events>
<settings>
<apikeys>
<prowl>
<apikey>None</apikey>
</prowl>
</apikeys>
</settings>
expected result
80
actual result
None

findtext requires a full path, but you have given a relative path, so you cannot find the element you are looking for.
You can either provide a good xpath or modify your code
def configSet(x):
if os.path.exists(configFileName):
tree = ET.parse(configFileName)
root = tree.getroot()
for e in root.getiterator():
t = e.findtext(x)
if t is not None:
return t
Update 1:
If you want to have all matched text as a list, the code is a bit different.
def configSet(x):
matches = []
if os.path.exists(configFileName):
tree = ET.parse(configFileName)
root = tree.getroot()
for e in root.getiterator():
t = e.findtext(x)
if t is not None:
matches.append(t)
return matches

You can use xpath to get to your desired element.
return root.find('./events/airTemps/hiTemp').text
There's easy to follow documentation here.

python reporting line/column of origin of XML node

I'm currently using xml.dom.minidom to parse some XML in python. After parsing, I'm doing some reporting on the content, and would like to report the line (and column) where the tag started in the source XML document, but I don't see how that's possible.
I'd like to stick with xml.dom / xml.dom.minidom if possible, but if I need to use a SAX parser to get the origin info, I can do that -- ideal in that case would be using SAX to track node location, but still end up with a DOM for my post-processing.
Any suggestions on how to do this? Hopefully I'm just overlooking something in the docs and this extremely easy.

By monkeypatching the minidom content handler I was able to record line and column number for each node (as the 'parse_position' attribute). It's a little dirty, but I couldn't see any "officially sanctioned" way of doing it :) Here's my test script:
from xml.dom import minidom
import xml.sax
doc = """\
<File>
<name>Name</name>
<pos>./</pos>
</File>
"""
def set_content_handler(dom_handler):
def startElementNS(name, tagName, attrs):
orig_start_cb(name, tagName, attrs)
cur_elem = dom_handler.elementStack[-1]
cur_elem.parse_position = (
parser._parser.CurrentLineNumber,
parser._parser.CurrentColumnNumber
)
orig_start_cb = dom_handler.startElementNS
dom_handler.startElementNS = startElementNS
orig_set_content_handler(dom_handler)
parser = xml.sax.make_parser()
orig_set_content_handler = parser.setContentHandler
parser.setContentHandler = set_content_handler
dom = minidom.parseString(doc, parser)
pos = dom.firstChild.parse_position
print("Parent: '{0}' at {1}:{2}".format(
dom.firstChild.localName, pos[0], pos[1]))
for child in dom.firstChild.childNodes:
if child.localName is None:
continue
pos = child.parse_position
print "Child: '{0}' at {1}:{2}".format(child.localName, pos[0], pos[1])
It outputs the following:
Parent: 'File' at 1:0
Child: 'name' at 2:2
Child: 'pos' at 3:2

A different way to hack around the problem is by patching line number information into the document before parsing it. Here's the idea:
LINE_DUMMY_ATTR = '_DUMMY_LINE' # Make sure this string is unique!
def parseXml(filename):
f = file.open(filename, 'r')
l = 0
content = list ()
for line in f:
l += 1
content.append(re.sub(r'<(\w+)', r'<\1 ' + LINE_DUMMY_ATTR + '="' + str(l) + '"', line))
f.close ()
return minidom.parseString ("".join(content))
Then you can retrieve the line number of an element with
int (element.getAttribute (LINE_DUMMY_ATTR))
Quite clearly, this approach has its own set of drawbacks, and if you really need column numbers, too, patching that in will be somewhat more involved. Also, if you want to extract text nodes or comments or use Node.toXml(), you'll have to make sure to strip out LINE_DUMMY_ATTR from any accidental matches, there.
The one advantage of this solution over aknuds1's answer is that it does not require messing with minidom internals.

XML Parsing in Python using document builder factory

I am working in STAF and STAX. Here python is used for coding . I am new to python.
Basically my task is to parse a XML file in python using Document Factory Parser.
The XML file I am trying to parse is :
<?xml version="1.0" encoding="utf-8"?>
<operating_system>
<unix_80sp1>
<tests type="quick_sanity_test">
<prerequisitescript>preparequicksanityscript</prerequisitescript>
<acbuildpath>acbuildpath</acbuildpath>
<testsuitscript>test quick sanity script</testsuitscript>
<testdir>quick sanity dir</testdir>
</tests>
<machine_name>u80sp1_L004</machine_name>
<machine_name>u80sp1_L005</machine_name>
<machine_name>xyz.pxy.dxe.cde</machine_name>
<vmware id="155.35.3.55">144.35.3.90</vmware>
<vmware id="155.35.3.56">144.35.3.91</vmware>
</unix_80sp1>
</operating_system>
I need to read all the tags .
For the tags machine_name i need to read them into a list
say all machine names should be in a list machname.
so machname should be [u80sp1_L004,u80sp1_L005,xyz.pxy.dxe.cde] after reading the tags.
I also need all the vmware tags:
all attributes should be vmware_attr =[155.35.3.55,155.35.3.56]
all vmware values should be vmware_value = [ 144.35.3.90,155.35.3.56]
I am able to read all tags properly except vmware tags and machine name tags:
I am using the following code:(i am new to xml and vmware).Help required.
The below code needs to be modified.
factory = DocumentBuilderFactory.newInstance();
factory.setValidating(1)
factory.setIgnoringElementContentWhitespace(0)
builder = factory.newDocumentBuilder()
document = builder.parse(xmlFileName)
vmware_value = None
vmware_attr = None
machname = None
# Get the text value for the element with tag name "vmware"
nodeList = document.getElementsByTagName("vmware")
for i in range(nodeList.getLength()):
node = nodeList.item(i)
if node.getNodeType() == Node.ELEMENT_NODE:
children = node.getChildNodes()
for j in range(children.getLength()):
thisChild = children.item(j)
if (thisChild.getNodeType() == Node.TEXT_NODE):
vmware_value = thisChild.getNodeValue()
vmware_attr ==??? what method to use ?
# Get the text value for the element with tag name "machine_name"
nodeList = document.getElementsByTagName("machine_name")
for i in range(nodeList.getLength()):
node = nodeList.item(i)
if node.getNodeType() == Node.ELEMENT_NODE:
children = node.getChildNodes()
for j in range(children.getLength()):
thisChild = children.item(j)
if (thisChild.getNodeType() == Node.TEXT_NODE):
machname = thisChild.getNodeValue()
Also how to check if a tag exists or not at all. I need to code the parsing properly.

You are need to instantiate vmware_value, vmware_attr and machname as lists not as strings, so instead of this:
vmware_value = None
vmware_attr = None
machname = None
do this:
vmware_value = []
vmware_attr = []
machname = []
Then, to add items to the list, use the append method on your lists. E.g.:
factory = DocumentBuilderFactory.newInstance();
factory.setValidating(1)
factory.setIgnoringElementContentWhitespace(0)
builder = factory.newDocumentBuilder()
document = builder.parse(xmlFileName)
vmware_value = []
vmware_attr = []
machname = []
# Get the text value for the element with tag name "vmware"
nodeList = document.getElementsByTagName("vmware")
for i in range(nodeList.getLength()):
node = nodeList.item(i)
vmware_attr.append(node.attributes["id"].value)
if node.getNodeType() == Node.ELEMENT_NODE:
children = node.getChildNodes()
for j in range(children.getLength()):
thisChild = children.item(j)
if (thisChild.getNodeType() == Node.TEXT_NODE):
vmware_value.append(thisChild.getNodeValue())
I've also edited the code to something I think should work to append the correct values to vmware_attr and vmware_value.
I had to make the assumption that STAX uses xml.dom syntax, so if that isn't the case, you will have to edit my suggestion appropriately.

Editing XML as a dictionary in python?

I'm trying to generate customized xml files from a template xml file in python.
Conceptually, I want to read in the template xml, remove some elements, change some text attributes, and write the new xml out to a file. I wanted it to work something like this:
conf_base = ConvertXmlToDict('config-template.xml')
conf_base_dict = conf_base.UnWrap()
del conf_base_dict['root-name']['level1-name']['leaf1']
del conf_base_dict['root-name']['level1-name']['leaf2']
conf_new = ConvertDictToXml(conf_base_dict)
now I want to write to file, but I don't see how to get to
ElementTree.ElementTree.write()
conf_new.write('config-new.xml')
Is there some way to do this, or can someone suggest doing this a different way?

This'll get you a dict minus attributes. I don't know, if this is useful to anyone. I was looking for an xml to dict solution myself, when I came up with this.
import xml.etree.ElementTree as etree
tree = etree.parse('test.xml')
root = tree.getroot()
def xml_to_dict(el):
d={}
if el.text:
d[el.tag] = el.text
else:
d[el.tag] = {}
children = el.getchildren()
if children:
d[el.tag] = map(xml_to_dict, children)
return d
This: http://www.w3schools.com/XML/note.xml
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
Would equal this:
{'note': [{'to': 'Tove'},
{'from': 'Jani'},
{'heading': 'Reminder'},
{'body': "Don't forget me this weekend!"}]}

I'm not sure if converting the info set to nested dicts first is easier. Using ElementTree, you can do this:
import xml.etree.ElementTree as ET
doc = ET.parse("template.xml")
lvl1 = doc.findall("level1-name")[0]
lvl1.remove(lvl1.find("leaf1")
lvl1.remove(lvl1.find("leaf2")
# or use del lvl1[idx]
doc.write("config-new.xml")
ElementTree was designed so that you don't have to convert your XML trees to lists and attributes first, since it uses exactly that internally.
It also support as small subset of XPath.

For easy manipulation of XML in python, I like the Beautiful Soup library. It works something like this:
Sample XML File:
<root>
<level1>leaf1</level1>
<level2>leaf2</level2>
</root>
Python code:
from BeautifulSoup import BeautifulStoneSoup, Tag, NavigableString
soup = BeautifulStoneSoup('config-template.xml') # get the parser for the xml file
soup.contents[0].name
# u'root'
You can use the node names as methods:
soup.root.contents[0].name
# u'level1'
It is also possible to use regexes:
import re
tags_starting_with_level = soup.findAll(re.compile('^level'))
for tag in tags_starting_with_level: print tag.name
# level1
# level2
Adding and inserting new nodes is pretty straightforward:
# build and insert a new level with a new leaf
level3 = Tag(soup, 'level3')
level3.insert(0, NavigableString('leaf3')
soup.root.insert(2, level3)
print soup.prettify()
# <root>
# <level1>
# leaf1
# </level1>
# <level2>
# leaf2
# </level2>
# <level3>
# leaf3
# </level3>
# </root>

My modification of Daniel's answer, to give a marginally neater dictionary:
def xml_to_dictionary(element):
l = len(namespace)
dictionary={}
tag = element.tag[l:]
if element.text:
if (element.text == ' '):
dictionary[tag] = {}
else:
dictionary[tag] = element.text
children = element.getchildren()
if children:
subdictionary = {}
for child in children:
for k,v in xml_to_dictionary(child).items():
if k in subdictionary:
if ( isinstance(subdictionary[k], list)):
subdictionary[k].append(v)
else:
subdictionary[k] = [subdictionary[k], v]
else:
subdictionary[k] = v
if (dictionary[tag] == {}):
dictionary[tag] = subdictionary
else:
dictionary[tag] = [dictionary[tag], subdictionary]
if element.attrib:
attribs = {}
for k,v in element.attrib.items():
attribs[k] = v
if (dictionary[tag] == {}):
dictionary[tag] = attribs
else:
dictionary[tag] = [dictionary[tag], attribs]
return dictionary
namespace is the xmlns string, including braces, that ElementTree prepends to all tags, so here I've cleared it as there is one namespace for the entire document
NB that I adjusted the raw xml too, so that 'empty' tags would produce at most a ' ' text property in the ElementTree representation
spacepattern = re.compile(r'\s+')
mydictionary = xml_to_dictionary(ElementTree.XML(spacepattern.sub(' ', content)))
would give for instance
{'note': {'to': 'Tove',
'from': 'Jani',
'heading': 'Reminder',
'body': "Don't forget me this weekend!"}}
it's designed for specific xml that is basically equivalent to json, should handle element attributes such as
<elementName attributeName='attributeContent'>elementContent</elementName>
too
there's the possibility of merging the attribute dictionary / subtag dictionary similarly to how repeat subtags are merged, although nesting the lists seems kind of appropriate :-)

Adding this line
d.update(('#' + k, v) for k, v in el.attrib.iteritems())
in the user247686's code you can have node attributes too.
Found it in this post https://stackoverflow.com/a/7684581/1395962
Example:
import xml.etree.ElementTree as etree
from urllib import urlopen
xml_file = "http://your_xml_url"
tree = etree.parse(urlopen(xml_file))
root = tree.getroot()
def xml_to_dict(el):
d={}
if el.text:
d[el.tag] = el.text
else:
d[el.tag] = {}
children = el.getchildren()
if children:
d[el.tag] = map(xml_to_dict, children)
d.update(('#' + k, v) for k, v in el.attrib.iteritems())
return d
Call as
xml_to_dict(root)

Have you tried this?
print xml.etree.ElementTree.tostring( conf_new )

most direct way to me :
root = ET.parse(xh)
data = root.getroot()
xdic = {}
if data > None:
for part in data.getchildren():
xdic[part.tag] = part.text

XML has a rich infoset, and it takes some special tricks to represent that in a Python dictionary. Elements are ordered, attributes are distinguished from element bodies, etc.
One project to handle round-trips between XML and Python dictionaries, with some configuration options to handle the tradeoffs in different ways is XML Support in Pickling Tools. Version 1.3 and newer is required. It isn't pure Python (and in fact is designed to make C++ / Python interaction easier), but it might be appropriate for various use cases.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Working with xml and exporting names of nodes - python

Related

lxml (etree) - Pretty Print attributes of root tag

Python ElementTree

python reporting line/column of origin of XML node

XML Parsing in Python using document builder factory

Editing XML as a dictionary in python?

Categories

Resources