Empty XML element handling in Python - python

I'm puzzled by minidom parser handling of empty element, as shown in following code section.
import xml.dom.minidom
doc = xml.dom.minidom.parseString('<value></value>')
print doc.firstChild.nodeValue.__repr__()
# Out: None
print doc.firstChild.toxml()
# Out: <value/>
doc = xml.dom.minidom.Document()
v = doc.appendChild(doc.createElement('value'))
v.appendChild(doc.createTextNode(''))
print v.firstChild.nodeValue.__repr__()
# Out: ''
print doc.firstChild.toxml()
# Out: <value></value>
How can I get consistent behavior? I'd like to receive empty string as value of empty element (which IS what I put in XML structure in the first place).

Cracking open xml.dom.minidom and searching for "/>", we find this:
# Method of the Element(Node) class.
def writexml(self, writer, indent="", addindent="", newl=""):
# [snip]
if self.childNodes:
writer.write(">%s"%(newl))
for node in self.childNodes:
node.writexml(writer,indent+addindent,addindent,newl)
writer.write("%s</%s>%s" % (indent,self.tagName,newl))
else:
writer.write("/>%s"%(newl))
We can deduce from this that the short-end-tag form only occurs when childNodes is an empty list. Indeed, this seems to be true:
>>> doc = Document()
>>> v = doc.appendChild(doc.createElement('v'))
>>> v.toxml()
'<v/>'
>>> v.childNodes
[]
>>> v.appendChild(doc.createTextNode(''))
<DOM Text node "''">
>>> v.childNodes
[<DOM Text node "''">]
>>> v.toxml()
'<v></v>'
As pointed out by Lloyd, the XML spec makes no distinction between the two. If your code does make the distinction, that means you need to rethink how you want to serialize your data.
xml.dom.minidom simply displays something differently because it's easier to code. You can, however, get consistent output. Simply inherit the Element class and override the toxml method such that it will print out the short-end-tag form when there are no child nodes with non-empty text content. Then monkeypatch the module to use your new Element class.

value = thing.firstChild.nodeValue or ''

Xml spec does not distinguish these two cases.

Related

Python xml.etree - how to search for n-th element in an xml with namespaces?

EDIT
Looks like I wasn't clear enough below. The problem is that if I use node positions (eg. /element[1]) and namespaces, xpapth expressions do not work in xml.etree. Partially I found my answer - lxml handles them well, so I can use it instead of xml.etree, but leaving the question open for the future reference.
So to be clear, problem statement is:
XPath expressions with positions and namespaces do not work in xml.etree. At least not for me.
Original question below:
I'm trying to use positions in xpath expressions processed by findall function of xml.etree.ElementTree.Element class. For some reason findall does not work with both namespaces and positions.
See the following example:
Works with no namespaces
>>> from xml.etree import ElementTree as ET
>>> xml = """
... <element>
... <system_name>TEST</system_name>
... <id_type>tradeseq</id_type>
... <id_value>31359936123</id_value>
... </element>
... """
>>> root = ET.fromstring(xml)
>>> list = root.findall('./system_name')
>>> list
[<Element 'system_name' at 0x0000023825CDB9F0>]
>>> list[0].tag
'system_name'
>>> list[0].text
'TEST'
###Here is the lookup with position - works well, returns one element
>>> list = root.findall('./system_name[1]')
>>> list
[<Element 'system_name' at 0x0000023825CDB9F0>]
>>> list[0].text
'TEST'
Does not work with namespaces
>>> xml = """
... <element xmlns="namespace">
... <system_name>TEST</system_name>
... <id_type>tradeseq</id_type>
... <id_value>31359936123</id_value>
... </element>
... """
>>> root = ET.fromstring(xml)
>>> list = root.findall(path='./system_name', namespaces={'': 'namespace'})
>>> list
[<Element '{namespace}system_name' at 0x0000023825CDBD60>]
>>> list[0].text
'TEST'
###Lookup with position and namespace: I'm expecting here one element, as it was in the no-namespace example, but it returns empty list
>>> list = root.findall(path='./system_name[1]', namespaces={'': 'namespace'})
>>> list
[]
Am I missing something, or is this a bug? If I should use any other library that better processes xml, could you name one, please?
It works as in the doc defined:
Please try this syntax:
ns = {'xmlns': 'namespace'}
for elem in root.findall(".//xmlns:system_name", ns):
print(elem.tag)
Remark:
even with empty key, but I assume this is not the correct usage.
ns = {'': 'namespace'}
for elem in root.findall(".//system_name", ns):
print(elem.tag)
If you have only one namespace definition, you can also use {*}tag_name:
for elem in root.findall(".//{*}system_name"):
print(elem.tag)
Also postional search of the child works fine:
ns = {'': 'namespace'}
for elem in root.findall("./system_name", ns):
print(elem.tag)

How to pull a value out of an element in a nested XML document in Python?

I'm asking an API to look up part numbers I get from a user with a barcode scanner. The API returns a much longer document than the below code block, but I trimmed a bunch of unnecessary empty elements, but the structure of the document is still the same. I need to put each part number in a dictionary where the value is the text inside of the <mfgr> element. With each run of my program, I generate a list of part numbers and have a loop that asks the API about each item in my list and each returns a huge document as expected. I'm a bit stuck on trying to parse the XML and get only the text inside of <mfgr> element, then save it to a dictionary with the part number that it belongs to. I'll put my loop that goes through my list below the XML document
<ArrayOfitem xmlns="WhereDataComesFrom.com" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<item>
<associateditem_swap/>
<bulk>false</bulk>
<category>Memory</category>
<clei>false</clei>
<createddate>5/11/2021 7:34:58 PM</createddate>
<description>sample description</description>
<heci/>
<imageurl/>
<item_swap/>
<itemid>1640</itemid>
<itemnumber>**sample part number**</itemnumber>
<listprice>0.0000</listprice>
<manufactureritem/>
<maxavailable>66</maxavailable>
<mfgr>**sample manufacturer**</mfgr>
<minreorderqty>0</minreorderqty>
<noninventory>false</noninventory>
<primarylocation/>
<reorderpoint>0</reorderpoint>
<rep>AP</rep>
<type>Memory </type>
<updateddate>2/4/2022 2:22:51 PM</updateddate>
<warehouse>MAIN</warehouse>
</item>
</ArrayOfitem>
Below is my Python code that loops through the part number list and asks the API to look up each part number.
import http.client
import xml.etree.ElementTree as etree
raw_xml = None
pn_list=["samplepart1","samplepart2"]
api_key= **redacted lol**
def getMFGR():
global raw_xml
for part_number in pn_list:
conn = http.client.HTTPSConnection("api.website.com")
payload = ''
headers = {
'session-token': 'api_key',
'Cookie': 'firstpartofmycookie; secondpartofmycookie'
}
conn.request("GET", "/webapi.svc/MI/XML/GetItemsByItemNumber?ItemNumber="+part_number, payload, headers)
res = conn.getresponse()
data = res.read()
raw_xml = data.decode("utf-8")
print(raw_xml)
print()
getMFGR()
Here is some code I tried while trying to get the mfgr. It will go inside the getMFGR() method inside the for loop so that it saves the manufacturer to a variable with each loop. Once the code works I want to have the dictionary look like this: {"samplepart1": "manufacturer1", "samplepart2": "manufacturer2"}.
root = etree.fromstring(raw_xml)
my_ns = {'root': 'WhereDataComesFrom.com'}
mfgr = root.findall('root:mfgr',my_ns)[0].text
The code above gives me a list index out of range error when I run it. I don't think it's searching past the namespaces node but I'm not sure how to tell it to search further.
This is where an interactive session becomes very useful. Drop your XML data into a file (say, data.xml), and then start up a Python REPL:
>>> import xml.etree.ElementTree as etree
>>> with open('data.xml') as fd:
... raw_xml=fd.read()
...
>>> root = etree.fromstring(raw_xml)
>>> my_ns = {'root': 'WhereDataComesFrom.com'}
Let's first look at your existing xpath expression:
>>> root.findall('root:mfgr',my_ns)
[]
That returns an empty list, which is why you're getting an "index out of range" error. You're getting an empty list because there is no mfgr element at the top level of the document; it's contained in an <item> element. So this will work:
>>> root.findall('root:item/root:mfgr',my_ns)
[<Element '{WhereDataComesFrom.com}mfgr' at 0x7fa5a45e2b60>]
To actually get the contents of that element:
>>> [x.text for x in root.findall('root:item/root:mfgr',my_ns)]
['**sample manufacturer**']
Hopefully that's enough to point you in the right direction.
I suggest use pandas for this structure of XML:
import pandas as pd
# Read XML row into DataFrame
ns = {"xmlns":"WhereDataComesFrom.com", "xmlns:i":"http://www.w3.org/2001/XMLSchema-instance"}
df = pd.read_xml("parNo_plant.xml", xpath=".//xmlns:item", namespaces=ns)
# Print only columns of interesst
df_of_interest = df[['itemnumber', 'mfgr']]
print(df_of_interest,'\n')
#Print the dictionary from DataFrame
print(df_of_interest.to_dict(orient='records'))
# If I understood right, you search this layout:
dictionary = dict(zip(df.itemnumber, df.mfgr))
print(dictionary)
Result (Pandas dataframe or dictionary):
itemnumber mfgr
0 **sample part number** **sample manufacturer**
[{'itemnumber': '**sample part number**', 'mfgr': '**sample manufacturer**'}]
{'**sample part number**': '**sample manufacturer**'}

Python XML: how to treat a node content as a string?

I have the following code:
from xml.etree import ElementTree
tree = ElementTree.parse(file)
my_val = tree.find('./abc').text
and here is an xml snippet:
<item>
<abc>
<a>hello</a>
<b>world</b>
awesome
</abc>
</item>
I need my_val of type string to contain
<a>hello</a>
<b>world</b>
awesome
But it obviously resolves to None
Iteration overfindall will give you a list of subtrees elements.
>>> elements = [ElementTree.tostring(x) for x in tree.findall('./abc/')]
['<a>hello</a>\n ', '<b>world</b>\n awesome\n ']
The problem with this is that text without is tags are appended to the previous tag. So you need to clean that too:
>>> split_elements = [x.split() for x in elements]
[['<a>hello</a>'], ['<b>world</b>', 'awesome']]
Now we have a list of lists that needs to be flatten:
>>> from itertools import chain
>>> flatten_list = list(chain(*split_elements))
['<a>hello</a>', '<b>world</b>', 'awesome']
Finally, you can print it one per line with:
>>> print("\n".join(flatten_list))
One way could be to start by getting the root element
from xml.etree import ElementTree
import string
tree=ElementTree.parse(file)
rootElem=tree.getroot()
Then we can get element abc from root and iterate over its children, formatting into a string using attributes of the children:
abcElem=root.find("abc")
my_list = ["<{0.tag}>{0.text}</{0.tag}>".format(child) for child in abcElem]
my_list.append(abcElem.text)
my_val = string.join(my_list,"\n")
I'm sure some other helpful soul knows a way to print these elements out using ElementTree or some other xml utility rather than formatting them yourself but this should start you off.
Answering my own question:
This might be not the best solution but it worked for me
my_val = ElementTree.tostring(tree.find('./abc'), 'utf-8', 'xml').decode('utf-8')
my_val = my_val.replace('<abc>', '').replace('</abc>', '')
my_val = my_val.strip()

Appending an xml-node which is read from file breaks pretty_print for adjacent nodes

I'm generating a XML-file with python's etree library. One node in the generated file is read from an existing XML-file. Adding this element breaks the pretty_print for the nodes directly before and after.
import xml.etree.cElementTree as ET
from lxml import etree
root = etree.Element("startNode")
subnode1 = etree.SubElement(root, "SubNode1")
subnode1Child1 = etree.SubElement(subnode1, "subNode1Child1")
etree.SubElement(subnode1Child1, "Child1")
etree.SubElement(subnode1Child1, "Child2")
f = open('/xml_testdata/ext_file.xml','r')
ext_xml = etree.fromstring(f.read())
ext_subnode = ext_xml.find("ExtNode")
subnode1.append(ext_subnode)
subnode1Child2 = etree.SubElement(subnode1, "subNode1Child2")
etree.SubElement(subnode1Child2, "Child1")
etree.SubElement(subnode1Child2, "Child2")
tree = etree.ElementTree(root)
tree.write("testfile.xml", xml_declaration=True, pretty_print=True)
which gives this result:
<startNode>
<SubNode1><subNode1Child1><Child1/><Child2/></subNode1Child1><ExtNode>
<NodeFromExt>
<SubNodeFromExt1/>
</NodeFromExt>
<NodeFromExt>
<SubNodeFromExt2/>
<AnotherSubNodeFromExt2>
<SubSubNode/>
<AllPrettyHere>
<Child/>
</AllPrettyHere>
</AnotherSubNodeFromExt2>
</NodeFromExt>
</ExtNode>
<subNode1Child2><Child1/><Child2/></subNode1Child2></SubNode1>
</startNode>
Not very readable, is it? Even worse when "subNodeChild" contains a lot more subnodes than this example!
Without appending the external elements, it looks like this:
<startNode>
<SubNode1>
<subNode1Child1>
<Child1/>
<Child2/>
</subNode1Child1>
<subNode1Child2>
<Child1/>
<Child2/>
</subNode1Child2>
</SubNode1>
</startNode>
So the problem is caused by appending the external elements!
Is there a way to append the external elements without breaking the pretty_print-output?
You can get nicer pretty-printed output by using a parser object that removes ignorable whitespace when parsing the existing XML file.
Instead of this:
f = open('/xml_testdata/ext_file.xml','r')
ext_xml = etree.fromstring(f.read())
Use this:
f = open('/xml_testdata/ext_file.xml', 'r')
parser = etree.XMLParser(remove_blank_text=True)
ext_xml = etree.fromstring(f.read(), parser)
See also:
http://lxml.de/api/lxml.etree.XMLParser-class.html
http://lxml.de/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output
I've been able to somewhat mitigate the effect by creating "ExtNode" with etree.SubElement and appending the elements inside it:
ext_node = etree.SubElement(subnode1, "ExtNode")
for element in ext_xml.findall("ExtNode/NodeFromExt")
ext_node.append(element)
which has this result:
<startNode>
<SubNode1>
<subNode1Child1>
<Child1/>
<Child2/>
</subNode1Child1>
<ExtNode><NodeFromExt>
<SubNodeFromExt1/>
</NodeFromExt>
<NodeFromExt>
<SubNodeFromExt2/>
<AnotherSubNodeFromExt2>
<SubSubNode/>
<AllPrettyHere>
<Child/>
</AllPrettyHere>
</AnotherSubNodeFromExt2>
</NodeFromExt>
</ExtNode>
<subNode1Child2>
<Child1/>
<Child2/>
</subNode1Child2>
</SubNode1>
</startNode>
Not perfect, but at least human readable (Which is the whole point of pretty_print, right?)
To satisfy my OCD, I'd still be interested if there is a way to get a flawlessly formatted file!

Element Tree: How to parse subElements of child nodes

I have an XML tree, which I'd like to parse using Elementtree. My XML looks something like
<?xml version="1.0" encoding="UTF-8"?>
<GetOrdersResponse xmlns="urn:ebay:apis:eBLBaseComponents">
<Ack>Success</Ack>
<Version>857</Version>
<Build>E857_INTL_APIXO_16643800_R1</Build>
<PaginationResult>
<TotalNumberOfPages>1</TotalNumberOfPages>
<TotalNumberOfEntries>2</TotalNumberOfEntries>
</PaginationResult>
<HasMoreOrders>false</HasMoreOrders>
<OrderArray>
<Order>
<OrderID>221362908003-1324471823012</OrderID>
<CheckoutStatus>
<eBayPaymentStatus>NoPaymentFailure</eBayPaymentStatus>
<LastModifiedTime>2014-02-03T12:08:51.000Z</LastModifiedTime>
<PaymentMethod>PaisaPayEscrow</PaymentMethod>
<Status>Complete</Status>
<IntegratedMerchantCreditCardEnabled>false</IntegratedMerchantCreditCardEnabled>
</CheckoutStatus>
</Order>
<Order> ...
</Order>
<Order> ...
</Order>
</OrderArray>
</GetOrdersResponse>
I want to parse the 6th child of the XML () I am able to get the value of subelements by index. E.g if I want OrderID of first order, i can use root[5][0][0].text. But, I would like to get the values of subElements by name. I tried the following code, but it does not print anything:
tree = ET.parse('response.xml')
root = tree.getroot()
for child in root:
try:
for ids in child.find('Order').find('OrderID'):
print ids.text
except:
continue
Could someone please help me on his. Thanks
Since the XML document has a namespace declaration (xmlns="urn:ebay:apis:eBLBaseComponents"), you have to use universal names when referring to elements in the document. For example, you need {urn:ebay:apis:eBLBaseComponents}OrderID instead of just OrderID.
This snippet prints all OrderIDs in the document:
from xml.etree import ElementTree as ET
NS = "urn:ebay:apis:eBLBaseComponents"
tree = ET.parse('response.xml')
for elem in tree.iter("*"): # Use tree.getiterator("*") in Python 2.5 and 2.6
if elem.tag == '{%s}OrderID' % NS:
print elem.text
See http://effbot.org/zone/element-namespaces.htm for details about ElementTree and namespaces.
Try to avoid chaining your finds. If your first find does not find anything, it will return None.
for child in root:
order = child.find('Order')
if order is not None:
ids = order.find('OrderID')
print ids.text
You can find an OrderArray first and then just iterate its children by name:
tree = ET.parse('response.xml')
root = tree.getroot()
order_array = root.find("OrderArray")
for order in order_array.findall('Order'):
order_id_element = order.find('OrderID')
if order_id_element is not None:
print order_id_element.text
A side note. Never ever use except: continue. It hides any exception you get and makes debugging really hard.

Categories

Resources