Python XML: how to treat a node content as a string? - python

I have the following code:
from xml.etree import ElementTree
tree = ElementTree.parse(file)
my_val = tree.find('./abc').text
and here is an xml snippet:
<item>
<abc>
<a>hello</a>
<b>world</b>
awesome
</abc>
</item>
I need my_val of type string to contain
<a>hello</a>
<b>world</b>
awesome
But it obviously resolves to None

Iteration overfindall will give you a list of subtrees elements.
>>> elements = [ElementTree.tostring(x) for x in tree.findall('./abc/')]
['<a>hello</a>\n ', '<b>world</b>\n awesome\n ']
The problem with this is that text without is tags are appended to the previous tag. So you need to clean that too:
>>> split_elements = [x.split() for x in elements]
[['<a>hello</a>'], ['<b>world</b>', 'awesome']]
Now we have a list of lists that needs to be flatten:
>>> from itertools import chain
>>> flatten_list = list(chain(*split_elements))
['<a>hello</a>', '<b>world</b>', 'awesome']
Finally, you can print it one per line with:
>>> print("\n".join(flatten_list))

One way could be to start by getting the root element
from xml.etree import ElementTree
import string
tree=ElementTree.parse(file)
rootElem=tree.getroot()
Then we can get element abc from root and iterate over its children, formatting into a string using attributes of the children:
abcElem=root.find("abc")
my_list = ["<{0.tag}>{0.text}</{0.tag}>".format(child) for child in abcElem]
my_list.append(abcElem.text)
my_val = string.join(my_list,"\n")
I'm sure some other helpful soul knows a way to print these elements out using ElementTree or some other xml utility rather than formatting them yourself but this should start you off.

Answering my own question:
This might be not the best solution but it worked for me
my_val = ElementTree.tostring(tree.find('./abc'), 'utf-8', 'xml').decode('utf-8')
my_val = my_val.replace('<abc>', '').replace('</abc>', '')
my_val = my_val.strip()

Related

Python xml.etree - how to search for n-th element in an xml with namespaces?

EDIT
Looks like I wasn't clear enough below. The problem is that if I use node positions (eg. /element[1]) and namespaces, xpapth expressions do not work in xml.etree. Partially I found my answer - lxml handles them well, so I can use it instead of xml.etree, but leaving the question open for the future reference.
So to be clear, problem statement is:
XPath expressions with positions and namespaces do not work in xml.etree. At least not for me.
Original question below:
I'm trying to use positions in xpath expressions processed by findall function of xml.etree.ElementTree.Element class. For some reason findall does not work with both namespaces and positions.
See the following example:
Works with no namespaces
>>> from xml.etree import ElementTree as ET
>>> xml = """
... <element>
... <system_name>TEST</system_name>
... <id_type>tradeseq</id_type>
... <id_value>31359936123</id_value>
... </element>
... """
>>> root = ET.fromstring(xml)
>>> list = root.findall('./system_name')
>>> list
[<Element 'system_name' at 0x0000023825CDB9F0>]
>>> list[0].tag
'system_name'
>>> list[0].text
'TEST'
###Here is the lookup with position - works well, returns one element
>>> list = root.findall('./system_name[1]')
>>> list
[<Element 'system_name' at 0x0000023825CDB9F0>]
>>> list[0].text
'TEST'
Does not work with namespaces
>>> xml = """
... <element xmlns="namespace">
... <system_name>TEST</system_name>
... <id_type>tradeseq</id_type>
... <id_value>31359936123</id_value>
... </element>
... """
>>> root = ET.fromstring(xml)
>>> list = root.findall(path='./system_name', namespaces={'': 'namespace'})
>>> list
[<Element '{namespace}system_name' at 0x0000023825CDBD60>]
>>> list[0].text
'TEST'
###Lookup with position and namespace: I'm expecting here one element, as it was in the no-namespace example, but it returns empty list
>>> list = root.findall(path='./system_name[1]', namespaces={'': 'namespace'})
>>> list
[]
Am I missing something, or is this a bug? If I should use any other library that better processes xml, could you name one, please?
It works as in the doc defined:
Please try this syntax:
ns = {'xmlns': 'namespace'}
for elem in root.findall(".//xmlns:system_name", ns):
print(elem.tag)
Remark:
even with empty key, but I assume this is not the correct usage.
ns = {'': 'namespace'}
for elem in root.findall(".//system_name", ns):
print(elem.tag)
If you have only one namespace definition, you can also use {*}tag_name:
for elem in root.findall(".//{*}system_name"):
print(elem.tag)
Also postional search of the child works fine:
ns = {'': 'namespace'}
for elem in root.findall("./system_name", ns):
print(elem.tag)

How to pull a value out of an element in a nested XML document in Python?

I'm asking an API to look up part numbers I get from a user with a barcode scanner. The API returns a much longer document than the below code block, but I trimmed a bunch of unnecessary empty elements, but the structure of the document is still the same. I need to put each part number in a dictionary where the value is the text inside of the <mfgr> element. With each run of my program, I generate a list of part numbers and have a loop that asks the API about each item in my list and each returns a huge document as expected. I'm a bit stuck on trying to parse the XML and get only the text inside of <mfgr> element, then save it to a dictionary with the part number that it belongs to. I'll put my loop that goes through my list below the XML document
<ArrayOfitem xmlns="WhereDataComesFrom.com" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<item>
<associateditem_swap/>
<bulk>false</bulk>
<category>Memory</category>
<clei>false</clei>
<createddate>5/11/2021 7:34:58 PM</createddate>
<description>sample description</description>
<heci/>
<imageurl/>
<item_swap/>
<itemid>1640</itemid>
<itemnumber>**sample part number**</itemnumber>
<listprice>0.0000</listprice>
<manufactureritem/>
<maxavailable>66</maxavailable>
<mfgr>**sample manufacturer**</mfgr>
<minreorderqty>0</minreorderqty>
<noninventory>false</noninventory>
<primarylocation/>
<reorderpoint>0</reorderpoint>
<rep>AP</rep>
<type>Memory </type>
<updateddate>2/4/2022 2:22:51 PM</updateddate>
<warehouse>MAIN</warehouse>
</item>
</ArrayOfitem>
Below is my Python code that loops through the part number list and asks the API to look up each part number.
import http.client
import xml.etree.ElementTree as etree
raw_xml = None
pn_list=["samplepart1","samplepart2"]
api_key= **redacted lol**
def getMFGR():
global raw_xml
for part_number in pn_list:
conn = http.client.HTTPSConnection("api.website.com")
payload = ''
headers = {
'session-token': 'api_key',
'Cookie': 'firstpartofmycookie; secondpartofmycookie'
}
conn.request("GET", "/webapi.svc/MI/XML/GetItemsByItemNumber?ItemNumber="+part_number, payload, headers)
res = conn.getresponse()
data = res.read()
raw_xml = data.decode("utf-8")
print(raw_xml)
print()
getMFGR()
Here is some code I tried while trying to get the mfgr. It will go inside the getMFGR() method inside the for loop so that it saves the manufacturer to a variable with each loop. Once the code works I want to have the dictionary look like this: {"samplepart1": "manufacturer1", "samplepart2": "manufacturer2"}.
root = etree.fromstring(raw_xml)
my_ns = {'root': 'WhereDataComesFrom.com'}
mfgr = root.findall('root:mfgr',my_ns)[0].text
The code above gives me a list index out of range error when I run it. I don't think it's searching past the namespaces node but I'm not sure how to tell it to search further.
This is where an interactive session becomes very useful. Drop your XML data into a file (say, data.xml), and then start up a Python REPL:
>>> import xml.etree.ElementTree as etree
>>> with open('data.xml') as fd:
... raw_xml=fd.read()
...
>>> root = etree.fromstring(raw_xml)
>>> my_ns = {'root': 'WhereDataComesFrom.com'}
Let's first look at your existing xpath expression:
>>> root.findall('root:mfgr',my_ns)
[]
That returns an empty list, which is why you're getting an "index out of range" error. You're getting an empty list because there is no mfgr element at the top level of the document; it's contained in an <item> element. So this will work:
>>> root.findall('root:item/root:mfgr',my_ns)
[<Element '{WhereDataComesFrom.com}mfgr' at 0x7fa5a45e2b60>]
To actually get the contents of that element:
>>> [x.text for x in root.findall('root:item/root:mfgr',my_ns)]
['**sample manufacturer**']
Hopefully that's enough to point you in the right direction.
I suggest use pandas for this structure of XML:
import pandas as pd
# Read XML row into DataFrame
ns = {"xmlns":"WhereDataComesFrom.com", "xmlns:i":"http://www.w3.org/2001/XMLSchema-instance"}
df = pd.read_xml("parNo_plant.xml", xpath=".//xmlns:item", namespaces=ns)
# Print only columns of interesst
df_of_interest = df[['itemnumber', 'mfgr']]
print(df_of_interest,'\n')
#Print the dictionary from DataFrame
print(df_of_interest.to_dict(orient='records'))
# If I understood right, you search this layout:
dictionary = dict(zip(df.itemnumber, df.mfgr))
print(dictionary)
Result (Pandas dataframe or dictionary):
itemnumber mfgr
0 **sample part number** **sample manufacturer**
[{'itemnumber': '**sample part number**', 'mfgr': '**sample manufacturer**'}]
{'**sample part number**': '**sample manufacturer**'}

python get xml tag list

i have this xml file :
<root>
<discovers>
<discover>
<zoulou>zag</zoulou>
<yotta>bob</yotta>
<alpha>ned</alpha>
</discover>
<discover>
<beta>Zorro</beta>
<omega>Danseur</omega>
</discover>
</discovers>
</root>
in python3.6 i want to get this output :
[[zoulou,yotta,alpha],[beta,omega]]
actually i can have all tag with this code in python
tree = etree.parse("./file.xml")
[elt.tag for elt in tree.findall("discovers/discover/*")]
i have this output :
['zoulou', 'yotta', 'alpha', 'beta', 'omega']
i don't found function for separate tag list by parent node, can you help me ?
i don't know how to separate my discover node
This can be accomplished by nesting list comprehensions. One option is to find all 'discover'-elements in the outer comprehension and then find any child elements.
[[ch.tag for ch in elt.findall('*')] for elt in doc.findall("discovers/discover")]
[['zoulou', 'yotta', 'alpha'], ['beta', 'omega']]
The best way to achieve what you need, and in general on of the best ways to parse xml, is to use BeautifulSoup4:
from bs4 import BeautifulSoup
result = """<root>
<discovers>
<discover>
<zoulou>zag</zoulou>
<yotta>bob</yotta>
<alpha>ned</alpha>
</discover>
<discover>
<beta>Zorro</beta>
<omega>Danseur</omega>
</discover>
</discovers>
</root>"""
soup = BeautifulSoup(result, "lxml")
findName = lambda child: child.name
print [map(findName, x.findChildren()) for x in soup.findAll('discover')] # [['zoulou', 'yotta', 'alpha'], ['beta', 'omega']]

Custom sort python

I have a question:
This is list of lists, formed by ElementTree library.
[['word1', <Element tag at b719a4cc>], ['word2', <Element tag at b719a6cc>], ['word3', <Element tag at b719a78c>], ['word4', <Element tag at b719a82c>]]
word1..4 may contain unicode characters i.e (â,ü,ç).
I want to sort this list of lists by my custom alphabet.
I know how to sort by custom alphabet from here
sorting words in python
I also know how to sort by key from here http://wiki.python.org/moin/HowTo/Sorting
The problem is that I couldn't find the way how to apply these two method to sort my "list of lists".
Your first link more or less solves the problem. You just need to have the lambda function only look at the first item in your list:
alphabet = "zyxwvutsrqpomnlkjihgfedcba"
new_list = sorted(inputList, key=lambda word: [alphabet.index(c) for c in word[0]])
One modification I might suggest, if you're sorting a reasonably large list, is to change the alphabet structure into a dict first, so that index lookup is faster:
alphabet_dict = dict([(x, alphabet.index(x)) for x in alphabet)
new_list = sorted(inputList, key=lambda word: [alphabet_dict[c] for c in word[0]])
If I'm understanding you correctly, you want to know how to apply the key sorting technique when the key should apply to an element of your object. In other words, you want to apply the key function to 'wordx', not the ['wordx', ...] element you are actually sorting. In that case, you can do this:
my_alphabet = "..."
def my_key(elem):
word = elem[0]
return [my_alphabet.index(c) for c in word]
my_list.sort(key=my_key)
or using the style in your first link:
my_alphabet = "..."
my_list.sort(key=lambda elem: [my_alphabet.index(c) for c in elem[0]])
Keep in mind that my_list.sort will sort in place, actually modifying your list. sorted(my_list, ...) will return a new sorted list.
Works great!!! Thank you for your help
Here is my story:
I have turkish-russian dictionary in xdxf format. the problem was to sort it.
I've found solution here http://effbot.org/zone/element-sort.htm but it didn't sort unicode characters.
here is final source code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET
import codecs
alphabet = u"aâbcçdefgğhiıjklmnoöpqrstuüvwxyz"
tree = ET.parse("dict.xml")
# this element holds the phonebook entries
container = tree.find("entries")
data = []
for elem in container:
keyd = elem.findtext("k")
data.append([keyd, elem])
data.sort(key=lambda data: [alphabet.index(c) for c in data[0]])
container[:] = [item[-1] for item in data]
tree.write("new-dict.xml", encoding="utf-8")
sample content of dict.xml
<cont>
<entries>
<ar><k>â</k>def1</ar>
<ar><k>a</k>def1</ar>
<ar><k>g</k>def1</ar>
<ar><k>w</k>def1</ar>
<ar><k>n</k>def1</ar>
<ar><k>u</k>def1</ar>
<ar><k>ü</k>def1</ar>
<ar><k>âb</k>def1</ar>
<ar><k>ç</k>def1</ar>
<ar><k>v</k>def1</ar>
<ar><k>ac</k>def1</ar>
</entries>
</cont>
Thank to all

Empty XML element handling in Python

I'm puzzled by minidom parser handling of empty element, as shown in following code section.
import xml.dom.minidom
doc = xml.dom.minidom.parseString('<value></value>')
print doc.firstChild.nodeValue.__repr__()
# Out: None
print doc.firstChild.toxml()
# Out: <value/>
doc = xml.dom.minidom.Document()
v = doc.appendChild(doc.createElement('value'))
v.appendChild(doc.createTextNode(''))
print v.firstChild.nodeValue.__repr__()
# Out: ''
print doc.firstChild.toxml()
# Out: <value></value>
How can I get consistent behavior? I'd like to receive empty string as value of empty element (which IS what I put in XML structure in the first place).
Cracking open xml.dom.minidom and searching for "/>", we find this:
# Method of the Element(Node) class.
def writexml(self, writer, indent="", addindent="", newl=""):
# [snip]
if self.childNodes:
writer.write(">%s"%(newl))
for node in self.childNodes:
node.writexml(writer,indent+addindent,addindent,newl)
writer.write("%s</%s>%s" % (indent,self.tagName,newl))
else:
writer.write("/>%s"%(newl))
We can deduce from this that the short-end-tag form only occurs when childNodes is an empty list. Indeed, this seems to be true:
>>> doc = Document()
>>> v = doc.appendChild(doc.createElement('v'))
>>> v.toxml()
'<v/>'
>>> v.childNodes
[]
>>> v.appendChild(doc.createTextNode(''))
<DOM Text node "''">
>>> v.childNodes
[<DOM Text node "''">]
>>> v.toxml()
'<v></v>'
As pointed out by Lloyd, the XML spec makes no distinction between the two. If your code does make the distinction, that means you need to rethink how you want to serialize your data.
xml.dom.minidom simply displays something differently because it's easier to code. You can, however, get consistent output. Simply inherit the Element class and override the toxml method such that it will print out the short-end-tag form when there are no child nodes with non-empty text content. Then monkeypatch the module to use your new Element class.
value = thing.firstChild.nodeValue or ''
Xml spec does not distinguish these two cases.

Categories

Resources