How to modify the value under a designate path with python? - python

There is a xml file like below:
<aa>
<bb>BB</bb>
<cc>
<dd>Tom</dd>
</cc>
<cc>
<dd>David</dd>
</cc>
</aa>
I'm trying to modify the value "Tom" and "David", but I can't get any value in <dd>. Then I try to get the value in <bb>, but I got the response "None" from my code.
My code as below:
import xml.etree.ElementTree as ET
tree = ET.parse("abc.xml")
root = tree.getroot()
a = root.find('aa/bb')
print(a)
Does someone could help me to correct my code to get and modify the value of <dd> ? Many thanks.

Your top level object is aa. So root is element aa
To get bb, just do root.find('bb')
>>> root
<Element 'aa' at 0x7fb1df5f0278>
>>> a = root.find('bb')
>>> a
<Element 'bb' at 0x7fb1df5f0228>
So to edit the names, try something like this
for dd in root.findall('cc/dd'):
if dd.text in ["Tom", "David"]:
dd.text = "something else"

Using ElementTree
Demo:
import xml.etree.ElementTree
et = xml.etree.ElementTree.parse(filename)
root = et.getroot()
for cc in root.findall('cc'): #Find all cc tags
print(cc.find("dd").text) #Print current text
cc.find("dd").text = "NewValue" #Update dd tags with new value
et.write(filename) #Write back to xml

If you don't mind using BeautifulSoup, you can modify your XML through it:
data = """<aa>
<bb>BB</bb>
<cc>
<dd>Tom</dd>
</cc>
<cc>
<dd>David</dd>
</cc>
</aa>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'xml')
for dd in soup.select('cc > dd'): # using CSS selectors
dd.clear()
dd.append('XXX')
print(soup.prettify())
Output:
<?xml version="1.0" encoding="utf-8"?>
<aa>
<bb>
BB
</bb>
<cc>
<dd>
XXX
</dd>
</cc>
<cc>
<dd>
XXX
</dd>
</cc>
</aa>

Related

How do I access elements in an XML when multiple default namespaces are used?

I would expect this code to produce a non-empty list:
import xml.etree.ElementTree as et
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<A
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="a:namespace">
<B xmlns="b:namespace">
<C>"Stuff"</C>
</B>
</A>
'''
namespaces = {'a' : 'a:namespace', 'b' : 'b:namespace'}
xroot = et.fromstring(xml)
res = xroot.findall('b:C', namespaces)
instead, res is an empty array. Why?
When I inspect the contents of xroot I can see that the C item is within b:namespace as expected:
for x in xroot.iter():
print(x)
# result:
<Element '{a:namespace}A' at 0x7f56e13b95e8>
<Element '{b:namespace}B' at 0x7f56e188d2c8>
<Element '{b:namespace}C' at 0x7f56e188def8>
To check whether something was wrong with my namespacing, I tried this as well; xroot.findall('{b:namespace}C') but the result was an empty array as well.
Your findall xpath 'b:C' is searching only tags immediately in the root element; you need to make it './/b:C' so the tag is found anywhere in the tree and it works, e.g.:
import xml.etree.ElementTree as et
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<A
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="a:namespace">
<B xmlns="b:namespace">
<C>"Stuff"</C>
</B>
</A>
'''
namespaces = {'a' : 'a:namespace', 'b' : 'b:namespace'}
xroot = et.fromstring(xml)
######## changed the xpath to start with .//
res = xroot.findall('.//b:C', namespaces)
print( f"{res=}" )
for x in xroot.iter():
print(x)
Output:
res=[<Element '{b:namespace}C' at 0x00000222DFCAAA40>]
<Element '{a:namespace}A' at 0x00000222DFCAA9A0>
<Element '{b:namespace}B' at 0x00000222DFCAA9F0>
<Element '{b:namespace}C' at 0x00000222DFCAAA40>
See here for some useful examples of ElementTree xpath support https://docs.python.org/3/library/xml.etree.elementtree.html?highlight=xpath#xpath-support

getting the node attribute of an XML file with LXML parsing

I cant get my mind around this nor working properly:
data='''<?xml version="1.0" encoding="UTF-8"?>\n<div type="docs" xml:base="/kime-api/prod/api/emi/2" xml:lang="ja" xml:id="39532e30"> <div n="0001" type="doc" xml:id="_5738d00002"></div></div>'''
parser = etree.XMLParser(resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)
# I tried with and without this following line
#data = data.replace('<?xml version="1.0" encoding="UTF-8"?>','')
XML_tree = etree.fromstring(data.encode() , parser=parser)
lang = XML_tree.xpath('.//div[#xml:lang]')
lang
lang is an empty list and there is ONE element like: xml:lang="ja" in the XML.
What am I doing wrong please?
You could just do xpath(#xml:lang).
XML_tree = etree.fromstring(data.encode() , parser=parser)
lang = XML_tree.xpath('#xml:lang')
print(lang)
Output:
['ja']
XML_tree represents the root element (the <div> with an xml:lang attribute).
If you want to get the language, use the following:
lang = XML_tree.xpath('#xml:lang')

How to add attribute to lxml Element

I would like to add attribute to a lxml Element like this
<outer xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Header>
<field1 name="blah">some value1</field1>
<field2 name="asdfasd">some value2</field2>
</Header>
</outer>
Here is what I have
E = lxml.builder.ElementMaker()
outer = E.outer
header = E.Header
FIELD1 = E.field1
FIELD2 = E.field2
the_doc = outer(
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance",
XML_2_HEADER(
FIELD1('some value1', name='blah'),
FIELD2('some value2', name='asdfasd'),
),
)
seems like this line is causing some problem
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance",
even if I replace it with
'xmlns:xsi'="http://www.w3.org/2001/XMLSchema-instance",
it won't work.
What is a way to add attribute to lxml Element?
That's a namespace definition, not an ordinary XML attribute. You can pass namespace information to ElementMaker() as a dictionary, for example :
from lxml import etree as ET
import lxml.builder
nsdef = {'xsi':'http://www.w3.org/2001/XMLSchema-instance'}
E = lxml.builder.ElementMaker(nsmap=nsdef)
doc = E.outer(
E.Header(
E.field1('some value1', name='blah'),
E.field2('some value2', name='asdfasd'),
),
)
print ET.tostring(doc, pretty_print=True)
output :
<outer xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Header>
<field1 name="blah">some value1</field1>
<field2 name="asdfasd">some value2</field2>
</Header>
</outer>
Link to the docs: http://lxml.de/api/lxml.builder.ElementMaker-class.html

Keep lxml from creating self-closing tags

I have a (old) tool which does not understand self-closing tags like <STATUS/>. So, we need to serialize our XML files with opened/closed tags like this: <STATUS></STATUS>.
Currently I have:
>>> from lxml import etree
>>> para = """<ERROR>The status is <STATUS></STATUS>.</ERROR>"""
>>> tree = etree.XML(para)
>>> etree.tostring(tree)
'<ERROR>The status is <STATUS/>.</ERROR>'
How can I serialize with opened/closed tags?
<ERROR>The status is <STATUS></STATUS>.</ERROR>
Solution
Given by wildwilhelm, below:
>>> from lxml import etree
>>> para = """<ERROR>The status is <STATUS></STATUS>.</ERROR>"""
>>> tree = etree.XML(para)
>>> for status_elem in tree.xpath("//STATUS[string() = '']"):
... status_elem.text = ""
>>> etree.tostring(tree)
'<ERROR>The status is <STATUS></STATUS>.</ERROR>'
It seems like the <STATUS> tag gets assigned a text attribute of None:
>>> tree[0]
<Element STATUS at 0x11708d4d0>
>>> tree[0].text
>>> tree[0].text is None
True
If you set the text attribute of the <STATUS> tag to an empty string, you should get what you're looking for:
>>> tree[0].text = ''
>>> etree.tostring(tree)
'<ERROR>The status is <STATUS></STATUS>.</ERROR>'
With this is mind, you can probably walk a DOM tree and fix up text attributes before writing out your XML. Something like this:
# prevent creation of self-closing tags
for node in tree.iter():
if node.text is None:
node.text = ''
If you tostring lxml dom is HTML, you can use
etree.tostring(html_dom, method='html')
to prevent self-closing tag like <a />

How do I get a list of all parent tags in BeautifulSoup?

Let's say I have a structure like this:
<folder name="folder1">
<folder name="folder2">
<bookmark href="link.html">
</folder>
</folder>
If I point to bookmark, what would be the command to just extract all of the folder lines?
For example,
bookmarks = soup.findAll('bookmark')
then beautifulsoupcommand(bookmarks[0]) would return:
[<folder name="folder1">,<folder name="folder2">]
I'd also want to know when the ending tags hit too. Any ideas?
Thanks in advance!
Here is my stab at it:
>>> from BeautifulSoup import BeautifulSoup
>>> html = """<folder name="folder1">
<folder name="folder2">
<bookmark href="link.html">
</folder>
</folder>
"""
>>> soup = BeautifulSoup(html)
>>> bookmarks = soup.find_all('bookmark')
>>> [p.get('name') for p in bookmarks[0].find_all_previous(name = 'folder')]
[u'folder2', u'folder1']
The key difference from #eumiro's answer is that I am using find_all_previous instead of find_parents. When I tested #eumiro's solution I found that find_parents only returns the first (immediate) parent as the name of the parent and grandparent are the same.
>>> [p.get('name') for p in bookmarks[0].find_parents('folder')]
[u'folder2']
>>> [p.get('name') for p in bookmarks[0].find_parents()]
[u'folder2', None]
It does return two generations of parents if the parent and grandparent are differently named.
>>> html = """<folder name="folder1">
<folder_parent name="folder2">
<bookmark href="link.html">
</folder_parent>
</folder>
"""
>>> soup = BeautifulSoup(html)
>>> bookmarks = soup.find_all('bookmark')
>>> [p.get('name') for p in bookmarks[0].find_parents()]
[u'folder2', u'folder1', None]
bookmarks[0].findParents('folder') will return you a list of all parent nodes. You can then iterate over them and use their name attribute.

Categories

Resources