How to modify the value under a designate path with python?

How to modify the value under a designate path with python? - python

There is a xml file like below:
<aa>
<bb>BB</bb>
<cc>
<dd>Tom</dd>
</cc>
<cc>
<dd>David</dd>
</cc>
</aa>
I'm trying to modify the value "Tom" and "David", but I can't get any value in <dd>. Then I try to get the value in <bb>, but I got the response "None" from my code.
My code as below:
import xml.etree.ElementTree as ET
tree = ET.parse("abc.xml")
root = tree.getroot()
a = root.find('aa/bb')
print(a)
Does someone could help me to correct my code to get and modify the value of <dd> ? Many thanks.

Your top level object is aa. So root is element aa
To get bb, just do root.find('bb')
>>> root
<Element 'aa' at 0x7fb1df5f0278>
>>> a = root.find('bb')
>>> a
<Element 'bb' at 0x7fb1df5f0228>
So to edit the names, try something like this
for dd in root.findall('cc/dd'):
if dd.text in ["Tom", "David"]:
dd.text = "something else"

Using ElementTree
Demo:
import xml.etree.ElementTree
et = xml.etree.ElementTree.parse(filename)
root = et.getroot()
for cc in root.findall('cc'): #Find all cc tags
print(cc.find("dd").text) #Print current text
cc.find("dd").text = "NewValue" #Update dd tags with new value
et.write(filename) #Write back to xml

If you don't mind using BeautifulSoup, you can modify your XML through it:
data = """<aa>
<bb>BB</bb>
<cc>
<dd>Tom</dd>
</cc>
<cc>
<dd>David</dd>
</cc>
</aa>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'xml')
for dd in soup.select('cc > dd'): # using CSS selectors
dd.clear()
dd.append('XXX')
print(soup.prettify())
Output:
<?xml version="1.0" encoding="utf-8"?>
<aa>
<bb>
BB
</bb>
<cc>
<dd>
XXX
</dd>
</cc>
<cc>
<dd>
XXX
</dd>
</cc>
</aa>

Related

How do I access elements in an XML when multiple default namespaces are used?

I would expect this code to produce a non-empty list:
import xml.etree.ElementTree as et
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<A
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="a:namespace">
<B xmlns="b:namespace">
<C>"Stuff"</C>
</B>
</A>
'''
namespaces = {'a' : 'a:namespace', 'b' : 'b:namespace'}
xroot = et.fromstring(xml)
res = xroot.findall('b:C', namespaces)
instead, res is an empty array. Why?
When I inspect the contents of xroot I can see that the C item is within b:namespace as expected:
for x in xroot.iter():
print(x)
# result:
<Element '{a:namespace}A' at 0x7f56e13b95e8>
<Element '{b:namespace}B' at 0x7f56e188d2c8>
<Element '{b:namespace}C' at 0x7f56e188def8>
To check whether something was wrong with my namespacing, I tried this as well; xroot.findall('{b:namespace}C') but the result was an empty array as well.

Your findall xpath 'b:C' is searching only tags immediately in the root element; you need to make it './/b:C' so the tag is found anywhere in the tree and it works, e.g.:
import xml.etree.ElementTree as et
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<A
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="a:namespace">
<B xmlns="b:namespace">
<C>"Stuff"</C>
</B>
</A>
'''
namespaces = {'a' : 'a:namespace', 'b' : 'b:namespace'}
xroot = et.fromstring(xml)
######## changed the xpath to start with .//
res = xroot.findall('.//b:C', namespaces)
print( f"{res=}" )
for x in xroot.iter():
print(x)
Output:
res=[<Element '{b:namespace}C' at 0x00000222DFCAAA40>]
<Element '{a:namespace}A' at 0x00000222DFCAA9A0>
<Element '{b:namespace}B' at 0x00000222DFCAA9F0>
<Element '{b:namespace}C' at 0x00000222DFCAAA40>
See here for some useful examples of ElementTree xpath support https://docs.python.org/3/library/xml.etree.elementtree.html?highlight=xpath#xpath-support

getting the node attribute of an XML file with LXML parsing

I cant get my mind around this nor working properly:
data='''<?xml version="1.0" encoding="UTF-8"?>\n<div type="docs" xml:base="/kime-api/prod/api/emi/2" xml:lang="ja" xml:id="39532e30"> <div n="0001" type="doc" xml:id="_5738d00002"></div></div>'''
parser = etree.XMLParser(resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)
# I tried with and without this following line
#data = data.replace('<?xml version="1.0" encoding="UTF-8"?>','')
XML_tree = etree.fromstring(data.encode() , parser=parser)
lang = XML_tree.xpath('.//div[#xml:lang]')
lang
lang is an empty list and there is ONE element like: xml:lang="ja" in the XML.
What am I doing wrong please?

You could just do xpath(#xml:lang).
XML_tree = etree.fromstring(data.encode() , parser=parser)
lang = XML_tree.xpath('#xml:lang')
print(lang)
Output:
['ja']

XML_tree represents the root element (the <div> with an xml:lang attribute).
If you want to get the language, use the following:
lang = XML_tree.xpath('#xml:lang')

How to add attribute to lxml Element

I would like to add attribute to a lxml Element like this
<outer xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Header>
<field1 name="blah">some value1</field1>
<field2 name="asdfasd">some value2</field2>
</Header>
</outer>
Here is what I have
E = lxml.builder.ElementMaker()
outer = E.outer
header = E.Header
FIELD1 = E.field1
FIELD2 = E.field2
the_doc = outer(
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance",
XML_2_HEADER(
FIELD1('some value1', name='blah'),
FIELD2('some value2', name='asdfasd'),
),
)
seems like this line is causing some problem
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance",
even if I replace it with
'xmlns:xsi'="http://www.w3.org/2001/XMLSchema-instance",
it won't work.
What is a way to add attribute to lxml Element?

That's a namespace definition, not an ordinary XML attribute. You can pass namespace information to ElementMaker() as a dictionary, for example :
from lxml import etree as ET
import lxml.builder
nsdef = {'xsi':'http://www.w3.org/2001/XMLSchema-instance'}
E = lxml.builder.ElementMaker(nsmap=nsdef)
doc = E.outer(
E.Header(
E.field1('some value1', name='blah'),
E.field2('some value2', name='asdfasd'),
),
)
print ET.tostring(doc, pretty_print=True)
output :
<outer xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Header>
<field1 name="blah">some value1</field1>
<field2 name="asdfasd">some value2</field2>
</Header>
</outer>
Link to the docs: http://lxml.de/api/lxml.builder.ElementMaker-class.html

Keep lxml from creating self-closing tags

I have a (old) tool which does not understand self-closing tags like <STATUS/>. So, we need to serialize our XML files with opened/closed tags like this: <STATUS></STATUS>.
Currently I have:
>>> from lxml import etree
>>> para = """<ERROR>The status is <STATUS></STATUS>.</ERROR>"""
>>> tree = etree.XML(para)
>>> etree.tostring(tree)
'<ERROR>The status is <STATUS/>.</ERROR>'
How can I serialize with opened/closed tags?
<ERROR>The status is <STATUS></STATUS>.</ERROR>
Solution
Given by wildwilhelm, below:
>>> from lxml import etree
>>> para = """<ERROR>The status is <STATUS></STATUS>.</ERROR>"""
>>> tree = etree.XML(para)
>>> for status_elem in tree.xpath("//STATUS[string() = '']"):
... status_elem.text = ""
>>> etree.tostring(tree)
'<ERROR>The status is <STATUS></STATUS>.</ERROR>'

It seems like the <STATUS> tag gets assigned a text attribute of None:
>>> tree[0]
<Element STATUS at 0x11708d4d0>
>>> tree[0].text
>>> tree[0].text is None
True
If you set the text attribute of the <STATUS> tag to an empty string, you should get what you're looking for:
>>> tree[0].text = ''
>>> etree.tostring(tree)
'<ERROR>The status is <STATUS></STATUS>.</ERROR>'
With this is mind, you can probably walk a DOM tree and fix up text attributes before writing out your XML. Something like this:
# prevent creation of self-closing tags
for node in tree.iter():
if node.text is None:
node.text = ''

If you tostring lxml dom is HTML, you can use
etree.tostring(html_dom, method='html')
to prevent self-closing tag like <a />

How do I get a list of all parent tags in BeautifulSoup?

Let's say I have a structure like this:
<folder name="folder1">
<folder name="folder2">
<bookmark href="link.html">
</folder>
</folder>
If I point to bookmark, what would be the command to just extract all of the folder lines?
For example,
bookmarks = soup.findAll('bookmark')
then beautifulsoupcommand(bookmarks[0]) would return:
[<folder name="folder1">,<folder name="folder2">]
I'd also want to know when the ending tags hit too. Any ideas?
Thanks in advance!

Here is my stab at it:
>>> from BeautifulSoup import BeautifulSoup
>>> html = """<folder name="folder1">
<folder name="folder2">
<bookmark href="link.html">
</folder>
</folder>
"""
>>> soup = BeautifulSoup(html)
>>> bookmarks = soup.find_all('bookmark')
>>> [p.get('name') for p in bookmarks[0].find_all_previous(name = 'folder')]
[u'folder2', u'folder1']
The key difference from #eumiro's answer is that I am using find_all_previous instead of find_parents. When I tested #eumiro's solution I found that find_parents only returns the first (immediate) parent as the name of the parent and grandparent are the same.
>>> [p.get('name') for p in bookmarks[0].find_parents('folder')]
[u'folder2']
>>> [p.get('name') for p in bookmarks[0].find_parents()]
[u'folder2', None]
It does return two generations of parents if the parent and grandparent are differently named.
>>> html = """<folder name="folder1">
<folder_parent name="folder2">
<bookmark href="link.html">
</folder_parent>
</folder>
"""
>>> soup = BeautifulSoup(html)
>>> bookmarks = soup.find_all('bookmark')
>>> [p.get('name') for p in bookmarks[0].find_parents()]
[u'folder2', u'folder1', None]

bookmarks[0].findParents('folder') will return you a list of all parent nodes. You can then iterate over them and use their name attribute.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to modify the value under a designate path with python? - python

Related

How do I access elements in an XML when multiple default namespaces are used?

getting the node attribute of an XML file with LXML parsing

How to add attribute to lxml Element

Keep lxml from creating self-closing tags

How do I get a list of all parent tags in BeautifulSoup?

Categories

Resources