I have the following xml - "file.xml"
<?xml version="1.0"?>
-<data>
-<dataset>
<ID>001</ID>
<A>5</A>
<B>2</B>
<C>1</C>
</dataset>
-<dataset>
<ID>002</ID>
<A>6</A>
<B>4</B>
<C>2</C>
</dataset>
-<dataset>
<ID>003</ID>
<A>3</A>
</dataset>
-<dataset>
<ID>004</ID>
<A>2</A>
<C>5</C>
</dataset>
</data>
I want to keep all elements with children A and B. Child C doesn't matter at all. My approach is to delete those elements without child A or B. Say, missing of either A or B will trigger the deletion of that element.
Here is my code:
import xml.etree.ElementTree as ET
tree = ET.parse("file.xml")
root = tree.getroot()
for element in root.findall('.//dataset'):
if element.tag != 'A' and element.tag != 'B':
root.remove(element)
This doesn't seem to be working.
Desired output:
<?xml version="1.0"?>
-<data>
-<dataset>
<ID>001</ID>
<A>5</A>
<B>2</B>
<C>1</C>
</dataset>
-<dataset>
<ID>002</ID>
<A>6</A>
<B>4</B>
<C>2</C>
</dataset>
</data>
Thank you!
I got it.
import xml.etree.ElementTree as ET
tree = ET.parse("file.xml")
root = tree.getroot()
#Get a list of the parent elements 'dataset' that have both element 'A' and 'B'
both =[]
for i in tree.findall(".//dataset/A/.."):
if i in tree.findall(".//dataset/B/.."):
both.append(i)
#Remove elements that are not in the above list
for i in root:
if i not in both:
root.remove(i)
Related
I have an xml file called persons.xml in the following format:
<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person id="1" name="John">
<city id="21" name="New York"/>
</person>
<person id="2" name="Mary">
<city id="22" name="Los Angeles"/>
</person>
</persons>
I want to export to a file the list of person names along with the city names
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('./persons.xml')
root = tree.getroot()
df_cols = ["person_name", "city_name"]
rows = []
for node in root:
person_name = node.attrib.get("name")
rows.append({"person_name": person_name})
out_df = pd.DataFrame(rows, columns = df_cols)
out_df
Obviously this part of the code will only work for obtaining the name as it’s part of the root, but I can’t figure out how to loop through the child nodes too and obtain this info. Do I need to append something to root to iterate over the child nodes?
I can obtain everything using root.getchildren but it doesn’t allow me to return only the child nodes:
children = root.getchildren()
for child in children:
ElementTree.dump(child)
Is there a good way to get this information?
See below
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person id="1" name="John">
<city id="21" name="New York" />
</person>
<person id="2" name="Mary">
<city id="22" name="Los Angeles" />
</person>
</persons>'''
root = ET.fromstring(xml)
data = []
for p in root.findall('.//person'):
data.append({'parson': p.attrib['name'], 'city': p.find('city').attrib['name']})
df = pd.DataFrame(data)
print(df)
output
parson city
0 John New York
1 Mary Los Angeles
I have one xml file that looks like this, XML1:
<?xml version='1.0' encoding='utf-8'?>
<report>
</report>
And the other one that is like this,
XML2:
<?xml version='1.0' encoding='utf-8'?>
<report attrib1="blabla" attrib2="blabla" attrib3="blabla" attrib4="blabla" attrib5="blabla" >
<child1>
<child2>
....
</child2>
</child1>
</report>
I need to replace and put root element of XML2 without its children, so XML1 looks like this:
<?xml version='1.0' encoding='utf-8'?>
<report attrib1="blabla" attrib2="blabla" attrib3="blabla" attrib4="blabla" attrib5="blabla">
</report>
Currently my code looks like this but it won't remove children but put whole tree inside:
source_tree = ET.parse('XML2.xml')
source_root = source_tree.getroot()
report = source_root.findall('report')
for child in list(report):
report.remove(child)
source_tree.write('XML1.xml', encoding='utf-8', xml_declaration=True)
Anyone has ide how can I achieve this?
Thanks!
Try the below (just copy attrib)
import xml.etree.ElementTree as ET
xml1 = '''<?xml version='1.0' encoding='utf-8'?>
<report>
</report>'''
xml2 = '''<?xml version='1.0' encoding='utf-8'?>
<report attrib1="blabla" attrib2="blabla" attrib3="blabla" attrib4="blabla" attrib5="blabla" >
<child1>
<child2>
</child2>
</child1>
</report>'''
root1 = ET.fromstring(xml1)
root2 = ET.fromstring(xml2)
root1.attrib = root2.attrib
ET.dump(root1)
output
<report attrib1="blabla" attrib2="blabla" attrib3="blabla" attrib4="blabla" attrib5="blabla">
</report>
So here is working code:
source_tree = ET.parse('XML2.xml')
source_root = source_tree.getroot()
dest_tree = ET.parse('XML1.xml')
dest_root = dest_tree.getroot()
dest_root.attrib = source_root.attrib
dest_tree.write('XML1.xml', encoding='utf-8', xml_declaration=True)
I am searching for a way to remove a specific tag <e> that has value as mmm within xml file (i.e <e>mmm</e>. I am referring to this thread as staring guide: How to remove elements from XML using Python without using lxml library instead of using ElementTree with python v2.6.6. I was trying to connect a dot with the thread and reading upon ElementTree api doc but I haven't been successful.
I appreciate your advice and thought on this.
<?xml version='1.0' encoding='UTF-8'?>
<parent>
<first>
<a>123</a>
<c>987</c>
<d>
<e>mmm</e>
<e>yyy</e>
</d>
</first>
<second>
<a>456</a>
<c>345</c>
<d>
<e>mmm</e>
<e>hhh</e>
</d>
</second>
</parent>
It took a while for me to realise all <e> tags are subnodes of <d>.
If we can assume the above is true for all your target nodes (<e> nodes with value mmm), you can use this script. (I added some extra nodes to check if it worked
import xml.etree.ElementTree as ET
xml_string = """<?xml version='1.0' encoding='UTF-8'?>
<parent>
<first>
<a>123</a>
<c>987</c>
<d>
<e>mmm</e>
<e>aaa</e>
<e>mmm</e>
<e>yyy</e>
</d>
</first>
<second>
<a>456</a>
<c>345</c>
<d>
<e>mmm</e>
<e>hhh</e>
</d>
</second>
</parent>"""
# this is how I create my root, if you choose to do it in a different way the end of this script might not be useful
root = ET.fromstring(xml_string)
target_node_first_parent = 'd'
target_node = 'e'
target_text = 'mmm'
# find all <d> nodes
for node in root.iter(target_node_first_parent):
# find <e> subnodes of <d>
for subnode in node.iter(target_node):
if subnode.text == target_text:
node.remove(subnode)
# output the result
tree = ET.ElementTree(root)
tree.write('output.xml')
I tried to just remove nodes found by root.iter(yourtag) but apparently it's not possible from the root (apparently it was not that easy)
The answer by #Queuebee is exactly correct but incase you want to read from a file, the code below provides a way to do that.
import xml.etree.ElementTree as ET
file_loc = " "
xml_tree_obj = ET.parse(file_loc)
xml_roots = xml_tree_obj.getroot()
target_node_first_parent = 'd'
target_node = 'e'
target_text = 'mmm'
# find all <d> nodes
for node in xml_roots.iter(target_node_first_parent):
# find <e> subnodes of <d>
for subnode in node.iter(target_node):
if subnode.text == target_text:
node.remove(subnode)
out_tree = ET.ElementTree(xml_roots)
out_tree.write('output.xml')
I have a number of 'root' tags with children 'name'. I want to sort the 'root' blocks, ordered alphabetically by the 'name' element. Have tried lxml / etree / minidom but can't get it working...
I can't get it to parse the value inside the tags, and then sort the parent root tags.
<?xml version='1.0' encoding='UTF-8'?>
<roots>
<root>
<path>//1.1.1.100/Alex</path>
<name>Alex Space</name>
</root>
<root>
<path>//1.1.1.101/Steve</path>
<name>Steve Space</name>
</root>
<root>
<path>//1.1.1.150/Bethany</path>
<name>Bethanys</name>
</root>
</roots>
Here is what I have tried:
import xml.etree.ElementTree as ET
def sortchildrenby(parent, child):
parent[:] = sorted(parent, key=lambda child: child)
tree = ET.parse('data.xml')
root = tree.getroot()
sortchildrenby(root, 'name')
for child in root:
sortchildrenby(child, 'name')
tree.write('output.xml')
If you want to put the name nodes first:
x = """
<roots>
<root>
<path>//1.1.1.100/Alex</path>
<name>Alex Space</name>
</root>
<root>
<path>//1.1.1.101/Steve</path>
<name>Bethanys</name>
</root>
<root>
<path>//1.1.1.150/Bethany</path>
<name>Steve Space</name>
</root>
</roots>"""
import lxml.etree as et
tree = et.fromstring(x)
for r in tree.iter("root"):
r[:] = sorted(r, key=lambda ch: -(ch.tag == "name"))
print(et.tostring(tree).decode("utf-8"))
Which would give you:
<roots>
<root>
<name>Alex Space</name>
<path>//1.1.1.100/Alex</path>
</root>
<root>
<name>Bethanys</name>
<path>//1.1.1.101/Steve</path>
</root>
<root>
<name>Steve Space</name>
<path>//1.1.1.150/Bethany</path>
</root>
</roots>
But there is no need to sort if you just want to add them first, you can just remove and reinsert the name into index 0:
import lxml.etree as et
tree = et.fromstring(x)
for r in tree.iter("root"):
ch = r.find("name")
r.remove(ch)
r.insert(0, ch)
print(et.tostring(tree).decode("utf-8"))
If the nodes are actually not in sorted order and you want to rearrange the roots node alphabetically:
x = """
<roots>
<root>
<path>//1.1.1.100/Alex</path>
<name>Alex Space</name>
</root>
<root>
<path>//1.1.1.101/Steve</path>
<name>Steve Space</name>
</root>
<root>
<path>//1.1.1.150/Bethany</path>
<name>Bethanys</name>
</root>
</roots>"""
import lxml.etree as et
tree = et.fromstring(x)
tree[:] = sorted(tree, key=lambda ch: ch.xpath("name/text()"))
print(et.tostring(tree).decode("utf-8"))
Which would give you:
<roots>
<root>
<path>//1.1.1.100/Alex</path>
<name>Alex Space</name>
</root>
<root>
<path>//1.1.1.150/Bethany</path>
<name>Bethanys</name>
</root>
<root>
<path>//1.1.1.101/Steve</path>
<name>Steve Space</name>
</root>
</roots>
You can also combine with either of the first two approach two also rearrange the root nodes putting name first.
Try this:
import xml.etree.ElementTree as ET
xml="<?xml version='1.0' encoding='UTF-8'?><roots><root><path>//1.1.1.100/Alex</path><name>Alex Space</name></root><root><path>//1.1.1.101/Steve</path><name>Steve Space</name></root><root><path>//1.1.1.150/Bethany</path><name>Bethanys</name></root></roots>"
oldxml = ET.fromstring(xml)
names = []
for rootobj in oldxml.findall('root'):
names.append(rootobj.find('name').text)
newxml = ET.Element('roots')
for name in sorted(names):
for rootobj in oldxml.findall('root'):
if name == rootobj.find('name').text:
newxml.append(rootobj)
ET.dump(oldxml)
ET.dump(newxml)
I'm reading from a variable and dumpin it on screen.
You can change it read from file and dump it to a file like you need.
Suppose you have an lmxl.etree element with the contents like:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
<element2>
<subelement2>blibli</sublement2>
</element2>
</root>
I can use find or xpath methods to get something an element rendering something like:
<element1>
<subelement1>blabla</subelement1>
</element1>
Is there a way simple to get:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
</root>
i.e The element of interest plus all it's ancestors up to the document root?
I am not sure there is something built-in for it, but here is a terrible, "don't ever use it in real life" type of a workaround using the iterancestors() parent iterator:
from lxml import etree as ET
data = """<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
<element2>
<subelement2>blibli</subelement2>
</element2>
</root>"""
root = ET.fromstring(data)
element = root.find(".//subelement1")
result = ET.tostring(element)
for node in element.iterancestors():
result = "<{name}>{text}</{name}>".format(name=node.tag, text=result)
print(ET.tostring(ET.fromstring(result), pretty_print=True))
Prints:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
</root>
The following code removes elements that don't have any subelement1 descendants and are not named subelement1.
from lxml import etree
tree = etree.parse("input.xml") # First XML document in question
for elem in tree.iter():
if elem.xpath("not(.//subelement1)") and not(elem.tag == "subelement1"):
if elem.getparent() is not None:
elem.getparent().remove(elem)
print etree.tostring(tree)
Output:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
</root>