How to create a subset of document using lxml?

How to create a subset of document using lxml? - python

Suppose you have an lmxl.etree element with the contents like:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
<element2>
<subelement2>blibli</sublement2>
</element2>
</root>
I can use find or xpath methods to get something an element rendering something like:
<element1>
<subelement1>blabla</subelement1>
</element1>
Is there a way simple to get:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
</root>
i.e The element of interest plus all it's ancestors up to the document root?

I am not sure there is something built-in for it, but here is a terrible, "don't ever use it in real life" type of a workaround using the iterancestors() parent iterator:
from lxml import etree as ET
data = """<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
<element2>
<subelement2>blibli</subelement2>
</element2>
</root>"""
root = ET.fromstring(data)
element = root.find(".//subelement1")
result = ET.tostring(element)
for node in element.iterancestors():
result = "<{name}>{text}</{name}>".format(name=node.tag, text=result)
print(ET.tostring(ET.fromstring(result), pretty_print=True))
Prints:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
</root>

The following code removes elements that don't have any subelement1 descendants and are not named subelement1.
from lxml import etree
tree = etree.parse("input.xml") # First XML document in question
for elem in tree.iter():
if elem.xpath("not(.//subelement1)") and not(elem.tag == "subelement1"):
if elem.getparent() is not None:
elem.getparent().remove(elem)
print etree.tostring(tree)
Output:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
</root>

Related

How to change sub element in lxml

My xml file:
<?xml version='1.0' encoding='UTF-8'?>
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:pain.001.001.03" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CstmrCdtTrfInitn>
<CtgyPurp>. // ---->i want to change this tag
<Cd>SALA</Cd> //-----> no change
</CtgyPurp> // ----> i want to change this tag
</CstmrCdtTrfInitn>
</Document>
I want to make a change in the xml file:
<CtgyPurp></CtgyPurp> change in <newName></newName>
I know how to change the value within a tag but not how to change/modify the tag itself with lxml

Something like this should work - note the treatment of namespaces:
from lxml import etree
ctg = """[your xml above"]"""
doc = etree.XML(ctg.encode())
ns = {"xx": "urn:iso:std:iso:20022:tech:xsd:pain.001.001.03"}
target = doc.xpath('//xx:CtgyPurp',namespaces=ns)[0]
target.tag = "newName"
print(etree.tostring(doc).decode())
Output:
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:pain.001.001.03" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CstmrCdtTrfInitn>
<newName>. // ---->i want to change this tag
<Cd>SALA</Cd> //-----> no change
</newName> // ----> i want to change this tag
</CstmrCdtTrfInitn>
</Document>

Python XML remove elements if a child is not in it

I have the following xml - "file.xml"
<?xml version="1.0"?>
-<data>
-<dataset>
<ID>001</ID>
<A>5</A>
<B>2</B>
<C>1</C>
</dataset>
-<dataset>
<ID>002</ID>
<A>6</A>
<B>4</B>
<C>2</C>
</dataset>
-<dataset>
<ID>003</ID>
<A>3</A>
</dataset>
-<dataset>
<ID>004</ID>
<A>2</A>
<C>5</C>
</dataset>
</data>
I want to keep all elements with children A and B. Child C doesn't matter at all. My approach is to delete those elements without child A or B. Say, missing of either A or B will trigger the deletion of that element.
Here is my code:
import xml.etree.ElementTree as ET
tree = ET.parse("file.xml")
root = tree.getroot()
for element in root.findall('.//dataset'):
if element.tag != 'A' and element.tag != 'B':
root.remove(element)
This doesn't seem to be working.
Desired output:
<?xml version="1.0"?>
-<data>
-<dataset>
<ID>001</ID>
<A>5</A>
<B>2</B>
<C>1</C>
</dataset>
-<dataset>
<ID>002</ID>
<A>6</A>
<B>4</B>
<C>2</C>
</dataset>
</data>
Thank you!

I got it.
import xml.etree.ElementTree as ET
tree = ET.parse("file.xml")
root = tree.getroot()
#Get a list of the parent elements 'dataset' that have both element 'A' and 'B'
both =[]
for i in tree.findall(".//dataset/A/.."):
if i in tree.findall(".//dataset/B/.."):
both.append(i)
#Remove elements that are not in the above list
for i in root:
if i not in both:
root.remove(i)

Sorting XML tags by child elements Python

I have a number of 'root' tags with children 'name'. I want to sort the 'root' blocks, ordered alphabetically by the 'name' element. Have tried lxml / etree / minidom but can't get it working...
I can't get it to parse the value inside the tags, and then sort the parent root tags.
<?xml version='1.0' encoding='UTF-8'?>
<roots>
<root>
<path>//1.1.1.100/Alex</path>
<name>Alex Space</name>
</root>
<root>
<path>//1.1.1.101/Steve</path>
<name>Steve Space</name>
</root>
<root>
<path>//1.1.1.150/Bethany</path>
<name>Bethanys</name>
</root>
</roots>
Here is what I have tried:
import xml.etree.ElementTree as ET
def sortchildrenby(parent, child):
parent[:] = sorted(parent, key=lambda child: child)
tree = ET.parse('data.xml')
root = tree.getroot()
sortchildrenby(root, 'name')
for child in root:
sortchildrenby(child, 'name')
tree.write('output.xml')

If you want to put the name nodes first:
x = """
<roots>
<root>
<path>//1.1.1.100/Alex</path>
<name>Alex Space</name>
</root>
<root>
<path>//1.1.1.101/Steve</path>
<name>Bethanys</name>
</root>
<root>
<path>//1.1.1.150/Bethany</path>
<name>Steve Space</name>
</root>
</roots>"""
import lxml.etree as et
tree = et.fromstring(x)
for r in tree.iter("root"):
r[:] = sorted(r, key=lambda ch: -(ch.tag == "name"))
print(et.tostring(tree).decode("utf-8"))
Which would give you:
<roots>
<root>
<name>Alex Space</name>
<path>//1.1.1.100/Alex</path>
</root>
<root>
<name>Bethanys</name>
<path>//1.1.1.101/Steve</path>
</root>
<root>
<name>Steve Space</name>
<path>//1.1.1.150/Bethany</path>
</root>
</roots>
But there is no need to sort if you just want to add them first, you can just remove and reinsert the name into index 0:
import lxml.etree as et
tree = et.fromstring(x)
for r in tree.iter("root"):
ch = r.find("name")
r.remove(ch)
r.insert(0, ch)
print(et.tostring(tree).decode("utf-8"))
If the nodes are actually not in sorted order and you want to rearrange the roots node alphabetically:
x = """
<roots>
<root>
<path>//1.1.1.100/Alex</path>
<name>Alex Space</name>
</root>
<root>
<path>//1.1.1.101/Steve</path>
<name>Steve Space</name>
</root>
<root>
<path>//1.1.1.150/Bethany</path>
<name>Bethanys</name>
</root>
</roots>"""
import lxml.etree as et
tree = et.fromstring(x)
tree[:] = sorted(tree, key=lambda ch: ch.xpath("name/text()"))
print(et.tostring(tree).decode("utf-8"))
Which would give you:
<roots>
<root>
<path>//1.1.1.100/Alex</path>
<name>Alex Space</name>
</root>
<root>
<path>//1.1.1.150/Bethany</path>
<name>Bethanys</name>
</root>
<root>
<path>//1.1.1.101/Steve</path>
<name>Steve Space</name>
</root>
</roots>
You can also combine with either of the first two approach two also rearrange the root nodes putting name first.

Try this:
import xml.etree.ElementTree as ET
xml="<?xml version='1.0' encoding='UTF-8'?><roots><root><path>//1.1.1.100/Alex</path><name>Alex Space</name></root><root><path>//1.1.1.101/Steve</path><name>Steve Space</name></root><root><path>//1.1.1.150/Bethany</path><name>Bethanys</name></root></roots>"
oldxml = ET.fromstring(xml)
names = []
for rootobj in oldxml.findall('root'):
names.append(rootobj.find('name').text)
newxml = ET.Element('roots')
for name in sorted(names):
for rootobj in oldxml.findall('root'):
if name == rootobj.find('name').text:
newxml.append(rootobj)
ET.dump(oldxml)
ET.dump(newxml)
I'm reading from a variable and dumpin it on screen.
You can change it read from file and dump it to a file like you need.

How to extract data from xml file that is deep down the tag

<?xml version="1.0" encoding="utf-8"?>
<ArrayOfRecord xmlns:i="http://www.w3.org/2001/XMLSchema-instance" i:type="Record">
<AvailableCharts>
<Accelerometer>true</Accelerometer>
<Velocity>false</Velocity>
</AvailableCharts>
<Trics>
<Trick>
<EndOffset>PT2M21.835S</EndOffset>
<Values>
<TrickValue>
<Acceleration>26.505801694441629</Acceleration>
<Rotation>0.023379150593228679</Rotation>
</TrickValue>
</Values>
</Trick>
</Trics>
<Values>
<SensorValue>
<accelx>-3.593643144</accelx>
<accely>7.316485176</accely>
</SensorValue>
<SensorValue>
<accelx>0.31103436</accelx>
<accely>7.70408184</accely>
</SensorValue>
</Values>
</ArrayOfRecord>
I am only interested in 'accelx' and 'accely' value in this data and need to create a csv out of it.
Update: The code given below breaks when I change the second row with the following. Nothing is displayed because of this;
<ArrayOfRecord xmlns:i="http://www.w3.org/2001/XMLSchema-instance" i:type="Record" xmlns="http://schemas">
The following code works:
import xml.etree.ElementTree as etree
tree = etree.parse(r"C:\Users\data.xml")
root = tree.getroot()
val_of_interest = root.findall("./Values/SensorValue")
for sensor_val in val_of_interest:
print sensor_val.find('accelx').text
print sensor_val.find('accely').text

Get attribute of first element using lxml

Trying to parse an XML file using lxml in Python, how do I simply get the value of an element's attribute? Example:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<item id="123">
<sub>ABC</sub>
</item>
I'd like to get the result 123, and store it as a variable.

When using etree.parse(), simply call .getroot() to get the root element; the .attrib attribute is a dictionary of all attributes, use that to get the value:
>>> from lxml import etree
>>> tree = etree.parse('test.xml')
>>> tree.getroot().attrib['id']
'123'
If you used etree.fromstring() the object returned is the root object already, so no .getroot() call is needed:
>>> tree = etree.fromstring('''\
... <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
... <item id="123">
... <sub>ABC</sub>
... </item>
... ''')
>>> tree.attrib['id']
'123'

Alternatively, you could use an XPath selector:
>>> from lxml import etree
>>> tree = etree.fromstring(b'''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<item id="123">
<sub>ABC</sub>
</item>''')
>>> tree.xpath('/item/#id')
['123']

I think Martijn has answered your question. Building on his answer, you can also use the items() method to get a list of tuples with the attributes and values. This may be useful if you need the values of multiple attributes. Like so:
>>> from lxml import etree
>>> tree = etree.parse('test.xml')
>>> item = tree.xpath('/item')
>>> item.items()
[('id', '123')]
Or in case of string:
>>> tree = etree.fromstring("""\
... <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
... <item id="123">
... <sub>ABC</sub>
... </item>
... """)
>>> tree.items()
[('id', '123')]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create a subset of document using lxml? - python

Related

How to change sub element in lxml

Python XML remove elements if a child is not in it

Sorting XML tags by child elements Python

How to extract data from xml file that is deep down the tag

Get attribute of first element using lxml

Categories

Resources