How to change sub element in lxml - python

My xml file:
<?xml version='1.0' encoding='UTF-8'?>
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:pain.001.001.03" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CstmrCdtTrfInitn>
<CtgyPurp>. // ---->i want to change this tag
<Cd>SALA</Cd> //-----> no change
</CtgyPurp> // ----> i want to change this tag
</CstmrCdtTrfInitn>
</Document>
I want to make a change in the xml file:
<CtgyPurp></CtgyPurp> change in <newName></newName>
I know how to change the value within a tag but not how to change/modify the tag itself with lxml

Something like this should work - note the treatment of namespaces:
from lxml import etree
ctg = """[your xml above"]"""
doc = etree.XML(ctg.encode())
ns = {"xx": "urn:iso:std:iso:20022:tech:xsd:pain.001.001.03"}
target = doc.xpath('//xx:CtgyPurp',namespaces=ns)[0]
target.tag = "newName"
print(etree.tostring(doc).decode())
Output:
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:pain.001.001.03" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CstmrCdtTrfInitn>
<newName>. // ---->i want to change this tag
<Cd>SALA</Cd> //-----> no change
</newName> // ----> i want to change this tag
</CstmrCdtTrfInitn>
</Document>

Related

filter non-nested tag values from XML

I have an xml that looks like this.
<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
<offer id="11" parent_id="12">
<name>Alpha</name>
<pos>697</pos>
<kat_pis>
<pos kat="2">112</pos>
</kat_pis>
</offer>
<offer id="12" parent_id="31">
<name>Beta</name>
<pos>099</pos>
<kat_pis>
<pos kat="2">113</pos>
</kat_pis>
</offer>
</details>
</main_heading>
I am parsing it using BeautifulSoup. Upon doing this:
soup = BeautifulSoup(file, 'xml')
pos = []
for i in (soup.find_all('pos')):
pos.append(i.text)
I get a list of all POS tag values, even the ones that are nested within the tag kat_pis.
So I get (697, 112, 099. 113).
However, I only want to get the POS values of the non-nested tags.
Expected desired result is (697, 099).
How can I achieve this?
Here is one way of getting those first level pos:
from bs4 import BeautifulSoup as bs
xml_doc = '''<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
<offer id="11" parent_id="12">
<name>Alpha</name>
<pos>697</pos>
<kat_pis>
<pos kat="2">112</pos>
</kat_pis>
</offer>
<offer id="12" parent_id="31">
<name>Beta</name>
<pos>099</pos>
<kat_pis>
<pos kat="2">113</pos>
</kat_pis>
</offer>
</details>
</main_heading>'''
soup = bs(xml_doc, 'xml')
pos = []
for i in (soup.select('offer > pos')):
pos.append(i.text)
print(pos)
Result in terminal:
['697', '099']
I think the best solution would be to abandon BeautifulSoup for an XML parser with XPath support, like lxml. Using XPath expressions, you can ask for only those tos elements that are children of offer elements:
from lxml import etree
with open('data.xml') as fd:
doc = etree.parse(fd)
pos = []
for ele in (doc.xpath('//offer/pos')):
pos.append(ele.text)
print(pos)
Given your example input, the above code prints:
['697', '099']

how to find and edit tags in XML files with namespaces using ElementTree

I would like to find specific tags in my XML document and edit their text or attributes. My XML file contains namespaces (and as I understand it correctly, nested namespaces). The tool I'd like to use for this purpose is ElementTree. I managed to read XML file by iterparse, however I don't know how I can save edited XML, because iterparse doesn't have write element. I need a solution to read XML file by parse and strip its namespaces and nested namespaces or a way to save iterparsed file.
For this case, let's edit the "Rating" tag text.
it = ET.iterparse(adiPath)
for _, el in it:
if '}' in el.tag:
el.tag = el.tag.split('}', 1)[1] # strip all namespaces
for at in list(el.attrib): # strip namespaces of attributes too
if '}' in at:
newat = at.split('}', 1)[1]
el.attrib[newat] = el.attrib[at]
del el.attrib[at]
root = it.root
# Search Rating tag and edit it's value
for rating in root.iter('Rating'):
print(rating.text) # Prints 18
rating.text = "999"
print(rating.text) # Prints 999
However in this case XML file remains unchanged.
Here is XML file:
<?xml version="1.0" encoding="utf-8"?>
<ADI3 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:content="urn:cablelabs:md:xsd:content:3.0" xmlns:core="urn:cablelabs:md:xsd:core:3.0" xmlns:offer="urn:cablelabs:md:xsd:offer:3.0" xmlns:terms="urn:cablelabs:md:xsd:terms:3.0" xmlns:title="urn:cablelabs:md:xsd:title:3.0" xmlns:adb="urn:adb:md:xsd:adb:01" xmlns:schemaLocation="urn:adb:md:xsd:adb:01 ADB-EXT-C01.xsd urn:cablelabs:md:xsd:core:3.0 MD-SP-CORE-C01.xsd urn:cablelabs:md:xsd:content:3.0 MD-SP-CONTENT-C01.xsd urn:cablelabs:md:xsd:offer:3.0 MD-SP-OFFER-C01.xsd urn:cablelabs:md:xsd:terms:3.0 MD-SP-TERMS-C01.xsd urn:cablelabs:md:xsd:title:3.0 MD-SP-TITLE-C01.xsd" xmlns:xml="http://www.w3.org/XML/1998/namespace" xmlns="urn:cablelabs:md:xsd:core:3.0">
<Asset xsi:type="title:TitleType" uriId="ab://cc.com" providerVersionNum="1" internalVersionNum="0" creationDateTime="2020-01-28T08:55:19Z" startDateTime="2019-05-20T00:00:00Z" endDateTime="2028-08-20T23:59:00Z">
<AlternateId identifierSystem="VOD1.1">ab://cc.com</AlternateId>
<Ext>
<adb:ExtensionType>
<adb:TitleExt>
<adb:SeriesInfo episodeNumber="6">
<adb:series seriesId="GOT" seasonCount="8"></adb:series>
<adb:season seasonId="GOTS08" number="8" episodeCount="6"></adb:season>
</adb:SeriesInfo>
</adb:TitleExt>
</adb:ExtensionType>
</Ext>
<title:LocalizableTitle xml:lang="pol">
<title:TitleLong>Game of Thrones VIII</title:TitleLong>
<title:SummaryLong>Long summary, long summary, long summary...</title:SummaryLong>
<title:Actor fullName="Peter Dinklage" firstName="Peter" lastName="Dinklage" />
<title:Actor fullName="Nikolaj Coster-Waldau" firstName="Nikolaj" lastName="Coster-Waldau" />
<title:Actor fullName="Emilia Clarke" firstName="Emilia" lastName="Clarke" />
<title:Actor fullName="Lena Headey" firstName="Lena" lastName="Headey" />
<title:Director fullName="David Nutter" firstName="David" lastname="Nutter" />
</title:LocalizableTitle>
<title:Rating ratingSystem="PL">18</title:Rating>
<title:Audience>General</title:Audience>
<title:DisplayRunTime>01:15</title:DisplayRunTime>
<title:Year>2019</title:Year>
<title:CountryOfOrigin>US</title:CountryOfOrigin>
<title:Genre>Film fantasy</title:Genre>
<title:ShowType>Movie</title:ShowType>
</Asset>
<Asset xsi:type="offer:CategoryType" uriId="cc.com/XX">
<AlternateId identifierSystem="VOD1.1">cc.com/XX</AlternateId>
<offer:CategoryPath>VOD/GOT/Season 8</offer:CategoryPath>
</Asset>
<Asset xsi:type="content:MovieType" uriId="GraoTronVIII_0_1080mp4">
<AlternateId identifierSystem="VOD1.1">GraoTronVIII_0_1080mp4</AlternateId>
<content:SourceUrl>GOTS08E06.mp4</content:SourceUrl>
<content:Resolution>1080p</content:Resolution>
<content:Duration>PT1H15M20S</content:Duration>
<content:Language>pol</content:Language>
<content:Language>eng</content:Language>
</Asset>
<Asset xsi:type="content:PreviewType" uriId="GraoTronVIII_1_1080mp4">
<AlternateId identifierSystem="VOD1.1">GraoTronVIII_1_1080mp4</AlternateId>
<content:SourceUrl>GOTS08E06_trailer.mp4</content:SourceUrl>
<content:Resolution>1080p</content:Resolution>
<content:Duration>PT0H01M48S</content:Duration>
<content:Language>pol</content:Language>
<content:Language>eng</content:Language>
</Asset>
<Asset xsi:type="content:PosterType" uriId="GraoTronVIIIPoster">
<AlternateId identifierSystem="VOD1.1">GraoTronVIIIPoster</AlternateId>
<content:SourceUrl>GOTS08E06.jpg</content:SourceUrl>
<content:X_Resolution>600</content:X_Resolution>
<content:Y_Resolution>900</content:Y_Resolution>
<content:Language>pol</content:Language>
</Asset>
<Asset xsi:type="offer:ContentGroupType" uriId="abc">
<AlternateId identifierSystem="VOD1.1">abc</AlternateId>
<offer:TitleRef uriId="abc" />
<offer:MovieRef uriId="GraoTronVIII_0_1080mp4" />
</Asset>
<Asset xsi:type="offer:ContentGroupType" uriId="abc">
<AlternateId identifierSystem="VOD1.1">abc</AlternateId>
<offer:TitleRef uriId="abc" />
<offer:MovieRef uriId="GraoTronVIII_1_1080mp4" />
</Asset>
<Asset xsi:type="offer:ContentGroupType" uriId="abc">
<AlternateId identifierSystem="VOD1.1">abc</AlternateId>
<offer:TitleRef uriId="abc" />
<offer:MovieRef uriId="GraoTronVIIIPoster" />
</Asset>
</ADI3>
Instead of stripping out the namespaces, I suggest using namespace wildcards. Support for this was added in Python 3.8.
from xml.etree import ElementTree as ET
tree = ET.parse(adiPath)
rating = tree.find(".//{*}Rating") # Find the Rating element in any namespace
rating.text = "999"
Note that you have to use find() (or findall()). Wildcards do not work with iter().
The following workaround can be used to preserve the original namespace prefixes when serializing the XML document (see also https://stackoverflow.com/a/42372404/407651 and https://stackoverflow.com/a/54491129/407651).
namespaces = dict([elem for _, elem in ET.iterparse("test1.xml", events=['start-ns'])])
for ns in namespaces:
ET.register_namespace(ns, namespaces[ns])

How to create a subset of document using lxml?

Suppose you have an lmxl.etree element with the contents like:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
<element2>
<subelement2>blibli</sublement2>
</element2>
</root>
I can use find or xpath methods to get something an element rendering something like:
<element1>
<subelement1>blabla</subelement1>
</element1>
Is there a way simple to get:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
</root>
i.e The element of interest plus all it's ancestors up to the document root?
I am not sure there is something built-in for it, but here is a terrible, "don't ever use it in real life" type of a workaround using the iterancestors() parent iterator:
from lxml import etree as ET
data = """<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
<element2>
<subelement2>blibli</subelement2>
</element2>
</root>"""
root = ET.fromstring(data)
element = root.find(".//subelement1")
result = ET.tostring(element)
for node in element.iterancestors():
result = "<{name}>{text}</{name}>".format(name=node.tag, text=result)
print(ET.tostring(ET.fromstring(result), pretty_print=True))
Prints:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
</root>
The following code removes elements that don't have any subelement1 descendants and are not named subelement1.
from lxml import etree
tree = etree.parse("input.xml") # First XML document in question
for elem in tree.iter():
if elem.xpath("not(.//subelement1)") and not(elem.tag == "subelement1"):
if elem.getparent() is not None:
elem.getparent().remove(elem)
print etree.tostring(tree)
Output:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
</root>

How to extract data from xml file that is deep down the tag

<?xml version="1.0" encoding="utf-8"?>
<ArrayOfRecord xmlns:i="http://www.w3.org/2001/XMLSchema-instance" i:type="Record">
<AvailableCharts>
<Accelerometer>true</Accelerometer>
<Velocity>false</Velocity>
</AvailableCharts>
<Trics>
<Trick>
<EndOffset>PT2M21.835S</EndOffset>
<Values>
<TrickValue>
<Acceleration>26.505801694441629</Acceleration>
<Rotation>0.023379150593228679</Rotation>
</TrickValue>
</Values>
</Trick>
</Trics>
<Values>
<SensorValue>
<accelx>-3.593643144</accelx>
<accely>7.316485176</accely>
</SensorValue>
<SensorValue>
<accelx>0.31103436</accelx>
<accely>7.70408184</accely>
</SensorValue>
</Values>
</ArrayOfRecord>
I am only interested in 'accelx' and 'accely' value in this data and need to create a csv out of it.
Update: The code given below breaks when I change the second row with the following. Nothing is displayed because of this;
<ArrayOfRecord xmlns:i="http://www.w3.org/2001/XMLSchema-instance" i:type="Record" xmlns="http://schemas">
The following code works:
import xml.etree.ElementTree as etree
tree = etree.parse(r"C:\Users\data.xml")
root = tree.getroot()
val_of_interest = root.findall("./Values/SensorValue")
for sensor_val in val_of_interest:
print sensor_val.find('accelx').text
print sensor_val.find('accely').text

Python XML check next item

Here is a little xml example:
<?xml version="1.0" encoding="UTF-8"?>
<list>
<person id="1">
<name>Smith</name>
<city>New York</city>
</person>
<person id="2">
<name>Pitt</name>
</person>
...
...
</list>
Now I need all Persons with a name and city.
I tried:
#!/usr/bin/python
# coding: utf8
import xml.dom.minidom as dom
tree = dom.parse("test.xml")
for listItems in tree.firstChild.childNodes:
for personItems in listItems.childNodes:
if personItems.nodeName == "name" and personItems.nextSibling == "city":
print personItems.firstChild.data.strip()
But the ouput is empty. Without the "and" condition I become all names. How can I check that the next tag after "name" is "city"?
You can do this in minidom:
import xml.dom.minidom as minidom
def getChild(n,v):
for child in n.childNodes:
if child.localName==v:
yield child
xmldoc = minidom.parse('test.xml')
person = getChild(xmldoc, 'list')
for p in person:
for v in getChild(p,'person'):
attr = v.getAttributeNode('id')
if attr:
print attr.nodeValue.strip()
This prints id of person nodes:
1
2
use element tree check this element tree
import xml.etree.ElementTree as ET
tree = ET.parse('a.xml')
root = tree.getroot()
for person in root.findall('person'):
name = person.find('name').text
try:
city = person.find('city').text
except:
continue
print name, city
for id u can get it by id= person.get('id')
output:Smith New York
Using lxml, you can use xpath to get in one step what you need:
from lxml import etree
xmlstr = """
<list>
<person id="1">
<name>Smith</name>
<city>New York</city>
</person>
<person id="2">
<name>Pitt</name>
</person>
</list>
"""
xml = etree.fromstring(xmlstr)
xp = "//person[city]"
for person in xml.xpath(xp):
print etree.tostring(person)
lxml is external python package, but is so useful, that to me it is always worth to install.
xpath is searching for any (//) element person having (declared by content of []) subelement city.

Categories

Resources