Using lxml to add a string as a sub element - python

I have an lxml element with children built like this:
xml = etree.Element('presentation')
format_xml = etree.SubElement(xml, 'format')
content_xml = etree.SubElement(xml, 'slides')
I then have several strings that I would like it iterate over and add each as child element to slides. Each string will be something like this:
<slide1>
<title>My Presentation</title>
<subtitle>A sample presentation</subtitle>
<phrase>Some sample text
<subphrase>Some more text</subphrase>
</phrase>
</slide1>
How can I append these strings as children to the slides element?

Just append:
import lxml.etree as etree
xml = etree.Element('presentation')
format_xml = etree.SubElement(xml, 'format')
content_xml = etree.SubElement(xml, 'slides')
new = """<slide1>
<title>My Presentation</title>
<subtitle>A sample presentation</subtitle>
<phrase>Some sample text
<subphrase>Some more text</subphrase>
</phrase>
</slide1>"""
content_xml.append(etree.fromstring(new))
print(etree.tostring(xml,pretty_print=1))
Which will give you:
<presentation>
<format/>
<slides>
<slide1>
<title>My Presentation</title>
<subtitle>A sample presentation</subtitle>
<phrase>Some sample text
<subphrase>Some more text</subphrase>
</phrase>
</slide1>
</slides>
</presentation>

fromstring() function would load an XML string directly into an Element instance which you can append:
from lxml import etree as ET
slide = ET.fromstring(xml_string)
content_xml.append(slide)

Related

How to find if there are empty attributes in XML?

Having a XML like this one (located in /home/user/):
<?xml version="1.0" ?>
<DataClient xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:cnmc="http://www.example.com/Tipos_DataClient" xmlns="http://www.example.com/DataClient">
<PersonalData Operation="3" Date="2022-09-06">
<ExtendedData>
<Person Code="XXX" OtherCode="Y12354"/>
</ExtendedData>
<Home Type="Street" Num="10" Code="12003" Poblation="Imaginary street"/>
</PersonalData>
</DataClient>
How could I identify if the "Num" attribute is empty? And then generate a list of all those elements that have the "Num" empty...
I tried to count all those with "None" as value, but it always returns 0:
#! /usr/bin/python3
import xml.etree.ElementTree as ET
tree = ET.parse('/home/user/file.xml')
root = tree.getroot()
b = None
a = sum(1 for s in root.findall('./DataClient/PersonalData/ExtendedData/Num') if s.b)
print (a)
Since Python's etree API maps attributes to dictionaries, consider dict.get to check for specific attribute. Also, you need to use namespaces argument of findall since XML contains a default namespace.
import xml.etree.ElementTree as ET
tree = ET.parse('/home/user/file.xml')
nmsp = {"doc": "http://www.example.com/DataClient"}
xpath = "./doc:DataClient/doc:PersonalData/doc:Home"
a = sum(1 for node in tree.findall(xpath, nmsp) if node.attrib.get("Num") is None)

How to update value between specific xml tags, where input is string, Python?

Consider I have a string that looks like the following below. It's type is string but it will always represents an xml document. I'm researching available python libraries for xml. How can I update a value in between 2 specific tags? What library would I be using for that?
<?xml version="1.0"?>
<PostTelemetryRequest xmlns:ns2="urn:com:onstar:global:common:schema:PostTelemetryData:1">
<ns2:PartnerVehicles>
<ns2:PartnerVehicle>
<ns2:partnerNotificationID>251029655</ns2:partnerNotificationID>
</ns2:PartnerVehicle>
</ns2:PartnerVehicles>
</PostTelemetryRequest>
For instance, if the input is the string above how can I update the value between <ns2:partnerNotificationID> and </ns2:partnerNotificationID> tags to a new value?
This is the base code:
>>> from xml.etree import ElementTree
>>> s = """<?xml version="1.0"?>
<PostTelemetryRequest xmlns:ns2="urn:com:onstar:global:common:schema:PostTelemetryData:1">
<ns2:PartnerVehicles>
<ns2:PartnerVehicle>
<ns2:partnerNotificationID>251029655</ns2:partnerNotificationID>
</ns2:PartnerVehicle>
</ns2:PartnerVehicles>
</PostTelemetryRequest>
"""
>>> root = ElementTree.fromstring(s)
>>> for e in root.iter():
... if e.tag=='{urn:com:onstar:global:common:schema:PostTelemetryData:1}partnerNotificationID':
... e.text='mytext'
...
>>> etree.ElementTree.tostring(root)
b'<PostTelemetryRequest xmlns:ns0="urn:com:onstar:global:common:schema:PostTelemetryData:1">\n <ns0:PartnerVehicles>\n <ns0:PartnerVehicle>\n <ns0:partnerNotificationID>mytext</ns0:partnerNotificationID>\n </ns0:PartnerVehicle>\n </ns0:PartnerVehicles>\n</PostTelemetryRequest>'

Adding subElement at a specific location with xml.dom.minidom (appendChild)

I intend to insert a sub element at a specified location. However, I do not know how to do that using appendChild in xml.dom
Here is my xml code:
<?xml version='1.0' encoding='UTF-8'?>
<VOD>
<root>
<ab>sdsd
<pp>pras</pp>
<ps>sinha</ps>
</ab>
<ab>prashu</ab>
<ab>sakshi</ab>
<cd>dfdf</cd>
</root>
<root>
<ab>pratik</ab>
</root>
<root>
<ab>Mum</ab>
</root>
</VOD>
I would like to insert another sub element "new" in first "root" element just before the "cd" tag. The result should look like this:
<ab>prashu</ab>
<ab>sakshi</ab>
<new>Anydata</new>
<cd>dfdf</cd>
The code I used for this is:
import xml.dom.minidom as m
doc = m.parse("file_notes.xml")
root=doc.getElementsByTagName("root")
valeurs = doc.getElementsByTagName("root")[0]
element = doc.createElement("new")
element.appendChild(doc.createTextNode("Anydata"))
valeurs.appendChild(element)
doc.writexml(open("newxmlfile.xml","w"))
In what way can I achieve my goal?
Thank you in advance..!!
Try using insertBefore instead. Something along these lines:
element = doc.createElement("new")
element.appendChild(doc.createTextNode("Anydata"))
cd = doc.getElementsByTagName("cd")[0]
cd.parentNode.insertBefore(element, cd)
To insert new nodes based on an index you can just do:
cd_list = doc.getElementsByTagName("cd")
cd_list[0].parentNode.insertBefore(element, cd_list[0])

Extract all the text from xml data with python

I'm new to xml data processing. I want to extract the text data in the following xml file:
<data>
<p>12345<strong>45667</strong>abcde</p>
</data>
so that expected result is:
['12345','45667', 'abcde'] Currently I have tried:
tree = ET.parse('data.xml')
data = tree.getiterator()
text = [data[i].text for i in range(0, len(data))]
But the result only shows ['12345','45667'] . 'abcde' is missing. Can someone help me? Thanks in advance!
Try doing this using xpath and lxml :
import lxml.etree as etree
string = '''
<data>
<p>12345<strong>45667</strong>abcde</p>
</data>
'''
tree = etree.fromstring(string)
print(tree.xpath('//p//text()'))
The Xpath expression means: "select all p elements wich containing text recursively"
OUTPUT:
['12345', '45667', 'abcde']
getiterator() (or it's replacement iter()) iterates over child tags/elements, while abcde is a text node, a tail of the strong tag.
You can use itertext() method:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
print list(tree.find('p').itertext())
Prints:
['12345', '45667', 'abcde']

How would one remove the CDATA tags from but preserve the actual data in Python using LXML or BeautifulSoup

I have some XML I am parsing in which I am using BeautifulSoup as the parser. I pull the CDATA out with the following code, but I only want the data and not the CDATA TAGS.
myXML = open("c:\myfile.xml", "r")
soup = BeautifulSoup(myXML)
data = soup.find(text=re.compile("CDATA"))
print data
<![CDATA[TEST DATA]]>
What I would like to see if the following output:
TEST DATA
I don't care if the solution is in LXML or BeautifulSoup. Just want the best or easiest way to get the job done. Thanks!
Here is a solution:
parser = etree.XMLParser(strip_cdata=False)
root = etree.parse(self.param1, parser)
data = root.findall('./config/script')
for item in data: # iterate through list to find text contained in elements containing CDATA
print item.text
Based on the lxml docs:
>>> from lxml import etree
>>> parser = etree.XMLParser(strip_cdata=False)
>>> root = etree.XML('<root><data><![CDATA[test]]></data></root>', parser)
>>> data = root.findall('data')
>>> for item in data: # iterate through list to find text contained in elements containing CDATA
print item.text
test # just the text of <![CDATA[test]]>
This might be the best way to get the job done, depending on how amenable your xml structure is to this approach.
Based on BeautifulSoup:
>>> str='<xml> <MsgType><![CDATA[text]]></MsgType> </xml>'
>>> soup=BeautifulSoup(str, "xml")
>>> soup.MsgType.get_text()
u'text'
>>> soup.MsgType.string
u'text'
>>> soup.MsgType.text
u'text'
As the result, it just print the text from msgtype;

Categories

Resources