etree data extraction from xml with odd tree structure - python

here is a piece of the xml data before i go any further
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xmeml>
<xmeml version="5">
<sequence id="episode1">
<media>
<video>
<track>
<generatoritem id="Gen Subtitle1">
<effect>
<name>Gen Subtitle</name>
<effectid>Gen Subtitle</effectid>
<effectcategory>Text</effectcategory>
<effecttype>generator</effecttype>
<mediatype>video</mediatype>
<parameter>
<parameterid>part1</parameterid>
<name>Text Settings</name>
<value/>
</parameter>
<parameter>
<parameterid>str</parameterid>
<name>Text</name>
<value>You're a coward for picking on people
who are weaker than you.</value>
</parameter>
<parameter>
<parameterid>font</parameterid>
<name>Font</name>
<value>Arial</value>
</parameter>
</effect>
</media>
</sequence>
</xmeml>
now as you can see the tree starts with <effect> and inside there are multiple <parameters> but im only ater the <value> from <parameters> that also contain
<parameterid>str</parameterid>
<name>Text</name>
so i can get an output of "That child is so cute.
And he's smart."
Here is my code
lst = tree.findall('xmeml/sequence/media/video/track/generatoritem/effect/parameter/value')
counts = tree.findall('.//value')
for each in counts:
print(each.text)
And this is what i get
And he's smart.
Arial

See below
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xmeml>
<xmeml version="5">
<sequence id="episode1">
<effect>
<name>Gen Subtitle</name>
<effectid>Gen Subtitle</effectid>
<effectcategory>Text</effectcategory>
<effecttype>generator</effecttype>
<mediatype>video</mediatype>
<parameter>
<parameterid>part1</parameterid>
<name>Text Settings</name>
<value/>
</parameter>
<parameter>
<parameterid>str</parameterid>
<name>Text</name>
<value>That child is so cute. And he's smart</value>
</parameter>
<parameter>
<parameterid>font</parameterid>
<name>Font</name>
<value>Arial</value>
</parameter>
</effect>
</sequence>
</xmeml>'''
root = ET.fromstring(xml)
str_params = root.findall('.//parameter/[parameterid="str"]')
for param in str_params:
if param.find('./name').text == 'Text':
print('The text: {}'.format(param.find('./value').text))
break
output
The text: That child is so cute. And he's smart

Related

Python xml parsing from string return none attrib

i
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<Root xmlns="http://www.nexacroplatform.com/platform/dataset">
<Parameters>
<Parameter id="ErrorCode" type="string">-1</Parameter>
<Parameter id="ErrorMsg" type="string"> 정원을 초과하였습니다..!</Parameter>
<Parameter id="O_RESULT" type="string">1</Parameter>
<Parameter id="O_RESULT_STR" type="string"> 정원을 초과하였습니다..!</Parameter>
</Parameters>
</Root>'''
tree=ET.fromstring(xml)
tree.findall('Parameter')
tree.findall('Parameter') returns empty list.
tree has none attrib and '{http://www.nexacroplatform.com/platform/dataset}Root' tag.
why this xml not work?
See below (no external lib is involved in the solution)
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<Root xmlns="http://www.nexacroplatform.com/platform/dataset">
<Parameters>
<Parameter id="ErrorCode" type="string">-1</Parameter>
<Parameter id="ErrorMsg" type="string"> 정원을 초과하였습니다..!</Parameter>
<Parameter id="O_RESULT" type="string">1</Parameter>
<Parameter id="O_RESULT_STR" type="string"> 정원을 초과하였습니다..!</Parameter>
</Parameters>
</Root>'''
tree = ET.fromstring(xml)
for entry in tree.findall('.//{http://www.nexacroplatform.com/platform/dataset}Parameter'):
print(f'id={entry.attrib["id"]}, type={entry.attrib["id"]}, data={entry.text}')
output
id=ErrorCode, type=ErrorCode, data=-1
id=ErrorMsg, type=ErrorMsg, data= 정원을 초과하였습니다..!
id=O_RESULT, type=O_RESULT, data=1
id=O_RESULT_STR, type=O_RESULT_STR, data= 정원을 초과하였습니다..!
You can use beautifulsoup with lxml parser to achieve what you want. I tried to print the ids of <Parameter> tags.
Here is the Code.
import bs4 as bs
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<Root xmlns="http://www.nexacroplatform.com/platform/dataset">
<Parameters>
<Parameter id="ErrorCode" type="string">-1</Parameter>
<Parameter id="ErrorMsg" type="string"> 정원을 초과하였습니다..!</Parameter>
<Parameter id="O_RESULT" type="string">1</Parameter>
<Parameter id="O_RESULT_STR" type="string"> 정원을 초과하였습니다..!</Parameter>
</Parameters>
</Root>'''
# Create a soup object with lxml parser
soup = bs.BeautifulSoup(xml, 'lxml')
# Select all the parameter tags
params = soup.find('root').find('parameters').findAll('parameter')
# Print the ids of all parameter tags
for i in params:
print(i['id'])
ErrorCode
ErrorMsg
O_RESULT
O_RESULT_STR

Adding an array of tags to xml root

I have an array of xml.etree.ElementTree.Element. i need to append it into root tag which contains few Tags (i.e) xml.etree.ElementTree.Element
for Example:
<MxGraphModel>
<root>
<mxCell id="0"></mxCell>
<mxCell id="1"></mxCell>
</root>
</MxGraphModel>
My array ['<mxCell id="3"></mxCell>','<mxCell id="4"></mxCell>']
My final output needs to be :
<MxGraphModel>
<root>
<mxCell id="0"></mxCell>
<mxCell id="1"></mxCell>
<mxCell id="3"></mxCell>
<mxCell id="4"></mxCell>
</root>
</MxGraphModel>
Try this:
from xml.etree import ElementTree as ET
data = ['<mxCell id="3"></mxCell>','<mxCell id="4"></mxCell>']
root = ET.parse('test.xml').getroot()
nodes = root.find('root')
for x in data:
nodes.append(ET.fromstring(x))
print(ET.tostring(root))
Output:
<MxGraphModel>
<root>
<mxCell id="0" />
<mxCell id="1" />
<mxCell id="3" />
<mxCell id="4" />
</root>
</MxGraphModel>

How to retain XML version and comments while writing in XML file using python?

I have to add 1 element at runtime on the XML file using Python.
My original XML file has content like this below
<?xml version='1.0' encoding='utf-8'?>
<!--
Some comments.
-->
<rootTag>
<childTag className="org.Tiger" SSLEngine="on" />
<childTag name="serv1">
<Connector port="8001" SSLEnabled="true"
maxThreads="800"
URIEncoding="UTF-8"
clientAuth="false" />
<Track name="Pacific" defaultHost="localhost">
<Realm className="Realm" appName="kernel"
userClassNames="User"
roleClassNames="Role"/>
<Host name="localhost"
createDirs="false">
<Value className="Remote"
httpsServerPort="223" />
</Host>
</Track>
</childTag>
</rootTag>
Below is the code which I wrote to add (Value) element at runtime
import xml.etree.ElementTree as ET
myTree = ET.parse("new2.xml")
myRoot = myTree.getroot()
x = myTree.findall('.//Valve[#className="Error"]')
print(len(x))
if int(len(x)) == 0:
for a in myRoot.findall('childTag'):
for b in a.findall('Track'):
for c in b.findall('Host'):
ele = ET.Element('Value')
ele.set("className", "Error")
ele.set("showReport", "false")
ele.set("showServerInfo", "false")
c.append(ele)
myTree.write("new2.xml")
The output which I got is this:-
<rootTag>
<childTag className="org.Tiger" SSLEngine="on" />
<childTag name="serv1">
<Connector port="8001" SSLEnabled="true" maxThreads="800" URIEncoding="UTF-8" clientAuth="false" />
<Track name="Pacific" defaultHost="localhost">
<Realm className="Realm" appName="kernel" userClassNames="User" roleClassNames="Role" />
<Host name="localhost" autoDeploy="false" createDirs="false">
<Value className="Remote" httpsServerPort="223" />
<Value className="Error" showReport="false" showServerInfo="false" /></Host>
</Track>
</childTag>
</rootTag>
The problem here is it removes the XML version, comments from the file and it also
change the indentation of file
How can I only add the subelement with correct indentation without changing anything else from the file
?
O/p should be like this
<?xml version='1.0' encoding='utf-8'?>
<!--
Some comments.
-->
<rootTag>
<childTag className="org.Tiger" SSLEngine="on" />
<childTag name="serv1">
<Connector port="8001" SSLEnabled="true"
maxThreads="800"
URIEncoding="UTF-8"
clientAuth="false" />
<Track name="Pacific" defaultHost="localhost">
<Realm className="Realm" appName="kernel"
userClassNames="User"
roleClassNames="Role"/>
<Host name="localhost"
createDirs="false">
<Value className="Remote"
httpsServerPort="223" />
<Value className="Error"
showReport="false" showServerInfo="false" />
</Host>
</Track>
</childTag>
</rootTag>

Python etree - find exact match

i have following xml file:
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE TaskDefinition PUBLIC "xxx" "yyy">
<TaskDefinition created="time_stamp" formPath="path/sometask.xhtml" id="sample_id" modified="timestamp_b" name="sample_task" resultAction="Delete" subType="subtype_sample_task" type="sample_type">
<Attributes>
<Map>
<entry key="applications" value="APP_NAME"/>
<entry key="aaa" value="true"/>
<entry key="bbb" value="true"/>
<entry key="ccc" value="true"/>
<entry key="ddd" value="true"/>
<entry key="eee" value="Disabled"/>
<entry key="fff"/>
<entry key="ggg"/>
</Map>
</Attributes>
<Description>Description.</Description>
<Owner>
<Reference class="sample_owner_class" id="sample_owner_id" name="sample__owner_name"/>
</Owner>
<Parent>
<Reference class="sample_parent_class" id="sample_parent_id" name="sample_parent_name"/>
</Parent>
</TaskDefinition>
I want to search for:
<entry key="applications" value="APP_NAME"/>
and change the value to ie.: `APP_NAME_2.
I know i can extract this value by this:
import xml.etree.cElementTree as ET
tree = ET.ElementTree(file='sample.xml')
root = tree.getroot()
print(root[0][0][0].tag, root[0][0][0].attrib)
but in this case i have to know exact position of ths entry in tree - so it is not flexible, and i have no idea how to change it.
Also tried something like this:
for app in root.attrib:
if 'applications' in root.attrib:
print(app)
but i can't figure out, why this returns nothing.
In python docs, there is following example:
for rank in root.iter('rank'):
new_rank = int(rank.text) + 1
rank.text = str(new_rank)
rank.set('updated', 'yes')
tree.write('output.xml')
but i have no idea how to addjust this to my example.
I don't want to use regex for this case.
Any help appreciated.
You can locate the specific entry element with XPath.
import xml.etree.ElementTree as ET
tree = ET.parse("sample.xml")
# Find the element that has a 'key' attribute with a value of 'applications'
entry = tree.find(".//entry[#key='applications']")
# Change the value of the 'value' attribute
entry.set("value", "APP_NAME_2")
tree.write("output.xml")
Result (output.xml):
<TaskDefinition created="time_stamp" formPath="path/sometask.xhtml" id="sample_id" modified="timestamp_b" name="sample_task" resultAction="Delete" subType="subtype_sample_task" type="sample_type">
<Attributes>
<Map>
<entry key="applications" value="APP_NAME_2" />
<entry key="aaa" value="true"/>
<entry key="bbb" value="true"/>
<entry key="ccc" value="true"/>
<entry key="ddd" value="true"/>
<entry key="eee" value="Disabled"/>
<entry key="fff"/>
<entry key="ggg"/>
</Map>
</Attributes>
<Description>Description.</Description>
<Owner>
<Reference class="sample_owner_class" id="sample_owner_id" name="sample__owner_name"/>
</Owner>
<Parent>
<Reference class="sample_parent_class" id="sample_parent_id" name="sample_parent_name"/>
</Parent>
</TaskDefinition>

how to get the index of a child node under a parent node using python?

my xml file goes like this:
<?xml version="1.0"?>
<BCPFORMAT
xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR="\t" MAX_LENGTH="12"/>
<FIELD ID="2" xsi:type="CharTerm" TERMINATOR="\t" MAX_LENGTH="20" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="3" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="30" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="age" xsi:type="SQLINT"/>
<COLUMN SOURCE="2" NAME="firstname" xsi:type="SQLVARYCHAR"/>
<COLUMN SOURCE="3" NAME="lastname" xsi:type="SQLVARYCHAR"/>
</ROW>
</BCPFORMAT>
i need to know the index of the child node ID="1" in its parent node 'RECORD'.(ie, index is 0 in this case)
please help me solve this.
thanks.. :)
Using xml.etree.ElementTree:
import xml.etree.ElementTree as ET
root = ET.fromstring('''<?xml version="1.0"?>
<BCPFORMAT
...
</BCPFORMAT>''')
# Accessing parent node: http://effbot.org/zone/element.htm#accessing-parents
parent_map = {c: p for p in root.getiterator() for c in p} child = root.find('.//*[#ID="1"]')
print(list(parent_map[child]).index(child)) # => 0
Using lxml:
import lxml.etree as ET
root = ET.fromstring('''<?xml version="1.0"?>
<BCPFORMAT
...
</BCPFORMAT>''')
child = root.find('.//*[#ID="1"]')
print(child.getparent().index(child)) # => 0

Categories

Resources