etree data extraction from xml with odd tree structure

etree data extraction from xml with odd tree structure - python

here is a piece of the xml data before i go any further
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xmeml>
<xmeml version="5">
<sequence id="episode1">
<media>
<video>
<track>
<generatoritem id="Gen Subtitle1">
<effect>
<name>Gen Subtitle</name>
<effectid>Gen Subtitle</effectid>
<effectcategory>Text</effectcategory>
<effecttype>generator</effecttype>
<mediatype>video</mediatype>
<parameter>
<parameterid>part1</parameterid>
<name>Text Settings</name>
<value/>
</parameter>
<parameter>
<parameterid>str</parameterid>
<name>Text</name>
<value>You're a coward for picking on people
who are weaker than you.</value>
</parameter>
<parameter>
<parameterid>font</parameterid>
<name>Font</name>
<value>Arial</value>
</parameter>
</effect>
</media>
</sequence>
</xmeml>
now as you can see the tree starts with <effect> and inside there are multiple <parameters> but im only ater the <value> from <parameters> that also contain
<parameterid>str</parameterid>
<name>Text</name>
so i can get an output of "That child is so cute.
And he's smart."
Here is my code
lst = tree.findall('xmeml/sequence/media/video/track/generatoritem/effect/parameter/value')
counts = tree.findall('.//value')
for each in counts:
print(each.text)
And this is what i get
And he's smart.
Arial

See below
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xmeml>
<xmeml version="5">
<sequence id="episode1">
<effect>
<name>Gen Subtitle</name>
<effectid>Gen Subtitle</effectid>
<effectcategory>Text</effectcategory>
<effecttype>generator</effecttype>
<mediatype>video</mediatype>
<parameter>
<parameterid>part1</parameterid>
<name>Text Settings</name>
<value/>
</parameter>
<parameter>
<parameterid>str</parameterid>
<name>Text</name>
<value>That child is so cute. And he's smart</value>
</parameter>
<parameter>
<parameterid>font</parameterid>
<name>Font</name>
<value>Arial</value>
</parameter>
</effect>
</sequence>
</xmeml>'''
root = ET.fromstring(xml)
str_params = root.findall('.//parameter/[parameterid="str"]')
for param in str_params:
if param.find('./name').text == 'Text':
print('The text: {}'.format(param.find('./value').text))
break
output
The text: That child is so cute. And he's smart

Related

Python xml parsing from string return none attrib

i
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<Root xmlns="http://www.nexacroplatform.com/platform/dataset">
<Parameters>
<Parameter id="ErrorCode" type="string">-1</Parameter>
<Parameter id="ErrorMsg" type="string"> 정원을 초과하였습니다..!</Parameter>
<Parameter id="O_RESULT" type="string">1</Parameter>
<Parameter id="O_RESULT_STR" type="string"> 정원을 초과하였습니다..!</Parameter>
</Parameters>
</Root>'''
tree=ET.fromstring(xml)
tree.findall('Parameter')
tree.findall('Parameter') returns empty list.
tree has none attrib and '{http://www.nexacroplatform.com/platform/dataset}Root' tag.
why this xml not work?

See below (no external lib is involved in the solution)
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<Root xmlns="http://www.nexacroplatform.com/platform/dataset">
<Parameters>
<Parameter id="ErrorCode" type="string">-1</Parameter>
<Parameter id="ErrorMsg" type="string"> 정원을 초과하였습니다..!</Parameter>
<Parameter id="O_RESULT" type="string">1</Parameter>
<Parameter id="O_RESULT_STR" type="string"> 정원을 초과하였습니다..!</Parameter>
</Parameters>
</Root>'''
tree = ET.fromstring(xml)
for entry in tree.findall('.//{http://www.nexacroplatform.com/platform/dataset}Parameter'):
print(f'id={entry.attrib["id"]}, type={entry.attrib["id"]}, data={entry.text}')
output
id=ErrorCode, type=ErrorCode, data=-1
id=ErrorMsg, type=ErrorMsg, data= 정원을 초과하였습니다..!
id=O_RESULT, type=O_RESULT, data=1
id=O_RESULT_STR, type=O_RESULT_STR, data= 정원을 초과하였습니다..!

You can use beautifulsoup with lxml parser to achieve what you want. I tried to print the ids of <Parameter> tags.
Here is the Code.
import bs4 as bs
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<Root xmlns="http://www.nexacroplatform.com/platform/dataset">
<Parameters>
<Parameter id="ErrorCode" type="string">-1</Parameter>
<Parameter id="ErrorMsg" type="string"> 정원을 초과하였습니다..!</Parameter>
<Parameter id="O_RESULT" type="string">1</Parameter>
<Parameter id="O_RESULT_STR" type="string"> 정원을 초과하였습니다..!</Parameter>
</Parameters>
</Root>'''
# Create a soup object with lxml parser
soup = bs.BeautifulSoup(xml, 'lxml')
# Select all the parameter tags
params = soup.find('root').find('parameters').findAll('parameter')
# Print the ids of all parameter tags
for i in params:
print(i['id'])
ErrorCode
ErrorMsg
O_RESULT
O_RESULT_STR

Adding an array of tags to xml root

I have an array of xml.etree.ElementTree.Element. i need to append it into root tag which contains few Tags (i.e) xml.etree.ElementTree.Element
for Example:
<MxGraphModel>
<root>
<mxCell id="0"></mxCell>
<mxCell id="1"></mxCell>
</root>
</MxGraphModel>
My array ['<mxCell id="3"></mxCell>','<mxCell id="4"></mxCell>']
My final output needs to be :
<MxGraphModel>
<root>
<mxCell id="0"></mxCell>
<mxCell id="1"></mxCell>
<mxCell id="3"></mxCell>
<mxCell id="4"></mxCell>
</root>
</MxGraphModel>

Try this:
from xml.etree import ElementTree as ET
data = ['<mxCell id="3"></mxCell>','<mxCell id="4"></mxCell>']
root = ET.parse('test.xml').getroot()
nodes = root.find('root')
for x in data:
nodes.append(ET.fromstring(x))
print(ET.tostring(root))
Output:
<MxGraphModel>
<root>
<mxCell id="0" />
<mxCell id="1" />
<mxCell id="3" />
<mxCell id="4" />
</root>
</MxGraphModel>

How to retain XML version and comments while writing in XML file using python?

I have to add 1 element at runtime on the XML file using Python.
My original XML file has content like this below
<?xml version='1.0' encoding='utf-8'?>
<!--
Some comments.
-->
<rootTag>
<childTag className="org.Tiger" SSLEngine="on" />
<childTag name="serv1">
<Connector port="8001" SSLEnabled="true"
maxThreads="800"
URIEncoding="UTF-8"
clientAuth="false" />
<Track name="Pacific" defaultHost="localhost">
<Realm className="Realm" appName="kernel"
userClassNames="User"
roleClassNames="Role"/>
<Host name="localhost"
createDirs="false">
<Value className="Remote"
httpsServerPort="223" />
</Host>
</Track>
</childTag>
</rootTag>
Below is the code which I wrote to add (Value) element at runtime
import xml.etree.ElementTree as ET
myTree = ET.parse("new2.xml")
myRoot = myTree.getroot()
x = myTree.findall('.//Valve[#className="Error"]')
print(len(x))
if int(len(x)) == 0:
for a in myRoot.findall('childTag'):
for b in a.findall('Track'):
for c in b.findall('Host'):
ele = ET.Element('Value')
ele.set("className", "Error")
ele.set("showReport", "false")
ele.set("showServerInfo", "false")
c.append(ele)
myTree.write("new2.xml")
The output which I got is this:-
<rootTag>
<childTag className="org.Tiger" SSLEngine="on" />
<childTag name="serv1">
<Connector port="8001" SSLEnabled="true" maxThreads="800" URIEncoding="UTF-8" clientAuth="false" />
<Track name="Pacific" defaultHost="localhost">
<Realm className="Realm" appName="kernel" userClassNames="User" roleClassNames="Role" />
<Host name="localhost" autoDeploy="false" createDirs="false">
<Value className="Remote" httpsServerPort="223" />
<Value className="Error" showReport="false" showServerInfo="false" /></Host>
</Track>
</childTag>
</rootTag>
The problem here is it removes the XML version, comments from the file and it also
change the indentation of file
How can I only add the subelement with correct indentation without changing anything else from the file
?
O/p should be like this
<?xml version='1.0' encoding='utf-8'?>
<!--
Some comments.
-->
<rootTag>
<childTag className="org.Tiger" SSLEngine="on" />
<childTag name="serv1">
<Connector port="8001" SSLEnabled="true"
maxThreads="800"
URIEncoding="UTF-8"
clientAuth="false" />
<Track name="Pacific" defaultHost="localhost">
<Realm className="Realm" appName="kernel"
userClassNames="User"
roleClassNames="Role"/>
<Host name="localhost"
createDirs="false">
<Value className="Remote"
httpsServerPort="223" />
<Value className="Error"
showReport="false" showServerInfo="false" />
</Host>
</Track>
</childTag>
</rootTag>

Python etree - find exact match

i have following xml file:
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE TaskDefinition PUBLIC "xxx" "yyy">
<TaskDefinition created="time_stamp" formPath="path/sometask.xhtml" id="sample_id" modified="timestamp_b" name="sample_task" resultAction="Delete" subType="subtype_sample_task" type="sample_type">
<Attributes>
<Map>
<entry key="applications" value="APP_NAME"/>
<entry key="aaa" value="true"/>
<entry key="bbb" value="true"/>
<entry key="ccc" value="true"/>
<entry key="ddd" value="true"/>
<entry key="eee" value="Disabled"/>
<entry key="fff"/>
<entry key="ggg"/>
</Map>
</Attributes>
<Description>Description.</Description>
<Owner>
<Reference class="sample_owner_class" id="sample_owner_id" name="sample__owner_name"/>
</Owner>
<Parent>
<Reference class="sample_parent_class" id="sample_parent_id" name="sample_parent_name"/>
</Parent>
</TaskDefinition>
I want to search for:
<entry key="applications" value="APP_NAME"/>
and change the value to ie.: `APP_NAME_2.
I know i can extract this value by this:
import xml.etree.cElementTree as ET
tree = ET.ElementTree(file='sample.xml')
root = tree.getroot()
print(root[0][0][0].tag, root[0][0][0].attrib)
but in this case i have to know exact position of ths entry in tree - so it is not flexible, and i have no idea how to change it.
Also tried something like this:
for app in root.attrib:
if 'applications' in root.attrib:
print(app)
but i can't figure out, why this returns nothing.
In python docs, there is following example:
for rank in root.iter('rank'):
new_rank = int(rank.text) + 1
rank.text = str(new_rank)
rank.set('updated', 'yes')
tree.write('output.xml')
but i have no idea how to addjust this to my example.
I don't want to use regex for this case.
Any help appreciated.

You can locate the specific entry element with XPath.
import xml.etree.ElementTree as ET
tree = ET.parse("sample.xml")
# Find the element that has a 'key' attribute with a value of 'applications'
entry = tree.find(".//entry[#key='applications']")
# Change the value of the 'value' attribute
entry.set("value", "APP_NAME_2")
tree.write("output.xml")
Result (output.xml):
<TaskDefinition created="time_stamp" formPath="path/sometask.xhtml" id="sample_id" modified="timestamp_b" name="sample_task" resultAction="Delete" subType="subtype_sample_task" type="sample_type">
<Attributes>
<Map>
<entry key="applications" value="APP_NAME_2" />
<entry key="aaa" value="true"/>
<entry key="bbb" value="true"/>
<entry key="ccc" value="true"/>
<entry key="ddd" value="true"/>
<entry key="eee" value="Disabled"/>
<entry key="fff"/>
<entry key="ggg"/>
</Map>
</Attributes>
<Description>Description.</Description>
<Owner>
<Reference class="sample_owner_class" id="sample_owner_id" name="sample__owner_name"/>
</Owner>
<Parent>
<Reference class="sample_parent_class" id="sample_parent_id" name="sample_parent_name"/>
</Parent>
</TaskDefinition>

how to get the index of a child node under a parent node using python?

my xml file goes like this:
<?xml version="1.0"?>
<BCPFORMAT
xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR="\t" MAX_LENGTH="12"/>
<FIELD ID="2" xsi:type="CharTerm" TERMINATOR="\t" MAX_LENGTH="20" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="3" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="30" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="age" xsi:type="SQLINT"/>
<COLUMN SOURCE="2" NAME="firstname" xsi:type="SQLVARYCHAR"/>
<COLUMN SOURCE="3" NAME="lastname" xsi:type="SQLVARYCHAR"/>
</ROW>
</BCPFORMAT>
i need to know the index of the child node ID="1" in its parent node 'RECORD'.(ie, index is 0 in this case)
please help me solve this.
thanks.. :)

Using xml.etree.ElementTree:
import xml.etree.ElementTree as ET
root = ET.fromstring('''<?xml version="1.0"?>
<BCPFORMAT
...
</BCPFORMAT>''')
# Accessing parent node: http://effbot.org/zone/element.htm#accessing-parents
parent_map = {c: p for p in root.getiterator() for c in p} child = root.find('.//*[#ID="1"]')
print(list(parent_map[child]).index(child)) # => 0
Using lxml:
import lxml.etree as ET
root = ET.fromstring('''<?xml version="1.0"?>
<BCPFORMAT
...
</BCPFORMAT>''')
child = root.find('.//*[#ID="1"]')
print(child.getparent().index(child)) # => 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

etree data extraction from xml with odd tree structure - python

Related

Python xml parsing from string return none attrib

Adding an array of tags to xml root

How to retain XML version and comments while writing in XML file using python?

Python etree - find exact match

how to get the index of a child node under a parent node using python?

Categories

Resources