How to edit pmml model file using xml parser - python

I want to remove some nodes from a pmml file that I generated. So I tried to use xml parser in python:
from xml.etree.ElementTree import ElementTree
tree = ElementTree()
tree.parse('treedemo.pmml')
for inter in tree.findall('DataDictionary'):
print(inter)
It turns out that the print output nothing, which means the xml parser didn't work. the pmml file is here. Suppose I want to delete
<Interval closure="closedClosed" leftMargin="21.0" rightMargin="46.0"/>
from
<DataField name="fk_057_nearcontact_auth_expire_time" optype="continuous" dataType="float">
<Interval closure="closedClosed" leftMargin="21.0" rightMargin="46.0"/>
</DataField>
Can pmml file be edit and modified by python?

Rather than developing custom XML manipulation code, you should learn about an existing technology called XSL Transformations (XSLT).
In brief, you need to create an XSL document, which specifies XML manipulation actions. You can then apply this XSL document to one or more XML documents (including PMML documents) using command-line XSLT tools. For example, on GNU/Linux systems you can use the xsltproc tool.
An XSL document for deleting Interval elements:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:pmml="http://www.dmg.org/PMML-4_2">
<!-- By default, copy all -->
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<!-- However, in case of the PMML Interval element, take no (copy-) action -->
<xsl:template match="pmml:Interval"/>
</xsl:stylesheet>
Be sure to configure the value of the pmml namespace prefix to match that of your PMML documents. The above example applies to PMML schema version 4.2 documents.
Then, apply the stylesheet to PMML files (command syntax xsltproc <XSL file> <PMML file(s)>):
$ xsltproc --output test-mod.pmml test.xsl test.pmml

In general, it's risky to play with pmml file, just be careful about it.
You can use BeautifulSoup.
for your specific goal, the tag 'Interval' appear only once, so you can find this tag in only one step, and then extract it:
# import BeautifulSoup
from bs4 import BeautifulSoup
# open and read the file
inf = open(r'treedemo.xml', 'r')
txt = inf.read()
inf.close()
# prepare the soup
soup = BeautifulSoup(txt, 'xml')
# now find the tag you want to remove, in this case it's easy, since the tag 'Interval' is unique across your pmml file:
interval = soup.find('Interval')
# remove the tag
interval.extract()
# write the updated pmml file
with open(r'treedemo_clean.xml', "w") as outf:
outf.write(str(soup))
The output will have no indents unless you will use outf.write(str(soup.prettify()))
I will not recommend to use prettify. might mess up the pmml
In case the tag is not unique then you have to find it carefully, in order to avoid deleting the wrong tag and brake your pmml.
There is nothing wrong with the field you want to remove. it shows statistics from your training data set. you can set the flag with_statistics=False

Related

Parse deeply nested XML to pandas dataframe

I'm trying to fetch particular parts of a XML file and move it into a pandas dataframe. Following some tutorials from xml.etree I'm still stuck at getting the output. So far, I've managed to find the child nodes, but I can't access them (i.e. can't get the actual data out of it). So, here is what I've got so far.
tree=ET.parse('data.xml')
root=tree_edu.getroot()
root.tag
#find all nodes within xml data
tree_edu.findall(".//")
#access the node
tree.findall(".//{http://someUrl.nl/schema/enterprise/program}programSummaryText")
What I want is to get the data from the node programDescriptions and specifically the child programDescriptionText xml:lang="nl", and of course a couple extra. But first focus on this one.
Some data to work with:
<?xml version="1.0" encoding="UTF-8"?>
<programs xmlns="http://someUrl.nl/schema/enterprise/program">
<program xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://someUrl.nl/schema/enterprise/program http://someUrl.nl/schema/enterprise/program.xsd">
<customizableOnRequest>true</customizableOnRequest>
<editor>webmaster#url</editor>
<expires>2019-04-21</expires>
<format>Edu-dex 1.0</format>
<generator>www.Url.com</generator>
<includeInCatalog>Catalogs</includeInCatalog>
<inPublication>true</inPublication>
<lastEdited>2019-04-12T20:03:09Z</lastEdited>
<programAdmission>
<applicationOpen>true</applicationOpen>
<applicationType>individual</applicationType>
<maxNumberOfParticipants>12</maxNumberOfParticipants>
<minNumberOfParticipants>8</minNumberOfParticipants>
<paymentDue>up-front</paymentDue>
<requiredLevel>academic bachelor</requiredLevel>
<startDateDetermination>fixed starting date</startDateDetermination>
</programAdmission>
<programCurriculum>
<instructionMode>training</instructionMode>
<teacher>
<id>{D83FFC12-0863-44A6-BDBB-ED618627F09D}</id>
<name>SomeName</name>
<summary xml:lang="nl">
Long text of the summary. Not needed.
</summary>
</teacher>
<studyLoad period="hour">26</studyLoad>
</programCurriculum>
<programDescriptions>
<programName xml:lang="nl">Program Course Name</programName>
<programSummaryText xml:lang="nl">short Program Course Name summary</programSummaryText>
<programSummaryHtml xml:lang="nl">short Program Course Name summary in HTML format</programSummaryHtml>
<programDescriptionText xml:lang="nl">This part is needed from the XML.
Big program description text. This part is needed to parse from the XML file.
</programDescriptionText>
<programDescriptionHtml xml:lang="nl">Not needed;
Not needed as well;
</programDescriptionHtml>
<subjectText>
<subject>curriculum</subject>
<header1 xml:lang="nl">Beschrijving</header1>
<descriptionHtml xml:lang="nl">Yet another HTML desscription;
Not necessarily needed;</descriptionHtml>
</subjectText>
<searchword xml:lang="nl">search word</searchword>
<webLink xml:lang="nl">website-url</webLink>
</programDescriptions>
<programSchedule>
<programRun>
<id>PR-019514</id>
<status>application opened</status>
<startDate isFinal="true">2019-06-26</startDate>
<endDate isFinal="true">2020-02-11</endDate>
</programRun>
</programSchedule>
</program>
</programs>
Try the code below: (55703748.xml contains the xml you have posted)
import xml.etree.ElementTree as ET
tree = ET.parse('55703748.xml')
root = tree.getroot()
nodes = root.findall(".//{http://someUrl.nl/schema/enterprise/program}programSummaryText")
for node in nodes:
print(node.text)
Output
short Program Course Name summary

Accessing non tree structured xml data in python

I have several xml files that I want to parse in python. I am aware of the ElementTree package in python, however my xml files aren't stored in a tree like structure. Below is an example
<tag1 attribute1="at1" attribute2="at2">My files are text that I annotated with a tool
to create these xml files.</tag1>
Some parts of the text are enclosed in an xml tag, whereas others are not.
<tag1 attribute1="at1" attribute2="at2"><tag2 attribute3="at3" attribute4="at4">Some
are even enclosed in multiple tags.</tag1></tag2>
And some have overlapping tags:
<tag1 attribute1="at1" attribute2="at2">This is an example sentence
<tag3 attribute5="at5">containing a nested example sentence</tag3></tag1>
Whenever I use an ElementTree like function to parse the file, I can only access the very first tag. I am looking for a way to parse all the tags and don't want a tree like structure. Any help is greatly appreciated.
If you have one XML fragment per line, just parse each line individually.
for line in some_file:
# parse using ET and getroot.

Python Parsing XML with a complex hierarchy - Nuke9.0v8

I am working with NukeX9.0v8, Adobe Premiere Pro CC 2015 and nukes internal python interrupter.
# Result: 2.7.3 (default, Jul 24 2013, 15:50:23)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
I am a vfx artist and I'm trying to wrap my brain around the best method to parse xml files in order to: create a folder structure, batch create .nk comp files and plug in the data within specific parts as I make my .nk comps. I have a bit of a grasp of how to do each of these things in isolation, but plugging it all together, and trying to find tutorials on such complex parse as ground me to a halt.
I know the scope of this is big but any small pieces of advice are appreciated.
Right now I have a nuke comp that has a node tree that takes in camera inputs and stitches them into a latlong image for 360 video, I am going to wrap that up into a gizmo for each different kind of rig configuration. This just simplifies the .nk files that are created and I can expose the parts of that gizmo I can feed data into.
Every day we receive a ton of footage from a shoot and we have to make a new .nk comp for each shot and set it to render right away. What I want to do is have the guys on set create a premiere project and organize the files based on this folder structure. That premiere project will be exported as an .xml file.
The design of the structure in premiere.
Day_01 (the day of the shoot)
-^-R001 (Roll number for the shots. R referring to camera type)
--^-R001_C001 (The name of the shot)
---^-Acamera clip (path to file name, video in point as frame#)
---^-Bcamera clip (path to file name, video in point as frame#)
---^-Ccamera clip (path to file name, video in point as frame#)
Right now in my script panel inside Nuke I can enter the information of where is the xml for the day what day to look for. Then it is suppose to look into each folder name for the roll, and using the first letter (R for RED camera) and looks inside for the clip folder. It then uses the pathurl directory for the camera files on the drive and also can take it data like the in and out points if present in the xml. I also have points to enter for the template version if I update a stitch process. That will tell the nuke comp which gizmo to use.
Here is my panel in Nuke.
def sesquixmlparse():
'''
This imports the xml file from premiere. It looks for the bin that it is working for today and starts looking in what is inside the bins
It then sees the bins inside and uses them to create nuke scripts with these as inputs
It asks what template version to use for the rig. things change or maybe even get better
'''
# Lets build the Nuke Panel that tells us our inputs
p = nuke.Panel("Sesqui XML Parse for Dailies")
xml_file = 'Daily XML'
daynumber = 'Day_##'
nk_output_dir = 'Directory to build VFX folder structure'
dnx_render_dir = 'Directory for write nodes'
r_template_vr = 'VER1'
g_template_vr = 'VER1'
c_template_vr = 'VER1'
p.addFilenameSearch("Daily XML", xml_file)
p.addSingleLineInput("Bin to process", daynumber)
p.addFilenameSearch("Directory to build VFX folder structure", nk_output_dir)
p.addFilenameSearch("Directory to render from write nodes", dnx_render_dir)
p.addSingleLineInput("3 Red stmap version", r_template_vr)
p.addSingleLineInput("6 Gopro stmap verison", g_template_vr)
p.addSingleLineInput("5 Canon stmap verison", c_template_vr)
p.setWidth(600)
print "Panel created"
if not p.show():
return
# Assign var from nuke panel user-entered data
xml_file = p.value("Daily XML")
daynumber = p.value("Bin to process")
nk_output_dir = p.value("Directory to build VFX folder structure")
dnx_render_dir = p.value("Directory to render from write nodes")
r_template_vr = p.value("3 Red stmap version")
g_template_vr = p.value("6 Gopro stmap verison")
c_template_vr = p.value("5 Canon stmap verison")
print "var's assigned from panel"
# Create paths for render directory if it does not exist
if not os.path.isdir(dnx_render_dir):
os.mkdir(dnx_render_dir)
print dnx_render_dir + " directory created"
if not os.path.isdir(nk_output_dir):
os.mkdir(nk_output_dir)
print nk_output_dir + " directory created"
I am at a loss on how to best read the xml file. All the tutorials I have seen on both DOM and elementtree are very basic and deal with direct code to read known XML tags and break data down to a simple str output.
I need to enter variables, which then constrain the parsing to a specific part of the tree, and go into an unknown hierarchy setup and seeing what is inside, and then make decisions on what to do with what it finds.
Here is a sample of my test XML file. The eventual plan is to have other different roll types that reference different camera types but for now I'm just working with 3 camera red rigs.
It's a very big file so here is a pastebin: http://pastebin.com/vLaRA0X8
Basically I am wanting to constrain the script to looking within my variable <bin><name>'daynumber'</name>~~~~</bin>. In this case looking in the Day_00 bin. If there is anything else in the root hierarchy I want to ignore it as sequences, unused clips and other data can get very very huge. I then want to create the directory of daynumber in the nk_output_dir & dnx_render_dir so that everything for this shoot day is contained in that folder.
A annoying part of the XML file is the name of a bin is a child to the <bin> itself, so once a bin name is found, any <children> of that bin would be the same level of the tree as the <name>. I can't find sample code of locating a tag and then looking working with the tags that are in the same branch instead of it's children.
Now that it has found the bin for the day I want it to start to look for all the bins in <children></children>. Example being <bin><name>R001</name>~~~</bin> and create directories inside the Day_00 folder I made in nk_output_dir & dnx_render_dir for each bin it finds in this part of the structure. Every time the camera reloads that will roll up to R002, R003, etc etc. Also different camera types like Gopros will create G001, G002, G003.
Then I want to look for in the <children> of the above bins and find all the bins inside like <bin><name>R001_C001</name>~~~</bin> and create folders in the nk_output_dir\daynumber\~whatever bin this is contained~\~name of this bin~\. Which is user created of the roll number and clip number. (R001_C001, R001_C002, etc etc) This will be the new clip name, the name of the .nk comp that will be generated and the file name of the render on the write node.
The goal here is to recreate the bin folder structure in the directory I've choosen for nk_output_dir.
The dnx_render_dir that is for being plugged into the write nodes of my nuke scripts later to where the files should be rendered to. It's separate because I'd have a different RAID drive that it will go to that will change as they fill up. The renders just need to be put in a directory for the daynumber\~rollnumber~ but doesn't need to be constrained into a folder for the clipname.
Here is where I am really lost. Now, because I have to account for user error, I can't be entirely sure how deep in the tree I need to be going. I know I want the <pathurl>~</pathurl> which I can plug into the .nk (nuke) scripts I make. With red camera files they can either be the directly here .R3D or the folder structure which can been 2-3 bins deep. I know that I can't 100% rely on the guys on set to be consistent on how they make this bin.
All I can trust them to do is make sure they are in correct alphabetic order. If you look at the xml so the order of them is important. I also know if I am looking at a R### roll bin that I need 3 <pathurl></pathurl> and if im looking inside G### I need 6 and for C### only 5.
The order of them is important as they can rename the name tag inside `~~~~ to rename cameras that were the wrong setting without renaming source files. (which breaks important metadata that is needed in other programs)
While in this part of the tree I'd also like to grab the <clip id=~><in>###</in> to grab the in marker frame offset. If the cameras have gone out of sync and their start points can be set. But of course this tag is not child to the <pathurl></pathurl> and is actually 3 parents up! Also this tag won't be on every clip so I can't look for it first!
<clip id="masterclip-40" explodedTracks="true" frameBlend="FALSE">
<uuid>85f87acc-308f-401e-bf82-55e8ea41e55a</uuid>
<masterclipid>masterclip-40</masterclipid>
<ismasterclip>TRUE</ismasterclip>
<duration>5355</duration>
<rate>
<timebase>30</timebase>
<ntsc>TRUE</ntsc>
</rate>
<in>876</in>
<name>B002_C002_0216AM_002.R3D</name>
<media>
<video>
<track>
<clipitem id="clipitem-118" frameBlend="FALSE">
<masterclipid>masterclip-40</masterclipid>
<name>B002_C002_0216AM_002.R3D</name>
<rate>
<timebase>30</timebase>
<ntsc>TRUE</ntsc>
</rate>
<alphatype>none</alphatype>
<pixelaspectratio>square</pixelaspectratio>
<anamorphic>FALSE</anamorphic>
<file id="file-40">
<name>B002_C002_0216AM_002.R3D</name>
<pathurl>file://localhost/Volumes/REDLAB_3A/SESQUI/MASTER_FILES/DAY_00/RED/R002/B002/B002_0216G4.RDM/B002_C002_0216AM.RDC/B002_C002_0216AM_002.R3D</pathurl>
So once I've parsed all this the information I'd like to have is.
The original bin folder structure of the XML contained in the daynumber. Take the names of the bins and construct the same folder structure in the nk_output_dir (Day_00/R001/R001_C001 etc etc)
I also want to make a daynumber directory in the dnx_render_dir folder and a directory for each bin referencing a camera roll.
Based on if the clipname is starts with a R, G or C I want to be able to access that for selecting what kind of .nk to make.
I want the pathurl information for each bin that is referring to a clip and plug. I also want any <in> information if there is any for that clip. That way I can plug it into the read node information for my nuke gizmo.
I think once I figure out how to parse such a complicated xml tree I'll able to fuss and fumble the rest of the process.
I am just really struggling with finding examples of parsing an complicated XML file like this.
Whenever faced with a complex XML, consider an XSLT script to transform your XML into a simpler structure. As information, XSLT is a special-purpose, declarative language (same type as SQL) designed to transform XML into various structures for end use needs. Python like other general purpose languages maintains an XSLT processor, specifically in its lxml module.
While this transformation does not address your entire needs, you can parse the simpler structure for your Nuke application needs. Directories and names are simplified and labeled for daynumber, rollnumber, shotnames, and clip with pathurls.
XSLT script (save as .xsl or .xslt to be referenced in .py script below)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:key name="idkey" match="ctype" use="#id" />
<xsl:template match="/">
<root>
<xsl:apply-templates select="*"/>
</root>
</xsl:template>
<xsl:template match="xmeml/bin">
<daynumber><xsl:value-of select="name"/></daynumber>
<xsl:apply-templates select="children/bin"/>
</xsl:template>
<xsl:template match="xmeml/bin/children/bin">
<roll>
<rollnumber><xsl:value-of select="name"/></rollnumber>
<rollnumberdir><xsl:value-of select="concat(ancestor::bin/name,
'/', name)"/></rollnumberdir>
<xsl:apply-templates select="children/bin"/>
</roll>
</xsl:template>
<xsl:template match="xmeml/bin/children/bin/children/bin">
<shot>
<shotname><xsl:value-of select="name"/></shotname>
<shotnamedir><xsl:value-of select="concat(/xmeml/bin/name, '/',
/xmeml/bin/children/bin/name, '/', name)"/></shotnamedir>
<xsl:apply-templates select="descendant::clip[position() < 4]"/>
</shot>
</xsl:template>
<xsl:template match="clip">
<clip>
<clipname><xsl:value-of select="descendant::name"/></clipname>
<xsl:copy-of select="in"/>
<pathurl><xsl:value-of select="descendant::pathurl"/></pathurl>
</clip>
</xsl:template>
</xsl:transform>
Python script (transform, parse, and export simpler structure)
#!/usr/bin/python
import lxml.etree as ET
# LOAD INPUT XML AND XSLT
dom = ET.parse('Input.xml'))
xslt = ET.parse('XSLTScript.xsl')
# TRANSFORM XML (SIMPLER NEWDOM CAN BE FURTHER PARSED: ITER(), FINDALL(), XPATH())
transform = ET.XSLT(xslt)
newdom = transform(dom)
# XPATH EXPRESSIONS (LIST OUTPUTS)
daynumber = newdom.xpath('//daynumber/text()')
# ['Day_00']
rolls = newdom.xpath('//rollnumber/text()')
# ['R001', 'R002']
shots = newdom.xpath('//shotname/text()')
# ['R001_C001', 'R002_C001', 'R002_C002']
# CONVERT TO STRING (IF NEEDED)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
print(tree_out.decode("utf-8"))
# OUTPUT TO FILE (IF NEEDED)
xmlfile = open('Output.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()
TRANSFORMED XML (contained in newdom object in .py script)
<?xml version='1.0' encoding='UTF-8'?>
<root>
<daynumber>Day_00</daynumber>
<roll>
<rollnumber>R001</rollnumber>
<rollnumberdir>Day_00/R001</rollnumberdir>
<shot>
<shotname>R001_C001</shotname>
<shotnamedir>Day_00/R001/R001_C001</shotnamedir>
<clip>
<clipname>A002_C001_0216MW_001.R3D</clipname>
<pathurl>file://localhost/Volumes/REDLAB_3A/SESQUI/MASTER_FILES/DAY_00/RED/R001/A002/A002_0216FE.RDM/A002_C001_0216MW.RDC/A002_C001_0216MW_001.R3D</pathurl>
</clip>
<clip>
<clipname>A002_C001_0216MW_002.R3D</clipname>
<pathurl>file://localhost/Volumes/REDLAB_3A/SESQUI/MASTER_FILES/DAY_00/RED/R001/A002/A002_0216FE.RDM/A002_C001_0216MW.RDC/A002_C001_0216MW_002.R3D</pathurl>
</clip>
<clip>
<clipname>A002_C001_0216MW_003.R3D</clipname>
<pathurl>file://localhost/Volumes/REDLAB_3A/SESQUI/MASTER_FILES/DAY_00/RED/R001/A002/A002_0216FE.RDM/A002_C001_0216MW.RDC/A002_C001_0216MW_003.R3D</pathurl>
</clip>
</shot>
</roll>
<roll>
<rollnumber>R002</rollnumber>
<rollnumberdir>Day_00/R002</rollnumberdir>
<shot>
<shotname>R002_C001</shotname>
<shotnamedir>Day_00/R001/R002_C001</shotnamedir>
<clip>
<clipname>A003_C001_0216XI_001.R3D</clipname>
<pathurl>file://localhost/Volumes/REDLAB_3A/SESQUI/MASTER_FILES/DAY_00/RED/R002/A003/A003_0216XO.RDM/A003_C001_0216XI.RDC/A003_C001_0216XI_001.R3D</pathurl>
</clip>
<clip>
<clipname>B002_C001_02169H_002.R3D</clipname>
<pathurl>file://localhost/Volumes/REDLAB_3A/SESQUI/MASTER_FILES/DAY_00/RED/R002/B002/B002_0216G4.RDM/B002_C001_02169H.RDC/B002_C001_02169H_002.R3D</pathurl>
</clip>
<clip>
<clipname>C002_C001_02168R_001.R3D</clipname>
<pathurl>file://localhost/Volumes/REDLAB_3A/SESQUI/MASTER_FILES/DAY_00/RED/R002/C002/C002_0216RL.RDM/C002_C001_02168R.RDC/C002_C001_02168R_001.R3D</pathurl>
</clip>
</shot>
<shot>
<shotname>R002_C002</shotname>
<shotnamedir>Day_00/R001/R002_C002</shotnamedir>
<clip>
<clipname>C002_C002_0216M9_001.R3D</clipname>
<pathurl>file://localhost/Volumes/REDLAB_3A/SESQUI/MASTER_FILES/DAY_00/RED/R002/C002/C002_0216RL.RDM/C002_C002_0216M9.RDC/C002_C002_0216M9_001.R3D</pathurl>
</clip>
<clip>
<clipname>C002_C002_0216M9_002.R3D</clipname>
<pathurl>file://localhost/Volumes/REDLAB_3A/SESQUI/MASTER_FILES/DAY_00/RED/R002/C002/C002_0216RL.RDM/C002_C002_0216M9.RDC/C002_C002_0216M9_002.R3D</pathurl>
</clip>
<clip>
<clipname>C002_C002_0216M9_003.R3D</clipname>
<pathurl>file://localhost/Volumes/REDLAB_3A/SESQUI/MASTER_FILES/DAY_00/RED/R002/C002/C002_0216RL.RDM/C002_C002_0216M9.RDC/C002_C002_0216M9_003.R3D</pathurl>
</clip>
</shot>
</roll>
</root>

Python script to change an attribute value in .tcx file (XML)

I have a .tcx (XML) file, with the following schema:
<Activities>
<Activity>
<Lap StartTime="2015-12-24T08:12:18.969Z">
<TotalTimeSeconds>4069.0</TotalTimeSeconds>
<DistanceMeters>30458.794921875</DistanceMeters>
<MaximumSpeed>43.36123275756836</MaximumSpeed>
<Calories>2286</Calories>
<AverageHeartRateBpm><Value>144</Value></AverageHeartRateBpm><MaximumHeartRateBpm><Value>169</Value></MaximumHeartRateBpm>
<Intensity>Active</Intensity>
<Cadence>87</Cadence>
<TriggerMethod>Manual</TriggerMethod>
<Track>
<Trackpoint>
<Time>2015-12-24T08:12:19.969Z</Time>
<Position><LatitudeDegrees>45.4917</LatitudeDegrees><LongitudeDegrees>9.16198</LongitudeDegrees></Position>
<AltitudeMeters>124.018</AltitudeMeters>
<DistanceMeters>0.0</DistanceMeters>
<SensorState>Present</SensorState>
<Extensions><TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2"><Watts>0</Watts></TPX></Extensions></Trackpoint>
...
</Track>
</Lap>
</Activity>
</Activities>
and need to change (double) the Watts attribute.
Would like a simple python script
Simply run an XSLT script. No Python loops or expensive XPaths (//) is needed. As information, XSLT is a declarative, special-purpose programming language used specifically to restructure, redesign, or re-format XML documents to various end use needs. Like most general purpose languages such as Java, C#, Perl, PHP, VB, Python comes equipped with an XSLT 1.0 processor in its lxml module.
Below runs an identity transform to copy entire document as is and then multiplies the current value in any Watts node by 2. I declare a namespace doc in XSLT to reference the Watts element.
XSLT (save as .xsl or .xslt)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:doc="http://www.garmin.com/xmlschemas/ActivityExtension/v2">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<!-- Identity Transform -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="doc:Watts">
<xsl:copy>
<xsl:value-of select=". * 2"/>
</xsl:copy>
</xsl:template>
</xsl:transform>
Python Script
import lxml.etree as ET
dom = ET.parse('Input.xml')
xslt = ET.parse('XSLTScript.xsl')
transform = ET.XSLT(xslt)
newdom = transform(dom)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
xmlfile = open('Output.xml')
xmlfile.write(tree_out)
xmlfile.close()
Your last two element tags need to be closing tags, and you have a Watts element not an attribute. Here is how to do it with your file structure.
Python provides the ElementTree library for this. The following script will accomplish what you want:
import xml.etree.ElementTree as ET
tree = ET.parse("test.tcx")
tpxns = "http://www.garmin.com/xmlschemas/ActivityExtension/v2"
for watts in tree.iter("{%s}Watts"%tpxns):
watts.text = str(2*int(watts.text))
tree.write("testnew.tcx")
Here I import the ElementTree library and use a simpler name for it. The parse function creates an ElementTree object from your file. I walk through the file to find all Watts elements (as these occur in a namespace, I actually need to look for {http://www.garmin.com/xmlschemas/ActivityExtension/v2}Watts, which I build up using string formatting).
When I find such an element, I set the inner text to be twice what the previous value was (converting to an int first and then back to a string).
Finally, I write the new xml file out. I could have overwrote the original file here if I wanted to.
Look over the documentation for the ElementTree module if you need to do anything fancier. It provides very powerful tools for working with XML. There are even more powerful libraries out there if you need more features (I like lxml for example).

Updating an Existing XML Document in Python

I have a large XML file whose structure is approximately as follows:
<GROUNDTRUTH>
<thing fileName="1" attrib="2">
<SUBSUB moreStuff="12" otherStuff="13"/>
</thing>
<thing fileName="2" attrib="2">
<SUBSUB moreStuff="12" otherStuff="13"/>
</thing>
<thing fileName="3" attrib="2">
<SUBSUB moreStuff="12" otherStuff="13"/>
</thing>
</GROUNDTRUTH>
I don't think I was clear enough in the original posting of this question. I have an xml document called GROUNDTRUTH, and inside of that I have several thousand "things". I want to search through all of the things in the document via filename and then change an attribute. So if I was searching for fileName="2", I would change its attribute to attrib=x. And for some thing, perhaps I'd go down to the sub level and change moreStuff.
My plan is to store into a csv file the names of the 'things' I need to change, and what I want to change the value of 'attrib' to. What function or module will provide this kind of functionality? Or am I just missing an easy/obvious approach? Ultimately I'd like to have a working script that will take a csv file with the thing identifier, and value to be updated, and take the xml file to make those changes onto.
Thanks for your help and suggestions!
First, you can transform the original xml file into an outputted xml file using an xslt stylesheet which can modify xml files in any way, shape, or form such as modifying, re-structuring, re-ordering attributes, elements, etc. Do note xsl is a declarative special-purpose language to transform and render XML documents.
Then, you can use Python's lxml library to run the transformation:
#!/usr/bin/python
import lxml.etree as ET
dom = ET.parse('originalfile.xml')
xslt = ET.parse('transformfile.xsl')
transform = ET.XSLT(xslt)
newdom = transform(dom)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True)
xmlfile = open('finalfile.xml','ab')
xmlfile.write(tree_out)
xmlfile.close()
By the way, PHP, Java, C, VB, or pretty much any language, even your everyday browser can run transformations! To have the browser run it, simply add stylesheet in header:
<?xml-stylesheet type="text/xsl" href="transformfile.xsl"?>

Categories

Resources