Python Parsing XML with a complex hierarchy - Nuke9.0v8

Python Parsing XML with a complex hierarchy - Nuke9.0v8 - python

I am working with NukeX9.0v8, Adobe Premiere Pro CC 2015 and nukes internal python interrupter.
# Result: 2.7.3 (default, Jul 24 2013, 15:50:23)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
I am a vfx artist and I'm trying to wrap my brain around the best method to parse xml files in order to: create a folder structure, batch create .nk comp files and plug in the data within specific parts as I make my .nk comps. I have a bit of a grasp of how to do each of these things in isolation, but plugging it all together, and trying to find tutorials on such complex parse as ground me to a halt.
I know the scope of this is big but any small pieces of advice are appreciated.
Right now I have a nuke comp that has a node tree that takes in camera inputs and stitches them into a latlong image for 360 video, I am going to wrap that up into a gizmo for each different kind of rig configuration. This just simplifies the .nk files that are created and I can expose the parts of that gizmo I can feed data into.
Every day we receive a ton of footage from a shoot and we have to make a new .nk comp for each shot and set it to render right away. What I want to do is have the guys on set create a premiere project and organize the files based on this folder structure. That premiere project will be exported as an .xml file.
The design of the structure in premiere.
Day_01 (the day of the shoot)
-^-R001 (Roll number for the shots. R referring to camera type)
--^-R001_C001 (The name of the shot)
---^-Acamera clip (path to file name, video in point as frame#)
---^-Bcamera clip (path to file name, video in point as frame#)
---^-Ccamera clip (path to file name, video in point as frame#)
Right now in my script panel inside Nuke I can enter the information of where is the xml for the day what day to look for. Then it is suppose to look into each folder name for the roll, and using the first letter (R for RED camera) and looks inside for the clip folder. It then uses the pathurl directory for the camera files on the drive and also can take it data like the in and out points if present in the xml. I also have points to enter for the template version if I update a stitch process. That will tell the nuke comp which gizmo to use.
Here is my panel in Nuke.
def sesquixmlparse():
'''
This imports the xml file from premiere. It looks for the bin that it is working for today and starts looking in what is inside the bins
It then sees the bins inside and uses them to create nuke scripts with these as inputs
It asks what template version to use for the rig. things change or maybe even get better
'''
# Lets build the Nuke Panel that tells us our inputs
p = nuke.Panel("Sesqui XML Parse for Dailies")
xml_file = 'Daily XML'
daynumber = 'Day_##'
nk_output_dir = 'Directory to build VFX folder structure'
dnx_render_dir = 'Directory for write nodes'
r_template_vr = 'VER1'
g_template_vr = 'VER1'
c_template_vr = 'VER1'
p.addFilenameSearch("Daily XML", xml_file)
p.addSingleLineInput("Bin to process", daynumber)
p.addFilenameSearch("Directory to build VFX folder structure", nk_output_dir)
p.addFilenameSearch("Directory to render from write nodes", dnx_render_dir)
p.addSingleLineInput("3 Red stmap version", r_template_vr)
p.addSingleLineInput("6 Gopro stmap verison", g_template_vr)
p.addSingleLineInput("5 Canon stmap verison", c_template_vr)
p.setWidth(600)
print "Panel created"
if not p.show():
return
# Assign var from nuke panel user-entered data
xml_file = p.value("Daily XML")
daynumber = p.value("Bin to process")
nk_output_dir = p.value("Directory to build VFX folder structure")
dnx_render_dir = p.value("Directory to render from write nodes")
r_template_vr = p.value("3 Red stmap version")
g_template_vr = p.value("6 Gopro stmap verison")
c_template_vr = p.value("5 Canon stmap verison")
print "var's assigned from panel"
# Create paths for render directory if it does not exist
if not os.path.isdir(dnx_render_dir):
os.mkdir(dnx_render_dir)
print dnx_render_dir + " directory created"
if not os.path.isdir(nk_output_dir):
os.mkdir(nk_output_dir)
print nk_output_dir + " directory created"
I am at a loss on how to best read the xml file. All the tutorials I have seen on both DOM and elementtree are very basic and deal with direct code to read known XML tags and break data down to a simple str output.
I need to enter variables, which then constrain the parsing to a specific part of the tree, and go into an unknown hierarchy setup and seeing what is inside, and then make decisions on what to do with what it finds.
Here is a sample of my test XML file. The eventual plan is to have other different roll types that reference different camera types but for now I'm just working with 3 camera red rigs.
It's a very big file so here is a pastebin: http://pastebin.com/vLaRA0X8
Basically I am wanting to constrain the script to looking within my variable <bin><name>'daynumber'</name>~~~~</bin>. In this case looking in the Day_00 bin. If there is anything else in the root hierarchy I want to ignore it as sequences, unused clips and other data can get very very huge. I then want to create the directory of daynumber in the nk_output_dir & dnx_render_dir so that everything for this shoot day is contained in that folder.
A annoying part of the XML file is the name of a bin is a child to the <bin> itself, so once a bin name is found, any <children> of that bin would be the same level of the tree as the <name>. I can't find sample code of locating a tag and then looking working with the tags that are in the same branch instead of it's children.
Now that it has found the bin for the day I want it to start to look for all the bins in <children></children>. Example being <bin><name>R001</name>~~~</bin> and create directories inside the Day_00 folder I made in nk_output_dir & dnx_render_dir for each bin it finds in this part of the structure. Every time the camera reloads that will roll up to R002, R003, etc etc. Also different camera types like Gopros will create G001, G002, G003.
Then I want to look for in the <children> of the above bins and find all the bins inside like <bin><name>R001_C001</name>~~~</bin> and create folders in the nk_output_dir\daynumber\~whatever bin this is contained~\~name of this bin~\. Which is user created of the roll number and clip number. (R001_C001, R001_C002, etc etc) This will be the new clip name, the name of the .nk comp that will be generated and the file name of the render on the write node.
The goal here is to recreate the bin folder structure in the directory I've choosen for nk_output_dir.
The dnx_render_dir that is for being plugged into the write nodes of my nuke scripts later to where the files should be rendered to. It's separate because I'd have a different RAID drive that it will go to that will change as they fill up. The renders just need to be put in a directory for the daynumber\~rollnumber~ but doesn't need to be constrained into a folder for the clipname.
Here is where I am really lost. Now, because I have to account for user error, I can't be entirely sure how deep in the tree I need to be going. I know I want the <pathurl>~</pathurl> which I can plug into the .nk (nuke) scripts I make. With red camera files they can either be the directly here .R3D or the folder structure which can been 2-3 bins deep. I know that I can't 100% rely on the guys on set to be consistent on how they make this bin.
All I can trust them to do is make sure they are in correct alphabetic order. If you look at the xml so the order of them is important. I also know if I am looking at a R### roll bin that I need 3 <pathurl></pathurl> and if im looking inside G### I need 6 and for C### only 5.
The order of them is important as they can rename the name tag inside `~~~~ to rename cameras that were the wrong setting without renaming source files. (which breaks important metadata that is needed in other programs)
While in this part of the tree I'd also like to grab the <clip id=~><in>###</in> to grab the in marker frame offset. If the cameras have gone out of sync and their start points can be set. But of course this tag is not child to the <pathurl></pathurl> and is actually 3 parents up! Also this tag won't be on every clip so I can't look for it first!
<clip id="masterclip-40" explodedTracks="true" frameBlend="FALSE">
<uuid>85f87acc-308f-401e-bf82-55e8ea41e55a</uuid>
<masterclipid>masterclip-40</masterclipid>
<ismasterclip>TRUE</ismasterclip>
<duration>5355</duration>
<rate>
<timebase>30</timebase>
<ntsc>TRUE</ntsc>
</rate>
<in>876</in>
<name>B002_C002_0216AM_002.R3D</name>
<media>
<video>
<track>
<clipitem id="clipitem-118" frameBlend="FALSE">
<masterclipid>masterclip-40</masterclipid>
<name>B002_C002_0216AM_002.R3D</name>
<rate>
<timebase>30</timebase>
<ntsc>TRUE</ntsc>
</rate>
<alphatype>none</alphatype>
<pixelaspectratio>square</pixelaspectratio>
<anamorphic>FALSE</anamorphic>
<file id="file-40">
<name>B002_C002_0216AM_002.R3D</name>
<pathurl>file://localhost/Volumes/REDLAB_3A/SESQUI/MASTER_FILES/DAY_00/RED/R002/B002/B002_0216G4.RDM/B002_C002_0216AM.RDC/B002_C002_0216AM_002.R3D</pathurl>
So once I've parsed all this the information I'd like to have is.
The original bin folder structure of the XML contained in the daynumber. Take the names of the bins and construct the same folder structure in the nk_output_dir (Day_00/R001/R001_C001 etc etc)
I also want to make a daynumber directory in the dnx_render_dir folder and a directory for each bin referencing a camera roll.
Based on if the clipname is starts with a R, G or C I want to be able to access that for selecting what kind of .nk to make.
I want the pathurl information for each bin that is referring to a clip and plug. I also want any <in> information if there is any for that clip. That way I can plug it into the read node information for my nuke gizmo.
I think once I figure out how to parse such a complicated xml tree I'll able to fuss and fumble the rest of the process.
I am just really struggling with finding examples of parsing an complicated XML file like this.

Whenever faced with a complex XML, consider an XSLT script to transform your XML into a simpler structure. As information, XSLT is a special-purpose, declarative language (same type as SQL) designed to transform XML into various structures for end use needs. Python like other general purpose languages maintains an XSLT processor, specifically in its lxml module.
While this transformation does not address your entire needs, you can parse the simpler structure for your Nuke application needs. Directories and names are simplified and labeled for daynumber, rollnumber, shotnames, and clip with pathurls.
XSLT script (save as .xsl or .xslt to be referenced in .py script below)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:key name="idkey" match="ctype" use="#id" />
<xsl:template match="/">
<root>
<xsl:apply-templates select="*"/>
</root>
</xsl:template>
<xsl:template match="xmeml/bin">
<daynumber><xsl:value-of select="name"/></daynumber>
<xsl:apply-templates select="children/bin"/>
</xsl:template>
<xsl:template match="xmeml/bin/children/bin">
<roll>
<rollnumber><xsl:value-of select="name"/></rollnumber>
<rollnumberdir><xsl:value-of select="concat(ancestor::bin/name,
'/', name)"/></rollnumberdir>
<xsl:apply-templates select="children/bin"/>
</roll>
</xsl:template>
<xsl:template match="xmeml/bin/children/bin/children/bin">
<shot>
<shotname><xsl:value-of select="name"/></shotname>
<shotnamedir><xsl:value-of select="concat(/xmeml/bin/name, '/',
/xmeml/bin/children/bin/name, '/', name)"/></shotnamedir>
<xsl:apply-templates select="descendant::clip[position() < 4]"/>
</shot>
</xsl:template>
<xsl:template match="clip">
<clip>
<clipname><xsl:value-of select="descendant::name"/></clipname>
<xsl:copy-of select="in"/>
<pathurl><xsl:value-of select="descendant::pathurl"/></pathurl>
</clip>
</xsl:template>
</xsl:transform>
Python script (transform, parse, and export simpler structure)
#!/usr/bin/python
import lxml.etree as ET
# LOAD INPUT XML AND XSLT
dom = ET.parse('Input.xml'))
xslt = ET.parse('XSLTScript.xsl')
# TRANSFORM XML (SIMPLER NEWDOM CAN BE FURTHER PARSED: ITER(), FINDALL(), XPATH())
transform = ET.XSLT(xslt)
newdom = transform(dom)
# XPATH EXPRESSIONS (LIST OUTPUTS)
daynumber = newdom.xpath('//daynumber/text()')
# ['Day_00']
rolls = newdom.xpath('//rollnumber/text()')
# ['R001', 'R002']
shots = newdom.xpath('//shotname/text()')
# ['R001_C001', 'R002_C001', 'R002_C002']
# CONVERT TO STRING (IF NEEDED)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
print(tree_out.decode("utf-8"))
# OUTPUT TO FILE (IF NEEDED)
xmlfile = open('Output.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()
TRANSFORMED XML (contained in newdom object in .py script)
<?xml version='1.0' encoding='UTF-8'?>
<root>
<daynumber>Day_00</daynumber>
<roll>
<rollnumber>R001</rollnumber>
<rollnumberdir>Day_00/R001</rollnumberdir>
<shot>
<shotname>R001_C001</shotname>
<shotnamedir>Day_00/R001/R001_C001</shotnamedir>
<clip>
<clipname>A002_C001_0216MW_001.R3D</clipname>
<pathurl>file://localhost/Volumes/REDLAB_3A/SESQUI/MASTER_FILES/DAY_00/RED/R001/A002/A002_0216FE.RDM/A002_C001_0216MW.RDC/A002_C001_0216MW_001.R3D</pathurl>
</clip>
<clip>
<clipname>A002_C001_0216MW_002.R3D</clipname>
<pathurl>file://localhost/Volumes/REDLAB_3A/SESQUI/MASTER_FILES/DAY_00/RED/R001/A002/A002_0216FE.RDM/A002_C001_0216MW.RDC/A002_C001_0216MW_002.R3D</pathurl>
</clip>
<clip>
<clipname>A002_C001_0216MW_003.R3D</clipname>
<pathurl>file://localhost/Volumes/REDLAB_3A/SESQUI/MASTER_FILES/DAY_00/RED/R001/A002/A002_0216FE.RDM/A002_C001_0216MW.RDC/A002_C001_0216MW_003.R3D</pathurl>
</clip>
</shot>
</roll>
<roll>
<rollnumber>R002</rollnumber>
<rollnumberdir>Day_00/R002</rollnumberdir>
<shot>
<shotname>R002_C001</shotname>
<shotnamedir>Day_00/R001/R002_C001</shotnamedir>
<clip>
<clipname>A003_C001_0216XI_001.R3D</clipname>
<pathurl>file://localhost/Volumes/REDLAB_3A/SESQUI/MASTER_FILES/DAY_00/RED/R002/A003/A003_0216XO.RDM/A003_C001_0216XI.RDC/A003_C001_0216XI_001.R3D</pathurl>
</clip>
<clip>
<clipname>B002_C001_02169H_002.R3D</clipname>
<pathurl>file://localhost/Volumes/REDLAB_3A/SESQUI/MASTER_FILES/DAY_00/RED/R002/B002/B002_0216G4.RDM/B002_C001_02169H.RDC/B002_C001_02169H_002.R3D</pathurl>
</clip>
<clip>
<clipname>C002_C001_02168R_001.R3D</clipname>
<pathurl>file://localhost/Volumes/REDLAB_3A/SESQUI/MASTER_FILES/DAY_00/RED/R002/C002/C002_0216RL.RDM/C002_C001_02168R.RDC/C002_C001_02168R_001.R3D</pathurl>
</clip>
</shot>
<shot>
<shotname>R002_C002</shotname>
<shotnamedir>Day_00/R001/R002_C002</shotnamedir>
<clip>
<clipname>C002_C002_0216M9_001.R3D</clipname>
<pathurl>file://localhost/Volumes/REDLAB_3A/SESQUI/MASTER_FILES/DAY_00/RED/R002/C002/C002_0216RL.RDM/C002_C002_0216M9.RDC/C002_C002_0216M9_001.R3D</pathurl>
</clip>
<clip>
<clipname>C002_C002_0216M9_002.R3D</clipname>
<pathurl>file://localhost/Volumes/REDLAB_3A/SESQUI/MASTER_FILES/DAY_00/RED/R002/C002/C002_0216RL.RDM/C002_C002_0216M9.RDC/C002_C002_0216M9_002.R3D</pathurl>
</clip>
<clip>
<clipname>C002_C002_0216M9_003.R3D</clipname>
<pathurl>file://localhost/Volumes/REDLAB_3A/SESQUI/MASTER_FILES/DAY_00/RED/R002/C002/C002_0216RL.RDM/C002_C002_0216M9.RDC/C002_C002_0216M9_003.R3D</pathurl>
</clip>
</shot>
</roll>
</root>

Related

Parse deeply nested XML to pandas dataframe

I'm trying to fetch particular parts of a XML file and move it into a pandas dataframe. Following some tutorials from xml.etree I'm still stuck at getting the output. So far, I've managed to find the child nodes, but I can't access them (i.e. can't get the actual data out of it). So, here is what I've got so far.
tree=ET.parse('data.xml')
root=tree_edu.getroot()
root.tag
#find all nodes within xml data
tree_edu.findall(".//")
#access the node
tree.findall(".//{http://someUrl.nl/schema/enterprise/program}programSummaryText")
What I want is to get the data from the node programDescriptions and specifically the child programDescriptionText xml:lang="nl", and of course a couple extra. But first focus on this one.
Some data to work with:
<?xml version="1.0" encoding="UTF-8"?>
<programs xmlns="http://someUrl.nl/schema/enterprise/program">
<program xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://someUrl.nl/schema/enterprise/program http://someUrl.nl/schema/enterprise/program.xsd">
<customizableOnRequest>true</customizableOnRequest>
<editor>webmaster#url</editor>
<expires>2019-04-21</expires>
<format>Edu-dex 1.0</format>
<generator>www.Url.com</generator>
<includeInCatalog>Catalogs</includeInCatalog>
<inPublication>true</inPublication>
<lastEdited>2019-04-12T20:03:09Z</lastEdited>
<programAdmission>
<applicationOpen>true</applicationOpen>
<applicationType>individual</applicationType>
<maxNumberOfParticipants>12</maxNumberOfParticipants>
<minNumberOfParticipants>8</minNumberOfParticipants>
<paymentDue>up-front</paymentDue>
<requiredLevel>academic bachelor</requiredLevel>
<startDateDetermination>fixed starting date</startDateDetermination>
</programAdmission>
<programCurriculum>
<instructionMode>training</instructionMode>
<teacher>
<id>{D83FFC12-0863-44A6-BDBB-ED618627F09D}</id>
<name>SomeName</name>
<summary xml:lang="nl">
Long text of the summary. Not needed.
</summary>
</teacher>
<studyLoad period="hour">26</studyLoad>
</programCurriculum>
<programDescriptions>
<programName xml:lang="nl">Program Course Name</programName>
<programSummaryText xml:lang="nl">short Program Course Name summary</programSummaryText>
<programSummaryHtml xml:lang="nl">short Program Course Name summary in HTML format</programSummaryHtml>
<programDescriptionText xml:lang="nl">This part is needed from the XML.
Big program description text. This part is needed to parse from the XML file.
</programDescriptionText>
<programDescriptionHtml xml:lang="nl">Not needed;
Not needed as well;
</programDescriptionHtml>
<subjectText>
<subject>curriculum</subject>
<header1 xml:lang="nl">Beschrijving</header1>
<descriptionHtml xml:lang="nl">Yet another HTML desscription;
Not necessarily needed;</descriptionHtml>
</subjectText>
<searchword xml:lang="nl">search word</searchword>
<webLink xml:lang="nl">website-url</webLink>
</programDescriptions>
<programSchedule>
<programRun>
<id>PR-019514</id>
<status>application opened</status>
<startDate isFinal="true">2019-06-26</startDate>
<endDate isFinal="true">2020-02-11</endDate>
</programRun>
</programSchedule>
</program>
</programs>

Try the code below: (55703748.xml contains the xml you have posted)
import xml.etree.ElementTree as ET
tree = ET.parse('55703748.xml')
root = tree.getroot()
nodes = root.findall(".//{http://someUrl.nl/schema/enterprise/program}programSummaryText")
for node in nodes:
print(node.text)
Output
short Program Course Name summary

How to edit pmml model file using xml parser

I want to remove some nodes from a pmml file that I generated. So I tried to use xml parser in python:
from xml.etree.ElementTree import ElementTree
tree = ElementTree()
tree.parse('treedemo.pmml')
for inter in tree.findall('DataDictionary'):
print(inter)
It turns out that the print output nothing, which means the xml parser didn't work. the pmml file is here. Suppose I want to delete
<Interval closure="closedClosed" leftMargin="21.0" rightMargin="46.0"/>
from
<DataField name="fk_057_nearcontact_auth_expire_time" optype="continuous" dataType="float">
<Interval closure="closedClosed" leftMargin="21.0" rightMargin="46.0"/>
</DataField>
Can pmml file be edit and modified by python?

Rather than developing custom XML manipulation code, you should learn about an existing technology called XSL Transformations (XSLT).
In brief, you need to create an XSL document, which specifies XML manipulation actions. You can then apply this XSL document to one or more XML documents (including PMML documents) using command-line XSLT tools. For example, on GNU/Linux systems you can use the xsltproc tool.
An XSL document for deleting Interval elements:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:pmml="http://www.dmg.org/PMML-4_2">
<!-- By default, copy all -->
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<!-- However, in case of the PMML Interval element, take no (copy-) action -->
<xsl:template match="pmml:Interval"/>
</xsl:stylesheet>
Be sure to configure the value of the pmml namespace prefix to match that of your PMML documents. The above example applies to PMML schema version 4.2 documents.
Then, apply the stylesheet to PMML files (command syntax xsltproc <XSL file> <PMML file(s)>):
$ xsltproc --output test-mod.pmml test.xsl test.pmml

In general, it's risky to play with pmml file, just be careful about it.
You can use BeautifulSoup.
for your specific goal, the tag 'Interval' appear only once, so you can find this tag in only one step, and then extract it:
# import BeautifulSoup
from bs4 import BeautifulSoup
# open and read the file
inf = open(r'treedemo.xml', 'r')
txt = inf.read()
inf.close()
# prepare the soup
soup = BeautifulSoup(txt, 'xml')
# now find the tag you want to remove, in this case it's easy, since the tag 'Interval' is unique across your pmml file:
interval = soup.find('Interval')
# remove the tag
interval.extract()
# write the updated pmml file
with open(r'treedemo_clean.xml', "w") as outf:
outf.write(str(soup))
The output will have no indents unless you will use outf.write(str(soup.prettify()))
I will not recommend to use prettify. might mess up the pmml
In case the tag is not unique then you have to find it carefully, in order to avoid deleting the wrong tag and brake your pmml.
There is nothing wrong with the field you want to remove. it shows statistics from your training data set. you can set the flag with_statistics=False

Python in XML: Getting text from grandchildren

I'm quite novice in programming but I believe the data I'm looking for is quite easy to get, however I can't seem to wrap my head around it.
My XML has several parents and each have of course their children with siblings and siblings with children. I am trying to reach a specific grandchild where one of its siblings has a specific word in a certain tag.
The XML (actually a KML) looks like this:
<Folder>
<name> Run-1</name>
<Placemark>
<name> run 1</name>
<Snippet></Snippet>
<styleUrl>#flightline</styleUrl>
<LineString>
<extrude>0</extrude>
<altitudeMode>clampToGround</altitudeMode>
<coordinates>54.72664746,24.91070844,2008 54.76968330,24.91068150,2008
</coordinates>
</LineString>
</Placemark>
</Folder>
Each folder named Run-X can have an infinite number of placemarks.
I want the name of each folder and the coordinates in the (there is only one) placemark containing the the <styleUrl>#flightline</styleUrl> ONLY.
That would build me a list of the run number and the 'flight line' coordinates.
Of course I am trying the python and w3 schools tutorials and I understand the basics but I can't seem to put it all together. Do I need a for loop to reach each child and a nested loop to reach every sub-child? Or can I just look for tags throughout the tree and get the coordinates value IF there is a <styleUrl>#flightline</styleUrl> tag?
I have been playing around with root.iter and root.findall but I can't seem to get any kind of result.

How about following? Assuming your kml data resides in data.xml
from collections import OrderedDict
from xml.etree import ElementTree as ET
tree = ET.parse("data.xml")
root = tree.getroot()
result = OrderedDict()
for folder in root.iter('Folder'):
for placemark in folder.findall('Placemark'):
if placemark.find('styleUrl').text == '#flightline':
result[folder.find('name').text.strip()] = placemark.find('LineString/coordinates').text.strip()
print(result)

Thanks so much for your help. I found a solution based on to your code:
for folder in root.iter('Folder'):
for placemark in folder.findall('Placemark'):
if placemark.find('styleUrl').text == '#flightline':
runLine = folder.find('name').text[5:]
startLat = placemark.find('LineString/coordinates').text[:11]
startLong = placemark.find('LineString/coordinates').text[12:23]
endLat = placemark.find('LineString/coordinates').text[29:40]
endLong = placemark.find('LineString/coordinates').text[41:52]
print ('Flightline: ' + runLine + ', coordinates start: ' + startLat + ' ' + startLong + '. Coordinates end: ' + endLat + ' ' + endLong + '.')
In case you are wondering, I'm trying to read files outputted by an aerial survey program (flightlines are lines flown to take pictures) and create a csv and flight plan file for the GPS in the aircraft to read so it can fly them automatically.
Now I need to find a way to remove the <kml> </kml> tags from the intial .kml file (on whatever line they might be) and only then open and parse it, output the line number and coordinates (with custom name) according to the flightline in a CSV and also output another flightplan file in a Garmin specific format. At least now I know how to scan the file. Thanks again Sir!

Updating an Existing XML Document in Python

I have a large XML file whose structure is approximately as follows:
<GROUNDTRUTH>
<thing fileName="1" attrib="2">
<SUBSUB moreStuff="12" otherStuff="13"/>
</thing>
<thing fileName="2" attrib="2">
<SUBSUB moreStuff="12" otherStuff="13"/>
</thing>
<thing fileName="3" attrib="2">
<SUBSUB moreStuff="12" otherStuff="13"/>
</thing>
</GROUNDTRUTH>
I don't think I was clear enough in the original posting of this question. I have an xml document called GROUNDTRUTH, and inside of that I have several thousand "things". I want to search through all of the things in the document via filename and then change an attribute. So if I was searching for fileName="2", I would change its attribute to attrib=x. And for some thing, perhaps I'd go down to the sub level and change moreStuff.
My plan is to store into a csv file the names of the 'things' I need to change, and what I want to change the value of 'attrib' to. What function or module will provide this kind of functionality? Or am I just missing an easy/obvious approach? Ultimately I'd like to have a working script that will take a csv file with the thing identifier, and value to be updated, and take the xml file to make those changes onto.
Thanks for your help and suggestions!

First, you can transform the original xml file into an outputted xml file using an xslt stylesheet which can modify xml files in any way, shape, or form such as modifying, re-structuring, re-ordering attributes, elements, etc. Do note xsl is a declarative special-purpose language to transform and render XML documents.
Then, you can use Python's lxml library to run the transformation:
#!/usr/bin/python
import lxml.etree as ET
dom = ET.parse('originalfile.xml')
xslt = ET.parse('transformfile.xsl')
transform = ET.XSLT(xslt)
newdom = transform(dom)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True)
xmlfile = open('finalfile.xml','ab')
xmlfile.write(tree_out)
xmlfile.close()
By the way, PHP, Java, C, VB, or pretty much any language, even your everyday browser can run transformations! To have the browser run it, simply add stylesheet in header:
<?xml-stylesheet type="text/xsl" href="transformfile.xsl"?>

Find and Replace tags in XML using Python

I have proposed a similar question before, but this one is slightly different. I want to find and replace XML tags using python. I am using the XML's to upload as metadata for some GIS shapefiles. In the metadata editor, I have options to choose dates for when certain data is collected. The options are 'single date', 'multiple dates' and 'range of dates'. In the first XML, which contains tags for a range of dates, you will see tags "rngdates" with some subelements 'begdate', 'begtime', 'enddate' and . I want to edit these tags out so that it looks like the second XML which contains multiple single dates. The new tags are 'mdattim', 'sngdate' and 'caldate'. I hope this is clear enough, but please ask for more info if needed. XML is a weird beast, and I'm still not fully understanding it.
Thanks,
Mike
First XML:
<idinfo>
<citation>
<citeinfo>
<origin>My Company Name</origin>
<pubdate>05/04/2009</pubdate>
<title>Feature Class Name</title>
<edition>0</edition>
<geoform>vector digital data</geoform>
<onlink>.</onlink>
</citeinfo>
</citation>
<descript>
<abstract>This dataset represents the GPS location of inspection points collected in the field for the Site Name</abstract>
<purpose>This dataset was created to accompany the clients Assessment Plan. This point feature class represents the location within the area that the field crews collected related data.</purpose>
</descript>
<timeperd>
<timeinfo>
<rngdates>
<begdate>7/13/2010</begdate>
<begtime>unknown</begtime>
<enddate>7/15/2010</enddate>
<endtime>unknown</endtime>
</rngdates>
</timeinfo>
<current>ground condition</current>
</timeperd>
Second XML:
<idinfo>
<citation>
<citeinfo>
<origin>My Company Name</origin>
<pubdate>03/07/2011</pubdate>
<title>Feature Class Name</title>
<edition>0</edition>
<geoform>vector digital data</geoform>
<onlink>.</onlink>
</citeinfo>
</citation>
<descript>
<abstract>This dataset represents the GPS location of inspection points collected in the field for the Site Name</abstract>
<purpose>This dataset was created to accompany the clients Assessment Plan. This point feature class represents the location within the area that the field crews collected related data.</purpose>
</descript>
<timeperd>
<timeinfo>
<mdattim>
<sngdate>
<caldate>08-24-2009</caldate>
<time>unknown</time>
</sngdate>
<sngdate>
<caldate>08-26-2009</caldate>
</sngdate>
<sngdate>
<caldate>08-26-2009</caldate>
</sngdate>
<sngdate>
<caldate>07-07-2010</caldate>
</sngdate>
</mdattim>
</timeinfo>
This is my Python code so far:
folderPath = "Z:\ESRI\Figure_Sourcing\Figures\Metadata\IOR_Run_Metadata_2009"
for filename in glob.glob(os.path.join(folderPath, "*.xml")):
fullpath = os.path.join(folderPath, filename)
if os.path.isfile(fullpath):
basename, filename2 = os.path.split(fullpath)
root = ElementTree(file=r"Z:\ESRI\Figure_Sourcing\Figures\Metadata\Run_Metadata_2009\\" + filename2)
iter = root.getiterator()
#Iterate
for element in iter:
print element.tag
if element.tag == "begdate":
element.tag.replace("begdate", "sngdate")

I believe I succeeded in making the code work. This will allow you to edit certain tags if you need to change them from an existing XML file. I needed to do this to create metadata for some GIS shapefiles in a batch processing script to change certain date values depending on if they were single dates, multiple dates or a range of dates.
This webpage helped a lot: http://lxml.de/tutorial.html
I have some more work to do, but this was the answer I was looking for from my original question :) I'm sure this can be used in many other applications.
# Set workspace location for XML files
folderPath = "Z:\ESRI\Figure_Sourcing\Figures\Metadata\IOR_Run_Metadata_2009"
# Loop through each file and search for files with .xml extension
for filename in glob.glob(os.path.join(folderPath, "*.xml")):
fullpath = os.path.join(folderPath, filename)
# Split file name from the directory path
if os.path.isfile(fullpath):
basename, filename2 = os.path.split(fullpath)
# Set variable to XML files
root = ElementTree(file=r"Z:\ESRI\Figure_Sourcing\Figures\Metadata\IOR_Run_Metadata_2009\\" + filename2)
# Set variable for iterator
iter = root.getiterator()
#Iterate through the tags in each XML file
for element in iter:
if element.tag == "timeinfo":
tree = root.find(".//timeinfo")
# Clear all tags below the "timeinfo" tag
tree.clear()
# Append new Element
element.append(ET.Element("mdattim"))
# Create SubElements to the parent tag
child1 = ET.SubElement(tree, "sngdate")
child2 = ET.SubElement(child1, "caldate")
child3 = ET.SubElement(child1, "time")
# Set text values for tags
child2.text = "08-24-2009"
child3.text = "unknown

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Parsing XML with a complex hierarchy - Nuke9.0v8 - python

Related

Parse deeply nested XML to pandas dataframe

How to edit pmml model file using xml parser

Python in XML: Getting text from grandchildren

Updating an Existing XML Document in Python

Find and Replace tags in XML using Python

Categories

Resources