Trying to extract xml element using python 2.7 - python

I am trying to extract the name elements under the sequence in xml files. I have pasted in the top of a sample xml to illustrate. With this I want to get the text from 01 Interview_been successful through mentorship and write it to a file. There are multiple sequence tags in the xml and I am trying to figure out how to go through it and extract it. I have tried to figure out how to use xml.etree and xml.dom.minidom but I can't seem to wrap my brain around it. I was able to get all of the id values from the sequence tags but not the name elements. I'm pasting in my code before the xml.
from xml.etree import ElementTree
file = open("xmldump.txt", "r")
filedata = file.read()
file.close()
with open('test.xml', 'rt') as f:
tree = ElementTree.parse(f)
for node in tree.iter('name'):
sequenceid = node.attrib.get('name')
print ' %s' % (sequenceid)
newLine = sequenceid + "\n"
file = open("xmldump.txt", "w")
file.write(newLine)
file.close()
Here is the XML:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xmeml>
<xmeml version="5">
<bin>
<uuid>0F5D72FA-54E4-4DE8-81D7-CC33F5C43836</uuid>
<updatebehavior>add</updatebehavior>
<name>Logged</name>
<children>
<sequence id="01 Interview_been successful through mentorship">
<uuid>12FB944D-83EA-4527-9A54-2130A42E3A06</uuid>
<updatebehavior>add</updatebehavior>
<name>01 Interview_been successful through mentorship</name>
<duration>1195</duration>
<rate>
<ntsc>TRUE</ntsc>
<timebase>24</timebase>
</rate>
<timecode>

Well, I'm not sure if you want the "id" attribute or the name tag(your code is confusing, it tries to extract a "name" attribute out of the "sequence" tag, but that tag only has an "id" attribute). Below is code that extract both, should help you get started on figuring out how ElementTree works
from xml.etree import ElementTree
with open('test.xml', 'rt') as f:
tree = ElementTree.parse(f)
for node in tree.iter('sequence'):
sequenceid = node.attrib.get('id')
name = node.findtext('name')

Related

Parse XML and Re-write the Filename Using an XML Element

I am trying to parse an XML and re-name the original XML using one of its child elements, specifically as a prefix for the filename of an XML to be overwritten. In the sample XML below, I want to extract the "to" element and insert its name "Tove" into a newly written XML filename. If the original file was named "reminder.xml", could the name "Tove" be parsed and inserted into a newly written file called "Tove_reminder.xml"? Is this possible with XMLs?
`<?xml version="1.0" encoding="ISO-8859-1"?>
-<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>`
It seems that Python has more flexibility extracting text and strings in other file formats, but I cannot find much that pertains to XML. Any help is most appreciated!
You can use beautifulsoup4 to extract attribute and inner texts of an xml document.
first, install beautfulsoup4
pip install beautifulsoup4
Then, assuming the text you wrote in your question is loaded in a variable named xml_text, you can do the following
from bs4 import BeautifulSoup
file_name = "reminder.xml"
xml_file = open(file_name, 'r')
xml_text = xml_file.read()
xml_file.close()
soup = BeautifulSoup(xml_text, "html.parser")
To extract a text from a tag, you can then use
to = soup.find("to")
name = to.text #contains Tove now
Finally, you can use the "name" variable to save the file
file_name = name + "_" + file_name
xml_file = open(file_name, "w")
xml_file.write(xml_text)
xml_file.close()

How to determine what the root tag name is for a XML document

I was wonder how I would go about determining what the root tag for an XML document is using xml.dom.minidom.
<?xml version="1.0" encoding="UTF-8"?>
<root>
<child1></child1>
<child2></child2>
<child3></child3>
</root>
In the example XML above, my root tag could be 3 or 4 different things. All I want to do is pull the tag, and then use that value to get the elements by tag name.
def import_from_XML(self, file_name)
file = open(file_name)
document = file.read()
if re.compile('^<\?xml').match(document):
xml = parseString(document)
root = '' # <-- THIS IS WHERE IM STUCK
elements = xml.getElementsByTagName(root)
I tried searching through the documentation for xml.dom.minidom, but it is a little hard for me to wrap my head around, and I couldn't find anything that answered this question outright.
I'm using Python 3.6.x, and I would prefer to keep with the standard library if possible.
For the line you commented as Where I am stuck, the following should assign the value of the root tag of the XML document to the variable theNameOfTheRootElement:
theNameOfTheRootElement = xml.documentElement.tagName
this is what I did when I last processed xml. I didn't use the approach you used but I hope it will help you.
import urllib2
from xml.etree import ElementTree as ET
req = urllib2.Request(site)
file=None
try:
file = urllib2.urlopen(req)
except urllib2.URLError as e:
print e.reason
data = file.read()
file.close()
root = ET.fromstring(data)
print("root", root)
for child in root.findall('parent element'):
print(child.text, child.attrib)

Find and replacing text in elementtree

i am very new to programming and python. I am trying to find and replace a text in an xml file. Here is my xml file
<?xml version="1.0" encoding="UTF-8"?>
<!--Arbortext, Inc., 1988-2008, v.4002-->
<!DOCTYPE doc PUBLIC "-//MYCOMPANY//DTD XSEIF 1/FAD 110 05 R5//EN"
"XSEIF_R5.dtd">
<doc version="XSEIF R5"
xmlns="urn:x-mycompany:r2:reg-doc:1551-fad.110.05:en:*">
<meta-data></meta-data>
<front></front>
<body>
<chl1><title xml:id="id_881i">Installation</title>
<p>To install SDK, perform the tasks mentioned in the following
table.</p>
<p><input>ln -s /sim/<var>user_id</var>/.VirtualBox $home/.VirtualBox</input
></p>
</chl1>
</body>
</doc>
<?Pub *0000021917 0?>
I need to replace all entries of "virtual box" with "Xen". For this i tried Elementtree. But i dont know how to replace and write back to the file. Here is my try.
import xml.etree.ElementTree as ET
tree=ET.parse('C:/My_location/1_1531-CRA 119 1364_2.xml')
doc=tree.getroot()
iterator=doc.getiterator()
for body in iterator:
old_text=body.replace("Virtualbox", "Xen")
The texts are available in many sub tags under body.I got the method to remove the subelement and append a new element, but didnt get to replace only the texts.
Replace text, tail attributes.
import lxml.etree as ET
with open('1.xml', 'rb+') as f:
tree = ET.parse(f)
root = tree.getroot()
for elem in root.getiterator():
if elem.text:
elem.text = elem.text.replace('VirtualBox', 'Xen')
if elem.tail:
elem.tail = elem.tail.replace('VirtualBox', 'Xen')
f.seek(0)
f.write(ET.tostring(tree, encoding='UTF-8', xml_declaration=True))
f.truncate()
Probably the simplest way is to do:
ifile = open('input_file','r')
ofile = open('output_file','w')
for line in ifile.readlines():
ofile.write(line.replace('VirtualBox','Xen'))
ifile.close()
ofile.close()

Modifying and rewriting XML file with Python ElementTree

I have a XML file that starts like this:
<?xml version="1.0" encoding="utf-8"?>
<Recipe xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
I need to read it in, modify it, then write it back out. Here is a code snippet:
from xml.etree import ElementTree
with open('base.xml', 'rt') as f:
tree = ElementTree.parse(f)
recipe = tree.find('')
t = recipe.find('Targets_Params/Target_Table/Target_Name')
t.text = "new Value"
output_file = open('new.xml', 'w' )
output_file.write(ElementTree.tostring(recipe))
output_file.close()
My problem is that when I write the file out I do not get the first line at all, and the second line comes out with just:
<Recipe>
How I can read in the file, modify it, and write it out while preserving the original structure?

How to add an element to xml file by using elementtree

I've a xml file, and I'm trying to add additional element to it.
the xml has the next structure :
<root>
<OldNode/>
</root>
What I'm looking for is :
<root>
<OldNode/>
<NewNode/>
</root>
but actually I'm getting next xml :
<root>
<OldNode/>
</root>
<root>
<OldNode/>
<NewNode/>
</root>
My code looks like that :
file = open("/tmp/" + executionID +".xml", 'a')
xmlRoot = xml.parse("/tmp/" + executionID +".xml").getroot()
child = xml.Element("NewNode")
xmlRoot.append(child)
xml.ElementTree(root).write(file)
file.close()
Thanks.
You opened the file for appending, which adds data to the end. Open the file for writing instead, using the w mode. Better still, just use the .write() method on the ElementTree object:
tree = xml.parse("/tmp/" + executionID +".xml")
xmlRoot = tree.getroot()
child = xml.Element("NewNode")
xmlRoot.append(child)
tree.write("/tmp/" + executionID +".xml")
Using the .write() method has the added advantage that you can set the encoding, force the XML prolog to be written if you need it, etc.
If you must use an open file to prettify the XML, use the 'w' mode, 'a' opens a file for appending, leading to the behaviour you observed:
with open("/tmp/" + executionID +".xml", 'w') as output:
output.write(prettify(tree))
where prettify is something along the lines of:
from xml.etree import ElementTree
from xml.dom import minidom
def prettify(elem):
"""Return a pretty-printed XML string for the Element.
"""
rough_string = ElementTree.tostring(elem, 'utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent=" ")
e.g. the minidom prettifying trick.

Categories

Resources