Setting value for a node in XML document in Python - python

I have a XML document "abc.xml":
I need to write a function replace(name, newvalue) which can replace the value node with tag 'name' with the new value and write it back to the disk. Is this possible in python? How should I do this?

import xml.dom.minidom
filename='abc.xml'
doc = xml.dom.minidom.parse(filename)
print doc.toxml()
c = doc.getElementsByTagName("c")
print c[0].toxml()
c[0].childNodes[0].nodeValue = 'zip'
print doc.toxml()
def replace(tagname, newvalue):
'''doc is global, first occurrence of tagname gets it!'''
doc.getElementsByTagName(tagname)[0].childNodes[0].nodeValue = newvalue
replace('c', 'zit')
print doc.toxml()
See minidom primer and API Reference.
# cat abc.xml
<root>
<a>
<c>zap</c>
</a>
<b>
</b>
</root>

Sure it is possible.
The xml.etree.ElementTree module will help you with parsing XML, finding tags and replacing values.
If you know a little bit more about the XML file you want to change, you can probably make the task a bit easier than if you need to write a generic function that will handle any XML file.
If you are already familiar with DOM parsing, there's a xml.dom package to use instead of the ElementTree one.

Related

Choosing a specific XML node using lxml in Python

I'm from a VBScript background and new to lxml with Python.
In VBScript, to choose a specific node, I would simply do something like:
Set myNode = xmlDoc.selectSingleNode("/node1/node2/myNode").
What I have done with Python:
from lxml import etree
xmlDoc = etree.parse(fileName)
myNode =
Question: So what should be written in front of myNode to be able to select it?
Preferably without using XPath? Also taking lxml into account
You could use something like:
myNode = xmlDoc.find('node2/myNode')
The etree.parse function will return a root node (ie your node1), so you don't need to use an absolute path.
Example
content = '''
<root>
<div>
<p>content 1</p>
</div>
</root>
'''
from lxml import etree
xmlDoc = etree.fromstring(content)
paragraph_element = xmlDoc.find('div/p')
print(paragraph_element)
Output
<Element p at 0x9f54bc8>
Note:
For my example I have used the function etree.fromstring. This is purely for demonstration purposes, so you can see a workable example using a string. The function etree.parse should generate the same result when working with files rather than strings.
Aside: Why not use XPath? It is extremely powerful!

In Python, Parsing Custom XML Tags Without Parsing HTML

I'm new to Python 2.7, and I'm trying to parse an XML file that contains HTML. I want to parse the custom XML tags without parsing any HTML content whatsoever. What's the best way to do this? (If it's helpful, my list of custom XML tags is small, so if there's an XML parser that has an option to only parse specified tags that would probably work fine.)
E.g. I have an XML file that looks like
<myTag1 myAttrib="value">
<myTag2>
<p>My what a lovely day.</p>
</myTag2>
</myTag1>
I'd like to be able to parse apart everything except the HTML, and in particular to extract the value of myTag2 as un-parsed HTML.
EDIT: Here's more information to answer a question below. I had previously tried using ElementTree. This is what happened:
root = ET.fromstring(xmlstring)
root.tag # returns 'myTag1'
root[0].tag # returns 'myTag2'
root[0].text # returns None, but I want it to return the HTML string
The HTML string I want has been parsed and is stored as a tag and text:
root[0][0].tag # returns 'p', but I don't even want root[0][0] to exist
root[0][0].text # returns 'My ... day.'
But really I'd like to be able to do something like this...
root[0].unparsedtext # returns '<p>My ... day.</p>'
SOLUTION:
har07's answer works great. I modified that code slightly to account for an edge case. Here's what I'm implementing:
def _getInner(element):
if element.text == None:
textStr = ''
else:
textStr = element.text
return textStr + ''.join(ET.tostring(e) for e in element)
Then if
element = ET.fromstring('<myTag>Let us be <b>gratuitous</b> with tags</myTag>')
the original code will only return the text starting with the first XML-formatted tag, but the modified version will capture the desired text:
''.join(ET.tostring(e) for e in element) # returns '<b>gratuitous</b> with tags'
_getInner(element) # returns 'Let us be <b>gratuitous</b> with tags'
I don't think there is an easy way to modify an XML parser behavior to ignore some predefined tags. A much easier way would be to let the parser normally parse the XML, then you can create a function that return unparsed content of an element for this purpose, for example :
import xml.etree.ElementTree as ET
def getUnparsedContent(element):
return ''.join(ET.tostring(e) for e in element)
xmlstring = """<myTag1 myAttrib="value">
<myTag2>
<p>My what a lovely day.</p>
</myTag2>
</myTag1>"""
root = ET.fromstring(xmlstring)
print(getUnparsedContent(root[0]))
output :
<p>My what a lovely day.</p>
You should be able to implement this through the built-in minidom xml parser.
from xml.dom import minidom
xmldoc = minidom.parse("document.xml")
rootNode = xmldoc.firstChild
firstNode = rootNode.childNodes[0]
In your example case, firstNode would end up as:
<p>My what a lovely day.</p>
Note that minidom (and probably any other xml-parsing library you might use) won't recognize HTML by default. This is by design, because XML documents do not have predefined tags.
You could then use a series of if or try statements to determine whether you have reached a HTML formatted node while extracting data:
for i in range (0, len(rootNode))
rowNode = rootNode.childNodes[i]
if "<p>" in rowNode:
#this is an html-formatted node: extract the value and continue

Parse xml from file using etree works when reading string, but not a file

I am a relative newby to Python and SO. I have an xml file from which I need to extract information. I've been struggling with this for several days, but I think I finally found something that will extract the information properly. Now I'm having troubles getting the right output. Here is my code:
from xml import etree
node = etree.fromstring('<dataObject><identifier>5e1882d882ec530069d6d29e28944396</identifier><description>This is a paragraph about a shark.</description></dataObject>')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
The result that I get is "5e1882d882ec530069d6d29e28944396 This is a paragraph about a shark.", which is what I want.
However, what I really need is to be able to read from a file instead of a string. So I try this code:
from xml import etree
node = etree.parse('test3.xml')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
Now my result is "None None". I have a feeling I'm either not getting the file in right or something is wrong with the output. Here is the contents of test3.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response xmlns="http://www.eol.org/transfer/content/0.3" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dwc="http://rs.tdwg.org/dwc/dwcore/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:dwct="http://rs.tdwg.org/dwc/terms/" xsi:schemaLocation="http://www.eol.org/transfer/content/0.3 http://services.eol.org/schema/content_0_3.xsd">
<identifier>5e1882d822ec530069d6d29e28944369</identifier>
<description>This is a paragraph about a shark.</description>
Your XML file uses a default namespace. You need to qualify your searches with the correct namespace:
identifier = node.findtext('{http://www.eol.org/transfer/content/0.3}identifier')
for ElementTree to match the correct elements.
You could also give the .find(), findall() and iterfind() methods an explicit namespace dictionary. This is not documented very well:
namespaces = {'eol': 'http://www.eol.org/transfer/content/0.3'} # add more as needed
root.findall('eol:identifier', namespaces=namespaces)
Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the eol: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.eol.org/transfer/content/0.3}identifier instead.
If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in a .nsmap attribute on elements.
Have you thought of trying beautifulsoup to parse your xml with python:
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing%20XML
There is some good documentation and a healthy online group so support is quite good
A

Using Python ElementTree to Extract Text in XML Tag

I have a corpus with tens of thousands of XML file (small sized files) and I'm trying to use Python and extract the text contained in one of the XML tags, for example, everything between the body tags for something like:
<body> sample text here with <bold> nested </bold> tags in this paragraph </body>
and then write a text document that contains this string, and move on down the list of XML files.
I'm using effbot's ELementTree but couldn't find the right commands/syntax to do this. I found a website that uses miniDOM's dom.getElementsByTagName but I'm not sure what the corresponding method is for ElementTree. Any ideas would be greatly appreciated.
A better answer, showing how to actually use XML parsing to do this:
import xml.etree.ElementTree as ET
stringofxml = "<body> sample text here with <bold> nested </bold> tags in this paragraph </body>"
def extractTextFromElement(elementName, stringofxml):
tree = ET.fromstring(stringofxml)
for child in tree:
if child.tag == elementName:
return child.text.strip()
print extractTextFromElement('bold', stringofxml)
I would just use re:
import re
body_txt = re.match('<body>(.*)</body>',body_txt).groups()[0]
then to remove the inner tags:
body_txt = re.sub('<.*?>','',body_txt)
You shouldn't use regexp when they are not needed, it's true... but there's nothing wrong with using them when they are.

How to retrieve attributes of xml tag in Python?

I'm looking for a way to add attributes to xml tags in python. Or to create a new tag with a new attributes
for example, I have the following xml file:
<types name='character' shortName='chrs'>
....
...
</types>
and i want to add an attribute to make it look like this:
<types name='character' shortName='chrs' fullName='MayaCharacters'>
....
...
</types>
how do I do that with python? by the way. I use python and minidom for this
please help. thanks in advance
You can use the attributes property of the respective Node object.
For example:
from xml.dom.minidom import parseString
documentNode = parseString("<types name='character' shortName='chrs'></types>")
typesNode = documentNode.firstChild
# Getting an attribute
print typesNode.attributes["name"].value # will print "character"
# Setting an attribute
typesNode.attributes["mynewattribute"] = u"mynewvalue"
print documentNode.toprettyxml()
The last print statement will output this XML document:
<?xml version="1.0" ?>
<types mynewattribute="mynewvalue" name="character" shortName="chrs"/>
I learned xml parsing in python from this great article. It has an attribute section, which cant be linked to, but just search for "Attributes" on that page and you'll find it, that holds the information you need.
But in short (snippet stolen from said page):
>>> building_element.setAttributeNS("http://www.boddie.org.uk/paul/business", "business:name", "Ivory Tower")
>>> building_element.getAttributeNS("http://www.boddie.org.uk/paul/business", "name")
'Ivory Tower'
You probably want to skip the handling of namespaces, to make the code cleaner, unless you need them.
It would seem that you just call setAttribute for the parsed dom objects.
http://developer.taboca.com/cases/en/creating_a_new_xml_document_with_python/

Categories

Resources