How to set element's id in Python's xml.dom.minidom? - python

How to? Created a document and an element:
import xml.dom.minidom as d
a=d.Document()
b=a.createElement('test')
setIdAttribute doesn't work :(
b.setIdAttribute('something')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/xml/dom/minidom.py", line 835, in setIdAttribute
self.setIdAttributeNode(idAttr)
File "/usr/lib/python2.6/xml/dom/minidom.py", line 843, in setIdAttributeNode
raise xml.dom.NotFoundErr()
xml.dom.NotFoundErr
And if I set this by hand, getElementById can't find it.
b.setAttribute('id', 'something')
a.getElementById('something')
What I have to do?

Two things are wrong here.
Document.getElementById will only find elements that are actually in the document. Here you've created b but not actually added it to the document. (It's exactly the same in JavaScript.)
You have to mark id as an ID attribute using setIdAttribute. (There's no need to do this in JavaScript because in HTML documents, attributes named id are automatically considered to be ID attributes, logically enough. But XML does not automatically treat attributes named id as IDs; you can either explicitly declare that they are in your DTD or call setIdAttribute individually for every ID attribute. And I am not sure the DTD thing will work with minidom, which is not a full DOM implementation.)
Like so:
import xml.dom.minidom as d
a = d.Document()
b = a.createElement('test')
a.appendChild(b)
b.setAttribute('id', 'x')
b.setIdAttribute('id')
After that, getElementById works:
>>> a.getElementById('x')
<DOM Element: test at 0xb77712ec>

Adding the name of the id attribute to the DTD should help. For example, if you want every to set the id as the id attribute for all <div> elements, you can set up your DTD as follows:
<!DOCTYPE div [<!ATTLIST div id ID #IMPLIED>]>
This is a working example:
>>> from xml.dom.minidom import parse, parseString
>>> data='<!DOCTYPE div [<!ATTLIST div id ID #IMPLIED>]><div><div id="foo">FOO word</div><div id="bar">BAR word</div></div>'
>>> x=parseString(data)
>>> x.getElementById('foo')
<DOM Element: div at 0x1126440>
>>> x.getElementById('foo').toxml()
u'<div id="foo">FOO word</div>'

Related

Get parent attributes based on the text of particular tag - XML

I have a XML file which is as follows:
<customer>
<customerdetails id="80">
<account id="910">
<attributes>
<premium>true</premium>
<type>mpset</type>
</attributes>
</account>
<account id="911">
<attributes>
<premium>true</premium>
<type>spset</type>
</attributes>
</account>
</customerdetails>
</customer>
Need to parse the file and get the necessary details from the same, For that I have used python's lxml library.
Using that I can be able to get the details from the XML file, For example I can be able to get the text from the particular tag of the file.
from lxml import etree
def read_a_customer_section(self):
root = self.parser.getroot()
customer_id = "80"
account_id = "910"
type_details = root.findtext("customerdetails[#id='"+customer_id+"']/account[#id='"+account_id+"']/attributes/type")
print type_details
dd = ConfigParserLxml("dummy1.xml").read_a_customer_section()
Using this I can be able to get the text of the particular tag as expected.
But now I need to get the parent tag attibutes based on the text.
For example, If I give type "mpset" as input, I should be able to get
the "account" attributes, Also I need to find the "customerdetails"
attributes.
Someone help me with the same.
Hope this is clear, Else let me know I will try to make it more clear.
In [3]: tree.xpath('//account[.//type="mpset"]/#id')
Out[3]: ['910']
OR:
In [4]: tree.xpath('//*[.//type="mpset"]/#id')
Out[4]: ['80', '910'] # this will return all the id attribute.
// descendant or self of root node.
. current node
.// descendant or self of current node.
.//type="mpset" current node's descendent tag type's string value is mpset
#id get id attribute
* is wildcard, match any tag

parsing XML file depends on tags which may or may not be existed

Im trying to parse an XML file depends on a tag which may or may not be existed !
how I can avoid this IndexError without using exception handler ?
python script:
#!/usr/bin/python3
from xml.dom import minidom
doc = minidom.parse("Data.xml")
persons = doc.getElementsByTagName("person")
for person in persons:
print(person.getElementsByTagName("phone")[0].firstChild.data)
Data.xml :
<?xml version="1.0" encoding="UTF-8"?>
<obo>
<Persons>
<person>
<id>XA123</id>
<first_name>Adam</first_name>
<last_name>John</last_name>
<phone>01-12322222</phone>
</person>
<person>
<id>XA7777</id>
<first_name>Anna</first_name>
<last_name>Watson</last_name>
<relationship>
<type>Friends</type>
<to>XA123</to>
</relationship>
<!--<phone>01-12322222</phone>-->
</person>
</Persons>
</obo>
and I get an IndexError:
01-12322222
Traceback (most recent call last):
File "XML->Neo4j-try.py", line 29, in <module>
print(person.getElementsByTagName("phone")[0].firstChild.data)
IndexError: list index out of range
First, you need to check whether current person has phone data, and proceed further only if it has. Also, it is slightly better to store the result of getElementsByTagName() in a variable to avoid doing the same query repeatedly, especially when the actual XML has a lot more content in each person element :
for person in persons:
phones = person.getElementsByTagName("phone")
if phones:
print(phones[0].firstChild.data)
it is giving error because if any person does not have phone then
from xml.dom import minidom
doc = minidom.parse("Data.xml")
persons = doc.getElementsByTagName("person")
for person in persons:
if person.getElementsByTagName("phone"):
print(person.getElementsByTagName("phone")[0].firstChild.data)

How to change element text of only one element with python elementtree

I have the following xml example file:
<Book>
<Location>page10</Location>
<Chapter>
<Location>page11</Location>
</Chapter>
</Book>
I want to change the text value of element <Location> right beneath <book>.
Using findall gives both 'Location' elements.
Using find gives the first, that could be right, but in case element 'Chapter' is placed before Location than I get the wrong element.
Anyone any suggestions?
Use paths..
>>> import xml.etree.ElementTree as etree
>>> frag = '<Book><Chapter><Location>page11</Location></Chapter><Location>page10</Location></Book>'
>>> tree = etree.fromstring(frag)
>>> tree.findall('./Location')[0].text
'page10'
>>> tree.findall('./Location')[1].text
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: list index out of range

Parse xsd with values [python]

I'm trying to examine and extract some data from an xml file using python. I'm doing this by parsing with etree then looping through the elements:
import xml.etree.ElementTree as etree
root = etree.fromstring(xml_string)
for element in root.iter():
print("%s , %s , %s" % (element.tag, element.attrib, element.text))
This works fine for some test data, but the actual xml files that I'm working with seem to contain xsd tags along with the data. Below is an example
<wdtf:observationMember>
<wdtf:TimeSeriesObservation gml:id="ts1">
<gml:description>Reading using DTW (Depth To Water) from TOC</gml:description>
<gml:name codeSpace="http://www.bom.gov.au/std/water/xml/wio0.2/feature/TimeSeriesObservation/w00066/12/A/GroundWaterLevel/">1</gml:name>
<om:procedure xlink:href="#gwTOC12" />
<om:observedProperty xlink:href="http://www.bom.gov.au/std/water/xml/wio0.2/property//bom/GroundWaterLevel_m" />
<om:featureOfInterest xlink:href="http://www.bom.gov.au/std/water/xml/wio0.2/feature/BorePipeSamplingInterval/w00066/12" />
<wdtf:metadata>
<wdtf:TimeSeriesObservationMetadata>
<wdtf:regulationProperty>Reg200806.s3.2a</wdtf:regulationProperty>
<wdtf:status>validated</wdtf:status>
</wdtf:TimeSeriesObservationMetadata>
</wdtf:metadata>
<wdtf:result>
<wdtf:TimeSeries>
<wdtf:defaultInterpolationType>InstVal</wdtf:defaultInterpolationType>
<wdtf:defaultUnitsOfMeasure>m</wdtf:defaultUnitsOfMeasure>
<wdtf:defaultQuality>quality-A</wdtf:defaultQuality>
<wdtf:timeValuePair time="1915-12-09T12:00:00+10:00">51.82</wdtf:timeValuePair>
<wdtf:timeValuePair time="1917-12-18T12:00:00+10:00">41.38</wdtf:timeValuePair>
<wdtf:timeValuePair time="1924-05-23T12:00:00+10:00">21.95</wdtf:timeValuePair>
<wdtf:timeValuePair time="1988-02-02T12:00:00+10:00">7.56</wdtf:timeValuePair>
</wdtf:TimeSeries>
</wdtf:result>
</wdtf:TimeSeriesObservation>
</wdtf:observationMember>
Useing this xml in the code above causes etree to return an error:
Traceback (most recent call last):
File "xml_test2.py", line 38, in <module>
root = etree.fromstring(xml_string)
File "<string>", line 124, in XML
ParseError: unbound prefix: line 1, column 4
Is there a different parser I should be using? Or can I remove the xsc tags some how?
Thanks
From what I can see in your post, your parser is namespace aware and is complaining that XML namespace aliases are not resolved. Assuming that <wdtf:observationMember> is your topmost element, then you have to have the following at least:
<wdtf:observationMember xmlns:wdtf="some-uri">
The same applies for all other prefixes, such as gml, om, etc.

Python lxml: Items without .text attribute returned when querying for nodes()

I am trying to parse out certain tags from an XML document and it is retiring an AttributeError: '_ElementStringResult' object has no attribute 'text' error.
Here is the xml document:
<?xml version='1.0' encoding='ASCII'?>
<Root>
<Data>
<FormType>Log</FormType>
<Submitted>2012-03-19 07:34:07</Submitted>
<ID>1234</ID>
<LAST>SJTK4</LAST>
<Latitude>36.7027777778</Latitude>
<Longitude>-108.046111111</Longitude>
<Speed>0.0</Speed>
</Data>
</Root>
Here is the code I am using
from lxml import etree
from StringIO import StringIO
import MySQLdb
import glob
import os
import shutil
import logging
import sys
localPath = "C:\data"
xmlFiles = glob.glob1(localPath,"*.xml")
for file in xmlFiles:
a = os.path.join(localPath,file)
element = etree.parse(a)
Data = element.xpath('//Root/Data/node()')
parsedData = [{field.tag: field.text for field in Data} for action in Data]
print parsedData #AttributeError: '_ElementStringResult' object has no attribute 'text'
'//Root/Data/node()' will return a list of all the child elements which include text elements as strings which will not have a text attribute. If you put a print right after the Data = ... you will see something like ['\n ', <Element FormType at 0x10675fdc0>, '\n ', ....
I would do a filter first such as:
Data = [f for f in elem.xpath('//Root/Data/node()') if hasattr(f, 'text')]
Then I think the following line could be rewritten as:
parsedData = {field.tag: field.text for field in Data}
which will give the element tag and text dictionary which I believe is what you want.
Instead of querying for //Root/Data/node(), query for /Root/Data/* if you want only elements (as opposed to text nodes) to be returned. (Also, using only a single leading / rather than // allows the engine to do a cheaper search, rather than needing to look through the whole subtree for an additional Root.
Also -- are you sure you really want to loop through the entire list of subelements of Data inside your inner loop, rather than looping over only the subelements of a single Data element selected by your outer loop? I think your logic is broken, though it would only be visible if you had a file with more than one Data element under Root.

Categories

Resources