How to change element text of only one element with python elementtree - python

I have the following xml example file:
<Book>
<Location>page10</Location>
<Chapter>
<Location>page11</Location>
</Chapter>
</Book>
I want to change the text value of element <Location> right beneath <book>.
Using findall gives both 'Location' elements.
Using find gives the first, that could be right, but in case element 'Chapter' is placed before Location than I get the wrong element.
Anyone any suggestions?

Use paths..
>>> import xml.etree.ElementTree as etree
>>> frag = '<Book><Chapter><Location>page11</Location></Chapter><Location>page10</Location></Book>'
>>> tree = etree.fromstring(frag)
>>> tree.findall('./Location')[0].text
'page10'
>>> tree.findall('./Location')[1].text
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: list index out of range

Related

extract a specific tag from xml file using beautiful soup in python

I have an xml file (lets call is abc.xml) which looks like this.
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<product name="XYZ" version="123"/>
<application-links>
<application-links>
<id>111111111111111</id>
<name>Link_1</name>
<primary>true</primary>
<type>applinks.ABC</type>
<display-url>http://ABC.displayURL</display-url>
<rpc-url>http://ABC.displayURL</rpc-url>
</application-links>
</application-links>
</properties>
my python code is like this
f = open ('file.xml', 'r')
from bs4 import BeautifulSoup
soup = BeautifulSoup(f,'lxml')
print(soup.product)
for applinks in soup.application-links:
print(applinks)
which prints the following
<product name="XYZ" version="123"></product>
Traceback (most recent call last):
File "parse.py", line 7, in <module>
for applinks in soup.application-links:
NameError: name 'links' is not defined
Please can you help me understand how to print lines which have tags including a dash/hyphen '-'
I don't know if beautifulsoup is the best option here, but I really suggest using the ElementTree module in python like so:
>>> import xml.etree.ElementTree as ET
>>> root = ET.parse('file.xml').getroot()
>>> for app in root.findall('*/application-links/'):
... print(app.text)
111111111111111
Link_1
true
applinks.ABC
http://ABC.displayURL
http://ABC.displayURL
So, to print the value inside the <name> tag, you can do so:
>>> for app in root.findall('*/application-links/name'):
... print(app.text)
Link_1

ExpatError: junk after document element xml python error

I have a project that needs to be conversion from xml to dict in python. I am using the xmltodict library however when I convert the xml to dict it raises the error:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/Users/deanchristianarmada/Desktop/projects/asian_gaming/radar/lib/python2.7/site-packages/xmltodict.py", line 311, in parse
parser.Parse(xml_input, True)
ExpatError: junk after document element: line 2, column 0
my code is:
import xmltodict
xml = '<row dataType="TR" ID="3B6B408870BA7AC3E05381010A0A5849" agentCode="690001001001001" transferId="G87_AGIN160901115820S441XB" tradeNo="160831287638239" platformType="AGIN" playerName="mubuuvu2" transferType="IN" transferAmount="28" previousAmount="0" currentAmount="28" currency="CNY" exchangeRate="1" IP="0" flag="0" creationTime="2016-08-31 23:58:16" gameCode="" />\r\n<row dataType="TR" ID="3B6B408870BB7AC3E05381010A0A5849" agentCode="690001001001001" transferId="160831231227456_Hunter_Out" tradeNo="160831287639025" platformType="AGIN" playerName="zxh123" transferType="OUT" transferAmount="-50" previousAmount="50" currentAmount="0" currency="CNY" exchangeRate="1" IP="0" flag="0" creationTime="2016-08-31 23:58:18" gameCode="" />\r\n<row dataType="TR" ID="3B6B408870BC7AC3E05381010A0A5849" agentCode="690001001001001" transferId="160831231227452_Hunter_In" tradeNo="160831287639507" platformType="AGIN" playerName="qqq19qq32b" transferType="IN" transferAmount="71" previousAmount="0" currentAmount="71" currency="CNY" exchangeRate="1" IP="0" flag="0" creationTime="2016-08-31 23:58:19" gameCode="" />\r\n'
_dict = xmltodict.parse(xml, attr_prefix="")
I can't seem to find a way to fix it and I'm not used in xml, I'm used with JSON
If you add a starting root tag in the beginning and an ending root tag in the end of the xml string, it should work.
import xmltodict
xml = 'xml string here'
xml = '<root>'+xml+'</root>'
_dict = xmltodict.parse(xml, attr_prefix="")
Basically, it's just missing the <root> tag.

lxml: Converting XML to HTML through XSLT and get HtmlElements

I have data that comes as an XML file. I have also been provided an XSLT to transform the XML to HTML. I can use lxml to perform the conversion, however, I want to alter some of the HTML tags after the transformation. How do I convert this new etree into HtmlElements so that I can specifically use certain methods like .cssselect() and so on.
>>> import lxml.etree
>>> import lxml.html
>>>
>>> xmlstring = '''\
... <?xml version='1.0' encoding='ASCII'?>
... <root><a class="here">link1</a><a class="there">link2</a></root>
... '''
>>> root = lxml.etree.fromstring(xmlstring)
>>> root.cssselect('a.here')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'lxml.etree._Element' object has no attribute 'cssselect'
lxml.etree.tostring(root) -> lxml.html.fromstring(..)
>>> root = lxml.html.fromstring(lxml.etree.tostring(root))
>>> root.cssselect('a.here')
[<Element a at 0x2989308>]
Get XML output:
>>> print lxml.etree.tostring(root, xml_declaration=True)
<?xml version='1.0' encoding='ASCII'?>
<root><a class="here">link1</a><a class="there">link2</a></root>

Parse xsd with values [python]

I'm trying to examine and extract some data from an xml file using python. I'm doing this by parsing with etree then looping through the elements:
import xml.etree.ElementTree as etree
root = etree.fromstring(xml_string)
for element in root.iter():
print("%s , %s , %s" % (element.tag, element.attrib, element.text))
This works fine for some test data, but the actual xml files that I'm working with seem to contain xsd tags along with the data. Below is an example
<wdtf:observationMember>
<wdtf:TimeSeriesObservation gml:id="ts1">
<gml:description>Reading using DTW (Depth To Water) from TOC</gml:description>
<gml:name codeSpace="http://www.bom.gov.au/std/water/xml/wio0.2/feature/TimeSeriesObservation/w00066/12/A/GroundWaterLevel/">1</gml:name>
<om:procedure xlink:href="#gwTOC12" />
<om:observedProperty xlink:href="http://www.bom.gov.au/std/water/xml/wio0.2/property//bom/GroundWaterLevel_m" />
<om:featureOfInterest xlink:href="http://www.bom.gov.au/std/water/xml/wio0.2/feature/BorePipeSamplingInterval/w00066/12" />
<wdtf:metadata>
<wdtf:TimeSeriesObservationMetadata>
<wdtf:regulationProperty>Reg200806.s3.2a</wdtf:regulationProperty>
<wdtf:status>validated</wdtf:status>
</wdtf:TimeSeriesObservationMetadata>
</wdtf:metadata>
<wdtf:result>
<wdtf:TimeSeries>
<wdtf:defaultInterpolationType>InstVal</wdtf:defaultInterpolationType>
<wdtf:defaultUnitsOfMeasure>m</wdtf:defaultUnitsOfMeasure>
<wdtf:defaultQuality>quality-A</wdtf:defaultQuality>
<wdtf:timeValuePair time="1915-12-09T12:00:00+10:00">51.82</wdtf:timeValuePair>
<wdtf:timeValuePair time="1917-12-18T12:00:00+10:00">41.38</wdtf:timeValuePair>
<wdtf:timeValuePair time="1924-05-23T12:00:00+10:00">21.95</wdtf:timeValuePair>
<wdtf:timeValuePair time="1988-02-02T12:00:00+10:00">7.56</wdtf:timeValuePair>
</wdtf:TimeSeries>
</wdtf:result>
</wdtf:TimeSeriesObservation>
</wdtf:observationMember>
Useing this xml in the code above causes etree to return an error:
Traceback (most recent call last):
File "xml_test2.py", line 38, in <module>
root = etree.fromstring(xml_string)
File "<string>", line 124, in XML
ParseError: unbound prefix: line 1, column 4
Is there a different parser I should be using? Or can I remove the xsc tags some how?
Thanks
From what I can see in your post, your parser is namespace aware and is complaining that XML namespace aliases are not resolved. Assuming that <wdtf:observationMember> is your topmost element, then you have to have the following at least:
<wdtf:observationMember xmlns:wdtf="some-uri">
The same applies for all other prefixes, such as gml, om, etc.

How to set element's id in Python's xml.dom.minidom?

How to? Created a document and an element:
import xml.dom.minidom as d
a=d.Document()
b=a.createElement('test')
setIdAttribute doesn't work :(
b.setIdAttribute('something')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/xml/dom/minidom.py", line 835, in setIdAttribute
self.setIdAttributeNode(idAttr)
File "/usr/lib/python2.6/xml/dom/minidom.py", line 843, in setIdAttributeNode
raise xml.dom.NotFoundErr()
xml.dom.NotFoundErr
And if I set this by hand, getElementById can't find it.
b.setAttribute('id', 'something')
a.getElementById('something')
What I have to do?
Two things are wrong here.
Document.getElementById will only find elements that are actually in the document. Here you've created b but not actually added it to the document. (It's exactly the same in JavaScript.)
You have to mark id as an ID attribute using setIdAttribute. (There's no need to do this in JavaScript because in HTML documents, attributes named id are automatically considered to be ID attributes, logically enough. But XML does not automatically treat attributes named id as IDs; you can either explicitly declare that they are in your DTD or call setIdAttribute individually for every ID attribute. And I am not sure the DTD thing will work with minidom, which is not a full DOM implementation.)
Like so:
import xml.dom.minidom as d
a = d.Document()
b = a.createElement('test')
a.appendChild(b)
b.setAttribute('id', 'x')
b.setIdAttribute('id')
After that, getElementById works:
>>> a.getElementById('x')
<DOM Element: test at 0xb77712ec>
Adding the name of the id attribute to the DTD should help. For example, if you want every to set the id as the id attribute for all <div> elements, you can set up your DTD as follows:
<!DOCTYPE div [<!ATTLIST div id ID #IMPLIED>]>
This is a working example:
>>> from xml.dom.minidom import parse, parseString
>>> data='<!DOCTYPE div [<!ATTLIST div id ID #IMPLIED>]><div><div id="foo">FOO word</div><div id="bar">BAR word</div></div>'
>>> x=parseString(data)
>>> x.getElementById('foo')
<DOM Element: div at 0x1126440>
>>> x.getElementById('foo').toxml()
u'<div id="foo">FOO word</div>'

Categories

Resources