Parse xsd with values [python] - python

I'm trying to examine and extract some data from an xml file using python. I'm doing this by parsing with etree then looping through the elements:
import xml.etree.ElementTree as etree
root = etree.fromstring(xml_string)
for element in root.iter():
print("%s , %s , %s" % (element.tag, element.attrib, element.text))
This works fine for some test data, but the actual xml files that I'm working with seem to contain xsd tags along with the data. Below is an example
<wdtf:observationMember>
<wdtf:TimeSeriesObservation gml:id="ts1">
<gml:description>Reading using DTW (Depth To Water) from TOC</gml:description>
<gml:name codeSpace="http://www.bom.gov.au/std/water/xml/wio0.2/feature/TimeSeriesObservation/w00066/12/A/GroundWaterLevel/">1</gml:name>
<om:procedure xlink:href="#gwTOC12" />
<om:observedProperty xlink:href="http://www.bom.gov.au/std/water/xml/wio0.2/property//bom/GroundWaterLevel_m" />
<om:featureOfInterest xlink:href="http://www.bom.gov.au/std/water/xml/wio0.2/feature/BorePipeSamplingInterval/w00066/12" />
<wdtf:metadata>
<wdtf:TimeSeriesObservationMetadata>
<wdtf:regulationProperty>Reg200806.s3.2a</wdtf:regulationProperty>
<wdtf:status>validated</wdtf:status>
</wdtf:TimeSeriesObservationMetadata>
</wdtf:metadata>
<wdtf:result>
<wdtf:TimeSeries>
<wdtf:defaultInterpolationType>InstVal</wdtf:defaultInterpolationType>
<wdtf:defaultUnitsOfMeasure>m</wdtf:defaultUnitsOfMeasure>
<wdtf:defaultQuality>quality-A</wdtf:defaultQuality>
<wdtf:timeValuePair time="1915-12-09T12:00:00+10:00">51.82</wdtf:timeValuePair>
<wdtf:timeValuePair time="1917-12-18T12:00:00+10:00">41.38</wdtf:timeValuePair>
<wdtf:timeValuePair time="1924-05-23T12:00:00+10:00">21.95</wdtf:timeValuePair>
<wdtf:timeValuePair time="1988-02-02T12:00:00+10:00">7.56</wdtf:timeValuePair>
</wdtf:TimeSeries>
</wdtf:result>
</wdtf:TimeSeriesObservation>
</wdtf:observationMember>
Useing this xml in the code above causes etree to return an error:
Traceback (most recent call last):
File "xml_test2.py", line 38, in <module>
root = etree.fromstring(xml_string)
File "<string>", line 124, in XML
ParseError: unbound prefix: line 1, column 4
Is there a different parser I should be using? Or can I remove the xsc tags some how?
Thanks

From what I can see in your post, your parser is namespace aware and is complaining that XML namespace aliases are not resolved. Assuming that <wdtf:observationMember> is your topmost element, then you have to have the following at least:
<wdtf:observationMember xmlns:wdtf="some-uri">
The same applies for all other prefixes, such as gml, om, etc.

Related

Python xml etree add sub element containing prefix to xml file

I need to add an element to an existing xml file. The file define a namespace like this:
<Document xmlns:idPkg="http://ns.adobe.com/AdobeInDesign/idml/1.0/packaging" ></Document>
The element that i need to add:
<idPkg:Story src="Stories/Story_main.xml" />
To obtain:
<Document xmlns:idPkg="http://ns.adobe.com/AdobeInDesign/idml/1.0/packaging">
<idPkg:Story src="Stories/Story_main.xml" />
</Document>
I've tried this:
designmap = ET.parse(os.path.join(path, "designmap.xml"))
designmap.getroot().append(ET.fromstring(f"<idPkg:Story src=\"Stories/Story_{self.id}.xml\" />"))
designmap.write(os.path.join(path, "designmap.xml"))
But I get this error:
xml.etree.ElementTree.ParseError: unbound prefix: line 1, column 0
due to the parser not being able to find the prefix.
Is there a work around?

ExpatError: junk after document element xml python error

I have a project that needs to be conversion from xml to dict in python. I am using the xmltodict library however when I convert the xml to dict it raises the error:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/Users/deanchristianarmada/Desktop/projects/asian_gaming/radar/lib/python2.7/site-packages/xmltodict.py", line 311, in parse
parser.Parse(xml_input, True)
ExpatError: junk after document element: line 2, column 0
my code is:
import xmltodict
xml = '<row dataType="TR" ID="3B6B408870BA7AC3E05381010A0A5849" agentCode="690001001001001" transferId="G87_AGIN160901115820S441XB" tradeNo="160831287638239" platformType="AGIN" playerName="mubuuvu2" transferType="IN" transferAmount="28" previousAmount="0" currentAmount="28" currency="CNY" exchangeRate="1" IP="0" flag="0" creationTime="2016-08-31 23:58:16" gameCode="" />\r\n<row dataType="TR" ID="3B6B408870BB7AC3E05381010A0A5849" agentCode="690001001001001" transferId="160831231227456_Hunter_Out" tradeNo="160831287639025" platformType="AGIN" playerName="zxh123" transferType="OUT" transferAmount="-50" previousAmount="50" currentAmount="0" currency="CNY" exchangeRate="1" IP="0" flag="0" creationTime="2016-08-31 23:58:18" gameCode="" />\r\n<row dataType="TR" ID="3B6B408870BC7AC3E05381010A0A5849" agentCode="690001001001001" transferId="160831231227452_Hunter_In" tradeNo="160831287639507" platformType="AGIN" playerName="qqq19qq32b" transferType="IN" transferAmount="71" previousAmount="0" currentAmount="71" currency="CNY" exchangeRate="1" IP="0" flag="0" creationTime="2016-08-31 23:58:19" gameCode="" />\r\n'
_dict = xmltodict.parse(xml, attr_prefix="")
I can't seem to find a way to fix it and I'm not used in xml, I'm used with JSON
If you add a starting root tag in the beginning and an ending root tag in the end of the xml string, it should work.
import xmltodict
xml = 'xml string here'
xml = '<root>'+xml+'</root>'
_dict = xmltodict.parse(xml, attr_prefix="")
Basically, it's just missing the <root> tag.

parsing XML file depends on tags which may or may not be existed

Im trying to parse an XML file depends on a tag which may or may not be existed !
how I can avoid this IndexError without using exception handler ?
python script:
#!/usr/bin/python3
from xml.dom import minidom
doc = minidom.parse("Data.xml")
persons = doc.getElementsByTagName("person")
for person in persons:
print(person.getElementsByTagName("phone")[0].firstChild.data)
Data.xml :
<?xml version="1.0" encoding="UTF-8"?>
<obo>
<Persons>
<person>
<id>XA123</id>
<first_name>Adam</first_name>
<last_name>John</last_name>
<phone>01-12322222</phone>
</person>
<person>
<id>XA7777</id>
<first_name>Anna</first_name>
<last_name>Watson</last_name>
<relationship>
<type>Friends</type>
<to>XA123</to>
</relationship>
<!--<phone>01-12322222</phone>-->
</person>
</Persons>
</obo>
and I get an IndexError:
01-12322222
Traceback (most recent call last):
File "XML->Neo4j-try.py", line 29, in <module>
print(person.getElementsByTagName("phone")[0].firstChild.data)
IndexError: list index out of range
First, you need to check whether current person has phone data, and proceed further only if it has. Also, it is slightly better to store the result of getElementsByTagName() in a variable to avoid doing the same query repeatedly, especially when the actual XML has a lot more content in each person element :
for person in persons:
phones = person.getElementsByTagName("phone")
if phones:
print(phones[0].firstChild.data)
it is giving error because if any person does not have phone then
from xml.dom import minidom
doc = minidom.parse("Data.xml")
persons = doc.getElementsByTagName("person")
for person in persons:
if person.getElementsByTagName("phone"):
print(person.getElementsByTagName("phone")[0].firstChild.data)

Python XPath SyntaxError: invalid predicate

i am trying to parse an xml like
<document>
<pages>
<page>
<paragraph>XBV</paragraph>
<paragraph>GHF</paragraph>
</page>
<page>
<paragraph>ash</paragraph>
<paragraph>lplp</paragraph>
</page>
</pages>
</document>
and here is my code
import xml.etree.ElementTree as ET
tree = ET.parse("../../xml/test.xml")
root = tree.getroot()
path="./pages/page/paragraph[text()='GHF']"
print root.findall(path)
but i get an error
print root.findall(path)
File "X:\Anaconda2\lib\xml\etree\ElementTree.py", line 390, in findall
return ElementPath.findall(self, path, namespaces)
File "X:\Anaconda2\lib\xml\etree\ElementPath.py", line 293, in findall
return list(iterfind(elem, path, namespaces))
File "X:\Anaconda2\lib\xml\etree\ElementPath.py", line 263, in iterfind
selector.append(ops[token[0]](next, token))
File "X:\Anaconda2\lib\xml\etree\ElementPath.py", line 224, in prepare_predicate
raise SyntaxError("invalid predicate")
SyntaxError: invalid predicate
what is wrong with my xpath?
Follow up
Thanks falsetru, your solution worked. I have a follow up. Now, i want to get all the paragraph elements that come before the paragraph with text GHF. So in this case i only need the XBV element. I want to ignore the ash and lplp. i guess one way to do this would be
result = []
for para in root.findall('./pages/page/'):
t = para.text.encode("utf-8", "ignore")
if t == "GHF":
break
else:
result.append(para)
but is there a better way to do this?
ElementTree's XPath support is limited. Use other library like lxml:
import lxml.etree
root = lxml.etree.parse('test.xml')
path = "./pages/page/paragraph[text()='GHF']"
print(root.xpath(path))
As #falsetru mentioned, ElementTree doesn't support text() predicate, but it supports matching child element by text, so in this example, it is possible to search for a page that has a paragraph with specific text, using the path ./pages/page[paragraph='GHF']. The problem here is that there are multiple paragraph tags in a page, so one would have to iterate for the specific paragraph. In my case, I needed to find the version of a dependency in a maven pom.xml, and there is only a single version child so the following worked:
In [1]: import xml.etree.ElementTree as ET
In [2] ns = {"pom": "http://maven.apache.org/POM/4.0.0"}
In [3] print ET.parse("pom.xml").findall(".//pom:dependencies/pom:dependency[pom:artifactId='some-artifact-with-hardcoded-version']/pom:version", ns)[0].text
Out[1]: '1.2.3'

Want to delete multiple tags with same tag name in XML using python elementtree?

I am new to Python, As per project requirement i want to launch web request for diffrent test cases. lets say (Refer below Employee_req.xml) for one test case i want to launch web services with all organizations. but for another one i want to launch web services in which all first-name tags should be removed.I am using ElementTree in python for dealing with XML. Please find below code segment.Modification of tag and attribute values works fine without any issue. but while removing certain tags it was throwing error. I am not correct with Xpath so will you please suggest possible ways?
Emp_req.xml
<request>
<orgaqnization>
<name>org1</name>
<employee>
<first-name>abc</first-name>
<last-name>def</last-name>
<dob>19870909</dob>
</employee>
</orgaqnization>
<orgaqnization>
<name>org2</name>
<employee>
<first-name>abc2</first-name>
<last-name>def2</last-name>
<dob>19870909</dob>
</employee>
</orgaqnization>
<orgaqnization>
<name>org3</name>
<employee>
<first-name>abc3</first-name>
<last-name>def3</last-name>
<dob>19870909</dob>
</employee>
</orgaqnization>
</request>
Python:: Test.py
modify_query("Remove",tag_name=".//first-name")
import xml.etree.ElementTree as query_xml
def modifiy_query(self,*args,**kwargs):
root = query_tree.getroot()
operation_type=args[0]
tag_name=kwargs['tagname']
try:
if operation_type=="Remove":
logger.info("Removing %s Tag from XML" % tag_name)
root.remove(tag_name)
elif operation_type=="Insert":
logger.info("Inserting %s tag to xml" % tag_name)
else:
raise InvalidXMLOperationError("Operation " + operation_type + " is invalid")
except InvalidXMLOperationError,e:
logger.error("Invalid XML operation %s" % operation_type)
The error message (Flow could be differ because i am running this code from some other program):
File "Test.py", line 161, in <module> testsuite.scheduler()
File "Test.py", line 91, in scheduler self.launched_query_with("Without_date_range")
File "Test.py", line 55, in launched_query_with test.modifiy_query("Remove",tagname='.//first-name')
File "/home/XXX/YYYY/common.py", line 287, in modifiy_query parent.remove(child)
File "/usr/local/lib/python2.7/xml/etree/ElementTree.py", line 337, in remove self._children.remove(element)
ValueError: list.remove(x): x not in list
Thanks,
Priyank Shah
remove takes an element as parameter, not a xpath.
Instead of:
root.remove(tag_name)
you should have:
elements = root.findall(tag_name)
for element in elements:
root.remove(element)

Categories

Resources