Parsing XML with SAX/Python + no validation - python

I am new to python and I'm trying to parse a XML file with SAX without validating it.
The head of my xml file is:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE n:document SYSTEM "schema.dtd">
<n:document....
and I've tried to parse it with python 2.5.2:
from xml.sax import make_parser, handler
import sys
parser = make_parser()
parser.setFeature(handler.feature_namespaces,True)
parser.setFeature(handler.feature_validation,False)
parser.setContentHandler(handler.ContentHandler())
parser.parse(sys.argv[1])
but I got an error:
python doc.py document.xml
(...)
File "/usr/lib/python2.5/urllib2.py", line 244, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: schema.dtd
I don't want the SAX parser to look for a schema. Where am I wrong ?
Thanks !

expatreader considers the DTD external subset as an external general entity. So the feature you want is:
parser.setFeature(handler.feature_external_ges, False)
However, it's a bit dodgy pointing the DTD external subset to a non-existant URL; as this shows, it's not only validating parsers that read it.

Related

Parse XML with Python resolving an external ENTITY reference

In my S1000D xml, it specifies a DOCTYPE with a reference to a public URL that contains references to a number of other files that contain all the valid character entities. I've used xml.etree.ElementTree and lxml to try to parse it and get a parse error with both indicating:
undefined entity −: line 82, column 652
Even though − is a valid entity according to the ENTITY Reference specfied.
The xml top is as follow:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE dmodule [
<!ENTITY % ISOEntities PUBLIC 'ISO 8879-1986//ENTITIES ISO Character Entities 20030531//EN//XML' 'http://www.s1000d.org/S1000D_4-1/ent/ISOEntities'>
%ISOEntities;]>
If you go out and get http://www.s1000d.org/S1000D_4-1/ent/ISOEntities, it will include 20 other ent files with one called iso-tech.ent which contains the line:
<!ENTITY minus "−"> <!-- MINUS SIGN -->
in line 82 of the xml file near column 652 is the following:
....Refer to 70−41....
How can I run a python script to parse this file without get the undefined entity?
Sorry I don't want to specify parser.entity['minus'] = chr(2212) for example. I did that for a quick fix but there are many character entity references.
I would like the parser to check Entity reference that is specified in the xml.
I'm surprised but I've gone around the sun and back and haven't found how to do this (or maybe I have but couldn't follow it).
if I update my xml file and add
<!ENTITY minus "−">
It won't fail, so It's not the xml.
It fails on the parse. Here's code I use for ElementTree
fl = os.path.join(pth, fn)
try:
root = ET.parse(fl)
except ParseError as p:
print("ParseError : ", p)
Here's the code I use for lxml
fl = os.path.join(pth, fn)
try:
parser = etree.XMLParser(load_dtd=True, resolve_entities=True)
root = etree.parse(fl, parser=parser)
except etree.XMLSyntaxError as pe:
print("lxml XMLSyntaxError: ", pe)
I would like the parser to load the ENTITY reference so that it knows that − and all the other character entities specified in all the files are valid entity characters.
Thank you so much for your advice and help.
I'm going to answer for lxml. No reason to consider ElementTree if you can use lxml.
I think the piece you're missing is no_network=False in the XMLParser; it's True by default.
Example...
XML Input (test.xml)
<!DOCTYPE doc [
<!ENTITY % ISOEntities PUBLIC 'ISO 8879-1986//ENTITIES ISO Character Entities 20030531//EN//XML' 'http://www.s1000d.org/S1000D_4-1/ent/ISOEntities'>
%ISOEntities;]>
<doc>
<test>Here's a test of minus: −</test>
</doc>
Python
from lxml import etree
parser = etree.XMLParser(load_dtd=True,
no_network=False)
tree = etree.parse("test.xml", parser=parser)
etree.dump(tree.getroot())
Output
<doc>
<test>Here's a test of minus: −</test>
</doc>
If you wanted the entity reference retained, add resolve_entities=False to the XMLParser.
Also, instead of going out to an external location to resolve the parameter entity, consider setting up an XML Catalog. This will let you resolve public and/or system identifiers to local versions.
Example using same XML input above...
XML Catalog ("catalog.xml" in the directory "catalog test" (space used in directory name for testing))
<!DOCTYPE catalog PUBLIC "-//OASIS//DTD XML Catalogs V1.1//EN" "http://www.oasis-open.org/committees/entity/release/1.1/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<!-- The path in #uri is relative to this file (catalog.xml). -->
<uri name="http://www.s1000d.org/S1000D_4-1/ent/ISOEntities" uri="./ents/ISOEntities_stackoverflow.ent"/>
</catalog>
Entity File ("ISOEntities_stackoverflow.ent" in the directory "catalog test/ents". Changed the value to "BAM!" for testing)
<!ENTITY minus "BAM!">
Python (Changed no_network to True for additional evidence that the local version of http://www.s1000d.org/S1000D_4-1/ent/ISOEntities is being used.)
import os
from urllib.request import pathname2url
from lxml import etree
# The XML_CATALOG_FILES environment variable is used by libxml2 (which is used by lxml).
# See http://xmlsoft.org/catalog.html.
try:
xcf_env = os.environ['XML_CATALOG_FILES']
except KeyError:
# Path to catalog must be a url.
catalog_path = f"file:{pathname2url(os.path.join(os.getcwd(), 'catalog test/catalog.xml'))}"
# Temporarily set the environment variable.
os.environ['XML_CATALOG_FILES'] = catalog_path
parser = etree.XMLParser(load_dtd=True,
no_network=True)
tree = etree.parse("test.xml", parser=parser)
etree.dump(tree.getroot())
Output
<doc>
<test>Here's a test of minus: BAM!</test>
</doc>

Reading xml with lxml lib geting strange string from xmlns tag

I am writing program to work on xml file and change it. But when I try to get to any part of it I get some extra part.
My xml file:
<?xml version="1.0" encoding="UTF-8"?>
<Package xmlns="http://soap.sforce.com/2006/04/metadata">
<types>
<members>sbaa__ApprovalChain__c.ExternalID__c</members>
<members>sbaa__ApprovalCondition__c.ExternalID__c</members>
<members>sbaa__ApprovalRule__c.ExternalID__c</members>
<name>CustomField</name>
</types>
<version>40.0</version>
</Package>
And I have my code:
from lxml import etree
import sys
tree = etree.parse('package.xml')
root = tree.getroot()
print( root[0][0].tag )
As output I expect to see members but I get something like this:
{http://soap.sforce.com/2006/04/metadata}members
Why do I see that url and how to stop it from showing up?
You have defined a default namespace (Wikipedia, lxml tutorial). When defined, it is a part of every child tag.
If you want to print the tag without the namespace, it's easy
tag = root[0][0].tag
print(tag[tag.find('}')+1:])
If you want to remove the namespace from XML, see this question.

Error parsing a RDF XML file with rdflib in Python

I am trying to parse this RDF:
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="http://sentic.net/api/en/concept/celebrate_special_occasion/polarity">
<rdf:type rdf:resource="http://sentic.net/api/concept/polarity"/>
<polarity xmlns="http://sentic.net" rdf:datatype="http://w3.org/2001/XMLSchema#float">0.551</polarity>
</rdf:Description>
</rdf:RDF>
I am loading it from the URL: http://sentic.net/api/en/concept/celebrate_special_occasion/polarity
To do this I am using this code:
import rdflib
g = rdflib.Graph()
g.parse("http://sentic.net/api/en/concept/celebrate_special_occasion/polarity", format='xml')
However, the code return this error:
ParserError: http://sentic.net/api/en/concept/celebrate_special_occasion/polarity:4:67: Repeat node-elements inside property elements: http://w3.org/1999/02/22-rdf-syntax-ns#type
Does anyone know what is happening? Which element is repeated? How can I solve this?
It doesn't appear to be valid RDF. The W3C validator fails.
I loaded it with [rapper] and got a more descriptive error message.
rapper: Parsing URI http://sentic.net/api/en/concept/celebrate_special_occasion/polarity with parser rdfxml
rapper: Serializing with serializer turtle
rapper: Error - URI http://sentic.net/api/en/concept/celebrate_special_occasion/polarity:5 - property element 'Description' has multiple object node elements, skipping.

xml parsing error special characters

I have following xml that I want to parse with xml.dom.minidom module
<?xml version="1.0" encoding="UTF-8"?>
<RootTag>
<InnerTag>
<MyValue>"< here is special char."</MyValue>
</InnerTag>
</RootTag>
I have following snippet for parsing above xml
import xml.dom.minidom
xml.dom.minidom.parse('input_xml')
But I get following error:
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 4, column 26
Above error occurs only when I provide '&' or '<' provided in MyValue tags
So,
How to resolve this issue?
I am not wishing to change my XML by using escape sequence < etc..
and I want to use "" (quotes)
Your example is not well-formed XML. < is not allowed in XML anywhere else other than the tags. Your data needs to be wrapped in CDATA or escaped as <
<![CDATA[< here is special char.]]>

Parse xsd with values [python]

I'm trying to examine and extract some data from an xml file using python. I'm doing this by parsing with etree then looping through the elements:
import xml.etree.ElementTree as etree
root = etree.fromstring(xml_string)
for element in root.iter():
print("%s , %s , %s" % (element.tag, element.attrib, element.text))
This works fine for some test data, but the actual xml files that I'm working with seem to contain xsd tags along with the data. Below is an example
<wdtf:observationMember>
<wdtf:TimeSeriesObservation gml:id="ts1">
<gml:description>Reading using DTW (Depth To Water) from TOC</gml:description>
<gml:name codeSpace="http://www.bom.gov.au/std/water/xml/wio0.2/feature/TimeSeriesObservation/w00066/12/A/GroundWaterLevel/">1</gml:name>
<om:procedure xlink:href="#gwTOC12" />
<om:observedProperty xlink:href="http://www.bom.gov.au/std/water/xml/wio0.2/property//bom/GroundWaterLevel_m" />
<om:featureOfInterest xlink:href="http://www.bom.gov.au/std/water/xml/wio0.2/feature/BorePipeSamplingInterval/w00066/12" />
<wdtf:metadata>
<wdtf:TimeSeriesObservationMetadata>
<wdtf:regulationProperty>Reg200806.s3.2a</wdtf:regulationProperty>
<wdtf:status>validated</wdtf:status>
</wdtf:TimeSeriesObservationMetadata>
</wdtf:metadata>
<wdtf:result>
<wdtf:TimeSeries>
<wdtf:defaultInterpolationType>InstVal</wdtf:defaultInterpolationType>
<wdtf:defaultUnitsOfMeasure>m</wdtf:defaultUnitsOfMeasure>
<wdtf:defaultQuality>quality-A</wdtf:defaultQuality>
<wdtf:timeValuePair time="1915-12-09T12:00:00+10:00">51.82</wdtf:timeValuePair>
<wdtf:timeValuePair time="1917-12-18T12:00:00+10:00">41.38</wdtf:timeValuePair>
<wdtf:timeValuePair time="1924-05-23T12:00:00+10:00">21.95</wdtf:timeValuePair>
<wdtf:timeValuePair time="1988-02-02T12:00:00+10:00">7.56</wdtf:timeValuePair>
</wdtf:TimeSeries>
</wdtf:result>
</wdtf:TimeSeriesObservation>
</wdtf:observationMember>
Useing this xml in the code above causes etree to return an error:
Traceback (most recent call last):
File "xml_test2.py", line 38, in <module>
root = etree.fromstring(xml_string)
File "<string>", line 124, in XML
ParseError: unbound prefix: line 1, column 4
Is there a different parser I should be using? Or can I remove the xsc tags some how?
Thanks
From what I can see in your post, your parser is namespace aware and is complaining that XML namespace aliases are not resolved. Assuming that <wdtf:observationMember> is your topmost element, then you have to have the following at least:
<wdtf:observationMember xmlns:wdtf="some-uri">
The same applies for all other prefixes, such as gml, om, etc.

Categories

Resources