Validate XML with namespaces against Schematron using lxml in Python - python

I am not able to get lxml Schematron validator to recognize namespaces. Validation works fine in code without namespaces.
This is for Python 3.7.4 and lxml 4.4.0 on MacOS 10.15
Here is the schematron file
<?xml version='1.0' encoding='UTF-8'?>
<schema xmlns="http://purl.oclc.org/dsdl/schematron"
xmlns:ns1="http://foo">
<pattern>
<rule context="//ns1:bar">
<assert test="number(.) = 2">
bar must be 2
</assert>
</rule>
</pattern>
</schema>
and here is the xml file
<?xml version="1.0" encoding="UTF-8"?>
<zip xmlns:ns1="http://foo">
<ns1:bar>3</ns1:bar>
</zip>
here is the python code
from lxml import etree, isoschematron
from plumbum import local
schematron_doc = etree.parse(local.path('rules.sch'))
schematron = isoschematron.Schematron(schematron_doc)
xml_doc = etree.parse(local.path('test.xml'))
is_valid = schematron.validate(xml_doc)
assert not is_valid
What I get: lxml.etree.XSLTParseError: xsltCompilePattern : failed to compile '//ns1:bar'
If I remove ns1 from both the XML file and the Schematron file, the example works perfectly-- no error message.
There must be a trick to registering namespaces in lxml Schematron that I am missing. Has anyone done this?

As it turns out, there is a specific way to register namespaces in Schematron. It is described in the Schematron ISO standard
It only required a small change to the Schematron file, adding the "ns" element in as follows:
<?xml version='1.0' encoding='UTF-8'?>
<schema xmlns="http://purl.oclc.org/dsdl/schematron">
<ns uri="http://foo" prefix="ns1"/>
<pattern>
<rule context="//ns1:bar">
<assert test="number(.) = 2">
bar must be 2
</assert>
</rule>
</pattern>
</schema>
I won't remove the question, since there is a dearth of examples of Schematron rules using namespaces. Hopefully it can be helpful to someone.

Related

Reading xml with lxml lib geting strange string from xmlns tag

I am writing program to work on xml file and change it. But when I try to get to any part of it I get some extra part.
My xml file:
<?xml version="1.0" encoding="UTF-8"?>
<Package xmlns="http://soap.sforce.com/2006/04/metadata">
<types>
<members>sbaa__ApprovalChain__c.ExternalID__c</members>
<members>sbaa__ApprovalCondition__c.ExternalID__c</members>
<members>sbaa__ApprovalRule__c.ExternalID__c</members>
<name>CustomField</name>
</types>
<version>40.0</version>
</Package>
And I have my code:
from lxml import etree
import sys
tree = etree.parse('package.xml')
root = tree.getroot()
print( root[0][0].tag )
As output I expect to see members but I get something like this:
{http://soap.sforce.com/2006/04/metadata}members
Why do I see that url and how to stop it from showing up?
You have defined a default namespace (Wikipedia, lxml tutorial). When defined, it is a part of every child tag.
If you want to print the tag without the namespace, it's easy
tag = root[0][0].tag
print(tag[tag.find('}')+1:])
If you want to remove the namespace from XML, see this question.

Beautiful Soup fails to recognize UTF-8 encoding on Python 3, IPython 6 console

I am trying to read an xml document using Beautiful Soup on Python 3.6.2, IPython 6.1.0, Windows 10, and I can't get the encoding right.
Here's my test xml, saved as a file in UTF8-encoding:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<info name="愛よ">ÜÜÜÜÜÜÜ</info>
<items>
<item thing="ÖöÖö">"23Äßßß"</item>
</items>
</root>
First check the XML using ElementTree:
import xml.etree.ElementTree as ET
def printXML(xml,indent=''):
print(indent+str(xml.tag)+': '+(xml.text if xml.text is not None else '').replace('\n',''))
if len(xml.attrib) > 0:
for k,v in xml.attrib.items():
print(indent+'\t'+k+' - '+v)
if xml.getchildren():
for child in xml.getchildren():
printXML(child,indent+'\t')
xml0 = ET.parse("test.xml").getroot()
printXML(xml0)
The output is correct:
root:
info: ÜÜÜÜÜÜÜ
name - 愛よ
items:
item: "23Äßßß"
thing - ÖöÖö
Now read the same file with Beautiful Soup and pretty-print it:
import bs4
with open("test.xml") as ff:
xml = bs4.BeautifulSoup(ff,"html5lib")
print(xml.prettify())
Output:
<!--?xml version="1.0" encoding="UTF-8"?-->
<html>
<head>
</head>
<body>
<root>
<info name="愛よ">
ÜÜÜÜÜÜÜ
</info>
<items>
<item thing="ÖöÖö">
"23Äßßß"
</item>
</items>
</root>
</body>
</html>
This is just wrong. Doing the call with explicite encoding specified bs4.BeautifulSoup(ff,"html5lib",from_encoding="UTF-8") doesn't change the result.
Doing
print(xml.original_encoding)
outputs
None
So Beautiful Soup is apparently unable to detect the original encoding even though the file is encoded in UTF8 (according to Notepad++) and the header information says UTF-8 as well, and I do have chardet installed as the doc recommends.
Am I making a mistake here? What could be causing this?
EDIT:
When I invoke the code without the html5lib I get this warning:
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib").
This usually isn't a problem, but if you run this code on another system, or in a different virtual environment,
it may use a different parser and behave differently.
The code that caused this warning is on line 241 of the file C:\Users\My.Name\AppData\Local\Continuum\Anaconda2\envs\Python3\lib\site-packages\spyder\utils\ipython\start_kernel.py.
To get rid of this warning, change code that looks like this:
BeautifulSoup(YOUR_MARKUP})
to this:
BeautifulSoup(YOUR_MARKUP, "html5lib")
markup_type=markup_type))
EDIT 2:
As suggested in a comment I tried bs4.BeautifulSoup(ff,"html.parser"), but the problem remains.
Then I installed lxml and tried bs4.BeautifulSoup(ff,"lxml-xml"), still the same output.
What also strikes me as odd is that even when specifying an encoding like bs4.BeautifulSoup(ff,"lxml-xml",from_encoding='UTF-8') the value of xml.original_encoding is None contrary to what is written in the doc.
EDIT 3:
I put my xml contents into a string
xmlstring = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><root><info name=\"愛よ\">ÜÜÜÜÜÜÜ</info><items><item thing=\"ÖöÖö\">\"23Äßßß\"</item></items></root>"
And used bs4.BeautifulSoup(xmlstring,"lxml-xml"), now I'm getting the correct output:
<?xml version="1.0" encoding="utf-8"?>
<root>
<info name="愛よ">
ÜÜÜÜÜÜÜ
</info>
<items>
<item thing="ÖöÖö">
"23Äßßß"
</item>
</items>
</root>
So it seems something is wrong with the file after all.
Found the error, I have to specify the encoding when opening the file:
with open("test.xml",encoding='UTF-8') as ff:
xml = bs4.BeautifulSoup(ff,"html5lib")
As I'm on Python 3 I thought the value of encoding was UTF-8 by default, but it turned out it's system-dependent and on my system it's cp1252.

Remove namespace with xmltodict in Python

xmltodict converts XML to a Python dictionary. It supports namespaces. I can follow the example on the homepage and successfully remove a namespace. However, I cannot remove the namespace from my XML and cannot identify why? Here is my XML:
<?xml version="1.0" encoding="UTF-8"?>
<status xmlns:mystatus="http://localhost/mystatus">
<section1
mystatus:field1="data1"
mystatus:field2="data2" />
<section2
mystatus:lineA="outputA"
mystatus:lineB="outputB" />
</status>
And using:
xmltodict.parse(xml,process_namespaces=True,namespaces={'http://localhost/mystatus':None})
I get:
OrderedDict([(u'status', OrderedDict([(u'section1', OrderedDict([(u'#http://localhost/mystatus:field1', u'data1'), (u'#http://localhost/mystatus:field2', u'data2')])), (u'section2', OrderedDict([(u'#http://localhost/mystatus:lineA', u'outputA'), (u'#http://localhost/mystatus:lineB', u'outputB')]))]))])
instead of:
OrderedDict([(u'status', OrderedDict([(u'section1', OrderedDict([(u'field1', u'data1'), (u'field2', u'data2')])), (u'section2', OrderedDict([(u'lineA', u'outputA'), (u'#lineB', u'outputB')]))]))])
Am I making some simple mistake, or is there something about my XML that prevents the process_namespace modification from working correctly?
xmltodict is based on expat, so namespaces should applied to the class name, not attribute names:
<?xml version="1.0" encoding="UTF-8"?>
<status xmlns:mystatus="http://localhost/mystatus">
<mystatus:section1 field1="data1" field2="data2" />
<mystatus:section2 lineA="outputA" lineB="outputB" />
</status>
When parsed with:
foo = xmltodict.parse(xml,
process_namespaces=True,
namespaces={'http://localhost/mystatus':None})
outputs:
OrderedDict([(u'status', OrderedDict([(u'section1', OrderedDict([(u'#field1', u'data1'), (u'#field2', u'data2')])), (u'section2', OrderedDict([(u'#lineA', u'outputA'), (u'#lineB', u'outputB')]))]))])
Accessing it is easy:
# Get attribute 'lineA' from class 'section2' from class 'status'
>>> foo.get('status').get('section2').get('#lineA')
u'outputA'
Attribute namespaces are only required when you have multiple attributes of the same name (e.g. multiple id's or multiple prices, etc), in which case, I couldn't get expat or xmltodict to parse it correctly. YMMV though.

Cannot write XML file with default namespace [duplicate]

This question already has answers here:
Saving XML files using ElementTree
(5 answers)
Closed 7 years ago.
I'm writing a Python script to update Visual Studio project files. They look like this:
<?xml version="1.0" encoding="utf-8"?>
<Project ToolsVersion="4.0" DefaultTargets="Build"
xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<PropertyGroup>
...
The following code reads and then writes the file:
import xml.etree.ElementTree as ET
tree = ET.parse(projectFile)
root = tree.getroot()
tree.write(projectFile,
xml_declaration = True,
encoding = 'utf-8',
method = 'xml',
default_namespace = "http://schemas.microsoft.com/developer/msbuild/2003")
Python throws an error at the last line, saying:
ValueError: cannot use non-qualified names with default_namespace option
This is surprising since I'm just reading and writing, with no editing in between. Visual Studio refuses to load XML files without a default namespace, so omitting it is not optional.
Why does this error occur? Suggestions or alternatives welcome.
This is a duplicate to Saving XML files using ElementTree
The solution is to define your default namespace BEFORE parsing the project file.
ET.register_namespace('',"http://schemas.microsoft.com/developer/msbuild/2003")
Then write out your file as
tree.write(projectFile,
xml_declaration = True,
encoding = 'utf-8',
method = 'xml')
You have successfully round-tripped your file. And avoided the creation of ns0 tags everywhere.
I think that lxml does a better job handling namespaces. It aims for an ElementTree-like interface but uses xmllib2 underneath.
>>> import lxml.etree
>>> doc=lxml.etree.fromstring("""<?xml version="1.0" encoding="utf-8"?>
... <Project ToolsVersion="4.0" DefaultTargets="Build"
... xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
... <PropertyGroup>
... </PropertyGroup>
... </Project>""")
>>> print lxml.etree.tostring(doc, xml_declaration=True, encoding='utf-8', method='xml', pretty_print=True)
<?xml version='1.0' encoding='utf-8'?>
<Project xmlns="http://schemas.microsoft.com/developer/msbuild/2003" ToolsVersion="4.0" DefaultTargets="Build">
<PropertyGroup>
</PropertyGroup>
</Project>
This was the closest answer I could find to my problem. Putting the:
ET.register_namespace('',"http://schemas.microsoft.com/developer/msbuild/2003")
just before the parsing of my file did not work.
You need to find the specific namespace the xml file you are loading is using. To do that, I printed out the Element of the ET tree node's tag which gave me my namespace to use and the tag name, copy that namespace into:
ET.register_namespace('',"XXXXX YOUR NAMESPACEXXXXXX")
before you start parsing your file then that should remove all the namespaces when you write.

How can lxml validate some XML against both an XSD file while also loading an inline schema too?

I'm having problems getting lxml to successfully validate some xml. The XSD schema and XML file are both from Amazon documentation so should be compatible. But the XML itself refers to another schema that's not being loaded.
Here is my code, which is based on the lxml validation tutorial:
xsd_doc = etree.parse('ProductImage.xsd')
xsd = etree.XMLSchema(xsd_doc)
xml = etree.parse('ProductImage_sample.xml')
xsd.validate(xml)
print xsd.error_log
"ProductImage_sample.xml:2:0:ERROR:SCHEMASV:SCHEMAV_CVC_ELT_1: Element 'AmazonEnvelope': No matching global declaration available for the validation root."
I get no errors if I validate against amzn-envelope.xsd instead of ProductImage.xsd, but that defeats the point of seeing if a given Image feed is valid. All xsd & xml files mentioned are in my working directory along with my python script by the way.
Here is a snippet of the sample xml, which should definately be valid:
<?xml version="1.0"?>
<AmazonEnvelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="amzn-envelope.xsd">
<Header>
<DocumentVersion>1.01</DocumentVersion>
<MerchantIdentifier>Q_M_STORE_123</MerchantIdentifier>
</Header>
<MessageType>ProductImage</MessageType>
<Message>
<MessageID>1</MessageID>
<OperationType>Update</OperationType>
<ProductImage>
<SKU>1234</SKU>
Here is a snippet of the schema (this file is not public so I can't show all of it):
<?xml version="1.0"?>
<!-- Revision="$Revision: #5 $" -->
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
<xsd:include schemaLocation="amzn-base.xsd"/>
<xsd:element name="ProductImage">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="SKU"/>
I can say that following the include to amzn-base.xsd does not end up reaching a definition of the AmazonEnvelope tag. So my questions is: can lxml load schemas via a tag like <AmazonEnvelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="amzn-envelope.xsd">. And if not, how can I validate my Image feed?
The answer is I should validate by the parent schema file, which as mentioned at the top of the XML file is amzn-envelope.xsd as this contains the line:
<xsd:include schemaLocation="ProductImage.xsd"/>
In general then, lxml won't read such a declaration as xsi:noNamespaceSchemaLocation="amzn-envelope.xsd" but if you can find the parent schema to validate against then this should hopefully include the specific schema you're interested in.

Categories

Resources