lxml parse xsd file without Schema URL - python

I am using lxml to parse an xsd file and am looking for an easy way to remove the URL namespace attached to each element name. Here's the xsd file:
<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" version="2.0" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="rootelement">
<xs:complexType>
<xs:choice maxOccurs="unbounded">
<xs:element minOccurs="1" maxOccurs="1" name="element1">
<xs:complexType>
<xs:all>
<xs:element name="subelement1" type="xs:string" />
<xs:element name="subelement2" type="xs:integer" />
<xs:element name="subelement3" type="xs:dateTime" />
</xs:all>
<xs:attribute name="id" type="xs:integer" use="required" />
</xs:complexType>
</xs:element>
</xs:choice>
<xs:attribute fixed="2.0" name="version" type="xs:decimal" use="required" />
</xs:complexType>
</xs:element>
</xs:schema>
and using this code:
from lxml import etree
parser = etree.XMLParser()
data = etree.parse(open("testschema.xsd"),parser)
root = data.getroot()
rootelement = root.getchildren()[0]
rootelementattribute = rootelement.getchildren()[0].getchildren()[1]
print "root element tags"
print rootelement[0].tag
print rootelementattribute.tag
elements = rootelement.getchildren()[0].getchildren()[0].getchildren()
elements_attribute = elements[0].getchildren()[0].getchildren()[1]
print "element tags"
print elements[0].tag
print elements_attribute.tag
subelements = elements[0].getchildren()[0].getchildren()[0].getchildren()
print "subelements"
print subelements
I get the following output
root element tags
{http://www.w3.org/2001/XMLSchema}complexType
{http://www.w3.org/2001/XMLSchema}attribute
element tags
{http://www.w3.org/2001/XMLSchema}element
{http://www.w3.org/2001/XMLSchema}attribute
subelements
[<Element {http://www.w3.org/2001/XMLSchema}element at 0x7f2998fb16e0>, <Element {http://www.w3.org/2001/XMLSchema}element at 0x7f2998fb1780>, <Element {http://www.w3.org/2001/XMLSchema}element at 0x7f2998fb17d0>]
I don't want "{http://www.w3.org/2001/XMLSchema}" to appear at all when I pull the tag data (altering the xsd file is not an option). The reason I need the xsd tag info is that I am using this to validate column names from a series of flat files. On the "element" level there are multiple elements that I'm pulling, as well as subelements, which I am using a dictionary to validate columns. Also, any suggestions on improving the code above would be greatly, such as a way to use fewer "getchildren" calls, or just make it more organized.

I'd use:
print elem.tag.split('}')[-1]
But you could also use the xpath function local-name():
print elem.xpath('local-name()')
As for fewer getchildren() calls: just leave them out. getchildren() is a deprecated way of making a list of the direct children (you should just use list(elem) instead if you actually want this).
You can iterate over, or use an index on an element directly. For example: rootelement[0] will give you the first child element of rootelement (but more efficient than if you were use rootelement.getchildren()[0], because this would act like list(rootelement) and create a new list first)

I wonder why etree.XMLParser(ns_clean=True) doesn't work. It had not worked for me so did it getting namespace from root.nsmap between brackets and replacing it with empty string
print rootelement[0].tag.replace('{%s}' %root.nsmap['xs'], '')

The easiest thing to do is to just use string slicing to remove namespace prefix:
>>> print rootelement[0].tag[34:]
complexType

If the URI might change in the future (for some unknown reason or you're truly paranoid), consider the following:
print "root element tags"
tag, nsmap, prefix = rootelement[0].tag, rootelement[0].nsmap, rootelement[0].prefix
tag = tag[len(nsmap[prefix]) + 2:]
print tag
This is a very unlikely case, but who knows?

Related

Zeep create xs:choice element

I have wsdl with ArrayOfVEHICLE type:
<xs:complexType name="ArrayOfVEHICLE">
<xs:sequence>
<xs:choice maxOccurs="unbounded" minOccurs="0">
<xs:element name="VEHICLE" nillable="true" type="tns:VEHICLE"/>
<xs:element name="VEHICLEV2" nillable="true" type="tns:VEHICLEV2"/>
</xs:choice>
</xs:sequence>
</xs:complexType>
I am trying to create element with that type with zeep:
vehicle_v2_type = client.get_type("ns0:ArrayOfVEHICLE")
vehicle_v2 = vehicle_v2_type(VEHICLEV2={...})
And I get an error:
TypeError: {http://www.vsk.ru}ArrayOfVEHICLE() got an unexpected keyword argument 'VEHICLE2'. Signature: `({VEHICLE: {http://www.vsk.ru}VEHICLE} | {VEHICLEV2: {http://www.vsk.ru}VEHICLEV2})[]`
I have tried using _value_1 method from zeep docs like this:
vehicle_v2 = vehicle_v2_type(_value_1={"VEHICLEV2": {...}})
And I get another error:
TypeError: No complete xsd:Sequence found for the xsd:Choice '_value_1'.
The signature is: ({VEHICLE: {http://www.vsk.ru}VEHICLE} | {VEHICLEV2: {http://www.vsk.ru}VEHICLEV2})[]
Anybody knows how to create that element with zeep?
Ok, i got it. My wsdl says that choise element got to be list, because of signature:
<xs:choice maxOccurs="unbounded" minOccurs="0">
And the easy way is to create Nested list using _value_1, without factories in my case
client.service.SomeService(
...
vehicles={ # Element with ArrayOfVEHICLE type
"_value_1" : [
{
"VEHICLE2": {...}
}
]
}
)
Hope this wil help someone

Very Strange XML Schema Issue

I'm trying to parse custom XML file formats with PyXB. So, I first wrote the following XML schema:
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="outertag" minOccurs="0" maxOccurs="1">
<xs:complexType>
<xs:all>
<xs:element name="innertag0"
minOccurs="0"
maxOccurs="unbounded"/>
<xs:element name="innertag1"
minOccurs="0"
maxOccurs="unbounded"/>
</xs:all>
</xs:complexType>
</xs:element>
</xs:schema>
I used the following pyxbgen command to generate the Python module's source, py_schema_module.py:
pyxbgen -m py_schema_module -u schema.xsd
I then wrote the following script for parsing an XML file I call example.xml:
#!/usr/bin/env python2.7
import py_schema_module
if __name__ == "__main__":
with open("example.xml", "r") as f:
py_schema_module.CreateFromDocument(f.read())
I use that script to determine the legality of example.xml's syntax. For instance, the following example.xml file has legal syntax per the schema:
<outertag>
<innertag0></innertag0>
<innertag1></innertag1>
</outertag>
So does this:
<outertag>
<innertag1></innertag1>
<innertag0></innertag0>
</outertag>
However, the following syntax is illegal:
<outertag>
<innertag1></innertag1>
<innertag0></innertag0>
<innertag1></innertag1>
</outertag>
So is this:
<outertag>
<innertag0></innertag0>
<innertag1></innertag1>
<innertag0></innertag0>
</outertag>
I am able to write innertag0 and then innertag1. I am also able to write innertag1 and then innertag0. I can also repeat the instances of innertag0 and innertag1 arbitrarily (examples not shown for the sake of brevity). However, what I cannot do is switch between innertag0 and innertag1.
Let's assume I want the format to support this functionality. How should I alter my XML schema file?
The following XML Schema (XSD) 1.0 should cover your use case regardless of the sequential order of the innertag(0|1) element. Default value for both minOccurs and maxOccurs is 1.
Useful link: XML schema, why xs:group can't be child of xs:all?
XML
<outertag>
<innertag1></innertag1>
<innertag0></innertag0>
</outertag>
XSD
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
<xs:element name="outertag">
<xs:complexType>
<xs:all>
<xs:element name="innertag0" type="xs:string"/>
<xs:element name="innertag1" type="xs:string"/>
</xs:all>
</xs:complexType>
</xs:element>
</xs:schema>
Your schema processor doesn't seem to be doing very careful checking against the spec.
If I try to process your schema as an XSD 1.0 schema with Saxon, it tells me there are four errors:
Error at xs:element on line 3 column 59 of test.xsd:
Attribute #minOccurs is not allowed on element <xs:element>
Error at xs:element on line 3 column 59 of test.xsd:
Attribute #maxOccurs is not allowed on element <xs:element>
Error at xs:all on line 5 column 15 of test.xsd:
Within <xs:all>, an <xs:element> must have #maxOccurs equal to 0 or 1
Error at xs:all on line 5 column 15 of test.xsd:
Within <xs:all>, an <xs:element> must have #maxOccurs equal to 0 or 1
Schema processing failed: 4 errors were found while processing the schema
The first two say that minOccurs and maxOccurs are not allowed on a global element declaration.
The second two say that maxOccurs must be 1 within xs:all - XSD 1.0 doesn't allow an element to repeat when the content model is xs:all. Your processor told you it was an error in the XML instance, but it's actually an error in your schema.
XSD 1.1 does allow multiple occurrences within xs:all. If I correct the global element declaration by deleting #minOccurs and #maxOccurs, the schema is now valid under XSD 1.1, and allows the interleaved instance examples that you were having trouble with.

Parsing XSD files does not work -> Cannot find any tags

I am currently trying to parse a XSD file in python using the lxml library.
For testing purposes I copied the following file together:
<xs:schema targetNamespace="http://www.w3schools.com" elementFormDefault="qualified">
<xs:element name="note">
<xs:complexType>
<xs:sequence>
<xs:element name="to" type="xs:string"/>
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:simpleType name="BaselineShiftValueType">
<xs:annotation>
<xs:documentation>The actual definition is
baseline | sub | super | <percentage> | <length> | inherit
not sure that union can do this
</xs:documentation>
</xs:annotation>
<xs:restriction base="string"/>
</xs:simpleType>
</xs:schema>
Now I tried to get the children of the root (schema), which would be: xs:element and xs:simpleType.
By iterating over the children of the root, everything works fine:
root = self.XMLTree.getroot()
for child in root:
print("{}: {}".format(child.tag, child.attrib))
This leads to the output:
{http://www.w3.org/2001/XMLSchema}element: {'name': 'note'}
{http://www.w3.org/2001/XMLSchema}simpleType: {'name': 'BaselineShiftValueType'}
But when I want to have only children of a certain type, it does not work:
root = self.XMLTree.getroot()
element = self.XMLTree.find("element")
print(str(element))
This gives me the following output:
None
Also using findall or writing ./element or .//element does not change the result.
I am quite sure I am missing something. What is the right way to do this?
You are missing the namespace. Unprefixed XPath selectors are considered as belonging to no namespace. You will have to register it with register_namespace:
self.XMLTree.register_namespace('xs',"http://www.w3.org/2001/XMLSchema")
and then use prefixed selectors to find your elements:
element = self.XMLTree.find("xs:element")
To follow the #helderdarocha's answer, you can also define your namespace in a dictionary and use it in your search functions like in the python xml.etree.ElementTree doc:
ns = {'xs',"http://www.w3.org/2001/XMLSchema"}
element = self.XMLTree.find("element", ns)

End-to-end example with PyXB. From an XSD schema to an XML document

I am having a hard time getting started with PyXB.
Say I have an XSD file (an XML schema). I would like to:
Use PyXB to define Python objects according to the schema.
Save those objects to disk as XML files that satisfy the schema.
How can I do this with PyXB? Below is a simple example of an XSD file (from Wikipedia) that encodes an address, but I am having a hard time even getting started.
<?xml version="1.0" encoding="utf-8"?>
<xs:schema elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="Address">
<xs:complexType>
<xs:sequence>
<xs:element name="FullName" type="xs:string" />
<xs:element name="House" type="xs:string" />
<xs:element name="Street" type="xs:string" />
<xs:element name="Town" type="xs:string" />
<xs:element name="County" type="xs:string" minOccurs="0" />
<xs:element name="PostCode" type="xs:string" />
<xs:element name="Country" minOccurs="0">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="IN" />
<xs:enumeration value="DE" />
<xs:enumeration value="ES" />
<xs:enumeration value="UK" />
<xs:enumeration value="US" />
</xs:restriction>
</xs:simpleType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
Update
Once I run
pyxbgen -u example.xsd -m example
I get a example.py that has the following classes:
example.Address example.STD_ANON
example.CTD_ANON example.StringIO
example.CreateFromDOM example.pyxb
example.CreateFromDocument example.sys
example.Namespace
I think I understand what CreateFromDocument does - it presumably reads an XML and creates the corresponding python object-, but which class do I use to create a new object and then save it to an XML?
A simple google search brings this: http://pyxb.sourceforge.net/userref_pyxbgen.html#pyxbgen
In particular the part that says:
Translate this into Python with the following command:
pyxbgen -u po1.xsd -m po1
The -u parameter identifies a schema
document describing contents of a namespace. The parameter may be a
path to a file on the local system, or a URL to a network-accessible
location like
http://www.weather.gov/forecasts/xml/DWMLgen/schema/DWML.xsd. The -m
parameter specifies the name to be used by the Python module holding
the bindings generated for the namespace in the preceding schema.
After running this, the Python bindings will be in a file named
po1.py.
EDIT Following your update:
Now that you have your generated Address class and all the associated helpers, look at http://pyxb.sourceforge.net/userref_usebind.html in order to learn how to use them. For your specific question, you want to study the "Creating Instances in Python Code" paragraph. Basically to generate XML from your application data you simply do:
import example
address = Address()
address.FullName = "Jo La Banane"
# fill other members of address
# ...
with open('myoutput.xml', 'w') as file
f.write(address.toxml("utf-8"))
Now it's up to you to be curious and read the code being generated, pyxb's doc, call the various generated methods and experiment!

How do I require that an element has either one set of attributes or another in an XSD schema?

I'm working with an XML document where a tag must either have one set of attributes or another. For example, it needs to either look like <tag foo="hello" bar="kitty" /> or <tag spam="goodbye" eggs="world" /> e.g.
<root>
<tag foo="hello" bar="kitty" />
<tag spam="goodbye" eggs="world" />
</root>
So I have an XSD schema where I use the xs:choice element to choose between two different attribute groups:
<xsi:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema" attributeFormDefault="unqualified" elementFormDefault="qualified">
<xs:element name="root">
<xs:complexType>
<xs:sequence>
<xs:element maxOccurs="unbounded" name="tag">
<xs:choice>
<xs:complexType>
<xs:attribute name="foo" type="xs:string" use="required" />
<xs:attribute name="bar" type="xs:string" use="required" />
</xs:complexType>
<xs:complexType>
<xs:attribute name="spam" type="xs:string" use="required" />
<xs:attribute name="eggs" type="xs:string" use="required" />
</xs:complexType>
</xs:choice>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xsi:schema>
However, when using lxml to attempt to load this schema, I get the following error:
>>> from lxml import etree
>>> etree.XMLSchema( etree.parse("schema_choice.xsd") )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "xmlschema.pxi", line 85, in lxml.etree.XMLSchema.__init__ (src/lxml/lxml.etree.c:118685)
lxml.etree.XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}element': The content is not valid. Expected is (annotation?, ((simpleType | complexType)?, (unique | key | keyref)*))., line 7
Since the error is with the placement of my xs:choice element, I've tried putting it in different places, but no matter what I try, I can't seem to use it to define a tag to have either one set of attributes (foo and bar) or another (spam and eggs).
Is this even possible? And if so, then what is the correct syntax?
It is unfortunately not possible to use choice with attributes in XML schema. You will need to implement this validation at a higher level.

Categories

Resources