Error parsing a RDF XML file with rdflib in Python

Error parsing a RDF XML file with rdflib in Python - python

I am trying to parse this RDF:
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="http://sentic.net/api/en/concept/celebrate_special_occasion/polarity">
<rdf:type rdf:resource="http://sentic.net/api/concept/polarity"/>
<polarity xmlns="http://sentic.net" rdf:datatype="http://w3.org/2001/XMLSchema#float">0.551</polarity>
</rdf:Description>
</rdf:RDF>
I am loading it from the URL: http://sentic.net/api/en/concept/celebrate_special_occasion/polarity
To do this I am using this code:
import rdflib
g = rdflib.Graph()
g.parse("http://sentic.net/api/en/concept/celebrate_special_occasion/polarity", format='xml')
However, the code return this error:
ParserError: http://sentic.net/api/en/concept/celebrate_special_occasion/polarity:4:67: Repeat node-elements inside property elements: http://w3.org/1999/02/22-rdf-syntax-ns#type
Does anyone know what is happening? Which element is repeated? How can I solve this?

It doesn't appear to be valid RDF. The W3C validator fails.
I loaded it with [rapper] and got a more descriptive error message.
rapper: Parsing URI http://sentic.net/api/en/concept/celebrate_special_occasion/polarity with parser rdfxml
rapper: Serializing with serializer turtle
rapper: Error - URI http://sentic.net/api/en/concept/celebrate_special_occasion/polarity:5 - property element 'Description' has multiple object node elements, skipping.

Related

Python code to remove XML Declaration tag

I xml file like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<product_info article_id="0006303562330" group_id="0006303562310" vendor_id="0006303562321"
subgroup_id="0006303562313">
<available>
...
using pure Python I want to have this:
<product_info article_id="0006303562330" group_id="0006303562310" vendor_id="0006303562321"
subgroup_id="0006303562313">
<available>
...
I get my xml code in response_xml.text (response_xml gives me Response (200)) and I have tried to do this:
response_xml = response_xml.text.replace('<?xml version="1.0" encoding="UTF-8" standalone="yes"?>','')
but I get:
AttributeError: 'str' object has no attribute 'text'

What the error is telling you is that your sample xml is a string. You need to parse it first to get the structure. You can use parser like BeautifulSoup or ElementTree and work with their output.

Parsing a kml file using lxml

I've got a KML file - I'm using the wikipedia 'default' as a sample:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<Placemark>
<name>New York City</name>
<description>New York City</description>
<Point>
<coordinates>-74.006393,40.714172,0</coordinates>
</Point>
</Placemark>
</Document>
</kml>
And I'm trying to extract the coordinates.
Now, I've got a snippet working that embeds the namespace to search:
#!/usr/python/python3.4/bin/python3
from lxml import etree as ET
tree = ET.parse('sample.kml')
root = tree.getroot
print (root.find('.//{http://www.opengis.net/kml/2.2}coordinates').text)
This works fine.
However having found this:
Parsing XML with namespace in Python via 'ElementTree'
I'm trying to do it via reading the namespace from the document, using 'root.nsmap'.
print (root.nsmap)
Gives me:
{None: '{http://www.opengis.net/kml/2.2}'}
So I think I should be able to do this:
print ( root.find('.//coordinates',root.nsmap).text )
Or something very similar, using the None namespace. (e.g. has no prefix). But this doesn't work - I get an error when doing it:
AttributeError: 'NoneType' object has no attribute 'text'
I assume that means that my 'find' didn't find anything in this instance.
What am I missing here?

This code,
root.find('.//coordinates', root.nsmap)
does not return anything because no prefix is used. See http://lxml.de/xpathxslt.html#namespaces-and-prefixes.
Below are two options that work.
Define another nsmap with a real prefix as key:
nsmap2 = {"k": root.nsmap[None]}
print (root.find('.//k:coordinates', nsmap2).text)
Don't bother with prefixes. Put the namespace URI inside curly braces ("Clark notation") to form a universal element name:
ns = root.nsmap[None]
print (root.find('.//{{{0}}}coordinates'.format(ns)).text)

How can lxml validate some XML against both an XSD file while also loading an inline schema too?

I'm having problems getting lxml to successfully validate some xml. The XSD schema and XML file are both from Amazon documentation so should be compatible. But the XML itself refers to another schema that's not being loaded.
Here is my code, which is based on the lxml validation tutorial:
xsd_doc = etree.parse('ProductImage.xsd')
xsd = etree.XMLSchema(xsd_doc)
xml = etree.parse('ProductImage_sample.xml')
xsd.validate(xml)
print xsd.error_log
"ProductImage_sample.xml:2:0:ERROR:SCHEMASV:SCHEMAV_CVC_ELT_1: Element 'AmazonEnvelope': No matching global declaration available for the validation root."
I get no errors if I validate against amzn-envelope.xsd instead of ProductImage.xsd, but that defeats the point of seeing if a given Image feed is valid. All xsd & xml files mentioned are in my working directory along with my python script by the way.
Here is a snippet of the sample xml, which should definately be valid:
<?xml version="1.0"?>
<AmazonEnvelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="amzn-envelope.xsd">
<Header>
<DocumentVersion>1.01</DocumentVersion>
<MerchantIdentifier>Q_M_STORE_123</MerchantIdentifier>
</Header>
<MessageType>ProductImage</MessageType>
<Message>
<MessageID>1</MessageID>
<OperationType>Update</OperationType>
<ProductImage>
<SKU>1234</SKU>
Here is a snippet of the schema (this file is not public so I can't show all of it):
<?xml version="1.0"?>
<!-- Revision="$Revision: #5 $" -->
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
<xsd:include schemaLocation="amzn-base.xsd"/>
<xsd:element name="ProductImage">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="SKU"/>
I can say that following the include to amzn-base.xsd does not end up reaching a definition of the AmazonEnvelope tag. So my questions is: can lxml load schemas via a tag like <AmazonEnvelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="amzn-envelope.xsd">. And if not, how can I validate my Image feed?

The answer is I should validate by the parent schema file, which as mentioned at the top of the XML file is amzn-envelope.xsd as this contains the line:
<xsd:include schemaLocation="ProductImage.xsd"/>
In general then, lxml won't read such a declaration as xsi:noNamespaceSchemaLocation="amzn-envelope.xsd" but if you can find the parent schema to validate against then this should hopefully include the specific schema you're interested in.

How to validate XML with multiple namespaces in Python?

I'm trying to write some unit tests in Python 2.7 to validate against some extensions I've made to the OAI-PMH schema: http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd
The problem that I'm running into is business with multiple nested namespaces is caused by this specification in the above mentioned XSD:
<complexType name="metadataType">
<annotation>
<documentation>Metadata must be expressed in XML that complies
with another XML Schema (namespace=#other). Metadata must be
explicitly qualified in the response.</documentation>
</annotation>
<sequence>
<any namespace="##other" processContents="strict"/>
</sequence>
</complexType>
Here's a snippet of the code I'm using:
import lxml.etree, urllib2
query = "http://localhost:8080/OAI-PMH?verb=GetRecord&by_doc_ID=false&metadataPrefix=nsdl_dc&identifier=http://www.purplemath.com/modules/ratio.htm"
schema_file = file("../schemas/OAI/2.0/OAI-PMH.xsd", "r")
schema_doc = etree.parse(schema_file)
oaischema = etree.XMLSchema(schema_doc)
request = urllib2.Request(query, headers=xml_headers)
response = urllib2.urlopen(request)
body = response.read()
response_doc = etree.fromstring(body)
try:
oaischema.assertValid(response_doc)
except etree.DocumentInvalid as e:
line = 1;
for i in body.split("\n"):
print "{0}\t{1}".format(line, i)
line += 1
print(e.message)
I end up with the following error:
AssertionError: http://localhost:8080/OAI-PMH?verb=GetRecord&by_doc_ID=false&metadataPrefix=nsdl_dc&identifier=http://www.purplemath.com/modules/ratio.htm
Element '{http://www.openarchives.org/OAI/2.0/oai_dc/}oai_dc': No matching global element declaration available, but demanded by the strict wildcard., line 22
I understand the error, in that the schema is requiring that the child element of the metadata element be strictly validated, which the sample xml does.
Now I've written a validator in Java that works - however it would be helpful for this to be in Python, since the rest of the solution I'm building is Python based. To make my Java variant work, I had to make my DocumentFactory namespace aware, otherwise I got the same error. I've not found any working example in python that performs this validation correctly.
Does anyone have an idea how I can get an XML document with multiple nested namespaces as my sample doc validate with Python?
Here is the sample XML document that i'm trying to validate:
<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2002-02-08T08:55:46Z</responseDate>
<request verb="GetRecord" identifier="oai:arXiv.org:cs/0112017"
metadataPrefix="oai_dc">http://arXiv.org/oai2</request>
<GetRecord>
<record>
<header>
<identifier>oai:arXiv.org:cs/0112017</identifier>
<datestamp>2001-12-14</datestamp>
<setSpec>cs</setSpec>
<setSpec>math</setSpec>
</header>
<metadata>
<oai_dc:dc
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/
http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Using Structural Metadata to Localize Experience of
Digital Content</dc:title>
<dc:creator>Dushay, Naomi</dc:creator>
<dc:subject>Digital Libraries</dc:subject>
<dc:description>With the increasing technical sophistication of
both information consumers and providers, there is
increasing demand for more meaningful experiences of digital
information. We present a framework that separates digital
object experience, or rendering, from digital object storage
and manipulation, so the rendering can be tailored to
particular communities of users.
</dc:description>
<dc:description>Comment: 23 pages including 2 appendices,
8 figures</dc:description>
<dc:date>2001-12-14</dc:date>
</oai_dc:dc>
</metadata>
</record>
</GetRecord>
</OAI-PMH>

Found this in lxml's doc on validation:
>>> schema_root = etree.XML('''\
... <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
... <xsd:element name="a" type="xsd:integer"/>
... </xsd:schema>
... ''')
>>> schema = etree.XMLSchema(schema_root)
>>> parser = etree.XMLParser(schema = schema)
>>> root = etree.fromstring("<a>5</a>", parser)
So, perhaps, what you need is this? (See last two lines.):
schema_doc = etree.parse(schema_file)
oaischema = etree.XMLSchema(schema_doc)
request = urllib2.Request(query, headers=xml_headers)
response = urllib2.urlopen(request)
body = response.read()
parser = etree.XMLParser(schema = oaischema)
response_doc = etree.fromstring(body, parser)

Parsing XML with SAX/Python + no validation

I am new to python and I'm trying to parse a XML file with SAX without validating it.
The head of my xml file is:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE n:document SYSTEM "schema.dtd">
<n:document....
and I've tried to parse it with python 2.5.2:
from xml.sax import make_parser, handler
import sys
parser = make_parser()
parser.setFeature(handler.feature_namespaces,True)
parser.setFeature(handler.feature_validation,False)
parser.setContentHandler(handler.ContentHandler())
parser.parse(sys.argv[1])
but I got an error:
python doc.py document.xml
(...)
File "/usr/lib/python2.5/urllib2.py", line 244, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: schema.dtd
I don't want the SAX parser to look for a schema. Where am I wrong ?
Thanks !

expatreader considers the DTD external subset as an external general entity. So the feature you want is:
parser.setFeature(handler.feature_external_ges, False)
However, it's a bit dodgy pointing the DTD external subset to a non-existant URL; as this shows, it's not only validating parsers that read it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Error parsing a RDF XML file with rdflib in Python - python

Related

Python code to remove XML Declaration tag

Parsing a kml file using lxml

How can lxml validate some XML against both an XSD file while also loading an inline schema too?

How to validate XML with multiple namespaces in Python?

Parsing XML with SAX/Python + no validation

Categories

Resources