XML (.xsd) feed validation against a schema - python

I have a XML file and I have a XML schema. I want to validate the file against that schema and check if it adheres to that. I am using python but am open to any language for that matter if there is no such useful library in python.
What would be my best options here? I would worry about the how fast I can get this up and running.

Definitely lxml.
Define an XMLParser with a predefined schema, load the the file fromstring() and catch any XML Schema errors:
from lxml import etree
def validate(xmlparser, xmlfilename):
try:
with open(xmlfilename, 'r') as f:
etree.fromstring(f.read(), xmlparser)
return True
except etree.XMLSchemaError:
return False
schema_file = 'schema.xsd'
with open(schema_file, 'r') as f:
schema_root = etree.XML(f.read())
schema = etree.XMLSchema(schema_root)
xmlparser = etree.XMLParser(schema=schema)
filenames = ['input1.xml', 'input2.xml', 'input3.xml']
for filename in filenames:
if validate(xmlparser, filename):
print("%s validates" % filename)
else:
print("%s doesn't validate" % filename)
Note about encoding
If the schema file contains an xml tag with an encoding (e.g. <?xml version="1.0" encoding="UTF-8"?>), the code above will generate the following error:
Traceback (most recent call last):
File "<input>", line 2, in <module>
schema_root = etree.XML(f.read())
File "src/lxml/etree.pyx", line 3192, in lxml.etree.XML
File "src/lxml/parser.pxi", line 1872, in lxml.etree._parseMemoryDocument
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
A solution is to open the files in byte mode: open(..., 'rb')
[...]
def validate(xmlparser, xmlfilename):
try:
with open(xmlfilename, 'rb') as f:
[...]
with open(schema_file, 'rb') as f:
[...]

The python snippet is good, but an alternative is to use xmllint:
xmllint -schema sample.xsd --noout sample.xml

import xmlschema
def get_validation_errors(xml_file, xsd_file):
schema = xmlschema.XMLSchema(xsd_file)
validation_error_iterator = schema.iter_errors(xml_file)
errors = list()
for idx, validation_error in enumerate(validation_error_iterator, start=1):
err = validation_error.__str__()
errors.append(err)
print(err)
return errors

Related

Python Minidom Parsing File Objects

I wrote a code using minidom which takes an xml script, opens it as a file object and then parses that file object. Not only that, but I want the script to open multiple files that are all contained in a folder, and parse each one individually.
An example of the xml script is:
<?xml version="1.0"?>
<Data>
<data1>1</data1>
<data2>2</data2>
<data3>3</data3>
<Sub_data>
<sub_data1>0.1111111111111</sub_data1>
<sub_data2>0.2222222222222</sub_data2>
... and so on.
i.e., it's pretty standard.
Now, my code looks like this:
import os
import io
from xml.dom import minidom
#folder where xml files are located
indir = '/foo/bar/docs/'
masterlist = []
for root, dirs, filenames in os.walk(indir):
for f in filenames:
row = []
fsock = io.open(indir + f, mode = 'rt', encoding = 'cp1252')
xmldoc = minidom.parse(fsock)
...
and the error I am getting is:
Traceback (most recent call last): File "kgp_2.py", line 34, in
<module> xmldoc = minidom.parse(fsock) File
"/usr/lib/python2.7/xml/dom/minidom.py", line 1918, in parse return
expatbuilder.parse(file) File
"/usr/lib/python2.7/xml/dom/expatbuilder.py", line 928, in parse
result = builder.parseFile(file) File
"/usr/lib/python2.7/xml/dom/expatbuilder.py", line 211, in parseFile
parser.Parse("", True) xml.parsers.expat.ExpatError: no element found:
line 203, column 1381
Now, when I make the change:
fsock = io.open(indir + filenames[0], mode = 'rt', encoding = 'cp1252')
this works fine, that is, it opens the first file in the folder; but I want to parse all the files in the folder. When I do a loop like:
m = 0
... in loop:
fsock = io.open(indir + filenames[m], mode = 'rt', encoding = 'cp1252')
...
m = m+1
I get the original error.
The reason I am using the io library instead of the usual file open function is that a previous stack overflow article recommended it. Using:
fsock = open(indir + filenames[0])
like before, gets no error, but:
fsock = open(indir + f)
or
#with a loop over m, like above
fsock = open(infir + filenames[m])
get the same error as above.
A strange problem. When I print the filenames they are correct. And they are being opened, there's no error there. It's the parser that just won't parse the object files, even with filenames[m] where m = 0, surely this should be no problem?
EDIT:
Parsing document with python minidom
in this post they had a similar problem, the resolution was to use
xmldoc.seek(0)
however, for me this returns
Traceback (most recent call last):
File "kgp_2.py", line 45, in <module>
xmldoc.seek(0)
AttributeError: Document instance has no attribute 'seek'
EDIT 2: THIS HAS BEEN RESOLVED. IT WAS A CASE OF A CORRUPTED INPUT XML FILE.
Are you sure the XML data contained in all XML files is correct? Perhaps one is empty an you have to handle such Exception. Anyhow I recommend you to use xml.etree doc.

ElementTree errors, html files will not parse using Python/Sublime

I am trying to parse a few thousand html files and dump the variables into a csv file (excel spreadsheet). I've come up against several roadblocks, but the first one is this: I can not get it to properly parse the file. Below is a brief explanation, the python code and the traceback info.
Using Python & Sublime to parse html files, I am getting several errors. What IS working: it runs fine up until if '.html' in file:. It does not execute that loop. It will iterate through print allFiles just fine. It also creates the csv file and creates the headers (though not in separate columns, but I can ask about that later).
It seems that the problem is in the if tree = ET.parse(HTML_PATH+"/"+file) piece. I've written this several different ways (without "/" and/or "file", for example)--so far I have yet to resolve this problem.
If I can provide more information or if anyone can direct me to other documenation, it would be greatly appreciated. So far I have yet to find anything that addresses this issue.
Many thanks for your thoughts.
//C
# Parses out data from crawled html files under "html files"
# and places the output in output.csv.
import xml.etree.ElementTree as ET
import csv, codecs, os
from cStringIO import StringIO
# Note: you need to download and install this..
import unicodecsv
# TODO: make into command line params (instead of constant)
CSV_FILE='output.csv'
HTML_PATH='/Users/C/data/Folder_NS'
f = open(CSV_FILE, 'wb')
w = unicodecsv.writer(f, encoding='utf-8', delimiter=';')
w.writerow(['file', 'category', 'about', 'title', 'subtitle', 'date', 'bodyarticle'])
# redundant declarations:
category=''
about=''
title=''
subtitle=''
date=''
bodyarticle=''
print "headers created"
allFiles = os.listdir(HTML_PATH)
#with open(CSV_FILE, 'wb') as csvfile:
print "all defined"
for file in allFiles:
#print allFiles
if '.html' in file:
print "in html loop"
tree = ET.parse(HTML_PATH+"/"+file)
print '===================='
print 'Parsing file: '+file
print '===================='
for node in tree.iter():
print "tbody"
# The tbody attribute spells it all (or does it):
name = node.attrib.get('/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font')
# Check common header stuff
if name=='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
#print ' ------------------'
#print ' Category:'
category=node.text
print "category"
f.close()
Traceback:
File "/Users/C/data/Folder_NS/data_parse.py", line 34, in
tree = ET.parse(HTML_PATH+"/"+file)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1182, in parse
tree.parse(source, parser)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
parser.feed(data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: mismatched tag: line 63, column 2
You are trying to parse HTML with an XML parser, and valid HTML is not always valid XML. You would be better off using the HTML parsing library in the lxml package.
import xml.etree.ElementTree as ET
# ...
tree = ET.parse(HTML_PATH + '/' + file)
would be changed to
import lxml.html
# ...
tree = lxml.html.parse(HTML_PATH + '/' + file)

python xml.etree.ElementTree tostring() fromstring() roundtrip fails for unicode code points

I'm using xml.etree.ElementTree in python 2.7 and having problems round-tripping to-and-from strings. Calling ET.fromstring() on ET.tostring() fails if there are non-ascii Unicode characters in the tree.
Why doesn't this work? Given that ElementTree wants bytestreams and to do its own decoding, why then does it default to a ASCII parser? Is this determined by something I've overlooked, like the encoding of the python file or locale?
ASCII only chars work:
import xml.etree.ElementTree as ET
t1 = ET.Element('test')
t1.text = u'hello world'
t1_roundtrip = ET.fromstring(ET.tostring(t1, encoding='utf8', method='xml'))
# ET.dump(t1) == ET.dump(t1_roundtrip)
Unicode Code points fail:
import xml.etree.ElementTree as ET
t2 = ET.Element('test')
t2.text = u'\u2603'
t2_roundtrip = ET.fromstring(ET.tostring(t2, encoding='utf8', method='xml'))
>>> t2_roundtrip = ET.fromstring(ET.tostring(t2, encoding='utf8', method='xml'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/rh/python27/root/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1300, in XML
parser.feed(text)
File "/opt/rh/python27/root/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "/opt/rh/python27/root/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 2, column 6
You've specified an illegal encoding. Quoting the ElementTree doc:
The encoding string included in XML output should conform to the appropriate standards. For example, “UTF-8” is valid, but “UTF8” is not. See http://www.w3.org/TR/2006/REC-xml11-20060816/#NT-EncodingDecl and http://www.iana.org/assignments/character-sets.
Found two ways around this:
don't include encoding for tostring():
import xml.etree.ElementTree as ET
t3 = ET.Element('test')
t3.text = u'\u2603'
t3_roundtrip = ET.fromstring(ET.tostring(t3, method='xml'))
specify XMLParser with utf-8 encoding:
import xml.etree.ElementTree as ET
t4 = ET.Element('test')
t4.text = u'\u2603'
t4_roundtrip_utf = ET.fromstring(
ET.tostring(t3, encoding='utf8', method='xml'),
parser=ET.XMLParser(encoding='utf-8'))
Why do I need either? Aren't XML files utf-8 unless specified otherwise?

Saving tables from html document

I tried to save tables from a document as a file under a directory as follows:
for table in tables:
tableString = html.tostring(table)
fileref=open('c:\\Users\\ahn_133\\Desktop\\appleTables\\Apple-' + str(count) + '.htm', 'w')
fileref.write(tableString)
fileref.close()
count+=1
But, I keep getting an error as follows:
Traceback (most recent call last):
File "<pyshell#27>", line 4, in <module>
fileref.write(tableString)
TypeError: must be str, not bytes
I am using Python 3.3 and installed lxml-3.0.1.win32-py3.3.‌exe
How can I fix this error?
lxml's tostring method returns a bytestring (bytes), because it is already encoded. This is necessary because the XML/HTML document can specify its own encoding, and that better be right!
Simply open the file in binary mode:
for table in tables:
tableString = html.tostring(table)
filename = r'c:\Users\ahn_133\Desktop\appleTables\Apple-' +str(count)+ '.htm'
with open(filename, 'wb') as fileref:
# ^
fileref.write(tableString)
count+=1

Writing XML to file corrupts files in python

I'm attempting to write contents from xml.dom.minidom object to file. The simple idea is to use 'writexml' method:
import codecs
def write_xml_native():
# Building DOM from XML
xmldoc = minidom.parse('semio2.xml')
f = codecs.open('codified.xml', mode='w', encoding='utf-8')
# Using native writexml() method to write
xmldoc.writexml(f, encoding="utf=8")
f.close()
The problem is that it corrupts the non-latin-encoded text in the file. The other way is to get the text string and write it to file explicitly:
def write_xml():
# Building DOM from XML
xmldoc = minidom.parse('semio2.xml')
# Opening file for writing UTF-8, which is XML's default encoding
f = codecs.open('codified3.xml', mode='w', encoding='utf-8')
# Writing XML in UTF-8 encoding, as recommended in the documentation
f.write(xmldoc.toxml("utf-8"))
f.close()
This results in the following error:
Traceback (most recent call last):
File "D:\Projects\Semio\semioparser.py", line 45, in <module>
write_xml()
File "D:\Projects\Semio\semioparser.py", line 42, in write_xml
f.write(xmldoc.toxml(encoding="utf-8"))
File "C:\Python26\lib\codecs.py", line 686, in write
return self.writer.write(data)
File "C:\Python26\lib\codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 2064: ordinal not in range(128)
How do I write an XML text to file? What is it I'm missing?
EDIT. Error is fixed by adding decode statement:
f.write(xmldoc.toxml("utf-8").decode("utf-8"))
But russian symbols are still corrupted.
The text is not corrupted when viewed in an interpreter, but when it's written in file.
Hmm, though this should work:
xml = minidom.parse("test.xml")
with codecs.open("out.xml", "w", "utf-8") as out:
xml.writexml(out)
you may alternatively try:
with codecs.open("test.xml", "r", "utf-8") as inp:
xml = minidom.parseString(inp.read().encode("utf-8"))
with codecs.open("out.xml", "w", "utf-8") as out:
xml.writexml(out)
Update: In case you construct xml out of string object, you should encode it before passing to minidom parser, like this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
import xml.dom.minidom as minidom
xml = minidom.parseString(u"<ru>Тест</ru>".encode("utf-8"))
with codecs.open("out.xml", "w", "utf-8") as out:
xml.writexml(out)
Try this:
with open("codified.xml", "w") as f:
f.write(xmldoc.toxml("utf-8").decode("utf-8"))
This works for me (under Python 3, though).

Categories

Resources