How to validate xml using python without third-party libs? - python

I have some xml pieces like this:
<!DOCTYPE mensaje SYSTEM "record.dtd">
<record>
<player_birthday>1979-09-23</player_birthday>
<player_name>Orene Ai'i</player_name>
<player_team>Blues</player_team>
<player_id>453</player_id>
<player_height>170</player_height>
<player_position>F&W</player_position> <---- a '&' here.
<player_weight>75</player_weight>
</record>
Is there any way to validate whether the xml pieces is well-formatted?
Is there any way to validate the xml against a DTD or XML Scheme?
For various reasons I can't use any third-party packages.
e.g. the xml above is not conrrect since it has a '&' in it. Note that the DOCTYPE definition sentence refer to a DTD.

Just try to parse it with ElementTree (xml.etree.ElementTree.fromstring) - it will raise an error if the XML is not well formed.
>>> a = """<record>
... <player_birthday>1979-09-23</player_birthday>
... <player_name>Orene Ai'i</player_name>
... <player_team>Blues</player_team>
... <player_id>453</player_id>
... <player_height>170</player_height>
... <player_position>F&W</player_position> <---- a '&' here.
... <player_weight>75</player_weight>
... </record>"""
>>>
>>> from xml.etree import ElementTree as ET
>>> x = ET.fromstring(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1282, in XML
parser.feed(text)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1624, in feed
self._raiseerror(v)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1488, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 7, column 24

You can use python's xml.dom.minidom XML parser (which is in the standard library, but isn't as powerful as alternatives such as lxml).
Just do:
import xml.dom.minidom
xml.dom.minidom.parseString('<My><XML><String/><XML/><My/>')
You will get a xml.parsers.expat.ExpatError if the XML is invalid.

Related

XPath SyntaxError: invalid predicate

I have a XML file like this:
$ cat sample.xml
<Requests>
<Request>
<ID>123</ID>
<Items>
<Item>a item</Item>
<Item>b item</Item>
<Item>c item</Item>
</Items>
</Request>
<Request>
<ID>456</ID>
<Items>
<Item>d item</Item>
<Item>e item</Item>
</Items>
</Request>
</Requests>
I simply want to extract the XML of Request elements which has certain value for their grandchild element Item. Here is code:
bash-4.2$ cat xsearch.py
import sys
import xml.etree.ElementTree as ET
if __name__ == '__main__':
tree = ET.parse(sys.argv[1])
root = tree.getroot()
for request in root.findall(".//Item[.='c item']/../.."):
#for request in root.findall(".//Request[Items/Item = 'c item']"):
print(request)
I got "invalid predicate" error:
bash-4.2$ python3 xsearch.py sample.xml
Traceback (most recent call last):
File "/usr/lib64/python3.6/xml/etree/ElementPath.py", line 263, in iterfind
selector = _cache[cache_key]
KeyError: (".//Item[.='c item']/../..", None)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "xsearch.py", line 8, in <module>
for request in root.findall(".//Item[.='c item']/../.."):
File "/usr/lib64/python3.6/xml/etree/ElementPath.py", line 304, in findall
return list(iterfind(elem, path, namespaces))
File "/usr/lib64/python3.6/xml/etree/ElementPath.py", line 277, in iterfind
selector.append(ops[token[0]](next, token))
File "/usr/lib64/python3.6/xml/etree/ElementPath.py", line 233, in prepare_predicate
raise SyntaxError("invalid predicate")
SyntaxError: invalid predicate
Could any one point out where I got it wrong?
In general, an XPath invalid predicate error means something is syntactically wrong with one of the XPath's predicates, the code between the [ and ].
Specifically in your case, there are two issues:
The SyntaxError("invalid predicate") is because there's an extra ) in the predicate:
for request in root.findall(".//Item[.='c item')]/../.."):
^
Note also that you can hoist the predicate to avoid navigating down and then back up (../..):
Instead of
.//Item[.='c item']/../..
consider
.//Request[Items/Item = 'c item']
to select the Request element with the targeted Item.
The XPath library you're using, ElementTree, is not a full implementation of the XPath standard. You can waste a lot of time trying to identify what ElementTree does support (".//Items[Item='c item']/.." happens to work here) and does not support, but it'd be better to just use a more compliant library such as lxml.

parse Dutch NDW xml

I am trying to parse the XML file from the Dutch NDW which contains every minute the trafficspeed on many Dutch motorways. I use this example file: http://www.ndw.nu/downloaddocument/e838c62446e862f5b6230be485291685/Reistijden.zip
I am trying to parse the traveltime data in variables with Python but i am struggling.
from xml.etree import ElementTree
import urllib2
url = "http://weburloffile.nl/ndw/Reistijden.xml"
response = urllib2.urlopen(url)
namespaces = {
'soap': 'http://schemas.xmlsoap.org/soap/envelope/',
'a': 'http://datex2.eu/schema/2/2_0'
}
dom = ElementTree.fromstring(response.read)
names = dom.findall(
'soap:Envelope'
'/a:duration',
namespaces,
)
#print names
for duration in names:
print(duration.text)
I get this new error
Traceback (most recent call last):
File "test.py", line 9, in <module>
dom = ElementTree.fromstring(response.read)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1311, in XML
parser.feed(text)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1651, in feed
self._parser.Parse(data, 0)
TypeError: Parse() argument 1 must be string or read-only buffer, not instancemethod
How to parse this (complex) xml correctly?
-- changed it into read as suggested by comment
The problem isn't the XML parsing; it's that you are using the response object incorrectly. urllib2.urlopen returns a file-like object that does not have a content attribute. Instead, you should be calling read on it:
dom = ElementTree.fromstring(response.read())

Import xml in another xml with Python ElementTree parser

Is it possible to load an xml file which imports another xml file with Python ElementTree.parse ?
For example:
I have file test.xml which contains:
<TestXml>
<!DOCTYPE doc [
<!ENTITY otherFile SYSTEM "test_1.xml">
]>
</TestXml>
and I also have test_1.xml which contains:
<test>it works!</test>
and I want to load test.xml in my python script:
from xml.etree.ElementTree import parse
a = parse('test.xml')
print a.find('test').text
and I expect it to output:
it works!
but instead I have:
Traceback (most recent call last):
File "D:/Work/depot/WIP/olex/Python/test/test.py", line 3, in <module>
a = parse('test.xml')
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1182, in parse
tree.parse(source, parser)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 656, in parse
parser.feed(data)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 2, column 6
Does somebody know what am I doing wrong or it is just impossible to load such a xml file for python ElementTree parser ?
The specific problem you are having is that your xml is malformed. Your DOCTYPE declaration should not be inside your root element. Rather, it should precede your root element:
<!DOCTYPE doc [
<!ENTITY otherFile SYSTEM "test_1.xml">
]>
<TestXml>
some content . . .
</TestXml>
That said, you will face a larger problem once you solve that issue. How do you use Python to parse the DOCTYPE declaration? Should you use the xml module, the lxml module, or the bs4 module?
That's a tough question. From what I have seen, people have (recently) had to do dtd parsing themselves. See the SO threads here and here for some possible leads.

Upper limit of fromstring function in ElementTree

I'm using Python 2.4 version on a Windows 32-bit PC. I'm trying to parse through a very large XML file using the ElementTree module. I downloaded version 1.2.6 of this module from effbot.org.
I followed the below code for my purpose:
import elementtree.ElementTree as ET
input = ''' 001 Chuck 009 Brent '''
stuff = ET.fromstring(input)
lst = stuff.findall("users/user")
print len(lst)
for item in lst:
print item.attrib["x"]
item = lst[0]
ET.dump(item)
item.get("x") # get works on attributes
item.find("id").text
item.find("id").tag
for user in stuff.getiterator('user'):
print "User" , user.attrib["x"]
ET.dump(user)
If the content of input is too large, more than 10,000 lines, the fromstring function raises an error (below). Can anyone help me out in rectifying this error?
This is the error generated:
Traceback (most recent call last): File "C:\Documents and Settings\hariprar\My Documents\My files\Python Try\xml_try1.py", line 16, in -toplevel- stuff = ET.fromstring(input) File "C:\Python24\Lib\site-packages\elementtree\ElementTree.py", line 1012, in XML return api.fromstring(text) File "C:\Python24\Lib\site-packages\elementtree\ElementTree.py", line 182, in fromstring parser.feed(text) File "C:\Python24\Lib\site-packages\elementtree\ElementTree.py", line 1292, in feed self._parser.Parse(data, 0) ExpatError: not well-formed (invalid token): line 2445, column 39
Take a look at the iterparse function. It will let you parse your input incrementally rather than reading it into memory as one big chunk.
It's described here: http://effbot.org/zone/element-iterparse.htm

Wikipedia with Python

I have this very simple python code to read xml for the wikipedia api:
import urllib
from xml.dom import minidom
usock = urllib.urlopen("http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500")
xmldoc=minidom.parse(usock)
usock.close()
print xmldoc.toxml()
But this code returns with these errors:
Traceback (most recent call last):
File "/home/user/workspace/wikipediafoundations/src/list.py", line 5, in <module><br>
xmldoc=minidom.parse(usock)<br>
File "/usr/lib/python2.6/xml/dom/minidom.py", line 1918, in parse<br>
return expatbuilder.parse(file)<br>
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 928, in parse<br>
result = builder.parseFile(file)<br>
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 207, in parseFile<br>
parser.Parse(buffer, 0)<br>
xml.parsers.expat.ExpatError: syntax error: line 1, column 62<br>
I have no clue as I just learning python. Is there a way to get an error with more detail? Does anyone know the solution? Also, please recommend a better language to do this in.
Thank You,
Venkat Rao
The URL you're requesting is an HTML representation of the XML that would be returned:
http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500
So the XML parser fails. You can see this by pasting the above in a browser. Try adding a format=xml at the end:
http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500&format=xml
as documented on the linked page:
http://en.wikipedia.org/w/api.php

Categories

Resources