Import xml in another xml with Python ElementTree parser - python

Is it possible to load an xml file which imports another xml file with Python ElementTree.parse ?
For example:
I have file test.xml which contains:
<TestXml>
<!DOCTYPE doc [
<!ENTITY otherFile SYSTEM "test_1.xml">
]>
</TestXml>
and I also have test_1.xml which contains:
<test>it works!</test>
and I want to load test.xml in my python script:
from xml.etree.ElementTree import parse
a = parse('test.xml')
print a.find('test').text
and I expect it to output:
it works!
but instead I have:
Traceback (most recent call last):
File "D:/Work/depot/WIP/olex/Python/test/test.py", line 3, in <module>
a = parse('test.xml')
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1182, in parse
tree.parse(source, parser)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 656, in parse
parser.feed(data)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 2, column 6
Does somebody know what am I doing wrong or it is just impossible to load such a xml file for python ElementTree parser ?

The specific problem you are having is that your xml is malformed. Your DOCTYPE declaration should not be inside your root element. Rather, it should precede your root element:
<!DOCTYPE doc [
<!ENTITY otherFile SYSTEM "test_1.xml">
]>
<TestXml>
some content . . .
</TestXml>
That said, you will face a larger problem once you solve that issue. How do you use Python to parse the DOCTYPE declaration? Should you use the xml module, the lxml module, or the bs4 module?
That's a tough question. From what I have seen, people have (recently) had to do dtd parsing themselves. See the SO threads here and here for some possible leads.

Related

Python XML Parser: junk after document element

I'm learning Python at work. I've got a large XML file with data similar to this:
testData3.xml File
<r><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c></c><c></c><c>something1</c><c>something1</c></r>
<r><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c></c><c></c><c>something2</c><c>something2</c></r>
I have copied an XML parser out of one of my Python books that works in gathering the data when the data file contains only one line. As soon as I add a second line of data, the script fails when it runs.
Python script that I'm running (xmlReader.py):
from xml.dom.minidom import parse, Node
xmltree = parse('testData3.xml')
for node1 in xmltree.getElementsByTagName('c'):
for node2 in node1.childNodes:
if node2.nodeType == Node.TEXT_NODE:
print(node2.data)
I'm looking for some help on how to write the loop so that my xmlReader.py continues through the entire file instead of just one line. I get the following errors when I run this script:
Errors during execution:
xxxx#xxxx:~/xxxx/xxxx> python xmlReader.py
Traceback (most recent call last):
File "xmlReader.py", line 2, in <module>
xmltree = parse('testData3.xml')
File "/usr/lib64/python2.6/site-packages/_xmlplus/dom/minidom.py", line 1915, in parse
return expatbuilder.parse(file)
File "/usr/lib64/python2.6/site-packages/_xmlplus/dom/expatbuilder.py", line 926, in parse
result = builder.parseFile(fp)
File "/usr/lib64/python2.6/site-packages/_xmlplus/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: junk after document element: line 2, column 0
xxxx#xxxx:~/xxxx/xxxx>
The problem is that your example data is not valid XML. A valid XML document should have a single root element; this is true for a single line of the file, where <r> is the root element, but not true when you add a second line, because each line is contained within a separate <r> element, but there is no global parent element in the file.
Either construct valid XML, for example:
<root>
<r><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c></c><c></c><c>something1</c><c>something1</c></r>
<r><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c></c><c></c><c>something2</c><c>something2</c></r>
</root>
or parse the file line by line:
from xml.dom.minidom import parseString
f = open('testData3.xml'):
for line in f:
xmltree = parseString(line)
...
f.close()

etree generating error when using urlib

I am trying to parse an HTML table into python (2.7) with the solutions in this post.
When I try either one of the first two with a string (as in the example) it works perfect.
But when I try to to use the etree.xml on HTML page I read with urlib I get an error. I did a check for each one of solutions, and the variable I pass is a str as well.
For the following code:
from lxml import etree
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = etree.XML(s)
I get this error:
File "C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py", line
9, in table = etree.XML(s)
File "lxml.etree.pyx", line 2723, in lxml.etree.XML
(src/lxml/lxml.etree.c:52448)
File "parser.pxi", line 1573, in lxml.etree._parseMemoryDocument
(src/lxml/lxml.etree.c:79932)
File "parser.pxi", line 1452, in lxml.etree._parseDoc
(src/lxml/lxml.etree.c:78774)
File "parser.pxi", line 960, in lxml.etree._BaseParser._parseDoc
(src/lxml/lxml.etree.c:75389)
File "parser.pxi", line 564, in
lxml.etree._ParserContext._handleParseResultDoc
(src/lxml/lxml.etree.c:71739)
File "parser.pxi", line 645, in lxml.etree._handleParseResult
(src/lxml/lxml.etree.c:72614)
File "parser.pxi", line 585, in lxml.etree._raiseParseError
(src/lxml/lxml.etree.c:71955) lxml.etree.XMLSyntaxError: Opening and
ending tag mismatch: link line 8 and head, line 8, column 48
and for this code:
from xml.etree import ElementTree as ET
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = ET.XML(s)
I get this error:
Traceback (most recent call last): File
"C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py", line 6, in
table = ET.XML(s)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1300, in XML
parser.feed(text)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in
_raiseerror
raise err xml.etree.ElementTree.ParseError: mismatched tag: line 8, column 111
While they may seem the same markup types, HTML is not as stringent as XML to be well-formed and follow markup rules (opening/closing nodes, escaping entities, etc.). Hence, what passes for HTML may not be allowed for XML.
Therefore, consider using etree's HTML() function to parse the page. Additionally, you can use XPath to target the particular area you intend to extract or use. Below is an example attempting to pull the main page's table. Do note the webpage uses a quite a bit of nested tables.
from lxml import etree
import urllib.request as rq
yearurl = "http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s = rq.urlopen(yearurl).read()
print(type(s))
# PARSE PAGE
htmlpage = etree.HTML(s)
# XPATH TO SPECIFIC CONTENT
htmltable = htmlpage.xpath("//table[tr/td/font/a/b='Rank']//text()")
for row in htmltable:
print(row)

How to validate xml using python without third-party libs?

I have some xml pieces like this:
<!DOCTYPE mensaje SYSTEM "record.dtd">
<record>
<player_birthday>1979-09-23</player_birthday>
<player_name>Orene Ai'i</player_name>
<player_team>Blues</player_team>
<player_id>453</player_id>
<player_height>170</player_height>
<player_position>F&W</player_position> <---- a '&' here.
<player_weight>75</player_weight>
</record>
Is there any way to validate whether the xml pieces is well-formatted?
Is there any way to validate the xml against a DTD or XML Scheme?
For various reasons I can't use any third-party packages.
e.g. the xml above is not conrrect since it has a '&' in it. Note that the DOCTYPE definition sentence refer to a DTD.
Just try to parse it with ElementTree (xml.etree.ElementTree.fromstring) - it will raise an error if the XML is not well formed.
>>> a = """<record>
... <player_birthday>1979-09-23</player_birthday>
... <player_name>Orene Ai'i</player_name>
... <player_team>Blues</player_team>
... <player_id>453</player_id>
... <player_height>170</player_height>
... <player_position>F&W</player_position> <---- a '&' here.
... <player_weight>75</player_weight>
... </record>"""
>>>
>>> from xml.etree import ElementTree as ET
>>> x = ET.fromstring(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1282, in XML
parser.feed(text)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1624, in feed
self._raiseerror(v)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1488, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 7, column 24
You can use python's xml.dom.minidom XML parser (which is in the standard library, but isn't as powerful as alternatives such as lxml).
Just do:
import xml.dom.minidom
xml.dom.minidom.parseString('<My><XML><String/><XML/><My/>')
You will get a xml.parsers.expat.ExpatError if the XML is invalid.

Lxml can't parse gzipped XML?

I have this Gzipped XML-file:
http://cdon.com/xml_files/cdon_games_SE.xml.gz
According to lxml http://lxml.de/parsing.html lxml can parse gzipped XML-files:
"lxml can parse from a local file, an HTTP URL or an FTP URL. It also auto-detects and reads gzip-compressed XML files (.gz)."
This code:
from lxml import etree
tree = urllib.urlopen('http://cdon.com/xml_files/cdon_games_SE.xml.gz')
parser = etree.XMLParser(recover=True)
tree = etree.parse(tree, parser)
tree = tree.xpath(//product)
Gives error:
tree = tree.xpath(//product)
File "lxml.etree.pyx", line 2038, in lxml.etree._ElementTree.xpath (src/lxml\lxml.etree.c:47529)
File "lxml.etree.pyx", line 1709, in lxml.etree._ElementTree._assertHasRoot (src/lxml\lxml.etree.c:44508)
AssertionError: ElementTree not initialized, missing root
What is wrong? Can't lxml parse gzipped XML-files? If I save the file in xml (without gzip) as a file on the local server it works.

Wikipedia with Python

I have this very simple python code to read xml for the wikipedia api:
import urllib
from xml.dom import minidom
usock = urllib.urlopen("http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500")
xmldoc=minidom.parse(usock)
usock.close()
print xmldoc.toxml()
But this code returns with these errors:
Traceback (most recent call last):
File "/home/user/workspace/wikipediafoundations/src/list.py", line 5, in <module><br>
xmldoc=minidom.parse(usock)<br>
File "/usr/lib/python2.6/xml/dom/minidom.py", line 1918, in parse<br>
return expatbuilder.parse(file)<br>
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 928, in parse<br>
result = builder.parseFile(file)<br>
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 207, in parseFile<br>
parser.Parse(buffer, 0)<br>
xml.parsers.expat.ExpatError: syntax error: line 1, column 62<br>
I have no clue as I just learning python. Is there a way to get an error with more detail? Does anyone know the solution? Also, please recommend a better language to do this in.
Thank You,
Venkat Rao
The URL you're requesting is an HTML representation of the XML that would be returned:
http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500
So the XML parser fails. You can see this by pasting the above in a browser. Try adding a format=xml at the end:
http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500&format=xml
as documented on the linked page:
http://en.wikipedia.org/w/api.php

Categories

Resources