Lxml can't parse gzipped XML? - python

I have this Gzipped XML-file:
http://cdon.com/xml_files/cdon_games_SE.xml.gz
According to lxml http://lxml.de/parsing.html lxml can parse gzipped XML-files:
"lxml can parse from a local file, an HTTP URL or an FTP URL. It also auto-detects and reads gzip-compressed XML files (.gz)."
This code:
from lxml import etree
tree = urllib.urlopen('http://cdon.com/xml_files/cdon_games_SE.xml.gz')
parser = etree.XMLParser(recover=True)
tree = etree.parse(tree, parser)
tree = tree.xpath(//product)
Gives error:
tree = tree.xpath(//product)
File "lxml.etree.pyx", line 2038, in lxml.etree._ElementTree.xpath (src/lxml\lxml.etree.c:47529)
File "lxml.etree.pyx", line 1709, in lxml.etree._ElementTree._assertHasRoot (src/lxml\lxml.etree.c:44508)
AssertionError: ElementTree not initialized, missing root
What is wrong? Can't lxml parse gzipped XML-files? If I save the file in xml (without gzip) as a file on the local server it works.

Related

Read multiple xml file from a folder using ElementTree

I am very new in coding in Python, and there is an issue I have been trying to solve for some hours:
I have 1600+ xml files (0000.xml, 0001.xml, etc) need to be parsed in order to do a text mining project.
But an error has occurred, when I have the following code:
from os import listdir, path
import xml.etree.ElementTree as ET
mypath = '../project/content'
files = [f for f in listdir(mypath) if f.endswith('.xml')]
for file in files:
tree = ET.parse("../project/content/"+file)
root = tree.getroot()
The error message is the following:
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-13-cdc3ee6c3989>", line 6, in <module>
tree = ET.parse("../project/content/"+file)
File "/anaconda3/lib/python3.6/xml/etree/ElementTree.py", line 1196, in parse
tree.parse(source, parser)
File "/anaconda3/lib/python3.6/xml/etree/ElementTree.py", line 597, in parse
self._root = parser._parse_whole(source)
File "<string>", line unknown ParseError: no element found: line 1, column 0
where did I make mistakes?
Also, I want to only extract the text from one element of each xml files, is it sufficient that I simply attach this line to the code? and moreover, how can I save each of the results to txt files?
maintext = root.find("mainText").text
Thank you very much!
The right way to create path elements is using join:
Add print messages to the code before you try and create the tree.
Is the XML you try parse valid?
Once you solve the parsing issue you can use multiprocessing in order to parse many files at the same time.
from os import listdir, path
import xml.etree.ElementTree as ET
mypath = '../project/content'
files = [path.join(mypath, f) for f in listdir(mypath) if f.endswith('.xml')]
for file in files:
print(file)
tree = ET.parse(file)
root = tree.getroot()

parse Dutch NDW xml

I am trying to parse the XML file from the Dutch NDW which contains every minute the trafficspeed on many Dutch motorways. I use this example file: http://www.ndw.nu/downloaddocument/e838c62446e862f5b6230be485291685/Reistijden.zip
I am trying to parse the traveltime data in variables with Python but i am struggling.
from xml.etree import ElementTree
import urllib2
url = "http://weburloffile.nl/ndw/Reistijden.xml"
response = urllib2.urlopen(url)
namespaces = {
'soap': 'http://schemas.xmlsoap.org/soap/envelope/',
'a': 'http://datex2.eu/schema/2/2_0'
}
dom = ElementTree.fromstring(response.read)
names = dom.findall(
'soap:Envelope'
'/a:duration',
namespaces,
)
#print names
for duration in names:
print(duration.text)
I get this new error
Traceback (most recent call last):
File "test.py", line 9, in <module>
dom = ElementTree.fromstring(response.read)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1311, in XML
parser.feed(text)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1651, in feed
self._parser.Parse(data, 0)
TypeError: Parse() argument 1 must be string or read-only buffer, not instancemethod
How to parse this (complex) xml correctly?
-- changed it into read as suggested by comment
The problem isn't the XML parsing; it's that you are using the response object incorrectly. urllib2.urlopen returns a file-like object that does not have a content attribute. Instead, you should be calling read on it:
dom = ElementTree.fromstring(response.read())

python - Parse XML with unicode characters into ElementTree

I'm using PDFminer, but it contains a bug and I get the following invalid XML file:
<?xml version="1.1" encoding="UTF-8"?>
<string size="16">ô‚ÌfƇ*š]Ö[</string>
When I'm trying to parse it with ElementTree I'm getting the following error:
bookXml = xml.etree.ElementTree.parse(filename)
File "C:\Users\User\Anaconda3\lib\xml\etree\ElementTree.py", line 1196, in parse
tree.parse(source, parser)
File "C:\Users\User\Anaconda3\lib\xml\etree\ElementTree.py", line 597, in parse
self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: reference to invalid character number: line 1, column 36
I think best way to handle this case is to fix XML first, but how?
I would wrap the offending XML string in CDATA. Confirmed working as soon as I did this. Example:
<?xml version="1.1" encoding="UTF-8"?>
<string><![CDATA[ô‚ÌƇ*šÖ]]></string>
More about CDATA here.

Import xml in another xml with Python ElementTree parser

Is it possible to load an xml file which imports another xml file with Python ElementTree.parse ?
For example:
I have file test.xml which contains:
<TestXml>
<!DOCTYPE doc [
<!ENTITY otherFile SYSTEM "test_1.xml">
]>
</TestXml>
and I also have test_1.xml which contains:
<test>it works!</test>
and I want to load test.xml in my python script:
from xml.etree.ElementTree import parse
a = parse('test.xml')
print a.find('test').text
and I expect it to output:
it works!
but instead I have:
Traceback (most recent call last):
File "D:/Work/depot/WIP/olex/Python/test/test.py", line 3, in <module>
a = parse('test.xml')
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1182, in parse
tree.parse(source, parser)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 656, in parse
parser.feed(data)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 2, column 6
Does somebody know what am I doing wrong or it is just impossible to load such a xml file for python ElementTree parser ?
The specific problem you are having is that your xml is malformed. Your DOCTYPE declaration should not be inside your root element. Rather, it should precede your root element:
<!DOCTYPE doc [
<!ENTITY otherFile SYSTEM "test_1.xml">
]>
<TestXml>
some content . . .
</TestXml>
That said, you will face a larger problem once you solve that issue. How do you use Python to parse the DOCTYPE declaration? Should you use the xml module, the lxml module, or the bs4 module?
That's a tough question. From what I have seen, people have (recently) had to do dtd parsing themselves. See the SO threads here and here for some possible leads.

How to validate xml using python without third-party libs?

I have some xml pieces like this:
<!DOCTYPE mensaje SYSTEM "record.dtd">
<record>
<player_birthday>1979-09-23</player_birthday>
<player_name>Orene Ai'i</player_name>
<player_team>Blues</player_team>
<player_id>453</player_id>
<player_height>170</player_height>
<player_position>F&W</player_position> <---- a '&' here.
<player_weight>75</player_weight>
</record>
Is there any way to validate whether the xml pieces is well-formatted?
Is there any way to validate the xml against a DTD or XML Scheme?
For various reasons I can't use any third-party packages.
e.g. the xml above is not conrrect since it has a '&' in it. Note that the DOCTYPE definition sentence refer to a DTD.
Just try to parse it with ElementTree (xml.etree.ElementTree.fromstring) - it will raise an error if the XML is not well formed.
>>> a = """<record>
... <player_birthday>1979-09-23</player_birthday>
... <player_name>Orene Ai'i</player_name>
... <player_team>Blues</player_team>
... <player_id>453</player_id>
... <player_height>170</player_height>
... <player_position>F&W</player_position> <---- a '&' here.
... <player_weight>75</player_weight>
... </record>"""
>>>
>>> from xml.etree import ElementTree as ET
>>> x = ET.fromstring(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1282, in XML
parser.feed(text)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1624, in feed
self._raiseerror(v)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1488, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 7, column 24
You can use python's xml.dom.minidom XML parser (which is in the standard library, but isn't as powerful as alternatives such as lxml).
Just do:
import xml.dom.minidom
xml.dom.minidom.parseString('<My><XML><String/><XML/><My/>')
You will get a xml.parsers.expat.ExpatError if the XML is invalid.

Categories

Resources