Unknown error when using lxml module - parsing XML - python

am curently learning using Python 101 and in one of examples I'm getting an error and have no clue how to fix it - my code is 100% same as in the book (checked it 3 times already) and it still outputs this error.
Here is the code:
from lxml import etree
def parseXML(xmlFile):
"""
Parse the xml
"""
with open(xmlFile) as fobj:
xml = fobj.read()
root = etree.fromstring(xml)
for appt in root.getchildren():
for elem in appt.getchildren():
if not elem.text:
text = 'None'
else:
text = elem.text
print(elem.tag + ' => ' + text)
if __name__ == '__main__':
parseXML('example.xml')
and here is xml file (it's the same as in the book):
<?xml version="1.0" ?>
<zAppointments reminder-"15">
<appointment>
<begin>1181251600</begin>
<uid>0400000008200E000</uid>
<alarmTime>1181572063</alarmTime>
<state></state>
<location></location>
<duration>1800</duration>
<subject>Bring pizza home</subject>
</appointment>
<appointment>
<begin>1234567890</begin>
<duration>1800</duration>
<subject>Check MS office webstie for updates</subject>
<state>dismissed</state>
<location></location>
<uid>502fq14-12551ss-255sf2</uid>
</appointment>
</zAppointments>
EDITED: Sry, got so excited about my first post that I actually forgot to put the error code.
Traceback (most recent call last):
File "/home/michal/Desktop/nauka programowania/python 101/parsing_with_lxml.py", line 21, in <module>
parseXML('example.xml')
File "/home/michal/Desktop/nauka programowania/python 101/parsing_with_lxml.py", line 10, in parseXML
root = etree.fromstring(xml)
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77737)
File "src/lxml/parser.pxi", line 1830, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116674)
File "src/lxml/parser.pxi", line 1711, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115220)
File "src/lxml/parser.pxi", line 1051, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109345)
File "src/lxml/parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103584)
File "src/lxml/parser.pxi", line 694, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105238)
File "src/lxml/parser.pxi", line 624, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:104147)
lxml.etree.XMLSyntaxError: Specification mandate value for attribute reminder-, line 2, column 25
Thanks for help!!

The only error in the xml can be found here: <zAppointments reminder-"15">, should be: <zAppointments reminder="15">.
In the future useful tools for validating xml can be found online.
Here for example: https://www.xmlvalidation.com/

Error may be in
<zAppointments reminder-"15">
For next validation try to use xmllint:
xmllint --valid --noout example.xml

Related

Parsing bad XHTML

My new project is to extract data from the Naxos Glossary of Musical Terms, a great resource whose text data I want to process and extract to a database to use on another, simpler website I'll create.
My only problem is awful XHTML formatting. The
W3C XHTML validation raises 318 errors and 54 warnings. Even a HTML Tidier I found can't fix it all.
I'm using Python 3.67 and the page I'm parsing was ASP. I've tested LXML and Python XML modules, but both fail.
Can anyone suggest any other tidiers or modules? Or will I have to use some sort of raw text manipulation (yuck!)?
My code:
LXML:
from lxml import etree
file = open("glossary.asp", "r", encoding="ISO-8859-1")
parsed = etree.parse(file)
Error:
Traceback (most recent call last):
File "/media/skuzzyneon/STORE-1/naxos_dict/xslt_test.py", line 4, in <module>
parsed = etree.parse(file)
File "src/lxml/etree.pyx", line 3426, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1861, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1881, in lxml.etree._parseFilelikeDocument
File "src/lxml/parser.pxi", line 1776, in lxml.etree._parseDocFromFilelike
File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
File "/media/skuzzyneon/STORE-1/naxos_dict/glossary.asp", line 25
lxml.etree.XMLSyntaxError: EntityRef: expecting ';', line 25, column 128
>>>
Python XML (using the tidied XHTML):
import xml.etree.ElementTree as ET
file = open("tidy.html", "r", encoding="ISO-8859-1")
root = ET.fromstring(file.read())
# Top-level elements
print(root.findall("."))
Error:
Traceback (most recent call last):
File "/media/skuzzyneon/STORE-1/naxos_dict/xslt_test.py", line 4, in <module>
root = ET.fromstring(file.read())
File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML
parser.feed(text)
File "<string>", line None
xml.etree.ElementTree.ParseError: undefined entity: line 526, column 33
Lxml likely thinks you're giving it xml that way.
Try it like this:
from lxml import html
from cssselect import GenericTranslator, SelectorError
file = open("glossary.asp", "r", encoding="ISO-8859-1")
doc = html.document_fromstring(file.read())
print(doc.cssselect('title')[0].text_content())
Also instead of "HTML Tidiers" just open it in chrome and copy the html in the elements panel.

How to include long base64 string in XML written by lxml element factory?

I'm using the lxml element factory on Python 3 to create an XML file that contains base64-encoded pdf files. The XML file will be used to import data into a database software, so the schema can not be changed.
When creating the XML file, lxml complains about the length of the base64 string:
article = E.article(
E.galley(
E.label('PDF'),
E.file(
ET.XML("<embed filename=\"" + row['galley'] + ".pdf\""
+ " encoding=\"base64\" mime_type=\"application/pdf\" >"
+ str(base64fulltext)
+ "</embed>")
), self.LOCALE(row['language']),
), self.LANGUAGE(row['language'])
)
When running the whole script, the error message ('line 45') points to the line where it says str(base64fulltext) in the code snippet above. The error message is as follows:
(lxml) vboxadmin#linux-x3el:~/repos/x> python3 test-csvFileImport.py
Traceback (most recent call last):
File "test-csvFileImport.py", line 65, in <module>
articlePdfBase64)
File "/home/vboxadmin/repos/x/y/writer.py", line 45, in exportArticle
+ "</embed>")
File "src/lxml/etree.pyx", line 3192, in lxml.etree.XML
File "src/lxml/parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1757, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1067, in lxml.etree._BaseParser._parseUnicodeDoc
File "src/lxml/parser.pxi", line 600, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 710, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 639, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: xmlSAX2Characters: huge text node, line 1, column 10027189
The expected result would have been to have the base64 string to be written to the xml file.
So far, I could only find that there is the option "huge_tree" in lxml.etree.iterparse (http://lxml.de/api/lxml.etree.iterparse-class.html), but I am not sure whether/how I can use this to solve my problem.
As a workaround, I am considering using string replace to insert the base64 string to the xml after it has been written to file. However, I would be more happy to use a proper lxml solution if anyone could suggest one. Thanks!

LXML issue parsing XML schema in Python 3

I'm attempting to use the XRDTools library to convert Panalytical XRDML files into a more database-friendly format, such as a pandas dataframe.
The XRDTools library is described here: https://github.com/paruch-group/xrdtools. It imports the XRDML file into a Python dictionary. I'm totally new to LXML, so I apologize if this is a simple question.
I've used Anaconda to create Python 2.7 and 3.6 environments specifically to work with the XRDTools package. I'd like to run it in Python 3.6.
In Python 2.7, this code runs smoothly:
import xrdtools
xrd = xrdtools.read_xrdml('filename.xrdml')
Output is a dict:
{u'2Theta': array([63. , 63.00334225, 63.00668449, ..., 67.99331551,
67.99665775, 68. ]),
u'Lambda': 1.540598,
u'Omega': array([31. , 31.00200535, 31.0040107 , ..., 33.9959893 ,
33.99799465, 34. ]), ...
I can then use the dictionary like any other Python object.
In Python 3.6, that same code generates this error message:
Traceback (most recent call last):
File "...\AppData\Local\Continuum\Anaconda2\envs\py36xrd\lib\site-packages\IPython\core\interactiveshell.py", line 2910, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-3-b6f5409b8bf9>", line 1, in <module>
xrd = xrdtools.read_xrdml('filename.xrdml')
File "...\XRDTools\xrdtools\xrdtools\io.py", line 297, in read_xrdml
valid = validate_xrdml_schema(filename)
File ...\XRDTools\xrdtools\xrdtools\io.py", line 43, in validate_xrdml_schema
xmlschema_doc = etree.parse(f)
File "src\lxml\etree.pyx", line 3444, in lxml.etree.parse (src\lxml\etree.c:83171)
File "src\lxml\parser.pxi", line 1855, in lxml.etree._parseDocument (src\lxml\etree.c:121011)
File "src\lxml\parser.pxi", line 1875, in lxml.etree._parseFilelikeDocument (src\lxml\etree.c:121294)
File "src\lxml\parser.pxi", line 1770, in lxml.etree._parseDocFromFilelike (src\lxml\etree.c:120078)
File "src\lxml\parser.pxi", line 1185, in lxml.etree._BaseParser._parseDocFromFilelike (src\lxml\etree.c:114806)
File "src\lxml\parser.pxi", line 598, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\etree.c:107724)
File "src\lxml\parser.pxi", line 709, in lxml.etree._handleParseResult (src\lxml\etree.c:109433)
File "src\lxml\parser.pxi", line 638, in lxml.etree._raiseParseError (src\lxml\etree.c:108287)
File "...\XRDTools\xrdtools\xrdtools\data\schemas\XRDMeasurement15.xsd", line 1
<?xml version="1.0" encoding="UTF-8"?>
^
XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
Digging into io.py, there is this function:
def validate_xrdml_schema(filename):
"""Validate the xml schema of a given file.
Parameters
----------
filename : str
The Filename of the `.xrdml` file to test.
Returns
-------
float or None
Returns the version number as float or None if
the file was not matching any provided xml schema.
"""
schemas = [(1.5, 'data/schemas/XRDMeasurement15.xsd'),
(1.4, 'data/schemas/XRDMeasurement14.xsd'),
(1.3, 'data/schemas/XRDMeasurement13.xsd'),
(1.2, 'data/schemas/XRDMeasurement12.xsd'),
(1.1, 'data/schemas/XRDMeasurement11.xsd'),
(1.0, 'data/schemas/XRDMeasurement10.xsd'),
]
schemas = [(v, os.path.join(package_path, schema)) for v, schema in schemas]
with open(filename, 'r') as f:
data_xml = etree.parse(f)
for version, schema in schemas:
with open(schema, 'r') as f:
xmlschema_doc = etree.parse(f)
xmlschema = etree.XMLSchema(xmlschema_doc)
valid = xmlschema.validate(data_xml)
if valid:
return version
return None
From what I've read, xmlschema_doc = etree.parse(f) is causing the issues. If I change that line to etree.parse(filename), it'll run without an error, but I'm not sure if that matters at all. I also haven't been able to apply that fix to anything other than a small self-contained cell in a Jupyter notebook.
What causes the error? Is there a way to fix it for Python 3? What's the best way to implement that fix?
Would love to get this resolved. TIA!
Most related problem I could find:
Python 3.4 lxml.etree: Start tag expected, '<' not found, line 1, column 1
Try:
with io.open(filename, 'r', encoding='utf8') as f:
data_xml = etree.parse(f)
(io.open because it is same call both for Python 2 and Python 3).

etree generating error when using urlib

I am trying to parse an HTML table into python (2.7) with the solutions in this post.
When I try either one of the first two with a string (as in the example) it works perfect.
But when I try to to use the etree.xml on HTML page I read with urlib I get an error. I did a check for each one of solutions, and the variable I pass is a str as well.
For the following code:
from lxml import etree
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = etree.XML(s)
I get this error:
File "C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py", line
9, in table = etree.XML(s)
File "lxml.etree.pyx", line 2723, in lxml.etree.XML
(src/lxml/lxml.etree.c:52448)
File "parser.pxi", line 1573, in lxml.etree._parseMemoryDocument
(src/lxml/lxml.etree.c:79932)
File "parser.pxi", line 1452, in lxml.etree._parseDoc
(src/lxml/lxml.etree.c:78774)
File "parser.pxi", line 960, in lxml.etree._BaseParser._parseDoc
(src/lxml/lxml.etree.c:75389)
File "parser.pxi", line 564, in
lxml.etree._ParserContext._handleParseResultDoc
(src/lxml/lxml.etree.c:71739)
File "parser.pxi", line 645, in lxml.etree._handleParseResult
(src/lxml/lxml.etree.c:72614)
File "parser.pxi", line 585, in lxml.etree._raiseParseError
(src/lxml/lxml.etree.c:71955) lxml.etree.XMLSyntaxError: Opening and
ending tag mismatch: link line 8 and head, line 8, column 48
and for this code:
from xml.etree import ElementTree as ET
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = ET.XML(s)
I get this error:
Traceback (most recent call last): File
"C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py", line 6, in
table = ET.XML(s)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1300, in XML
parser.feed(text)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in
_raiseerror
raise err xml.etree.ElementTree.ParseError: mismatched tag: line 8, column 111
While they may seem the same markup types, HTML is not as stringent as XML to be well-formed and follow markup rules (opening/closing nodes, escaping entities, etc.). Hence, what passes for HTML may not be allowed for XML.
Therefore, consider using etree's HTML() function to parse the page. Additionally, you can use XPath to target the particular area you intend to extract or use. Below is an example attempting to pull the main page's table. Do note the webpage uses a quite a bit of nested tables.
from lxml import etree
import urllib.request as rq
yearurl = "http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s = rq.urlopen(yearurl).read()
print(type(s))
# PARSE PAGE
htmlpage = etree.HTML(s)
# XPATH TO SPECIFIC CONTENT
htmltable = htmlpage.xpath("//table[tr/td/font/a/b='Rank']//text()")
for row in htmltable:
print(row)

how to pass an xml file to lxml to parse?

I'm trying to parse an xml file using lxml. xml.etree allowed me to simply pass the file name as a parameter to the parse function, so I attempted to do the same with lxml.
My code:
from lxml import etree
from lxml import objectify
file = "C:\Projects\python\cb.xml"
tree = etree.parse(file)
but I get the error:
Traceback (most recent call last):
File "cb.py", line 5, in <module>
tree = etree.parse(file)
File "lxml.etree.pyx", line 2698, in lxml.etree.parse (src/lxml/lxml.etree.c:4
9590)
File "parser.pxi", line 1491, in lxml.etree._parseDocument (src/lxml/lxml.etre
e.c:71205)
File "parser.pxi", line 1520, in lxml.etree._parseDocumentFromURL (src/lxml/lx
ml.etree.c:71488)
File "parser.pxi", line 1420, in lxml.etree._parseDocFromFile (src/lxml/lxml.e
tree.c:70583)
File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/
lxml/lxml.etree.c:67736)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDo
c (src/lxml/lxml.etree.c:63820)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.e
tree.c:64741)
File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etr
ee.c:64084)
lxml.etree.XMLSyntaxError: AttValue: " or ' expected, line 2, column 26
What am I doing wrong?
What you are doing wrong is (1) not checking whether you got the same outcome by using xml.etree on the same file (2) not reading the error message, which indicates a syntax error in line 2 of the file, way down stream from any file-opening issue
I stumbled across a similar error message this morning, and for me the answer was a malformed DTD. In my DTD, there was an Attribute definition with a default value that was not enclosed in quotes - as soon as I changed that, the error didn't happen anymore.
You have a syntax error in your XML Markup. You aren't doing anything wrong.
lxml allows you load a broken xml by creating a parser instance with recover=True
etree.XMLParser(recover=True)
While this is not ideal, I use this to load an xml for schema/dtd/schematron validation.

Categories

Resources