LXML issue parsing XML schema in Python 3 - python

I'm attempting to use the XRDTools library to convert Panalytical XRDML files into a more database-friendly format, such as a pandas dataframe.
The XRDTools library is described here: https://github.com/paruch-group/xrdtools. It imports the XRDML file into a Python dictionary. I'm totally new to LXML, so I apologize if this is a simple question.
I've used Anaconda to create Python 2.7 and 3.6 environments specifically to work with the XRDTools package. I'd like to run it in Python 3.6.
In Python 2.7, this code runs smoothly:
import xrdtools
xrd = xrdtools.read_xrdml('filename.xrdml')
Output is a dict:
{u'2Theta': array([63. , 63.00334225, 63.00668449, ..., 67.99331551,
67.99665775, 68. ]),
u'Lambda': 1.540598,
u'Omega': array([31. , 31.00200535, 31.0040107 , ..., 33.9959893 ,
33.99799465, 34. ]), ...
I can then use the dictionary like any other Python object.
In Python 3.6, that same code generates this error message:
Traceback (most recent call last):
File "...\AppData\Local\Continuum\Anaconda2\envs\py36xrd\lib\site-packages\IPython\core\interactiveshell.py", line 2910, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-3-b6f5409b8bf9>", line 1, in <module>
xrd = xrdtools.read_xrdml('filename.xrdml')
File "...\XRDTools\xrdtools\xrdtools\io.py", line 297, in read_xrdml
valid = validate_xrdml_schema(filename)
File ...\XRDTools\xrdtools\xrdtools\io.py", line 43, in validate_xrdml_schema
xmlschema_doc = etree.parse(f)
File "src\lxml\etree.pyx", line 3444, in lxml.etree.parse (src\lxml\etree.c:83171)
File "src\lxml\parser.pxi", line 1855, in lxml.etree._parseDocument (src\lxml\etree.c:121011)
File "src\lxml\parser.pxi", line 1875, in lxml.etree._parseFilelikeDocument (src\lxml\etree.c:121294)
File "src\lxml\parser.pxi", line 1770, in lxml.etree._parseDocFromFilelike (src\lxml\etree.c:120078)
File "src\lxml\parser.pxi", line 1185, in lxml.etree._BaseParser._parseDocFromFilelike (src\lxml\etree.c:114806)
File "src\lxml\parser.pxi", line 598, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\etree.c:107724)
File "src\lxml\parser.pxi", line 709, in lxml.etree._handleParseResult (src\lxml\etree.c:109433)
File "src\lxml\parser.pxi", line 638, in lxml.etree._raiseParseError (src\lxml\etree.c:108287)
File "...\XRDTools\xrdtools\xrdtools\data\schemas\XRDMeasurement15.xsd", line 1
<?xml version="1.0" encoding="UTF-8"?>
^
XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
Digging into io.py, there is this function:
def validate_xrdml_schema(filename):
"""Validate the xml schema of a given file.
Parameters
----------
filename : str
The Filename of the `.xrdml` file to test.
Returns
-------
float or None
Returns the version number as float or None if
the file was not matching any provided xml schema.
"""
schemas = [(1.5, 'data/schemas/XRDMeasurement15.xsd'),
(1.4, 'data/schemas/XRDMeasurement14.xsd'),
(1.3, 'data/schemas/XRDMeasurement13.xsd'),
(1.2, 'data/schemas/XRDMeasurement12.xsd'),
(1.1, 'data/schemas/XRDMeasurement11.xsd'),
(1.0, 'data/schemas/XRDMeasurement10.xsd'),
]
schemas = [(v, os.path.join(package_path, schema)) for v, schema in schemas]
with open(filename, 'r') as f:
data_xml = etree.parse(f)
for version, schema in schemas:
with open(schema, 'r') as f:
xmlschema_doc = etree.parse(f)
xmlschema = etree.XMLSchema(xmlschema_doc)
valid = xmlschema.validate(data_xml)
if valid:
return version
return None
From what I've read, xmlschema_doc = etree.parse(f) is causing the issues. If I change that line to etree.parse(filename), it'll run without an error, but I'm not sure if that matters at all. I also haven't been able to apply that fix to anything other than a small self-contained cell in a Jupyter notebook.
What causes the error? Is there a way to fix it for Python 3? What's the best way to implement that fix?
Would love to get this resolved. TIA!
Most related problem I could find:
Python 3.4 lxml.etree: Start tag expected, '<' not found, line 1, column 1

Try:
with io.open(filename, 'r', encoding='utf8') as f:
data_xml = etree.parse(f)
(io.open because it is same call both for Python 2 and Python 3).

Related

Parsing bad XHTML

My new project is to extract data from the Naxos Glossary of Musical Terms, a great resource whose text data I want to process and extract to a database to use on another, simpler website I'll create.
My only problem is awful XHTML formatting. The
W3C XHTML validation raises 318 errors and 54 warnings. Even a HTML Tidier I found can't fix it all.
I'm using Python 3.67 and the page I'm parsing was ASP. I've tested LXML and Python XML modules, but both fail.
Can anyone suggest any other tidiers or modules? Or will I have to use some sort of raw text manipulation (yuck!)?
My code:
LXML:
from lxml import etree
file = open("glossary.asp", "r", encoding="ISO-8859-1")
parsed = etree.parse(file)
Error:
Traceback (most recent call last):
File "/media/skuzzyneon/STORE-1/naxos_dict/xslt_test.py", line 4, in <module>
parsed = etree.parse(file)
File "src/lxml/etree.pyx", line 3426, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1861, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1881, in lxml.etree._parseFilelikeDocument
File "src/lxml/parser.pxi", line 1776, in lxml.etree._parseDocFromFilelike
File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
File "/media/skuzzyneon/STORE-1/naxos_dict/glossary.asp", line 25
lxml.etree.XMLSyntaxError: EntityRef: expecting ';', line 25, column 128
>>>
Python XML (using the tidied XHTML):
import xml.etree.ElementTree as ET
file = open("tidy.html", "r", encoding="ISO-8859-1")
root = ET.fromstring(file.read())
# Top-level elements
print(root.findall("."))
Error:
Traceback (most recent call last):
File "/media/skuzzyneon/STORE-1/naxos_dict/xslt_test.py", line 4, in <module>
root = ET.fromstring(file.read())
File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML
parser.feed(text)
File "<string>", line None
xml.etree.ElementTree.ParseError: undefined entity: line 526, column 33
Lxml likely thinks you're giving it xml that way.
Try it like this:
from lxml import html
from cssselect import GenericTranslator, SelectorError
file = open("glossary.asp", "r", encoding="ISO-8859-1")
doc = html.document_fromstring(file.read())
print(doc.cssselect('title')[0].text_content())
Also instead of "HTML Tidiers" just open it in chrome and copy the html in the elements panel.

How to include long base64 string in XML written by lxml element factory?

I'm using the lxml element factory on Python 3 to create an XML file that contains base64-encoded pdf files. The XML file will be used to import data into a database software, so the schema can not be changed.
When creating the XML file, lxml complains about the length of the base64 string:
article = E.article(
E.galley(
E.label('PDF'),
E.file(
ET.XML("<embed filename=\"" + row['galley'] + ".pdf\""
+ " encoding=\"base64\" mime_type=\"application/pdf\" >"
+ str(base64fulltext)
+ "</embed>")
), self.LOCALE(row['language']),
), self.LANGUAGE(row['language'])
)
When running the whole script, the error message ('line 45') points to the line where it says str(base64fulltext) in the code snippet above. The error message is as follows:
(lxml) vboxadmin#linux-x3el:~/repos/x> python3 test-csvFileImport.py
Traceback (most recent call last):
File "test-csvFileImport.py", line 65, in <module>
articlePdfBase64)
File "/home/vboxadmin/repos/x/y/writer.py", line 45, in exportArticle
+ "</embed>")
File "src/lxml/etree.pyx", line 3192, in lxml.etree.XML
File "src/lxml/parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1757, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1067, in lxml.etree._BaseParser._parseUnicodeDoc
File "src/lxml/parser.pxi", line 600, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 710, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 639, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: xmlSAX2Characters: huge text node, line 1, column 10027189
The expected result would have been to have the base64 string to be written to the xml file.
So far, I could only find that there is the option "huge_tree" in lxml.etree.iterparse (http://lxml.de/api/lxml.etree.iterparse-class.html), but I am not sure whether/how I can use this to solve my problem.
As a workaround, I am considering using string replace to insert the base64 string to the xml after it has been written to file. However, I would be more happy to use a proper lxml solution if anyone could suggest one. Thanks!

Inserting data to impala table using Ibis python

I'm trying to insert df into a ibis created impala table with partition. I am running this on remote kernel using spyder 3.2.4 on windows 10 machine and python 3.6.2 on edge node machine running CentOS.
I get following error:
Writing DataFrame to temporary file
Writing CSV to: /tmp/ibis/pandas_0032f9dd1916426da62c8b4d8f4dfb92/0.csv
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
insert = target_table.insert(df3)
File "/usr/local/lib/python3.6/site-packages/ibis/impala/client.py", line 1674, in insert
writer, expr = write_temp_dataframe(self._client, obj)
File "/usr/local/lib/python3.6/site-packages/ibis/impala/pandas_interop.py", line 225, in write_temp_dataframe
return writer, writer.delimited_table(path)
File "/usr/local/lib/python3.6/site-packages/ibis/impala/pandas_interop.py", line 188, in delimited_table
schema = self.get_schema()
File "/usr/local/lib/python3.6/site-packages/ibis/impala/pandas_interop.py", line 184, in get_schema
return pandas_to_ibis_schema(self.df)
File "/usr/local/lib/python3.6/site-packages/ibis/impala/pandas_interop.py", line 219, in pandas_to_ibis_schema
return schema(pairs)
File "/usr/local/lib/python3.6/site-packages/ibis/expr/api.py", line 105, in schema
return Schema.from_tuples(pairs)
File "/usr/local/lib/python3.6/site-packages/ibis/expr/datatypes.py", line 109, in from_tuples
return Schema(names, types)
File "/usr/local/lib/python3.6/site-packages/ibis/expr/datatypes.py", line 55, in init
self.types = [validate_type(typ) for typ in types]
File "/usr/local/lib/python3.6/site-packages/ibis/expr/datatypes.py", line 55, in
self.types = [validate_type(typ) for typ in types]
File "/usr/local/lib/python3.6/site-packages/ibis/expr/datatypes.py", line 1040, in validate_type
return TypeParser(t).parse()
File "/usr/local/lib/python3.6/site-packages/ibis/expr/datatypes.py", line 901, in parse
t = self.type()
File "/usr/local/lib/python3.6/site-packages/ibis/expr/datatypes.py", line 1033, in type
raise SyntaxError('Type cannot be parsed: {}'.format(self.text))
File "", line unknown
SyntaxError: Type cannot be parsed: integer
Error was coming due to structure and security of the hadoop system. Ibis package tries to create temp_db & temp_hdfs_location in __ibis_tmp & /tmp/ibis/ respectively. Since in our system default locations are not open to any user other than root/system admin... insert command was erroring out when getting data from /tmp/ibis/ to actual db (still not clear but may be via __ibis_tmp dbase). Once we edited the config_init.py file for ibis package to a allowed temp location/db. It worked like a charm.
instead of editing the config_init.py mentioned
https://stackoverflow.com/a/47543691/5485370
It is easier to assign the temp db and path using the ibis.options:
ibis.options.impala.temp_db = 'your_temp_db'
ibis.options.impala.temp_hdfs_path = 'your_temp_hdfs_path'

Unknown error when using lxml module - parsing XML

am curently learning using Python 101 and in one of examples I'm getting an error and have no clue how to fix it - my code is 100% same as in the book (checked it 3 times already) and it still outputs this error.
Here is the code:
from lxml import etree
def parseXML(xmlFile):
"""
Parse the xml
"""
with open(xmlFile) as fobj:
xml = fobj.read()
root = etree.fromstring(xml)
for appt in root.getchildren():
for elem in appt.getchildren():
if not elem.text:
text = 'None'
else:
text = elem.text
print(elem.tag + ' => ' + text)
if __name__ == '__main__':
parseXML('example.xml')
and here is xml file (it's the same as in the book):
<?xml version="1.0" ?>
<zAppointments reminder-"15">
<appointment>
<begin>1181251600</begin>
<uid>0400000008200E000</uid>
<alarmTime>1181572063</alarmTime>
<state></state>
<location></location>
<duration>1800</duration>
<subject>Bring pizza home</subject>
</appointment>
<appointment>
<begin>1234567890</begin>
<duration>1800</duration>
<subject>Check MS office webstie for updates</subject>
<state>dismissed</state>
<location></location>
<uid>502fq14-12551ss-255sf2</uid>
</appointment>
</zAppointments>
EDITED: Sry, got so excited about my first post that I actually forgot to put the error code.
Traceback (most recent call last):
File "/home/michal/Desktop/nauka programowania/python 101/parsing_with_lxml.py", line 21, in <module>
parseXML('example.xml')
File "/home/michal/Desktop/nauka programowania/python 101/parsing_with_lxml.py", line 10, in parseXML
root = etree.fromstring(xml)
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77737)
File "src/lxml/parser.pxi", line 1830, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116674)
File "src/lxml/parser.pxi", line 1711, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115220)
File "src/lxml/parser.pxi", line 1051, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109345)
File "src/lxml/parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103584)
File "src/lxml/parser.pxi", line 694, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105238)
File "src/lxml/parser.pxi", line 624, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:104147)
lxml.etree.XMLSyntaxError: Specification mandate value for attribute reminder-, line 2, column 25
Thanks for help!!
The only error in the xml can be found here: <zAppointments reminder-"15">, should be: <zAppointments reminder="15">.
In the future useful tools for validating xml can be found online.
Here for example: https://www.xmlvalidation.com/
Error may be in
<zAppointments reminder-"15">
For next validation try to use xmllint:
xmllint --valid --noout example.xml

how to pass an xml file to lxml to parse?

I'm trying to parse an xml file using lxml. xml.etree allowed me to simply pass the file name as a parameter to the parse function, so I attempted to do the same with lxml.
My code:
from lxml import etree
from lxml import objectify
file = "C:\Projects\python\cb.xml"
tree = etree.parse(file)
but I get the error:
Traceback (most recent call last):
File "cb.py", line 5, in <module>
tree = etree.parse(file)
File "lxml.etree.pyx", line 2698, in lxml.etree.parse (src/lxml/lxml.etree.c:4
9590)
File "parser.pxi", line 1491, in lxml.etree._parseDocument (src/lxml/lxml.etre
e.c:71205)
File "parser.pxi", line 1520, in lxml.etree._parseDocumentFromURL (src/lxml/lx
ml.etree.c:71488)
File "parser.pxi", line 1420, in lxml.etree._parseDocFromFile (src/lxml/lxml.e
tree.c:70583)
File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/
lxml/lxml.etree.c:67736)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDo
c (src/lxml/lxml.etree.c:63820)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.e
tree.c:64741)
File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etr
ee.c:64084)
lxml.etree.XMLSyntaxError: AttValue: " or ' expected, line 2, column 26
What am I doing wrong?
What you are doing wrong is (1) not checking whether you got the same outcome by using xml.etree on the same file (2) not reading the error message, which indicates a syntax error in line 2 of the file, way down stream from any file-opening issue
I stumbled across a similar error message this morning, and for me the answer was a malformed DTD. In my DTD, there was an Attribute definition with a default value that was not enclosed in quotes - as soon as I changed that, the error didn't happen anymore.
You have a syntax error in your XML Markup. You aren't doing anything wrong.
lxml allows you load a broken xml by creating a parser instance with recover=True
etree.XMLParser(recover=True)
While this is not ideal, I use this to load an xml for schema/dtd/schematron validation.

Categories

Resources