Parsing bad XHTML

Parsing bad XHTML - python

My new project is to extract data from the Naxos Glossary of Musical Terms, a great resource whose text data I want to process and extract to a database to use on another, simpler website I'll create.
My only problem is awful XHTML formatting. The
W3C XHTML validation raises 318 errors and 54 warnings. Even a HTML Tidier I found can't fix it all.
I'm using Python 3.67 and the page I'm parsing was ASP. I've tested LXML and Python XML modules, but both fail.
Can anyone suggest any other tidiers or modules? Or will I have to use some sort of raw text manipulation (yuck!)?
My code:
LXML:
from lxml import etree
file = open("glossary.asp", "r", encoding="ISO-8859-1")
parsed = etree.parse(file)
Error:
Traceback (most recent call last):
File "/media/skuzzyneon/STORE-1/naxos_dict/xslt_test.py", line 4, in <module>
parsed = etree.parse(file)
File "src/lxml/etree.pyx", line 3426, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1861, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1881, in lxml.etree._parseFilelikeDocument
File "src/lxml/parser.pxi", line 1776, in lxml.etree._parseDocFromFilelike
File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
File "/media/skuzzyneon/STORE-1/naxos_dict/glossary.asp", line 25
lxml.etree.XMLSyntaxError: EntityRef: expecting ';', line 25, column 128
>>>
Python XML (using the tidied XHTML):
import xml.etree.ElementTree as ET
file = open("tidy.html", "r", encoding="ISO-8859-1")
root = ET.fromstring(file.read())
# Top-level elements
print(root.findall("."))
Error:
Traceback (most recent call last):
File "/media/skuzzyneon/STORE-1/naxos_dict/xslt_test.py", line 4, in <module>
root = ET.fromstring(file.read())
File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML
parser.feed(text)
File "<string>", line None
xml.etree.ElementTree.ParseError: undefined entity: line 526, column 33

Lxml likely thinks you're giving it xml that way.
Try it like this:
from lxml import html
from cssselect import GenericTranslator, SelectorError
file = open("glossary.asp", "r", encoding="ISO-8859-1")
doc = html.document_fromstring(file.read())
print(doc.cssselect('title')[0].text_content())
Also instead of "HTML Tidiers" just open it in chrome and copy the html in the elements panel.

Related

How to include long base64 string in XML written by lxml element factory?

I'm using the lxml element factory on Python 3 to create an XML file that contains base64-encoded pdf files. The XML file will be used to import data into a database software, so the schema can not be changed.
When creating the XML file, lxml complains about the length of the base64 string:
article = E.article(
E.galley(
E.label('PDF'),
E.file(
ET.XML("<embed filename=\"" + row['galley'] + ".pdf\""
+ " encoding=\"base64\" mime_type=\"application/pdf\" >"
+ str(base64fulltext)
+ "</embed>")
), self.LOCALE(row['language']),
), self.LANGUAGE(row['language'])
)
When running the whole script, the error message ('line 45') points to the line where it says str(base64fulltext) in the code snippet above. The error message is as follows:
(lxml) vboxadmin#linux-x3el:~/repos/x> python3 test-csvFileImport.py
Traceback (most recent call last):
File "test-csvFileImport.py", line 65, in <module>
articlePdfBase64)
File "/home/vboxadmin/repos/x/y/writer.py", line 45, in exportArticle
+ "</embed>")
File "src/lxml/etree.pyx", line 3192, in lxml.etree.XML
File "src/lxml/parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1757, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1067, in lxml.etree._BaseParser._parseUnicodeDoc
File "src/lxml/parser.pxi", line 600, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 710, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 639, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: xmlSAX2Characters: huge text node, line 1, column 10027189
The expected result would have been to have the base64 string to be written to the xml file.
So far, I could only find that there is the option "huge_tree" in lxml.etree.iterparse (http://lxml.de/api/lxml.etree.iterparse-class.html), but I am not sure whether/how I can use this to solve my problem.
As a workaround, I am considering using string replace to insert the base64 string to the xml after it has been written to file. However, I would be more happy to use a proper lxml solution if anyone could suggest one. Thanks!

Unknown error when using lxml module - parsing XML

am curently learning using Python 101 and in one of examples I'm getting an error and have no clue how to fix it - my code is 100% same as in the book (checked it 3 times already) and it still outputs this error.
Here is the code:
from lxml import etree
def parseXML(xmlFile):
"""
Parse the xml
"""
with open(xmlFile) as fobj:
xml = fobj.read()
root = etree.fromstring(xml)
for appt in root.getchildren():
for elem in appt.getchildren():
if not elem.text:
text = 'None'
else:
text = elem.text
print(elem.tag + ' => ' + text)
if __name__ == '__main__':
parseXML('example.xml')
and here is xml file (it's the same as in the book):
<?xml version="1.0" ?>
<zAppointments reminder-"15">
<appointment>
<begin>1181251600</begin>
<uid>0400000008200E000</uid>
<alarmTime>1181572063</alarmTime>
<state></state>
<location></location>
<duration>1800</duration>
<subject>Bring pizza home</subject>
</appointment>
<appointment>
<begin>1234567890</begin>
<duration>1800</duration>
<subject>Check MS office webstie for updates</subject>
<state>dismissed</state>
<location></location>
<uid>502fq14-12551ss-255sf2</uid>
</appointment>
</zAppointments>
EDITED: Sry, got so excited about my first post that I actually forgot to put the error code.
Traceback (most recent call last):
File "/home/michal/Desktop/nauka programowania/python 101/parsing_with_lxml.py", line 21, in <module>
parseXML('example.xml')
File "/home/michal/Desktop/nauka programowania/python 101/parsing_with_lxml.py", line 10, in parseXML
root = etree.fromstring(xml)
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77737)
File "src/lxml/parser.pxi", line 1830, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116674)
File "src/lxml/parser.pxi", line 1711, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115220)
File "src/lxml/parser.pxi", line 1051, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109345)
File "src/lxml/parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103584)
File "src/lxml/parser.pxi", line 694, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105238)
File "src/lxml/parser.pxi", line 624, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:104147)
lxml.etree.XMLSyntaxError: Specification mandate value for attribute reminder-, line 2, column 25
Thanks for help!!

The only error in the xml can be found here: <zAppointments reminder-"15">, should be: <zAppointments reminder="15">.
In the future useful tools for validating xml can be found online.
Here for example: https://www.xmlvalidation.com/

Error may be in
<zAppointments reminder-"15">
For next validation try to use xmllint:
xmllint --valid --noout example.xml

parse xml file and output to text file

Trying to parse an xml file (config.xml) with ElementTree and output to a text file. I looked at other similar ques here but none helped me. Using Python 2.7.9
import xml.etree.ElementTree as ET
tree = ET.parse('config.xml')
notags = ET.tostring(tree,encoding='us-ascii',method='text')
print(notags)
OUTPUT
Traceback (most recent call last):
File "./python_element", line 9, in <module>
notags = ET.tostring(tree,encoding='us-ascii',method='text')
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1126, in tostring
ElementTree(element).write(file, encoding, method=method
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 814, in write
_serialize_text(write, self._root, encoding)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1005, in _serialize_text
for part in elem.itertext():
AttributeError:
> 'ElementTree' object has no attribute 'itertext'

Instead of tree (ElementTree object), pass an Element object. You can get an root element using .getroot() method:
notags = ET.tostring(tree.getroot(), encoding='utf-8',method='text')

etree generating error when using urlib

I am trying to parse an HTML table into python (2.7) with the solutions in this post.
When I try either one of the first two with a string (as in the example) it works perfect.
But when I try to to use the etree.xml on HTML page I read with urlib I get an error. I did a check for each one of solutions, and the variable I pass is a str as well.
For the following code:
from lxml import etree
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = etree.XML(s)
I get this error:
File "C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py", line
9, in table = etree.XML(s)
File "lxml.etree.pyx", line 2723, in lxml.etree.XML
(src/lxml/lxml.etree.c:52448)
File "parser.pxi", line 1573, in lxml.etree._parseMemoryDocument
(src/lxml/lxml.etree.c:79932)
File "parser.pxi", line 1452, in lxml.etree._parseDoc
(src/lxml/lxml.etree.c:78774)
File "parser.pxi", line 960, in lxml.etree._BaseParser._parseDoc
(src/lxml/lxml.etree.c:75389)
File "parser.pxi", line 564, in
lxml.etree._ParserContext._handleParseResultDoc
(src/lxml/lxml.etree.c:71739)
File "parser.pxi", line 645, in lxml.etree._handleParseResult
(src/lxml/lxml.etree.c:72614)
File "parser.pxi", line 585, in lxml.etree._raiseParseError
(src/lxml/lxml.etree.c:71955) lxml.etree.XMLSyntaxError: Opening and
ending tag mismatch: link line 8 and head, line 8, column 48
and for this code:
from xml.etree import ElementTree as ET
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = ET.XML(s)
I get this error:
Traceback (most recent call last): File
"C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py", line 6, in
table = ET.XML(s)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1300, in XML
parser.feed(text)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in
_raiseerror
raise err xml.etree.ElementTree.ParseError: mismatched tag: line 8, column 111

While they may seem the same markup types, HTML is not as stringent as XML to be well-formed and follow markup rules (opening/closing nodes, escaping entities, etc.). Hence, what passes for HTML may not be allowed for XML.
Therefore, consider using etree's HTML() function to parse the page. Additionally, you can use XPath to target the particular area you intend to extract or use. Below is an example attempting to pull the main page's table. Do note the webpage uses a quite a bit of nested tables.
from lxml import etree
import urllib.request as rq
yearurl = "http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s = rq.urlopen(yearurl).read()
print(type(s))
# PARSE PAGE
htmlpage = etree.HTML(s)
# XPATH TO SPECIFIC CONTENT
htmltable = htmlpage.xpath("//table[tr/td/font/a/b='Rank']//text()")
for row in htmltable:
print(row)

how to pass an xml file to lxml to parse?

I'm trying to parse an xml file using lxml. xml.etree allowed me to simply pass the file name as a parameter to the parse function, so I attempted to do the same with lxml.
My code:
from lxml import etree
from lxml import objectify
file = "C:\Projects\python\cb.xml"
tree = etree.parse(file)
but I get the error:
Traceback (most recent call last):
File "cb.py", line 5, in <module>
tree = etree.parse(file)
File "lxml.etree.pyx", line 2698, in lxml.etree.parse (src/lxml/lxml.etree.c:4
9590)
File "parser.pxi", line 1491, in lxml.etree._parseDocument (src/lxml/lxml.etre
e.c:71205)
File "parser.pxi", line 1520, in lxml.etree._parseDocumentFromURL (src/lxml/lx
ml.etree.c:71488)
File "parser.pxi", line 1420, in lxml.etree._parseDocFromFile (src/lxml/lxml.e
tree.c:70583)
File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/
lxml/lxml.etree.c:67736)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDo
c (src/lxml/lxml.etree.c:63820)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.e
tree.c:64741)
File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etr
ee.c:64084)
lxml.etree.XMLSyntaxError: AttValue: " or ' expected, line 2, column 26
What am I doing wrong?

What you are doing wrong is (1) not checking whether you got the same outcome by using xml.etree on the same file (2) not reading the error message, which indicates a syntax error in line 2 of the file, way down stream from any file-opening issue

I stumbled across a similar error message this morning, and for me the answer was a malformed DTD. In my DTD, there was an Attribute definition with a default value that was not enclosed in quotes - as soon as I changed that, the error didn't happen anymore.

You have a syntax error in your XML Markup. You aren't doing anything wrong.

lxml allows you load a broken xml by creating a parser instance with recover=True
etree.XMLParser(recover=True)
While this is not ideal, I use this to load an xml for schema/dtd/schematron validation.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing bad XHTML - python

Related

How to include long base64 string in XML written by lxml element factory?

Unknown error when using lxml module - parsing XML

parse xml file and output to text file

etree generating error when using urlib

how to pass an xml file to lxml to parse?

Categories

Resources