Python SAX parser says XML file is not well-formed - python

I stripped some tags that I thought were unnecessary from an XML file. Now when I try to parse it, my SAX parser throws an error and says my file is not well-formed. However, I know every start tag has an end tag. The file's opening tag has a link to an XML schema. Could this be causing the trouble? If so, then how do I fix it?
Edit: I think I've found the problem. My character data contains "&lt" and "&gt" characters, presumably from html tags. After being parsed, these are converted to "<" and ">" characters, which seems to bother the SAX parser. Is there any way to prevent this from happening?

I would suggest putting those tags back in and making sure it still works. Then, if you want to take them out, do it one at a time until it breaks.
However, I question the wisdom of taking them out. If it's your XML file, you should understand it better. If it's a third-party XML file, you really shouldn't be fiddling with it (until you understand it better :-).

Does the sax parser not give you details about where it thinks it's not well-formed?
Have you tried loading the file into an XML editor and checking it there? Do other XML parsers accept it?
The schema shouldn't change whether or not the XML is well-formed or not; it may well change whether it's valid or not. See the wikipedia entry for XML well-formedness for a little bit more, or the XML specs for a lot more detail :)
EDIT: To represent "&" in text, you should escape it as &
So:
&lt
should be
&lt
(assuming you really want ampersand, l, t).

I would second recommendation to try to parse it using another XML parser. That should give an indication as to whether it's the document that's wrong, or parser.
Also, the actual error message might be useful. One fairly common problem for example is that the xml declaration (if one is used, it's optional) must be the very first thing -- not even whitespace is allowed before it.

You could load it into Firefox, if you don't have an XML editor. Firefox shows you the error.

Related

How to ignore mismatched tags while parsing xml in Python

I want to parse an XML file with Python. I don't need the hierarchical tag structure -- all I want is a simple SAX or Expat-based parser. However, they both fail with mismatched tag-related error messages when the XML file is not well formed.
Is there a way to tell the parser to ignore these errors? I tried to
parser.setFeature(sax.handler.feature_validation, False)
, but that didn't help either.
Is there a solution? Either SAX/Expat would do.
You should give Beautiful Soup a try. Its main purpose is to parse HTML even in the presence of malformations. You may find it parses your invalid XML without much trouble.
Would you also use lxml? It has a function called iterparse, which is event-driven parsing in a (according to the documentation) "SAX-like fashion", and has a parameter to force parsing broken input. It's fairly easy to use, also.
lxml iterparse tutorial
lxml iterparse class definition

Search text for valid Python code

I have chunks of text that may or may not contain Python code. I need a way to search the text for code and if it is there, do something. I could easily search for specific strings that match a regex, but I need something general.
One thought I had would be to run the text through ast, but that would require parsing out all possible substrings and submitting each of them to ast.
To be clear, the text comes from a Q&A forum for Python. Users frequently post code in their questions, but the code is all smushed into one long, incoherent line when it should be formatted to be displayed properly. I need to check if there is code included in the text and if there is and it isn't formatted properly, complain to the user. Checking formatting is something I can handle, I just need to check the text for any Python code.
Any help would be appreciated.

python sax parser skipping over exception

Is there a way to "skip" a line using a SAX XML parser?
I've got a non-confirming XML document which is a concatenation of valid XML documents and thus the <?xml ...?> appears for each document. Also note I need to use a SAX parser since the input documents are huge.
I tried crafting a "custom stream" class as feeder for the parser but quickly realized that SAX uses the read method and thus reads stuff in "byte arrays" thereby exploding the complexity of this project.
thanks!
UPDATE: I know there is way around this using csplit but I am after a Python based solution if at all possible within reasonable limits.
Update2: Maybe I should have said "skipping to next document", that would have made more sense. Anyhow, that's what I need: a way of parsing multiple documents from a single input stream.
When you are concatenating the documents together, just replace the beginning <? and ?> with <!-- and -->, this will comment out the xml declarations.

Determining charset from html meta tags w/python

I have a script that needs to determine the charset before being read by lxml.HTML() for parsing. I will assume ISO-8859-1(that's the normal assumed charset for this right?) if it can't be found and search the html for the meta tag with the charset attribute. However I'm not sure the best way to do that. I could try to create an etree with lxml, but I don't want to read the whole file since I may run into encoding problems. However, if I don't read the whole file I can't build an etree since some tags will not have been closed.
Should I just find the meta tag with some fancy string subscripting and break out of the loop once it's found or a certain number of lines have been read? Maybe use a low level HTML parser, eg html.parser? Using python3 btw, thanks.
You should first try to extract encoding from HTTP headers. If it is not present there, you should parse it with the lxml. This might be tricky since lxml throws parse errors if charset does not match. A work-around would be decoding and encoding the data ignoring the unknown characters.
html_data=html_data.decode("UTF-8","ignore")
html_data=html_data.encode("UTF-8","ignore")
After this, you can parse by invoking the lxml.HTML() command with utf-8 encoding.
This way, you'll be able to find the correct encoding defined in the HTML headers.
After finding the encoding, you'll have to re-parse the HTML document with proper encoding.
Unfortunately, sometimes you might not find character encoding even in the HTML headers. I'd suggest you using the chardet module to find the proper encoding only after these steps fail.
Determining the character encoding of an HTML file correctly is actually quite a complex matter, but the HTML5 spec defines exactly how a processor should do it. You can find the algorithm here: http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding

Filter out HTML tags and resolve entities in python

Because regular expressions scare me, I'm trying to find a way to remove all HTML tags and resolve HTML entities from a string in Python.
Use lxml which is the best xml/html library for python.
import lxml.html
t = lxml.html.fromstring("...")
t.text_content()
And if you just want to sanitize the html look at the lxml.html.clean module
Use BeautifulSoup! It's perfect for this, where you have incoming markup of dubious virtue and need to get something reasonable out of it. Just pass in the original text, extract all the string tags, and join them.
While I agree with Lucas that regular expressions are not all that scary, I still think that you should go with a specialized HTML parser. This is because the HTML standard is hairy enough (especially if you want to parse arbitrarily "HTML" pages taken off the Internet) that you would need to write a lot of code to handle the corner cases. It seems that python includes one out of the box.
You should also check out the python bindings for TidyLib which can clean up broken HTML, making the success rate of any HTML parsing much higher.
How about parsing the HTML data and extracting the data with the help of the parser ?
I'd try something like the author described in chapter 8.3 in the Dive Into Python book
if you use django you might also use
http://docs.djangoproject.com/en/dev/ref/templates/builtins/#striptags
;)
You might need something more complicated than a regular expression. Web pages often have angle brackets that aren't part of a tag, like this:
<div>5 < 7</div>
Stripping the tags with regex will return the string "5 " and treat
< 7</div>
as a single tag and strip it out.
I suggest looking for already-written code that does this for you. I did a search and found this: http://zesty.ca/python/scrape.html It also can resolve HTML entities.
Regular expressions are not scary, but writing your own regexes to strip HTML is a sure path to madness (and it won't work, either). Follow the path of wisdom, and use one of the many good HTML-parsing libraries.
Lucas' example is also broken because "sub" is not a method of a Python string. You'd have to "import re", then call re.sub(pattern, repl, string). But that's neither here nor there, as the correct answer to your question does not involve writing any regexes.
Looking at the amount of sense people are demonstrating in other answers here, I'd say that using a regex probably isn't the best idea for your situation. Go for something tried and tested, and treat my previous answer as a demonstration that regexes need not be that scary.

Categories

Resources