reference to invalid character number: (Python ElementTree parse) [duplicate] - python

The GET service I try to parse using ElementTree, and whose content I don't control, contains a non-UTF8 special character:
respXML = response.content.decode("utf-8")
respRoot = ET.fromstring(respXML)
The second line throws
xml.etree.ElementTree.ParseError: reference to invalid character number: line 3591, column 39
How can I make sure that the XML gets parsed regardless of the character set, which I can later run a replacement against if I find illegal characters? For example, is there an encoding which includes everything? I understand I can do a search and replace of the input XML string but I would prefer to parse it first because my parsing converts it into a data structure which is more easily searchable.
The special character in question is  but I would like to be able to ingest any character. The whole tag is <literal>Alzheimers disease</literal>.

With a little help from #tdelaney, I was able to get past this hurdle by scrubbing the input XML as a string:
respXML = response.content.decode("utf-8")
scrubbedXML = re.sub('&.+[0-9]+;', '', respXML)
respRoot = ET.fromstring(scrubbedXML)

Related

Python XML Compatible String

I am writing an XML file using lxml and am having issues with control characters. I am reading text from a file to assign to an element that contains control characters. When I run the script I receive this error:
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
So I wrote a small function to replace the control characters with a '?', when I look at the generated XML it appears that the control characters are new lines 0x0A. With this knowledge I wrote a function to encode there control characters :
def encodeXMLText(text):
text = text.replace("&", "&")
text = text.replace("\"", """)
text = text.replace("'", "&apos;")
text = text.replace("<", "<")
text = text.replace(">", ">")
text = text.replace("\n", "
")
text = text.replace("\r", "
")
return text
This still returns the same error as before. I want to preserve the new lines so simply stripping them isn't a valid option for me. No idea what I am doing wrong at this point. I am looking for a way to do this with lxml, similar to this:
ruleTitle = ET.SubElement(rule,'title')
ruleTitle.text = encodeXMLText(titleText)
The other questions I have read either don't use lxml or don't address new line (/n) and line feed (/r) characters as control characters
I printed out the string to see what specific characters were causing the issue and noticed these characters : \xe2\x80\x99 in the text. So the issue was the encoding, changing the code to look like this fixed my issue:
ruleTitle = ET.SubElement(rule,'title')
ruleTitle.text = titleText.decode('UTF-8')

Parsing XML with special chars using ElementTree

The GET service I try to parse using ElementTree, and whose content I don't control, contains a non-UTF8 special character:
respXML = response.content.decode("utf-8")
respRoot = ET.fromstring(respXML)
The second line throws
xml.etree.ElementTree.ParseError: reference to invalid character number: line 3591, column 39
How can I make sure that the XML gets parsed regardless of the character set, which I can later run a replacement against if I find illegal characters? For example, is there an encoding which includes everything? I understand I can do a search and replace of the input XML string but I would prefer to parse it first because my parsing converts it into a data structure which is more easily searchable.
The special character in question is  but I would like to be able to ingest any character. The whole tag is <literal>Alzheimers disease</literal>.
With a little help from #tdelaney, I was able to get past this hurdle by scrubbing the input XML as a string:
respXML = response.content.decode("utf-8")
scrubbedXML = re.sub('&.+[0-9]+;', '', respXML)
respRoot = ET.fromstring(scrubbedXML)

Why does this regular expression not work?

I have a function that parses HTML code so it is easy to read and write with. In order to do this I must split the string with multiple delimiters and as you can see I have used re.split() and I cannot find a better solution. However, when I submit some HTML such as this, it has absolutely no effect. This has lead me to believe that my regular expression is incorrectly written. What should be there instead?
def parsed(data):
"""Removes junk from the data so it can be easily processed."""
data = str(data)
# This checks for a cruft and removes it if it exists.
if re.search("b'", data):
data = data[2:-1]
lines = re.split(r'\r|\n', data) # This clarifies the lines for writing.
return lines
This isn't a duplicate if you find a similar question, I've been crawling around for ages and it still doesn't work.
You are converting a bytes value to string:
data = str(data)
# This checks for a cruft and removes it if it exists.
if re.search("b'", data):
data = data[2:-1]
which means that all line delimiters have been converted to their Python escape codes:
>>> str(b'\n')
"b'\n'"
That is a literal b, literal quote, literal \ backslash, literal n, literal quote. You would have to split on r'(\\n|\\r)' instead, but most of all, you shouldn't turn bytes values to string representations here. Python produced the representation of the bytes value as a literal string you can paste back into your Python interpreter, which is not the same thing as the value contained in the object.
You want to decode to string instead:
if isinstance(data, bytes):
data = data.decode('utf8')
where I am assuming that the data is encoded with UTF8. If this is data from a web request, the response headers quite often include the character set used to encode the data in the Content-Type header, look for the charset= parameter.
A response produced by the urllib.request module has an .info() method, and the character set can be extracted (if provided) with:
charset = response.info().get_param('charset')
where the return value is None if no character set was provided.
You don't need to use a regular expression to split lines, the str type has a dedicated method, str.splitlines():
Return a list of the lines in the string, breaking at line boundaries. This method uses the universal newlines approach to splitting lines. Line breaks are not included in the resulting list unless keepends is given and true.
For example, 'ab c\n\nde fg\rkl\r\n'.splitlines() returns ['ab c', '', 'de fg', 'kl'], while the same call with splitlines(True) returns ['ab c\n', '\n', 'de fg\r', 'kl\r\n'].

latexcodec stripping slashes but not translating characters (Python)

I'm trying to process some Bibtex entries converted to an XML tree via Pybtex. I'd like to go ahead and process all the special characters from their LaTeX specials to unicode characters, via latexcodec. Via question Does pybtex support accent/special characters in .bib file? and the documentation I have checked the syntax, however, I am not getting the correct output.
>>> import latexcodec
>>> name = 'Br\"{u}derle'
>>> name.decode('latex')
u'Br"{u}derle'
I have tested this across different strings and special characters and always it just strips off the first slash without translating the character. Should I be using latexencoder differently to get the correct output?
Your backslash is not included in the string at all because it is treated as a string escape, so the codec never sees it:
>>> print 'Br\"{u}derle'
Br"{u}derle
Use a raw string:
name = r'Br\"{u}derle'
Alternatively, try reading actual data from a file, in which case the raw/non-raw distinction will not matter. (The distinction only applies to literal strings in Python source code.)

Yet another unicode mess in Python

I'm tagging some unicode text with Python NLTK.
The issue is that the text is from data sources that are badly encoded, and do not specify the encoding. After some messing, I figured out that the text must be in UTF-8.
Given the input string:
s = u"The problem isn’t getting to Huancavelica from Huancayo to the north."
I want process it with NLTK, for example for POS tagging, but the special characters are not resolved, and I get output like:
The/DT problem/NN isn’t/NN getting/VBG
Instead of:
The/DT problem/NN isn't/VBG getting/VBG
How do I get clean the text from these special characters?
Thanks for any feedback,
Mulone
UPDATE: If I run HTMLParser().unescape(s), I get:
u'The problem isn\u2019t getting to Huancavelica from Huancayo to the north.'
In other cases, I still get things like & and 
 in the text.
What do I need to do to translate this into something that NLTK will understand?
This is not an character/Unicode encoding issue. The text you have contains XML/HTML numeric character reference entities, which are markup. Whatever library you're using to parse the file should provide some function to dereference ’ to the appropriate character.
If you're not bound to any library, see Decode HTML entities in Python string?
The resulting string includes a special apostrophe instead of an ascii single-quote. You can just replace it in the result:
In [6]: s = u"isn’t"
In [7]: print HTMLParser.HTMLParser().unescape(s)
isn’t
In [8]: print HTMLParser.HTMLParser().unescape(s).replace(u'\u2019', "'")
isn't
Unescape will take care of the rest of the characters. For example & is the & symbol itself. 
 is a CR symbol (\r) and can be either ignored or converted into a newline depending on where the original text comes from (old macs used it for newlines)

Categories

Resources