I am writing an XML file using lxml and am having issues with control characters. I am reading text from a file to assign to an element that contains control characters. When I run the script I receive this error:
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
So I wrote a small function to replace the control characters with a '?', when I look at the generated XML it appears that the control characters are new lines 0x0A. With this knowledge I wrote a function to encode there control characters :
def encodeXMLText(text):
text = text.replace("&", "&")
text = text.replace("\"", """)
text = text.replace("'", "'")
text = text.replace("<", "<")
text = text.replace(">", ">")
text = text.replace("\n", "
")
text = text.replace("\r", "
")
return text
This still returns the same error as before. I want to preserve the new lines so simply stripping them isn't a valid option for me. No idea what I am doing wrong at this point. I am looking for a way to do this with lxml, similar to this:
ruleTitle = ET.SubElement(rule,'title')
ruleTitle.text = encodeXMLText(titleText)
The other questions I have read either don't use lxml or don't address new line (/n) and line feed (/r) characters as control characters
I printed out the string to see what specific characters were causing the issue and noticed these characters : \xe2\x80\x99 in the text. So the issue was the encoding, changing the code to look like this fixed my issue:
ruleTitle = ET.SubElement(rule,'title')
ruleTitle.text = titleText.decode('UTF-8')
Related
The GET service I try to parse using ElementTree, and whose content I don't control, contains a non-UTF8 special character:
respXML = response.content.decode("utf-8")
respRoot = ET.fromstring(respXML)
The second line throws
xml.etree.ElementTree.ParseError: reference to invalid character number: line 3591, column 39
How can I make sure that the XML gets parsed regardless of the character set, which I can later run a replacement against if I find illegal characters? For example, is there an encoding which includes everything? I understand I can do a search and replace of the input XML string but I would prefer to parse it first because my parsing converts it into a data structure which is more easily searchable.
The special character in question is but I would like to be able to ingest any character. The whole tag is <literal>Alzheimers disease</literal>.
With a little help from #tdelaney, I was able to get past this hurdle by scrubbing the input XML as a string:
respXML = response.content.decode("utf-8")
scrubbedXML = re.sub('&.+[0-9]+;', '', respXML)
respRoot = ET.fromstring(scrubbedXML)
The GET service I try to parse using ElementTree, and whose content I don't control, contains a non-UTF8 special character:
respXML = response.content.decode("utf-8")
respRoot = ET.fromstring(respXML)
The second line throws
xml.etree.ElementTree.ParseError: reference to invalid character number: line 3591, column 39
How can I make sure that the XML gets parsed regardless of the character set, which I can later run a replacement against if I find illegal characters? For example, is there an encoding which includes everything? I understand I can do a search and replace of the input XML string but I would prefer to parse it first because my parsing converts it into a data structure which is more easily searchable.
The special character in question is but I would like to be able to ingest any character. The whole tag is <literal>Alzheimers disease</literal>.
With a little help from #tdelaney, I was able to get past this hurdle by scrubbing the input XML as a string:
respXML = response.content.decode("utf-8")
scrubbedXML = re.sub('&.+[0-9]+;', '', respXML)
respRoot = ET.fromstring(scrubbedXML)
I made a scraping script with python and selenium. It scrapes data from a Spanish language website:
for i, line in enumerate(browser.find_elements_by_xpath(xpath)):
tds = line.find_elements_by_tag_name('td') # takes <td> tags from line
print tds[0].text # FIRST PRINT
if len(tds)%2 == 0: # takes data from lines with even quantity of cells only
data.append([u"".join(tds[0].text), u"".join(tds[1].text), ])
print data # SECOND PRINT
The first print statement gives me a normal Spanish string. But the second print gives me a string like this: "Data de Distribui\u00e7\u00e3o".
What's the reason for this?
You are mixing encodings:
u'' # unicode string
b'' # bytearray string
The text property of tds[0] is a bytearray string which is encoding agnostic, and you are operating in the second print with unicode string, thus mixing the encodings
for using any type of accented character we have to first encode or decode it before using them
accent_char = "ôâ"
name = accent_char.decode('utf-8')
print(name)
The above code will work for decoding the characters
I'm tagging some unicode text with Python NLTK.
The issue is that the text is from data sources that are badly encoded, and do not specify the encoding. After some messing, I figured out that the text must be in UTF-8.
Given the input string:
s = u"The problem isn’t getting to Huancavelica from Huancayo to the north."
I want process it with NLTK, for example for POS tagging, but the special characters are not resolved, and I get output like:
The/DT problem/NN isn’t/NN getting/VBG
Instead of:
The/DT problem/NN isn't/VBG getting/VBG
How do I get clean the text from these special characters?
Thanks for any feedback,
Mulone
UPDATE: If I run HTMLParser().unescape(s), I get:
u'The problem isn\u2019t getting to Huancavelica from Huancayo to the north.'
In other cases, I still get things like & and
in the text.
What do I need to do to translate this into something that NLTK will understand?
This is not an character/Unicode encoding issue. The text you have contains XML/HTML numeric character reference entities, which are markup. Whatever library you're using to parse the file should provide some function to dereference ’ to the appropriate character.
If you're not bound to any library, see Decode HTML entities in Python string?
The resulting string includes a special apostrophe instead of an ascii single-quote. You can just replace it in the result:
In [6]: s = u"isn’t"
In [7]: print HTMLParser.HTMLParser().unescape(s)
isn’t
In [8]: print HTMLParser.HTMLParser().unescape(s).replace(u'\u2019', "'")
isn't
Unescape will take care of the rest of the characters. For example & is the & symbol itself.
is a CR symbol (\r) and can be either ignored or converted into a newline depending on where the original text comes from (old macs used it for newlines)
I'm using this function to escape the HTML enities
import re, htmlentitydefs
##
# Removes HTML or XML character references and entities from a text string.
#
# #param text The HTML (or XML) source text.
# #return The plain text, as a Unicode string, if necessary.
def unescape(text):
def fixup(m):
text = m.group(0)
if text[:2] == "&#":
# character reference
try:
if text[:3] == "&#x":
return unichr(int(text[3:-1], 16))
else:
return unichr(int(text[2:-1]))
except ValueError:
pass
else:
# named entity
try:
text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
except KeyError:
pass
return text # leave as is
return re.sub("&#?\w+;", fixup, text)
but when i try to process some text i get this error, (most of the text works) but python throws me this error
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xae' in position 3
48: character maps to <undefined>
i have tried encoding the text string a million different ways, nothing is working so far ascii, utf, unicode... all that stuff which i really don't understand
Based on the error message, it looks like you may be attempting to convert a unicode string into CP 437 (an IBM PC character set). This doesn't appear to be occurring in your function, but could happen when attempting to print the resulting string to your console. I ran a quick test with the input string "® some text" and was able to reproduce the failure when printing the resulting string:
print unescape("® some text")
You can avoid this by specifying the encoding you want to convert the unicode string to:
print unescape("® some text").encode('utf-8')
You'll see non-ascii characters if you attempt to print this string to the console, however if you write it to a file and read it in a viewer that supports utf-8 encoded documents, you should see the characters you expect.
You need to post the FULL traceback so that we can see where in YOUR code the error happens. You also need to show us repr(a SMALL piece of data that has this problem) -- your data is at least 348 bytes long.
Based on the initially-supplied information:
You are crashing trying to encode a unicode character using cp437 ...
Either (1) the error is happening somewhere in your displayed code and somebody has kludged your default encoding to be cp437 (don't do that)
or (2) the error is not happening anywhere in the code that you have shown us, it is happening when you try to print some of the results of your function, you are running in a Windows "Command Prompt" window, and so your sys.stdout.encoding is set to some legacy MS-DOS encoding which doesn't support the U+00AE character.
you need to convert result using encode method ,apply encoding like 'utf-8' ,
for eg.
strdata = (result).encode('utf-8')
print strdata