I am receiving a JSON string, pass it through json.loads and ends with an array of unicode strings. That's all well and good. One of the strings in the array is:
u'\xc3\x85sum'
now should translate into 'Åsum' when decoded using decode('utf8') but instead I get an error:
UnicodeEncodeError: 'charmap' codec can't encode character u'\x85' in position 1: character maps to <undefined>
To test what's wrong I did the following
'Åsum'.encode('utf8')
'\xc3\x85sum'
print '\xc3\x85sum'.decode('utf8')
Åsum
So that worked fine, but if I make it to a unicode string as json.loads does I get the same error:
print u'\xc3\x85sum'.decode('utf8')
UnicodeEncodeError: 'charmap' codec can't encode character u'\x85' in position 1: character maps to <undefined>
I tried doing json.loads(jsonstring, encoding = 'uft8') but that changes nothing.
Is there a way to solve it? Make json.loads not make it unicode or make it decode using 'utf8' as I ask it to.
Edit:
The original string I receive look like this, or the part that causes trouble:
"\\u00c3\\u0085sum"
You already have a Unicode value, so trying to decode it forces an encode first, using the default codec.
It looks like you received malformed JSON instead; JSON values are already unicode. If you have UTF-8 data in your Unicode values, the only way to recover is to encode to Latin-1 (which maps the first 255 codepoints to bytes one-on-one), then decode from that as UTF8:
>>> print u'\xc3\x85sum'.encode('latin1').decode('utf8')
Åsum
The better solution is to fix the JSON source, however; it should not doubly-encode to UTF-8. The correct representation would be:
json.dumps(u'Åsum')
'"\\u00c5sum"'
Related
I am running into this problem where when I try to decode a string I run into one error,when I try to encode I run into another error,errors below,is there a permanent solution for this?
P.S please note that you may not be able to reproduce the encoding error with the string I provided as I couldnt copy/paste some errors
text = "sometext"
string = '\n'.join(list(set(text)))
try:
print "decode"
text = string.decode('UTF-8')
except Exception as e:
print e
text = string.encode('UTF-8')
Errors:-
error while using string.decode('UTF-8')
'ascii' codec can't encode character u'\u2602' in position 438: ordinal not in range(128)
Error while using string.encode('UTF-8')
Exception All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
The First Error
The code you have provided will work as the text is a a bytestring (as you are using Python 2). But what you're trying to do is to decode from a UTF-8 string to
an ASCII one, which is possible, but only if that Unicode string contains only characters that have an ASCII equivalent (you can see the list of ASCII characters here). In your case, it's encountering a unicode character (specifically ☂) which has no ASCII equivalent. You can get around this behaviour by using:
string.decode('UTF-8', 'ignore')
Which will just ignore (i.e. replace with nothing) the characters that cannot be encoded into ASCII.
The Second Error
This error is more interesting. It appears the text you are trying to encode into UTF-8 contains either NULL bytes or specific control characters, which are not allowed by the version of Unicode (UTF-8) that you are trying to encode into. Again, the code that you have actually provided works, but something in the text that you are trying to encode is violating the encoding. You can try the same trick as above:
string.encode('UTF-8', 'ignore')
Which will simply remove the offending characters, or you can look into what it is in your specific text input that is causing the problem.
I'm querying a DB for jokes and am getting back Python strs. I want to use them as Unicode objects, so I do:
joke = unicode(joke, 'utf-8')
This works for all my DB results and does not cause any issues.
Then I try to hash each word in each joke like this:
result = mmh3.hash(joke)
and I get back:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128)
I inspected the text and it's Japanese. Does this mean I should drop all non-ascii characters before hashing or is there a better way to handle this?
Thanks!
The .hash(...) function appears to require either bytes or ascii-convertible text.
The easiest way (if you're dealing entirely with unicode objects) is to convert them to bytes as you call mmh3.hash:
result = mmh3.hash(joke.encode('UTF-8'))
I have bytes array as str and want to send it (if I look at it through a debugger, it shows me that File.body is str). For that reason I have to create message to send.
request_text += '\n'.join([
'',
'--%s' % boundary_id,
attachment_headers,
File.body,
])
But at only it tries to join file body, I receive exception:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
Still, in example where I took it, it was implemented this way. How should I order python string to work byte string? Do I should decode it somehow? But how, if that is not text, but just bytes.
You are probably getting the error as your string has non-ascii characters. The following is an example on how to encode/decode a string containing non-ascii characters
1) convert the string to unicode
string="helloé"
u=unicode(string, 'utf-8')
2) Encode the string to utf-8 before sending it on the network
encoded = u.encode('utf-8')
3) Decode it from utf-8 to unicode on the other side
encoded.decode('utf-8')
I am getting an error at this line
logger.info(u"Data: {}".format(data))
I'm getting this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 4: ordinal not in range(128)
Before that line, I tried adding data = data.decode('utf8') and I still get the same error.
I tried data = data.encode('utf8') and it says UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 4: ordinal not in range(128)
How do I fix this? I don't know if I should encode or decode but neither works.
Use a string literal:
if isinstance(data, unicode):
data = data.encode('utf8')
logger.info("Data: {}".format(data))
The logging module needs you to pass in string values as these values are passed on unaltered to formatters and the handlers. Writing log messages to a file means that unicode values are encoded with the default (ASCII) codec otherwise. But you also need to pass in a bytestring value when formatting.
Passing in a str value into a unicode .format() template leads to decoding errors, passing in a unicode value into a str .format() template leads to encoding errors, and passing a formatted unicode value to logger.info() leads to encoding errors too.
Better not mix and encode explicitly beforehand.
You could do something such as
data.decode('utf-8').encode("ascii",errors="ignore")
This will "ignore" the unicode characters
edit: data.encode('ascii',error='ignore') may be enough but i'm not in a position to test this currently.
I've got a database in MSSQL that I'm porting to SQLite/Django. I'm using pymssql to connect to the database and save a text field to the local SQLite database.
However for some characters, it explodes. I get complaints like this:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 1916: ordinal not in range(128)
Is there some way I can convert the chars to proper unicode versions? Or strip them out?
Once you have the string of bytes s, instead of using it as a unicode obj directly, convert it explicitly with the right codec, e.g.:
u = s.decode('latin-1')
and use u instead of s in the code that follows this point (presumably the part that writes to sqlite). That's assuming latin-1 is the encoding that was used to make the byte string originally -- it's impossible for us to guess, so try to find out;-).
As a general rule, I suggest: don't process in your applications any text as encoded byte strings -- decode them to unicode objects right after input, and, if necessary, encode them back to byte strings right before output.
When you decode, just pass 'ignore' to strip those characters
there is some more way of stripping / converting those are
'replace': replace malformed data with a suitable replacement marker, such as '?' or '\ufffd'
'ignore': ignore malformed data and continue without further notice
'backslashreplace': replace with backslashed escape sequences (for encoding only)
Test
>>> "abcd\x97".decode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 4: ordinal not in range(128)
>>>
>>> "abcd\x97".decode("ascii","ignore")
u'abcd'