Unicode encoding error in python - python

I have characters \u002d, \u2019, u\2022, \u25ba, \u2013 etc, coming in my data.
I have to do json.loads(data)
I tried doing
data1 = data.encode('utf-8')
json.loads(data1)
I still get an error.
Also tried the below but ended up in an error
b1 = data.encode('ascii', 'ignore')
b2 = json.loads(b1)
It works if I replace the characters in my data, like '\u002d' to '-', but I do not know what other characters might creep in. So I am looking for a solution which would encode these characters

There is no need to encode the data.
Feed it directly to json.loads(); the JSON standard uses \u.... escape codes to denote unicode values too.
The values are not encoded in UTF-8, the Python json module will handle them for you.
Even if the data was encoded in UTF-8, the json module will handle that for you as well. Even if it didn't, you'd use str.decode(), not encode.
UTF-8 data looks different as well; the U+2019 codepoint looks like:
>>> u'\u2019'.encode('utf8')
'\xe2\x80\x99'
when encoded to UTF-8.

Related

Decode UTF-8 encoding in JSON string

I have JSON file which contains followingly encoded strings:
"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1",
I am trying to parse this file using the json module. However I am not able to decode this string correctly.
What I get after decoding the JSON using .load() method is 'HornÃ\xadková'. The string should be correctly decoded as 'Horníková' instead.
I read the JSON specification and I understasnd that after \u there should be 4 hexadecimal numbers specifing Unicode number of character. But it seems that in this JSON file UTF-8 encoded bytes are stored as \u-sequences.
What type of encoding is this and how to correctly parse it in Python 3?
Is this type JSON file even valid JSON file according to the specification?
Your text is already encoded and you need to tell this to Python by using a b prefix in your string but since you're using json and the input needs to be string you have to decode your encoded text manually. Since your input is not byte you can use 'raw_unicode_escape' encoding to convert the string to byte without encoding and prevent the open method to use its own default encoding. Then you can simply use aforementioned approach to get the desired result.
Note that since you need to do the encoding and decoding your have to read file content and perform the encoding on loaded string, then you should use json.loads() instead of json.load().
In [168]: with open('test.json', encoding='raw_unicode_escape') as f:
...: d = json.loads(f.read().encode('raw_unicode_escape').decode())
...:
In [169]: d
Out[169]: {'sender_name': 'Horníková'}
The JSON you are reading was written incorrectly and the Unicode strings decoded from it will have to be re-encoded with the wrong encoding used, then decoded with the correct encoding.
Here's an example:
#!python3
import json
# The bad JSON you have
bad_json = r'{"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}'
print('bad_json =',bad_json)
# The wanted result from json.loads()
wanted = {'sender_name':'Horníková'}
# What correctly written JSON should look like
good_json = json.dumps(wanted)
print('good_json =',good_json)
# What you get when loading the bad JSON.
got = json.loads(bad_json)
print('wanted =',wanted)
print('got =',got)
# How to correct the mojibake string
corrected_sender = got['sender_name'].encode('latin1').decode('utf8')
print('corrected_sender =',corrected_sender)
Output:
bad_json = {"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}
good_json = {"sender_name": "Horn\u00edkov\u00e1"}
wanted = {'sender_name': 'Horníková'}
got = {'sender_name': 'HornÃ\xadková'}
corrected_sender = Horníková
I don't know enough about JSON to be able to say whether this is valid or not, but you can parse these strings using the raw_unicode_escape codec:
>>> "Horn\u00c3\u00adkov\u00c3\u00a1".encode('raw_unicode_escape').decode('utf8')
'Horníková'
Reencode to bytes, and then redecode to text.
>>> 'HornÃ\xadková'.encode('latin-1').decode('utf-8')
'Horníková'
Is this type JSON file even valid JSON file according to the specification?
No.
A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes [emphasis added].
source
A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). [...] Any code point may be represented as a hexadecimal escape sequence [...] represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point [emphasis added].
source
UTF-8 byte sequences are neither Unicode characters nor Unicode code points.

coding (unicode) hex to octal

I read a text file, which has some characters like that '\260' (it means '°'), and then I add it to DB (sqlite3).
After that, I try to get the information from DB, but the sql-query will be built with '\xb0'(it means '°' too), because I get this information from XML file.
I try to replace hex characters with octal chracters: text = text.replace(r'\xb0', '\260') but it doesn't work, why? I cannot build correct sql-query.
Maybe there are some solutions for this problem e.g. encode, decode etc.
\260 is the same thing as \xb0:
>>> '\xb0'
'\xb0'
>>> '\260'
'\xb0'
You probably want to decode your input to unicode and insert that instead. If your data is encoded to Latin 1 then decode:
>>> print '\xb0'.decode('latin1')
°
sqlite3 can handle unicode just fine, and by decoding you make sure you are handling text values, not byte values, which can differ from codec to codec.

Python, Encoding output to UTF-8

I have a definition that builds a string composed of UTF-8 encoded characters. The output files are opened using 'w+', "utf-8" arguments.
However, when I try to x.write(string) I get the UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 1: ordinal not in range(128)
I assume this is because normally for example you would do `print(u'something'). But I need to use a variable and the quotations in u'_' negate that...
Any suggestions?
EDIT: Actual code here:
source = codecs.open("actionbreak/" + target + '.csv','r', "utf-8")
outTarget = codecs.open("actionbreak/" + newTarget, 'w+', "utf-8")
x = str(actionT(splitList[0], splitList[1]))
outTarget.write(x)
Essentially all this is supposed to be doing is building me a large amount of strings that look similar to this:
[日木曜 Deliverables]= CASE WHEN things = 11
THEN C ELSE 0 END
Are you using codecs.open()? Python 2.7's built-in open() does not support a specific encoding, meaning you have to manually encode non-ascii strings (as others have noted), but codecs.open() does support that and would probably be easier to drop in than manually encoding all the strings.
As you are actually using codecs.open(), going by your added code, and after a bit of looking things up myself, I suggest attempting to open the input and/or output file with encoding "utf-8-sig", which will automatically handle the BOM for UTF-8 (see http://docs.python.org/2/library/codecs.html#encodings-and-unicode, near the bottom of the section) I would think that would only matter for the input file, but if none of those combinations (utf-8-sig/utf-8, utf-8/utf-8-sig, utf-8-sig/utf-8-sig) work, then I believe the most likely situation would be that your input file is encoded in a different Unicode format with BOM, as Python's default UTF-8 codec interprets BOMs as regular characters so the input would not have an issue but output could.
Just noticed this, but... when you use codecs.open(), it expects a Unicode string, not an encoded one; try x = unicode(actionT(splitList[0], splitList[1])).
Your error can also occur when attempting to decode a unicode string (see http://wiki.python.org/moin/UnicodeEncodeError), but I don't think that should be happening unless actionT() or your list-splitting does something to the Unicode strings that causes them to be treated as non-Unicode strings.
In python 2.x there are two types of string: byte string and unicode string. First one contains bytes and last one - unicode code points. It is easy to determine, what type of string it is - unicode string starts with u:
# byte string
>>> 'abc'
'abc'
# unicode string:
>>> u'abc абв'
u'abc \u0430\u0431\u0432'
'abc' chars are the same, because the are in ASCII range. \u0430 is a unicode code point, it is out of ASCII range. "Code point" is python internal representation of unicode points, they can't be saved to file. It is needed to encode them to bytes first. Here how encoded unicode string looks like (as it is encoded, it becomes a byte string):
>>> s = u'abc абв'
>>> s.encode('utf8')
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
This encoded string now can be written to file:
>>> s = u'abc абв'
>>> with open('text.txt', 'w+') as f:
... f.write(s.encode('utf8'))
Now, it is important to remember, what encoding we used when writing to file. Because to be able to read the data, we need to decode the content. Here what data looks like without decoding:
>>> with open('text.txt', 'r') as f:
... content = f.read()
>>> content
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
You see, we've got encoded bytes, exactly the same as in s.encode('utf8'). To decode it is needed to provide coding name:
>>> content.decode('utf8')
u'abc \u0430\u0431\u0432'
After decode, we've got back our unicode string with unicode code points.
>>> print content.decode('utf8')
abc абв
xgord is right, but for further edification it's worth noting exactly what \ufeff means. It's known as a BOM or a byte order mark and basically it's a callback to the early days of unicode when people couldn't agree which way they wanted their unicode to go. Now all unicode documents are prefaced with either an \ufeff or an \uffef depending on which order they decide to arrange their bytes in.
If you hit an error on those characters in the first location you can be sure the issue is that you are not trying to decode it as utf-8, and the file is probably still fine.

How to Handle JSON with escaped Unicode characters using python json module?

EDIT: The error doesn't appear in Prompt, but in the following Google App Engine environment.
I have following json
>>>dat = r"""{"name":"Something", "data":"For youth \n\nBe a hero! Donate blood!\n\u091c\u092f \u0939\u093f\u0902\u0926! \u0935\u0928\u094d\u0926\u0947 \u092e\u093e\u0924\u0930\u092e\u094d"}"""
It contains unicode escaped characters.
I want to parse this. So I did
>>>jsDat = json.loads(js)
Then following works
>>>name = jsDat.get('name')
>>>name = name.encode('ascii') #This is because json module handles in unicode
>>>print name
Something
But trying for the field with unicode data, that is "data", an error is displayed
>>>data = jsDat.get('data')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 366-367: ordinal not in range(128)
How should I parse the data?
You can't encode unicode to ASCII if the characters exceed the ASCII character set. If you want to force the conversion, and lose data, you can do this:
data = jsDat.get('data')
data = data.encode('ascii', 'ignore')
See the doc for str.encode for more details about the ignore.
As an aside, I'm not sure why you're trying to encode to ASCII - the JSON module seems to handle that raw string just fine?
The error is coming from your 'print' line, and only because you're trying to print to a 'terminal' that doesn't understand the encoding. Doing anything else with the JSON object shouldn't produce errors.

how to reliably decode various encodings to system default encoding

I am trying to work with several documents that all have various encodings - some utf-8, some ISO-8859-2, some ascii etc. Is there a reliable way of decoding to a standard encoding for processing?
I have tried the following:
import chardet
encoding = chardet.detect(text)
text = unicode(text,encoding['encoding']).decode(sys.getdefaultencoding(),'ignore')
With the above code I still get UnicodeEncodeError errors
Use decode to convert bytes to unicode, and encode to convert unicode to bytes:
text.decode(encoding['encoding'], 'ignore').encode(sys.getdefaultencoding(), 'ignore')
Although I would recommend doing your processing on the unicode objects themselves, or UTF-8 encoded strings if you absolutely need to work with bytes. sys.getdefaultencoding() is 'ascii', which provides a very limited character set. See also: http://wiki.python.org/moin/DefaultEncoding
You probably mean encode:
u = unicode(text, encoding['encoding'], 'ignore')
text = u.encode(sys.getdefaultencoding(), 'ignore')
or equivalently and more commonly,
u = text.decode(encoding['encoding'], 'ignore')
text = u.encode(sys.getdefaultencoding(), 'ignore')
You may want ignore on both, as above: the incoming text may have invalid characters in it, causing it to fail to decode to Unicode, and it may have characters which can't be represented in the default encoding, causing it to fail to encode. (You may not actually want to ignore errors, though, since it looks like you were just trying to work around using the wrong function.)

Categories

Resources