Encoding problem when converting csv to json in python [duplicate] - python

Sample code (in a REPL):
import json
json_string = json.dumps("ברי צקלה")
print(json_string)
Output:
"\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"
The problem: it's not human readable. My (smart) users want to verify or even edit text files with JSON dumps (and I’d rather not use XML).
Is there a way to serialize objects into UTF-8 JSON strings (instead of \uXXXX)?

Use the ensure_ascii=False switch to json.dumps(), then encode the value to UTF-8 manually:
>>> json_string = json.dumps("ברי צקלה", ensure_ascii=False).encode('utf8')
>>> json_string
b'"\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94"'
>>> print(json_string.decode())
"ברי צקלה"
If you are writing to a file, just use json.dump() and leave it to the file object to encode:
with open('filename', 'w', encoding='utf8') as json_file:
json.dump("ברי צקלה", json_file, ensure_ascii=False)
Caveats for Python 2
For Python 2, there are some more caveats to take into account. If you are writing this to a file, you can use io.open() instead of open() to produce a file object that encodes Unicode values for you as you write, then use json.dump() instead to write to that file:
with io.open('filename', 'w', encoding='utf8') as json_file:
json.dump(u"ברי צקלה", json_file, ensure_ascii=False)
Do note that there is a bug in the json module where the ensure_ascii=False flag can produce a mix of unicode and str objects. The workaround for Python 2 then is:
with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(u"ברי צקלה", ensure_ascii=False)
# unicode(data) auto-decodes data to unicode if str
json_file.write(unicode(data))
In Python 2, when using byte strings (type str), encoded to UTF-8, make sure to also set the encoding keyword:
>>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
>>> d
{1: '\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94', 2: u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'}
>>> s=json.dumps(d, ensure_ascii=False, encoding='utf8')
>>> s
u'{"1": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4", "2": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"}'
>>> json.loads(s)['1']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> json.loads(s)['2']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> print json.loads(s)['1']
ברי צקלה
>>> print json.loads(s)['2']
ברי צקלה

To write to a file
import codecs
import json
with codecs.open('your_file.txt', 'w', encoding='utf-8') as f:
json.dump({"message":"xin chào việt nam"}, f, ensure_ascii=False)
To print to stdout
import json
print(json.dumps({"message":"xin chào việt nam"}, ensure_ascii=False))

This is the wrong answer, but it's still useful to understand why it's wrong. See comments.
Use unicode-escape:
>>> d = {1: "ברי צקלה", 2: u"ברי צקלה"}
>>> json_str = json.dumps(d).decode('unicode-escape').encode('utf8')
>>> print json_str
{"1": "ברי צקלה", "2": "ברי צקלה"}

Pieters' Python 2 workaround fails on an edge case:
d = {u'keyword': u'bad credit \xe7redit cards'}
with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(d, ensure_ascii=False).decode('utf8')
try:
json_file.write(data)
except TypeError:
# Decode data to Unicode first
json_file.write(data.decode('utf8'))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 25: ordinal not in range(128)
It was crashing on the .decode('utf8') part of line 3. I fixed the problem by making the program much simpler by avoiding that step as well as the special casing of ASCII:
with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(d, ensure_ascii=False, encoding='utf8')
json_file.write(unicode(data))
cat filename
{"keyword": "bad credit çredit cards"}

As of Python 3.7 the following code works fine:
from json import dumps
result = {"symbol": "ƒ"}
json_string = dumps(result, sort_keys=True, indent=2, ensure_ascii=False)
print(json_string)
Output:
{"symbol": "ƒ"}

Thanks for the original answer here. With Python 3 the following line of code:
print(json.dumps(result_dict,ensure_ascii=False))
was ok. Consider trying not writing too much text in the code if it's not imperative.
This might be good enough for the Python console. However, to satisfy a server, you might need to set the locale as explained here (if it is on Apache 2)
Setting LANG and LC_ALL when using mod_wsgi
Basically, install he_IL or whatever language locale on Ubuntu.
Check it is not installed:
locale -a
Install it, where XX is your language:
sudo apt-get install language-pack-XX
For example:
sudo apt-get install language-pack-he
Add the following text to /etc/apache2/envvrs
export LANG='he_IL.UTF-8'
export LC_ALL='he_IL.UTF-8'
Then you would hopefully not get Python errors on from Apache like:
print (js)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 41-45: ordinal not in range(128)
Also in Apache, try to make UTF the default encoding as explained here:
How to change the default encoding to UTF-8 for Apache
Do it early because Apache errors can be pain to debug and you can mistakenly think it's from Python which possibly isn't the case in that situation.

The following is my understanding var reading answer above and google.
# coding:utf-8
r"""
#update: 2017-01-09 14:44:39
#explain: str, unicode, bytes in python2to3
#python2 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 7: ordinal not in range(128)
#1.reload
#importlib,sys
#importlib.reload(sys)
#sys.setdefaultencoding('utf-8') #python3 don't have this attribute.
#not suggest even in python2 #see:http://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script
#2.overwrite /usr/lib/python2.7/sitecustomize.py or (sitecustomize.py and PYTHONPATH=".:$PYTHONPATH" python)
#too complex
#3.control by your own (best)
#==> all string must be unicode like python3 (u'xx'|b'xx'.encode('utf-8')) (unicode 's disappeared in python3)
#see: http://blog.ernest.me/post/python-setdefaultencoding-unicode-bytes
#how to Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence
#http://stackoverflow.com/questions/18337407/saving-utf-8-texts-in-json-dumps-as-utf8-not-as-u-escape-sequence
"""
from __future__ import print_function
import json
a = {"b": u"中文"} # add u for python2 compatibility
print('%r' % a)
print('%r' % json.dumps(a))
print('%r' % (json.dumps(a).encode('utf8')))
a = {"b": u"中文"}
print('%r' % json.dumps(a, ensure_ascii=False))
print('%r' % (json.dumps(a, ensure_ascii=False).encode('utf8')))
# print(a.encode('utf8')) #AttributeError: 'dict' object has no attribute 'encode'
print('')
# python2:bytes=str; python3:bytes
b = a['b'].encode('utf-8')
print('%r' % b)
print('%r' % b.decode("utf-8"))
print('')
# python2:unicode; python3:str=unicode
c = b.decode('utf-8')
print('%r' % c)
print('%r' % c.encode('utf-8'))
"""
#python2
{'b': u'\u4e2d\u6587'}
'{"b": "\\u4e2d\\u6587"}'
'{"b": "\\u4e2d\\u6587"}'
u'{"b": "\u4e2d\u6587"}'
'{"b": "\xe4\xb8\xad\xe6\x96\x87"}'
'\xe4\xb8\xad\xe6\x96\x87'
u'\u4e2d\u6587'
u'\u4e2d\u6587'
'\xe4\xb8\xad\xe6\x96\x87'
#python3
{'b': '中文'}
'{"b": "\\u4e2d\\u6587"}'
b'{"b": "\\u4e2d\\u6587"}'
'{"b": "中文"}'
b'{"b": "\xe4\xb8\xad\xe6\x96\x87"}'
b'\xe4\xb8\xad\xe6\x96\x87'
'中文'
'中文'
b'\xe4\xb8\xad\xe6\x96\x87'
"""

If you are loading a JSON string from a file and the file content is Arabic texts, then this will work.
Assume a file like arabic.json
{
"key1": "لمستخدمين",
"key2": "إضافة مستخدم"
}
Get the Arabic contents from the arabic.json file
with open(arabic.json, encoding='utf-8') as f:
# Deserialises it
json_data = json.load(f)
f.close()
# JSON formatted string
json_data2 = json.dumps(json_data, ensure_ascii = False)
To use JSON data in a Django template follow the below steps:
# If have to get the JSON index in a Django template file, then simply decode the encoded string.
json.JSONDecoder().decode(json_data2)
Done! Now we can get the results as a JSON index with Arabic values.

Use unicode-escape to solve the problem
>>>import json
>>>json_string = json.dumps("ברי צקלה")
>>>json_string.encode('ascii').decode('unicode-escape')
'"ברי צקלה"'
Explanation
>>>s = '漢 χαν хан'
>>>print('Unicode: ' + s.encode('unicode-escape').decode('utf-8'))
Unicode: \u6f22 \u03c7\u03b1\u03bd \u0445\u0430\u043d
>>>u = s.encode('unicode-escape').decode('utf-8')
>>>print('Original: ' + u.encode("utf-8").decode('unicode-escape'))
Original: 漢 χαν хан
Original resource:Python3 使用 unicode-escape 处理 unicode 16进制字符串编解码问题

Here's my solution using json.dump():
def jsonWrite(p, pyobj, ensure_ascii=False, encoding=SYSTEM_ENCODING, **kwargs):
with codecs.open(p, 'wb', 'utf_8') as fileobj:
json.dump(pyobj, fileobj, ensure_ascii=ensure_ascii,encoding=encoding, **kwargs)
where SYSTEM_ENCODING is set to:
locale.setlocale(locale.LC_ALL, '')
SYSTEM_ENCODING = locale.getlocale()[1]

Use codecs if possible,
with codecs.open('file_path', 'a+', 'utf-8') as fp:
fp.write(json.dumps(res, ensure_ascii=False))

Using ensure_ascii=False in json.dumps is the right direction to solve this problem, as pointed out by Martijn. However, this may raise an exception:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 1: ordinal not in range(128)
You need extra settings in either site.py or sitecustomize.py to set your sys.getdefaultencoding() correct. site.py is under lib/python2.7/ and sitecustomize.py is under lib/python2.7/site-packages.
If you want to use site.py, under def setencoding(): change the first if 0: to if 1: so that Python will use your operation system's locale.
If you prefer to use sitecustomize.py, which may not exist if you haven't created it, simply add these lines:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Then you can do some Chinese JSON output in UTF-8 format, such as:
name = {"last_name": u"王"}
json.dumps(name, ensure_ascii=False)
You will get an UTF-8 encoded string, rather than a \u escaped JSON string.
To verify your default encoding:
print sys.getdefaultencoding()
You should get "utf-8" or "UTF-8" to verify your site.py or sitecustomize.py settings.
Please note that you could not do sys.setdefaultencoding("utf-8") at an interactive Python console.

Related

String read from JSON cannot be decoded [duplicate]

Sample code (in a REPL):
import json
json_string = json.dumps("ברי צקלה")
print(json_string)
Output:
"\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"
The problem: it's not human readable. My (smart) users want to verify or even edit text files with JSON dumps (and I’d rather not use XML).
Is there a way to serialize objects into UTF-8 JSON strings (instead of \uXXXX)?
Use the ensure_ascii=False switch to json.dumps(), then encode the value to UTF-8 manually:
>>> json_string = json.dumps("ברי צקלה", ensure_ascii=False).encode('utf8')
>>> json_string
b'"\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94"'
>>> print(json_string.decode())
"ברי צקלה"
If you are writing to a file, just use json.dump() and leave it to the file object to encode:
with open('filename', 'w', encoding='utf8') as json_file:
json.dump("ברי צקלה", json_file, ensure_ascii=False)
Caveats for Python 2
For Python 2, there are some more caveats to take into account. If you are writing this to a file, you can use io.open() instead of open() to produce a file object that encodes Unicode values for you as you write, then use json.dump() instead to write to that file:
with io.open('filename', 'w', encoding='utf8') as json_file:
json.dump(u"ברי צקלה", json_file, ensure_ascii=False)
Do note that there is a bug in the json module where the ensure_ascii=False flag can produce a mix of unicode and str objects. The workaround for Python 2 then is:
with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(u"ברי צקלה", ensure_ascii=False)
# unicode(data) auto-decodes data to unicode if str
json_file.write(unicode(data))
In Python 2, when using byte strings (type str), encoded to UTF-8, make sure to also set the encoding keyword:
>>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
>>> d
{1: '\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94', 2: u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'}
>>> s=json.dumps(d, ensure_ascii=False, encoding='utf8')
>>> s
u'{"1": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4", "2": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"}'
>>> json.loads(s)['1']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> json.loads(s)['2']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> print json.loads(s)['1']
ברי צקלה
>>> print json.loads(s)['2']
ברי צקלה
To write to a file
import codecs
import json
with codecs.open('your_file.txt', 'w', encoding='utf-8') as f:
json.dump({"message":"xin chào việt nam"}, f, ensure_ascii=False)
To print to stdout
import json
print(json.dumps({"message":"xin chào việt nam"}, ensure_ascii=False))
This is the wrong answer, but it's still useful to understand why it's wrong. See comments.
Use unicode-escape:
>>> d = {1: "ברי צקלה", 2: u"ברי צקלה"}
>>> json_str = json.dumps(d).decode('unicode-escape').encode('utf8')
>>> print json_str
{"1": "ברי צקלה", "2": "ברי צקלה"}
Pieters' Python 2 workaround fails on an edge case:
d = {u'keyword': u'bad credit \xe7redit cards'}
with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(d, ensure_ascii=False).decode('utf8')
try:
json_file.write(data)
except TypeError:
# Decode data to Unicode first
json_file.write(data.decode('utf8'))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 25: ordinal not in range(128)
It was crashing on the .decode('utf8') part of line 3. I fixed the problem by making the program much simpler by avoiding that step as well as the special casing of ASCII:
with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(d, ensure_ascii=False, encoding='utf8')
json_file.write(unicode(data))
cat filename
{"keyword": "bad credit çredit cards"}
As of Python 3.7 the following code works fine:
from json import dumps
result = {"symbol": "ƒ"}
json_string = dumps(result, sort_keys=True, indent=2, ensure_ascii=False)
print(json_string)
Output:
{"symbol": "ƒ"}
Thanks for the original answer here. With Python 3 the following line of code:
print(json.dumps(result_dict,ensure_ascii=False))
was ok. Consider trying not writing too much text in the code if it's not imperative.
This might be good enough for the Python console. However, to satisfy a server, you might need to set the locale as explained here (if it is on Apache 2)
Setting LANG and LC_ALL when using mod_wsgi
Basically, install he_IL or whatever language locale on Ubuntu.
Check it is not installed:
locale -a
Install it, where XX is your language:
sudo apt-get install language-pack-XX
For example:
sudo apt-get install language-pack-he
Add the following text to /etc/apache2/envvrs
export LANG='he_IL.UTF-8'
export LC_ALL='he_IL.UTF-8'
Then you would hopefully not get Python errors on from Apache like:
print (js)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 41-45: ordinal not in range(128)
Also in Apache, try to make UTF the default encoding as explained here:
How to change the default encoding to UTF-8 for Apache
Do it early because Apache errors can be pain to debug and you can mistakenly think it's from Python which possibly isn't the case in that situation.
The following is my understanding var reading answer above and google.
# coding:utf-8
r"""
#update: 2017-01-09 14:44:39
#explain: str, unicode, bytes in python2to3
#python2 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 7: ordinal not in range(128)
#1.reload
#importlib,sys
#importlib.reload(sys)
#sys.setdefaultencoding('utf-8') #python3 don't have this attribute.
#not suggest even in python2 #see:http://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script
#2.overwrite /usr/lib/python2.7/sitecustomize.py or (sitecustomize.py and PYTHONPATH=".:$PYTHONPATH" python)
#too complex
#3.control by your own (best)
#==> all string must be unicode like python3 (u'xx'|b'xx'.encode('utf-8')) (unicode 's disappeared in python3)
#see: http://blog.ernest.me/post/python-setdefaultencoding-unicode-bytes
#how to Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence
#http://stackoverflow.com/questions/18337407/saving-utf-8-texts-in-json-dumps-as-utf8-not-as-u-escape-sequence
"""
from __future__ import print_function
import json
a = {"b": u"中文"} # add u for python2 compatibility
print('%r' % a)
print('%r' % json.dumps(a))
print('%r' % (json.dumps(a).encode('utf8')))
a = {"b": u"中文"}
print('%r' % json.dumps(a, ensure_ascii=False))
print('%r' % (json.dumps(a, ensure_ascii=False).encode('utf8')))
# print(a.encode('utf8')) #AttributeError: 'dict' object has no attribute 'encode'
print('')
# python2:bytes=str; python3:bytes
b = a['b'].encode('utf-8')
print('%r' % b)
print('%r' % b.decode("utf-8"))
print('')
# python2:unicode; python3:str=unicode
c = b.decode('utf-8')
print('%r' % c)
print('%r' % c.encode('utf-8'))
"""
#python2
{'b': u'\u4e2d\u6587'}
'{"b": "\\u4e2d\\u6587"}'
'{"b": "\\u4e2d\\u6587"}'
u'{"b": "\u4e2d\u6587"}'
'{"b": "\xe4\xb8\xad\xe6\x96\x87"}'
'\xe4\xb8\xad\xe6\x96\x87'
u'\u4e2d\u6587'
u'\u4e2d\u6587'
'\xe4\xb8\xad\xe6\x96\x87'
#python3
{'b': '中文'}
'{"b": "\\u4e2d\\u6587"}'
b'{"b": "\\u4e2d\\u6587"}'
'{"b": "中文"}'
b'{"b": "\xe4\xb8\xad\xe6\x96\x87"}'
b'\xe4\xb8\xad\xe6\x96\x87'
'中文'
'中文'
b'\xe4\xb8\xad\xe6\x96\x87'
"""
If you are loading a JSON string from a file and the file content is Arabic texts, then this will work.
Assume a file like arabic.json
{
"key1": "لمستخدمين",
"key2": "إضافة مستخدم"
}
Get the Arabic contents from the arabic.json file
with open(arabic.json, encoding='utf-8') as f:
# Deserialises it
json_data = json.load(f)
f.close()
# JSON formatted string
json_data2 = json.dumps(json_data, ensure_ascii = False)
To use JSON data in a Django template follow the below steps:
# If have to get the JSON index in a Django template file, then simply decode the encoded string.
json.JSONDecoder().decode(json_data2)
Done! Now we can get the results as a JSON index with Arabic values.
Use unicode-escape to solve the problem
>>>import json
>>>json_string = json.dumps("ברי צקלה")
>>>json_string.encode('ascii').decode('unicode-escape')
'"ברי צקלה"'
Explanation
>>>s = '漢 χαν хан'
>>>print('Unicode: ' + s.encode('unicode-escape').decode('utf-8'))
Unicode: \u6f22 \u03c7\u03b1\u03bd \u0445\u0430\u043d
>>>u = s.encode('unicode-escape').decode('utf-8')
>>>print('Original: ' + u.encode("utf-8").decode('unicode-escape'))
Original: 漢 χαν хан
Original resource:Python3 使用 unicode-escape 处理 unicode 16进制字符串编解码问题
Here's my solution using json.dump():
def jsonWrite(p, pyobj, ensure_ascii=False, encoding=SYSTEM_ENCODING, **kwargs):
with codecs.open(p, 'wb', 'utf_8') as fileobj:
json.dump(pyobj, fileobj, ensure_ascii=ensure_ascii,encoding=encoding, **kwargs)
where SYSTEM_ENCODING is set to:
locale.setlocale(locale.LC_ALL, '')
SYSTEM_ENCODING = locale.getlocale()[1]
Use codecs if possible,
with codecs.open('file_path', 'a+', 'utf-8') as fp:
fp.write(json.dumps(res, ensure_ascii=False))
Using ensure_ascii=False in json.dumps is the right direction to solve this problem, as pointed out by Martijn. However, this may raise an exception:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 1: ordinal not in range(128)
You need extra settings in either site.py or sitecustomize.py to set your sys.getdefaultencoding() correct. site.py is under lib/python2.7/ and sitecustomize.py is under lib/python2.7/site-packages.
If you want to use site.py, under def setencoding(): change the first if 0: to if 1: so that Python will use your operation system's locale.
If you prefer to use sitecustomize.py, which may not exist if you haven't created it, simply add these lines:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Then you can do some Chinese JSON output in UTF-8 format, such as:
name = {"last_name": u"王"}
json.dumps(name, ensure_ascii=False)
You will get an UTF-8 encoded string, rather than a \u escaped JSON string.
To verify your default encoding:
print sys.getdefaultencoding()
You should get "utf-8" or "UTF-8" to verify your site.py or sitecustomize.py settings.
Please note that you could not do sys.setdefaultencoding("utf-8") at an interactive Python console.

How to this error ? utf-8' codec can't decode byte 0xef in position 32887: invalid continuation byte

enter image description here
Hello. I am trying to open this file which is in .txt format but it gives me an error.
Sometimes when you don't have uniform files you have to by specific with the correct encoding,
You should indicate it in function open for example,
with open(‘file.txt’, encoding = ‘utf-8’) as f:
etc
also you can detect the file encoding like this:
from chardet import detect
with open(file, 'rb') as f:
rawdata = f.read()
enc = detect(rawdata)['encoding']
with open(‘file.txt’, encoding = enc) as f:
etc
Result:
>>> from chardet import detect
>>>
>>> with open('test.c', 'rb') as f:
... rawdata = f.read()
... enc = detect(rawdata)['encoding']
...
>>> print(enc)
ascii
Python 3.7.0

How to solve Flask jsonfify encoding/decoding? [duplicate]

Sample code (in a REPL):
import json
json_string = json.dumps("ברי צקלה")
print(json_string)
Output:
"\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"
The problem: it's not human readable. My (smart) users want to verify or even edit text files with JSON dumps (and I’d rather not use XML).
Is there a way to serialize objects into UTF-8 JSON strings (instead of \uXXXX)?
Use the ensure_ascii=False switch to json.dumps(), then encode the value to UTF-8 manually:
>>> json_string = json.dumps("ברי צקלה", ensure_ascii=False).encode('utf8')
>>> json_string
b'"\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94"'
>>> print(json_string.decode())
"ברי צקלה"
If you are writing to a file, just use json.dump() and leave it to the file object to encode:
with open('filename', 'w', encoding='utf8') as json_file:
json.dump("ברי צקלה", json_file, ensure_ascii=False)
Caveats for Python 2
For Python 2, there are some more caveats to take into account. If you are writing this to a file, you can use io.open() instead of open() to produce a file object that encodes Unicode values for you as you write, then use json.dump() instead to write to that file:
with io.open('filename', 'w', encoding='utf8') as json_file:
json.dump(u"ברי צקלה", json_file, ensure_ascii=False)
Do note that there is a bug in the json module where the ensure_ascii=False flag can produce a mix of unicode and str objects. The workaround for Python 2 then is:
with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(u"ברי צקלה", ensure_ascii=False)
# unicode(data) auto-decodes data to unicode if str
json_file.write(unicode(data))
In Python 2, when using byte strings (type str), encoded to UTF-8, make sure to also set the encoding keyword:
>>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
>>> d
{1: '\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94', 2: u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'}
>>> s=json.dumps(d, ensure_ascii=False, encoding='utf8')
>>> s
u'{"1": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4", "2": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"}'
>>> json.loads(s)['1']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> json.loads(s)['2']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> print json.loads(s)['1']
ברי צקלה
>>> print json.loads(s)['2']
ברי צקלה
To write to a file
import codecs
import json
with codecs.open('your_file.txt', 'w', encoding='utf-8') as f:
json.dump({"message":"xin chào việt nam"}, f, ensure_ascii=False)
To print to stdout
import json
print(json.dumps({"message":"xin chào việt nam"}, ensure_ascii=False))
This is the wrong answer, but it's still useful to understand why it's wrong. See comments.
Use unicode-escape:
>>> d = {1: "ברי צקלה", 2: u"ברי צקלה"}
>>> json_str = json.dumps(d).decode('unicode-escape').encode('utf8')
>>> print json_str
{"1": "ברי צקלה", "2": "ברי צקלה"}
Pieters' Python 2 workaround fails on an edge case:
d = {u'keyword': u'bad credit \xe7redit cards'}
with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(d, ensure_ascii=False).decode('utf8')
try:
json_file.write(data)
except TypeError:
# Decode data to Unicode first
json_file.write(data.decode('utf8'))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 25: ordinal not in range(128)
It was crashing on the .decode('utf8') part of line 3. I fixed the problem by making the program much simpler by avoiding that step as well as the special casing of ASCII:
with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(d, ensure_ascii=False, encoding='utf8')
json_file.write(unicode(data))
cat filename
{"keyword": "bad credit çredit cards"}
As of Python 3.7 the following code works fine:
from json import dumps
result = {"symbol": "ƒ"}
json_string = dumps(result, sort_keys=True, indent=2, ensure_ascii=False)
print(json_string)
Output:
{"symbol": "ƒ"}
Thanks for the original answer here. With Python 3 the following line of code:
print(json.dumps(result_dict,ensure_ascii=False))
was ok. Consider trying not writing too much text in the code if it's not imperative.
This might be good enough for the Python console. However, to satisfy a server, you might need to set the locale as explained here (if it is on Apache 2)
Setting LANG and LC_ALL when using mod_wsgi
Basically, install he_IL or whatever language locale on Ubuntu.
Check it is not installed:
locale -a
Install it, where XX is your language:
sudo apt-get install language-pack-XX
For example:
sudo apt-get install language-pack-he
Add the following text to /etc/apache2/envvrs
export LANG='he_IL.UTF-8'
export LC_ALL='he_IL.UTF-8'
Then you would hopefully not get Python errors on from Apache like:
print (js)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 41-45: ordinal not in range(128)
Also in Apache, try to make UTF the default encoding as explained here:
How to change the default encoding to UTF-8 for Apache
Do it early because Apache errors can be pain to debug and you can mistakenly think it's from Python which possibly isn't the case in that situation.
The following is my understanding var reading answer above and google.
# coding:utf-8
r"""
#update: 2017-01-09 14:44:39
#explain: str, unicode, bytes in python2to3
#python2 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 7: ordinal not in range(128)
#1.reload
#importlib,sys
#importlib.reload(sys)
#sys.setdefaultencoding('utf-8') #python3 don't have this attribute.
#not suggest even in python2 #see:http://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script
#2.overwrite /usr/lib/python2.7/sitecustomize.py or (sitecustomize.py and PYTHONPATH=".:$PYTHONPATH" python)
#too complex
#3.control by your own (best)
#==> all string must be unicode like python3 (u'xx'|b'xx'.encode('utf-8')) (unicode 's disappeared in python3)
#see: http://blog.ernest.me/post/python-setdefaultencoding-unicode-bytes
#how to Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence
#http://stackoverflow.com/questions/18337407/saving-utf-8-texts-in-json-dumps-as-utf8-not-as-u-escape-sequence
"""
from __future__ import print_function
import json
a = {"b": u"中文"} # add u for python2 compatibility
print('%r' % a)
print('%r' % json.dumps(a))
print('%r' % (json.dumps(a).encode('utf8')))
a = {"b": u"中文"}
print('%r' % json.dumps(a, ensure_ascii=False))
print('%r' % (json.dumps(a, ensure_ascii=False).encode('utf8')))
# print(a.encode('utf8')) #AttributeError: 'dict' object has no attribute 'encode'
print('')
# python2:bytes=str; python3:bytes
b = a['b'].encode('utf-8')
print('%r' % b)
print('%r' % b.decode("utf-8"))
print('')
# python2:unicode; python3:str=unicode
c = b.decode('utf-8')
print('%r' % c)
print('%r' % c.encode('utf-8'))
"""
#python2
{'b': u'\u4e2d\u6587'}
'{"b": "\\u4e2d\\u6587"}'
'{"b": "\\u4e2d\\u6587"}'
u'{"b": "\u4e2d\u6587"}'
'{"b": "\xe4\xb8\xad\xe6\x96\x87"}'
'\xe4\xb8\xad\xe6\x96\x87'
u'\u4e2d\u6587'
u'\u4e2d\u6587'
'\xe4\xb8\xad\xe6\x96\x87'
#python3
{'b': '中文'}
'{"b": "\\u4e2d\\u6587"}'
b'{"b": "\\u4e2d\\u6587"}'
'{"b": "中文"}'
b'{"b": "\xe4\xb8\xad\xe6\x96\x87"}'
b'\xe4\xb8\xad\xe6\x96\x87'
'中文'
'中文'
b'\xe4\xb8\xad\xe6\x96\x87'
"""
If you are loading a JSON string from a file and the file content is Arabic texts, then this will work.
Assume a file like arabic.json
{
"key1": "لمستخدمين",
"key2": "إضافة مستخدم"
}
Get the Arabic contents from the arabic.json file
with open(arabic.json, encoding='utf-8') as f:
# Deserialises it
json_data = json.load(f)
f.close()
# JSON formatted string
json_data2 = json.dumps(json_data, ensure_ascii = False)
To use JSON data in a Django template follow the below steps:
# If have to get the JSON index in a Django template file, then simply decode the encoded string.
json.JSONDecoder().decode(json_data2)
Done! Now we can get the results as a JSON index with Arabic values.
Use unicode-escape to solve the problem
>>>import json
>>>json_string = json.dumps("ברי צקלה")
>>>json_string.encode('ascii').decode('unicode-escape')
'"ברי צקלה"'
Explanation
>>>s = '漢 χαν хан'
>>>print('Unicode: ' + s.encode('unicode-escape').decode('utf-8'))
Unicode: \u6f22 \u03c7\u03b1\u03bd \u0445\u0430\u043d
>>>u = s.encode('unicode-escape').decode('utf-8')
>>>print('Original: ' + u.encode("utf-8").decode('unicode-escape'))
Original: 漢 χαν хан
Original resource:Python3 使用 unicode-escape 处理 unicode 16进制字符串编解码问题
Here's my solution using json.dump():
def jsonWrite(p, pyobj, ensure_ascii=False, encoding=SYSTEM_ENCODING, **kwargs):
with codecs.open(p, 'wb', 'utf_8') as fileobj:
json.dump(pyobj, fileobj, ensure_ascii=ensure_ascii,encoding=encoding, **kwargs)
where SYSTEM_ENCODING is set to:
locale.setlocale(locale.LC_ALL, '')
SYSTEM_ENCODING = locale.getlocale()[1]
Use codecs if possible,
with codecs.open('file_path', 'a+', 'utf-8') as fp:
fp.write(json.dumps(res, ensure_ascii=False))
Using ensure_ascii=False in json.dumps is the right direction to solve this problem, as pointed out by Martijn. However, this may raise an exception:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 1: ordinal not in range(128)
You need extra settings in either site.py or sitecustomize.py to set your sys.getdefaultencoding() correct. site.py is under lib/python2.7/ and sitecustomize.py is under lib/python2.7/site-packages.
If you want to use site.py, under def setencoding(): change the first if 0: to if 1: so that Python will use your operation system's locale.
If you prefer to use sitecustomize.py, which may not exist if you haven't created it, simply add these lines:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Then you can do some Chinese JSON output in UTF-8 format, such as:
name = {"last_name": u"王"}
json.dumps(name, ensure_ascii=False)
You will get an UTF-8 encoded string, rather than a \u escaped JSON string.
To verify your default encoding:
print sys.getdefaultencoding()
You should get "utf-8" or "UTF-8" to verify your site.py or sitecustomize.py settings.
Please note that you could not do sys.setdefaultencoding("utf-8") at an interactive Python console.

Python: read json, filter content, write back to json [duplicate]

Sample code (in a REPL):
import json
json_string = json.dumps("ברי צקלה")
print(json_string)
Output:
"\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"
The problem: it's not human readable. My (smart) users want to verify or even edit text files with JSON dumps (and I’d rather not use XML).
Is there a way to serialize objects into UTF-8 JSON strings (instead of \uXXXX)?
Use the ensure_ascii=False switch to json.dumps(), then encode the value to UTF-8 manually:
>>> json_string = json.dumps("ברי צקלה", ensure_ascii=False).encode('utf8')
>>> json_string
b'"\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94"'
>>> print(json_string.decode())
"ברי צקלה"
If you are writing to a file, just use json.dump() and leave it to the file object to encode:
with open('filename', 'w', encoding='utf8') as json_file:
json.dump("ברי צקלה", json_file, ensure_ascii=False)
Caveats for Python 2
For Python 2, there are some more caveats to take into account. If you are writing this to a file, you can use io.open() instead of open() to produce a file object that encodes Unicode values for you as you write, then use json.dump() instead to write to that file:
with io.open('filename', 'w', encoding='utf8') as json_file:
json.dump(u"ברי צקלה", json_file, ensure_ascii=False)
Do note that there is a bug in the json module where the ensure_ascii=False flag can produce a mix of unicode and str objects. The workaround for Python 2 then is:
with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(u"ברי צקלה", ensure_ascii=False)
# unicode(data) auto-decodes data to unicode if str
json_file.write(unicode(data))
In Python 2, when using byte strings (type str), encoded to UTF-8, make sure to also set the encoding keyword:
>>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
>>> d
{1: '\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94', 2: u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'}
>>> s=json.dumps(d, ensure_ascii=False, encoding='utf8')
>>> s
u'{"1": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4", "2": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"}'
>>> json.loads(s)['1']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> json.loads(s)['2']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> print json.loads(s)['1']
ברי צקלה
>>> print json.loads(s)['2']
ברי צקלה
To write to a file
import codecs
import json
with codecs.open('your_file.txt', 'w', encoding='utf-8') as f:
json.dump({"message":"xin chào việt nam"}, f, ensure_ascii=False)
To print to stdout
import json
print(json.dumps({"message":"xin chào việt nam"}, ensure_ascii=False))
This is the wrong answer, but it's still useful to understand why it's wrong. See comments.
Use unicode-escape:
>>> d = {1: "ברי צקלה", 2: u"ברי צקלה"}
>>> json_str = json.dumps(d).decode('unicode-escape').encode('utf8')
>>> print json_str
{"1": "ברי צקלה", "2": "ברי צקלה"}
Pieters' Python 2 workaround fails on an edge case:
d = {u'keyword': u'bad credit \xe7redit cards'}
with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(d, ensure_ascii=False).decode('utf8')
try:
json_file.write(data)
except TypeError:
# Decode data to Unicode first
json_file.write(data.decode('utf8'))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 25: ordinal not in range(128)
It was crashing on the .decode('utf8') part of line 3. I fixed the problem by making the program much simpler by avoiding that step as well as the special casing of ASCII:
with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(d, ensure_ascii=False, encoding='utf8')
json_file.write(unicode(data))
cat filename
{"keyword": "bad credit çredit cards"}
As of Python 3.7 the following code works fine:
from json import dumps
result = {"symbol": "ƒ"}
json_string = dumps(result, sort_keys=True, indent=2, ensure_ascii=False)
print(json_string)
Output:
{"symbol": "ƒ"}
Thanks for the original answer here. With Python 3 the following line of code:
print(json.dumps(result_dict,ensure_ascii=False))
was ok. Consider trying not writing too much text in the code if it's not imperative.
This might be good enough for the Python console. However, to satisfy a server, you might need to set the locale as explained here (if it is on Apache 2)
Setting LANG and LC_ALL when using mod_wsgi
Basically, install he_IL or whatever language locale on Ubuntu.
Check it is not installed:
locale -a
Install it, where XX is your language:
sudo apt-get install language-pack-XX
For example:
sudo apt-get install language-pack-he
Add the following text to /etc/apache2/envvrs
export LANG='he_IL.UTF-8'
export LC_ALL='he_IL.UTF-8'
Then you would hopefully not get Python errors on from Apache like:
print (js)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 41-45: ordinal not in range(128)
Also in Apache, try to make UTF the default encoding as explained here:
How to change the default encoding to UTF-8 for Apache
Do it early because Apache errors can be pain to debug and you can mistakenly think it's from Python which possibly isn't the case in that situation.
The following is my understanding var reading answer above and google.
# coding:utf-8
r"""
#update: 2017-01-09 14:44:39
#explain: str, unicode, bytes in python2to3
#python2 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 7: ordinal not in range(128)
#1.reload
#importlib,sys
#importlib.reload(sys)
#sys.setdefaultencoding('utf-8') #python3 don't have this attribute.
#not suggest even in python2 #see:http://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script
#2.overwrite /usr/lib/python2.7/sitecustomize.py or (sitecustomize.py and PYTHONPATH=".:$PYTHONPATH" python)
#too complex
#3.control by your own (best)
#==> all string must be unicode like python3 (u'xx'|b'xx'.encode('utf-8')) (unicode 's disappeared in python3)
#see: http://blog.ernest.me/post/python-setdefaultencoding-unicode-bytes
#how to Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence
#http://stackoverflow.com/questions/18337407/saving-utf-8-texts-in-json-dumps-as-utf8-not-as-u-escape-sequence
"""
from __future__ import print_function
import json
a = {"b": u"中文"} # add u for python2 compatibility
print('%r' % a)
print('%r' % json.dumps(a))
print('%r' % (json.dumps(a).encode('utf8')))
a = {"b": u"中文"}
print('%r' % json.dumps(a, ensure_ascii=False))
print('%r' % (json.dumps(a, ensure_ascii=False).encode('utf8')))
# print(a.encode('utf8')) #AttributeError: 'dict' object has no attribute 'encode'
print('')
# python2:bytes=str; python3:bytes
b = a['b'].encode('utf-8')
print('%r' % b)
print('%r' % b.decode("utf-8"))
print('')
# python2:unicode; python3:str=unicode
c = b.decode('utf-8')
print('%r' % c)
print('%r' % c.encode('utf-8'))
"""
#python2
{'b': u'\u4e2d\u6587'}
'{"b": "\\u4e2d\\u6587"}'
'{"b": "\\u4e2d\\u6587"}'
u'{"b": "\u4e2d\u6587"}'
'{"b": "\xe4\xb8\xad\xe6\x96\x87"}'
'\xe4\xb8\xad\xe6\x96\x87'
u'\u4e2d\u6587'
u'\u4e2d\u6587'
'\xe4\xb8\xad\xe6\x96\x87'
#python3
{'b': '中文'}
'{"b": "\\u4e2d\\u6587"}'
b'{"b": "\\u4e2d\\u6587"}'
'{"b": "中文"}'
b'{"b": "\xe4\xb8\xad\xe6\x96\x87"}'
b'\xe4\xb8\xad\xe6\x96\x87'
'中文'
'中文'
b'\xe4\xb8\xad\xe6\x96\x87'
"""
If you are loading a JSON string from a file and the file content is Arabic texts, then this will work.
Assume a file like arabic.json
{
"key1": "لمستخدمين",
"key2": "إضافة مستخدم"
}
Get the Arabic contents from the arabic.json file
with open(arabic.json, encoding='utf-8') as f:
# Deserialises it
json_data = json.load(f)
f.close()
# JSON formatted string
json_data2 = json.dumps(json_data, ensure_ascii = False)
To use JSON data in a Django template follow the below steps:
# If have to get the JSON index in a Django template file, then simply decode the encoded string.
json.JSONDecoder().decode(json_data2)
Done! Now we can get the results as a JSON index with Arabic values.
Use unicode-escape to solve the problem
>>>import json
>>>json_string = json.dumps("ברי צקלה")
>>>json_string.encode('ascii').decode('unicode-escape')
'"ברי צקלה"'
Explanation
>>>s = '漢 χαν хан'
>>>print('Unicode: ' + s.encode('unicode-escape').decode('utf-8'))
Unicode: \u6f22 \u03c7\u03b1\u03bd \u0445\u0430\u043d
>>>u = s.encode('unicode-escape').decode('utf-8')
>>>print('Original: ' + u.encode("utf-8").decode('unicode-escape'))
Original: 漢 χαν хан
Original resource:Python3 使用 unicode-escape 处理 unicode 16进制字符串编解码问题
Here's my solution using json.dump():
def jsonWrite(p, pyobj, ensure_ascii=False, encoding=SYSTEM_ENCODING, **kwargs):
with codecs.open(p, 'wb', 'utf_8') as fileobj:
json.dump(pyobj, fileobj, ensure_ascii=ensure_ascii,encoding=encoding, **kwargs)
where SYSTEM_ENCODING is set to:
locale.setlocale(locale.LC_ALL, '')
SYSTEM_ENCODING = locale.getlocale()[1]
Use codecs if possible,
with codecs.open('file_path', 'a+', 'utf-8') as fp:
fp.write(json.dumps(res, ensure_ascii=False))
Using ensure_ascii=False in json.dumps is the right direction to solve this problem, as pointed out by Martijn. However, this may raise an exception:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 1: ordinal not in range(128)
You need extra settings in either site.py or sitecustomize.py to set your sys.getdefaultencoding() correct. site.py is under lib/python2.7/ and sitecustomize.py is under lib/python2.7/site-packages.
If you want to use site.py, under def setencoding(): change the first if 0: to if 1: so that Python will use your operation system's locale.
If you prefer to use sitecustomize.py, which may not exist if you haven't created it, simply add these lines:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Then you can do some Chinese JSON output in UTF-8 format, such as:
name = {"last_name": u"王"}
json.dumps(name, ensure_ascii=False)
You will get an UTF-8 encoded string, rather than a \u escaped JSON string.
To verify your default encoding:
print sys.getdefaultencoding()
You should get "utf-8" or "UTF-8" to verify your site.py or sitecustomize.py settings.
Please note that you could not do sys.setdefaultencoding("utf-8") at an interactive Python console.

Saving UTF-8 texts with json.dumps as UTF-8, not as a \u escape sequence

Sample code (in a REPL):
import json
json_string = json.dumps("ברי צקלה")
print(json_string)
Output:
"\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"
The problem: it's not human readable. My (smart) users want to verify or even edit text files with JSON dumps (and I’d rather not use XML).
Is there a way to serialize objects into UTF-8 JSON strings (instead of \uXXXX)?
Use the ensure_ascii=False switch to json.dumps(), then encode the value to UTF-8 manually:
>>> json_string = json.dumps("ברי צקלה", ensure_ascii=False).encode('utf8')
>>> json_string
b'"\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94"'
>>> print(json_string.decode())
"ברי צקלה"
If you are writing to a file, just use json.dump() and leave it to the file object to encode:
with open('filename', 'w', encoding='utf8') as json_file:
json.dump("ברי צקלה", json_file, ensure_ascii=False)
Caveats for Python 2
For Python 2, there are some more caveats to take into account. If you are writing this to a file, you can use io.open() instead of open() to produce a file object that encodes Unicode values for you as you write, then use json.dump() instead to write to that file:
with io.open('filename', 'w', encoding='utf8') as json_file:
json.dump(u"ברי צקלה", json_file, ensure_ascii=False)
Do note that there is a bug in the json module where the ensure_ascii=False flag can produce a mix of unicode and str objects. The workaround for Python 2 then is:
with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(u"ברי צקלה", ensure_ascii=False)
# unicode(data) auto-decodes data to unicode if str
json_file.write(unicode(data))
In Python 2, when using byte strings (type str), encoded to UTF-8, make sure to also set the encoding keyword:
>>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
>>> d
{1: '\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94', 2: u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'}
>>> s=json.dumps(d, ensure_ascii=False, encoding='utf8')
>>> s
u'{"1": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4", "2": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"}'
>>> json.loads(s)['1']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> json.loads(s)['2']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> print json.loads(s)['1']
ברי צקלה
>>> print json.loads(s)['2']
ברי צקלה
To write to a file
import codecs
import json
with codecs.open('your_file.txt', 'w', encoding='utf-8') as f:
json.dump({"message":"xin chào việt nam"}, f, ensure_ascii=False)
To print to stdout
import json
print(json.dumps({"message":"xin chào việt nam"}, ensure_ascii=False))
This is the wrong answer, but it's still useful to understand why it's wrong. See comments.
Use unicode-escape:
>>> d = {1: "ברי צקלה", 2: u"ברי צקלה"}
>>> json_str = json.dumps(d).decode('unicode-escape').encode('utf8')
>>> print json_str
{"1": "ברי צקלה", "2": "ברי צקלה"}
Pieters' Python 2 workaround fails on an edge case:
d = {u'keyword': u'bad credit \xe7redit cards'}
with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(d, ensure_ascii=False).decode('utf8')
try:
json_file.write(data)
except TypeError:
# Decode data to Unicode first
json_file.write(data.decode('utf8'))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 25: ordinal not in range(128)
It was crashing on the .decode('utf8') part of line 3. I fixed the problem by making the program much simpler by avoiding that step as well as the special casing of ASCII:
with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(d, ensure_ascii=False, encoding='utf8')
json_file.write(unicode(data))
cat filename
{"keyword": "bad credit çredit cards"}
As of Python 3.7 the following code works fine:
from json import dumps
result = {"symbol": "ƒ"}
json_string = dumps(result, sort_keys=True, indent=2, ensure_ascii=False)
print(json_string)
Output:
{"symbol": "ƒ"}
Thanks for the original answer here. With Python 3 the following line of code:
print(json.dumps(result_dict,ensure_ascii=False))
was ok. Consider trying not writing too much text in the code if it's not imperative.
This might be good enough for the Python console. However, to satisfy a server, you might need to set the locale as explained here (if it is on Apache 2)
Setting LANG and LC_ALL when using mod_wsgi
Basically, install he_IL or whatever language locale on Ubuntu.
Check it is not installed:
locale -a
Install it, where XX is your language:
sudo apt-get install language-pack-XX
For example:
sudo apt-get install language-pack-he
Add the following text to /etc/apache2/envvrs
export LANG='he_IL.UTF-8'
export LC_ALL='he_IL.UTF-8'
Then you would hopefully not get Python errors on from Apache like:
print (js)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 41-45: ordinal not in range(128)
Also in Apache, try to make UTF the default encoding as explained here:
How to change the default encoding to UTF-8 for Apache
Do it early because Apache errors can be pain to debug and you can mistakenly think it's from Python which possibly isn't the case in that situation.
The following is my understanding var reading answer above and google.
# coding:utf-8
r"""
#update: 2017-01-09 14:44:39
#explain: str, unicode, bytes in python2to3
#python2 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 7: ordinal not in range(128)
#1.reload
#importlib,sys
#importlib.reload(sys)
#sys.setdefaultencoding('utf-8') #python3 don't have this attribute.
#not suggest even in python2 #see:http://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script
#2.overwrite /usr/lib/python2.7/sitecustomize.py or (sitecustomize.py and PYTHONPATH=".:$PYTHONPATH" python)
#too complex
#3.control by your own (best)
#==> all string must be unicode like python3 (u'xx'|b'xx'.encode('utf-8')) (unicode 's disappeared in python3)
#see: http://blog.ernest.me/post/python-setdefaultencoding-unicode-bytes
#how to Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence
#http://stackoverflow.com/questions/18337407/saving-utf-8-texts-in-json-dumps-as-utf8-not-as-u-escape-sequence
"""
from __future__ import print_function
import json
a = {"b": u"中文"} # add u for python2 compatibility
print('%r' % a)
print('%r' % json.dumps(a))
print('%r' % (json.dumps(a).encode('utf8')))
a = {"b": u"中文"}
print('%r' % json.dumps(a, ensure_ascii=False))
print('%r' % (json.dumps(a, ensure_ascii=False).encode('utf8')))
# print(a.encode('utf8')) #AttributeError: 'dict' object has no attribute 'encode'
print('')
# python2:bytes=str; python3:bytes
b = a['b'].encode('utf-8')
print('%r' % b)
print('%r' % b.decode("utf-8"))
print('')
# python2:unicode; python3:str=unicode
c = b.decode('utf-8')
print('%r' % c)
print('%r' % c.encode('utf-8'))
"""
#python2
{'b': u'\u4e2d\u6587'}
'{"b": "\\u4e2d\\u6587"}'
'{"b": "\\u4e2d\\u6587"}'
u'{"b": "\u4e2d\u6587"}'
'{"b": "\xe4\xb8\xad\xe6\x96\x87"}'
'\xe4\xb8\xad\xe6\x96\x87'
u'\u4e2d\u6587'
u'\u4e2d\u6587'
'\xe4\xb8\xad\xe6\x96\x87'
#python3
{'b': '中文'}
'{"b": "\\u4e2d\\u6587"}'
b'{"b": "\\u4e2d\\u6587"}'
'{"b": "中文"}'
b'{"b": "\xe4\xb8\xad\xe6\x96\x87"}'
b'\xe4\xb8\xad\xe6\x96\x87'
'中文'
'中文'
b'\xe4\xb8\xad\xe6\x96\x87'
"""
If you are loading a JSON string from a file and the file content is Arabic texts, then this will work.
Assume a file like arabic.json
{
"key1": "لمستخدمين",
"key2": "إضافة مستخدم"
}
Get the Arabic contents from the arabic.json file
with open(arabic.json, encoding='utf-8') as f:
# Deserialises it
json_data = json.load(f)
f.close()
# JSON formatted string
json_data2 = json.dumps(json_data, ensure_ascii = False)
To use JSON data in a Django template follow the below steps:
# If have to get the JSON index in a Django template file, then simply decode the encoded string.
json.JSONDecoder().decode(json_data2)
Done! Now we can get the results as a JSON index with Arabic values.
Use unicode-escape to solve the problem
>>>import json
>>>json_string = json.dumps("ברי צקלה")
>>>json_string.encode('ascii').decode('unicode-escape')
'"ברי צקלה"'
Explanation
>>>s = '漢 χαν хан'
>>>print('Unicode: ' + s.encode('unicode-escape').decode('utf-8'))
Unicode: \u6f22 \u03c7\u03b1\u03bd \u0445\u0430\u043d
>>>u = s.encode('unicode-escape').decode('utf-8')
>>>print('Original: ' + u.encode("utf-8").decode('unicode-escape'))
Original: 漢 χαν хан
Original resource:Python3 使用 unicode-escape 处理 unicode 16进制字符串编解码问题
Here's my solution using json.dump():
def jsonWrite(p, pyobj, ensure_ascii=False, encoding=SYSTEM_ENCODING, **kwargs):
with codecs.open(p, 'wb', 'utf_8') as fileobj:
json.dump(pyobj, fileobj, ensure_ascii=ensure_ascii,encoding=encoding, **kwargs)
where SYSTEM_ENCODING is set to:
locale.setlocale(locale.LC_ALL, '')
SYSTEM_ENCODING = locale.getlocale()[1]
Use codecs if possible,
with codecs.open('file_path', 'a+', 'utf-8') as fp:
fp.write(json.dumps(res, ensure_ascii=False))
Using ensure_ascii=False in json.dumps is the right direction to solve this problem, as pointed out by Martijn. However, this may raise an exception:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 1: ordinal not in range(128)
You need extra settings in either site.py or sitecustomize.py to set your sys.getdefaultencoding() correct. site.py is under lib/python2.7/ and sitecustomize.py is under lib/python2.7/site-packages.
If you want to use site.py, under def setencoding(): change the first if 0: to if 1: so that Python will use your operation system's locale.
If you prefer to use sitecustomize.py, which may not exist if you haven't created it, simply add these lines:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Then you can do some Chinese JSON output in UTF-8 format, such as:
name = {"last_name": u"王"}
json.dumps(name, ensure_ascii=False)
You will get an UTF-8 encoded string, rather than a \u escaped JSON string.
To verify your default encoding:
print sys.getdefaultencoding()
You should get "utf-8" or "UTF-8" to verify your site.py or sitecustomize.py settings.
Please note that you could not do sys.setdefaultencoding("utf-8") at an interactive Python console.

Categories

Resources