UTF-8 characters in python string even after decoding from UTF-8?

UTF-8 characters in python string even after decoding from UTF-8? - python

I'm working on converting portions of XHTML to JSON objects. I finally got everything in JSON form, but some UTF-8 character codes are being printed.
Example:
{
"p": {
"#class": "para-p",
"#text": "I\u2019m not on Earth."
}
}
This should be:
{
"p": {
"#class": "para-p",
"#text": "I'm not on Earth."
}
}
This is just one example of UTF-8 codes coming through. How can I got through the string and replace every instance of a UTF-8 code with the character it represents?

\u2019 is not a UTF-8 character, but a Unicode escape code. It's valid JSON and when read back via json.load will become ’ (RIGHT SINGLE QUOTATION MARK).
If you want to write the actual character, use ensure_ascii=False to prevent escape codes from being written for non-ASCII characters:
with open('output.json','w',encoding='utf8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)

You didn'T paste your code, so I don't kwon how you converted XHTML to JSON. I assume that you ended with hex value characters in Python objects. This \u2019 is a single character with a 16-bit hex value. The JSON module can handle this by default. For example, the json.loads method can fix that:
x = '''{
"p": {
"#class": "para-p",
"#text": "I\\u2019m not on Earth."
}
}'''
print(x)
x_json=json.loads(x)
print(x_json)
Output shows:
{
"p": {
"#class": "para-p",
"#text": "I\u2019m not on Earth."
}
}
{'p': {'#class': 'para-p', '#text': 'I’m not on Earth.'}}

Related

using symbols in json in python

Recently, I got a problem while working with json in python. Actually that is about special symbols in json. The problem is defined with code below:
import json
app = {
"text": "°"
}
print(json.dumps(app, indent=2))
but giving this I get this:
{
"text": "\u00b0"
}
Here the ° sign is replaced with \u00b0. But I want it to be exact as my input. How can I do it?
Thanks in advance.

According to https://pynative.com/python-json-encode-unicode-and-non-ascii-characters-as-is/, you want to set ensure_ascii=False:
>>> import json
>>> app={"text": "°"}
>>> print(json.dumps(app, indent=2, ensure_ascii=False))
{
"text": "°"
}

Python multi-line JSON and variables

I'm trying to encode a somewhat large JSON in Python (v2.7) and I'm having trouble putting in my variables!
As the JSON is multi-line and to keep my code neat I've decided to use the triple double quotation mark to make it look as follows:
my_json = """{
"settings": {
"serial": "1",
"status": "2",
"ersion": "3"
},
"config": {
"active": "4",
"version": "5"
}
}"""
To encode this, and output it works well for me, but I'm not sure how I can change the numbers I have there and replace them by variable strings. I've tried:
"settings": {
"serial": 'json_serial',
but to no avail. Any help would be appreciated!

Why don't you make it a dictionary and set variables then use the json library to make it into json
import json
json_serial = "123"
my_json = {
'settings': {
"serial": json_serial,
"status": '2',
"ersion": '3',
},
'config': {
'active': '4',
'version': '5'
}
}
print(json.dumps(my_json))

If you absolutely insist on generating JSON with string concatenation -- and, to be clear, you absolutely shouldn't -- the only way to be entirely certain that your output is valid JSON is to generate the substrings being substituted with a JSON generator. That is:
'''"settings" : {
"serial" : {serial},
"version" : {version}
}'''.format(serial=json.dumps("5"), version=json.dumps(1))
But don't. Really, really don't. The answer by #davidejones is the Right Thing for this scenario.

Access JSON data from API

I am trying to write a script to download images from an API, I have a set up a loop that is as follows:
response = requests.get(url, params=query)
json_data = json.dumps(response.text)
pythonVal = json.loads(json.loads(json_data))
print(pythonVal)
The print(pythonVal) returns:
{
"metadata": {
"code": 200,
"message": "OK",
"version": "v2.0"
},
"data": {
"_links": {
"self": {
"href": "redactedLink"
}
},
"id": "123456789",
"_fixed": true
,
"type": "IMAGE",
"source": "social media",
"source_id": "1234567890_1234567890",
"original_source": "link",
"caption": "caption",
"video_url": null,
"share_url": "link",
"date_submitted": "2016-07-11T09:34:35+00:00",
"date_published": "2016-09-11T16:30:26+00:00",
I keep getting an error that reads:
UnicodeEncodeError: 'ascii' codec can't encode character '\xc4' in
position 527: ordinal not in range(128)
For the pythonVal variable, if I just have it set to json.loads(json_data), it prints out the JSON response, but then when I try doing pythonVal['data'] I get another error that reads:
TypeError: string indices must be integers
Ultimately I'd like to be able to get data from it by doing something like
pythonVal['data']['_embedded']['uploader']['username']
Thanks for your input!

Why doing json.loads() twice? Change:
json.loads(json.loads(json_data))
to:
json.loads(json_data)
and it should work.
Now since you are getting error TypeError: string indices must be integers on doing pythonVal['data'], it means that the value of pythonVal is of list type and not dict. Instead do:
for item in pythonVal:
print item
Please also mention the sample JSON content with the question, if you want better help from others :)

Put the following on top of your code. This works by overriding the native ascii encoding of Python to UTF-8.
# -*- coding: utf-8 -*-
The second error is because you have already gotten the string, and you need integer indices to get the characters of the string.

json.dumps \u escaped unicode to utf8

I came from this old discussion, but the solution didn't help much as my original data was encoded differently:
My original data was already encoded in unicode, I need to output as UTF-8
data={"content":u"\u4f60\u597d"}
When I try to convert to utf:
json.dumps(data, indent=1, ensure_ascii=False).encode("utf8")
the output I get is
"content": "ä½ å¥½" and the expected out put should be
"content": "你好"
I tried without ensure_ascii=false and the output becomes plain unescaped "content": "\u4f60\u597d"
How can I convert the previously \u escaped json to UTF-8?

You have UTF-8 JSON data:
>>> import json
>>> data = {'content': u'\u4f60\u597d'}
>>> json.dumps(data, indent=1, ensure_ascii=False)
u'{\n "content": "\u4f60\u597d"\n}'
>>> json.dumps(data, indent=1, ensure_ascii=False).encode('utf8')
'{\n "content": "\xe4\xbd\xa0\xe5\xa5\xbd"\n}'
>>> print json.dumps(data, indent=1, ensure_ascii=False).encode('utf8')
{
"content": "你好"
}
My terminal just happens to be configured to handle UTF-8, so printing the UTF-8 bytes to my terminal produced the desired output.
However, if your terminal is not set up for such output, it is your terminal that then shows 'wrong' characters:
>>> print json.dumps(data, indent=1, ensure_ascii=False).encode('utf8').decode('latin1')
{
"content": "ä½ å¥½"
}
Note how I decoded the data to Latin-1 to deliberately mis-read the UTF-8 bytes.
This isn't a Python problem; this is a problem with how you are handling the UTF-8 bytes in whatever tool you used to read these bytes.

in python2, it works; however in python3 print will output like:
>>> b'{\n "content": "\xe4\xbd\xa0\xe5\xa5\xbd"\n}'
do not use encode('utf8'):
>>> print(json.dumps(data, indent=1, ensure_ascii=False))
{
"content": "你好"
}
or use sys.stdout.buffer.write instead of print:
>>> import sys
>>> import json
>>> data = {'content': u'\u4f60\u597d'}
>>> sys.stdout.buffer.write(json.dumps(data, indent=1,
ensure_ascii=False).encode('utf8') + b'\n')
{
"content": "你好"
}
see Write UTF-8 to stdout, regardless of the console's encoding

Read JSON as a variable from a file

Trying to read a multi-line variable (JSON) from a file in python, but getting an error.
#config.py
A = {
"query" : {
"match_all" : { }
}
}
#client.py
from config import *
print A
I get {'query':{'match_all':{}}} <--- double quotes are replaced with single quotes. Is there a way to preserve the original?
Thanks,

The single quotes are there as a result of Python's representation of strings. If you really want the double quotes, you can do a trivial str.replace:
>>> print str(A).replace("'",'"')
{"query": {"match_all": {}}}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

UTF-8 characters in python string even after decoding from UTF-8? - python

Related

using symbols in json in python

Python multi-line JSON and variables

Access JSON data from API

json.dumps \u escaped unicode to utf8

Read JSON as a variable from a file

Categories

Resources