json.dumps \u escaped unicode to utf8

json.dumps \u escaped unicode to utf8 - python

I came from this old discussion, but the solution didn't help much as my original data was encoded differently:
My original data was already encoded in unicode, I need to output as UTF-8
data={"content":u"\u4f60\u597d"}
When I try to convert to utf:
json.dumps(data, indent=1, ensure_ascii=False).encode("utf8")
the output I get is
"content": "ä½ å¥½" and the expected out put should be
"content": "你好"
I tried without ensure_ascii=false and the output becomes plain unescaped "content": "\u4f60\u597d"
How can I convert the previously \u escaped json to UTF-8?

You have UTF-8 JSON data:
>>> import json
>>> data = {'content': u'\u4f60\u597d'}
>>> json.dumps(data, indent=1, ensure_ascii=False)
u'{\n "content": "\u4f60\u597d"\n}'
>>> json.dumps(data, indent=1, ensure_ascii=False).encode('utf8')
'{\n "content": "\xe4\xbd\xa0\xe5\xa5\xbd"\n}'
>>> print json.dumps(data, indent=1, ensure_ascii=False).encode('utf8')
{
"content": "你好"
}
My terminal just happens to be configured to handle UTF-8, so printing the UTF-8 bytes to my terminal produced the desired output.
However, if your terminal is not set up for such output, it is your terminal that then shows 'wrong' characters:
>>> print json.dumps(data, indent=1, ensure_ascii=False).encode('utf8').decode('latin1')
{
"content": "ä½ å¥½"
}
Note how I decoded the data to Latin-1 to deliberately mis-read the UTF-8 bytes.
This isn't a Python problem; this is a problem with how you are handling the UTF-8 bytes in whatever tool you used to read these bytes.

in python2, it works; however in python3 print will output like:
>>> b'{\n "content": "\xe4\xbd\xa0\xe5\xa5\xbd"\n}'
do not use encode('utf8'):
>>> print(json.dumps(data, indent=1, ensure_ascii=False))
{
"content": "你好"
}
or use sys.stdout.buffer.write instead of print:
>>> import sys
>>> import json
>>> data = {'content': u'\u4f60\u597d'}
>>> sys.stdout.buffer.write(json.dumps(data, indent=1,
ensure_ascii=False).encode('utf8') + b'\n')
{
"content": "你好"
}
see Write UTF-8 to stdout, regardless of the console's encoding

Related

UTF-8 characters in python string even after decoding from UTF-8?

I'm working on converting portions of XHTML to JSON objects. I finally got everything in JSON form, but some UTF-8 character codes are being printed.
Example:
{
"p": {
"#class": "para-p",
"#text": "I\u2019m not on Earth."
}
}
This should be:
{
"p": {
"#class": "para-p",
"#text": "I'm not on Earth."
}
}
This is just one example of UTF-8 codes coming through. How can I got through the string and replace every instance of a UTF-8 code with the character it represents?

\u2019 is not a UTF-8 character, but a Unicode escape code. It's valid JSON and when read back via json.load will become ’ (RIGHT SINGLE QUOTATION MARK).
If you want to write the actual character, use ensure_ascii=False to prevent escape codes from being written for non-ASCII characters:
with open('output.json','w',encoding='utf8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)

You didn'T paste your code, so I don't kwon how you converted XHTML to JSON. I assume that you ended with hex value characters in Python objects. This \u2019 is a single character with a 16-bit hex value. The JSON module can handle this by default. For example, the json.loads method can fix that:
x = '''{
"p": {
"#class": "para-p",
"#text": "I\\u2019m not on Earth."
}
}'''
print(x)
x_json=json.loads(x)
print(x_json)
Output shows:
{
"p": {
"#class": "para-p",
"#text": "I\u2019m not on Earth."
}
}
{'p': {'#class': 'para-p', '#text': 'I’m not on Earth.'}}

using symbols in json in python

Recently, I got a problem while working with json in python. Actually that is about special symbols in json. The problem is defined with code below:
import json
app = {
"text": "°"
}
print(json.dumps(app, indent=2))
but giving this I get this:
{
"text": "\u00b0"
}
Here the ° sign is replaced with \u00b0. But I want it to be exact as my input. How can I do it?
Thanks in advance.

According to https://pynative.com/python-json-encode-unicode-and-non-ascii-characters-as-is/, you want to set ensure_ascii=False:
>>> import json
>>> app={"text": "°"}
>>> print(json.dumps(app, indent=2, ensure_ascii=False))
{
"text": "°"
}

UnicodeDecodeError while decoding a json with python3.5

{
"Sponge": {
"orientation": "Straight",
"gender": "Woman",
"age": 23,
"rel_status": "Single",
"summary": " Bonjour! Je m'appelle Jacqueline!, Enjoy cooking, reading and traveling!, Love animals, languages and nature :-) ",
"location": "Kao-hsiung-k’a",
"id": "6693397339871"
}
}
I have this json above and I'm trying to read it except there is some special character in it. For example the "’" in location. This raise some error when I'm trying to read the JSON:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 27-28: character maps to <undefined>
I'm using python 3.5 and I have done the following code:
with open('test.json') as json_data:
users = json.load(json_data)
print users

Use codecs module to open the file for a quick fix.
with codecs.open('test.json', 'r', 'utf-8') as json_data:
users = json.load(json_data)
print(users)
Also answer to this question can be found easily on the web. (hint: that's how I learned about this module.)

Ok I find my solution it's a problem with the terminal of windows you have to type this in the terminal: chcp 65001
After that launch your program!
More explanation here: Why doesn't Python recognize my utf-8 encoded source file?

Access JSON data from API

I am trying to write a script to download images from an API, I have a set up a loop that is as follows:
response = requests.get(url, params=query)
json_data = json.dumps(response.text)
pythonVal = json.loads(json.loads(json_data))
print(pythonVal)
The print(pythonVal) returns:
{
"metadata": {
"code": 200,
"message": "OK",
"version": "v2.0"
},
"data": {
"_links": {
"self": {
"href": "redactedLink"
}
},
"id": "123456789",
"_fixed": true
,
"type": "IMAGE",
"source": "social media",
"source_id": "1234567890_1234567890",
"original_source": "link",
"caption": "caption",
"video_url": null,
"share_url": "link",
"date_submitted": "2016-07-11T09:34:35+00:00",
"date_published": "2016-09-11T16:30:26+00:00",
I keep getting an error that reads:
UnicodeEncodeError: 'ascii' codec can't encode character '\xc4' in
position 527: ordinal not in range(128)
For the pythonVal variable, if I just have it set to json.loads(json_data), it prints out the JSON response, but then when I try doing pythonVal['data'] I get another error that reads:
TypeError: string indices must be integers
Ultimately I'd like to be able to get data from it by doing something like
pythonVal['data']['_embedded']['uploader']['username']
Thanks for your input!

Why doing json.loads() twice? Change:
json.loads(json.loads(json_data))
to:
json.loads(json_data)
and it should work.
Now since you are getting error TypeError: string indices must be integers on doing pythonVal['data'], it means that the value of pythonVal is of list type and not dict. Instead do:
for item in pythonVal:
print item
Please also mention the sample JSON content with the question, if you want better help from others :)

Put the following on top of your code. This works by overriding the native ascii encoding of Python to UTF-8.
# -*- coding: utf-8 -*-
The second error is because you have already gotten the string, and you need integer indices to get the characters of the string.

Not "\u": How to Unescape Unicode in JSON?

I'm trying to scrape from a non-English website using Scrapy. The scraped results as JSON look something like this:
{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"},
This is the code I'm using:
def parse(self, response):
for sel in response.xpath('//section[#class="items-box"]'):
item = ShopItem()
item['name'] = sel.xpath('a/div/h3/text()').extract()
item['price'] = sel.xpath('a/div/div/div[1]/text()').extract().replace("$", "")
yield item
How would I output unescaped Unicode characters onto the JSON?

Edit (2016-10-19):
With Scrapy 1.2+, you can use the FEED_EXPORT_ENCODING set to the character encoding you need for the output JSON file, e.g FEED_EXPORT_ENCODING = 'utf-8' (the default value being None, which means \uXXXX escaping)
Note: I'm adapting what I wrote on GitHub for a similar issue I linked to in the question's comments.
Note that there's an open issue on Scrapy to make the output encoding a parameter: https://github.com/scrapy/scrapy/issues/1965
Scrapy's default JSON exporter uses (the default) ensure_ascii=True argument, so it outputs Unicode characters as \uXXXX sequences before writing to file. (This is what is used when doing -o somefile.json)
Setting ensure_ascii=False in the exporter will output Unicode strings, which will end up as UTF-8 encoded on file. See custom exporter code at the bottom here.
To illustrate, let's read your input JSON string back into some data to work on:
>>> import json
>>> test = r'''{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"}'''
>>> json.loads(test)
{u'price': u'13,000', u'name': u'\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc'}
The input with \uXXXX sequences is valid JSON for Python (as it should), and loads() produces a valid Python dict.
Now let's serialize to JSON again:
>>> # dumping the dict back to JSON, with default ensure_ascii=True
>>> json.dumps(json.loads(test))
'{"price": "13,000", "name": "\\u58c1\\u6bb4\\u308a\\u4ee3\\u884c\\u69d8\\u5c02\\u7528\\u2605 \\u30c6\\u30ec\\u30d3\\u672c\\u4f53 20v\\u578b \\u767d \\u9001\\u6599\\u8fbc"}'
>>>
And now with ensure_ascii=False
>>> # now dumping with ensure_ascii=False, you get a Unicode string
>>> json.dumps(json.loads(test), ensure_ascii=False)
u'{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"}'
>>>
Let's print to see the difference:
>>> print json.dumps(json.loads(test))
{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"}
>>> print json.dumps(json.loads(test), ensure_ascii=False)
{"price": "13,000", "name": "壁殴り代行様専用★ テレビ本体 20v型 白 送料込"}
If you want to write JSON items as UTF-8, you can do it like this:
1.. define a custom item exporter, e.g. in an exporters.py file in your project
$ cat myproject/exporters.py
from scrapy.exporters import JsonItemExporter
class Utf8JsonItemExporter(JsonItemExporter):
def __init__(self, file, **kwargs):
super(Utf8JsonItemExporter, self).__init__(
file, ensure_ascii=False, **kwargs)
2.. replace the default JSON item exporter in your settings.py
FEED_EXPORTERS = {
'json': 'myproject.exporters.Utf8JsonItemExporter',
}

Use the codecs module for text -> text decoding (In Python 2 it's not strictly necessary, but in Python 3 str doesn't have a decode method, because the methods are for str -> bytes and back, not str -> str). Using the unicode_escape codec for decoding will get you the correct data back:
import codecs
somestr = codecs.decode(strwithescapes, 'unicode-escape')
So to fix the names you're getting, you'd do:
item['name'] = codecs.decode(sel.xpath('a/div/h3/text()').extract(), 'unicode-escape')
If the problem is in JSON you're producing, you'd want to just make sure the json module isn't forcing strings to be ASCII with character encodings; it does so by default because not all JSON parsers can handle true Unicode characters (they often assume data is sent as ASCII bytes with escapes). So wherever you call json.dump/json.dumps (or create a json.JSONEncoder), make sure to explicitly pass ensure_ascii=False.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

json.dumps \u escaped unicode to utf8 - python

Related

UTF-8 characters in python string even after decoding from UTF-8?

using symbols in json in python

UnicodeDecodeError while decoding a json with python3.5

Access JSON data from API

Not "\u": How to Unescape Unicode in JSON?

Categories

Resources