{
"Sponge": {
"orientation": "Straight",
"gender": "Woman",
"age": 23,
"rel_status": "Single",
"summary": " Bonjour! Je m'appelle Jacqueline!, Enjoy cooking, reading and traveling!, Love animals, languages and nature :-) ",
"location": "Kao-hsiung-k’a",
"id": "6693397339871"
}
}
I have this json above and I'm trying to read it except there is some special character in it. For example the "’" in location. This raise some error when I'm trying to read the JSON:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 27-28: character maps to <undefined>
I'm using python 3.5 and I have done the following code:
with open('test.json') as json_data:
users = json.load(json_data)
print users
Use codecs module to open the file for a quick fix.
with codecs.open('test.json', 'r', 'utf-8') as json_data:
users = json.load(json_data)
print(users)
Also answer to this question can be found easily on the web. (hint: that's how I learned about this module.)
Ok I find my solution it's a problem with the terminal of windows you have to type this in the terminal: chcp 65001
After that launch your program!
More explanation here: Why doesn't Python recognize my utf-8 encoded source file?
Related
I got an UnicodeEncodeError: 'charmap' codec can't encode characters in position 1-4: character maps to <undefined> when I using esy.osmfilter package (version 1.0.7) to filter an OSM .*pbf file and then save it to a *.json file with the following code:
import os
from esy.osmfilter import Node, Way, Relation
from esy.osmfilter import run_filter
PBF_inputfile = os.path.join(os.getcwd(), 'liechtenstein-latest.osm.pbf')
JSON_outputfile = os.path.join(os.getcwd(), 'liechtenstein-latest_river.json')
prefilter = {Node: {}, Way: {'waterway': ['river', ], }, Relation: {}}
whitefilter = []
blackfilter = []
[Data, _] = run_filter('noname',
PBF_inputfile,
JSON_outputfile,
prefilter,
whitefilter,
blackfilter,
NewPreFilterData=True,
CreateElements=False,
LoadElements=False,
verbose=True)
print(len(Data['Node']))
print(len(Data['Relation']))
print(len(Data['Way']))
I followed the tutorial and used tags like {'waterway': ['stream', ], }, {'waterway': ['canal', ], }, {'waterway': ['dam', ], }, etc. in the prefilter and they were all error-free. Then I found that the tag {'waterway': ['river', ], } will cause the error mentioned above. The same situation I received with the Berlin data. Then I tried with the Delaware data, which was error-free. So I thought it might be related to the German words? My default encoding is 'utf-8'.
I believe this bug is a pure Windows bug. Please use esy-osmfilter on a linux machine for the moment. This error results from an external library, however I will fix this within the next days.
this error is fixed with version 1.0.11
I am trying to read Cyrillic characters from some JSON file and then output it to console using Python 3.4.3 on Windows. Normal print('Russian smth буквы') works as intended.
But when I print JSON contents it seems to print in Windows-1251 - "СЂСѓСЃСЃРєРёРµ Р±СѓРєРІС‹" (though my console, my JSON file and my .py (with coding comment) are in UTF-8).
I've tried re-encoding it to Win-1251 and setting console to Win-1251, but still no luck.
My JSON (Encoded in UTF-8):
{
"русские буквы": "что-то ещё на русском",
"english letters": "и что-то на великом"
}
My code to load dictionary:
def load_dictionary():
global Dictionary, isFatal
try:
with open(DictionaryName) as f:
Dictionary = json.load(f)
except Exception as e:
logging.critical('Error loading dictionary: ' + str(e))
isFatal = True
return
logging.info('Dictionary was loaded successfully')
I am trying to output it in 2 ways (both show the same gibberish):
print(helper.Dictionary.get('rly'))
print(helper.Dictionary)
An interesting add-on: I've added the whole Russian alphabet to my JSON file and it seems to get stuck at "С с" letter. (Error loading dictionary: 'charmap' codec can't decode byte 0x81 in position X: character maps to ). If I remove this one letter it shows no exception, but the problem above remains.
"But when I print JSON contents …"
If you print it using type command, then you get mojibake СЂСѓСЃСЃРєРёРµ … instead of русские … under CHCP 1251 scope.
Try type under CHCP 65001 (i.e. UTF-8) scope.
Follow nauer's advice, use open(DictionaryName, encoding="utf8").
Example (39755662.json is saved with UTF-8 encoding):
==> chcp 866
Active code page: 866
==> type 39755662.json
{
"╤А╤Г╤Б╤Б╨║╨╕╨╡ ╨▒╤Г╨║╨▓╤Л": "╤З╤В╨╛-╤В╨╛ ╨╡╤Й╤С ╨╜╨░ ╤А╤Г╤Б╤Б╨║╨╛╨╝",
"rly": "╤А╤Г╤Б╤Б╨║╨╕╨╣"
}
==> chcp 1251
Active code page: 1251
==> type 39755662.json
{
"русские буквы": "что-то ещё на русском",
"rly": "СЂСѓСЃСЃРєРёР№"
}
==> chcp 65001
Active code page: 65001
==> type 39755662.json
{
"русские буквы": "что-то ещё на русском",
"rly": "русский"
}
==>
I am trying to write a script to download images from an API, I have a set up a loop that is as follows:
response = requests.get(url, params=query)
json_data = json.dumps(response.text)
pythonVal = json.loads(json.loads(json_data))
print(pythonVal)
The print(pythonVal) returns:
{
"metadata": {
"code": 200,
"message": "OK",
"version": "v2.0"
},
"data": {
"_links": {
"self": {
"href": "redactedLink"
}
},
"id": "123456789",
"_fixed": true
,
"type": "IMAGE",
"source": "social media",
"source_id": "1234567890_1234567890",
"original_source": "link",
"caption": "caption",
"video_url": null,
"share_url": "link",
"date_submitted": "2016-07-11T09:34:35+00:00",
"date_published": "2016-09-11T16:30:26+00:00",
I keep getting an error that reads:
UnicodeEncodeError: 'ascii' codec can't encode character '\xc4' in
position 527: ordinal not in range(128)
For the pythonVal variable, if I just have it set to json.loads(json_data), it prints out the JSON response, but then when I try doing pythonVal['data'] I get another error that reads:
TypeError: string indices must be integers
Ultimately I'd like to be able to get data from it by doing something like
pythonVal['data']['_embedded']['uploader']['username']
Thanks for your input!
Why doing json.loads() twice? Change:
json.loads(json.loads(json_data))
to:
json.loads(json_data)
and it should work.
Now since you are getting error TypeError: string indices must be integers on doing pythonVal['data'], it means that the value of pythonVal is of list type and not dict. Instead do:
for item in pythonVal:
print item
Please also mention the sample JSON content with the question, if you want better help from others :)
Put the following on top of your code. This works by overriding the native ascii encoding of Python to UTF-8.
# -*- coding: utf-8 -*-
The second error is because you have already gotten the string, and you need integer indices to get the characters of the string.
I came from this old discussion, but the solution didn't help much as my original data was encoded differently:
My original data was already encoded in unicode, I need to output as UTF-8
data={"content":u"\u4f60\u597d"}
When I try to convert to utf:
json.dumps(data, indent=1, ensure_ascii=False).encode("utf8")
the output I get is
"content": "ä½ å¥½" and the expected out put should be
"content": "你好"
I tried without ensure_ascii=false and the output becomes plain unescaped "content": "\u4f60\u597d"
How can I convert the previously \u escaped json to UTF-8?
You have UTF-8 JSON data:
>>> import json
>>> data = {'content': u'\u4f60\u597d'}
>>> json.dumps(data, indent=1, ensure_ascii=False)
u'{\n "content": "\u4f60\u597d"\n}'
>>> json.dumps(data, indent=1, ensure_ascii=False).encode('utf8')
'{\n "content": "\xe4\xbd\xa0\xe5\xa5\xbd"\n}'
>>> print json.dumps(data, indent=1, ensure_ascii=False).encode('utf8')
{
"content": "你好"
}
My terminal just happens to be configured to handle UTF-8, so printing the UTF-8 bytes to my terminal produced the desired output.
However, if your terminal is not set up for such output, it is your terminal that then shows 'wrong' characters:
>>> print json.dumps(data, indent=1, ensure_ascii=False).encode('utf8').decode('latin1')
{
"content": "ä½ å¥½"
}
Note how I decoded the data to Latin-1 to deliberately mis-read the UTF-8 bytes.
This isn't a Python problem; this is a problem with how you are handling the UTF-8 bytes in whatever tool you used to read these bytes.
in python2, it works; however in python3 print will output like:
>>> b'{\n "content": "\xe4\xbd\xa0\xe5\xa5\xbd"\n}'
do not use encode('utf8'):
>>> print(json.dumps(data, indent=1, ensure_ascii=False))
{
"content": "你好"
}
or use sys.stdout.buffer.write instead of print:
>>> import sys
>>> import json
>>> data = {'content': u'\u4f60\u597d'}
>>> sys.stdout.buffer.write(json.dumps(data, indent=1,
ensure_ascii=False).encode('utf8') + b'\n')
{
"content": "你好"
}
see Write UTF-8 to stdout, regardless of the console's encoding
I'm trying to scrape from a non-English website using Scrapy. The scraped results as JSON look something like this:
{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"},
This is the code I'm using:
def parse(self, response):
for sel in response.xpath('//section[#class="items-box"]'):
item = ShopItem()
item['name'] = sel.xpath('a/div/h3/text()').extract()
item['price'] = sel.xpath('a/div/div/div[1]/text()').extract().replace("$", "")
yield item
How would I output unescaped Unicode characters onto the JSON?
Edit (2016-10-19):
With Scrapy 1.2+, you can use the FEED_EXPORT_ENCODING set to the character encoding you need for the output JSON file, e.g FEED_EXPORT_ENCODING = 'utf-8' (the default value being None, which means \uXXXX escaping)
Note: I'm adapting what I wrote on GitHub for a similar issue I linked to in the question's comments.
Note that there's an open issue on Scrapy to make the output encoding a parameter: https://github.com/scrapy/scrapy/issues/1965
Scrapy's default JSON exporter uses (the default) ensure_ascii=True argument, so it outputs Unicode characters as \uXXXX sequences before writing to file. (This is what is used when doing -o somefile.json)
Setting ensure_ascii=False in the exporter will output Unicode strings, which will end up as UTF-8 encoded on file. See custom exporter code at the bottom here.
To illustrate, let's read your input JSON string back into some data to work on:
>>> import json
>>> test = r'''{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"}'''
>>> json.loads(test)
{u'price': u'13,000', u'name': u'\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc'}
The input with \uXXXX sequences is valid JSON for Python (as it should), and loads() produces a valid Python dict.
Now let's serialize to JSON again:
>>> # dumping the dict back to JSON, with default ensure_ascii=True
>>> json.dumps(json.loads(test))
'{"price": "13,000", "name": "\\u58c1\\u6bb4\\u308a\\u4ee3\\u884c\\u69d8\\u5c02\\u7528\\u2605 \\u30c6\\u30ec\\u30d3\\u672c\\u4f53 20v\\u578b \\u767d \\u9001\\u6599\\u8fbc"}'
>>>
And now with ensure_ascii=False
>>> # now dumping with ensure_ascii=False, you get a Unicode string
>>> json.dumps(json.loads(test), ensure_ascii=False)
u'{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"}'
>>>
Let's print to see the difference:
>>> print json.dumps(json.loads(test))
{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"}
>>> print json.dumps(json.loads(test), ensure_ascii=False)
{"price": "13,000", "name": "壁殴り代行様専用★ テレビ本体 20v型 白 送料込"}
If you want to write JSON items as UTF-8, you can do it like this:
1.. define a custom item exporter, e.g. in an exporters.py file in your project
$ cat myproject/exporters.py
from scrapy.exporters import JsonItemExporter
class Utf8JsonItemExporter(JsonItemExporter):
def __init__(self, file, **kwargs):
super(Utf8JsonItemExporter, self).__init__(
file, ensure_ascii=False, **kwargs)
2.. replace the default JSON item exporter in your settings.py
FEED_EXPORTERS = {
'json': 'myproject.exporters.Utf8JsonItemExporter',
}
Use the codecs module for text -> text decoding (In Python 2 it's not strictly necessary, but in Python 3 str doesn't have a decode method, because the methods are for str -> bytes and back, not str -> str). Using the unicode_escape codec for decoding will get you the correct data back:
import codecs
somestr = codecs.decode(strwithescapes, 'unicode-escape')
So to fix the names you're getting, you'd do:
item['name'] = codecs.decode(sel.xpath('a/div/h3/text()').extract(), 'unicode-escape')
If the problem is in JSON you're producing, you'd want to just make sure the json module isn't forcing strings to be ASCII with character encodings; it does so by default because not all JSON parsers can handle true Unicode characters (they often assume data is sent as ASCII bytes with escapes). So wherever you call json.dump/json.dumps (or create a json.JSONEncoder), make sure to explicitly pass ensure_ascii=False.