JSON printed to console shows wrong encoding

JSON printed to console shows wrong encoding - python

I am trying to read Cyrillic characters from some JSON file and then output it to console using Python 3.4.3 on Windows. Normal print('Russian smth буквы') works as intended.
But when I print JSON contents it seems to print in Windows-1251 - "СЂСѓСЃСЃРєРёРµ Р±СѓРєРІС‹" (though my console, my JSON file and my .py (with coding comment) are in UTF-8).
I've tried re-encoding it to Win-1251 and setting console to Win-1251, but still no luck.
My JSON (Encoded in UTF-8):
{
"русские буквы": "что-то ещё на русском",
"english letters": "и что-то на великом"
}
My code to load dictionary:
def load_dictionary():
global Dictionary, isFatal
try:
with open(DictionaryName) as f:
Dictionary = json.load(f)
except Exception as e:
logging.critical('Error loading dictionary: ' + str(e))
isFatal = True
return
logging.info('Dictionary was loaded successfully')
I am trying to output it in 2 ways (both show the same gibberish):
print(helper.Dictionary.get('rly'))
print(helper.Dictionary)
An interesting add-on: I've added the whole Russian alphabet to my JSON file and it seems to get stuck at "С с" letter. (Error loading dictionary: 'charmap' codec can't decode byte 0x81 in position X: character maps to ). If I remove this one letter it shows no exception, but the problem above remains.

"But when I print JSON contents …"
If you print it using type command, then you get mojibake СЂСѓСЃСЃРєРёРµ … instead of русские … under CHCP 1251 scope.
Try type under CHCP 65001 (i.e. UTF-8) scope.
Follow nauer's advice, use open(DictionaryName, encoding="utf8").
Example (39755662.json is saved with UTF-8 encoding):
==> chcp 866
Active code page: 866
==> type 39755662.json
{
"╤А╤Г╤Б╤Б╨║╨╕╨╡ ╨▒╤Г╨║╨▓╤Л": "╤З╤В╨╛-╤В╨╛ ╨╡╤Й╤С ╨╜╨░ ╤А╤Г╤Б╤Б╨║╨╛╨╝",
"rly": "╤А╤Г╤Б╤Б╨║╨╕╨╣"
}
==> chcp 1251
Active code page: 1251
==> type 39755662.json
{
"СЂСѓСЃСЃРєРёРµ Р±СѓРєРІС‹": "С‡С‚Рѕ-С‚Рѕ РµС‰С‘ РЅР° СЂСѓСЃСЃРєРѕРј",
"rly": "СЂСѓСЃСЃРєРёР№"
}
==> chcp 65001
Active code page: 65001
==> type 39755662.json
{
"русские буквы": "что-то ещё на русском",
"rly": "русский"
}
==>

Related

Convert base64 encoded google service account key to JSON file using Python

hello I'm trying to convert a google service account JSON key (contained in a base64 encoded field named privateKeyData in file foo.json - more context here ) into the actual JSON file (I need that format as ansible only accepts that)
The foo.json file is obtained using this google python api method
what I'm trying to do (though I am using python) is also described this thread which by the way does not work for me (tried on OSx and Linux).
#!/usr/bin/env python3
import json
import base64
with open('/tmp/foo.json', 'r') as f:
ymldict = json.load(f)
b64encodedCreds = ymldict['privateKeyData']
b64decodedBytes = base64.b64decode(b64encodedCreds,validate=True)
outputStr = b64decodedBytes
print(outputStr)
#issue
outputStr = b64decodedBytes.decode('UTF-8')
print(outputStr)
yields
./test.py
b'0\x82\t\xab\x02\x01\x030\x82\td\x06\t*\x86H\x86\xf7\r\x01\x07\x01\xa0\x82\tU\x04\x82\tQ0\x82\tM0\x82\x05q\x06\t*\x86H\x86\xf7\r\x01\x07\x01\xa0\x82\x05b\x04\x82\x05^0\x82\x05Z0\x82\x05V\x06\x0b*\x86H\x86\xf7\r\x01\x0c\n\x01\x02\xa0\x82\x#TRUNCATING HERE
Traceback (most recent call last):
File "./test.py", line 17, in <module>
outputStr = b64decodedBytes.decode('UTF-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 1: invalid start byte
I think I have run out of ideas and spent now more than a day on this :(
what am I doing wrong?

Your base64 decoding logic looks fine to me. The problem you are facing is probably due to a character encoding mismatch. The response body you received after calling create (your foo.json file) is probably not encoded with UTF-8. Check out the response header's Content-Type field. It should look something like this:
Content-Type: text/javascript; charset=Shift_JIS
Try to decode your base64 decoded string with the encoding used in the content type
b64decodedBytes.decode('Shift_JIS')

UnicodeDecodeError while decoding a json with python3.5

{
"Sponge": {
"orientation": "Straight",
"gender": "Woman",
"age": 23,
"rel_status": "Single",
"summary": " Bonjour! Je m'appelle Jacqueline!, Enjoy cooking, reading and traveling!, Love animals, languages and nature :-) ",
"location": "Kao-hsiung-k’a",
"id": "6693397339871"
}
}
I have this json above and I'm trying to read it except there is some special character in it. For example the "’" in location. This raise some error when I'm trying to read the JSON:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 27-28: character maps to <undefined>
I'm using python 3.5 and I have done the following code:
with open('test.json') as json_data:
users = json.load(json_data)
print users

Use codecs module to open the file for a quick fix.
with codecs.open('test.json', 'r', 'utf-8') as json_data:
users = json.load(json_data)
print(users)
Also answer to this question can be found easily on the web. (hint: that's how I learned about this module.)

Ok I find my solution it's a problem with the terminal of windows you have to type this in the terminal: chcp 65001
After that launch your program!
More explanation here: Why doesn't Python recognize my utf-8 encoded source file?

Meet " UnicodeDecodeError at /admin/login/ " when following Django Documentation Tutorial

I have problem with my Django admin site, 127.0.0.1:8000/admin/. Before, I have run the server by typing python manage.py runserver. I am now at Django docs tutorial part 2. I follow it and when I open the admin site, I meet this problem :
UnicodeDecodeError at /admin/login/
'utf-8' codec can't decode byte 0xee in position 394374: invalid continuation byte
Request Method: GET
Request URL: http://127.0.0.1:8000/admin/login/?next=/admin/
Django Version: 1.8.3
Exception Type: UnicodeDecodeError
Exception Value:
'utf-8' codec can't decode byte 0xee in position 394374: invalid continuation byte
Exception Location: c:\Python34\lib\codecs.py in decode, line 319
Python Executable: c:\Python34\python.exe
Python Version: 3.4.3
Python Path:
['C:\\Python34\\Scripts\\mypoll',
'C:\\WINDOWS\\system32\\python34.zip',
'c:\\Python34\\DLLs',
'c:\\Python34\\lib',
'c:\\Python34',
'c:\\Python34\\lib\\site-packages']
Server time: Sat, 1 Aug 2015 22:19:09 +0700
The Traceback :
Traceback:
File "c:\Python34\lib\site-packages\django\core\handlers\base.py" in get_response
164. response = response.render()
File "c:\Python34\lib\site-packages\django\template\response.py" in render
158. self.content = self.rendered_content
File "c:\Python34\lib\site-packages\django\template\response.py" in rendered_content
133. template = self._resolve_template(self.template_name)
File "c:\Python34\lib\site-packages\django\template\response.py" in _resolve_template
88. new_template = self.resolve_template(template)
File "c:\Python34\lib\site-packages\django\template\response.py" in resolve_template
80. return loader.get_template(template, using=self.using)
File "c:\Python34\lib\site-packages\django\template\loader.py" in get_template
35. return engine.get_template(template_name, dirs)
File "c:\Python34\lib\site-packages\django\template\backends\django.py" in get_template
30. return Template(self.engine.get_template(template_name, dirs))
File "c:\Python34\lib\site-packages\django\template\engine.py" in get_template
167. template, origin = self.find_template(template_name, dirs)
File "c:\Python34\lib\site-packages\django\template\engine.py" in find_template
141. source, display_name = loader(name, dirs)
File "c:\Python34\lib\site-packages\django\template\loaders\base.py" in __call__
13. return self.load_template(template_name, template_dirs)
File "c:\Python34\lib\site-packages\django\template\loaders\base.py" in load_template
17. template_name, template_dirs)
File "c:\Python34\lib\site-packages\django\template\loaders\app_directories.py" in load_template_source
39. return fp.read(), filepath
File "c:\Python34\lib\codecs.py" in decode
319. (result, consumed) = self._buffer_decode(data, self.errors, final)
Exception Type: UnicodeDecodeError at /admin/login/
Exception Value: 'utf-8' codec can't decode byte 0xee in position 394374: invalid continuation byte
At the bottom of the page, it states :
You're seeing this error because you have DEBUG = True in your Django settings file. Change that to False, and Django will display a standard page generated by the handler for this status code.
Then I change DEBUG = True to False. Automatically, the server will stop because at the command prompt stated : CommandError: You must set settings.ALLOWED_HOSTS if DEBUG is False.
Then, I can't do anything. What should I do ? Are there any solution ? I have searched google and stackoverflow but I can't find anything similar to my problem. I hope someone can help me with this problem.
Thanks

Breakdown:
The exception is issued by python itself. It happens when attempting to decode some raw data stream into strings. If you are new to python, you should know that python3 makes a clear distinction between strings aka str (that contain characters) and raw data aka bytes (that just contain bytes, potentially binary data).
The exception raised here means that for some reason, python was ordered to decode some bytes into text using utf-8 encoding, yet the data is not valid utf-8-encoded text.
Assuming you come from a western country, my bet is the text is using ANSI or ISO-8859-1 and has an “î” in it. That gets encoded as 0xee in ANSI, but should be encoded as 0xC3 0xAE in UTF-8.
There are several reasons this could happen. Here, from the traceback, it happened while rendering a template. More specifically, while rendering a template from an app's directory. So you have in one of your apps a template that's not properly encoded.
How it happened? Well, I see you are running a Windows box. The Windows environment is somewhat of a mess when it comes to text encoding. Every software comes with its own opinion of what to use as default (when it can be changed). For instance, Notepad still encodes in ANSI by default, or ISO-8859-1 in Western Europe.
It is very likely that one of the software you use for editing your templates is encoding your files into whatever. You have two options from here:
Check the options of your tools and make sure they are all configured to use UTF-8 encoding.
Or configure Django to use the same encoding as your tools. You would do that by adding a FILE_CHARSET='iso-8859-1' line to your settings, or whatever encoding your tools use.
In any case, you must be sure that all of your tools agree on the encoding used, or you will either have other decoding errors, or some characters will get mangled (and show as strange Ã® or ? symbols).
Not useful for Django tutorial, but worth reading at some point in your python life: Unicode handling in Python

Python messing up the scandinavian characters (Ö -> Ã)

I know everyone's fed up with encoding questions, but I can't figure this out.
I'm getting data from a XML-file (API) in Python. Everything is fine, but when I print the values that contain scandinavian characters, such as Ö or Ä, they get messed up:
Ö -> Ã
Ä -> Ã¤
The XML-document is encoded in UTF-8.
Here's my code. Sorry for the inconvenience.
# Get the data
from urllib2 import urlopen
ur = urlopen("http://www.leffatykki.com/xml/leffat")
data = ur.read()
# Replace ampersands (triggers an error)
data = data.replace('&', '&')
# Loop XML
from lxml import etree
from cStringIO import StringIO
def fast_iter(context, func):
for event, elem in context:
func(elem)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
def process_element(elem):
try:
name = elem.xpath('name/text( )')[0]
year = elem.xpath('year/text( )')[0]
print name
except IndexError:
temp = '...'
context = etree.iterparse(StringIO(data), tag='movie')
fast_iter(context, process_element)

In your call to "etree.iterparse", try filling out the encoding value:
context = etree.iterparse(StringIO(data), tag='movie', encoding="utf-8")
From the itree.iterparse documentation:
"""
| Other keyword arguments:
| - encoding: override the document encoding
| - schema: an XMLSchema to validate against
"""
Better yet - forget that:
I've downloaded your file and played around - it seems to be working, at least for the first movie - maybe you have badly encoded characters in the file itself? It is either taht or everything is just fine, and the mess is only at your print statement -
try using "print name.encode("utf-8")" - or the correct encoding of your terminal, instead of letting python try to guess it.

How can I understand this python error message?

Hi can you help me decode this message and what to do:
main.py", line 1278, in post
message.body = "%s %s/%s/%s" % (msg, host, ad.key().id(), slugify(ad.title.encode('utf-8')))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
Thanks
UPDATE having tried removing the encode call it appears to work:
class Recommend(webapp.RequestHandler):
def post(self, key):
ad= db.get(db.Key(key))
email = self.request.POST['tip_email']
host = os.environ.get("HTTP_HOST", os.environ["SERVER_NAME"])
senderemail = users.get_current_user().email() if users.get_current_user() else 'info#monton.cl' if host.endswith('.cl') else 'info#monton.com.mx' if host.endswith('.mx') else 'info#montao.com.br' if host.endswith('.br') else 'admin#koolbusiness.com'
message = mail.EmailMessage(sender=senderemail, subject="%s recommends %s" % (self.request.POST['tip_name'], ad.title) )
message.to = email
message.body = "%s %s/%s/%s" % (self.request.POST['tip_msg'],host,ad.key().id(),slugify(ad.title))
message.send()
matched_images=ad.matched_images
count = matched_images.count()
if ad.text:
p = re.compile(r'(www[^ ]*|http://[^ ]*)')
text = p.sub(r'\1',ad.text.replace('http://',''))
else:
text = None
self.response.out.write("Message sent<br>")
path = os.path.join(os.path.dirname(__file__), 'market', 'market_ad_detail.html')
self.response.out.write(template.render(path, {'user_url':users.create_logout_url(self.request.uri) if users.get_current_user() else users.create_login_url(self.request.uri),
'user':users.get_current_user(), 'ad.user':ad.user,'count':count, 'ad':ad, 'matched_images': matched_images,}))

The problem here is your underlying model (message.body) only wants ASCII text but you're trying to give it a string encoded in unicode.
But since you've got a normal ascii string here, you can just make python print out the '?' character when you've got a non-ascii-printing string.
"UNICODE STRING".encode('ascii','replace').decode('ascii')
So like from your example above:
message.body = "%s %s/%s/%s" % \
(msgencode('ascii','replace').decode('ascii'),
hostencode('ascii','replace').decode('ascii'),
ad.key().id()encode('ascii','replace').decode('ascii'),
slugify(ad.title)encode('ascii','replace').decode('ascii'))
Or just encode/decode on the variable that has the unicode character.
But this isn't an optimal solution. The best idea is to make message.body a unicode string. Being that doesn't seem feasible (I'm not familiar with GAE), you can use this to at least not have errors.

You've got a Unicode character in a place that you're not supposed to. Most often I find this error is having MS Word-style slanted quotes.

One of these fields has some characters that cannot be encoded. If you switch to python 3 (it has better unicode support), or you change the encoding of the entire script the problem should stop, about the best way to change the encoding in 2.x is using the encoding comment line. If you see http://evanjones.ca/python-utf8.html you will see more of an explanation of using python with utf-8 support the best suggestion is add # -*- coding: utf-8 -*- to the top of your script. And handle scripts like this
s = "hello normal string"
u = unicode( s, "utf-8" )
backToBytes = u.encode( "utf-8" )

I had a similar problem when using Django norel and Google App Engine.
The problem was at the folder containing the application. Probably isn't this the problem described in this question, but, maybe helps someone don't waste time like me.
Try first change you application folder maybe to /home/ and try to run again, if doesn't works, try something more.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.