Decode unicode in json - python

I have code.
# -*- coding: utf8 -*-
import re, json
from pprint import pprint
jsonStr = '{"type":"\u041f\u0435\u043d\u0438","values":{"v1":"222"}}'
data = json.loads(jsonStr)
pprint(data)
output
{u'type': u'\u041f\u0435\u043d\u0438', u'values': {u'v1': u'222'}}
how to get the normal data in 'type'?
thanks to all,
beautiful output in the console
jsonStr = '{"type":"\u041f\u0435\u043d\u0438","values":{"v1":"222"}}'
data = json.loads(jsonStr.decode("utf-8"))
print json.dumps(data, sort_keys=True, indent=2).decode("unicode_escape")
output
{
"type": "Пени",
"values": {
"v1": "222"
}
}

You have normal data:
>>> import json
>>> jsonStr = '{"type":"\u041f\u0435\u043d\u0438","values":{"v1":"222"}}'
>>> data = json.loads(jsonStr)
>>> print data['type']
Пени
Python containers such as dictionaries and lists show their contents using the repr() function; you are looking at debugger friendly output, which is ASCII safe. To make it ASCII safe any non-ASCII and non-printable codepoints are shown as escape sequences, so you can copy that output to a Python interpreter and re-create the value safely without having to worry about codecs.
Just use the data as you normally would. I printed the string, so that Python could encode it to my terminal codec and my terminal decoded it and showed the Russian text (cyrillic characters).

Related

Python - Inserting special characters into JSON string to print them on bash terminal

I am constructing JSON dictionary with some special characters to print words in bold and various colors on the bash terminal like this:
# 'field' is a part of a bigger JSON document 'data'
field["value"] = '\033[1m' + string_to_print_in_bold + '\033[0m'
Later on, I'm calling dumps to create and print out my JSON:
print(json.dumps(data, indent=4, ensure_ascii=False))
However, on the terminal I see this:
"value": "\u001b[1mstring_to_print_in_bold\u001b[0m"
instead of
"value": "string_to_print_in_bold"
Note that ensure_ascii=False!
What am I missing?
From a design perspective, you should separate formatting from your data.
If you only want to pretty-print json with color, pygments provides a terminal text formatter to prettify your json output:
import json
from pygments import highlight
from pygments.lexers import JsonLexer
from pygments.formatters import TerminalFormatter
data = {"value": "myvalue"}
json_str = json.dumps(json_object, indent=4, sort_keys=True)
print(highlight(json_str, JsonLexer(), TerminalFormatter()))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 6233: ordinal not in range(128)

I'm working on a new project but I can't fix the error in the title.
Here's the code:
#!/usr/bin/env python3.5.2
import urllib.request , urllib.parse
def start(url):
source_code = urllib.request.urlopen(url).read()
info = urllib.parse.parse_qs(source_code)
print(info)
start('https://www.youtube.com/watch?v=YfRLJQlpMNw')
The error occurred because of .encode which works on a unicode object. So we need to convert the byte string to unicode string using
.decode('unicode_escape')
So the code will be:
#!/usr/bin/env python3.5.2
import urllib.request , urllib.parse
def start(url):
source_code = urllib.request.urlopen(url).read()
info = urllib.parse.parse_qs(source_code.decode('unicode_escape'))
print(info)
start('https://www.youtube.com/watch?v=YfRLJQlpMNw')
Try this
source_code = urllib.request.urlopen(url).read().decode('utf-8')
The error message is self explainatory: there is a byte 0xf0 in an input string that is expected to be an ascii string.
You should have given the exact error message and on what line it happened, but I can guess that is happened on info = urllib.parse.parse_qs(source_code), because parse_qs expects either a unicode string or an ascii byte string.
The first question is why you call parse_qs on data coming from youtube, because the doc for the Python Standart Library says:
Parse a query string given as a string argument (data of type application/x-www-form-urlencoded). Data are returned as a dictionary. The dictionary keys are the unique query variable names and the values are lists of values for each name.
So you are going to parse this on = and & character to interpret it as a query string in the form key1=value11&key2=value2&key1=value12 to give { 'key1': [ 'value11', 'value12'], 'key2': ['value2']}.
If you know why you want that, you should first decode the byte string into a unicode string, using the proper encoding, or if unsure Latin1 which is able to accept any byte:
def start(url):
source_code = urllib.request.urlopen(url).read().decode('latin1')
info = urllib.parse.parse_qs(source_code)
print(info)
This code is rather weird indeed. You are using query parser to parse contents of a web page.
So instead of using parse_qs you should be using something like this.

Convert JSON string of unicode characters to dict Python 3

I am trying to count the number of Unicode characters in the JSON data. I am using requests to get the data from the feed.
import requests
r = requests.get('https://venmo.com/api/v5/public?since=1438578858&until=1438578958')'
j_data = r.text
Now, I need to convert the j_data into a dictionary to get the message items alone. If I just use json.loads(j_data), I get UnicodeEncodeError: 'charmap' codec can't encode character.
Therefore, I am encoding the j_data and then trying to convert to dict using loads. I am getting this error
TypeError: the JSON object must be str, not 'bytes'
How to approach this problem?
Code:
import requests
import json
r = requests.get('https://venmo.com/api/v5/public?since=1438578858&until=1438578958')
j_data = r.text
encoded = j_data.encode()
b = json.loads(encoded)
print(b)
It seems to work fine in Python 2.7.6
import requests
import json
req = requests.get('https://venmo.com/api/v5/public?since=1438578858&until=1438578958')
contentJ = json.loads(req.content)
and I get a dict named contentJ
As I see it, you try to encode something, that does not need to be encoded. Stripping of the line with the encoding and all works fine in Python3.4.
import requests
import json
r = requests.get('https://venmo.com/api/v5/public?since=1438578858&until=1438578958')
j_data = r.text
b = json.loads(j_data)
print(type(b))
To get json, use r.json():
import requests # $ pip install requests
r = requests.get(url)
data = r.json()
Your error: UnicodeEncodeError: 'charmap' codec can't encode character. is unrelated to the json parsing. Most likely it happens when you are trying to print Unicode to Windows console. Configure the console font that can display the desired characters and install win-unicode-console package:
T:\> py -mpip install win-unicode-console
T:\> py -mrun your_script_that_prints_unicode.py
See What's the deal with Python 3.4, Unicode, different languages and Windows?

Convert unicode with utf-8 string as content to str

I'm using pyquery to parse a page:
dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
content = dom('#mw-content-text > p').eq(0).text()
but what I get in content is a unicode string with utf-8 encoded content:
u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8...'
how could I convert it to str without lost the content?
to make it clear:
I want conent == '\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
not conent == u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
If you have a unicode value with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':
content = content.encode('latin1')
because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.
For your example this gives me:
>>> content = u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
>>> content.encode('latin1')
'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
>>> content.encode('latin1').decode('utf8')
u'\u5c42\u53e0\u6837\u5f0f\u8868'
>>> print content.encode('latin1').decode('utf8')
层叠样式表
PyQuery uses either requests or urllib to retrieve the HTML, and in the case of requests, uses the .text attribute of the response. This auto-decodes the response data based on the encoding set in a Content-Type header alone, or if that information is not available, uses latin-1 for this (for text responses, but HTML is a text response). You can override this by passing in an encoding argument:
dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',
{'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
at which point you'd not have to re-encode at all.

How to use pycurl when url contain non-English language?

This is the example on the pycurl's sourceforge page. And if the url contain like Chinese. What process should we do? Since pycurl does not support unicode?
import pycurl
c = pycurl.Curl()
c.setopt(pycurl.URL, "http://www.python.org/")
c.setopt(pycurl.HTTPHEADER, ["Accept:"])
import StringIO
b = StringIO.StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.perform()
print b.getvalue()
Here's a script that demonstrates three separate issues:
non-ascii characters in Python source code
non-ascii characters in the url
non-ascii characters in the html content
# -*- coding: utf-8 -*-
import urllib
from StringIO import StringIO
import pycurl
title = u"UNIX时间" # 1
url = "https://zh.wikipedia.org/wiki/" + urllib.quote(title.encode('utf-8')) # 2
c = pycurl.Curl()
c.setopt(pycurl.URL, url)
c.setopt(pycurl.HTTPHEADER, ["Accept:"])
b = StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.perform()
data = b.getvalue() # bytes
print len(data), repr(data[:200])
html_page_charset = "utf-8" # 3
html_text = data.decode(html_page_charset)
print html_text[:200] # 4
Note: all utf-8 in the code are compeletely independent from each other.
Unicode literals use whatever character encoding you defined at the
top of the file. Make sure your text editor respects that setting
Path in the url should be encoded using utf-8 before it is
percent-encoded (urlencoded)
There are several ways to find out a html page charset. See
Character encodings in HTML. Some libraries such as requests mentioned by #Oz123 do it automatically:
# -*- coding: utf-8 -*-
import requests
r = requests.get(u"https://zh.wikipedia.org/wiki/UNIX时间")
print len(r.content), repr(r.content[:200]) # bytes
print r.encoding
print r.text[:200] # Unicode
To print Unicode to console you could use PYTHONIOENCODING environment variable to set character encoding that your terminal understands
See also The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and Python-specific Pragmatic Unicode.
Try urllib.quote, which will replace non-ASCII characters by an escape sequence:
import urllib
url_to_fetch = urllib.quote(unicode_url)
edit: only the path should be quoted, you will have to split the complete URL with urlparse, quote the path, and then use urlunparse to obtain the final URL to fetch.
Just encode your url in "utf-8", and everything would be fine. from the docs:
Under Python 3, the bytes type holds arbitrary encoded byte strings. PycURL will accept bytes values for all options where libcurl specifies a “string” argument:
>>> import pycurl
>>> c = pycurl.Curl()
>>> c.setopt(c.USERAGENT, b'Foo\xa9')
# ok
The str type holds Unicode data. PycURL will accept str values containing ASCII code points only:
>>> c.setopt(c.USERAGENT, 'Foo')
# ok
>>> c.setopt(c.USERAGENT, 'Foo\xa9')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xa9' in position 3:
ordinal not in range(128)
>>> c.setopt(c.USERAGENT, 'Foo\xa9'.encode('iso-8859-1'))
# ok
[1] http://pycurl.io/docs/latest/unicode.html

Categories

Resources