voluptuous unable to handle unicode string? - python

I'm trying to use voluptuous to validate JSON input from HTTP request. However, it doesn't seem to handle unicode string to well.
from voluptuous import Schema, Required
from pprint import pprint
schema = Schema({
Required('name'): str,
Required('www'): str,
})
data = {
'name': 'Foo',
'www': u'http://www.foo.com',
}
pprint(data)
schema(data)
The above code generates the following error:
voluptuous.MultipleInvalid: expected str for dictionary value # data['www']
However, if I remove the u notation from the URL, everything works fine. Is this a bug or am I doing it wrong?
ps. I'm using python 2.7 if it has anything to do with it.

There are two string types in Python 2.7: str and unicode. In Python 2.7 the str type is not a Unicode string, it is a byte string.
So the value u'http://www.foo.com' indeed is not an instance of type str and you're getting that error. If you wish to support both str and Unicode strings in Python 2.7 you'd need to change your schema to be:
from voluptuous import Any, Schema, Required
schema = Schema({
Required('name'): Any(str, unicode),
Required('www'): Any(str, unicode),
})
Or, for simplicity, if you always receive Unicode strings then you can use:
schema = Schema({
Required('name'): unicode,
Required('www'): unicode,
})

Related

unable to parse string to json in python

I have a string, which I evaluate as:
import ast
def parse(s):
return ast.literal_eval(s)
print parse(string)
{'_meta': {'name': 'foo', 'version': 0.2},
'clientId': 'google.com',
'clip': False,
'cts': 1444088114,
'dev': 0,
'uuid': '4375d784-809f-4243-886b-5dd2e6d2c3b7'}
But when I use jsonlint.com to validate the above json..
it throws schema error..
If I try to use json.loads
I see the following error:
Try: json.loads(str(parse(string)))
ValueError: Expecting property name: line 1 column 1 (char 1)
I am basically trying to convert this json in avro How to covert json string to avro in python?
ast.literal_eval() loads Python syntax. It won't parse JSON, that's what the json.loads() function is for.
Converting a Python object to a string with str() is still Python syntax, not JSON syntax, that is what json.dumps() is for.
JSON is not Python syntax. Python uses None where JSON uses null; Python uses True and False for booleans, JSON uses true and false. JSON strings always use " double quotes, Python uses either single or double, depending on the contents. When using Python 2, strings contain bytes unless you use unicode objects (recognisable by the u prefix on their literal notation), but JSON strings are fully Unicode aware. Python will use \xhh for Unicode characters in the Latin-1 range outside ASCII and \Uhhhhhhhh for non-BMP unicode points, but JSON only ever uses \uhhhh codes. JSON integers should generally be viewed as limited to the range representable by the C double type (since JavaScript numbers are always floating point numbers), Python integers have no limits other than what fits in your memory.
As such, JSON and Python syntax are not interchangeable. You cannot use str() on a Python object and expect to parse it as JSON. You cannot use json.dumps() and parse it with ast.literal_eval(). Don't confuse the two.

How to encode Chinese character as 'gbk' in json, to format a url request parameter String?

I want to dump a dict as a json String which contains some Chinese characters, and format a url request parameter with that.
here is my python code:
import httplib
import simplejson as json
import urllib
d={
"key":"上海",
"num":1
}
jsonStr = json.dumps(d,encoding='gbk')
url_encode=urllib.quote_plus(jsonStr)
conn = httplib.HTTPConnection("localhost",port=8885)
conn.request("GET","/?json="+url_encode)
res = conn.getresponse()
what I expected of the request string is this:
GET /?json=%7B%22num%22%3A+1%2C+%22key%22%3A+%22%C9%CF%BA%A3%22%7D
------------
|
V
"%C9%CF%BA%A3" represent "上海" in format of 'gbk' in url.
but what I got is this:
GET /?json=%7B%22num%22%3A+1%2C+%22key%22%3A+%22%5Cu6d93%5Cu5a43%5Cu6363%22%7D
------------------------
|
v
%5Cu6d93%5Cu5a43%5Cu6363 is 'some' format of chinese characters "上海"
I also tried to dump json with ensure_ascii=False option:
jsonStr = json.dumps(d,ensure_ascii=False,encoding='gbk')
but get no luck.
so, how can I make this work? thanks.
You almost got it with ensure_ascii=False. This works:
jsonStr = json.dumps(d, encoding='gbk', ensure_ascii=False).encode('gbk')
You need to tell json.dumps() that the strings it will read are GBK, and that it should not try to ASCII-fy them. Then you must re-specify the output encoding, because json.dumps() has no separate option for that.
This solution is similar to another answer here: https://stackoverflow.com/a/18337754/4323
So this does what you want, though I should note that the standard for URIs seems to say that they should be in UTF-8 whenever possible. For more on this, see here: https://stackoverflow.com/a/14001296/4323
"key":"上海",
You saved your source code as UTF-8, so this is the byte string '\xe4\xb8\x8a\xe6\xb5\xb7'.
jsonStr = json.dumps(d,encoding='gbk')
The JSON format supports only Unicode strings. The encoding parameter can be used to force json.dumps into allowing byte strings, automatically decoding them to Unicode using the given encoding.
However, the byte string's encoding is actually UTF-8 not 'gbk', so json.dumps decodes incorrectly, giving u'涓婃捣'. It then produces the incorrect JSON output "\u6d93\u5a43\u6363", which gets URL-encoded to %22%5Cu6d93%5Cu5a43%5Cu6363%22.
To fix this you should make the input to json.dumps a proper Unicode (u'') string:
# coding: utf-8
d = {
"key": u"上海", # or u'\u4e0a\u6d77' if you don't want to rely on the coding decl
"num":1
}
jsonStr = json.dumps(d)
...
This will get you JSON "\u4e0a\u6d77", encoding to URL %22%5Cu4e0a%5Cu6d77%22.
If you really don't want the \u escapes in your JSON you can indeed ensure_ascii=False and then .encode() the output before URL-encoding. But I wouldn't recommend it as you would then have to worry about what encoding the target application wants in its URL parameters, which is a source of some pain. The \u version is accepted by all JSON parsers, and is not typically much longer once URL-encoded.

Python return a string by json

I got a problem, first I made a api that accepts a post request,
then responds with JSON as a result.
Post request data had been encoded, I accepted the post data, and got the data
rightly, then I response with the new data in JSON format.
But when I returned the JSON, I found that the string is unicode format, e.g.
{
'a':'\u00e3\u0080'
}
but, I want to get a format like this:
{
'a':"ã"
}
I want this format because I found that this unicode format didn't work well in IE8.
Yes, IE8.
What can I do for this issue?
Thanks!
If you're using standard library json module, specifying ensure_ascii=False give you what you want.
For example:
>>> print json.dumps({'a': u'ã'})
{"a": "\u00e3"}
>>> print json.dumps({'a': u'ã'}, ensure_ascii=False)
{"a": "ã"}
According to json.dump documentation:
If ensure_ascii is True (the default), all non-ASCII characters in the
output are escaped with \uXXXX sequences, and the result is a str
instance consisting of ASCII characters only. If ensure_ascii is
False, some chunks written to fp may be unicode instances. This
usually happens because the input contains unicode strings or the
encoding parameter is used. ...
BTW, what do you mean "unicode format didn't work well in IE8," ?

Python3 json.dumps gives TypeError: keys must be a string

I've got a simple web server written in Python3 (using classes from http.server) that I'm porting from 2 to 3.
I have the following code:
# More code here...
postvars = cgi.parse_qs(self.rfile.read(length),
keep_blank_values=1)
something.json_run(json.dumps(postvars))
Which throws:
TypeError: keys must be a string
By inspecting the data, I've determined that parse_qs seems to encode the keys as bytes, which is what's throwing the error (json doesn't like bytes, apparently).
import json
json.dumps({b'Throws error' : [b"Keys must be a string"]})
json.dumps({'Also throws error': [b'TypeError, is not JSON serializable']})
json.dumps({'This works': ['No bytes!']})
What is the best solution here? With Python 2, the code works fine because parse_qs uses str instead of bytes. My initial thought is that I probably need to write a JSON serializer. Not that it's difficult for something so simple, but I'd prefer not to if I can do it some other way (e.g. translate the dictionary to using strings instead of bytes).
Firstly, the cgi.parse_qs function is deprecated and merely an alias for urllib.parse.parse_qs, you may want to adjust your import path.
Secondly, you are passing in a byte string into the parse method. If you pass in a regular (unicode) string instead the parse_qs method returns regular strings:
>>> from urllib.parse import parse_qs
>>> parse_qs(b'a_byte_string=foobar')
{b'a_byte_string': [b'foobar']}
>>> parse_qs('a_unicode_string=foobar')
{'a_unicode_string': ['foobar']}
So you'll need to decode your file-read byte string to a regular string first.

Python AJAX response string literal

I have an AJAX response that returns a JSON object. One of the dictionary values is supposed to read:
"image\/jpeg"
But instead in reads:
"images\\/jpeg"
I've gone through the documentation on string literals and how to ignore escape sequences, and I've tried to prefix the string with 'r', but so far no luck.
My JSON encoded dictionary looks like this:
response.append({
'name' : i.pk,
'size' : False,
'type' : 'image/jpeg'
})
Help would be greatly appreciated!
According to the JSON spec, the \ character should be escaped as \\ in JSON.
So the Python json library is correct:
>>> import json
>>> json.dumps({"type": r"image\/jpeg", "size": False})
'{"type": "image\\\\/jpeg", "size": false}'
When the JSON is parsed/evaluated in the browser, the type attribute will have the correct value image\/jpeg.
The Python JSON parser of course handles the escaping as well:
>>> print(json.loads(json.dumps({"type": r"image\/jpeg", "size": False}))["type"])
image\/jpeg
I find it very strange that your javascript library requires that particular value for a value that looks like it is used to identify a resource's mime type.

Categories

Resources