unable to parse string to json in python - python

I have a string, which I evaluate as:
import ast
def parse(s):
return ast.literal_eval(s)
print parse(string)
{'_meta': {'name': 'foo', 'version': 0.2},
'clientId': 'google.com',
'clip': False,
'cts': 1444088114,
'dev': 0,
'uuid': '4375d784-809f-4243-886b-5dd2e6d2c3b7'}
But when I use jsonlint.com to validate the above json..
it throws schema error..
If I try to use json.loads
I see the following error:
Try: json.loads(str(parse(string)))
ValueError: Expecting property name: line 1 column 1 (char 1)
I am basically trying to convert this json in avro How to covert json string to avro in python?

ast.literal_eval() loads Python syntax. It won't parse JSON, that's what the json.loads() function is for.
Converting a Python object to a string with str() is still Python syntax, not JSON syntax, that is what json.dumps() is for.
JSON is not Python syntax. Python uses None where JSON uses null; Python uses True and False for booleans, JSON uses true and false. JSON strings always use " double quotes, Python uses either single or double, depending on the contents. When using Python 2, strings contain bytes unless you use unicode objects (recognisable by the u prefix on their literal notation), but JSON strings are fully Unicode aware. Python will use \xhh for Unicode characters in the Latin-1 range outside ASCII and \Uhhhhhhhh for non-BMP unicode points, but JSON only ever uses \uhhhh codes. JSON integers should generally be viewed as limited to the range representable by the C double type (since JavaScript numbers are always floating point numbers), Python integers have no limits other than what fits in your memory.
As such, JSON and Python syntax are not interchangeable. You cannot use str() on a Python object and expect to parse it as JSON. You cannot use json.dumps() and parse it with ast.literal_eval(). Don't confuse the two.

Related

What encoding is used by json.dumps?

When I use json.dumps in Python 3.8 for special characters they are being "escaped", like:
>>> import json
>>> json.dumps({'Crêpes': 5})
'{"Cr\\u00eapes": 5}'
What kind of encoding is this? Is this an "escape encoding"? And why is this kind of encoding not part of the encodings module? (Also see codecs, I think I tried all of them.)
To put it another way, how can I convert the string 'Crêpes' to the string 'Cr\\u00eapes' using Python encodings, escaping, etc.?
You are probably confused by the fact that this is a JSON string, not directly a Python string.
Python would encode this string as "Cr\u00eapes", where \u00ea represents a single Unicode character using its hexadecimal code point. In other words, in Python, len("\u00ea") == 1
JSON requires the same sort of encoding, but embedding the JSON-encoded value in a Python string requires you to double the backslash; so in Python's representation, this becomes "Cr\\u00eapes" where you have a literal backslash (which has to be escaped by another backslash), two literal zeros, a literal e character, and a literal a character. Thus, len("\\u00ea") == 6
If you have JSON in a file, the absolutely simplest way to load it into Python is to use json.loads() to read and decode it into a native Python data structure.
If you need to decode the hexadecimal sequence separately, the unicode-escape function does that on a byte value:
>>> b"Cr\\u00eapes".decode('unicode-escape')
'Crêpes'
This is sort of coincidental, and works simply because the JSON representation happens to be identical to the Python unicode-escape representation. You still need a b'...' aka bytes input for that. ("Crêpes".encode('unicode-escape') produces a slightly different representation. "Cr\\u00eapes".encode('us-ascii') produces a bytes string with the Unicode representation b"Cr\\u00eapes".)
It is not a Python encoding. It is the way JSON encodes Unicode non-ASCII characters. It is independent of Python and is used exactly the same for example in Java or with a C or C++ library.
The rule is that a non-ASCII character in the Basic Multilingual Plane (i.e. with a maximum 16 bits code) is encoded as \uxxxx where xxxx is the unicode code value.
Which explains why the ê is written as \u00ea, because its unicode code point is U+00EA

Parsing json string representation with delimited double quote marks and booleans in python

This seems so basic that I can't believe I haven't found a simple solution. I am trying to evaluate a json string representation in python which contains backslash delimited double quote marks and boolean values.
e.g s = "{\"data\":true}".
This is valid json in javascript as JSON.parse("{\"data\":true}") returns a valid object.
I've tried multiple methods, none of which provide the desired solution of a pythonised string representation of a json object:
Using eval(s) gives NameError: name 'true' is not defined.
Using ast.literal_eval(s) gives ValueError: malformed string.
Using json.dumps(json.loads(s)) returns the same format of string.
The output I want is "{'data':True}" as my neo4j database does not recognise \ as a delimiter for storing purposes and therefore produces an error when attempting to store the initial format. I am trying to avoid a hard replace of \" with ' or true with True as there must be a better, faster and easier solution to this.
Use json.loads:
import json
result = str(json.loads("{\"data\":true}"))
print(result, type(result))
Output
{'data': True} <class 'str'>

Is converting Python unicode by casting to str reversible?

The proper way to convert a unicode string u to a (byte)string in Python is by calling u.encode(someencoding).
Unfortunately, I didn't know that before and I had used str(u) for conversion. In particular, I called str(u) to coerce u to be a string so that I can make it a valid shelve key (which must be a str).
Since I didn't encounter any UnicodeEncodeError, I wonder if this process is reversible/lossless. That is, can I do u = str(converted_unicode) (or u = bytes(converted_unicode) in Python 3) to get the original u?
In Python 2, if the conversion with str() was successful, then you can reverse the result. Using str() on a unicode value is the equivalent of using unicode_value.encode('ascii') and the reverse is to simply use str_value.decode('ascii'). Using unicode(str_value) will use the same implicit ASCII codec to decode.
In Python 3, calling str() on a unicode value simply gives you the same object back, since in Python 3 str() is the Unicode type. Using bytes() on a Unicode value without an encoding fails, you always have to use explicit codecs in Python 3 to convert between str and bytes.

Parsing invalid Unicode JSON in Python

i have a problematic json string contains some funky unicode characters
"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}
and if I convert using python
import json
s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
json.loads(s)
# Error..
If I can accept to skip/lose the value of these unicode characters, what is the best way to make my json.loads(s) works?
If the rest of the string apart from invalid \x5c is a JSON then you could use string-escape encoding to decode `'\x5c into backslashes:
>>> import json
>>> s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
>>> json.loads(s.decode('string-escape'))
{u'test': {u'foo': u'Ig0s/k/4jRk'}}
You don't have JSON; that can be interpreted directly as Python instead. Use ast.literal_eval():
>>> import ast
>>> s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
>>> ast.literal_eval(s)
{'test': {'foo': 'Ig0s\\/k\\/4jRk'}}
The \x5C is a single backslash, doubled in the Python literal string representation here. The actual string value is:
>>> print _['test']['foo']
Ig0s\/k\/4jRk
This parses the input as Python source, but only allows for literal values; strings, None, True, False, numbers and containers (lists, tuples, dictionaries).
This method is slower than json.loads() because it does part of the parse-tree processing in pure Python code.
Another approach would be to use a regular expression to replace the \xhh escape codes with JSON \uhhhh codes:
import re
escape_sequence = re.compile(r'\\x([a-fA-F0-9]{2})')
def repair(string):
return escape_sequence.sub(r'\\u00\1', string)
Demo:
>>> import json
>>> json.loads(repair(s))
{u'test': {u'foo': u'Ig0s\\/k\\/4jRk'}}
If you can repair the source producing this value to output actual JSON instead that'd be a much better solution.
I'm a bit late for the party, but we were seeing a similar issue, to be precise this one Logstash JSON input with escaped double quote, just for \xXX.
There JS.stringify created such (per specification) invalid json texts.
The solution is to simply replace the \x by \u00, as unicode escaped characters are allowed, while ASCII escaped characters are not.
import json
s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
s = s.replace("\\x", "\\u00")
json.loads(s)

Python return a string by json

I got a problem, first I made a api that accepts a post request,
then responds with JSON as a result.
Post request data had been encoded, I accepted the post data, and got the data
rightly, then I response with the new data in JSON format.
But when I returned the JSON, I found that the string is unicode format, e.g.
{
'a':'\u00e3\u0080'
}
but, I want to get a format like this:
{
'a':"ã"
}
I want this format because I found that this unicode format didn't work well in IE8.
Yes, IE8.
What can I do for this issue?
Thanks!
If you're using standard library json module, specifying ensure_ascii=False give you what you want.
For example:
>>> print json.dumps({'a': u'ã'})
{"a": "\u00e3"}
>>> print json.dumps({'a': u'ã'}, ensure_ascii=False)
{"a": "ã"}
According to json.dump documentation:
If ensure_ascii is True (the default), all non-ASCII characters in the
output are escaped with \uXXXX sequences, and the result is a str
instance consisting of ASCII characters only. If ensure_ascii is
False, some chunks written to fp may be unicode instances. This
usually happens because the input contains unicode strings or the
encoding parameter is used. ...
BTW, what do you mean "unicode format didn't work well in IE8," ?

Categories

Resources