Parsing invalid Unicode JSON in Python - python

i have a problematic json string contains some funky unicode characters
"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}
and if I convert using python
import json
s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
json.loads(s)
# Error..
If I can accept to skip/lose the value of these unicode characters, what is the best way to make my json.loads(s) works?

If the rest of the string apart from invalid \x5c is a JSON then you could use string-escape encoding to decode `'\x5c into backslashes:
>>> import json
>>> s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
>>> json.loads(s.decode('string-escape'))
{u'test': {u'foo': u'Ig0s/k/4jRk'}}

You don't have JSON; that can be interpreted directly as Python instead. Use ast.literal_eval():
>>> import ast
>>> s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
>>> ast.literal_eval(s)
{'test': {'foo': 'Ig0s\\/k\\/4jRk'}}
The \x5C is a single backslash, doubled in the Python literal string representation here. The actual string value is:
>>> print _['test']['foo']
Ig0s\/k\/4jRk
This parses the input as Python source, but only allows for literal values; strings, None, True, False, numbers and containers (lists, tuples, dictionaries).
This method is slower than json.loads() because it does part of the parse-tree processing in pure Python code.
Another approach would be to use a regular expression to replace the \xhh escape codes with JSON \uhhhh codes:
import re
escape_sequence = re.compile(r'\\x([a-fA-F0-9]{2})')
def repair(string):
return escape_sequence.sub(r'\\u00\1', string)
Demo:
>>> import json
>>> json.loads(repair(s))
{u'test': {u'foo': u'Ig0s\\/k\\/4jRk'}}
If you can repair the source producing this value to output actual JSON instead that'd be a much better solution.

I'm a bit late for the party, but we were seeing a similar issue, to be precise this one Logstash JSON input with escaped double quote, just for \xXX.
There JS.stringify created such (per specification) invalid json texts.
The solution is to simply replace the \x by \u00, as unicode escaped characters are allowed, while ASCII escaped characters are not.
import json
s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
s = s.replace("\\x", "\\u00")
json.loads(s)

Related

How to tell python that a string is actually bytes-object? Not converting

I have a txt file which contains a line:
' 6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"'
The contents in the double quotes is actually octal encoding, but with two escape characters.
After the line has been read in, I used regex to extract the contents in the double quotes.
c = re.search(r': "(.+)"', line).group(1)
After that, I have two problem:
First, I need to replace the two escape characters with one.
Second, Tell python that the str object c is actually a byte object.
None of them has been done.
I have tried:
re.sub('\\', '\', line)
re.sub(r'\\', '\', line)
re.sub(r'\\', r'\', line)
All failed.
A bytes object can be easily define with 'b'.
c = b'\351\231\220\346\227\266\345\205\215\350\264\271'
How to change the variable type of a string to bytes? I think this not a encode-and-decode thing.
I googled a lot, but with no answers. Maybe I use the wrong key word.
Does anyone know how to do these? Or other way to get what I want?
This is always a little confusing. I assume your bytes object should represent a string like:
b = b'\351\231\220\346\227\266\345\205\215\350\264\271'
b.decode()
# '限时免费'
To get that with your escaped string, you could use the codecs library and try:
import re
import codecs
line = ' 6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"'
c = re.search(r': "(.+)"', line).group(1)
codecs.escape_decode(bytes(c, "utf-8"))[0].decode("utf-8")
# '限时免费'
giving the same result.
The string contains literal text for escape codes. You cannot just replace the literal backslashes with a single backslash as escape codes are used in source code to indicate a single character. Decoding is needed to change literal escape codes to the actual character, but only byte strings can be decoded.
Encoding a Unicode string to a byte string with the Latin-1 codec translates Unicode code points 1:1 to the corresponding byte, so it is the common way to directly convert a "byte-string-like" Unicode string to an actual byte string.
Step-by-Step:
>>> s = "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"
>>> print(s) # Actual text of the string
\351\231\220\346\227\266\345\205\215\350\264\271
>>> s.encode('latin1') # Convert to byte string
b'\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271'
>>> # decode the escape codes...result is latin-1 characters in Unicode
>>> s.encode('latin1').decode('unicode-escape')
'é\x99\x90æ\x97¶å\x85\x8dè´¹' # convert back to byte string
>>> s.encode('latin1').decode('unicode-escape').encode('latin1')
b'\xe9\x99\x90\xe6\x97\xb6\xe5\x85\x8d\xe8\xb4\xb9'
>>> # data is UTF-8-encoded text so decode it correctly now
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'限时免费'
Your text example looks like part of a Python dictionary. You may be able to save some steps by using the ast module's literal_eval function to turn the dictionary directly into a Python object, and then just fix this line of code:
>>> # Python dictionary-like text
d='{6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"}'
>>> import ast
>>> ast.literal_eval(d) # returns Python dictionary with value already decoded
{6: 'é\x99\x90æ\x97¶å\x85\x8dè´¹'}
>>> ast.literal_eval(d)[6] # but decoded incorrectly as Latin-1 text.
'é\x99\x90æ\x97¶å\x85\x8dè´¹'
>>> ast.literal_eval(d)[6].encode('latin1').decode('utf8') # undo Latin1, decode as UTF-8
'限时免费'

unable to parse string to json in python

I have a string, which I evaluate as:
import ast
def parse(s):
return ast.literal_eval(s)
print parse(string)
{'_meta': {'name': 'foo', 'version': 0.2},
'clientId': 'google.com',
'clip': False,
'cts': 1444088114,
'dev': 0,
'uuid': '4375d784-809f-4243-886b-5dd2e6d2c3b7'}
But when I use jsonlint.com to validate the above json..
it throws schema error..
If I try to use json.loads
I see the following error:
Try: json.loads(str(parse(string)))
ValueError: Expecting property name: line 1 column 1 (char 1)
I am basically trying to convert this json in avro How to covert json string to avro in python?
ast.literal_eval() loads Python syntax. It won't parse JSON, that's what the json.loads() function is for.
Converting a Python object to a string with str() is still Python syntax, not JSON syntax, that is what json.dumps() is for.
JSON is not Python syntax. Python uses None where JSON uses null; Python uses True and False for booleans, JSON uses true and false. JSON strings always use " double quotes, Python uses either single or double, depending on the contents. When using Python 2, strings contain bytes unless you use unicode objects (recognisable by the u prefix on their literal notation), but JSON strings are fully Unicode aware. Python will use \xhh for Unicode characters in the Latin-1 range outside ASCII and \Uhhhhhhhh for non-BMP unicode points, but JSON only ever uses \uhhhh codes. JSON integers should generally be viewed as limited to the range representable by the C double type (since JavaScript numbers are always floating point numbers), Python integers have no limits other than what fits in your memory.
As such, JSON and Python syntax are not interchangeable. You cannot use str() on a Python object and expect to parse it as JSON. You cannot use json.dumps() and parse it with ast.literal_eval(). Don't confuse the two.

Python: Converting a dictionary to JSON and back appends "u" to each entry

I'm trying to store translations of spanish words as json. But the process of converting back and forth between python dictionaries and json strings is messing up my data.
Here's the code:
import json
text={"hablar":"reden"}
print(text) # {'hablar': 'reden'}
data=json.dumps(text)
text=json.loads(data)
print(text) # {u'hablar': u'reden}
Why has the letter "u" been added ?
Strings in JSON are loaded into unicode strings in Python.
If you print a dictionary you are printing repr() form of its keys and values. But the string itself still contains only reden.
>>> print(repr(text["hablar"]))
u'reden'
>>> print(text["hablar"])
reden
Everything should be OK. Unicode is the preferred way how to work with "human-readable" strings. JSON does not natively support binary data, so parsing JSON strings into Python unicode makes sense.
You can read more about Unicode in Python here: http://docs.python.org/2/howto/unicode.html
In Python source code, Unicode literals are written as strings prefixed with the ‘u’ or ‘U’ character: u'abcdefghijk'.
json.loads reads strings as unicode1. In general, this shouldn't hurt anything.
1The u is only present in the unicode object's representation -- try print(text['hablar']) and no u will be present.
The u character meaning is this string is unicode so you can use yaml instead of json
so the code will be:
import yaml
text={"hablar":"reden"}
print(text) # {'hablar': 'reden'}
data=yaml.dump(text)
text=yaml.load(data)
print(text) # {'hablar': 'reden}
This question have more details about yaml:
What is the difference between YAML and JSON? When to prefer one over the other

Unescape unicode-escapes, but not carriage returns and line feeds, in Python

I have an ASCII-encoded JSON file with unicode-escapes (e.g., \\u201cquotes\\u201d) and newlines escaped within strings, (e.g., `"foo\\r\\nbar"). Is there a simple way in Python to generate a utf-8 encoded file by un-escaping the unicode-escapes, but leaving the newline escapes intact?
Calling decode('unicode-escape') on the string will decode the unicode escapes (which is what I want) but it will also decode the carriage returns and newlines (which I don't want).
Sure there is, use the right tool for the job and ask the json module to decode the data to Python unicode; then encode the result to UTF-8:
import json
json.loads(input).encode('utf8')
Use unicode-escape only for actual Python string literals. JSON strings are not the same as Python strings, even though they may, at first glance, look very similar.
Short demo (take into account the python interactive interpreter echoes strings as literals):
>>> json.loads(r'"\u201cquotes\u201d"').encode('utf8')
'\xe2\x80\x9cquotes\xe2\x80\x9d'
>>> json.loads(r'"foo\r\nbar"').encode('utf8')
'foo\r\nbar'
Note that the JSON decoder decodes \r on \n just like a python literal would.
If you absolutely have to only process the \uabcd unicode literals in the JSON input but leave the rest intact, then you need to resort to a regular expression:
import re
codepoint = re.compile(r'(\\u[0-9a-fA-F]{4})')
def replace(match):
return unichr(int(match.group(1)[2:], 16))
codepoint.sub(replace, text).encode('utf8')
which gives:
>>> codepoint.sub(replace, r'\u201cquotes\u201d').encode('utf8')
'\xe2\x80\x9cquotes\xe2\x80\x9d'
>>> codepoint.sub(replace, r'"foo\r\nbar"').encode('utf8')
'"foo\\r\\nbar"'

How do convert unicode escape sequences to unicode characters in a python string

When I tried to get the content of a tag using "unicode(head.contents[3])" i get the output similar to this: "Christensen Sk\xf6ld". I want the escape sequence to be returned as string. How to do it in python?
Assuming Python sees the name as a normal string, you'll first have to decode it to unicode:
>>> name
'Christensen Sk\xf6ld'
>>> unicode(name, 'latin-1')
u'Christensen Sk\xf6ld'
Another way of achieving this:
>>> name.decode('latin-1')
u'Christensen Sk\xf6ld'
Note the "u" in front of the string, signalling it is uncode. If you print this, the accented letter is shown properly:
>>> print name.decode('latin-1')
Christensen Sköld
BTW: when necessary, you can use de "encode" method to turn the unicode into e.g. a UTF-8 string:
>>> name.decode('latin-1').encode('utf-8')
'Christensen Sk\xc3\xb6ld'
I suspect that it's acutally working correctly. By default, Python displays strings in ASCII encoding, since not all terminals support unicode. If you actually print the string, though, it should work. See the following example:
>>> u'\xcfa'
u'\xcfa'
>>> print u'\xcfa'
Ïa
Given a byte string with Unicode escapes b"\N{SNOWMAN}", b"\N{SNOWMAN}".decode('unicode-escape) will produce the expected Unicode string u'\u2603'.

Categories

Resources