How to fix mixed JSON encoded strings

How to fix mixed JSON encoded strings - python

I face the following problems. I have JSON strings, where inner arrays/objects are sometimes written as an escaped string and sometimes not. For instance I have
{ "author": "Jack",
"meta": ["a", "b"]}
and a bad one:
{ "author": "Jack",
"meta": "[\"a\", \"b\"]"}
If I parse the latter one, I will only get a string for the meta property. This can be fixed by passing the meta property again through a JSON parser. The problem, however, if I pass it through JSON.parse (Ruby) or JSON.load (Python) then maybe I am not dealing with an escaped string, but maybe a simple number "15.3". Which results in an error.
So how I can intelligently detect, whether the value is a value which needs to go through JSON.parse again? Simply try-catch this situation?

It really depends on the kind of double-encoded data you're dealing with, but testing the first character might be sufficient. If it's [ or { then you could try and decode it with JSON, and if successful, substitute it for that.

Related

Eliminating " " from a JSON file so that they don't interrupt the string [duplicate]

While trying to parse JSON from an AJAX request, the string returned contains invalid JSON.
Although the best practice would be to change the server to reply with valid JSON, as suggested in multiple related answers, this is not an option.
Trying to solve this problem using python, I looked at regular expressions.
The main problem is elements as follows (which I currently use as a test string:
testStr = '{"KEY1":"THIS IS "AN" ELEMENT","KEY2":"""THIS IS ANOTHER "ELEMENT""}'
I currently use the following code:
jsonString = re.sub(r'(?<=\w)\"(?=[^\(\:\}\,])','\\"',testStr)
jsonString = re.sub(r'\"\"(?![,}:])','\"\\\"',jsonString)
with very limited success.
If I was using C, I would parse the string, and simply escape all double quotes within the element (i.e between all double quotes which are preceded by [:{},] )
There must be a pythonic way to parse, without resorting to a for loop and looking ahead, and keeping history.
EDIT:
Assuming that strings do not contain: [ : { } ]
And also assuming that the unescaped double quotes are only within the value, and not in the key,
Then I assume that the following (or something similar should solve the problem:
import re
re.sub(r'(?<![\[\:])\"(?![,\}),'\"',testString)
But it still does not work.

Seems I needed a break to solve this.
The following regular expression seems to replace only doublequotes that are contained within the element string. (With the assumptions I stated in the question)
output = re.sub(r'(?<![\[\:\{\,])\"(?![\:\}\,])','\\\"', stringName)
I have created a sandbox here: https://repl.it/vNK
Example Output:
Original String:
{"KEY1":"THIS IS "AN" ELEMENT","KEY2":"""THIS IS ANOTHER "ELEMENT""}
Modified String:
{"KEY1":"THIS IS \"AN\" ELEMENT","KEY2":"\"\"THIS IS ANOTHER \"ELEMENT\""}
Parsed JSON:
{
"KEY1": "THIS IS \"AN\" ELEMENT",
"KEY2": "\"\"THIS IS ANOTHER \"ELEMENT\""
}
Any suggestions are welcome.

JSONDecodeError; Invalid /escape when parsing from Python

After running my object detection model, it outputs a .json file with the results. In order to actually use the results of the model in my python I need to parse the .json file, but nothing I have tried in order to do it works. I tried just to open and then print the results but I got the error:
json.decoder.JSONDecodeError: Invalid \escape: line 4 column 41 (char 60)
If you have any idea what I did wrong, the help would be very much appreciated. My code:
with open(r'C:\Yolo_v4\darknet\build\darknet\x64\result.json') as result:
data = json.load(result)
result.close()
print(data)
My .json file
[
{
"frame_id":1,
"filename":"C:\Yolo_v4\darknet\build\darknet\x64\f047.png",
"objects": [
{"class_id":32, "name":"right", "relative_coordinates":{"center_x":0.831927, "center_y":0.202225, "width":0.418463, "height":0.034752}, "confidence":0.976091},
{"class_id":19, "name":"h", "relative_coordinates":{"center_x":0.014761, "center_y":0.873551, "width":0.041723, "height":0.070544}, "confidence":0.484339},
{"class_id":24, "name":"left", "relative_coordinates":{"center_x":0.285694, "center_y":0.200752, "width":0.619584, "height":0.032149}, "confidence":0.646595},
]
}
]
(There are several more detected objects but did not include them)

The other responders are of course right. This is not valid JSON. But sometimes you don't have the option to change the format, e.g. because you are working with a broken data dump where the original source is no longer available.
The only way to deal with that is to sanitize it somehow. This is of course not ideal, because you have to put a lot of expectations into your sanitizer code, i.e. you need to know exactly what kind of errors the json file has.
However, a solution using regular expressions could look like this:
import json
import re
class LazyDecoder(json.JSONDecoder):
def decode(self, s, **kwargs):
regex_replacements = [
(re.compile(r'([^\\])\\([^\\])'), r'\1\\\\\2'),
(re.compile(r',(\s*])'), r'\1'),
]
for regex, replacement in regex_replacements:
s = regex.sub(replacement, s)
return super().decode(s, **kwargs)
with open(r'C:\Yolo_v4\darknet\build\darknet\x64\result.json') as result:
data = json.load(result, cls=LazyDecoder)
print(data)
This is by subclassing the standard JSONDecoder and using that one for loading.

Hi you need to use double (backslashes), remove the last comma in the objects property and finally you dont need close the file inside the with block
[
{
"frame_id":1,
"filename":"C:\\Yolo_v4\\darknet\\build\\darknet\\x64\\f047.png",
"objects": [
{"class_id":32, "name":"right", "relative_coordinates":{"center_x":0.831927, "center_y":0.202225, "width":0.418463, "height":0.034752}, "confidence":0.976091},
{"class_id":19, "name":"h", "relative_coordinates":{"center_x":0.014761, "center_y":0.873551, "width":0.041723, "height":0.070544}, "confidence":0.484339},
{"class_id":24, "name":"left", "relative_coordinates":{"center_x":0.285694, "center_y":0.200752, "width":0.619584, "height":0.032149}, "confidence":0.646595}
]
}
]

The "\" character is used not only in Windows filepaths but also as an escape character for things like newlines (you can use "\n" instead of an actual newline, for example). To escape the escape, you simply have to put a second backslash before it, like this:
"C:\\Yolo_v4\\darknet\\build\\darknet\\x64\\f047.png"
As someone said in the comments, json.dump should do this automatically for you, so it sounds like something internal is messed up (unless this wasn't created using that).

the function uses the encode method to encode the string as UTF-8, and the decode method with the unicode_escape codec to decode the string, removing any escape sequences in the process.
def remove_escape_sequences(string):
return string.encode('utf-8').decode('unicode_escape')

Parsing string for further use

I have a string like this "d/\"". I want to split the string to \" using the split('/') function. Then I want to format the string as raw string and append it to a list.
When I want to print/write that string of the list I want to get \".
I tried really many approaches. (With replacing \ character, using repr() function, trying to format string as raw string like that r'%s'%string)
None of that worked correctly. Maybe someone can help me out getting that desired solution.
Thank you in advance,
Greetings
EDIT: Minimal reproducible example:
JSON Object:
{
"name": "d/\"",
"name_encoding": "utf8",
"value": "/\/\",
}
I want a list with the raw strings like that ['\"', 'utf8', '/\/\'] and when write the list to a file or print it, output should look like that \", utf8, /\/\.
Minimal code I use (for reproducing the problem):
with open(...) as f:
for object in ijson.items(f, "item"):
temp_list =[]
for attribute_key in object.keys():
if attribute_key == 'name':
name_parsed = object[attribute_key].split('/', maxsplit=1)[1]
temp_list.append(repr(name_parsed)[1:-1])

Saving python variable with new lines in JSON with pretty print

I am reading this text from a CSV file in Python.
Hi there,
This is a test.
and storing it into a variable text.
I am trying to write this variable in a JSON file with json.dump(), but it is being transformed into:
' \ufeffHi there,\n\n\xa0\n\nThis is a test.
How can I make my JSON file look like the one below?:
{
"text": "Hi there,
This is a test."
}

JSON does not allow real line-breaks. If you still want to use them, you will have to make your own "json" writer.
Edit: Here's function that will take python dict (which you can get using json.loads() ) and print it the way you need:
def print_wrong_json(dict_object):
print '{'
print ',\n'.join(['"{}": "{}"'.format(key, dict_object[key]) for key in dict_object])
print '}'

Well it can be done, as user1308345 shows in his answer but it wouldn't be valid JSON anymore and you probably run into issues later, when deserializing the JSON.
But if you really want to do it, and still want to have valid JSON, you could split the string (and remove the new lines) and serialize them as an array like suggested in this answer https://stackoverflow.com/a/7744658/1466757
Then your JSON would look similar to this
{
"text": [
"Hi there,",
"",
"",
"",
"this is a test."
]
}
After deserializing it, you would have to put the line breaks back in.

JSON String with elements containing unescaped double quotes

While trying to parse JSON from an AJAX request, the string returned contains invalid JSON.
Although the best practice would be to change the server to reply with valid JSON, as suggested in multiple related answers, this is not an option.
Trying to solve this problem using python, I looked at regular expressions.
The main problem is elements as follows (which I currently use as a test string:
testStr = '{"KEY1":"THIS IS "AN" ELEMENT","KEY2":"""THIS IS ANOTHER "ELEMENT""}'
I currently use the following code:
jsonString = re.sub(r'(?<=\w)\"(?=[^\(\:\}\,])','\\"',testStr)
jsonString = re.sub(r'\"\"(?![,}:])','\"\\\"',jsonString)
with very limited success.
If I was using C, I would parse the string, and simply escape all double quotes within the element (i.e between all double quotes which are preceded by [:{},] )
There must be a pythonic way to parse, without resorting to a for loop and looking ahead, and keeping history.
EDIT:
Assuming that strings do not contain: [ : { } ]
And also assuming that the unescaped double quotes are only within the value, and not in the key,
Then I assume that the following (or something similar should solve the problem:
import re
re.sub(r'(?<![\[\:])\"(?![,\}),'\"',testString)
But it still does not work.

Seems I needed a break to solve this.
The following regular expression seems to replace only doublequotes that are contained within the element string. (With the assumptions I stated in the question)
output = re.sub(r'(?<![\[\:\{\,])\"(?![\:\}\,])','\\\"', stringName)
I have created a sandbox here: https://repl.it/vNK
Example Output:
Original String:
{"KEY1":"THIS IS "AN" ELEMENT","KEY2":"""THIS IS ANOTHER "ELEMENT""}
Modified String:
{"KEY1":"THIS IS \"AN\" ELEMENT","KEY2":"\"\"THIS IS ANOTHER \"ELEMENT\""}
Parsed JSON:
{
"KEY1": "THIS IS \"AN\" ELEMENT",
"KEY2": "\"\"THIS IS ANOTHER \"ELEMENT\""
}
Any suggestions are welcome.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to fix mixed JSON encoded strings - python

It really depends on the kind of double-encoded data you're dealing with, but testing the first character might be sufficient. If it's [ or { then you could try and decode it with JSON, and if successful, substitute it for that.

Related

Eliminating " " from a JSON file so that they don't interrupt the string [duplicate]

JSONDecodeError; Invalid /escape when parsing from Python

Parsing string for further use

Saving python variable with new lines in JSON with pretty print

JSON String with elements containing unescaped double quotes

Categories

Resources