Encoding in messenger JSON [duplicate] - python

This question already has answers here:
Facebook JSON badly encoded
(9 answers)
Closed 3 years ago.
I'm learning to work with JSON by making a simple program in python that analyzes facebook messages in JSON I downloaded, but these messages contain plenty of Unicode characters that are written in the JSON file like this
pom\u00c3\u00b4\u00c5\u00bee
The example above is supposed to be word
pomôže
however, when I try to work with the string and print out the word it comes up like this
'pomôže'
Even most online converters printed it out like this except this one https://github.com/mathiasbynens/utf8.js
Is there any way to fix this?
EDIT:
Alright, so I'm sorry for not being clear enough. Hopefully, this will make things clearer:
I have a JSON file that looks like this, when opened in Notepad++:
{
"participants": [
{
"name": "Person1"
},
{
"name": "Person2"
}
],
"messages": [
{
"sender_name": "Person1",
"timestamp_ms": 1521492166805,
"content": "D\u00c3\u00bafam, \u00c5\u00bee pom\u00c3\u00b4\u00c5\u00bee",
"type": "Generic"
}
]
}
When I try to print or work with the content of the message :
import json
with open("messages.json", "r") as f:
messages = json.load(f)
print(messages["messages"][0]["content"])
the string looks like this:
Dúfam, že pomôže
How do I get the text into readable form?

It took me a while to understand but it is quite easy the reason, the character table is read in many ways, in your case the problem is that you want to print in utf8 but the table utf-8 is related to the system language, you have to print in utf-16
I'll give you some examples:
in javascript:
console.log("pom\u{00f4}\u{017E}e");
in python 3
print("pom"+u"\u00F4"+u"\u017E"+"e")
in python 2
print("pom"+u"\u00F4".encode('utf-8')+u"\u017E".encode('utf-8')+"e")
doc python 2.X
doc python 3.X

Related

Eliminating " " from a JSON file so that they don't interrupt the string [duplicate]

While trying to parse JSON from an AJAX request, the string returned contains invalid JSON.
Although the best practice would be to change the server to reply with valid JSON, as suggested in multiple related answers, this is not an option.
Trying to solve this problem using python, I looked at regular expressions.
The main problem is elements as follows (which I currently use as a test string:
testStr = '{"KEY1":"THIS IS "AN" ELEMENT","KEY2":"""THIS IS ANOTHER "ELEMENT""}'
I currently use the following code:
jsonString = re.sub(r'(?<=\w)\"(?=[^\(\:\}\,])','\\"',testStr)
jsonString = re.sub(r'\"\"(?![,}:])','\"\\\"',jsonString)
with very limited success.
If I was using C, I would parse the string, and simply escape all double quotes within the element (i.e between all double quotes which are preceded by [:{},] )
There must be a pythonic way to parse, without resorting to a for loop and looking ahead, and keeping history.
EDIT:
Assuming that strings do not contain: [ : { } ]
And also assuming that the unescaped double quotes are only within the value, and not in the key,
Then I assume that the following (or something similar should solve the problem:
import re
re.sub(r'(?<![\[\:])\"(?![,\}),'\"',testString)
But it still does not work.
Seems I needed a break to solve this.
The following regular expression seems to replace only doublequotes that are contained within the element string. (With the assumptions I stated in the question)
output = re.sub(r'(?<![\[\:\{\,])\"(?![\:\}\,])','\\\"', stringName)
I have created a sandbox here: https://repl.it/vNK
Example Output:
Original String:
{"KEY1":"THIS IS "AN" ELEMENT","KEY2":"""THIS IS ANOTHER "ELEMENT""}
Modified String:
{"KEY1":"THIS IS \"AN\" ELEMENT","KEY2":"\"\"THIS IS ANOTHER \"ELEMENT\""}
Parsed JSON:
{
"KEY1": "THIS IS \"AN\" ELEMENT",
"KEY2": "\"\"THIS IS ANOTHER \"ELEMENT\""
}
Any suggestions are welcome.

Convert serialized protobuf output to python dictionary

Given, a serialized protobuf (protocol buffer) output in the string format. I want to convert it to a python dictionary.
Suppose, this is the serialized protobuf, given as a python string:
person {
info {
name: John
age: 20
website: "https://mywebsite.com"
eligible: True
}
}
I want to convert the above python string to a python dictionary data, given as:
data = {
"person": {
"info": {
"name": "John",
"age": 20,
"website": "https://mywebsite.com",
"eligible": True,
}
}
}
I can write a python script to do the conversion, as follows:
Append commas on every line not ending with curly brackets.
Add an extra colon before the opening curly bracket.
Surround every individual key and value pair with quotes.
Finally, use the json.loads() method to convert it to a Python dictionary.
I wonder whether this conversion can be achieved using a simpler or a standard method, already available in protocol buffers. So, apart from the manual scripting using the steps I mentioned above, is there a better or a standard method available to convert the serialized protobuf output to a python dictionary?

JSONDecodeError; Invalid /escape when parsing from Python

After running my object detection model, it outputs a .json file with the results. In order to actually use the results of the model in my python I need to parse the .json file, but nothing I have tried in order to do it works. I tried just to open and then print the results but I got the error:
json.decoder.JSONDecodeError: Invalid \escape: line 4 column 41 (char 60)
If you have any idea what I did wrong, the help would be very much appreciated. My code:
with open(r'C:\Yolo_v4\darknet\build\darknet\x64\result.json') as result:
data = json.load(result)
result.close()
print(data)
My .json file
[
{
"frame_id":1,
"filename":"C:\Yolo_v4\darknet\build\darknet\x64\f047.png",
"objects": [
{"class_id":32, "name":"right", "relative_coordinates":{"center_x":0.831927, "center_y":0.202225, "width":0.418463, "height":0.034752}, "confidence":0.976091},
{"class_id":19, "name":"h", "relative_coordinates":{"center_x":0.014761, "center_y":0.873551, "width":0.041723, "height":0.070544}, "confidence":0.484339},
{"class_id":24, "name":"left", "relative_coordinates":{"center_x":0.285694, "center_y":0.200752, "width":0.619584, "height":0.032149}, "confidence":0.646595},
]
}
]
(There are several more detected objects but did not include them)
The other responders are of course right. This is not valid JSON. But sometimes you don't have the option to change the format, e.g. because you are working with a broken data dump where the original source is no longer available.
The only way to deal with that is to sanitize it somehow. This is of course not ideal, because you have to put a lot of expectations into your sanitizer code, i.e. you need to know exactly what kind of errors the json file has.
However, a solution using regular expressions could look like this:
import json
import re
class LazyDecoder(json.JSONDecoder):
def decode(self, s, **kwargs):
regex_replacements = [
(re.compile(r'([^\\])\\([^\\])'), r'\1\\\\\2'),
(re.compile(r',(\s*])'), r'\1'),
]
for regex, replacement in regex_replacements:
s = regex.sub(replacement, s)
return super().decode(s, **kwargs)
with open(r'C:\Yolo_v4\darknet\build\darknet\x64\result.json') as result:
data = json.load(result, cls=LazyDecoder)
print(data)
This is by subclassing the standard JSONDecoder and using that one for loading.
Hi you need to use double (backslashes), remove the last comma in the objects property and finally you dont need close the file inside the with block
[
{
"frame_id":1,
"filename":"C:\\Yolo_v4\\darknet\\build\\darknet\\x64\\f047.png",
"objects": [
{"class_id":32, "name":"right", "relative_coordinates":{"center_x":0.831927, "center_y":0.202225, "width":0.418463, "height":0.034752}, "confidence":0.976091},
{"class_id":19, "name":"h", "relative_coordinates":{"center_x":0.014761, "center_y":0.873551, "width":0.041723, "height":0.070544}, "confidence":0.484339},
{"class_id":24, "name":"left", "relative_coordinates":{"center_x":0.285694, "center_y":0.200752, "width":0.619584, "height":0.032149}, "confidence":0.646595}
]
}
]
The "\" character is used not only in Windows filepaths but also as an escape character for things like newlines (you can use "\n" instead of an actual newline, for example). To escape the escape, you simply have to put a second backslash before it, like this:
"C:\\Yolo_v4\\darknet\\build\\darknet\\x64\\f047.png"
As someone said in the comments, json.dump should do this automatically for you, so it sounds like something internal is messed up (unless this wasn't created using that).
the function uses the encode method to encode the string as UTF-8, and the decode method with the unicode_escape codec to decode the string, removing any escape sequences in the process.
def remove_escape_sequences(string):
return string.encode('utf-8').decode('unicode_escape')

Saving python variable with new lines in JSON with pretty print

I am reading this text from a CSV file in Python.
Hi there,
This is a test.
and storing it into a variable text.
I am trying to write this variable in a JSON file with json.dump(), but it is being transformed into:
' \ufeffHi there,\n\n\xa0\n\nThis is a test.
How can I make my JSON file look like the one below?:
{
"text": "Hi there,
This is a test."
}
JSON does not allow real line-breaks. If you still want to use them, you will have to make your own "json" writer.
Edit: Here's function that will take python dict (which you can get using json.loads() ) and print it the way you need:
def print_wrong_json(dict_object):
print '{'
print ',\n'.join(['"{}": "{}"'.format(key, dict_object[key]) for key in dict_object])
print '}'
Well it can be done, as user1308345 shows in his answer but it wouldn't be valid JSON anymore and you probably run into issues later, when deserializing the JSON.
But if you really want to do it, and still want to have valid JSON, you could split the string (and remove the new lines) and serialize them as an array like suggested in this answer https://stackoverflow.com/a/7744658/1466757
Then your JSON would look similar to this
{
"text": [
"Hi there,",
"",
"",
"",
"this is a test."
]
}
After deserializing it, you would have to put the line breaks back in.

How to fix mixed JSON encoded strings

I face the following problems. I have JSON strings, where inner arrays/objects are sometimes written as an escaped string and sometimes not. For instance I have
{ "author": "Jack",
"meta": ["a", "b"]}
and a bad one:
{ "author": "Jack",
"meta": "[\"a\", \"b\"]"}
If I parse the latter one, I will only get a string for the meta property. This can be fixed by passing the meta property again through a JSON parser. The problem, however, if I pass it through JSON.parse (Ruby) or JSON.load (Python) then maybe I am not dealing with an escaped string, but maybe a simple number "15.3". Which results in an error.
So how I can intelligently detect, whether the value is a value which needs to go through JSON.parse again? Simply try-catch this situation?
It really depends on the kind of double-encoded data you're dealing with, but testing the first character might be sufficient. If it's [ or { then you could try and decode it with JSON, and if successful, substitute it for that.

Categories

Resources