How to Fix JSON Key Values without double-quotes? - python

I currently have JSON in the below format.
Some of the Key values are NOT properly formatted as they are missing double quotes (")
How do I fix these key values to have double-quotes on them?
{
Name: "test",
Address: "xyz",
"Age": 40,
"Info": "test"
}
Required:
{
"Name": "test",
"Address": "xyz",
"Age": 40,
"Info": "test"
}
Using the below post, I was able to find such key values in the above INVALID JSON.
However, I could NOT find an efficient way to replace these found values with double-quotes.
s = "Example: String"
out = re.findall(r'\w+:', s)
How to Escape Double Quote inside JSON

Using Regex:
import re
data = """{ Name: "test", Address: "xyz"}"""
print( re.sub("(\w+):", r'"\1":', data) )
Output:
{ "Name": "test", "Address": "xyz"}

You can use PyYaml. Since JSON is a subset of Yaml, pyyaml may overcome the lack of quotes.
Example
import yaml
dirty_json = """
{
key: "value",
"key2": "value"
}
"""
yaml.load(dirty_json, yaml.SafeLoader)

I had few more issues that I faced in my JSON.
Thought of sharing the final solution that worked for me.
jsonStr = re.sub("((?=\D)\w+):", r'"\1":', jsonStr)
jsonStr = re.sub(": ((?=\D)\w+)", r':"\1"', jsonStr)
First Line will fix this double-quotes issue for the Key. i.e.
Name: "test"
Second Line will fix double-quotes issue for the value. i.e. "Info": test
Also, above will exclude double-quoting within date timestamp which have : (colon) in them.

You can use online formatter. I know most of them are throwing error for not having double quotes but below one seems handling it nicely!
JSON Formatter

The regex approach can be brittle. I suggest you find a library that can parse the JSON text that is missing quotes.
For example, in Kotlin 1.4, the standard way to parse a JSON string is using Json.decodeFromString. However, you can use Json { isLenient = true }.decodeFromString to relax the requirements for quotes. Here is a complete example in JUnit.
import kotlinx.serialization.Serializable
import kotlinx.serialization.decodeFromString
import kotlinx.serialization.json.Json
import org.junit.jupiter.api.Assertions
import org.junit.jupiter.api.Test
#Serializable
data class Widget(val x: Int, val y: String)
class JsonTest {
#Test
fun `Parsing Json`() {
val w: Widget = Json.decodeFromString("""{"x":123, "y":"abc"}""")
Assertions.assertEquals(123, w.x)
Assertions.assertEquals("abc", w.y)
}
#Test
fun `Parsing Json missing quotes`() {
// Json.decodeFromString("{x:123, y:abc}") failed to decode due to missing quotes
val w: Widget = Json { isLenient = true }.decodeFromString("{x:123, y:abc}")
Assertions.assertEquals(123, w.x)
Assertions.assertEquals("abc", w.y)
}
}

Related

How to convert JSON string (with double quotes in its values) to python dictionary

I have some JSON files like this:
{
"#context": "http://schema.org",
"#type": "Product",
"name": "ADIZERO ADIOS PRO 2 Löparskor",
"#id": "adidas-adizero-adios-pro-2-loparskor",
"color": "Lila",
"description": "Example text "Best Comfort" an other example text.",
"brand": {
"#type": "Thing",
"name": "adidas"
},
"audience": {
"#type": "Audience",
"name": "Herr, Dam"
}
}
I know it is not a valid JSON file since in the description field there is " " but how can I manipulate this string with python and use json.loads()
I'm thinking about some regular expressions to remove these inner double quotes, Is that possible?
BTW: It's not possible to manipulate the source JSON files.
The Right Answer -- do it right, GIGO, etc.
Definitely fix the problem at the source. Or else you're just plugging holes in a leaking dam. Who knows what other special characters might appear in the text, in the future if the program receives new input?
The Pragmatic Answer -- you asked for it
Here's an example Python script that takes the "bad JSON" on stdin and produces (hopefully) valid JSON output on stdout:
import sys
import re
def main():
for line in sys.stdin:
# Replace content where the property value has invalid
# double quotes that were supposed to be part of the string
# with properly quoted double quotes.
line = re.sub(r'(: ")(.*)(",)$', replacer, line)
sys.stdout.write(line)
def replacer(match):
before = match.group(1)
string_to_fix = match.group(2)
after = match.group(3)
return before + escape_quotes(string_to_fix) + after
def replacer2(match):
return match.group(1) + match.group(2).upper() + match.group(3)
def escape_quotes(s):
return s.replace('\\', '\\\\').replace('"', '\\"')
main()

How to remove all the “$oid” and "$date" in a .json file?

I have a .json file saved in my computer that contains things like $oid or $date which will later cause me trouble in BigQuery. For example:
{
"_id": {
"$oid": "5e7511c45cb29ef48b8cfcff"
},
"about": "some text",
"creationDate": {
"$date": "2021-01-05T14:59:58.046Z"
}
}
I want it to look like (so it’s not just removing some letters from the string):
{
"_id": "5e7511c45cb29ef48b8cfcff",
"about": "some text",
"creationDate": "2021-01-05T14:59:58.046Z"
}
With Pymongo, one can do something like:
my_file['id']=my_file['id']['$oid']
my_file['creationDate']=my_file['creationDate']['$date']
How would this look without using Pymongo, since I want to first find such keys and remove all the problematic $oid or $date?
Edit: sorry for the bad wording, what I meant to say was whether it was possible to find the keys that contain these problematic $ without writing down every key in the dictionary. In reality, there are more files with huge tables and many of them can contain this.
The $oid and $date fields appear when you use the default encoder using bson.json_util.dumps().
If you have control over where these files come from, you might want to fix the "problem" at source rather than having to code around it. The following code snippet shows how you can implement a custom encoder to format the output how you need it:
import json
import datetime
from pymongo import MongoClient
class MyJsonEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, datetime.datetime):
return obj.isoformat()
if hasattr(obj, '__str__'): # This will handle ObjectIds
return str(obj)
return super(MyJsonEncoder, self).default(obj)
db = MongoClient()['mydatabase']
db.mycollection.insert_one({'Date': datetime.datetime.now()})
record = db.mycollection.find_one()
print(json.dumps(record, indent=4, cls=MyJsonEncoder))
prints:
{
"_id": "60a55e3cea5bf57c79177871",
"Date": "2021-05-19T19:51:40.808000"
}
I would try something as shown below.
import json
file = open('data.json','r')
data = json.load(file)
for k,v in data.items():
#check if key has dict value
if type(v) == dict:
#find id with $
r = list(data[k].keys())[0]
#change value if $ occurs
if r[0] == '$':
data[k] = data[k][r]
print(data)
seems like we get this output.
{'_id': '5e7511c45cb29ef48b8cfcff', 'about': 'some text', 'creationDate': '2021-01-05T14:59:58.046Z'}

Getting TypeError while spliting Json data?

I have A json data like this:
json_data = '{"data":"[{"Date":"3/17/2017","Steam Total":60},{"Date":"3/18/2017","Steam Total":15},{"Date":"3/19/2017","Steam Total":1578},{"Date":"3/20/2017","Steam Total":1604}]", "data_details": "{"data_key":"Steam Total", "given_high":"1500", "given_low":"1000", "running_info": []}"}'
json_input_data = json_data["data"]
json_input_additional_info = json_data["data_details"]
I am getting an error:
TypeError: string indices must be integers, not str
I think there is an error in the json data. Can someone Help me on this?
In you code has some issues.
The code: json_input_data = json_data["data"], the variable json_data is not a Json Object, is a String Object and you try get a string position by string key, for get a Json object from string json use json api: json
You Json string isn't valid, this is a valid version:
{"data":[{"Date":"3/17/2017","Steam Total":60},{"Date":"3/18/2017","Steam Total":15},{"Date":"3/19/2017","Steam Total":1578},{"Date":"3/20/2017","Steam Total":1604}], "data_details": {"data_key":"Steam Total", "given_high":"1500", "given_low":"1000", "running_info": []}}
Now, your code works fine.
Try parsing your json_data to JSON format (with JSON.parse(json_data)). Currently it's type is string - which is exactly what your error says.
As Pongpira Upra pointed out, your json is not well formed and should be something like this.
{
"data":[
{
"Date":"3/17/2017",
"Steam Total":60
},
{
"Date":"3/18/2017",
"Steam Total":15
},
{
"Date":"3/19/2017",
"Steam Total":1578
},
{
"Date":"3/20/2017",
"Steam Total":1604
}
],
"data_details":{
"data_key":"Steam Total",
"given_high":"1500",
"given_low":"1000",
"running_info":[]
}
}
In order to retrieve information you should write
json_data[0]["Date"]
This would print "3/17/2017"
You declare a string called json_data and, well, then it acts like a string. That is what the exception tells you. Like others here tried to say - you do also have an error in your data, but the exception you supplied is due to accessing the string as if it was a dictionary. You need to add a missing call to e.g. json.loads(...).
You were right. Your JSON is indeed wrong.
Can you try using this json?
{
"data":[
{
"Date":"3/17/2017",
"Steam Total":60
},
{
"Date":"3/18/2017",
"Steam Total":15
},
{
"Date":"3/19/2017",
"Steam Total":1578
},
{
"Date":"3/20/2017",
"Steam Total":1604
}
],
"data_details":{
"data_key":"Steam Total",
"given_high":"1500",
"given_low":"1000",
"running_info":[]
}
}

How can I load a string that looks like json? [duplicate]

I wonder if there is a way to decode a JSON-like string.
I got string:
'{ hotel: { id: "123", name: "hotel_name"} }'
It's not a valid JSON string, so I can't decode it directly with the python API.
Python will only accept a stringified JSON string like:
'{ "hotel": { "id": "123", "name": "hotel_name"} }'
where properties are quoted to be a string.
Use demjson module, which has ability to decode in non-strict mode.
In [1]: import demjson
In [2]: demjson.decode('{ hotel: { id: "123", name: "hotel_name"} }')
Out[2]: {u'hotel': {u'id': u'123', u'name': u'hotel_name'}}
You could try and use a wrapper for a JavaScript engine, like pyv8.
import PyV8
ctx = PyV8.JSContext()
ctx.enter()
# Note that we need to insert an assignment here ('a ='), or syntax error.
js = 'a = ' + '{ hotel: { id: "123", name: "hotel_name"} }'
a = ctx.eval(js)
a.hotel.id
>> '123' # Prints
#vartec has already pointed out demjson, which works well for slightly invalid JSON. For data that's even less JSON compliant I've written barely_json:
from barely_json import parse
print(parse('[no, , {complete: yes, where is my value?}]'))
prints
[False, '', {'complete': True, 'where is my value?': ''}]
Not very elegant and not robust (and easy to break), but it may be possible to kludge it with something like:
kludged = re.sub('(?i)([a-z_].*?):', r'"\1":', string)
# { "hotel": { "id": "123", "name": "hotel_name"} }
You may find that using pyparsing and the parsePythonValue.py example could do what you want as well... (or modified fairly easily to do so) or the jsonParser.py could be modified to not require quoted key values.

How to decode an invalid json string in python

I wonder if there is a way to decode a JSON-like string.
I got string:
'{ hotel: { id: "123", name: "hotel_name"} }'
It's not a valid JSON string, so I can't decode it directly with the python API.
Python will only accept a stringified JSON string like:
'{ "hotel": { "id": "123", "name": "hotel_name"} }'
where properties are quoted to be a string.
Use demjson module, which has ability to decode in non-strict mode.
In [1]: import demjson
In [2]: demjson.decode('{ hotel: { id: "123", name: "hotel_name"} }')
Out[2]: {u'hotel': {u'id': u'123', u'name': u'hotel_name'}}
You could try and use a wrapper for a JavaScript engine, like pyv8.
import PyV8
ctx = PyV8.JSContext()
ctx.enter()
# Note that we need to insert an assignment here ('a ='), or syntax error.
js = 'a = ' + '{ hotel: { id: "123", name: "hotel_name"} }'
a = ctx.eval(js)
a.hotel.id
>> '123' # Prints
#vartec has already pointed out demjson, which works well for slightly invalid JSON. For data that's even less JSON compliant I've written barely_json:
from barely_json import parse
print(parse('[no, , {complete: yes, where is my value?}]'))
prints
[False, '', {'complete': True, 'where is my value?': ''}]
Not very elegant and not robust (and easy to break), but it may be possible to kludge it with something like:
kludged = re.sub('(?i)([a-z_].*?):', r'"\1":', string)
# { "hotel": { "id": "123", "name": "hotel_name"} }
You may find that using pyparsing and the parsePythonValue.py example could do what you want as well... (or modified fairly easily to do so) or the jsonParser.py could be modified to not require quoted key values.

Categories

Resources