How to remove all the “$oid” and "$date" in a .json file? - python

I have a .json file saved in my computer that contains things like $oid or $date which will later cause me trouble in BigQuery. For example:
{
"_id": {
"$oid": "5e7511c45cb29ef48b8cfcff"
},
"about": "some text",
"creationDate": {
"$date": "2021-01-05T14:59:58.046Z"
}
}
I want it to look like (so it’s not just removing some letters from the string):
{
"_id": "5e7511c45cb29ef48b8cfcff",
"about": "some text",
"creationDate": "2021-01-05T14:59:58.046Z"
}
With Pymongo, one can do something like:
my_file['id']=my_file['id']['$oid']
my_file['creationDate']=my_file['creationDate']['$date']
How would this look without using Pymongo, since I want to first find such keys and remove all the problematic $oid or $date?
Edit: sorry for the bad wording, what I meant to say was whether it was possible to find the keys that contain these problematic $ without writing down every key in the dictionary. In reality, there are more files with huge tables and many of them can contain this.

The $oid and $date fields appear when you use the default encoder using bson.json_util.dumps().
If you have control over where these files come from, you might want to fix the "problem" at source rather than having to code around it. The following code snippet shows how you can implement a custom encoder to format the output how you need it:
import json
import datetime
from pymongo import MongoClient
class MyJsonEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, datetime.datetime):
return obj.isoformat()
if hasattr(obj, '__str__'): # This will handle ObjectIds
return str(obj)
return super(MyJsonEncoder, self).default(obj)
db = MongoClient()['mydatabase']
db.mycollection.insert_one({'Date': datetime.datetime.now()})
record = db.mycollection.find_one()
print(json.dumps(record, indent=4, cls=MyJsonEncoder))
prints:
{
"_id": "60a55e3cea5bf57c79177871",
"Date": "2021-05-19T19:51:40.808000"
}

I would try something as shown below.
import json
file = open('data.json','r')
data = json.load(file)
for k,v in data.items():
#check if key has dict value
if type(v) == dict:
#find id with $
r = list(data[k].keys())[0]
#change value if $ occurs
if r[0] == '$':
data[k] = data[k][r]
print(data)
seems like we get this output.
{'_id': '5e7511c45cb29ef48b8cfcff', 'about': 'some text', 'creationDate': '2021-01-05T14:59:58.046Z'}

Related

How do I automate finding and replacing a JSON attribute?

This is an example of a JSON database that I will work with in my Python code.
{
"name1": {
"file": "abc"
"delimiter": "n"
},
"name2": {
"file": "def"
"delimiter": "n"
}
}
Pretend that a user of my code presses a GUI button that is supposed to change the name of "name1" to whatever the user typed into a textbox.
How do I change "name1" to a custom string without manually copying and pasting the entire JSON database into my actual code? I want the code to load the JSON database and change the name by itself.
Load the JSON object into a dict. Grab the name1 entry. Create a new entry with the desired key and the same value. Delete the original entry. Dump the dict back to your JSON file.
This is likely not the best way to perform the task. Use sed on Linux or its Windows equivalent (depending on your loaded apps) to make the simple stream-edit change.
If I understand clearly the task. Here is an example:
import json
user_input = input('Name: ')
db = json.load(open("db.json"))
db[user_input] = db.pop('name1')
json.dump(db, open("db.json", 'w'))
You can use the object_hook parameter that json.loads() accepts to detect JSON objects (dictionaries) that have an entry associated with the old key and re-associate its value with new key they're encountered.
This can be implement as a function as shown follows:
import json
def replace_key(json_repr, old_key, new_key):
def decode_dict(a_dict):
try:
entry = a_dict.pop(old_key)
except KeyError:
pass # Old key not present - no change needed.
else:
a_dict[new_key] = entry
return a_dict
return json.loads(json_repr, object_hook=decode_dict)
data = '''{
"name1": {
"file": "abc",
"delimiter": "n"
},
"name2": {
"file": "def",
"delimiter": "n"
}
}
'''
new_data = replace_key(data, 'name1', 'custom string')
print(json.dumps(new_data, indent=4))
Output:
{
"name2": {
"file": "def",
"delimiter": "n"
},
"custom string": {
"file": "abc",
"delimiter": "n"
}
}
I got the basic idea from #Mike Brennan's answer to another JSON-related question How to get string objects instead of Unicode from JSON?

Python conditional get value from dict?

Given this json string, how can I pull out the value of id if code equals 4003?
error_json = '''{
'error': {
'meta': {
'code': 4003,
'message': 'Tracking already exists.',
'type': 'BadRequest'
},
'data': {
'tracking': {
'slug': 'fedex',
'tracking_number': '442783308929',
'id': '5b59ea69a9335baf0b5befcf',
'created_at': '2018-07-26T15:36:09+00:00'
}
}
}
}'''
I can't assume that anything other than the error element exists at the beginning, so the meta and code elements may or may not be there. The data, tracking, and id may or may not be there either.
The question is how to extract the value of id if the elements are all there. If any of the elements are missing, the value of id should be None
A python dictionary has a get(key, default) method that supports returning a default value if a key is not found. You can chain empty dictionaries to reach nested elements.
# use get method to access keys without raising exceptions
# see https://docs.quantifiedcode.com/python-anti-patterns/correctness/not_using_get_to_return_a_default_value_from_a_dictionary.html
code = error_json.get('error', {}).get('meta', {}).get('code', None)
if code == 4003:
# change id with _id to avoid overriding builtin methods
# see https://docs.quantifiedcode.com/python-anti-patterns/correctness/assigning_to_builtin.html
_id = error_json.get('error', {}).get('data', {}).get('tracking', {}).get('id', None)
Now, given a string that looks like a JSON you can parse it into a dictionary using json.loads(), as shown in Parse JSON in Python
I would try this:
import json
error_json = '''{
"error": {
"meta": {
"code": 4003,
"message": "Tracking already exists.",
"type": "BadRequest"
},
"data": {
"tracking": {
"slug": "fedex",
"tracking_number": "442783308929",
"id": "5b59ea69a9335baf0b5befcf",
"created_at": "2018-07-26T15:36:09+00:00"
}
}
}
}'''
parsed_json = json.loads(error_json)
try:
if parsed_json["error"]["meta"]["code"] == int(parsed_json["error"]["meta"]["code"]):
print(str(parsed_json["error"]["data"]["tracking"]["id"]))
except:
print("no soup for you")
Output:
5b59ea69a9335baf0b5befcf
A lot of python seems to be it's better to ask for forgiveness instead of permission. You could look up to see if that key in the dictionary is there, but really it's easier to just try. I'm specifically doing a check to make sure that code is an int, but you could change it around any way you'd like. If it can be other things you'd have to adjust it. There are several different solutions to this, it's really whatever you feel the most comfortable doing and maintaining.
You can add a check for key errors to the whole operation:
def get_id(j)
try:
if j['error']['meta’]['code'] == 4003:
return j['error']['data']['tracking']['id']
except KeyError:
pass
return None
If any element is missing, this function will quietly return None. If all the required elements are present, it will return the required ID. The only time it could really fail is if one of the intermediate keys does not refer to a dictionary. That could potentially result in a TypeError.

How to Fix JSON Key Values without double-quotes?

I currently have JSON in the below format.
Some of the Key values are NOT properly formatted as they are missing double quotes (")
How do I fix these key values to have double-quotes on them?
{
Name: "test",
Address: "xyz",
"Age": 40,
"Info": "test"
}
Required:
{
"Name": "test",
"Address": "xyz",
"Age": 40,
"Info": "test"
}
Using the below post, I was able to find such key values in the above INVALID JSON.
However, I could NOT find an efficient way to replace these found values with double-quotes.
s = "Example: String"
out = re.findall(r'\w+:', s)
How to Escape Double Quote inside JSON
Using Regex:
import re
data = """{ Name: "test", Address: "xyz"}"""
print( re.sub("(\w+):", r'"\1":', data) )
Output:
{ "Name": "test", "Address": "xyz"}
You can use PyYaml. Since JSON is a subset of Yaml, pyyaml may overcome the lack of quotes.
Example
import yaml
dirty_json = """
{
key: "value",
"key2": "value"
}
"""
yaml.load(dirty_json, yaml.SafeLoader)
I had few more issues that I faced in my JSON.
Thought of sharing the final solution that worked for me.
jsonStr = re.sub("((?=\D)\w+):", r'"\1":', jsonStr)
jsonStr = re.sub(": ((?=\D)\w+)", r':"\1"', jsonStr)
First Line will fix this double-quotes issue for the Key. i.e.
Name: "test"
Second Line will fix double-quotes issue for the value. i.e. "Info": test
Also, above will exclude double-quoting within date timestamp which have : (colon) in them.
You can use online formatter. I know most of them are throwing error for not having double quotes but below one seems handling it nicely!
JSON Formatter
The regex approach can be brittle. I suggest you find a library that can parse the JSON text that is missing quotes.
For example, in Kotlin 1.4, the standard way to parse a JSON string is using Json.decodeFromString. However, you can use Json { isLenient = true }.decodeFromString to relax the requirements for quotes. Here is a complete example in JUnit.
import kotlinx.serialization.Serializable
import kotlinx.serialization.decodeFromString
import kotlinx.serialization.json.Json
import org.junit.jupiter.api.Assertions
import org.junit.jupiter.api.Test
#Serializable
data class Widget(val x: Int, val y: String)
class JsonTest {
#Test
fun `Parsing Json`() {
val w: Widget = Json.decodeFromString("""{"x":123, "y":"abc"}""")
Assertions.assertEquals(123, w.x)
Assertions.assertEquals("abc", w.y)
}
#Test
fun `Parsing Json missing quotes`() {
// Json.decodeFromString("{x:123, y:abc}") failed to decode due to missing quotes
val w: Widget = Json { isLenient = true }.decodeFromString("{x:123, y:abc}")
Assertions.assertEquals(123, w.x)
Assertions.assertEquals("abc", w.y)
}
}

Use Python and JSON to recursively get all keys associated with a value

Giving data organized in JSON format (code example bellow) how can we get the path of keys and sub-keys associated with a given value?
i.e.
Giving an input "23314" we need to return a list with:
Fanerozoico, Cenozoico, Quaternario, Pleistocenico, Superior.
Since data is a json file, using python and json lib we had decoded it:
import json
def decode_crono(crono_file):
with open(crono_file) as json_file:
data = json.load(json_file)
Now on we do not know how to treat it in a way to get what we need.
We can access keys like this:
k = data["Fanerozoico"]["Cenozoico"]["Quaternario "]["Pleistocenico "].keys()
or values like this:
v= data["Fanerozoico"]["Cenozoico"]["Quaternario "]["Pleistocenico "]["Superior"].values()
but this is still far from what we need.
{
"Fanerozoico": {
"id": "20000",
"Cenozoico": {
"id": "23000",
"Quaternario": {
"id": "23300",
"Pleistocenico": {
"id": "23310",
"Superior": {
"id": "23314"
},
"Medio": {
"id": "23313"
},
"Calabriano": {
"id": "23312"
},
"Gelasiano": {
"id": "23311"
}
}
}
}
}
}
It's a little hard to understand exactly what you are after here, but it seems like for some reason you have a bunch of nested json and you want to search it for an id and return a list that represents the path down the json nesting. If so, the quick and easy path is to recurse on the dictionary (that you got from json.load) and collect the keys as you go. When you find an 'id' key that matches the id you are searching for you are done. Here is some code that does that:
def all_keys(search_dict, key_id):
def _all_keys(search_dict, key_id, keys=None):
if not keys:
keys = []
for i in search_dict:
if search_dict[i] == key_id:
return keys + [i]
if isinstance(search_dict[i], dict):
potential_keys = _all_keys(search_dict[i], key_id, keys + [i])
if 'id' in potential_keys:
keys = potential_keys
break
return keys
return _all_keys(search_dict, key_id)[:-1]
The reason for the nested function is to strip off the 'id' key that would otherwise be on the end of the list.
This is really just to give you an idea of what a solution might look like. Beware the python recursion limit!
Based on the assumption that you need the full dictionary path until a key named id has a particular value, here's a recursive solution that iterates the whole dict. Bear in mind that:
The code is not optimized at all
For huge json objects it might yield StackOverflow :)
It will stop at first encountered value found (in theory there shouldn't be more than 1 if the json is semantically correct)
The code:
import json
from types import DictType
SEARCH_KEY_NAME = "id"
FOUND_FLAG = ()
CRONO_FILE = "a.jsn"
def decode_crono(crono_file):
with open(crono_file) as json_file:
return json.load(json_file)
def traverse_dict(dict_obj, value):
for key in dict_obj:
key_obj = dict_obj[key]
if key == SEARCH_KEY_NAME and key_obj == value:
return FOUND_FLAG
elif isinstance(key_obj, DictType):
inner = traverse_dict(key_obj, value)
if inner is not None:
return (key,) + inner
return None
if __name__ == "__main__":
value = "23314"
json_dict = decode_crono(CRONO_FILE)
result = traverse_dict(json_dict, value)
print result

how to parse json where key is variable in python?

i am parsing a log file which is in json format,
and contains data in the form of key : value pair.
i was stuck at place where key itself is variable. please look at the attached code
in this code i am able to access keys like username,event_type,ip etc.
problem for me is to access the values inside the "submission" key where
i4x-IITB-CS101-problem-33e4aac93dc84f368c93b1d08fa984fc_2_1 is a variable key which will change for different users,
how can i access it as a variable ?
{
"username": "batista",
"event_type": "problem_check",
"ip": "127.0.0.1",
"event": {
"submission": {
"i4x-IITB-CS101-problem-33e4aac93dc84f368c93b1d08fa984fc_2_1": {
"input_type": "choicegroup",
"question": "",
"response_type": "multiplechoiceresponse",
"answer": "MenuInflater.inflate()",
"variant": "",
"correct": true
}
},
"success": "correct",
"grade": 1,
"correct_map": {
"i4x-IITB-CS101-problem-33e4aac93dc84f368c93b1d08fa984fc_2_1": {
"hint": "",
"hintmode": null,
"correctness": "correct",
"npoints": null,
"msg": "",
"queuestate": null
}
}
this is my code how i am solving it :
import json
import pprint
with open("log.log") as infile:
# Loop until we have parsed all the lines.
for line in infile:
# Read lines until we find a complete object
while (True):
try:
json_data = json.loads(line)
username = json_data['username']
print "username :- " + username
except ValueError:
line += next(infile)
how can i access i4x-IITB-CS101-problem-33e4aac93dc84f368c93b1d08fa984fc_2_1 key and
data inside this key ??
You don't need to know the key in advance, you can simply iterate over the dictionary:
for k,v in obj['event']['submission'].iteritems():
print(k,v)
Suppose you have a dictionary of type d = {"a":"b"} then d.popitem() would give you a tuple ("a","b") which is (key,value). So using this you can access key-value pairs without knowing the key.
In you case if j is the main dictionary then j["event"]["submission"].popitem() would give you tuple
("i4x-IITB-CS101-problem-33e4aac93dc84f368c93b1d08fa984fc_2_1": {
"input_type": "choicegroup",
"question": "",
"response_type": "multiplechoiceresponse",
"answer": "MenuInflater.inflate()",
"variant": "",
"correct": true
})
Hope this is what you were asking.
using python json module you'll end up with a dictionary of parsed values from the above JSON data
import json
parsed = json.loads(this_sample_data_in_question)
# parsed is a dictionary, so are "correct_map" and "submission" dictionary keys within "event" key
So you could iterate over the key, values of the data as a normal dictionary, say like this:
for k, v in parsed.items():
print k, v
Now you could find the (possible different values) of "i4x-IITB-CS101-problem-33e4aac93dc84f368c93b1d08fa984fc_2_1" key in a quick way like this:
import json
parsed = json.loads(the_data_in_question_as_string)
event = parsed['event']
for key, val in event.items():
if key in ('correct_map', 'submission'):
section = event[key]
for possible_variable_key, its_value in section.items():
print possible_variable_key, its_value
Of course there might be better way of iterating over the dictionary, but that one you could choose based on your coding taste, or performance if you have a fairly larger kind of data than the one posted in here.

Categories

Resources