Convert multiple lines of string into a JSON based on a delimiter - python

I am literally lost with ideas on converting multiple lines of string into a JSON tree structure
So, I have multiple lines of string like below under a particular excel column:
/abc/a[2]/a/x[1]
/abc/a[2]/a/x[2]
Since the above strings contain the delimiter / , I could use them to create a parent-child relationship and convert them into a Python dictionary (or) JSON like below:
{
"tag": "abc",
"child": [
{
"tag": "a[2]",
"child": [
{
"tag": "a",
"child": [
{
"tag": "a",
"child": [
{
"tag": "x[1]"
},
{
"tag": "x[2]"
}
]
}
]
}
]
}
]
}
I am unable to come up with a logic for this part of my project since I need to look for the presence of [1],[2] and assign them to a common parent and this needs to be done in some recursive way that works for strings with any length. Please help me out with any code logic in Python or provide me with suggestions. Much appreciated !!
Additional Update:
Just wondering if it would also be possible to include other column's data along with the JSON structure.
For ex: If the excel contains the below three columns with 2 rows
tag
text
type
/abc/a[2]/a/x[1]
Hello
string
/abc/a[2]/a/x[2]
World
string
Along with the JSON from the original question, is it possible to add these other column information as key-value attributes (to the corresponding innermost child nesting) in the JSON like below.
These do not follow the same '/' delimiter format, hence I am unsure on approaching this..
{
"tag": "abc",
"child": [
{
"tag": "a[2]",
"child": [
{
"tag": "a",
"child": [
{
"tag": "a",
"child": [
{
"tag": "x[1]",
"text": "Hello",
"type": "string"
},
{
"tag": "x[2]",
"text": "World",
"type": "string"
}
]
}
]
}
]
}
]
}
P.S: I felt it would be appropriate to add the data to the innermost child to avoid information redundancy... Please feel free to suggest including these other column in any other appropriate way as well.

You can use recursion with collections.defaultdict:
import collections, json
def to_tree(d):
v = collections.defaultdict(list)
for [a, *b], *c in d:
v[a].append([b, *c])
return [{'tag':a, **({'child':to_tree(b)} if all(j for j, *_ in b) else dict(zip(['text', 'type'], b[0][-2:])))}
for a, b in v.items()]
col_data = [['/abc/a[2]/a/x[1]', 'Hello', 'string'], ['/abc/a[2]/a/x[2]', 'World', 'string']] #read in from your excel file
result = to_tree([[[*filter(None, a.split('/'))], b, c] for a, b, c in col_data])
print(json.dumps(result, indent=4))
Output:
[
{
"tag": "abc",
"child": [
{
"tag": "a[2]",
"child": [
{
"tag": "a",
"child": [
{
"tag": "x[1]",
"text": "Hello",
"type": "string"
},
{
"tag": "x[2]",
"text": "World",
"type": "string"
}
]
}
]
}
]
}
]

def to_dict(data, old_data=None):
json={}
if old_data!=None:
json["child"]=old_data
for row in data.split("\n"):
path=json
for element in row.strip("/").split("/"):
if not "child" in path:
path["child"]=[]
path=path["child"]
for path_el in path:
if element==path_el["tag"]:
path=path_el
break
else:
path.append({"tag":element})
path=path[-1]
return json["child"]
#tests:
column="""/abc/a[2]/a/x[1]
/abc/a[2]/a/x[2]"""
print(to_dict_2(column))
print(to_dict_2(column,[{'tag': 'abc', 'child': [{'tag': 'old entry'}]}]))

Related

Flatten deeply nested JSON vertically convert to pandas

Hi I am trying flatten JSON file but unable to . My JSON has 3 indents repeating sample as below
floors": [
{
"uuid": "8474",
"name": "some value",
"areas": [
{
"uuid": "xyz",
"**name**": "qwe",
"roomType": "Name1",
"templateUuid": "sdklfj",
"templateName": "asdf",
"templateVersion": "2.7.1",
"Required1": [
{
"**uuid**": "asdf",
"description": "asdf3",
"categoryName": "asdf",
"familyName": "asdf",
"productName": "asdf3",
"Required2": [
{
"**deviceId**": "asdf",
"**deviceUuid**": "asdf-asdf"
}
]
}
I want for area the corresponding values in nested Required1 and for the Required1 corresponding required 2.(Highlighted in **)
I have tried JSON normalize as below but failed and other free libs :
Attempts :
from pprint import pprint
with open('Filename.json') as data_file:
data_item = json.load(data_file)
Raw_Areas=json_normalize(data_item['floors'],'areas',errors='ignore',record_prefix='Area_')
No area value displayed. Only Required 1 Required 2 still nested
K=json_normalize(data_item['floors'][0],record_path=['Required1','Required2'],errors='ignore',record_prefix='Try_')
from flatten_json import flatten_json
Flat_J1= pd.DataFrame([flatten_json(data_item)])
Looking to get values as below :
Columns expected :
floors.areas.Required1.Required2.deviceUuid
floors.areas.name
(Side by Side)
Please help am I missing anything in my attempt. I am fairly new to JSON loads.
Assuming the following JSON (as multiple people pointed out, it's incomplete). So I completed it based on the bracket openings you had.
dct = {"floors": [
{
"uuid": "8474",
"name": "some value",
"areas": [
{
"uuid": "xyz",
"name": "qwe",
"roomType": "Name1",
"templateUuid": "sdklfj",
"templateName": "asdf",
"templateVersion": "2.7.1",
"Required1": [
{
"uuid": "asdf",
"description": "asdf3",
"categoryName": "asdf",
"familyName": "asdf",
"productName": "asdf3",
"Required2": [
{
"deviceId": "asdf",
"deviceUuid": "asdf-asdf"
}
]
}
]
}
]
}
]}
You can do the following (requires pandas 0.25.0)
df = pd.io.json.json_normalize(
dct, record_path=['floors','areas', 'Required1'],meta=[['floors', 'areas', 'name']])
df = df.explode('Required2')
df = pd.concat([df, df["Required2"].apply(pd.Series)], axis=1)
df = df[['floors.areas.name', 'uuid', 'deviceId', 'deviceUuid']]
Which gives,
>>> floors.areas.name uuid deviceId deviceUuid
>>> 0 qwe asdf asdf asdf-asdf

how to count number of dictionary in set of dictionaries

enter image description hereI am trying to convert a very complex JSON into CSV, and now I have stuck somewhere in the middle. My JSON file is nested with a combination of many lists and dictionaries(dictionaries also have sub dictionaries)
while I am iterating through the complete JSON, I am getting two dictionaries from a for a loop. Now my problem is when I am looping through this set, to append keys(Zip1) and values(value) into my default dictionary which is set as null in the beginning, due to the limitation of dictionaries, I am able to extract only one value i.e. Zip1, 34567
{'type': 'Zip1', 'value': '12345'}
{'type': 'Zip1', 'value': '34567'}
fin_data={}
dict1 is the outcome of some for loop of my code and holds the value as
{'type': 'Zip1', 'value': '12345'}
{'type': 'Zip1', 'value': '34567'}
for key,value in dict1.items():
for data in value:
print(data)
fin_data.update(key:data['value'])
Is there any way, I can iterate through sets of the dictionaries of the dict1?
so that at the first iteration, I will copy data into CSV, and then in the second iteration, the other values to my CSV
The output I am getting is :
{Zip1:34567}
Actual Output is required as both values
Sample of my json, on which i am working, Need to extract data from all of the value attributes:
{
"updatedTime": 1562215101843,
"attributes": {
"ActiveFlag": [
{
"value": "Y"
}
],
"CountryCode": [
{
"value": "United States"
}
],
"LastName": [
{
"value": "Giers"
}
],
"MatchFirstNames": [
{
"value": "Morgan"
}
],
"Address": [
{
"value": {
"Zip": [
{
"value": {
"Zip5": [
{
"type": "Zip1",
"value": "12345"
}
]
}
}
],
"Country": [
{
"value": "United States"
}
]
}
},
{
"value": {
"City": [
{
"value": "Tempe"
}
],
"Zip": [
{
"value": {
"Zip5": [
{
"type": "Zip1",
"value": "85287"
}
]
}
}
]
}
}
]
}
}
Expected Result :
updatedTime, ActiveFlag, CountryCode, LastName, MatchFirstNames, Address_Zip_Zip5, Address_City, Address_Country
1562215101843,Y,United States,Giers,Morgan,12345,,United States
1562215101843,Y,United States,Giers,Morgan,85287,Tempe,
At the first step for each person accumulate all the zip codes in one line/list/record say space separated or in an array.
Then after everything extracted split row into several

How to cut desired level of unlimited depth of JSON data in Python?

I have a large JSON file with the following structure, with different depth embedded in a node. I want to delete different levels based on assigned depth.
So far I tried to cut some level manually but still it doesn't remove correctly as I remove them based on index and each time indexes shift
content = json.load(file)
content_copy = content.copy()
for j, i in enumerate(content):
if 'children' in i:
for xx, x in enumerate(i['children']):
if 'children' in x:
for zz, z in enumerate(x['children']):
if 'children' in z:
del content_copy[j]['children'][xx]['children'][zz]
Input:
[
{
"name":"1",
"children":[
{
"name":"3",
"children":"5"
},
{
"name":"33",
"children":"51"
},
{
"name":"13",
"children":[
{
"name":"20",
"children":"30"
},
{
"name":"40",
"children":"50"
}
]
}
]
},
{
"name":"2",
"children":[
{
"name":"7",
"children":"6"
},
{
"name":"3",
"children":"521"
},
{
"name":"193",
"children":"292"
}
]
}
]
Output:
In which in 'name':13, its children were removed.
[
{
"name": "1",
"children": [
{
"name": "3",
"children": "5"
},
{
"name": "33",
"children": "51"
},
{
"name": "13"
}
]
},
{
"name": "2",
"children": [
{
"name": "7",
"children": "6"
},
{
"name": "3",
"children": "521"
},
{
"name": "193",
"children": "292"
}
]
}
]
Not a python answer, but in the hope it's useful to someone, here is a one liner using jq tool:
<file jq 'del(.[][][]?[]?[]?)'
It simply deletes all elements that has a depth more than the 5.
The question mark ? is used to avoid iterating over elements that would have a depth less than 3.
One way to prune is pass depth+1 in a recursive function call.
You are asking for different behaviors for different types. If the grandchild is just a string, you want to keep it, but if it is a list then you want to prune. This seems inconsistent, 13 should have children ["20", "30"] but then they wouldn't be the same node structure, so I can see your dilemma.
I would convent them to a tree of node objects, and then just prune nodes, but to get the exact output you listed, I can just selectively prune based on whether child is a string or list.
import pprint
import json
data = """[{
"name": "1", "children":
[
{"name":"3","children":"5"},{"name":"33","children":"51"},
{"name":"13","children":[{"name":"20","children":"30"},
{"name":"40","children":"50"}]}
]
},
{
"name": "2", "children":
[
{"name":"7","children":"6"},
{"name":"3","children":"521"},
{"name":"193","children":"292"}
]
}]"""
content = json.loads(data)
def pruned_nodecopy(content, prune_level, depth=0):
if not isinstance(content, list):
return content
result = []
for node in content:
node_copy = {'name': node['name']}
if 'children' in node:
children = node['children']
if not isinstance(children, list):
node_copy['children'] = children
elif depth+1 < prune_level:
node_copy['children'] = pruned_nodecopy(node['children'], prune_level, depth+1)
result.append(node_copy)
return result
content_copy = pruned_nodecopy(content, 2)
pprint.pprint (content_copy)
Note that this is specifically copying the attributes you use. I had to make it use hard-coded attributes because you're asking for specific (and different) behaviors on those.

Get "path" of parent keys and indices in dictionary of nested dictionaries and lists

I am receiving a large json from Google Assistant and I want to retrieve some specific details from it. The json is the following:
{
"responseId": "************************",
"queryResult": {
"queryText": "actions_intent_DELIVERY_ADDRESS",
"action": "delivery",
"parameters": {},
"allRequiredParamsPresent": true,
"fulfillmentMessages": [
{
"text": {
"text": [
""
]
}
}
],
"outputContexts": [
{
"name": "************************/agent/sessions/1527070836044/contexts/actions_capability_screen_output"
},
{
"name": "************************/agent/sessions/1527070836044/contexts/more",
"parameters": {
"polar": "no",
"polar.original": "No",
"cardinal": 2,
"cardinal.original": "2"
}
},
{
"name": "************************/agent/sessions/1527070836044/contexts/actions_capability_audio_output"
},
{
"name": "************************/agent/sessions/1527070836044/contexts/actions_capability_media_response_audio"
},
{
"name": "************************/agent/sessions/1527070836044/contexts/actions_intent_delivery_address",
"parameters": {
"DELIVERY_ADDRESS_VALUE": {
"userDecision": "ACCEPTED",
"#type": "type.googleapis.com/google.actions.v2.DeliveryAddressValue",
"location": {
"postalAddress": {
"regionCode": "US",
"recipients": [
"Amazon"
],
"postalCode": "NY 10001",
"locality": "New York",
"addressLines": [
"450 West 33rd Street"
]
},
"phoneNumber": "+1 206-266-2992"
}
}
}
},
{
"name": "************************/agent/sessions/1527070836044/contexts/actions_capability_web_browser"
}
],
"intent": {
"name": "************************/agent/intents/86fb2293-7ae9-4bed-adeb-6dfe8797e5ff",
"displayName": "Delivery"
},
"intentDetectionConfidence": 1,
"diagnosticInfo": {},
"languageCode": "en-gb"
},
"originalDetectIntentRequest": {
"source": "google",
"version": "2",
"payload": {
"isInSandbox": true,
"surface": {
"capabilities": [
{
"name": "actions.capability.MEDIA_RESPONSE_AUDIO"
},
{
"name": "actions.capability.SCREEN_OUTPUT"
},
{
"name": "actions.capability.AUDIO_OUTPUT"
},
{
"name": "actions.capability.WEB_BROWSER"
}
]
},
"inputs": [
{
"rawInputs": [
{
"query": "450 West 33rd Street"
}
],
"arguments": [
{
"extension": {
"userDecision": "ACCEPTED",
"#type": "type.googleapis.com/google.actions.v2.DeliveryAddressValue",
"location": {
"postalAddress": {
"regionCode": "US",
"recipients": [
"Amazon"
],
"postalCode": "NY 10001",
"locality": "New York",
"addressLines": [
"450 West 33rd Street"
]
},
"phoneNumber": "+1 206-266-2992"
}
},
"name": "DELIVERY_ADDRESS_VALUE"
}
],
"intent": "actions.intent.DELIVERY_ADDRESS"
}
],
"user": {
"lastSeen": "2018-05-23T10:20:25Z",
"locale": "en-GB",
"userId": "************************"
},
"conversation": {
"conversationId": "************************",
"type": "ACTIVE",
"conversationToken": "[\"more\"]"
},
"availableSurfaces": [
{
"capabilities": [
{
"name": "actions.capability.SCREEN_OUTPUT"
},
{
"name": "actions.capability.AUDIO_OUTPUT"
},
{
"name": "actions.capability.WEB_BROWSER"
}
]
}
]
}
},
"session": "************************/agent/sessions/1527070836044"
}
This large json returns amongst other things to my back-end the delivery address details of the user (here I use Amazon's NY locations details as an example). Therefore, I want to retrieve the location dictionary which is near the end of this large json. The location details appear also near the start of this json but I want to retrieve specifically the second location dictionary which is near the end of this large json.
For this reason, I had to read through this json by myself and manually test some possible "paths" of the location dictionary within this large json to find out finally that I had to write the following line to retrieve the second location dictionary:
location = json['originalDetectIntentRequest']['payload']['inputs'][0]['arguments'][0]['extension']['location']
Therefore, my question is the following: is there any concise way to retrieve automatically the "path" of the parent keys and indices of the second location dictionary within this large json?
Hence, I expect that the general format of the output from a function which does this for all the occurrences of the location dictionary in any json will be the following:
[["path" of first `location` dictionary], ["path" of second `location` dictionary], ["path" of third `location` dictionary], ...]
where for the json above it will be
[["path" of first `location` dictionary], ["path" of second `location` dictionary]]
as there are two occurrences of the location dictionary with
["path" of second `location` dictionary] = ['originalDetectIntentRequest', 'payload', 'inputs', 0, 'arguments', 0, 'extension', 'location']
I have in my mind relevant posts on StackOverflow (Python--Finding Parent Keys for a specific value in a nested dictionary) but I am not sure that these apply exactly to my problem since these are for parent keys in nested dictionaries whereas here I am talking about the parent keys and indices in dictionary with nested dictionaries and lists.
I solved this by using recursive search
# result and path should be outside of the scope of find_path to persist values during recursive calls to the function
result = []
path = []
from copy import copy
# i is the index of the list that dict_obj is part of
def find_path(dict_obj,key,i=None):
for k,v in dict_obj.items():
# add key to path
path.append(k)
if isinstance(v,dict):
# continue searching
find_path(v, key,i)
if isinstance(v,list):
# search through list of dictionaries
for i,item in enumerate(v):
# add the index of list that item dict is part of, to path
path.append(i)
if isinstance(item,dict):
# continue searching in item dict
find_path(item, key,i)
# if reached here, the last added index was incorrect, so removed
path.pop()
if k == key:
# add path to our result
result.append(copy(path))
# remove the key added in the first line
if path != []:
path.pop()
# default starting index is set to None
find_path(di,"location")
print(result)
# [['queryResult', 'outputContexts', 4, 'parameters', 'DELIVERY_ADDRESS_VALUE', 'location'], ['originalDetectIntentRequest', 'payload', 'inputs', 0, 'arguments', 0, 'extension', 'location']]

How to convert special characters to normal text when exporting to file in Python?

I have some input data from a website, that I have gathered using BeautifulSoup.
After I have collected the relevant information from the site, I want to export it to JSON.
This is what some of my output data looks like:
[
{
"time": "30\/3",
"tag": "I\u00c3\u00b8"
},
{
"time": "12\/4",
"tag": "Da"
}
]
It should be:
[
{
"time": "30/3",
"tag": "Iø"
},
{
"time": "12/4",
"tag": "Da"
}
]
Why does it look like that and how do I fix it?
i don't know about the code around it, but this issue is because your code is trying to use ascii encoding, so it can't handle the special characters
to handle special characters with json you can just set ensure_ascii to false
import json
a = [
{
"time": "30/3",
"tag": "Iø"
},
{
"time": "12/4",
"tag": "Da"
}
]
print(json.dumps(a, ensure_ascii=False, indent=4))
output:
[
{
"time": "30/3",
"tag": "Iø"
},
{
"time": "12/4",
"tag": "Da"
}
]
The issue is they're escaping the slashes and non-ASCII characters. One way is to use the json library like so:
>>> import json
>>> s = """[
... {
... "time": "30\/3",
... "tag": "I\u00c3\u00b8"
... },
... {
... "time": "12\/4",
... "tag": "Da"
... }
... ]"""
>>> json.loads(s)
[{'time': '30/3', 'tag': 'Iø'}, {'time': '12/4', 'tag': 'Da'}]

Categories

Resources