I need to flatten a JSON with different levels of nested JSON arrays in Python
Part of my JSON looks like:
{
"data": {
"workbooks": [
{
"projectName": "TestProject",
"name": "wkb1",
"site": {
"name": "site1"
},
"description": "",
"createdAt": "2020-12-13T15:38:58Z",
"updatedAt": "2020-12-13T15:38:59Z",
"owner": {
"name": "user1",
"username": "John"
},
"embeddedDatasources": [
{
"name": "DS1",
"hasExtracts": false,
"upstreamDatasources": [
{
"projectName": "Data Sources",
"name": "DS1",
"hasExtracts": false,
"owner": {
"username": "user2"
}
}
],
"upstreamTables": [
{
"name": "table_1",
"schema": "schema_1",
"database": {
"name": "testdb",
"connectionType": "redshift"
}
},
{
"name": "table_2",
"schema": "schema_2",
"database": {
"name": "testdb",
"connectionType": "redshift"
}
},
{
"name": "table_3",
"schema": "schema_3",
"database": {
"name": "testdb",
"connectionType": "redshift"
}
}
]
},
{
"name": "DS2",
"hasExtracts": false,
"upstreamDatasources": [
{
"projectName": "Data Sources",
"name": "DS2",
"hasExtracts": false,
"owner": {
"username": "user3"
}
}
],
"upstreamTables": [
{
"name": "table_4",
"schema": "schema_1",
"database": {
"name": "testdb",
"connectionType": "redshift"
}
}
]
}
]
}
]
}
}
The output should like this
sample output
Tried using json_normalize but couldn't make it work. Currently parsing it by reading the nested arrays using loops and reading values using keys. Looking for a better way of normalizing the JSON
Here's a partial solution:
First save your data in the same directory as the script as a JSON file called data.json.
import json
import pandas as pd
from pandas.io.json import json_normalize
with open('data.json') as json_file:
json_data = json.load(json_file)
new_data = json_data['data']['workbooks']
result = json_normalize(new_data, ['embeddedDatasources', 'upstreamTables'], ['projectName', 'name', 'createdAt', 'updatedAt', 'owner', 'site'], record_prefix='_')
result
Output:
_name
_schema
_database.name
_database.connectionType
projectName
name
createdAt
updatedAt
owner
site
0
table_1
schema_1
testdb
redshift
TestProject
wkb1
2020-12-13T15:38:58Z
2020-12-13T15:38:59Z
{'name': 'user1', 'username': 'John'}
{'name': 'site1'}
1
table_2
schema_2
testdb
redshift
TestProject
wkb1
2020-12-13T15:38:58Z
2020-12-13T15:38:59Z
{'name': 'user1', 'username': 'John'}
{'name': 'site1'}
2
table_3
schema_3
testdb
redshift
TestProject
wkb1
2020-12-13T15:38:58Z
2020-12-13T15:38:59Z
{'name': 'user1', 'username': 'John'}
{'name': 'site1'}
3
table_4
schema_1
testdb
redshift
TestProject
wkb1
2020-12-13T15:38:58Z
2020-12-13T15:38:59Z
{'name': 'user1', 'username': 'John'}
{'name': 'site1'}
What next?
I think if you re-structure the data a bit in advance (for example flattening 'database': {'name': 'testdb', 'connectionType': 'redshift'}) you will be able to add more fields to the meta parameter.
As you see in the documentation of json_normalize, the four parameters that are used here are:
data: dict or list of dicts :
Unserialized JSON objects.
record_path: str or list of str : default None
Path in each object to list of records. If not passed, data will be assumed to be an array of records.
meta: list of paths (str or list of str) : default None
Fields to use as metadata for each record in resulting table.
record_prefix: str : default None
If True, prefix records with dotted (?) path, e.g. foo.bar.field if path to records is [‘foo’, ‘bar’].
tl;dr: Your final output along with detailed steps are mentioned in here
details :
To answer this question you need to have a thorough understanding of pandas.json_normalize. The understanding of json_normalize, record_path, meta, explode and in general json parsing.
import json
import pandas as pd
data = {
"data":
{
"workbooks":
[
{
"projectName": "TestProject",
"name": "wkb1",
"site":
{
"name": "site1"
},
"description": "",
"createdAt": "2020-12-13T15:38:58Z",
"updatedAt": "2020-12-13T15:38:59Z",
"owner":
{
"name": "user1",
"username": "John"
},
"embeddedDatasources":
[
{
"name": "DS1",
"hasExtracts": False,
"upstreamDatasources":
[
{
"projectName": "Data Sources",
"name": "DS1",
"hasExtracts": False,
"owner":
{
"username": "user2"
}
}
],
"upstreamTables":
[
{
"name": "table_1",
"schema": "schema_1",
"database":
{
"name": "testdb",
"connectionType": "redshift"
}
},
{
"name": "table_2",
"schema": "schema_2",
"database":
{
"name": "testdb",
"connectionType": "redshift"
}
},
{
"name": "table_3",
"schema": "schema_3",
"database":
{
"name": "testdb",
"connectionType": "redshift"
}
}
]
},
{
"name": "DS2",
"hasExtracts": False,
"upstreamDatasources":
[
{
"projectName": "Data Sources",
"name": "DS2",
"hasExtracts": False,
"owner":
{
"username": "user3"
}
}
],
"upstreamTables":
[
{
"name": "table_4",
"schema": "schema_1",
"database":
{
"name": "testdb",
"connectionType": "redshift"
}
}
]
}
]
}
]
}
}
First you need to bring it to the dict level.
data_list = data['data']['workbooks']
I did some data massaging by renaming some columns as per requirements.
data_list_pd = pd.DataFrame(data_list)
data_list_pd = data_list_pd.rename(
columns= {'name':'wkb'},errors='ignore').rename(
columns= {'createdAt':'wkb_createdDt'},errors='ignore').rename(
columns= {'updatedAt':'wkb_updatedDt'},errors='ignore').rename(
columns= {'projectName':'prj'},errors='ignore')
data_list_pd
data_list = json.loads(data_list_pd.to_json(orient="records"))
data_list
Next is where the core of your problem statement lies. You need to flatten the JSON by mentioning the record_path which is esentially the nested dictionary you want to expand along with the meta which is meta data/the remaining columns which you want to display. After that you need to explode on columns which have lists in them. You can achieve it by chaining explode method couple of times.
flattened_dataframe= pd.json_normalize(data_list,
record_path = 'embeddedDatasources',
meta = ['prj','wkb','wkb_createdDt', 'wkb_updatedDt',['site','name'],['owner','name'],['owner','username']],
errors='ignore').explode('upstreamDatasources').explode('upstreamTables')
flattened_dataframe
You can repeat this process couple of times to reach your final goal/desired result. Since the json_normalize works on JSON/dict files you will have to convert the dataframe into json files after each iteration. You can follow these steps.
flattened_json = json.loads(flattened_dataframe.to_json(orient="records"))
Also read about to_json.
Related
I would like to store two-dimensional arrays of numbers in Avro.
I have tried the following:
{
"namespace": "com.company",
"type": "record",
"name": "MyName",
"doc" : "...",
"fields": [
{
"name": "MyArray",
"type": {
"type": "array",
"items": {
"type": {"type": "array","items": "int"}
}
}
}
]
}
But when I tried to read it with the parser:
import avro.schema
schema = avro.schema.parse(open("my_schema.avsc", "r").read())
I get the following error:
avro.errors.SchemaParseException: Type property "{'type': 'array', 'items': {'type': {'type': 'array', 'items': 'int'}}}"
not a valid Avro schema: Items schema ({'type': {'type': 'array', 'items': 'int'}}) not
a valid Avro schema: Undefined type: {'type': 'array', 'items': 'int'}
(known names: dict_keys(['com.algoint.ecg_frame_file.EcgFrameFile']))
It looks like you have one too many type keys.
You schema should be this instead:
{
"namespace": "com.company",
"type": "record",
"name": "MyName",
"doc" : "...",
"fields": [
{
"name": "MyArray",
"type": {
"type": "array",
"items": {"type": "array","items": "int"}
}
}
]
}
I have a large amount of data in a collection in mongodb which I need to analyze, using pandas and pymongo in jupyter. I am trying to import specific data in a dataframe.
Sample data.
{
"stored": "2022-04-xx",
...
...
"completedQueues": [
"STATEMENT_FORWARDING_QUEUE",
"STATEMENT_PERSON_QUEUE",
"STATEMENT_QUERYBUILDERCACHE_QUEUE"
],
"activities": [
"https://example.com
],
"hash": "xxx",
"agents": [
"mailto:example#example.com"
],
"statement": { <=== I want to import the data from "statement"
"authority": {
"objectType": "Agent",
"name": "xxx",
"mbox": "mailto:example#example.com"
},
"stored": "2022-04-xxx",
"context": {
"platform": "Unknown",
"extensions": {
"http://example.com",
"xxx.com": {
"user_agent": "xxx"
},
"http://example.com": ""
}
},
"actor": {
"objectType": "xxx",
"name": "xxx",
"mbox": "mailto:example#example.com"
},
"timestamp": "2022-04-xxx",
"version": "1.0.0",
"id": "xxx",
"verb": {
"id": "http://example.com",
"display": {
"en-US": "viewed"
}
},
"object": {
"objectType": "xxx",
"id": "https://example.com",
"definition": {
"type": "http://example.com",
"name": {
"en-US": ""
},
"description": {
"en-US": "Viewed"
}
}
}
}, <=== up to here
"hasGeneratedId": true,
...
...
}
Notice that I am only interested in data nested under "statement", and not in any data containing the string, ie the "STATEMENT_FORWARDING_QUEUE" above it.
What I am trying to accomplish is import the data from "statement" (as indicated above) in a dataframe, and arrange them in a manner, like:
id
authority objectType
authority name
authority mbox
stored
context platform
context extensions
actor objectType
actor name
...
00
Agent
xxx
mailto
2022-
Unknown
http://1
xxx
xxx
...
01
Agent
yyy
mailto
2022-
Unknown
http://2
yyy
yyy
...
The idea is to be able to access any data like "authority name" or "actor objectType".
I have tried:
df = pd.DataFrame(list(collection.find(query)(filters)))
df = json_normalize(list(collection.find(query)(filters)))
with various queries, filter and slices, and also aggregate and map/reduce, but nothing produces the correct output.
I would also like to sort (newest to oldest) based on the "stored" field (sort('$natural',-1) ?), and maybe apply limit(xx) to the dataframe as well.
Any ideas?
Thanks in advance.
Try this
df = json_normalize(list(
collection.aggregate([
{
"$match": query
},
{
"$replaceRoot": {
"newRoot": "$statement"
}
}
])
)
Thanks for the answer, #pavel. It is spot on and pretty much solves the problem.
I also added sorting and limit, so if anyone is interested, the final code looks like this:
df = json_normalize(list(
statements_coll.aggregate([
{
"$match": query
},
{
"$replaceRoot": {
"newRoot": "$statement"
}
},
{
"$sort": {
"stored": -1
}
},
{
"$limit": 10
}
])
))
I'm trying to figure out how to perform a Merge or Join on a nested field in a DataFrame. Below is some example data:
df_all_groups = pd.read_json("""
[
{
"object": "group",
"id": "group-one",
"collections": [
{
"id": "111-111-111",
"readOnly": false
},
{
"id": "222-222-222",
"readOnly": false
}
]
},
{
"object": "group",
"id": "group-two",
"collections": [
{
"id": "111-111-111",
"readOnly": false
},
{
"id": "333-333-333",
"readOnly": false
}
]
}
]
""")
df_collections_with_names = pd.read_json("""
[
{
"object": "collection",
"id": "111-111-111",
"externalId": null,
"name": "Cats"
},
{
"object": "collection",
"id": "222-222-222",
"externalId": null,
"name": "Dogs"
},
{
"object": "collection",
"id": "333-333-333",
"externalId": null,
"name": "Fish"
}
]
""")
I'm trying to add the name field from df_collections_with_names to each df_all_groups['collections'][<index>] by joining on df_all_groups['collections'][<index>].id The output I'm trying to get to is:
[
{
"object": "group",
"id": "group-one",
"collections": [
{
"id": "111-111-111",
"readOnly": false,
"name": "Cats" // See Collection name was added
},
{
"id": "222-222-222",
"readOnly": false,
"name": "Dogs" // See Collection name was added
}
]
},
{
"object": "group",
"id": "group-two",
"collections": [
{
"id": "111-111-111",
"readOnly": false,
"name": "Cats" // See Collection name was added
},
{
"id": "333-333-333",
"readOnly": false,
"name": "Fish" // See Collection name was added
}
]
}
]
I've tried to use the merge method, but can't seem to get it to run on the collections nested field as I believe it's a series at that point.
Here's one approach:
First convert the json string used to construct df_all_groups (I named it all_groups here) to a dictionary using json.loads. Then use json_normalize to contruct a DataFrame with it.
Then merge the DataFrame constructed above with df_collections_with_names; we have "names" column now.
The rest is constructing the desired dictionary from the result obtained above; groupby + apply(to_dict) + reset_index + to_dict will fetch the desired outcome:
import json
out = (pd.json_normalize(json.loads(all_groups), ['collections'], ['object', 'id'], meta_prefix='_')
.merge(df_collections_with_names, on='id', suffixes=('','_'))
.drop(columns=['object','externalId']))
out = (out.groupby(['_object','_id']).apply(lambda x: x[['id','readOnly','name']].to_dict('records'))
.reset_index(name='collections'))
out.rename(columns={c: c.strip('_') for c in out.columns}).to_dict('records')
Output:
[{'object': 'group',
'id': 'group-one',
'collections': [{'id': '111-111-111', 'readOnly': False, 'name': 'Cats'},
{'id': '222-222-222', 'readOnly': False, 'name': 'Dogs'}]},
{'object': 'group',
'id': 'group-two',
'collections': [{'id': '111-111-111', 'readOnly': False, 'name': 'Cats'},
{'id': '333-333-333', 'readOnly': False, 'name': 'Fish'}]}]
I want to merge many JSON files with the same nested structure, using jsonmerge, but have been unsuccessful so far. For example, I want to merge base and head:
base = {
"data": [
{
"author_id": "id1",
"id": "1"
},
{
"author_id": "id2",
"id": "2"
}
],
"includes": {
"users": [
{
"id": "user1",
"name": "user1"
},
{
"id": "user2",
"name": "user2"
}
]
}
}
head = {
"data": [
{
"author_id": "id3",
"id": "3"
},
{
"author_id": "id4",
"id": "4"
}
],
"includes": {
"users": [
{
"id": "user3",
"name": "user3"
},
{
"id": "user4",
"name": "user4"
}
]
}
}
The resulting JSON should be:
final_result = {
"data": [
{
"author_id": "id1",
"id": "1"
},
{
"author_id": "id2",
"id": "2"
},
{
"author_id": "id3",
"id": "3"
},
{
"author_id": "id4",
"id": "4"
}
],
"includes": {
"users": [
{
"id": "user1",
"name": "user1"
},
{
"id": "user2",
"name": "user2"
},
{
"id": "user3",
"name": "user3"
},
{
"id": "user4",
"name": "user4"
}
]
}
}
However, I've only managed to merge correctly the data fields, while for users it doesn't seem to work. This is my code:
from jsonmerge import merge
from jsonmerge import Merger
schema = { "properties": {
"data": {
"mergeStrategy": "append"
},
"includes": {
"users": {
"mergeStrategy": "append"
}
}
}
}
merger = Merger(schema)
result = merger.merge(base, head)
The end result is:
{'data': [{'author_id': 'id1', 'id': '1'},
{'author_id': 'id2', 'id': '2'},
{'author_id': 'id3', 'id': '3'},
{'author_id': 'id4', 'id': '4'}],
'includes': {'users': [{'id': 'user3', 'name': 'user3'},
{'id': 'user4', 'name': 'user4'}]}}
The issue is with the definition of the schema, but I do not know if it is possible to do it like that with jsonmerge. Any help is appreciated!
Thank you!
It is based on jsonschema. So when you have an object within an object (e.g. "users" within "includes") then you'll need to tell jsonschema it is dealing with another object like so:
schema = {
"properties": {
"data": {
"mergeStrategy": "append"
},
"includes": {
"type": "object",
"properties": {
"users": {
"mergeStrategy": "append"
}
}
}
}
}
Note that this also happens for your top-level objects, hence you have "properties" argument on the highest level.
I am trying to convert following CSV to JSON below. Any help will be appreciated.
Sample of CSV file (File would contain lot of network groups with network,host attributes)
Type,Value ,Name
Network,10.0.0.0/8,network_group_3
Host,10.0.0.27,network_group_3
Host,10.0.0.28,network_group_3
Network,10.10.10.0/24,network_group_4
Network,10.10.20.0/24,network_group_4
Host,10.10.10.6,network_group_4
Output in JSON Needed
netgroup =
"literals": [
{
"type": "Network",
"value": "10.0.0.0/8"
},
{
"type": "Host",
"value": "10.0.0.27"
},
{
"type": "Host",
"value": "10.0.0.28"
}
],
"name": "network_group_3"
},
{
"literals": [
{
"type": "Network",
"value": "10.10.10.0/24"
},
{
"type": "Network",
"value": "10.10.20.0/24"
},
{
"type": "Host",
"value": "10.0.0.6
}
],
"name": "network_group_4"
Here is a good explanation of Python for conerting CSV to JSON:
http://www.idiotinside.com/2015/09/18/csv-json-pretty-print-python/
Here is a solution using jq
If the file filter.jq contains
[
split("\n") # split string into lines
| (.[0] | split(",")) as $headers # split header
| (.[1:][] | split(",")) # split data rows
| select(length>0) # get rid of empty lines
]
| [
group_by(.[2])[]
| {
name: .[0][2],
literals: map({type:.[0], value:.[1]})
}
]
and your data is in a file called data then
jq -M -R -s -r -f filter.jq data
will generate
[
{
"name": "network_group_3",
"literals": [
{
"type": "Network",
"value": "10.0.0.0/8"
},
{
"type": "Host",
"value": "10.0.0.27"
},
{
"type": "Host",
"value": "10.0.0.28"
}
]
},
{
"name": "network_group_4",
"literals": [
{
"type": "Network",
"value": "10.10.10.0/24"
},
{
"type": "Network",
"value": "10.10.20.0/24"
},
{
"type": "Host",
"value": "10.10.10.6"
}
]
}
]
Late is better than never, so using convtools library:
from convtools import conversion as c
from convtools.contrib.tables import Table
# store converter somewhere if it needs to be reused
converter = (
c.group_by(c.item("Name"))
.aggregate(
{
"literals": c.ReduceFuncs.Array(
{
"type": c.item("Type"),
"value": c.item("Value"),
}
),
"name": c.item("Name"),
}
)
.gen_converter()
)
# iterable of rows and it can only be consumed once
rows = Table.from_csv("tmp2.csv", header=True).into_iter_rows(dict)
assert converter(rows) == [
{'literals': [{'type': 'Network', 'value': '10.0.0.0/8'},
{'type': 'Host', 'value': '10.0.0.27'},
{'type': 'Host', 'value': '10.0.0.28'}],
'name': 'network_group_3'},
{'literals': [{'type': 'Network', 'value': '10.10.10.0/24'},
{'type': 'Network', 'value': '10.10.20.0/24'},
{'type': 'Host', 'value': '10.10.10.6'}],
'name': 'network_group_4'}]