How merge or join data in a Pandas nested DataFrame

How merge or join data in a Pandas nested DataFrame - python

I'm trying to figure out how to perform a Merge or Join on a nested field in a DataFrame. Below is some example data:
df_all_groups = pd.read_json("""
[
{
"object": "group",
"id": "group-one",
"collections": [
{
"id": "111-111-111",
"readOnly": false
},
{
"id": "222-222-222",
"readOnly": false
}
]
},
{
"object": "group",
"id": "group-two",
"collections": [
{
"id": "111-111-111",
"readOnly": false
},
{
"id": "333-333-333",
"readOnly": false
}
]
}
]
""")
df_collections_with_names = pd.read_json("""
[
{
"object": "collection",
"id": "111-111-111",
"externalId": null,
"name": "Cats"
},
{
"object": "collection",
"id": "222-222-222",
"externalId": null,
"name": "Dogs"
},
{
"object": "collection",
"id": "333-333-333",
"externalId": null,
"name": "Fish"
}
]
""")
I'm trying to add the name field from df_collections_with_names to each df_all_groups['collections'][<index>] by joining on df_all_groups['collections'][<index>].id The output I'm trying to get to is:
[
{
"object": "group",
"id": "group-one",
"collections": [
{
"id": "111-111-111",
"readOnly": false,
"name": "Cats" // See Collection name was added
},
{
"id": "222-222-222",
"readOnly": false,
"name": "Dogs" // See Collection name was added
}
]
},
{
"object": "group",
"id": "group-two",
"collections": [
{
"id": "111-111-111",
"readOnly": false,
"name": "Cats" // See Collection name was added
},
{
"id": "333-333-333",
"readOnly": false,
"name": "Fish" // See Collection name was added
}
]
}
]
I've tried to use the merge method, but can't seem to get it to run on the collections nested field as I believe it's a series at that point.

Here's one approach:
First convert the json string used to construct df_all_groups (I named it all_groups here) to a dictionary using json.loads. Then use json_normalize to contruct a DataFrame with it.
Then merge the DataFrame constructed above with df_collections_with_names; we have "names" column now.
The rest is constructing the desired dictionary from the result obtained above; groupby + apply(to_dict) + reset_index + to_dict will fetch the desired outcome:
import json
out = (pd.json_normalize(json.loads(all_groups), ['collections'], ['object', 'id'], meta_prefix='_')
.merge(df_collections_with_names, on='id', suffixes=('','_'))
.drop(columns=['object','externalId']))
out = (out.groupby(['_object','_id']).apply(lambda x: x[['id','readOnly','name']].to_dict('records'))
.reset_index(name='collections'))
out.rename(columns={c: c.strip('_') for c in out.columns}).to_dict('records')
Output:
[{'object': 'group',
'id': 'group-one',
'collections': [{'id': '111-111-111', 'readOnly': False, 'name': 'Cats'},
{'id': '222-222-222', 'readOnly': False, 'name': 'Dogs'}]},
{'object': 'group',
'id': 'group-two',
'collections': [{'id': '111-111-111', 'readOnly': False, 'name': 'Cats'},
{'id': '333-333-333', 'readOnly': False, 'name': 'Fish'}]}]

Related

Flattening Multi-Level Nested Object to DataFrame

I am trying to convert an object/dictionary to a Python DataFrame using the following code:
sr = pd.Series(object)
df = pd.DataFrame(sr.values.tolist())
display(df)
It works well but some of the output columns are of object/dictionary type, and I would like to break them up to multiple columns, for example, if column "Items" produces the following value in a cell:
obj = {
"item1": {
"id": "item1",
"relatedItems": [
{
"id": "1111",
"category": "electronics"
},
{
"id": "9999",
"category": "electronics",
"subcategory": "computers"
},
{
"id": "2222",
"category": "electronics",
"subcategory": "computers",
"additionalData": {
"createdBy": "Doron",
"inventory": 100
}
}
]
},
"item2": {
"id": "item2",
"relatedItems": [
{
"id": "4444",
"category": "furniture",
"subcategory": "sofas"
},
{
"id": "5555",
"category": "books",
},
{
"id": "6666",
"category": "electronics",
"subcategory": "computers",
"additionalData": {
"createdBy": "Joe",
"inventory": 5,
"condition": {
"name": "new",
"inspectedBy": "Doron"
}
}
}
]
}
}
The desired output is:
I tried using df.explode, but it multiplies the row to multiple rows, I am looking for a way to achieve the same but split into columns and retain a single row.
Any suggestions?

You can use the pd.json_normalize function to flatten the nested dictionary into multiple columns, with the keys joined with a dot (.).
sr = pd.Series({
'Items': {
'item_name': 'name',
'item_value': 'value'
}
})
df = pd.json_normalize(sr, sep='.')
display(df)
This will give you the following df
Items.item_name Items.item_value
0 name value
You can also specify the level of nesting by passing the record_path parameter to pd.json_normalize, for example, to only flatten the 'Items' key:
df = pd.json_normalize(sr, 'Items', sep='.')
display(df)

Seems like you're looking for pandas.json_normalize which has a (sep) parameter:
obj = {
'name': 'Doron Barel',
'items': {
'item_name': 'name',
'item_value': 'value',
'another_item_prop': [
{
'subitem1_name': 'just_another_name',
'subitem1_value': 'just_another_value',
},
{
'subitem2_name': 'one_more_name',
'subitem2_value': 'one_more_value',
}
]
}
}

df = pd.json_normalize(obj, sep='.')

ser = df.pop('items.another_item_prop').explode()

out = (df.join(pd.DataFrame(ser.tolist(), index=s.index)
.rename(columns= lambda x: ser.name+"."+x))
.groupby("name", as_index=False).first()
)
Output :
print(out)

name items.item_name items.item_value items.another_item_prop.subitem1_name items.another_item_prop.subitem1_value items.another_item_prop.subitem2_name items.another_item_prop.subitem2_value
0 Doron Barel name value just_another_name just_another_value one_more_name one_more_value

How to flatten a nested json array?

I need to flatten a JSON with different levels of nested JSON arrays in Python
Part of my JSON looks like:
{
"data": {
"workbooks": [
{
"projectName": "TestProject",
"name": "wkb1",
"site": {
"name": "site1"
},
"description": "",
"createdAt": "2020-12-13T15:38:58Z",
"updatedAt": "2020-12-13T15:38:59Z",
"owner": {
"name": "user1",
"username": "John"
},
"embeddedDatasources": [
{
"name": "DS1",
"hasExtracts": false,
"upstreamDatasources": [
{
"projectName": "Data Sources",
"name": "DS1",
"hasExtracts": false,
"owner": {
"username": "user2"
}
}
],
"upstreamTables": [
{
"name": "table_1",
"schema": "schema_1",
"database": {
"name": "testdb",
"connectionType": "redshift"
}
},
{
"name": "table_2",
"schema": "schema_2",
"database": {
"name": "testdb",
"connectionType": "redshift"
}
},
{
"name": "table_3",
"schema": "schema_3",
"database": {
"name": "testdb",
"connectionType": "redshift"
}
}
]
},
{
"name": "DS2",
"hasExtracts": false,
"upstreamDatasources": [
{
"projectName": "Data Sources",
"name": "DS2",
"hasExtracts": false,
"owner": {
"username": "user3"
}
}
],
"upstreamTables": [
{
"name": "table_4",
"schema": "schema_1",
"database": {
"name": "testdb",
"connectionType": "redshift"
}
}
]
}
]
}
]
}
}
The output should like this
sample output
Tried using json_normalize but couldn't make it work. Currently parsing it by reading the nested arrays using loops and reading values using keys. Looking for a better way of normalizing the JSON

Here's a partial solution:
First save your data in the same directory as the script as a JSON file called data.json.
import json
import pandas as pd
from pandas.io.json import json_normalize
with open('data.json') as json_file:
json_data = json.load(json_file)
new_data = json_data['data']['workbooks']
result = json_normalize(new_data, ['embeddedDatasources', 'upstreamTables'], ['projectName', 'name', 'createdAt', 'updatedAt', 'owner', 'site'], record_prefix='_')
result
Output:
_name
_schema
_database.name
_database.connectionType
projectName
name
createdAt
updatedAt
owner
site
0
table_1
schema_1
testdb
redshift
TestProject
wkb1
2020-12-13T15:38:58Z
2020-12-13T15:38:59Z
{'name': 'user1', 'username': 'John'}
{'name': 'site1'}
1
table_2
schema_2
testdb
redshift
TestProject
wkb1
2020-12-13T15:38:58Z
2020-12-13T15:38:59Z
{'name': 'user1', 'username': 'John'}
{'name': 'site1'}
2
table_3
schema_3
testdb
redshift
TestProject
wkb1
2020-12-13T15:38:58Z
2020-12-13T15:38:59Z
{'name': 'user1', 'username': 'John'}
{'name': 'site1'}
3
table_4
schema_1
testdb
redshift
TestProject
wkb1
2020-12-13T15:38:58Z
2020-12-13T15:38:59Z
{'name': 'user1', 'username': 'John'}
{'name': 'site1'}
What next?
I think if you re-structure the data a bit in advance (for example flattening 'database': {'name': 'testdb', 'connectionType': 'redshift'}) you will be able to add more fields to the meta parameter.
As you see in the documentation of json_normalize, the four parameters that are used here are:
data: dict or list of dicts :
Unserialized JSON objects.
record_path: str or list of str : default None
Path in each object to list of records. If not passed, data will be assumed to be an array of records.
meta: list of paths (str or list of str) : default None
Fields to use as metadata for each record in resulting table.
record_prefix: str : default None
If True, prefix records with dotted (?) path, e.g. foo.bar.field if path to records is [‘foo’, ‘bar’].

tl;dr: Your final output along with detailed steps are mentioned in here
details :
To answer this question you need to have a thorough understanding of pandas.json_normalize. The understanding of json_normalize, record_path, meta, explode and in general json parsing.
import json
import pandas as pd
data = {
"data":
{
"workbooks":
[
{
"projectName": "TestProject",
"name": "wkb1",
"site":
{
"name": "site1"
},
"description": "",
"createdAt": "2020-12-13T15:38:58Z",
"updatedAt": "2020-12-13T15:38:59Z",
"owner":
{
"name": "user1",
"username": "John"
},
"embeddedDatasources":
[
{
"name": "DS1",
"hasExtracts": False,
"upstreamDatasources":
[
{
"projectName": "Data Sources",
"name": "DS1",
"hasExtracts": False,
"owner":
{
"username": "user2"
}
}
],
"upstreamTables":
[
{
"name": "table_1",
"schema": "schema_1",
"database":
{
"name": "testdb",
"connectionType": "redshift"
}
},
{
"name": "table_2",
"schema": "schema_2",
"database":
{
"name": "testdb",
"connectionType": "redshift"
}
},
{
"name": "table_3",
"schema": "schema_3",
"database":
{
"name": "testdb",
"connectionType": "redshift"
}
}
]
},
{
"name": "DS2",
"hasExtracts": False,
"upstreamDatasources":
[
{
"projectName": "Data Sources",
"name": "DS2",
"hasExtracts": False,
"owner":
{
"username": "user3"
}
}
],
"upstreamTables":
[
{
"name": "table_4",
"schema": "schema_1",
"database":
{
"name": "testdb",
"connectionType": "redshift"
}
}
]
}
]
}
]
}
}
First you need to bring it to the dict level.
data_list = data['data']['workbooks']
I did some data massaging by renaming some columns as per requirements.
data_list_pd = pd.DataFrame(data_list)
data_list_pd = data_list_pd.rename(
columns= {'name':'wkb'},errors='ignore').rename(
columns= {'createdAt':'wkb_createdDt'},errors='ignore').rename(
columns= {'updatedAt':'wkb_updatedDt'},errors='ignore').rename(
columns= {'projectName':'prj'},errors='ignore')
data_list_pd
data_list = json.loads(data_list_pd.to_json(orient="records"))
data_list
Next is where the core of your problem statement lies. You need to flatten the JSON by mentioning the record_path which is esentially the nested dictionary you want to expand along with the meta which is meta data/the remaining columns which you want to display. After that you need to explode on columns which have lists in them. You can achieve it by chaining explode method couple of times.
flattened_dataframe= pd.json_normalize(data_list,
record_path = 'embeddedDatasources',
meta = ['prj','wkb','wkb_createdDt', 'wkb_updatedDt',['site','name'],['owner','name'],['owner','username']],
errors='ignore').explode('upstreamDatasources').explode('upstreamTables')
flattened_dataframe
You can repeat this process couple of times to reach your final goal/desired result. Since the json_normalize works on JSON/dict files you will have to convert the dataframe into json files after each iteration. You can follow these steps.
flattened_json = json.loads(flattened_dataframe.to_json(orient="records"))
Also read about to_json.

Merge 2 json files with jsonmerge

I want to merge many JSON files with the same nested structure, using jsonmerge, but have been unsuccessful so far. For example, I want to merge base and head:
base = {
"data": [
{
"author_id": "id1",
"id": "1"
},
{
"author_id": "id2",
"id": "2"
}
],
"includes": {
"users": [
{
"id": "user1",
"name": "user1"
},
{
"id": "user2",
"name": "user2"
}
]
}
}
head = {
"data": [
{
"author_id": "id3",
"id": "3"
},
{
"author_id": "id4",
"id": "4"
}
],
"includes": {
"users": [
{
"id": "user3",
"name": "user3"
},
{
"id": "user4",
"name": "user4"
}
]
}
}
The resulting JSON should be:
final_result = {
"data": [
{
"author_id": "id1",
"id": "1"
},
{
"author_id": "id2",
"id": "2"
},
{
"author_id": "id3",
"id": "3"
},
{
"author_id": "id4",
"id": "4"
}
],
"includes": {
"users": [
{
"id": "user1",
"name": "user1"
},
{
"id": "user2",
"name": "user2"
},
{
"id": "user3",
"name": "user3"
},
{
"id": "user4",
"name": "user4"
}
]
}
}
However, I've only managed to merge correctly the data fields, while for users it doesn't seem to work. This is my code:
from jsonmerge import merge
from jsonmerge import Merger
schema = { "properties": {
"data": {
"mergeStrategy": "append"
},
"includes": {
"users": {
"mergeStrategy": "append"
}
}
}
}
merger = Merger(schema)
result = merger.merge(base, head)
The end result is:
{'data': [{'author_id': 'id1', 'id': '1'},
{'author_id': 'id2', 'id': '2'},
{'author_id': 'id3', 'id': '3'},
{'author_id': 'id4', 'id': '4'}],
'includes': {'users': [{'id': 'user3', 'name': 'user3'},
{'id': 'user4', 'name': 'user4'}]}}
The issue is with the definition of the schema, but I do not know if it is possible to do it like that with jsonmerge. Any help is appreciated!
Thank you!

It is based on jsonschema. So when you have an object within an object (e.g. "users" within "includes") then you'll need to tell jsonschema it is dealing with another object like so:
schema = {
"properties": {
"data": {
"mergeStrategy": "append"
},
"includes": {
"type": "object",
"properties": {
"users": {
"mergeStrategy": "append"
}
}
}
}
}
Note that this also happens for your top-level objects, hence you have "properties" argument on the highest level.

merge two dataframes with nested dictionaries

How can we merge two dataframes with columns which having nested dictionaries. Update the df1 with df2 in "actions" column. Is there any way to achieve this by using available methods like concat,append and merge..?
df1 = pd.DataFrame([
{
"id": "87c4b5a0db9f49c49f766436c9582297",
"actions": {
"sample": [
{
"tagvalue": "test",
"status": "created"
},
{
"tagvalue": "test2",
"status": "created"
}
]
}
},
{
"id": "87c4b5a0db9f49c49f766436c9582298",
"actions": {
"sample": [
{
"tagvalue": "test2",
"status": "created"
}
]
}
}
])
df2 = pd.DataFrame([
{
"id": "87c4b5a0db9f49c49f766436c9582297",
"actions": {
"sample": [
{
"tagvalue": "test",
"status": "updated"
}
]
}
}
])
df1.set_index('id', inplace=True)
df2.set_index('id', inplace=True)
# Need to merge the data based on id
# TODO : Right way to merge to get the following output
finalOutputExpectaion = [
{
"id": "87c4b5a0db9f49c49f766436c9582297",
"actions": {
"sample": [
{
"tagvalue": "test",
"status": "updated"
},
{
"tagvalue": "test2",
"status": "created"
}
]
}
},
{
"id": "87c4b5a0db9f49c49f766436c9582298",
"actions": {
"sample": [
{
"tagvalue": "test2",
"status": "created"
}
]
}
}
]
Note : finalOutputExpectaion- updated dataframe as dict(We'll get it by using to_dict(orient=records))
Python Version : 3.7,
Pandas Version : 1.1.0

First join the dataframes df1 and df2 on id, then inside a list comprehension zip the column actions from left and right dataframe and use a custom defined merge function to update the dictionaries:
def merge(d1, d2):
if pd.isna(d1) or pd.isna(d2):
return d1
tags = set(d['tagvalue'] for d in d2['sample'])
d2['sample'] += [d for d in d1['sample'] if d['tagvalue'] not in tags]
return d2
m = df1.join(df2, lsuffix='', rsuffix='_r')
df1['actions'] = [merge(*v) for v in zip(m['actions'], m['actions_r'])]
Result:
actions
id
87c4b5a0db9f49c49f766436c9582297 {'sample': [{'tagvalue': 'test', 'status': 'updated'}, {'tagvalue': 'test2', 'status': 'created'}]}
87c4b5a0db9f49c49f766436c9582298 {'sample': [{'tagvalue': 'test2', 'status': 'created'}]}

Turning Nested JSON with Arrays into DataFrame in Python

I have a heavily nested set of Json that I would like to turn into a table
I would like to turn the below JSON response into a table under "steps" I could just extract "name" and "options" and there values
"data": {
"activities": [
{
"sections": [
{
"steps": [
{
"blocking": false,
"actionable": true,
"document": null,
"name": "Site",
"options": [
"RKM",
"Meridian"
],
"description": null,
"id": "036c3090-95c4-4162-a746-832ed43a2805",
"type": "DROPDOWN"
},
{
"blocking": false,
"actionable": true,
"document": null,
"name": "Location",
"options": [
"Field",
"Station"
],

Assuming that you want a pandas dataframe:
df = pd.DataFrame(json['data']['activities'][0]['sections'][0]['steps'])[['name', 'options']]
print(df)
Output:
name options
0 Site [RKM, Meridian]
1 Location [Field, Station]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How merge or join data in a Pandas nested DataFrame - python

Related

Flattening Multi-Level Nested Object to DataFrame

How to flatten a nested json array?

Merge 2 json files with jsonmerge

merge two dataframes with nested dictionaries

Turning Nested JSON with Arrays into DataFrame in Python

Categories

Resources