merge two dataframes with nested dictionaries

merge two dataframes with nested dictionaries - python

How can we merge two dataframes with columns which having nested dictionaries. Update the df1 with df2 in "actions" column. Is there any way to achieve this by using available methods like concat,append and merge..?
df1 = pd.DataFrame([
{
"id": "87c4b5a0db9f49c49f766436c9582297",
"actions": {
"sample": [
{
"tagvalue": "test",
"status": "created"
},
{
"tagvalue": "test2",
"status": "created"
}
]
}
},
{
"id": "87c4b5a0db9f49c49f766436c9582298",
"actions": {
"sample": [
{
"tagvalue": "test2",
"status": "created"
}
]
}
}
])
df2 = pd.DataFrame([
{
"id": "87c4b5a0db9f49c49f766436c9582297",
"actions": {
"sample": [
{
"tagvalue": "test",
"status": "updated"
}
]
}
}
])
df1.set_index('id', inplace=True)
df2.set_index('id', inplace=True)
# Need to merge the data based on id
# TODO : Right way to merge to get the following output
finalOutputExpectaion = [
{
"id": "87c4b5a0db9f49c49f766436c9582297",
"actions": {
"sample": [
{
"tagvalue": "test",
"status": "updated"
},
{
"tagvalue": "test2",
"status": "created"
}
]
}
},
{
"id": "87c4b5a0db9f49c49f766436c9582298",
"actions": {
"sample": [
{
"tagvalue": "test2",
"status": "created"
}
]
}
}
]
Note : finalOutputExpectaion- updated dataframe as dict(We'll get it by using to_dict(orient=records))
Python Version : 3.7,
Pandas Version : 1.1.0

First join the dataframes df1 and df2 on id, then inside a list comprehension zip the column actions from left and right dataframe and use a custom defined merge function to update the dictionaries:
def merge(d1, d2):
if pd.isna(d1) or pd.isna(d2):
return d1
tags = set(d['tagvalue'] for d in d2['sample'])
d2['sample'] += [d for d in d1['sample'] if d['tagvalue'] not in tags]
return d2
m = df1.join(df2, lsuffix='', rsuffix='_r')
df1['actions'] = [merge(*v) for v in zip(m['actions'], m['actions_r'])]
Result:
actions
id
87c4b5a0db9f49c49f766436c9582297 {'sample': [{'tagvalue': 'test', 'status': 'updated'}, {'tagvalue': 'test2', 'status': 'created'}]}
87c4b5a0db9f49c49f766436c9582298 {'sample': [{'tagvalue': 'test2', 'status': 'created'}]}

Related

Flattening Multi-Level Nested Object to DataFrame

I am trying to convert an object/dictionary to a Python DataFrame using the following code:
sr = pd.Series(object)
df = pd.DataFrame(sr.values.tolist())
display(df)
It works well but some of the output columns are of object/dictionary type, and I would like to break them up to multiple columns, for example, if column "Items" produces the following value in a cell:
obj = {
"item1": {
"id": "item1",
"relatedItems": [
{
"id": "1111",
"category": "electronics"
},
{
"id": "9999",
"category": "electronics",
"subcategory": "computers"
},
{
"id": "2222",
"category": "electronics",
"subcategory": "computers",
"additionalData": {
"createdBy": "Doron",
"inventory": 100
}
}
]
},
"item2": {
"id": "item2",
"relatedItems": [
{
"id": "4444",
"category": "furniture",
"subcategory": "sofas"
},
{
"id": "5555",
"category": "books",
},
{
"id": "6666",
"category": "electronics",
"subcategory": "computers",
"additionalData": {
"createdBy": "Joe",
"inventory": 5,
"condition": {
"name": "new",
"inspectedBy": "Doron"
}
}
}
]
}
}
The desired output is:
I tried using df.explode, but it multiplies the row to multiple rows, I am looking for a way to achieve the same but split into columns and retain a single row.
Any suggestions?

You can use the pd.json_normalize function to flatten the nested dictionary into multiple columns, with the keys joined with a dot (.).
sr = pd.Series({
'Items': {
'item_name': 'name',
'item_value': 'value'
}
})
df = pd.json_normalize(sr, sep='.')
display(df)
This will give you the following df
Items.item_name Items.item_value
0 name value
You can also specify the level of nesting by passing the record_path parameter to pd.json_normalize, for example, to only flatten the 'Items' key:
df = pd.json_normalize(sr, 'Items', sep='.')
display(df)

Seems like you're looking for pandas.json_normalize which has a (sep) parameter:
obj = {
'name': 'Doron Barel',
'items': {
'item_name': 'name',
'item_value': 'value',
'another_item_prop': [
{
'subitem1_name': 'just_another_name',
'subitem1_value': 'just_another_value',
},
{
'subitem2_name': 'one_more_name',
'subitem2_value': 'one_more_value',
}
]
}
}

df = pd.json_normalize(obj, sep='.')

ser = df.pop('items.another_item_prop').explode()

out = (df.join(pd.DataFrame(ser.tolist(), index=s.index)
.rename(columns= lambda x: ser.name+"."+x))
.groupby("name", as_index=False).first()
)
Output :
print(out)

name items.item_name items.item_value items.another_item_prop.subitem1_name items.another_item_prop.subitem1_value items.another_item_prop.subitem2_name items.another_item_prop.subitem2_value
0 Doron Barel name value just_another_name just_another_value one_more_name one_more_value

How merge or join data in a Pandas nested DataFrame

I'm trying to figure out how to perform a Merge or Join on a nested field in a DataFrame. Below is some example data:
df_all_groups = pd.read_json("""
[
{
"object": "group",
"id": "group-one",
"collections": [
{
"id": "111-111-111",
"readOnly": false
},
{
"id": "222-222-222",
"readOnly": false
}
]
},
{
"object": "group",
"id": "group-two",
"collections": [
{
"id": "111-111-111",
"readOnly": false
},
{
"id": "333-333-333",
"readOnly": false
}
]
}
]
""")
df_collections_with_names = pd.read_json("""
[
{
"object": "collection",
"id": "111-111-111",
"externalId": null,
"name": "Cats"
},
{
"object": "collection",
"id": "222-222-222",
"externalId": null,
"name": "Dogs"
},
{
"object": "collection",
"id": "333-333-333",
"externalId": null,
"name": "Fish"
}
]
""")
I'm trying to add the name field from df_collections_with_names to each df_all_groups['collections'][<index>] by joining on df_all_groups['collections'][<index>].id The output I'm trying to get to is:
[
{
"object": "group",
"id": "group-one",
"collections": [
{
"id": "111-111-111",
"readOnly": false,
"name": "Cats" // See Collection name was added
},
{
"id": "222-222-222",
"readOnly": false,
"name": "Dogs" // See Collection name was added
}
]
},
{
"object": "group",
"id": "group-two",
"collections": [
{
"id": "111-111-111",
"readOnly": false,
"name": "Cats" // See Collection name was added
},
{
"id": "333-333-333",
"readOnly": false,
"name": "Fish" // See Collection name was added
}
]
}
]
I've tried to use the merge method, but can't seem to get it to run on the collections nested field as I believe it's a series at that point.

Here's one approach:
First convert the json string used to construct df_all_groups (I named it all_groups here) to a dictionary using json.loads. Then use json_normalize to contruct a DataFrame with it.
Then merge the DataFrame constructed above with df_collections_with_names; we have "names" column now.
The rest is constructing the desired dictionary from the result obtained above; groupby + apply(to_dict) + reset_index + to_dict will fetch the desired outcome:
import json
out = (pd.json_normalize(json.loads(all_groups), ['collections'], ['object', 'id'], meta_prefix='_')
.merge(df_collections_with_names, on='id', suffixes=('','_'))
.drop(columns=['object','externalId']))
out = (out.groupby(['_object','_id']).apply(lambda x: x[['id','readOnly','name']].to_dict('records'))
.reset_index(name='collections'))
out.rename(columns={c: c.strip('_') for c in out.columns}).to_dict('records')
Output:
[{'object': 'group',
'id': 'group-one',
'collections': [{'id': '111-111-111', 'readOnly': False, 'name': 'Cats'},
{'id': '222-222-222', 'readOnly': False, 'name': 'Dogs'}]},
{'object': 'group',
'id': 'group-two',
'collections': [{'id': '111-111-111', 'readOnly': False, 'name': 'Cats'},
{'id': '333-333-333', 'readOnly': False, 'name': 'Fish'}]}]

Convert PANDAS dataframe to nested JSON + add array name

I've been wresting with this for many days now and would appreciate any help.
I'm importing an Excel file to a Pandas data frame resulting in the following dataframe [record]:
account_id
name
timestamp
value
A0001C
Fund_1
1588618800000000000
1
B0001B
Dev_2
1601578800000000000
1
I'm looking to produce a nested JSON output (will be used to submit data to an API), include adding a records and metric labels for the arrays.
Here is the output i'm looking for:
{
"records": [
{
"name": "Fund_1",
"account_id": "A0001C",
"metrics": [
{
"timestamp": 1588618800000000000,
"value": 1
}
]
}
{
"name": "Dev_2",
"account_id": "B0001B",
"metrics": [
{
"timestamp": 1601578800000000000,
"value": 1
}
]
}
]
}
I've gotten an output of a none nested JSON data set, but not able split out the timestamp and value to add the metrics part.
for record in df.to_dict(orient='records'):
record_data = {'records': [record]}
payload_json = json.dumps(record_data)
print(payload_json)
I get the following output:
{"records": [{"account_id": "A0001C", "name": "Fund_1", "Date Completed": 1588618800000000000, "Count": "1"}]}
{"records": [{"account_id": "B0001B", "name": "Dev_2", "Date Completed": 1601578800000000000, "Count": "1"}]}
Any help on how i can modify my code to add the metrics label and nest the data.
Thanks in advance.

One approach is through the use of pd.apply. This allows you to apply a function to series (either column- or row-wise) in your dataframe.
In your particular case, you want to apply the function row-by-row, so you have to use apply with axis=1:
records = list(df.apply(lambda row: {"name": row["name"],
"account_id": row["account_id"],
"metrics": [{
"timestamp": row["timestamp"],
"value": row["value"]}]
},
axis=1).values)
payload = {"records": records}
Alternatively, you could introduce an auxiliary column "metrics" in which you store your metrics (subsequently applying pd.to_json):
df["metrics"] = df.apply(lambda e: [{"timestamp": e.timestamp,
"value": e.value}],
axis=1)
records = df[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
Here's a full example applying option 2:
import io
import json
import pandas as pd
data = io.StringIO("""account_id name timestamp value
A0001C Fund_1 1588618800000000000 1
B0001B Dev_2 1601578800000000000 1""")
df = pd.read_csv(data, sep="\t")
df["metrics"] = df.apply(lambda e: [{"timestamp": e.timestamp,
"value": e.value}],
axis=1)
records = df[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
print(json.dumps(payload, indent=4))
Output:
{
"records": [
{
"account_id": "A0001C",
"name": "Fund_1",
"metrics": [
{
"timestamp": 1588618800000000000,
"value": 1
}
]
},
{
"account_id": "B0001B",
"name": "Dev_2",
"metrics": [
{
"timestamp": 1601578800000000000,
"value": 1
}
]
}
]
}
Edit: The second approach also makes grouping by accounts (in case you want to do that) rather easy. Below is a small example and output:
import io
import json
import pandas as pd
data = io.StringIO("""account_id name timestamp value
A0001C Fund_1 1588618800000000000 1
A0001C Fund_1 1588618900000000000 2
B0001B Dev_2 1601578800000000000 1""")
df = pd.read_csv(data, sep="\t")
# adding the metrics column as above
df["metrics"] = df.apply(lambda e: {"timestamp": e.timestamp,
"value": e.value},
axis=1)
# group metrics by account
df_grouped = df.groupby(by=["name", "account_id"]).metrics.agg(list).reset_index()
records = df_grouped[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
print(json.dumps(payload, indent=4))
Output:
{
"records": [
{
"account_id": "B0001B",
"name": "Dev_2",
"metrics": [
{
"timestamp": 1601578800000000000,
"value": 1
}
]
},
{
"account_id": "A0001C",
"name": "Fund_1",
"metrics": [
{
"timestamp": 1588618800000000000,
"value": 1
},
{
"timestamp": 1588618900000000000,
"value": 2
}
]
}
]
}

Convert JSON with nested objects to Pandas Dataframe

I am trying to load json from a url and convert to a Pandas dataframe, so that the dataframe would look like the sample below.
I've tried json_normalize, but it duplicates the columns, one for each data type (value and stringValue). Is there a simpler way than this method and then dropping and renaming columns after creating the dataframe? I want to keep the stringValue.
Person ID Position ID Job ID Manager
0 192 936 93 Tom
my_json = {
"columns": [
{
"alias": "c3",
"label": "Person ID",
"dataType": "integer"
},
{
"alias": "c36",
"label": "Position ID",
"dataType": "string"
},
{
"alias": "c40",
"label": "Job ID",
"dataType": "integer",
"entityType": "job"
},
{
"alias": "c19",
"label": "Manager",
"dataType": "integer"
},
],
"data": [
{
"c3": {
"value": 192,
"stringValue": "192"
},
"c36": {
"value": "936",
"stringValue": "936"
},
"c40": {
"value": 93,
"stringValue": "93"
},
"c19": {
"value": 12412453,
"stringValue": "Tom"
}
}
]
}

If c19 is of type string, this should work
alias_to_label = {x['alias']: x['label'] for x in my_json["columns"]}
is_str = {x['alias']: ('string' == x['dataType']) for x in my_json["columns"]}
data = []
for x in my_json["data"]:
data.append({
k: v["stringValue" if is_str[k] else 'value']
for k, v in x.items()
})
df = pd.DataFrame(data).rename(columns=alias_to_label)

Best pythonic way to find the max value in a matrix of objects

I have the following code:
resdata = dict()
rows = result.rows.all()
for key, group in groupby(rows, lambda x: x.space):
row = list()
for item in group:
cell = {
'time': item.time,
'value': item.value
}
row.append(cell)
resdata[key] = row
a sample resdata would be:
resdata = [
{
"skl": "nn_skl:5608",
"cols": [
{
"value": 115.396956868,
"time": "2012-06-02 00:00:00"
},
{
"value": 112.501399874,
"time": "2012-06-03 00:00:00"
},
{
"value": 106.528068506,
"time": "2012-06-18 00:00:00"
}
],
"len": 226
},
{
"skl": "nn_skl:5609",
"cols": [
{
"value": 114.541167284,
"time": "2012-06-02 00:00:00"
},
],
"len": 226
},
{
"skl": "nn_skl:5610",
"cols": [
{
"value": 105.887267189,
"time": "2012-06-18 00:00:00"
}
],
"len": 225
}
]
What I want to do is to get the maximum 'value' and the maximum 'time' among all the cells.

Assuming you've converted into a Python object with json.loads or whatnot, then you want something like:
max(b["time"] for b in a["cols"] for a in data)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

merge two dataframes with nested dictionaries - python

Related

Flattening Multi-Level Nested Object to DataFrame

How merge or join data in a Pandas nested DataFrame

Convert PANDAS dataframe to nested JSON + add array name

Convert JSON with nested objects to Pandas Dataframe

Best pythonic way to find the max value in a matrix of objects

Categories

Resources