I am trying to convert an object/dictionary to a Python DataFrame using the following code:
sr = pd.Series(object)
df = pd.DataFrame(sr.values.tolist())
display(df)
It works well but some of the output columns are of object/dictionary type, and I would like to break them up to multiple columns, for example, if column "Items" produces the following value in a cell:
obj = {
"item1": {
"id": "item1",
"relatedItems": [
{
"id": "1111",
"category": "electronics"
},
{
"id": "9999",
"category": "electronics",
"subcategory": "computers"
},
{
"id": "2222",
"category": "electronics",
"subcategory": "computers",
"additionalData": {
"createdBy": "Doron",
"inventory": 100
}
}
]
},
"item2": {
"id": "item2",
"relatedItems": [
{
"id": "4444",
"category": "furniture",
"subcategory": "sofas"
},
{
"id": "5555",
"category": "books",
},
{
"id": "6666",
"category": "electronics",
"subcategory": "computers",
"additionalData": {
"createdBy": "Joe",
"inventory": 5,
"condition": {
"name": "new",
"inspectedBy": "Doron"
}
}
}
]
}
}
The desired output is:
I tried using df.explode, but it multiplies the row to multiple rows, I am looking for a way to achieve the same but split into columns and retain a single row.
Any suggestions?
You can use the pd.json_normalize function to flatten the nested dictionary into multiple columns, with the keys joined with a dot (.).
sr = pd.Series({
'Items': {
'item_name': 'name',
'item_value': 'value'
}
})
df = pd.json_normalize(sr, sep='.')
display(df)
This will give you the following df
Items.item_name Items.item_value
0 name value
You can also specify the level of nesting by passing the record_path parameter to pd.json_normalize, for example, to only flatten the 'Items' key:
df = pd.json_normalize(sr, 'Items', sep='.')
display(df)
Seems like you're looking for pandas.json_normalize which has a (sep) parameter:
obj = {
'name': 'Doron Barel',
'items': {
'item_name': 'name',
'item_value': 'value',
'another_item_prop': [
{
'subitem1_name': 'just_another_name',
'subitem1_value': 'just_another_value',
},
{
'subitem2_name': 'one_more_name',
'subitem2_value': 'one_more_value',
}
]
}
}
df = pd.json_normalize(obj, sep='.')
ser = df.pop('items.another_item_prop').explode()
out = (df.join(pd.DataFrame(ser.tolist(), index=s.index)
.rename(columns= lambda x: ser.name+"."+x))
.groupby("name", as_index=False).first()
)
Output :
print(out)
name items.item_name items.item_value items.another_item_prop.subitem1_name items.another_item_prop.subitem1_value items.another_item_prop.subitem2_name items.another_item_prop.subitem2_value
0 Doron Barel name value just_another_name just_another_value one_more_name one_more_value
Related
Would you help, please, to parce 2-arrayed json via python, json_normalize.
Here is the code:
import json
from pandas.io.json import json_normalize
data5 = {
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
}
df2 = json_normalize(data5
, record_path = ['topping']
, meta = ['id', 'type', 'name', 'ppu', 'batters']
, record_prefix='_'
, errors='ignore'
)
This parces "topping" object but doesn't parce the "batters".
To parce the "batters" may be applied the code:
# parce the part of json string into another dataframe
df3 = json_normalize(data5
,record_path = ['batters', 'batter'])
# cross join 2 dataframes
df2['key_'] = 1
df3['key_'] = 1
result = pd.merge(df2, df3, on ='key_').drop("key_", 1)
But this looks complicated.
Is it possible to combine 2 steps above in one query? E.g.:
df2 = json_normalize(data5
, record_path = ['topping', ['batters', 'batter']]
, meta = ['id', 'type', 'name', 'ppu', ]
, record_prefix='_'
, errors='ignore'
)
Thank you.
I don't think you can specify that within json_normalize. However, you can avoid creating the key_ column by specifying how="cross" in pd.merge (also no need to keep batters in df2):
import pandas as pd
df2 = pd.json_normalize(data5
, record_path = ['topping']
, meta = ['id', 'type', 'name', 'ppu']
, record_prefix='_'
)
df3 = pd.json_normalize(data5
,record_path = ['batters', 'batter'])
pd.merge(df2, df3, how="cross")
I've been wresting with this for many days now and would appreciate any help.
I'm importing an Excel file to a Pandas data frame resulting in the following dataframe [record]:
account_id
name
timestamp
value
A0001C
Fund_1
1588618800000000000
1
B0001B
Dev_2
1601578800000000000
1
I'm looking to produce a nested JSON output (will be used to submit data to an API), include adding a records and metric labels for the arrays.
Here is the output i'm looking for:
{
"records": [
{
"name": "Fund_1",
"account_id": "A0001C",
"metrics": [
{
"timestamp": 1588618800000000000,
"value": 1
}
]
}
{
"name": "Dev_2",
"account_id": "B0001B",
"metrics": [
{
"timestamp": 1601578800000000000,
"value": 1
}
]
}
]
}
I've gotten an output of a none nested JSON data set, but not able split out the timestamp and value to add the metrics part.
for record in df.to_dict(orient='records'):
record_data = {'records': [record]}
payload_json = json.dumps(record_data)
print(payload_json)
I get the following output:
{"records": [{"account_id": "A0001C", "name": "Fund_1", "Date Completed": 1588618800000000000, "Count": "1"}]}
{"records": [{"account_id": "B0001B", "name": "Dev_2", "Date Completed": 1601578800000000000, "Count": "1"}]}
Any help on how i can modify my code to add the metrics label and nest the data.
Thanks in advance.
One approach is through the use of pd.apply. This allows you to apply a function to series (either column- or row-wise) in your dataframe.
In your particular case, you want to apply the function row-by-row, so you have to use apply with axis=1:
records = list(df.apply(lambda row: {"name": row["name"],
"account_id": row["account_id"],
"metrics": [{
"timestamp": row["timestamp"],
"value": row["value"]}]
},
axis=1).values)
payload = {"records": records}
Alternatively, you could introduce an auxiliary column "metrics" in which you store your metrics (subsequently applying pd.to_json):
df["metrics"] = df.apply(lambda e: [{"timestamp": e.timestamp,
"value": e.value}],
axis=1)
records = df[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
Here's a full example applying option 2:
import io
import json
import pandas as pd
data = io.StringIO("""account_id name timestamp value
A0001C Fund_1 1588618800000000000 1
B0001B Dev_2 1601578800000000000 1""")
df = pd.read_csv(data, sep="\t")
df["metrics"] = df.apply(lambda e: [{"timestamp": e.timestamp,
"value": e.value}],
axis=1)
records = df[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
print(json.dumps(payload, indent=4))
Output:
{
"records": [
{
"account_id": "A0001C",
"name": "Fund_1",
"metrics": [
{
"timestamp": 1588618800000000000,
"value": 1
}
]
},
{
"account_id": "B0001B",
"name": "Dev_2",
"metrics": [
{
"timestamp": 1601578800000000000,
"value": 1
}
]
}
]
}
Edit: The second approach also makes grouping by accounts (in case you want to do that) rather easy. Below is a small example and output:
import io
import json
import pandas as pd
data = io.StringIO("""account_id name timestamp value
A0001C Fund_1 1588618800000000000 1
A0001C Fund_1 1588618900000000000 2
B0001B Dev_2 1601578800000000000 1""")
df = pd.read_csv(data, sep="\t")
# adding the metrics column as above
df["metrics"] = df.apply(lambda e: {"timestamp": e.timestamp,
"value": e.value},
axis=1)
# group metrics by account
df_grouped = df.groupby(by=["name", "account_id"]).metrics.agg(list).reset_index()
records = df_grouped[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
print(json.dumps(payload, indent=4))
Output:
{
"records": [
{
"account_id": "B0001B",
"name": "Dev_2",
"metrics": [
{
"timestamp": 1601578800000000000,
"value": 1
}
]
},
{
"account_id": "A0001C",
"name": "Fund_1",
"metrics": [
{
"timestamp": 1588618800000000000,
"value": 1
},
{
"timestamp": 1588618900000000000,
"value": 2
}
]
}
]
}
How can we merge two dataframes with columns which having nested dictionaries. Update the df1 with df2 in "actions" column. Is there any way to achieve this by using available methods like concat,append and merge..?
df1 = pd.DataFrame([
{
"id": "87c4b5a0db9f49c49f766436c9582297",
"actions": {
"sample": [
{
"tagvalue": "test",
"status": "created"
},
{
"tagvalue": "test2",
"status": "created"
}
]
}
},
{
"id": "87c4b5a0db9f49c49f766436c9582298",
"actions": {
"sample": [
{
"tagvalue": "test2",
"status": "created"
}
]
}
}
])
df2 = pd.DataFrame([
{
"id": "87c4b5a0db9f49c49f766436c9582297",
"actions": {
"sample": [
{
"tagvalue": "test",
"status": "updated"
}
]
}
}
])
df1.set_index('id', inplace=True)
df2.set_index('id', inplace=True)
# Need to merge the data based on id
# TODO : Right way to merge to get the following output
finalOutputExpectaion = [
{
"id": "87c4b5a0db9f49c49f766436c9582297",
"actions": {
"sample": [
{
"tagvalue": "test",
"status": "updated"
},
{
"tagvalue": "test2",
"status": "created"
}
]
}
},
{
"id": "87c4b5a0db9f49c49f766436c9582298",
"actions": {
"sample": [
{
"tagvalue": "test2",
"status": "created"
}
]
}
}
]
Note : finalOutputExpectaion- updated dataframe as dict(We'll get it by using to_dict(orient=records))
Python Version : 3.7,
Pandas Version : 1.1.0
First join the dataframes df1 and df2 on id, then inside a list comprehension zip the column actions from left and right dataframe and use a custom defined merge function to update the dictionaries:
def merge(d1, d2):
if pd.isna(d1) or pd.isna(d2):
return d1
tags = set(d['tagvalue'] for d in d2['sample'])
d2['sample'] += [d for d in d1['sample'] if d['tagvalue'] not in tags]
return d2
m = df1.join(df2, lsuffix='', rsuffix='_r')
df1['actions'] = [merge(*v) for v in zip(m['actions'], m['actions_r'])]
Result:
actions
id
87c4b5a0db9f49c49f766436c9582297 {'sample': [{'tagvalue': 'test', 'status': 'updated'}, {'tagvalue': 'test2', 'status': 'created'}]}
87c4b5a0db9f49c49f766436c9582298 {'sample': [{'tagvalue': 'test2', 'status': 'created'}]}
I want to use an api and would need to put my dataframe in a dictionary format first.
The dataframe df that looks like this:
OrigC OrigZ OrigN Weigh DestC DestZ DestN
0 PL 97 TP 59 DE 63 SN
Exepected output of the first row:
{"section":[
{"location":
{
"zipCode":
{"OrigC": "PL",
"OrigZ":"97"},
"location": {"id": "1"},
"OrigN": "TP"
},
"carriageParameter":
{"road":
{"truckLoad": "Auto"}
},
"load":
{"Weigh": "59",
"unit": "ton",
"showEmissionsAtResponse": "true"
}
},
{"location":
{
"zipCode":
{"DestC": "DE",
"DestZ":"63"},
"location": {"id": "2"},
"DestN": "SN"
},
"carriageParameter":
{"road":
{"truckLoad":"Auto"}
},
"unload":
{"WEIGHTTONS":"59",
"unit": "ton",
"showEmissionsAtResponse": "true"
}
}]}
Note that there is static information in the dictionary that doesn't require any change.
How can this be done in Python?
You can use iterrows.
dic = {}
dic['section'] = []
for ix, row in df.iterrows():
in_dict = {
'location': {
'zip_code': {
'OrigC': row['OrigC'],
'OrigZ': row['OrigZ'],
},
'location': {'id': ix+1}, # I am guessing here.
'OrigN': 'TP',
},
'CarriageParameter': {
'road': {
'truckLoad': 'Auto'}
},
'load': {
'Weigh': str(row['Weigh']),
}
}
dic['section'].append(in_dict)
Note that this is not the entire entry, but I think it is clear enough to illustrate the idea.
I am trying to load json from a url and convert to a Pandas dataframe, so that the dataframe would look like the sample below.
I've tried json_normalize, but it duplicates the columns, one for each data type (value and stringValue). Is there a simpler way than this method and then dropping and renaming columns after creating the dataframe? I want to keep the stringValue.
Person ID Position ID Job ID Manager
0 192 936 93 Tom
my_json = {
"columns": [
{
"alias": "c3",
"label": "Person ID",
"dataType": "integer"
},
{
"alias": "c36",
"label": "Position ID",
"dataType": "string"
},
{
"alias": "c40",
"label": "Job ID",
"dataType": "integer",
"entityType": "job"
},
{
"alias": "c19",
"label": "Manager",
"dataType": "integer"
},
],
"data": [
{
"c3": {
"value": 192,
"stringValue": "192"
},
"c36": {
"value": "936",
"stringValue": "936"
},
"c40": {
"value": 93,
"stringValue": "93"
},
"c19": {
"value": 12412453,
"stringValue": "Tom"
}
}
]
}
If c19 is of type string, this should work
alias_to_label = {x['alias']: x['label'] for x in my_json["columns"]}
is_str = {x['alias']: ('string' == x['dataType']) for x in my_json["columns"]}
data = []
for x in my_json["data"]:
data.append({
k: v["stringValue" if is_str[k] else 'value']
for k, v in x.items()
})
df = pd.DataFrame(data).rename(columns=alias_to_label)