Mapping pandas df to JSON Schema - python

Here is my df:
text
date
channel
sentiment
product
segment
0
I like the new layout
2021-08-30T18:15:22Z
Snowflake
predict
Skills
EMEA
I need to convert this to JSON output that matches the following:
[
{
"text": "I like the new layout",
"date": "2021-08-30T18:15:22Z",
"channel": "Snowflake",
"sentiment": "predict",
"fields": [
{
"field": "product",
"value": "Skills"
},
{
"field": "segment",
"value": "EMEA"
}
]
}
]
I'm getting stuck with mapping the keys of the columns to the values in the first dict and mapping the column and row to new keys in the final dict. I've tried various options using df.groupby with .apply() but am coming up short.
Samples of what I've tried:
df.groupby(['text', 'date','channel','sentiment','product','segment']).apply(
lambda r: r[['27cf2f]].to_dict(orient='records')).unstack('text').apply(lambda s: [
{s.index.name: idx, 'fields': value}
for idx, value in s.items()]
).to_json(orient='records')
Any and all help is appreciated!

One option is to use a nested list comprehension:
# Start with your example data
d = {'text': ['I like the new layout'],
'date': ['2021-08-30T18:15:22Z'],
'channel': ['Snowflake'],
'sentiment': ['predict'],
'product': ['Skills'],
'segment': ['EMEA']}
df = pd.DataFrame(d)
# Specify field column names
fieldcols = ['product', 'segment']
# Build a dict for each group as a Series named `fields`
res = (df.groupby(['text', 'date','channel','sentiment'])
.apply(lambda s: [{'field': field,
'value': value}
for field in fieldcols
for value in s[field].values])
).rename('fields')
# Convert Series to DataFrame and then to_json
res = res.reset_index().to_json(orient='records')
# Print result
import json
print(json.dumps(json.loads(res), indent=2))
[
{
"text": "I like the new layout",
"date": "2021-08-30T18:15:22Z",
"channel": "Snowflake",
"sentiment": "predict",
"fields": [
{
"field": "product",
"value": "Skills"
},
{
"field": "segment",
"value": "EMEA"
}
]
}
]

Related

Flattening Multi-Level Nested Object to DataFrame

I am trying to convert an object/dictionary to a Python DataFrame using the following code:
sr = pd.Series(object)
df = pd.DataFrame(sr.values.tolist())
display(df)
It works well but some of the output columns are of object/dictionary type, and I would like to break them up to multiple columns, for example, if column "Items" produces the following value in a cell:
obj = {
"item1": {
"id": "item1",
"relatedItems": [
{
"id": "1111",
"category": "electronics"
},
{
"id": "9999",
"category": "electronics",
"subcategory": "computers"
},
{
"id": "2222",
"category": "electronics",
"subcategory": "computers",
"additionalData": {
"createdBy": "Doron",
"inventory": 100
}
}
]
},
"item2": {
"id": "item2",
"relatedItems": [
{
"id": "4444",
"category": "furniture",
"subcategory": "sofas"
},
{
"id": "5555",
"category": "books",
},
{
"id": "6666",
"category": "electronics",
"subcategory": "computers",
"additionalData": {
"createdBy": "Joe",
"inventory": 5,
"condition": {
"name": "new",
"inspectedBy": "Doron"
}
}
}
]
}
}
The desired output is:
I tried using df.explode, but it multiplies the row to multiple rows, I am looking for a way to achieve the same but split into columns and retain a single row.
Any suggestions?
You can use the pd.json_normalize function to flatten the nested dictionary into multiple columns, with the keys joined with a dot (.).
sr = pd.Series({
'Items': {
'item_name': 'name',
'item_value': 'value'
}
})
df = pd.json_normalize(sr, sep='.')
display(df)
This will give you the following df
Items.item_name Items.item_value
0 name value
You can also specify the level of nesting by passing the record_path parameter to pd.json_normalize, for example, to only flatten the 'Items' key:
df = pd.json_normalize(sr, 'Items', sep='.')
display(df)
Seems like you're looking for pandas.json_normalize which has a (sep) parameter:​
obj = {
'name': 'Doron Barel',
'items': {
'item_name': 'name',
'item_value': 'value',
'another_item_prop': [
{
'subitem1_name': 'just_another_name',
'subitem1_value': 'just_another_value',
},
{
'subitem2_name': 'one_more_name',
'subitem2_value': 'one_more_value',
}
]
}
}
​
df = pd.json_normalize(obj, sep='.')
​
ser = df.pop('items.another_item_prop').explode()
​
out = (df.join(pd.DataFrame(ser.tolist(), index=s.index)
.rename(columns= lambda x: ser.name+"."+x))
.groupby("name", as_index=False).first()
)
Output :
print(out)
​
name items.item_name items.item_value items.another_item_prop.subitem1_name items.another_item_prop.subitem1_value items.another_item_prop.subitem2_name items.another_item_prop.subitem2_value
0 Doron Barel name value just_another_name just_another_value one_more_name one_more_value

Pandas select rows from a DataFrame based on column values?

I have below json string loaded to dataframe. Now I want to filter the record based on ossId.
The condition I have is giving the error message. what is the correct way to filter by ossId?
import pandas as pd
data = """
{
"components": [
{
"ossId": 3946,
"project": "OALX",
"licenses": [
{
"name": "BSD 3",
"status": "APPROVED"
}
]
},
{
"ossId": 3946,
"project": "OALX",
"version": "OALX.client.ALL",
"licenses": [
{
"name": "GNU Lesser General Public License v2.1 or later",
"status": "APPROVED"
}
]
},
{
"ossId": 2550,
"project": "OALX",
"version": "OALX.webservice.ALL" ,
"licenses": [
{
"name": "MIT License",
"status": "APPROVED"
}
]
}
]
}
"""
df = pd.read_json(data)
print(df)
df1 = df[df["components"]["ossId"] == 2550]
I think your issue is due to the json structure. You are actually loading into df a single row that is the whole list of field component.
You should instead pass to the dataframe the list of records. Something like:
json_data = json.loads(data)
df = pd.DataFrame(json_data["components"])
filtered_data = df[df["ossId"] == 2550]
You need to go into the cell's data and get the correct key:
df[df['components'].apply(lambda x: x.get('ossId')==2550)]
Use str
df[df.components.str['ossId']==2550]
Out[89]:
components
2 {'ossId': 2550, 'project': 'OALX', 'version': ...

Convert PANDAS dataframe to nested JSON + add array name

I've been wresting with this for many days now and would appreciate any help.
I'm importing an Excel file to a Pandas data frame resulting in the following dataframe [record]:
account_id
name
timestamp
value
A0001C
Fund_1
1588618800000000000
1
B0001B
Dev_2
1601578800000000000
1
I'm looking to produce a nested JSON output (will be used to submit data to an API), include adding a records and metric labels for the arrays.
Here is the output i'm looking for:
{
"records": [
{
"name": "Fund_1",
"account_id": "A0001C",
"metrics": [
{
"timestamp": 1588618800000000000,
"value": 1
}
]
}
{
"name": "Dev_2",
"account_id": "B0001B",
"metrics": [
{
"timestamp": 1601578800000000000,
"value": 1
}
]
}
]
}
I've gotten an output of a none nested JSON data set, but not able split out the timestamp and value to add the metrics part.
for record in df.to_dict(orient='records'):
record_data = {'records': [record]}
payload_json = json.dumps(record_data)
print(payload_json)
I get the following output:
{"records": [{"account_id": "A0001C", "name": "Fund_1", "Date Completed": 1588618800000000000, "Count": "1"}]}
{"records": [{"account_id": "B0001B", "name": "Dev_2", "Date Completed": 1601578800000000000, "Count": "1"}]}
Any help on how i can modify my code to add the metrics label and nest the data.
Thanks in advance.
One approach is through the use of pd.apply. This allows you to apply a function to series (either column- or row-wise) in your dataframe.
In your particular case, you want to apply the function row-by-row, so you have to use apply with axis=1:
records = list(df.apply(lambda row: {"name": row["name"],
"account_id": row["account_id"],
"metrics": [{
"timestamp": row["timestamp"],
"value": row["value"]}]
},
axis=1).values)
payload = {"records": records}
Alternatively, you could introduce an auxiliary column "metrics" in which you store your metrics (subsequently applying pd.to_json):
df["metrics"] = df.apply(lambda e: [{"timestamp": e.timestamp,
"value": e.value}],
axis=1)
records = df[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
Here's a full example applying option 2:
import io
import json
import pandas as pd
data = io.StringIO("""account_id name timestamp value
A0001C Fund_1 1588618800000000000 1
B0001B Dev_2 1601578800000000000 1""")
df = pd.read_csv(data, sep="\t")
df["metrics"] = df.apply(lambda e: [{"timestamp": e.timestamp,
"value": e.value}],
axis=1)
records = df[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
print(json.dumps(payload, indent=4))
Output:
{
"records": [
{
"account_id": "A0001C",
"name": "Fund_1",
"metrics": [
{
"timestamp": 1588618800000000000,
"value": 1
}
]
},
{
"account_id": "B0001B",
"name": "Dev_2",
"metrics": [
{
"timestamp": 1601578800000000000,
"value": 1
}
]
}
]
}
Edit: The second approach also makes grouping by accounts (in case you want to do that) rather easy. Below is a small example and output:
import io
import json
import pandas as pd
data = io.StringIO("""account_id name timestamp value
A0001C Fund_1 1588618800000000000 1
A0001C Fund_1 1588618900000000000 2
B0001B Dev_2 1601578800000000000 1""")
df = pd.read_csv(data, sep="\t")
# adding the metrics column as above
df["metrics"] = df.apply(lambda e: {"timestamp": e.timestamp,
"value": e.value},
axis=1)
# group metrics by account
df_grouped = df.groupby(by=["name", "account_id"]).metrics.agg(list).reset_index()
records = df_grouped[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
print(json.dumps(payload, indent=4))
Output:
{
"records": [
{
"account_id": "B0001B",
"name": "Dev_2",
"metrics": [
{
"timestamp": 1601578800000000000,
"value": 1
}
]
},
{
"account_id": "A0001C",
"name": "Fund_1",
"metrics": [
{
"timestamp": 1588618800000000000,
"value": 1
},
{
"timestamp": 1588618900000000000,
"value": 2
}
]
}
]
}

Getting 0 records while parsing json file , if the Key Attribute does not exists

I have few static key columns EmployeeId,type and few columns coming from first FOR loop.
While in the second FOR loop if i have a specific key then only values should be appended to the existing data frame columns else whatever the columns getting fetched from first for loop should remain same.
First For Loop Output:
"EmployeeId","type","KeyColumn","Start","End","Country","Target","CountryId","TargetId"
"Emp1","Metal","1212121212","2000-06-17","9999-12-31","","","",""
After Second For Loop i have below output:
"EmployeeId","type","KeyColumn","Start","End","Country","Target","CountryId","TargetId"
"Emp1","Metal","1212121212","2000-06-17","9999-12-31","","AMAZON","1",""
"Emp1","Metal","1212121212","2000-06-17","9999-12-31","","FLIPKART","2",""
As per code if i have Employee tag available , i have got above 2 records but i may have few json files without Employee tag then output should remain same as per First Loop Output with all the key fields populated and rest columns with null.
But i am getting 0 records as per my code. Please help me if my way of coding is wrong.
Please help me ... If the way of asking question is not clear i am sorry , as i am new to python . Please find the sample data in the below link
Please find below code
for i in range(len(json_file['enty'])):
temp = {}
temp['EmployeeId'] = json_file['enty'][i]['id']
temp['type'] = json_file['enty'][i]['type']
for key in json_file['enty'][i]['data']['attributes'].keys():
try:
temp[key] = json_file['enty'][i]['data']['attributes'][key]['values'][0]['value']
except:
temp[key] = None
for key in json_file['enty'][i]['data']['attributes'].keys():
if(key == 'Employee'):
for j in range(len(json_file['enty'][i]['data']['attributes']['Employee']['group'])):
for key in json_file['enty'][i]['data']['attributes']['Employee']['group'][j].keys():
try:
temp[key] = json_file['enty'][i]['data']['attributes']['Employee']['group'][j][key]['values'][0]['value']
except:
temp[key] = None
temp_df = pd.DataFrame([temp])
df = pd.concat([df, temp_df], sort=True)
# Rearranging columns
df = df[['EmployeeId', 'type'] + [col for col in df.columns if col not in ['EmployeeId', 'type']]]
# Writing the dataset
df[columns_list].to_csv("Test22.csv", index=False, quotechar='"', quoting=1)
If Employee Tag is not available i am getting 0 records as output but i am expecting 1 record as for first for loop
enter link description here
The JSON structure is quite complicated. I try to simplified the data collection from it. The result is a list of flat dicts. The code handles the case where 'Employee' is not found.
import copy
d = {
"enty": [
{
"id": "Emp1",
"type": "Metal",
"data": {
"attributes": {
"KeyColumn": {
"values": [
{
"value": 1212121212
}
]
},
"End": {
"values": [
{
"value": "2050-12-31"
}
]
},
"Start": {
"values": [
{
"value": "2000-06-17"
}
]
},
"Employee": {
"group": [
{
"Target": {
"values": [
{
"value": "AMAZON"
}
]
},
"CountryId": {
"values": [
{
"value": "1"
}
]
}
},
{
"Target": {
"values": [
{
"value": "FLIPKART"
}
]
},
"CountryId": {
"values": [
{
"value": "2"
}
]
}
}
]
}
}
}
}
]
}
emps = []
for e in d['enty']:
entry = {'id': e['id'], 'type': e['type']}
for x in ["KeyColumn", "Start", "End"]:
entry[x] = e['data']['attributes'][x]['values'][0]['value']
if e['data']['attributes'].get('Employee'):
for grp in e['data']['attributes']['Employee']['group']:
clone = copy.deepcopy(entry)
for x in ['Target', 'CountryId']:
clone[x] = grp[x]['values'][0]['value']
emps.append(clone)
else:
emps.add(entry)
# TODO write to csv
for emp in emps:
print(emp)
output
{'End': '2050-12-31', 'Target': 'AMAZON', 'KeyColumn': 1212121212, 'Start': '2000-06-17', 'CountryId': '1', 'type': 'Metal', 'id': 'Emp1'}
{'End': '2050-12-31', 'Target': 'FLIPKART', 'KeyColumn': 1212121212, 'Start': '2000-06-17', 'CountryId': '2', 'type': 'Metal', 'id': 'Emp1'}

Django filtering based on list value

My JSON data is in this format..
[
{
"id": "532befe4ee434047ff968a6e",
"company": "528458c4bbe7823947b6d2a3",
"values" : [
{
"Value":"11",
"uniqueId":true
},
{
"Value":"14",
"uniqueId":true
},
]
},
{
"id": "532befe4ee434047ff968a",
"company": "528458c4bbe7823947b6d",
"values" : [
{
"Value":"1111",
"uniqueId":true
},
{
"Value":"10",
"uniqueId":true
},
]
}
]
If I want to filter based on company field then it is possible in this way.
qaresults = QAResult.objects.filter(company= comapnyId)
and it gives me first dictionary of list
But what If I want to filter this based on values list's "value" of Value Key of first dictionary ?
I am not 100% sure what you want , but from what i understand your question,
Try this solution :
import json
json_dict = json.loads('[{"id": "532befe4ee434047ff968a6e","company": "528458c4bbe7823947b6d2a3","values": [{"Value": "11","uniqueId": true},{"Value": "14","uniqueId": true}]},{"id": "532befe4ee434047ff968a","company": "528458c4bbe7823947b6d","values": [{"Value": "1111","uniqueId": true},{"Value": "10","uniqueId": true}]}]')
expected_values = []
js = json_dict[0]
for key,value in js.items():
if key == 'values':
expected_values.append(value[0]['Value'])
And then
qaresults = QAResult.objects.filter(company__id__in = expected_values)

Categories

Resources