This is my first question on here. I have searched around on here and throughout the web and I seem unable to find the answer to my question. I'm trying to explode out a list in a json file out into multiple columns and rows. Everything I have tried so far has proven unsuccessful.
I am doing this over multiple json files within a directory in order to have it print out in the dataframe like this.
Goal:
did
Version
Nodes
rds
time
c
sc
f
uc
did
Version
Nodes
rds
time
c
sc
f
uc
did
Version
Nodes
rds
time
c
sc
f
uc
did
Version
Nodes
rds
time
c
sc
f
uc
Instead I get this in my dataframe:
did
Version
Nodes
rds
fusage
did
Version
Nodes
rds
everything in fusage
did
Version
Nodes
rds
everything in fusage
did
Version
Nodes
rds
everything in fusage
example of the json I'm working with. The json structure will not change
{
"did": "123456789",
"mId": "1a2b3cjsks",
"timestamp": "2021-11-26T11:10:58.322000",
"beat": {
"did": "123456789",
"collectionTime": "2010-05-26 11:10:58.004783+00",
"Nodes": 6,
"Version": "v1.4.6-2",
"rds": "0.00B",
"fusage": [
{
"time": "2010-05-25",
"c": "string",
"sc": "string",
"f": "string",
"uc": "int"
},
{
"time": "2010-05-19",
"c": "string",
"sc": "string",
"f": "string",
"uc": "int"
},
{
"t": "2010-05-23",
"c": "string",
"sc": "string",
"f": "string",
"uc": "int"
},
{
"time": "2010-05-23",
"c": "string",
"sc": "string",
"f": "string",
"uc": "int"
}
]
}
}
My end goal is to get the dataframe out to a csv in order to be ingested. I appreciate everyone's help looking at this.
using python 3.8.10 & pandas 1.3.4
python code below
import csv
import glob
import json
import os
import pandas as pd
tempdir = '/dir/to/files/json_temp'
json_files = os.path.join(tempdir, '*.json')
file_list = glob.glob(json_files)
dfs = []
for file in file_list:
with open(file) as f:
data = pd.json_normalize(json.loads(f.read()))
dfs.append(data)
df = pd.concat(dfs, ignore_index=True)
df.explode('fusage')
print(df)
If you're going to use the explode function, after that, apply pd.Series over the column containing the fusage list (beat.fusage) to obtain a Series for each list item.
/dir/to/files
├── example-v1.4.6-2.json
└── example-v2.2.2-2.json
...
for file in file_list:
with open(file) as f:
data = pd.json_normalize(json.loads(f.read()))
dfs.append(data)
df = pd.concat(dfs, ignore_index=True)
fusage_list = df.explode('beat.fusage')['beat.fusage'].apply(pd.Series)
df = pd.concat([df, fusage_list], axis=1)
# show desired columns
df = df[['did', 'beat.Version', 'beat.Nodes', 'beat.rds', 'time', 'c', 'sc', 'f', 'uc']]
print(df)
Output from df
did beat.Version beat.Nodes beat.rds time c sc f uc
0 123456789 v1.4.6-2 6 0.00B 2010-05-25 string string string int
0 123456789 v1.4.6-2 6 0.00B 2010-05-19 string string string int
0 123456789 v1.4.6-2 6 0.00B NaN string string string int
0 123456789 v1.4.6-2 6 0.00B 2010-05-23 string string string int
1 123777777 v2.2.2-2 4 0.00B 2010-05-25 string string string int
1 123777777 v2.2.2-2 4 0.00B 2010-05-19 string string string int
1 123777777 v2.2.2-2 4 0.00B NaN string string string int
1 123777777 v2.2.2-2 4 0.00B 2010-05-23 string string string int
I have a pandas dataframe which I would like to convert to JSON format for my source system to utilize, which requires a very specific JSON format.
I cant seem to get to the exact format like shown in the expected output section, using simple dictionary loops.
Is there anyway I can convert csv/pd.Dataframe to nested JSON?
Any python package specifically built for this?
Input Dataframe:
#Create Input Dataframe
data = {
'col6':['A','A','A','B','B','B'],
'col7':[1, 1, 2, 1, 2, 2],
'col8':['A','A','A','B','B','B'],
'col10':['A','A','A','B','B','B'],
'col14':[1,1,1,1,1,2],
'col15':[1,2,1,1,1,1],
'col16':[9,10,26,9,12,4],
'col18':[1,1,2,1,2,3],
'col1':['xxxx','xxxx','xxxx','xxxx','xxxx','xxxx'],
'col2':[2.02011E+13,2.02011E+13,2.02011E+13,2.02011E+13,2.02011E+13,2.02011E+13],
'col3':['xxxx20201107023012','xxxx20201107023012','xxxx20201107023012','xxxx20201107023012','xxxx20201107023012','xxxx20201107023012'],
'col4':['yyyy','yyyy','yyyy','yyyy','yyyy','yyyy'],
'col5':[0,0,0,0,0,0],
'col9':['A','A','A','B','B','B'],
'col11':[0,0,0,0,0,0],
'col12':[0,0,0,0,0,0],
'col13':[0,0,0,0,0,0],
'col17':[51,63,47,59,53,56]
}
pd.DataFrame(data)
Expected Output:
{
"header1": {
"col1": "xxxx"
"col2": "20201107023012"
"col3": "xxxx20201107023012"
"col4": "yyyy",
"col5": "0"
},
"header2":
{
"header3":
[
{
col6: A,
col7: 1,
header4:
[
{
col8: "A",
col9: 1,
col10: "A",
col11: 0,
col12: 0,
col13: 0,
"header5":
[
{
col14: "1",
col15: 1,
col16: 1,
col17: 51,
col18: 1
},
{
col14: "1",
col15: 1,
col16: 2,
col17: 63,
col18: 2
}
]
},
{
col8: "A",
col9: 1,
col10: "A",
col11: 0,
col12: 0,
col13: 0,
"header5":
[
{
col14: "1",
col15: 1,
col16: 1,
col17: 51,
col18: 1
},
{
col14: "1",
col15: 1,
col16: 2,
col17: 63,
col18: 2
}
]
}
]
}
]
}
}
Maybe this will get you started. I'm not aware of a current python module that will do what you want but this is the basis of how I'd start it. Making assumptions based on what you've provided.
As each successive nest is based on some criteria, you'll need to loop through filtered dataframes. Depending on the size of your dataframes using groupby may be a better option than what I have here but the theory is the same. Also, you'll have to create you key value pairs correctly, this just creates the data support what you are builing.
# assume header 1 is constant so take first row and use .T to transpose to create dictionaries
header1 = dict(df.iloc[0].T[['col1','col2','col3','col4','col5']])
print('header1', header1)
# for header three, looks like you need the unique combinations so create dataframe
# and then iterate through to get all the header3 dictionaries
header3_dicts = []
dfh3 = df[['col6', 'col7']].drop_duplicates().reset_index(drop=True)
for i in range(dfh3.shape[0]):
header3_dicts.append(dict(dfh3.iloc[i].T[['col6','col7']]))
print('header3', header3_dicts)
# iterate over header3 to get header 4
for i in range(dfh3.shape[0]):
#print(dfh3.iat[i,0], dfh3.iat[i,1])
dfh4 = df.loc[(df['col6']==dfh3.iat[i,0]) & (df['col7']==dfh3.iat[i,1])]
header4_dicts = []
for j in range(dfh4.shape[0]):
header4_dicts.append(dict(df.iloc[j].T[['col8','col9','col10','col11','col12','col13']]))
print('header4', header4_dicts)
# next level repeat similar to above
I am working on getting data from an API using python. The API returns data in form of json which is being normalised and written to a data frame which is then written to a csv file.
The API can return any number of columns which differs between each records. I need only a fixed number of columns which i am defining in the code.
In the scenario where the required column is not being returned my code fails.
I need a solution where even though required columns are not present in the data frame column header gets created in the csv and all rows gets populated with null.
required csv structure :
name address phone
abc bcd 1214
bcd null null
I'm not sure if understood you correctly but I hope the following code solves your problem:
import json
import pandas as pd
# Declare json with missing values:
# - First element doesn't contain "phone" field
# - Second element doesn't contain "married" field
api_data = """
{ "sentences" :
[
{ "name": "abc", "address": "bcd", "married": true},
{ "name": "def", "address": "ghi", "phone" : 7687 }
]
}
"""
json_data = json.loads(api_data)
df = pd.DataFrame(
data=json_data["sentences"],
# Explicitly declare which columns should be presented in DataFrame
# If value for given column is absent it will be populated with NaN
columns=["name", "address", "married", "phone"]
)
# Save result to csv:
df.to_csv("tmp.csv", index=False)
The content of resulting csv:
name,address,married,phone
abc,bcd,True,
def,ghi,,7687.0
P.S.:
It should work even if columns are absent in all the records. Here is another example:
# Both elements do not contain "married" and "phone" fields
api_data = """
{ "sentences" :
[
{ "name": "abc", "address": "bcd"},
{ "name": "def", "address": "ghi"}
]
}
"""
json_data = json.loads(api_data)
json_data["sentences"][0]
df = pd.DataFrame(
data=json_data["sentences"],
# Explicitly declare which columns should be presented in DataFrame
# If value for given column is absent it will be populated with NaN
columns=["name", "address", "married", "phone"]
)
# Print first rows of DataFrame:
df.head()
# Expected output:
# name address married phone
# 0 abc bcd NaN NaN
# 1 def ghi NaN NaN
df.to_csv("tmp.csv", index=False)
In this case the resulting csv file will contain the following text:
name,address,married,phone
abc,bcd,,
def,ghi,,
The last two commas in the 2nd and 3d lines mean "an empty/missing value" and if you create DataFrame from resulting csv by pd.read_csv then "married" and "phone" columns will be populated with NaN values.
I have a pandas dataframe with rows created from dicts, using pd.io.json.json_normalize(). The values(not the keys/columns names) in dataframe have been modified. I want to retrieve a dict, with the same nested format the original dict has, from a row of the dataframe.
sample = {
"A": {
"a": 7
},
"B": {
"a": "name",
"z":{
"dD": 20 ,
"f_f": 3 ,
}
}
}
df = pd.io.json.json_normalize(sample, sep='__')
as expected df.columns returns me:
Index(['A__a', 'B__a', 'B__z__dD', 'B__z__f_f'], dtype='object')
I want to "reverse" the process now.
I can guarantee no string in the original dict(key or value) has a '__' as a substring and neither starts or ends with '_'
I have some JSON, returned from an API call, that looks something like this:
{
"result": {
"code": "OK",
"msg": ""
},
"report_name": "FAMOUS_DICTATORS",
"columns": [
"rank",
"name",
"deaths"
],
"data": [
{
"row": [
1,
"Mao Zedong",
63000000
]
},
{
"row": [
2,
"Jozef Stalin",
23000000
]
}
]
}
I'd like to convert the JSON into a Pandas DataFrame:
rank name deaths
1 Mao Zedong 63000000
2 Jozef Stalin 23000000
I wrote this and it works, but looks a bit ugly:
import pandas as pd
import json
columns = eval(r.content)['columns']
df = pd.DataFrame(columns = eval(r.content)['columns'])
for row in eval(r.content)['data']:
df.loc[len(df)+1] = row['row']
Is there a more elegant/Pythonic way to do this (e.g. possibly using pandas.io.json.read_json)?
The read_json function of pandas is a very tricky method to use. If you don't know with certainty the validity of your JSON object or whether its initial structure is sane enough to build a dataframe around, it's much better to stick to tried and tested methods to break your data down to something that pandas can use without issues 100%.
In your case, I suggest breaking down your data to a list of lists. Out of all that JSON, the only part you really need is in the data and column keys.
Try this:
import pandas as pd
import json
import urllib
js = json.loads(urllib.urlopen("test.json").read())
data = js["data"]
rows = [row["row"] for row in data] # Transform the 'row' keys to list of lists.
df = pd.DataFrame(rows, columns=js["columns"])
print df
This gives me the desired result:
rank name deaths
0 1 Mao Zedong 63000000
1 2 Jozef Stalin 23000000
see pandas.io.json.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.json.read_json.html