I'm parsing some XML data, doing some logic on it, and trying to display the results in an HTML table. The dictionary, after filling, looks like this:
{
"general_info": {
"name": "xxx",
"description": "xxx",
"language": "xxx",
"prefix": "xxx",
"version": "xxx"
},
"element_count": {
"folders": 23,
"conditions": 72,
"listeners": 1,
"outputs": 47
},
"external_resource_count": {
"total": 9,
"extensions": {
"jar": 8,
"json": 1
},
"paths": {
"/lib": 9
}
},
"complexity": {
"over_1_transition": {
"number": 4,
"percentage": 30.769
},
"over_1_trigger": {
"number": 2,
"percentage": 15.385
},
"over_1_output": {
"number": 4,
"percentage": 30.769
}
}
}
Then I'm using pandas to convert the dictionary into a table, like so:
data_frame = pandas.DataFrame.from_dict(data=extracted_metrics, orient='index').stack().to_frame()
The result is a table that is mostly correct:
While the first and second levels seem to render correctly, those categories with a sub-sub category get written as a string in the cell, rather than as a further column. I've also tried using stack(level=1) but it raises an error "IndexError: Too many levels: Index has only 1 level, not 2". I've also tried making it into a series with no luck. It seems like it only renders "complete" columns. Is there a way of filling up the empty spaces in the dictionary before processing?
How can I get, for example, external_resource_count -> extensions to have two daughter rows jar and json, with an additional column for the values, so that the final table looks like this:
Extra credit if anyone can tell me how to get rid of the first row with the index numbers. Thanks!
The way you load the dataframe is correct but you should rename the 0 to a some column name.
# this function extracts all the keys from your nested dicts
def explode_and_filter(df, filterdict):
return [df[col].apply(lambda x:x.get(k) if type(x)==dict else x).rename(f'{k}')
for col,nested in filterdict.items()
for k in nested]
data_frame = pd.DataFrame.from_dict(data= extracted_metrics, orient='index').stack().to_frame(name='somecol')
#lets separate the rows where a dict is present & explode only those rows
mask = data_frame.somecol.apply(lambda x:type(x)==dict)
expp = explode_and_filter(data_frame[mask],
{'somecol':['jar', 'json', '/lib', 'number', 'percentage']})
# here we concat the exploded series to a frame
exploded_df = pd.concat(expp, axis=1).stack().to_frame(name='somecol2').reset_index(level=2)\.rename(columns={'level_2':'somecol'})
# and now we concat the rows with dict elements with the rows with non dict elements
out = pd.concat([data_frame[~mask], exploded_df])
The output dataframe looks like this
I'm trying to parse nested json results.
data = {
"results": [
{
"components": [
{
"times": {
"periods": [
{
"fromDayOfWeek": 0,
"fromHour": 12,
"fromMinute": 0,
"toDayOfWeek": 4,
"toHour": 21,
"toMinute": 0,
"id": 156589,
"periodId": 20855
}
],
}
}
],
}
],
}
I can get to and create dataframes for "results" and "components" lists, but cannot get to "periods" due to the "times" dict. So far I have this:
df = pd.json_normalize(data, record_path = ['results','components'])
Need a separate "periods" dataframe with the included column names and values. Would appreciate your help on this. Thank you!
I results
II components
III times
IIII periods
The normalize should be correct way:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html
There is 4 level of nesting. There can be x components in results and y times in components - however that type of nesting is overengineering?
The simplest way of getting data is:
print data['a']['b']['c']['d'] (...)
in your case:
print data['results']['components']['times']['periods']
You can access the specific nested level by this piece of code:
def GetPropertyFromPeriods (property):
propertyList = []
for x in data['results']['components']['times']:
singleProperty = photoURL['periods'][property]
propertyList.append(singleProperty)
return propertyList
This give you access to one property inside periods (fromDayOfWeek, fromHour, fromMinute)
After coverting json value, transform it into pandas dataframe:
print pd.DataFrame(data, columns=["columnA", "columnB”])
If stuck:
How to Create a table with data from JSON output in Python
Python - How to convert JSON File to Dataframe
pandas documentation:
pandas.DataFrame.from_dict
pandas.json_normalize
I have a pandas dataframe which I would like to convert to JSON format for my source system to utilize, which requires a very specific JSON format.
I cant seem to get to the exact format like shown in the expected output section, using simple dictionary loops.
Is there anyway I can convert csv/pd.Dataframe to nested JSON?
Any python package specifically built for this?
Input Dataframe:
#Create Input Dataframe
data = {
'col6':['A','A','A','B','B','B'],
'col7':[1, 1, 2, 1, 2, 2],
'col8':['A','A','A','B','B','B'],
'col10':['A','A','A','B','B','B'],
'col14':[1,1,1,1,1,2],
'col15':[1,2,1,1,1,1],
'col16':[9,10,26,9,12,4],
'col18':[1,1,2,1,2,3],
'col1':['xxxx','xxxx','xxxx','xxxx','xxxx','xxxx'],
'col2':[2.02011E+13,2.02011E+13,2.02011E+13,2.02011E+13,2.02011E+13,2.02011E+13],
'col3':['xxxx20201107023012','xxxx20201107023012','xxxx20201107023012','xxxx20201107023012','xxxx20201107023012','xxxx20201107023012'],
'col4':['yyyy','yyyy','yyyy','yyyy','yyyy','yyyy'],
'col5':[0,0,0,0,0,0],
'col9':['A','A','A','B','B','B'],
'col11':[0,0,0,0,0,0],
'col12':[0,0,0,0,0,0],
'col13':[0,0,0,0,0,0],
'col17':[51,63,47,59,53,56]
}
pd.DataFrame(data)
Expected Output:
{
"header1": {
"col1": "xxxx"
"col2": "20201107023012"
"col3": "xxxx20201107023012"
"col4": "yyyy",
"col5": "0"
},
"header2":
{
"header3":
[
{
col6: A,
col7: 1,
header4:
[
{
col8: "A",
col9: 1,
col10: "A",
col11: 0,
col12: 0,
col13: 0,
"header5":
[
{
col14: "1",
col15: 1,
col16: 1,
col17: 51,
col18: 1
},
{
col14: "1",
col15: 1,
col16: 2,
col17: 63,
col18: 2
}
]
},
{
col8: "A",
col9: 1,
col10: "A",
col11: 0,
col12: 0,
col13: 0,
"header5":
[
{
col14: "1",
col15: 1,
col16: 1,
col17: 51,
col18: 1
},
{
col14: "1",
col15: 1,
col16: 2,
col17: 63,
col18: 2
}
]
}
]
}
]
}
}
Maybe this will get you started. I'm not aware of a current python module that will do what you want but this is the basis of how I'd start it. Making assumptions based on what you've provided.
As each successive nest is based on some criteria, you'll need to loop through filtered dataframes. Depending on the size of your dataframes using groupby may be a better option than what I have here but the theory is the same. Also, you'll have to create you key value pairs correctly, this just creates the data support what you are builing.
# assume header 1 is constant so take first row and use .T to transpose to create dictionaries
header1 = dict(df.iloc[0].T[['col1','col2','col3','col4','col5']])
print('header1', header1)
# for header three, looks like you need the unique combinations so create dataframe
# and then iterate through to get all the header3 dictionaries
header3_dicts = []
dfh3 = df[['col6', 'col7']].drop_duplicates().reset_index(drop=True)
for i in range(dfh3.shape[0]):
header3_dicts.append(dict(dfh3.iloc[i].T[['col6','col7']]))
print('header3', header3_dicts)
# iterate over header3 to get header 4
for i in range(dfh3.shape[0]):
#print(dfh3.iat[i,0], dfh3.iat[i,1])
dfh4 = df.loc[(df['col6']==dfh3.iat[i,0]) & (df['col7']==dfh3.iat[i,1])]
header4_dicts = []
for j in range(dfh4.shape[0]):
header4_dicts.append(dict(df.iloc[j].T[['col8','col9','col10','col11','col12','col13']]))
print('header4', header4_dicts)
# next level repeat similar to above
I have the following json object:
{
"Name": "David",
"Gender": "M",
"Date": "2014-01-01",
"Address": {
"Street": "429 Ford",
"City": "Oxford",
"State": "DE",
"Zip": 1009
}
}
How would I load this into a pandas dataframe so that it orients itself as:
name gender date address
David M 20140-01-01 {...}
What I'm trying now is:
pd.read_json(file)
But it orients it as four records instead of one.
You should read it as a Series and then (optionally) convert to a DataFrame:
df = pd.DataFrame(pd.read_json(file, typ='series')).T
df.shape
#(1, 4)
if your JSON file is composed of 1 JSON object per line (not an array, not a pretty printed JSON object)
then you can use:
df = pd.read_json(file, lines=True)
and it will do what you want
if file contains:
{"Name": "David","Gender": "M","Date": "2014-01-01","Address": {"Street": "429 Ford","City": "Oxford","State": "DE","Zip": 1009}}
on 1 line, then you get:
If you use
df = pd.read_json(file, orient='records')
you can load as 1 key per column, but the sub-keys will be split up into multiple rows.
I have some JSON, returned from an API call, that looks something like this:
{
"result": {
"code": "OK",
"msg": ""
},
"report_name": "FAMOUS_DICTATORS",
"columns": [
"rank",
"name",
"deaths"
],
"data": [
{
"row": [
1,
"Mao Zedong",
63000000
]
},
{
"row": [
2,
"Jozef Stalin",
23000000
]
}
]
}
I'd like to convert the JSON into a Pandas DataFrame:
rank name deaths
1 Mao Zedong 63000000
2 Jozef Stalin 23000000
I wrote this and it works, but looks a bit ugly:
import pandas as pd
import json
columns = eval(r.content)['columns']
df = pd.DataFrame(columns = eval(r.content)['columns'])
for row in eval(r.content)['data']:
df.loc[len(df)+1] = row['row']
Is there a more elegant/Pythonic way to do this (e.g. possibly using pandas.io.json.read_json)?
The read_json function of pandas is a very tricky method to use. If you don't know with certainty the validity of your JSON object or whether its initial structure is sane enough to build a dataframe around, it's much better to stick to tried and tested methods to break your data down to something that pandas can use without issues 100%.
In your case, I suggest breaking down your data to a list of lists. Out of all that JSON, the only part you really need is in the data and column keys.
Try this:
import pandas as pd
import json
import urllib
js = json.loads(urllib.urlopen("test.json").read())
data = js["data"]
rows = [row["row"] for row in data] # Transform the 'row' keys to list of lists.
df = pd.DataFrame(rows, columns=js["columns"])
print df
This gives me the desired result:
rank name deaths
0 1 Mao Zedong 63000000
1 2 Jozef Stalin 23000000
see pandas.io.json.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.json.read_json.html