Python - How to convert an array of json objects to a Dataframe?

Python - How to convert an array of json objects to a Dataframe? - python

I'm completely new to python. And I need a little help to be able to filter my JSON.
json = {
"selection":[
{
"person_id":105894,
"position_id":1,
"label":"Work",
"description":"A description",
"startDate":"2017-07-16T19:20:30+01:00",
"stopDate":"2017-07-16T20:20:30+01:00"
},
{
"person_id":945123,
"position_id":null,
"label":"Illness",
"description":"A description",
"startDate":"2017-07-17T19:20:30+01:00",
"stopDate":"2017-07-17T20:20:30+01:00"
}
]
}
Concretely what I'm trying to do is to transform my JSON (here above) into a Dataframe to be able to use the query methods on it, like:
selected_person_id = 105894
query_person_id = json[(json['person_id'] == selected_person_id)]
or
json.query('person_id <= 105894')
The columns must be:
cols = ['person_id', 'position_id', 'label', 'description', 'startDate', 'stopDate']
How can I do it ?

Use:
df = pd.DataFrame(json['selection'])
print (df)
description label person_id position_id startDate \
0 A description Work 105894 1.0 2017-07-16T19:20:30+01:00
1 A description Illness 945123 NaN 2017-07-17T19:20:30+01:00
stopDate
0 2017-07-16T20:20:30+01:00
1 2017-07-17T20:20:30+01:00
EDIT:
import json
with open('file.json') as data_file:
json = json.load(data_file)

for more complicated examples where a flattening of the structure is neeeded use json_normalize:
>>> data = [{'state': 'Florida',
... 'shortname': 'FL',
... 'info': {
... 'governor': 'Rick Scott'
... },
... 'counties': [{'name': 'Dade', 'population': 12345},
... {'name': 'Broward', 'population': 40000},
... {'name': 'Palm Beach', 'population': 60000}]},
... {'state': 'Ohio',
... 'shortname': 'OH',
... 'info': {
... 'governor': 'John Kasich'
... },
... 'counties': [{'name': 'Summit', 'population': 1234},
... {'name': 'Cuyahoga', 'population': 1337}]}]
>>> from pandas.io.json import json_normalize
>>> result = json_normalize(data, 'counties', ['state', 'shortname',
... ['info', 'governor']])
>>> result
name population info.governor state shortname
0 Dade 12345 Rick Scott Florida FL
1 Broward 40000 Rick Scott Florida FL
2 Palm Beach 60000 Rick Scott Florida FL
3 Summit 1234 John Kasich Ohio OH
4 Cuyahoga 1337 John Kasich Ohio OH

Related

json_normalise issue when "record_path" variable uses nested data & "meta" variable has nested data, works fine without nested data for "record_path"

I am running into a problem with json_normalise, I can't wrap my head around it. The issue arises when the "record_path" part of the function has nested data (['extra', 'students']), I can't seem to get used nested data in the "meta" part of the function. Works fine when the record path is not nested. Any ideas?
json_list = [
{
'class': 'Year 1',
'student count': 20,
'room': 'Yellow',
'info': {
'teachers': {
'math': 'Rick Scott',
'physics': 'Elon Mask'
}
},
'extra': {
'students': [
{
'name': 'Tom',
'sex': 'M',
'grades': { 'math': 66, 'physics': 77 }
},
{
'name': 'James',
'sex': 'M',
'grades': { 'math': 80, 'physics': 78 }
},
]
}
},
{
'class': 'Year 2',
'student count': 25,
'room': 'Blue',
'info': {
'teachers': {
'math': 'Alan Turing',
'physics': 'Albert Einstein'
}
},
'extra': {
'students': [
{ 'name': 'Tony', 'sex': 'M' },
{ 'name': 'Jacqueline', 'sex': 'F' },
]
}
},
]
print(pd.json_normalize(
json_list,
record_path = ['extra', 'students'],
meta=['class', 'room', ['info', 'teachers', 'math']]
) )

You can transform the json_list before creating the dataframe or try to modify dataframe step-by-step by .apply(pd.Series)/.explode:
df = pd.DataFrame(json_list)
df = pd.concat(
[df, df.pop("info").apply(pd.Series), df.pop("extra").apply(pd.Series)],
axis=1,
).explode("students")
df = pd.concat(
[
df,
df.pop("teachers").apply(pd.Series).add_prefix("teachers_"),
df.pop("students").apply(pd.Series),
],
axis=1,
)
df = pd.concat(
[df, df.pop("grades").apply(pd.Series).add_prefix("grades_")],
axis=1,
).drop(columns="grades_0")
print(df)
Prints:
class student count room teachers_math teachers_physics name sex grades_math grades_physics
0 Year 1 20 Yellow Rick Scott Elon Mask Tom M 66.0 77.0
0 Year 1 20 Yellow Rick Scott Elon Mask James M 80.0 78.0
1 Year 2 25 Blue Alan Turing Albert Einstein Tony M NaN NaN
1 Year 2 25 Blue Alan Turing Albert Einstein Jacqueline F NaN NaN

How to access data and handle missing data in a dictionaries within a dataframe

Given, df:
import pandas as pd
import numpy as np
data =\
{'Col1': [1, 2, 3],
'Person': [{'ID': 10001,
'Data': {'Address': {'Street': '1234 Street A',
'City': 'Houston',
'State': 'Texas',
'Zip': '77002'}},
'Age': 30,
'Income': 50000},
{'ID': 10002,
'Data': {'Address': {'Street': '7892 Street A',
'City': 'Greenville',
'State': 'Maine',
'Zip': np.nan}},
'Age': np.nan,
'Income': 63000},
{'ID': 10003, 'Data': {'Address': np.nan}, 'Age': 56, 'Income': 85000}]}
df = pd.DataFrame(data)
Input Dataframe:
Col1 Person
0 1 {'ID': 10001, 'Data': {'Address': {'Street': '1234 Street A', 'City': 'Houston', 'State': 'Texas', 'Zip': '77002'}}, 'Age': 30, 'Income': 50000}
1 2 {'ID': 10002, 'Data': {'Address': {'Street': '7892 Street A', 'City': 'Greenville', 'State': 'Maine', 'Zip': nan}}, 'Age': nan, 'Income': 63000}
2 3 {'ID': 10003, 'Data': {'Address': nan}, 'Age': 56, 'Income': 85000}
My expected output dataframe is df[['Col1', 'Income', 'Age', 'Street', 'Zip']] where Income, Age, Street, and Zip come from within Person:
Col1 Income Age Street Zip
0 1 50000 30.0 1234 Street A 77002
1 2 63000 NaN 7892 Street A nan
2 3 85000 56.0 NaN nan

Using list comprehension, we can create most of these columns.
df['Income'] = [x.get('Income') for x in df['Person']]
df['Age'] = [x.get('Age') for x in df['Person']]
df['Age']
Output:
0 30.0
1 NaN
2 56.0
Name: Age, dtype: float64
However, dealing with np.nan values inside a nested dictionary is a real pain. Let's look at getting data from a nested dictionary data where one of the values is nan.
df['Street'] = [x.get('Data').get('Address').get('Street') for x in df['Person']]
We get an AttributeError:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-80-cc2f92bfe95d> in <module>
1 #However, let's look at getting data rom a nested dictionary where one of the values is nan.
2
----> 3 df['Street'] = [x.get('Data').get('Address').get('Street') for x in df['Person']]
4
5 #We get and AttributeError because NoneType object has no get method
<ipython-input-80-cc2f92bfe95d> in <listcomp>(.0)
1 #However, let's look at getting data rom a nested dictionary where one of the values is nan.
2
----> 3 df['Street'] = [x.get('Data').get('Address').get('Street') for x in df['Person']]
4
5 #We get and AttributeError because NoneType object has no get method
AttributeError: 'float' object has no attribute 'get'
Let's use the .str accessor with dictionary keys to fetch this data.
There is little documentation in pandas that shows how you can use .str.get or .str[] to fetch values from dictionary objects in a dataframe column/pandas series.
df['Street'] = df['Person'].str['Data'].str['Address'].str['Street']
Output:
0 1234 Street A
1 7892 Street A
2 NaN
Name: Street, dtype: object
And, likewise with
df['Zip'] = df['Person'].str['Data'].str['Address'].str['Zip']
Leaving us with the columns to build the desired dataframe
df[['Col1', 'Income', 'Age', 'Street', 'Zip']]
from dictionaries.
Output:
Col1 Income Age Street Zip
0 1 50000 30.0 1234 Street A 77002
1 2 63000 NaN 7892 Street A NaN
2 3 85000 56.0 NaN NaN

import pandas as pd
import numpy as np
df = pd.DataFrame({
"Col1": [1, 2, 3],
"Person": [
{
"ID": 10001,
"Data": {
"Address": {
"Street": "1234 Street A",
"City": "Houston",
"State": "Texas",
"Zip": "77002",
}
},
"Age": 30,
"Income": 50000,
},
{
"ID": 10002,
"Data": {
"Address": {
"Street": "7892 Street A",
"Zip": np.nan,
"City": "Greenville",
"State": "Maine",
}
},
"Age": np.nan,
"Income": 63000,
},
{
"ID": 10003,
"Data": {"Address": np.nan},
"Age": 56, "Income": 85000
},
],
})
row_dic_list = df.to_dict(orient='records') # convert to dict
# remain = ['Col1', 'Income', 'Age', 'Street', 'Zip']
new_row_dict_list = []
# Iterate over each row to generate new data
for row_dic in row_dic_list:
col1 = row_dic['Col1']
person_dict = row_dic['Person']
age = person_dict['Age']
income = person_dict['Income']
address = person_dict["Data"]["Address"]
street = np.nan
zip_v = np.nan
if isinstance(address, dict):
street = address["Street"]
zip_v = address["Zip"]
new_row_dict = {
'Col1': col1,
'Income': income,
'Age': age,
'Street': street,
'Zip': zip_v,
}
new_row_dict_list.append(new_row_dict)
# Generate a dataframe from each new row of data
new_df = pd.DataFrame(new_row_dict_list)
print(new_df)
"""
Col1 Income Age Street Zip
0 1 50000 30.0 1234 Street A 77002
1 2 63000 NaN 7892 Street A NaN
2 3 85000 56.0 NaN NaN
"""

json_normalize with multiple record paths

I'm using the example given in the json_normalize documentation given here pandas.json_normalize — pandas 1.0.3 documentation, I can't unfortunately paste my actual JSON but this example works. Pasted from the documentation:
data = [{'state': 'Florida',
'shortname': 'FL',
'info': {'governor': 'Rick Scott'},
'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]},
{'state': 'Ohio',
'shortname': 'OH',
'info': {'governor': 'John Kasich'},
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]}]
result = json_normalize(data, 'counties', ['state', 'shortname',
['info', 'governor']])
result
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
What if the JSON was the one below instead where info is an array instead of a dict:
data = [{'state': 'Florida',
'shortname': 'FL',
'info': [{'governor': 'Rick Scott'},
{'governor': 'Rick Scott 2'}],
'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]},
{'state': 'Ohio',
'shortname': 'OH',
'info': [{'governor': 'John Kasich'},
{'governor': 'John Kasich 2'}],
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]}]
How would you get the following output using json_normalize:
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Dade 12345 Florida FL Rick Scott 2
2 Broward 40000 Florida FL Rick Scott
3 Broward 40000 Florida FL Rick Scott 2
4 Palm Beach 60000 Florida FL Rick Scott
5 Palm Beach 60000 Florida FL Rick Scott 2
6 Summit 1234 Ohio OH John Kasich
7 Summit 1234 Ohio OH John Kasich 2
8 Cuyahoga 1337 Ohio OH John Kasich
9 Cuyahoga 1337 Ohio OH John Kasich 2
Or if there is another way to do it, please do let me know.

json_normalize is designed for convenience rather than flexibility. It can't handle all forms of JSON out there (and JSON is just too flexible to write a universal parser for).
How about calling json_normalize twice and then merge. This assumes each state only appear once in your JSON:
counties = json_normalize(data, 'counties', ['state', 'shortname'])
governors = json_normalize(data, 'info', ['state'])
result = counties.merge(governors, on='state')

pretty nested dictionary as a table

Is there any way to pretty print in a table format a nested dictionary? My data structure looks like this;
data = {'01/09/16': {'In': ['Jack'], 'Out': ['Lisa', 'Tom', 'Roger', 'Max', 'Harry', 'Same', 'Joseph', 'Luke', 'Mohammad', 'Sammy']},
'02/09/16': {'In': ['Jack', 'Lisa', 'Rache', 'Allan'], 'Out': ['Lisa', 'Tom']},
'03/09/16': {'In': ['James', 'Jack', 'Nowel', 'Harry', 'Timmy'], 'Out': ['Lisa', 'Tom
And I'm trying to print it out something like this (the names are kept in one line). Note that the names are listed below one another:
+----------------------------------+-------------+-------------+-------------+
| Status | 01/09/16 | 02/09/16 | 03/09/16 |
+----------------------------------+-------------+-------------+-------------+
| In | Jack Tom Tom
| Lisa | Jack |
+----------------------------------+-------------+-------------+-------------+
| Out | Lisa
Tom | Jack | Lisa |
+----------------------------------+-------------+-------------+-------------+
I've tried using pandas with this code;
pd.set_option('display.max_colwidth', -1)
df = pd.DataFrame(role_assignment)
df.fillna('None', inplace=True)
print df
But the problem above is that pandas prints it like this (The names are printed in a single line and it doesn't look good, especially if there's a lot of names);
01/09/16 \
In [Jack]
Out [Lisa, Tom, Roger, Max, Harry, Same, Joseph, Luke, Mohammad, Sammy]
02/09/16 03/09/16
In [Jack, Lisa, Rache, Allan] [James, Jack, Nowel, Harry, Timmy]
Out [Lisa, Tom] [Lisa, Tom]
I prefer this but names listed below one another;
01/09/16 02/09/16 03/09/16
In [Jack] [Jack] [James]
Out [Lisa] [Lisa] [Lisa]
Is there a way to print it neater using pandas or another tool?

This is nonsense hackery and only for display purposes only.
data = {
'01/09/16': {
'In': ['Jack'],
'Out': ['Lisa', 'Tom', 'Roger',
'Max', 'Harry', 'Same',
'Joseph', 'Luke', 'Mohammad', 'Sammy']
},
'02/09/16': {
'In': ['Jack', 'Lisa', 'Rache', 'Allan'],
'Out': ['Lisa', 'Tom']
},
'03/09/16': {
'In': ['James', 'Jack', 'Nowel', 'Harry', 'Timmy'],
'Out': ['Lisa', 'Tom']
}
}
df = pd.DataFrame(data)
d1 = df.stack().apply(pd.Series).stack().unstack(1).fillna('')
d1.index.set_levels([''] * len(d1.index.levels[1]), level=1, inplace=True)
print(d1)
01/09/16 02/09/16 03/09/16
In Jack Jack James
Lisa Jack
Rache Nowel
Allan Harry
Timmy
Out Lisa Lisa Lisa
Tom Tom Tom
Roger
Max
Harry
Same
Joseph
Luke
Mohammad
Sammy

how to convert this nested JSON in columnar form into Pandas dataframe

I could read this nested JSON format in columnar format into pandas.
JSON Scheme
JSON scheme format
Python script
req = requests.get(REQUEST_API)
returned_data = json.loads(req.text)
# status
print("status: {0}".format(returned_data["status"]))
# api version
print("version: {0}".format(returned_data["version"]))
data_in_columnar_form = pd.DataFrame(returned_data["data"])
data = data_in_columnar_form["data"]
UPDATE
I want to convert the following JSON scheme into the tabular format as the table, how to ?
JSON Scheme
"data":[
{
"year":"2009",
"values":[
{
"Actual":"(0.2)"
},
{
"Upper End of Range":"-"
},
{
"Upper End of Central Tendency":"-"
},
{
"Lower End of Central Tendency":"-"
},
{
"Lower End of Range":"-"
}
]
},
{
"year":"2010",
"values":[
{
"Actual":"2.8"
},
{
"Upper End of Range":"-"
},
{
"Upper End of Central Tendency":"-"
},
{
"Lower End of Central Tendency":"-"
},
{
"Lower End of Range":"-"
}
]
},...
]

Pandas has a JSON normalization function (as of 0.13), straight out of the docs:
In [205]: from pandas.io.json import json_normalize
In [206]: data = [{'state': 'Florida',
.....: 'shortname': 'FL',
.....: 'info': {
.....: 'governor': 'Rick Scott'
.....: },
.....: 'counties': [{'name': 'Dade', 'population': 12345},
.....: {'name': 'Broward', 'population': 40000},
.....: {'name': 'Palm Beach', 'population': 60000}]},
.....: {'state': 'Ohio',
.....: 'shortname': 'OH',
.....: 'info': {
.....: 'governor': 'John Kasich'
.....: },
.....: 'counties': [{'name': 'Summit', 'population': 1234},
.....: {'name': 'Cuyahoga', 'population': 1337}]}]
.....:
In [207]: json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])
Out[207]:
name population info.governor state shortname
0 Dade 12345 Rick Scott Florida FL
1 Broward 40000 Rick Scott Florida FL
2 Palm Beach 60000 Rick Scott Florida FL
3 Summit 1234 John Kasich Ohio OH
4 Cuyahoga 1337 John Kasich Ohio OH

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - How to convert an array of json objects to a Dataframe? - python

Related

json_normalise issue when "record_path" variable uses nested data & "meta" variable has nested data, works fine without nested data for "record_path"

How to access data and handle missing data in a dictionaries within a dataframe

json_normalize with multiple record paths

pretty nested dictionary as a table

how to convert this nested JSON in columnar form into Pandas dataframe

Categories

Resources