Extracting data from nested json arrays in python

Extracting data from nested json arrays in python - python

I am having trouble extracting data from nested json in python. I want to create a one column pandas dataframe of all the values of "bill", e.g.
bill
----
a1
a2
a3
Using the output from an API formatted like this:
{
"status": "succeeded",
"travels": [
{
"jobs": [
{
"bill": "a1"
},
{
"bill": "a2"
},
{
"bill": "a3"
}
],
"vehicle": {
"plate": "xyz123"
}
}
]
}
Loading the json directly into pandas gives me only the first instance of 'bill'. I have tried json_normalize() on 'jobs', but it has a key error. Can anybody help me figure out how to grab just the 'bill'?
Thanks

I think you were on the right track with json_normalize. With your input as a python dictionary d:
from pandas.io.json import json_normalize
json_normalize(d, record_path=['travels', 'jobs'])
bill
0 a1
1 a2
2 a3

Related

Convert Embedded JSON Dict To Panda DataFrame Where Columns Headers Are Seperate From Values

I'm trying to create a python pandas DataFrame out of a JSON dictionary. The embedding is tripping me up.
The column headers are in a different section of the JSON file to the values.
The json looks similar to below. There is one section of column headers and multiple sections of data.
I need each column filled with the data that relates to it. So value_one in each case will fill the column under header_one and so on.
I have come close, but can't seem to get it to spit out the dataframe as described.
{
"my_data": {
"column_headers": [
"header_one",
"header_two",
"header_three"
],
"values": [
{
"data": [
"value_one",
"value_two",
"value_three"
]
},
{
"data": [
"value_one",
"value_two",
"value_three"
]
}
]
}
}

Assuming your dictionary is my_dict, try:
>>> pd.DataFrame(data=[d["data"] for d in my_dict["my_data"]["values"]],
columns=my_dict["my_data"]["column_headers"])

Convert http text response to pandas dataframe [duplicate]

This question already has answers here:
Convert Python dict into a dataframe
(18 answers)
JSON to pandas DataFrame
(14 answers)
Closed last year.
I want to convert the below text into a pandas dataframe. Is there a way I can use Python Pandas pre-built or in-built parser to convert? I can make a custom function for parsing but want to know if there is pre-built and/or fast solution.
In this example, the dataframe should result in two rows, one each of ABC & PQR
{
"data": [
{
"ID": "ABC",
"Col1": "ABC_C1",
"Col2": "ABC_C2"
},
{
"ID": "PQR",
"Col1": "PQR_C1",
"Col2": "PQR_C2"
}
]
}

You've listed everything you need as tags. Use json.loads to get a dict from string
import json
import pandas as pd
d = json.loads('''{
"data": [
{
"ID": "ABC",
"Col1": "ABC_C1",
"Col2": "ABC_C2"
},
{
"ID": "PQR",
"Col1": "PQR_C1",
"Col2": "PQR_C2"
}
]
}''')
df = pd.DataFrame(d['data'])
print(df)
Output:
ID Col1 Col2
0 ABC ABC_C1 ABC_C2
1 PQR PQR_C1 PQR_C2

merging list with nested list

I need to 'cross join' (for want of a better term !) 2 lists.
Between them they represent a tabled dataset but ..
One holds the column header names, the other a nested array with the row values.
I've managed the easy bit :
col_names = [i['name'] for i in c]
which strips the column names out in to a list without 'typeName'
But just thinking how to extract the row field values and map them with column names .. is giving me a headache!
Any pointers appreciated ;)
Thanks
Columns (as provided):
[
{
"name": "col1",
"typeName": "varchar"
},
{
"name": "col2",
"typeName": "int4"
}
]
Records (as provided):
[
[
{
"stringValue": "apples"
},
{
"longValue": 1
}
],
[
{
"stringValue": "bananas"
},
{
"longValue": 2
}
]
]
Required Result:
[
{
'col1':'apples',
'col2':1
},
{
'col1':'bananas',
'col2':2
}
]

You have to be able to assume there is a 1-to-1 correspondence between the names in the schema and the dicts in the records. Once you assume that, it's pretty easy:
names = [i['name'] for i in schema]
data = []
for row in records:
d = {}
for a,b in zip( names, row ):
d[a] = list(b.values())[0]
data.append(d)
print(data)

Multi-level Python Dict to Pandas DataFrame only processes one level out of many

I'm parsing some XML data, doing some logic on it, and trying to display the results in an HTML table. The dictionary, after filling, looks like this:
{
"general_info": {
"name": "xxx",
"description": "xxx",
"language": "xxx",
"prefix": "xxx",
"version": "xxx"
},
"element_count": {
"folders": 23,
"conditions": 72,
"listeners": 1,
"outputs": 47
},
"external_resource_count": {
"total": 9,
"extensions": {
"jar": 8,
"json": 1
},
"paths": {
"/lib": 9
}
},
"complexity": {
"over_1_transition": {
"number": 4,
"percentage": 30.769
},
"over_1_trigger": {
"number": 2,
"percentage": 15.385
},
"over_1_output": {
"number": 4,
"percentage": 30.769
}
}
}
Then I'm using pandas to convert the dictionary into a table, like so:
data_frame = pandas.DataFrame.from_dict(data=extracted_metrics, orient='index').stack().to_frame()
The result is a table that is mostly correct:
While the first and second levels seem to render correctly, those categories with a sub-sub category get written as a string in the cell, rather than as a further column. I've also tried using stack(level=1) but it raises an error "IndexError: Too many levels: Index has only 1 level, not 2". I've also tried making it into a series with no luck. It seems like it only renders "complete" columns. Is there a way of filling up the empty spaces in the dictionary before processing?
How can I get, for example, external_resource_count -> extensions to have two daughter rows jar and json, with an additional column for the values, so that the final table looks like this:
Extra credit if anyone can tell me how to get rid of the first row with the index numbers. Thanks!

The way you load the dataframe is correct but you should rename the 0 to a some column name.
# this function extracts all the keys from your nested dicts
def explode_and_filter(df, filterdict):
return [df[col].apply(lambda x:x.get(k) if type(x)==dict else x).rename(f'{k}')
for col,nested in filterdict.items()
for k in nested]
data_frame = pd.DataFrame.from_dict(data= extracted_metrics, orient='index').stack().to_frame(name='somecol')
#lets separate the rows where a dict is present & explode only those rows
mask = data_frame.somecol.apply(lambda x:type(x)==dict)
expp = explode_and_filter(data_frame[mask],
{'somecol':['jar', 'json', '/lib', 'number', 'percentage']})
# here we concat the exploded series to a frame
exploded_df = pd.concat(expp, axis=1).stack().to_frame(name='somecol2').reset_index(level=2)\.rename(columns={'level_2':'somecol'})
# and now we concat the rows with dict elements with the rows with non dict elements
out = pd.concat([data_frame[~mask], exploded_df])
The output dataframe looks like this

JSON to Pandas: is there a more elegant solution?

I have some JSON, returned from an API call, that looks something like this:
{
"result": {
"code": "OK",
"msg": ""
},
"report_name": "FAMOUS_DICTATORS",
"columns": [
"rank",
"name",
"deaths"
],
"data": [
{
"row": [
1,
"Mao Zedong",
63000000
]
},
{
"row": [
2,
"Jozef Stalin",
23000000
]
}
]
}
I'd like to convert the JSON into a Pandas DataFrame:
rank name deaths
1 Mao Zedong 63000000
2 Jozef Stalin 23000000
I wrote this and it works, but looks a bit ugly:
import pandas as pd
import json
columns = eval(r.content)['columns']
df = pd.DataFrame(columns = eval(r.content)['columns'])
for row in eval(r.content)['data']:
df.loc[len(df)+1] = row['row']
Is there a more elegant/Pythonic way to do this (e.g. possibly using pandas.io.json.read_json)?

The read_json function of pandas is a very tricky method to use. If you don't know with certainty the validity of your JSON object or whether its initial structure is sane enough to build a dataframe around, it's much better to stick to tried and tested methods to break your data down to something that pandas can use without issues 100%.
In your case, I suggest breaking down your data to a list of lists. Out of all that JSON, the only part you really need is in the data and column keys.
Try this:
import pandas as pd
import json
import urllib
js = json.loads(urllib.urlopen("test.json").read())
data = js["data"]
rows = [row["row"] for row in data] # Transform the 'row' keys to list of lists.
df = pd.DataFrame(rows, columns=js["columns"])
print df
This gives me the desired result:
rank name deaths
0 1 Mao Zedong 63000000
1 2 Jozef Stalin 23000000

see pandas.io.json.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.json.read_json.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting data from nested json arrays in python - python

I think you were on the right track with json_normalize. With your input as a python dictionary d: from pandas.io.json import json_normalize json_normalize(d, record_path=['travels', 'jobs']) bill 0 a1 1 a2 2 a3

Related

Convert Embedded JSON Dict To Panda DataFrame Where Columns Headers Are Seperate From Values

Convert http text response to pandas dataframe [duplicate]

merging list with nested list

Multi-level Python Dict to Pandas DataFrame only processes one level out of many

JSON to Pandas: is there a more elegant solution?

Categories

Resources