Construct DataFrame from multiple JSON files - python

I'm using pandas to convert multiple json files into a dataframe. I only want some entries that match some criteria from those files, but I'm appending the whole converted files, then filtering it.
Suppose I have 2 json files that look like this:
File 1500.json
[
{
"CodStore": 1500,
"CodItem": 10,
"NameItem": "Burger",
"Price": 10.0
},
{
"CodStore": 1500,
"CodItem": 20,
"NameItem": "Fries",
"Price": 3.0
},
{
"CodStore": 1500,
"CodItem": 30,
"NameItem": "Ice Cream",
"Price": 1.0
}
]
File 1805.json
[
{
"CodStore": 1805,
"CodItem": 10,
"NameItem": "Burger",
"Price": 9.0
},
{
"CodStore": 1805,
"CodItem": 20,
"NameItem": "Fries",
"Price": 2.0
},
{
"CodStore": 1805,
"CodItem": 30,
"NameItem": "Ice Cream",
"Price": 0.5
}
]
I only want entries with CodItem 10 and 30 on my dataframe, so my python code looks like this:
from pandas import DataFrame, read_json
df = DataFrame()
stores = [1500, 1805]
for store in stores:
filename = '%s.json' % store
df = df.append(read_json(filename))
df = df[(df.CodItem == 10) | (df.CodItem == 30)]
This is just an example, the problem is that I have more than 600+ json files so reading takes a lot of time, the dataframe becomes very long and memory consumption is very high.
Is there a way to read only the matching criteria to the dataframe?

One option would be to append your JSON data to a list, then convert once at the end and filter.
coditems = [10, 30]
data = []
for filename in json_files:
data.extend(read_json(filename))
df = pd.DataFrame(data).query('CodItem in #coditems')
This should be a lot faster because append is a quadratic operation. You have to read all the data in anyway, so you may as well use pandas to speed it up.
Another option would be to initialise your DataFrames inside a loop and then call pd.concat after you're done.
df_list = []
for file in json_files:
df_list.append(pd.DataFrame.from_records(read_json(filename)))
df = pd.concat(df_list, ignore_index=True).query('CodItem in #coditems')

You can can create a temporary data frame within your loop and filter it before appending:
from pandas import DataFrame, read_json
df = DataFrame()
stores = [1500, 1805]
for store in stores:
filename = '%s.json' % store
temp_df = read_json(filename)
df = df.append(temp_df[(temp_df.CodItem == 10) | (temp_df.CodItem == 30)])

Related

Multi-level Python Dict to Pandas DataFrame only processes one level out of many

I'm parsing some XML data, doing some logic on it, and trying to display the results in an HTML table. The dictionary, after filling, looks like this:
{
"general_info": {
"name": "xxx",
"description": "xxx",
"language": "xxx",
"prefix": "xxx",
"version": "xxx"
},
"element_count": {
"folders": 23,
"conditions": 72,
"listeners": 1,
"outputs": 47
},
"external_resource_count": {
"total": 9,
"extensions": {
"jar": 8,
"json": 1
},
"paths": {
"/lib": 9
}
},
"complexity": {
"over_1_transition": {
"number": 4,
"percentage": 30.769
},
"over_1_trigger": {
"number": 2,
"percentage": 15.385
},
"over_1_output": {
"number": 4,
"percentage": 30.769
}
}
}
Then I'm using pandas to convert the dictionary into a table, like so:
data_frame = pandas.DataFrame.from_dict(data=extracted_metrics, orient='index').stack().to_frame()
The result is a table that is mostly correct:
While the first and second levels seem to render correctly, those categories with a sub-sub category get written as a string in the cell, rather than as a further column. I've also tried using stack(level=1) but it raises an error "IndexError: Too many levels: Index has only 1 level, not 2". I've also tried making it into a series with no luck. It seems like it only renders "complete" columns. Is there a way of filling up the empty spaces in the dictionary before processing?
How can I get, for example, external_resource_count -> extensions to have two daughter rows jar and json, with an additional column for the values, so that the final table looks like this:
Extra credit if anyone can tell me how to get rid of the first row with the index numbers. Thanks!
The way you load the dataframe is correct but you should rename the 0 to a some column name.
# this function extracts all the keys from your nested dicts
def explode_and_filter(df, filterdict):
return [df[col].apply(lambda x:x.get(k) if type(x)==dict else x).rename(f'{k}')
for col,nested in filterdict.items()
for k in nested]
data_frame = pd.DataFrame.from_dict(data= extracted_metrics, orient='index').stack().to_frame(name='somecol')
#lets separate the rows where a dict is present & explode only those rows
mask = data_frame.somecol.apply(lambda x:type(x)==dict)
expp = explode_and_filter(data_frame[mask],
{'somecol':['jar', 'json', '/lib', 'number', 'percentage']})
# here we concat the exploded series to a frame
exploded_df = pd.concat(expp, axis=1).stack().to_frame(name='somecol2').reset_index(level=2)\.rename(columns={'level_2':'somecol'})
# and now we concat the rows with dict elements with the rows with non dict elements
out = pd.concat([data_frame[~mask], exploded_df])
The output dataframe looks like this

Deeply nested json - a list within a dictionary to Pandas DataFrame

I'm trying to parse nested json results.
data = {
"results": [
{
"components": [
{
"times": {
"periods": [
{
"fromDayOfWeek": 0,
"fromHour": 12,
"fromMinute": 0,
"toDayOfWeek": 4,
"toHour": 21,
"toMinute": 0,
"id": 156589,
"periodId": 20855
}
],
}
}
],
}
],
}
I can get to and create dataframes for "results" and "components" lists, but cannot get to "periods" due to the "times" dict. So far I have this:
df = pd.json_normalize(data, record_path = ['results','components'])
Need a separate "periods" dataframe with the included column names and values. Would appreciate your help on this. Thank you!
I results
II components
III times
IIII periods
The normalize should be correct way:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html
There is 4 level of nesting. There can be x components in results and y times in components - however that type of nesting is overengineering?
The simplest way of getting data is:
print data['a']['b']['c']['d'] (...)
in your case:
print data['results']['components']['times']['periods']
You can access the specific nested level by this piece of code:
def GetPropertyFromPeriods (property):
propertyList = []
for x in data['results']['components']['times']:
singleProperty = photoURL['periods'][property]
propertyList.append(singleProperty)
return propertyList
This give you access to one property inside periods (fromDayOfWeek, fromHour, fromMinute)
After coverting json value, transform it into pandas dataframe:
print pd.DataFrame(data, columns=["columnA", "columnB”])
If stuck:
How to Create a table with data from JSON output in Python
Python - How to convert JSON File to Dataframe
pandas documentation:
pandas.DataFrame.from_dict
pandas.json_normalize

How to convert pandas dataframe into multi level JSON with headers?

I have a pandas dataframe which I would like to convert to JSON format for my source system to utilize, which requires a very specific JSON format.
I cant seem to get to the exact format like shown in the expected output section, using simple dictionary loops.
Is there anyway I can convert csv/pd.Dataframe to nested JSON?
Any python package specifically built for this?
Input Dataframe:
#Create Input Dataframe
data = {
'col6':['A','A','A','B','B','B'],
'col7':[1, 1, 2, 1, 2, 2],
'col8':['A','A','A','B','B','B'],
'col10':['A','A','A','B','B','B'],
'col14':[1,1,1,1,1,2],
'col15':[1,2,1,1,1,1],
'col16':[9,10,26,9,12,4],
'col18':[1,1,2,1,2,3],
'col1':['xxxx','xxxx','xxxx','xxxx','xxxx','xxxx'],
'col2':[2.02011E+13,2.02011E+13,2.02011E+13,2.02011E+13,2.02011E+13,2.02011E+13],
'col3':['xxxx20201107023012','xxxx20201107023012','xxxx20201107023012','xxxx20201107023012','xxxx20201107023012','xxxx20201107023012'],
'col4':['yyyy','yyyy','yyyy','yyyy','yyyy','yyyy'],
'col5':[0,0,0,0,0,0],
'col9':['A','A','A','B','B','B'],
'col11':[0,0,0,0,0,0],
'col12':[0,0,0,0,0,0],
'col13':[0,0,0,0,0,0],
'col17':[51,63,47,59,53,56]
}
pd.DataFrame(data)
Expected Output:
{
"header1": {
"col1": "xxxx"
"col2": "20201107023012"
"col3": "xxxx20201107023012"
"col4": "yyyy",
"col5": "0"
},
"header2":
{
"header3":
[
{
col6: A,
col7: 1,
header4:
[
{
col8: "A",
col9: 1,
col10: "A",
col11: 0,
col12: 0,
col13: 0,
"header5":
[
{
col14: "1",
col15: 1,
col16: 1,
col17: 51,
col18: 1
},
{
col14: "1",
col15: 1,
col16: 2,
col17: 63,
col18: 2
}
]
},
{
col8: "A",
col9: 1,
col10: "A",
col11: 0,
col12: 0,
col13: 0,
"header5":
[
{
col14: "1",
col15: 1,
col16: 1,
col17: 51,
col18: 1
},
{
col14: "1",
col15: 1,
col16: 2,
col17: 63,
col18: 2
}
]
}
]
}
]
}
}
Maybe this will get you started. I'm not aware of a current python module that will do what you want but this is the basis of how I'd start it. Making assumptions based on what you've provided.
As each successive nest is based on some criteria, you'll need to loop through filtered dataframes. Depending on the size of your dataframes using groupby may be a better option than what I have here but the theory is the same. Also, you'll have to create you key value pairs correctly, this just creates the data support what you are builing.
# assume header 1 is constant so take first row and use .T to transpose to create dictionaries
header1 = dict(df.iloc[0].T[['col1','col2','col3','col4','col5']])
print('header1', header1)
# for header three, looks like you need the unique combinations so create dataframe
# and then iterate through to get all the header3 dictionaries
header3_dicts = []
dfh3 = df[['col6', 'col7']].drop_duplicates().reset_index(drop=True)
for i in range(dfh3.shape[0]):
header3_dicts.append(dict(dfh3.iloc[i].T[['col6','col7']]))
print('header3', header3_dicts)
# iterate over header3 to get header 4
for i in range(dfh3.shape[0]):
#print(dfh3.iat[i,0], dfh3.iat[i,1])
dfh4 = df.loc[(df['col6']==dfh3.iat[i,0]) & (df['col7']==dfh3.iat[i,1])]
header4_dicts = []
for j in range(dfh4.shape[0]):
header4_dicts.append(dict(df.iloc[j].T[['col8','col9','col10','col11','col12','col13']]))
print('header4', header4_dicts)
# next level repeat similar to above

Load a dataframe from a single json object

I have the following json object:
{
"Name": "David",
"Gender": "M",
"Date": "2014-01-01",
"Address": {
"Street": "429 Ford",
"City": "Oxford",
"State": "DE",
"Zip": 1009
}
}
How would I load this into a pandas dataframe so that it orients itself as:
name gender date address
David M 20140-01-01 {...}
What I'm trying now is:
pd.read_json(file)
But it orients it as four records instead of one.
You should read it as a Series and then (optionally) convert to a DataFrame:
df = pd.DataFrame(pd.read_json(file, typ='series')).T
df.shape
#(1, 4)
if your JSON file is composed of 1 JSON object per line (not an array, not a pretty printed JSON object)
then you can use:
df = pd.read_json(file, lines=True)
and it will do what you want
if file contains:
{"Name": "David","Gender": "M","Date": "2014-01-01","Address": {"Street": "429 Ford","City": "Oxford","State": "DE","Zip": 1009}}
on 1 line, then you get:
If you use
df = pd.read_json(file, orient='records')
you can load as 1 key per column, but the sub-keys will be split up into multiple rows.

JSON to Pandas: is there a more elegant solution?

I have some JSON, returned from an API call, that looks something like this:
{
"result": {
"code": "OK",
"msg": ""
},
"report_name": "FAMOUS_DICTATORS",
"columns": [
"rank",
"name",
"deaths"
],
"data": [
{
"row": [
1,
"Mao Zedong",
63000000
]
},
{
"row": [
2,
"Jozef Stalin",
23000000
]
}
]
}
I'd like to convert the JSON into a Pandas DataFrame:
rank name deaths
1 Mao Zedong 63000000
2 Jozef Stalin 23000000
I wrote this and it works, but looks a bit ugly:
import pandas as pd
import json
columns = eval(r.content)['columns']
df = pd.DataFrame(columns = eval(r.content)['columns'])
for row in eval(r.content)['data']:
df.loc[len(df)+1] = row['row']
Is there a more elegant/Pythonic way to do this (e.g. possibly using pandas.io.json.read_json)?
The read_json function of pandas is a very tricky method to use. If you don't know with certainty the validity of your JSON object or whether its initial structure is sane enough to build a dataframe around, it's much better to stick to tried and tested methods to break your data down to something that pandas can use without issues 100%.
In your case, I suggest breaking down your data to a list of lists. Out of all that JSON, the only part you really need is in the data and column keys.
Try this:
import pandas as pd
import json
import urllib
js = json.loads(urllib.urlopen("test.json").read())
data = js["data"]
rows = [row["row"] for row in data] # Transform the 'row' keys to list of lists.
df = pd.DataFrame(rows, columns=js["columns"])
print df
This gives me the desired result:
rank name deaths
0 1 Mao Zedong 63000000
1 2 Jozef Stalin 23000000
see pandas.io.json.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.json.read_json.html

Categories

Resources