Converting Excel to JSON using Pandas in Python 3.9 - python

this is my first ever post here so go easy! :) I am attempting to convert data from Excel to JSON using the Python Pandas library.
I have data in Excel that looks like the table below, the columns detailed as "Unnamed: x" are blank, I used these headers as that's how they are output when converting to JSON. There are around 20 tests formatted like the sample below:
Unnamed: 1
Unnamed: 2
Unnamed: 3
Unnamed: 4
Test 1
Menu
Setting
Value
Menu1
Setting1
Value1
Test 2
A
B
C
1
2
3
I would like to put these in to JSON to look something like this:
{
"Test 1": [
"Menu":"Menu1",
"Setting":"Setting1",
"Value":"Value1",
]
}
And so on...
I can convert the current code to JSON (but not the format detailed above, and I have been experimenting with creating different Pandas dataframes in Python. At the moment the JSON data I get looks something like this:
"3":[
{
"Unnamed: 0":"Test1",
"Unnamed: 1":"Menu",
"Unnamed: 2":"Setting",
"Unnamed: 2":"Value"
}
"4":[
{
"Unnamed: 1":"Menu1",
"Unnamed: 2":"Setting1",
"Unnamed: 2":"Value1"
}
So I am doing some manual work (copying and pasting) to set it up in the desired format.
Here is my current code:
import pandas
# Pointing to file location and specifying the sheet name to convert
excel_data_fragment = pandas.read_excel('C:\\Users\\user_name\\tests\\data.xls', sheet_name='Tests')
# Converting to data frame
df = pandas.DataFrame(excel_data_fragment)
# This will get the values in Column A and removes empty values
test_titles = df['Unnamed: 0'].dropna(how="all")
# This is the first set of test values
columnB = df['Unnamed: 1'].dropna(how="all")
# Saving original data in df and removing rows which contain all NaN values to mod_df
mod_df = df.dropna(how="all")
# Converting data frame with NaN values removed to json
df_json = mod_df.apply(lambda x: [x.dropna()], axis=1).to_json()
print(mod_df)

Your Excel sheet is basically composed of several distinct subtables put together (one for each test). The way I would go to process them in pandas would be to use groupby and then process each group as a table. DataFrame.to_dict will be your friend here to output JSON-able objects.
First here is some sample data that ressembles what you have provided:
import pandas as pd
rows = [
[],
[],
["Test 1", "Menu", "Setting", "Value"],
[None, "Menu1", "Setting1", "Value1"],
[None, "Menu2", "Setting2", "Value2"],
[],
[],
["Test 2", "A", "B", "C"],
[None, 1, 2, 3],
[None, 4, 5, 6],
]
df = pd.DataFrame(rows, columns=[f"Unnamed: {i}" for i in range(1, 5)])
df looks like:
Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
0 None None None None
1 None None None None
2 Test 1 Menu Setting Value
3 None Menu1 Setting1 Value1
4 None Menu2 Setting2 Value2
5 None None None None
6 None None None None
7 Test 2 A B C
8 None 1 2 3
9 None 4 5 6
Then use the following snippet, which cleans up all the missing values in df and turns each subtable into a dict.
# Remove entirely empty rows
df = df.dropna(how="all")
# Fill missing values in column 1
df["Unnamed: 1"] = df["Unnamed: 1"].fillna(method="ffill")
def process_group(g):
# Drop first column
g = g.drop("Unnamed: 1", axis=1)
# Use first row as column names
g = g.rename(columns=g.iloc[0])
# Drop first row
g = g.drop(g.index[0])
# Convert to dict
return g.to_dict(orient="records")
output = df.groupby("Unnamed: 1").apply(process_group).to_dict()
In the end, output is equal to:
{
"Test 1": [
{
"Menu": "Menu1",
"Setting": "Setting1",
"Value": "Value1"
},
{
"Menu": "Menu2",
"Setting": "Setting2",
"Value": "Value2"
}
],
"Test 2": [
{
"A": 1,
"B": 2,
"C": 3
},
{
"A": 4,
"B": 5,
"C": 6
}
]
}
You can finally get the JSON string by simply using:
import json
output_str = json.dumps(output)

Related

Explode function

This is my first question on here. I have searched around on here and throughout the web and I seem unable to find the answer to my question. I'm trying to explode out a list in a json file out into multiple columns and rows. Everything I have tried so far has proven unsuccessful.
I am doing this over multiple json files within a directory in order to have it print out in the dataframe like this.
Goal:
did
Version
Nodes
rds
time
c
sc
f
uc
did
Version
Nodes
rds
time
c
sc
f
uc
did
Version
Nodes
rds
time
c
sc
f
uc
did
Version
Nodes
rds
time
c
sc
f
uc
Instead I get this in my dataframe:
did
Version
Nodes
rds
fusage
did
Version
Nodes
rds
everything in fusage
did
Version
Nodes
rds
everything in fusage
did
Version
Nodes
rds
everything in fusage
example of the json I'm working with. The json structure will not change
{
"did": "123456789",
"mId": "1a2b3cjsks",
"timestamp": "2021-11-26T11:10:58.322000",
"beat": {
"did": "123456789",
"collectionTime": "2010-05-26 11:10:58.004783+00",
"Nodes": 6,
"Version": "v1.4.6-2",
"rds": "0.00B",
"fusage": [
{
"time": "2010-05-25",
"c": "string",
"sc": "string",
"f": "string",
"uc": "int"
},
{
"time": "2010-05-19",
"c": "string",
"sc": "string",
"f": "string",
"uc": "int"
},
{
"t": "2010-05-23",
"c": "string",
"sc": "string",
"f": "string",
"uc": "int"
},
{
"time": "2010-05-23",
"c": "string",
"sc": "string",
"f": "string",
"uc": "int"
}
]
}
}
My end goal is to get the dataframe out to a csv in order to be ingested. I appreciate everyone's help looking at this.
using python 3.8.10 & pandas 1.3.4
python code below
import csv
import glob
import json
import os
import pandas as pd
tempdir = '/dir/to/files/json_temp'
json_files = os.path.join(tempdir, '*.json')
file_list = glob.glob(json_files)
dfs = []
for file in file_list:
with open(file) as f:
data = pd.json_normalize(json.loads(f.read()))
dfs.append(data)
df = pd.concat(dfs, ignore_index=True)
df.explode('fusage')
print(df)
If you're going to use the explode function, after that, apply pd.Series over the column containing the fusage list (beat.fusage) to obtain a Series for each list item.
/dir/to/files
├── example-v1.4.6-2.json
└── example-v2.2.2-2.json
...
for file in file_list:
with open(file) as f:
data = pd.json_normalize(json.loads(f.read()))
dfs.append(data)
df = pd.concat(dfs, ignore_index=True)
fusage_list = df.explode('beat.fusage')['beat.fusage'].apply(pd.Series)
df = pd.concat([df, fusage_list], axis=1)
# show desired columns
df = df[['did', 'beat.Version', 'beat.Nodes', 'beat.rds', 'time', 'c', 'sc', 'f', 'uc']]
print(df)
Output from df
did beat.Version beat.Nodes beat.rds time c sc f uc
0 123456789 v1.4.6-2 6 0.00B 2010-05-25 string string string int
0 123456789 v1.4.6-2 6 0.00B 2010-05-19 string string string int
0 123456789 v1.4.6-2 6 0.00B NaN string string string int
0 123456789 v1.4.6-2 6 0.00B 2010-05-23 string string string int
1 123777777 v2.2.2-2 4 0.00B 2010-05-25 string string string int
1 123777777 v2.2.2-2 4 0.00B 2010-05-19 string string string int
1 123777777 v2.2.2-2 4 0.00B NaN string string string int
1 123777777 v2.2.2-2 4 0.00B 2010-05-23 string string string int

How to convert pandas dataframe into multi level JSON with headers?

I have a pandas dataframe which I would like to convert to JSON format for my source system to utilize, which requires a very specific JSON format.
I cant seem to get to the exact format like shown in the expected output section, using simple dictionary loops.
Is there anyway I can convert csv/pd.Dataframe to nested JSON?
Any python package specifically built for this?
Input Dataframe:
#Create Input Dataframe
data = {
'col6':['A','A','A','B','B','B'],
'col7':[1, 1, 2, 1, 2, 2],
'col8':['A','A','A','B','B','B'],
'col10':['A','A','A','B','B','B'],
'col14':[1,1,1,1,1,2],
'col15':[1,2,1,1,1,1],
'col16':[9,10,26,9,12,4],
'col18':[1,1,2,1,2,3],
'col1':['xxxx','xxxx','xxxx','xxxx','xxxx','xxxx'],
'col2':[2.02011E+13,2.02011E+13,2.02011E+13,2.02011E+13,2.02011E+13,2.02011E+13],
'col3':['xxxx20201107023012','xxxx20201107023012','xxxx20201107023012','xxxx20201107023012','xxxx20201107023012','xxxx20201107023012'],
'col4':['yyyy','yyyy','yyyy','yyyy','yyyy','yyyy'],
'col5':[0,0,0,0,0,0],
'col9':['A','A','A','B','B','B'],
'col11':[0,0,0,0,0,0],
'col12':[0,0,0,0,0,0],
'col13':[0,0,0,0,0,0],
'col17':[51,63,47,59,53,56]
}
pd.DataFrame(data)
Expected Output:
{
"header1": {
"col1": "xxxx"
"col2": "20201107023012"
"col3": "xxxx20201107023012"
"col4": "yyyy",
"col5": "0"
},
"header2":
{
"header3":
[
{
col6: A,
col7: 1,
header4:
[
{
col8: "A",
col9: 1,
col10: "A",
col11: 0,
col12: 0,
col13: 0,
"header5":
[
{
col14: "1",
col15: 1,
col16: 1,
col17: 51,
col18: 1
},
{
col14: "1",
col15: 1,
col16: 2,
col17: 63,
col18: 2
}
]
},
{
col8: "A",
col9: 1,
col10: "A",
col11: 0,
col12: 0,
col13: 0,
"header5":
[
{
col14: "1",
col15: 1,
col16: 1,
col17: 51,
col18: 1
},
{
col14: "1",
col15: 1,
col16: 2,
col17: 63,
col18: 2
}
]
}
]
}
]
}
}
Maybe this will get you started. I'm not aware of a current python module that will do what you want but this is the basis of how I'd start it. Making assumptions based on what you've provided.
As each successive nest is based on some criteria, you'll need to loop through filtered dataframes. Depending on the size of your dataframes using groupby may be a better option than what I have here but the theory is the same. Also, you'll have to create you key value pairs correctly, this just creates the data support what you are builing.
# assume header 1 is constant so take first row and use .T to transpose to create dictionaries
header1 = dict(df.iloc[0].T[['col1','col2','col3','col4','col5']])
print('header1', header1)
# for header three, looks like you need the unique combinations so create dataframe
# and then iterate through to get all the header3 dictionaries
header3_dicts = []
dfh3 = df[['col6', 'col7']].drop_duplicates().reset_index(drop=True)
for i in range(dfh3.shape[0]):
header3_dicts.append(dict(dfh3.iloc[i].T[['col6','col7']]))
print('header3', header3_dicts)
# iterate over header3 to get header 4
for i in range(dfh3.shape[0]):
#print(dfh3.iat[i,0], dfh3.iat[i,1])
dfh4 = df.loc[(df['col6']==dfh3.iat[i,0]) & (df['col7']==dfh3.iat[i,1])]
header4_dicts = []
for j in range(dfh4.shape[0]):
header4_dicts.append(dict(df.iloc[j].T[['col8','col9','col10','col11','col12','col13']]))
print('header4', header4_dicts)
# next level repeat similar to above

Required columns are not present in the data frame but column header gets created in the csv and all rows gets populated with null

I am working on getting data from an API using python. The API returns data in form of json which is being normalised and written to a data frame which is then written to a csv file.
The API can return any number of columns which differs between each records. I need only a fixed number of columns which i am defining in the code.
In the scenario where the required column is not being returned my code fails.
I need a solution where even though required columns are not present in the data frame column header gets created in the csv and all rows gets populated with null.
required csv structure :
name address phone
abc bcd 1214
bcd null null
I'm not sure if understood you correctly but I hope the following code solves your problem:
import json
import pandas as pd
# Declare json with missing values:
# - First element doesn't contain "phone" field
# - Second element doesn't contain "married" field
api_data = """
{ "sentences" :
[
{ "name": "abc", "address": "bcd", "married": true},
{ "name": "def", "address": "ghi", "phone" : 7687 }
]
}
"""
json_data = json.loads(api_data)
df = pd.DataFrame(
data=json_data["sentences"],
# Explicitly declare which columns should be presented in DataFrame
# If value for given column is absent it will be populated with NaN
columns=["name", "address", "married", "phone"]
)
# Save result to csv:
df.to_csv("tmp.csv", index=False)
The content of resulting csv:
name,address,married,phone
abc,bcd,True,
def,ghi,,7687.0
P.S.:
It should work even if columns are absent in all the records. Here is another example:
# Both elements do not contain "married" and "phone" fields
api_data = """
{ "sentences" :
[
{ "name": "abc", "address": "bcd"},
{ "name": "def", "address": "ghi"}
]
}
"""
json_data = json.loads(api_data)
json_data["sentences"][0]
df = pd.DataFrame(
data=json_data["sentences"],
# Explicitly declare which columns should be presented in DataFrame
# If value for given column is absent it will be populated with NaN
columns=["name", "address", "married", "phone"]
)
# Print first rows of DataFrame:
df.head()
# Expected output:
# name address married phone
# 0 abc bcd NaN NaN
# 1 def ghi NaN NaN
df.to_csv("tmp.csv", index=False)
In this case the resulting csv file will contain the following text:
name,address,married,phone
abc,bcd,,
def,ghi,,
The last two commas in the 2nd and 3d lines mean "an empty/missing value" and if you create DataFrame from resulting csv by pd.read_csv then "married" and "phone" columns will be populated with NaN values.

Original dict/json from a pd.io.json.json_normalize() dataframe row

I have a pandas dataframe with rows created from dicts, using pd.io.json.json_normalize(). The values(not the keys/columns names) in dataframe have been modified. I want to retrieve a dict, with the same nested format the original dict has, from a row of the dataframe.
sample = {
"A": {
"a": 7
},
"B": {
"a": "name",
"z":{
"dD": 20 ,
"f_f": 3 ,
}
}
}
df = pd.io.json.json_normalize(sample, sep='__')
as expected df.columns returns me:
Index(['A__a', 'B__a', 'B__z__dD', 'B__z__f_f'], dtype='object')
I want to "reverse" the process now.
I can guarantee no string in the original dict(key or value) has a '__' as a substring and neither starts or ends with '_'

JSON to Pandas: is there a more elegant solution?

I have some JSON, returned from an API call, that looks something like this:
{
"result": {
"code": "OK",
"msg": ""
},
"report_name": "FAMOUS_DICTATORS",
"columns": [
"rank",
"name",
"deaths"
],
"data": [
{
"row": [
1,
"Mao Zedong",
63000000
]
},
{
"row": [
2,
"Jozef Stalin",
23000000
]
}
]
}
I'd like to convert the JSON into a Pandas DataFrame:
rank name deaths
1 Mao Zedong 63000000
2 Jozef Stalin 23000000
I wrote this and it works, but looks a bit ugly:
import pandas as pd
import json
columns = eval(r.content)['columns']
df = pd.DataFrame(columns = eval(r.content)['columns'])
for row in eval(r.content)['data']:
df.loc[len(df)+1] = row['row']
Is there a more elegant/Pythonic way to do this (e.g. possibly using pandas.io.json.read_json)?
The read_json function of pandas is a very tricky method to use. If you don't know with certainty the validity of your JSON object or whether its initial structure is sane enough to build a dataframe around, it's much better to stick to tried and tested methods to break your data down to something that pandas can use without issues 100%.
In your case, I suggest breaking down your data to a list of lists. Out of all that JSON, the only part you really need is in the data and column keys.
Try this:
import pandas as pd
import json
import urllib
js = json.loads(urllib.urlopen("test.json").read())
data = js["data"]
rows = [row["row"] for row in data] # Transform the 'row' keys to list of lists.
df = pd.DataFrame(rows, columns=js["columns"])
print df
This gives me the desired result:
rank name deaths
0 1 Mao Zedong 63000000
1 2 Jozef Stalin 23000000
see pandas.io.json.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.json.read_json.html

Categories

Resources