How can I organize JSON data from pandas dataframe - python

I can't figure out how to correctly organize the JSON data that is created from my pandas dataframe. This is my code:
with open (spreadsheetName, 'rb') as spreadsheet:
newSheet = spreadsheet.read()
newSheet = pd.read_excel(newSheet)
exportSheet = newSheet.to_json('file.json', orient = 'index')
And I'd like for the JSON data to look something like
{
"cars": [
{
"Model": "Camry",
"Year": "2015"
},
{
"Model": "Model S",
"Year": "2018"
}
]
}
But instead I'm getting a single line of JSON data from the code I have. Any ideas on how I can make it so that each row is a JSON 'object' with it's own keys and values from the column headers (like model and year)?

Set an indent argument to desired value in to_json function.
exportSheet = newSheet.to_json('file.json', orient='index', indent=4)

Related

Python nested dict. return deepest dict to csv

I'm trying to return some values from a nested dict (based on a json) to a csv without success due to the following structure.
{
"http_method":"GET",
"results":{
"FTKMOB21xxxxD":{
"serial_number":"FTKMOB21xxxxD",
"comments":"",
"q_type":432,
"license":"EFTM123123123",
"type":"mobile",
"user":"pippo",
"user_type":"user",
"drift":0,
"status":{
"name":"activated"
}
},
"FTKMOB21xxxxF":{
"serial_number":"FTKMOB21xxxxF",
"comments":"",
"q_type":432,
"license":"EFTM123123123",
"type":"mobile",
"drift":0,
"status":{
"name":"pending"
}
}
},
"vdom":"root",
"path":"user",
"name":"fortitoken",
"action":"",
"status":"success",
"serial":"FGT_VM",
"version":"v7.0.5",
"build":304
}
What I need to return in a csv are fields "serial_number", "user", "status".
The FTKMOB21xxxxD change for each device and I need to consider it as a dynamic value, I suppose that a loop based on its position is needed.
Could you please help me to understood how to do that?
It's straight-forward with pandas:
import pandas as pd
df = pd.DataFrame(input_dict['results'])
df.T[["serial_number", "user", "status"]].to_csv('output.csv', index=False)
Your csv will then look like:
serial_number,user,status
FTKMOB21xxxxD,pippo,{'name': 'activated'}
FTKMOB21xxxxF,,{'name': 'pending'}
Edit: if you actually want status/name as status, you have to reassign df['status']:
df = pd.DataFrame.from_dict(input_dict['results'], orient='index', columns=["serial_number", "user", "status"])
df['status'] = pd.DataFrame(df['status'].to_list())['name'].to_list()
df.to_csv('output.csv', index=False)

Excel to JSON format with python

I have an excel sheet which is in the below format
I want to convert this excel sheet into JSON format using Python. each JSON object is a diagonal value and column headings in the below format.
{
"Records": [
{
"RecordId": "F1",
"Assets": [
{
"AssetId": "A1",
"Support": "S11"
},
{
"AssetId": "A2",
"Support": "S12"
},
{
"AssetId": "A3",
"Support": "S13"
}
]
},
{
"RecordId": "F2",
"Assets": [
{
"AssetId": "A1",
"Support": "S21"
},
{
"AssetId": "A2",
"Support": "S22"
},
{
"AssetId": "A3",
"Support": "S23"
}
]
}
]
}
I have written some code it seems not working as I expected.
import json
import pandas as pd
df = pd.read_excel (r'test.xlsx', sheet_name='Sheet2')
#initialize data
data=[0 for i in range(len(df))]
datac=[0 for c in range(len(df.columns))]
newset=dict()
for i in range(len(df)):
# data[i] = r'{"'+str(df.columns.values[0])+'": "' +str(df.loc[i][0])+'", '+str(df.columns.values[1])+'": "' +str(df.loc[i][1])+'", '+str(df.columns.values[2])+'": "' +str(df.loc[i][2])+'"}'
#data[i] = {str(df.columns.values[1]) : str(df.loc[i][0]), str(df.columns.values[1]): str(df.loc[i][1]), str(df.columns.values[2]): str(df.loc[i][2])}
for c in range(1,len(df.columns)):
#data[i] = {str('RecordId') : str(df.loc[i][0]),str('Assets'):[{"AssetId": str(df.columns.values[c]),"Support": str(df.loc[i][c])}]}
datac[c] = {"AssetId": str(df.columns.values[c]),"Support": str(df.loc[i][c])}
data[i]={str('RecordId') : str(df.loc[i][0]),str('Assets'):datac[c]}
print(data[i])
output_lines = [json.dumps(line)+",\n" for line in data]
output_lines[-1] = output_lines[-1][:-2] # remove ",\n" from last line
with open(r'Savedwork.json', 'w') as json_file:
json_file.writelines(output_lines)
What you need is the iterrows() method, it will iterate over the
dataframe's rows as (index, series) pairs. The columns() method will give you
the list of column names, so you'll be able to iterate over the columns in the
series, and access them by name.
import json
import pandas as pd
df = pd.read_excel('test.xlsx')
recs = []
for i, row in df.iterrows():
rec = {
'RecordId': row[0],
'Assets': [{'AssetId': c, 'Support': row[c]} for c in df.columns[1:]]
}
recs.append(rec)
out = {'Records': recs}
(yes, it could all be done in a single list comprehension, but abusing those hinders readability)
Also, you don't need to do json.dumps on lines, and then assemble them with
newlines (don't work at the text level): build a dictionary with the entire
data, and then json.dump that:
print(json.dumps(out, indent=4))
You can create the dicts directly in pandas.
First set the first column with F1, F2 as index:
df.set_index(0, inplace = True)
df.index.name = None
Then create the dicts in pandas with dict keys as column names, export it to a dict and save it to json:
import json
df = df.apply(lambda x: [{"AssetId": x.name, "Support": i} for i in x], axis =1).reset_index().rename(columns={'index': 'RecordId', 0: 'Assets'})
json_data = {"Records": df.to_dict('records')}
with open('r'Savedwork.json', 'w') as fp:
json.dump(json_data, fp)
another solution is to take a snapshot of the entire workbook in json format and reorganize it out of the box. Using the collect function of XLtoy is possible to do that via command line, this approach allows you more degrees of freedom.
[i'm the main developer of XLtoy]

How to handle a JSON that returns a list of dict-like objects in Pandas?

I am using an API from collegefootballdata.com to get data on scores and betting lines. I want to use betting lines to infer expected win % and then compare that to actual results (I feel like my team loses too many games where we are big favorites and want to test that.) This code retrieves one game for example purposes.
parameters = {
"gameId": 401112435,
"year": 2019
}
response = requests.get("https://api.collegefootballdata.com/lines", params=parameters)
The JSON output is this:
[
{
"awayConference": "ACC",
"awayScore": 28,
"awayTeam": "Virginia Tech",
"homeConference": "ACC",
"homeScore": 35,
"homeTeam": "Boston College",
"id": 401112435,
"lines": [
{
"formattedSpread": "Virginia Tech -4.5",
"overUnder": "57.5",
"provider": "consensus",
"spread": "4.5"
},
{
"formattedSpread": "Virginia Tech -4.5",
"overUnder": "57",
"provider": "Caesars",
"spread": "4.5"
},
{
"formattedSpread": "Virginia Tech -4.5",
"overUnder": "58",
"provider": "numberfire",
"spread": "4.5"
},
{
"formattedSpread": "Virginia Tech -4.5",
"overUnder": "56.5",
"provider": "teamrankings",
"spread": "4.5"
}
],
"season": 2019,
"seasonType": "regular",
"week": 1
}
]
I'm then loading into a pandas dataframe with:
def jstring(obj):
# create a formatted string of the Python JSON object
text = json.dumps(obj, sort_keys=True, indent=4)
return text
json_str = jstring(response.json())
df = pd.read_json(json_str)
This creates a dataframe with a "lines" column that contains the entire lines section of the JSON as a string. Ultimately, I want to use the "spread" value in the block where "provider" = "consensus". Everything else is extraneous for my purposes. I've tried exploding the column with
df = df.explode('lines')
which gives me 4 rows with something like this for each game (as expected):
{'formattedSpread': 'Virginia Tech -4.5', 'overUnder': '57.5', 'provider': 'consensus', 'spread': '4.5'}
Here is where I'm stuck. I want to keep only the rows where 'provider' = 'consensus', and further I need to have 'spread' to use as a separate variable/column in my analysis. I've tried exploding a 2nd time, df.split, df.replace to change { to [ and explode as a list, all to no avail. Any help is appreciated!!
This is probably what you're looking for -
EDIT: Handling special case.
import pandas as pd
import requests
params = {
"gameId": 401112435,
"year": 2019,
}
r = requests.get("https://api.collegefootballdata.com/lines", params=params)
df = pd.DataFrame(r.json()) # Create a DataFrame with a lines column that contains JSON
df = df.explode('lines') # Explode the DataFrame so that each line gets its own row
df = df.reset_index(drop=True) # After explosion, the indices are all the same - this resets them so that you can align the DataFrame below cleanly
def fill_na_lines(lines):
if pd.isna(lines):
return {k: None for k in ['provider', 'spread', 'formattedSpread', 'overUnder']}
return lines
df.lines = df.lines.apply(fill_na_lines)
lines_df = pd.DataFrame(df.lines.tolist()) # A separate lines DataFrame created from the lines JSON column
df = pd.concat([df, lines_df], axis=1) # Concatenating the two DataFrames along the vertical axis.
# Now you can filter down to whichever rows you need.
df = df[df.provider == 'consensus']
The documentation on joining DataFrames in different ways is probably useful.

Extracting JSON data into a relational table

I have a JSON file which resulted from YouTube's iframe API and needs to be preprocessed. I want to put this JSON data into a pandas dataframe, where each JSON key will be a column, and each recorded "event" should be a new row.
I was able to load the data as a dataframe using the read_json , but with this the keys for each event are shown as an array.
Here is what my JSON data looks like :
{
"events":[
{
"timemillis":1563467463580,
"date":"18.7.2019",
"time":"18:31:03,580",
"name":"Player is loading",
"data":""
},
{
"timemillis":1563467463668,
"date":"18.7.2019",
"time":"18:31:03,668",
"name":"Player is loaded",
"data":"5"
}
]
}
And this is what I did to convert it to a dataframe:
data=pd.read_json("file.json")
df=pd.DataFrame(data)
print(df)
The output looks like this:
0 {'timemillis': 1563469276604, 'date': '18.7.20...
1 {'timemillis': 1563469276694, 'date': '18.7.20...
...
How can I convert this output into a table where I have separate columns for these keys such as 'timemmillis','date','name' and so on? I never worked with JSONs before so I am a bit confused.
import pandas as pd
import json
data = {
"events":[
{
"timemillis":1563467463580,
"date":"18.7.2019",
"time":"18:31:03,580",
"name":"Player is loading",
"data":""
},
{
"timemillis":1563467463668,
"date":"18.7.2019",
"time":"18:31:03,668",
"name":"Player is loaded",
"data":"5"
}
]
}
# or read data from file
# rather than reading file directly to pandas dataframe read as json
# data=pd.read_json("file.json")
with open('file.json') as json_file:
data = json.load(json_file)
df=pd.DataFrame(data['events'])
print(df)
Result
data date name time timemillis
0 18.7.2019 Player is loading 18:31:03,580 1563467463580
1 5 18.7.2019 Player is loaded 18:31:03,668 1563467463668
import pandas as pd
df=pd.read_json("file.json",orient='columns')
rows = []
for i,r in df.iterrows():
rows.append({'eventid':i+1,'timemillis':r['events']['timemillis'],'name':r['events']['name']})
df = pd.DataFrame(rows)
print(df)
Now you can insert this df to database

How to read this JSON into dataframe with specfic dataframe format

This is my JSON string, I want to make it read into dataframe in the following tabular format.
I have no idea what should I do after pd.Dataframe(json.loads(data))
JSON data, edited
{
"data":[
{
"data":{
"actual":"(0.2)",
"upper_end_of_central_tendency":"-"
},
"title":"2009"
},
{
"data":{
"actual":"2.8",
"upper_end_of_central_tendency":"-"
},
"title":"2010"
},
{
"data":{
"actual":"-",
"upper_end_of_central_tendency":"2.3"
},
"title":"longer_run"
}
],
"schedule_id":"2014-03-19"
}
That's a somewhat overly nested JSON. But if that's what you have to work with, and assuming your parsed JSON is in jdata:
datapts = jdata['data']
rownames = ['actual', 'upper_end_of_central_tendency']
colnames = [ item['title'] for item in datapts ] + ['schedule_id' ]
sched_id = jdata['schedule_id']
rows = [ [item['data'][rn] for item in datapts ] + [sched_id] for rn in rownames]
df = pd.DataFrame(rows, index=rownames, columns=colnames)
df is now:
If you wanted to simplify that a bit, you could construct the core data without the asymmetric schedule_id field, then add that after the fact:
datapts = jdata['data']
rownames = ['actual', 'upper_end_of_central_tendency']
colnames = [ item['title'] for item in datapts ]
rows = [ [item['data'][rn] for item in datapts ] for rn in rownames]
d2 = pd.DataFrame(rows, index=rownames, columns=colnames)
d2['schedule_id'] = jdata['schedule_id']
That will make an identical DataFrame (i.e. df == d2). It helps when learning pandas to try a few different construction strategies, and get a feel for what is more straightforward. There are more powerful tools for unfolding nested structures into flatter tables, but they're not as easy to understand first time out of the gate.
(Update) If you wanted a better structuring on your JSON to make it easier to put into this format, ask pandas what it likes. E.g. df.to_json() output, slightly prettified:
{
"2009": {
"actual": "(0.2)",
"upper_end_of_central_tendency": "-"
},
"2010": {
"actual": "2.8",
"upper_end_of_central_tendency": "-"
},
"longer_run": {
"actual": "-",
"upper_end_of_central_tendency": "2.3"
},
"schedule_id": {
"actual": "2014-03-19",
"upper_end_of_central_tendency": "2014-03-19"
}
}
That is a format from which pandas' read_json function will immediately construct the DataFrame you desire.

Categories

Resources