Convert csv data to nested json data - python

I have a csv data as:
dataf = pd.DataFrame{'title':['tit1', 'tit1', 'tit2' 'tit2','tit3'],
'context':['con1', 'con1', 'con2', 'con2', 'con3'],
'answers':['text1', 'text2', 'text3', 'text4', 'text5'],
'question':['que1', 'que2', 'que3', 'que4', 'que5'],
'id':['DDA', 'SAV', 'AFS', 'ML', 'MLI']}
I wanted to convert it to nested json format as below using python:
[
{
"title": "tit1",
"paragraph": [
{
"context": "con1",
"qas": [
{
"answers": "text1",
"question": "que1",
"id": "DDA"
},
{
"answers": "text2",
"question": "que2",
"id": "SAV"
}
]
}
]
},
{
"title": "tit2",
"paragraph": [
{
"context": "con2",
"qas": [
{
"answers": "text3",
"question": "que3",
"id": "AFS"
},
{
"answers": "text4",
"question": "que4",
"id": "ML"
}
],
"context": "con3",
"qas": [
{
"answers": "text5",
"question": "que5",
"id": "MLI"
}
]
}
]
}
]

Related

Best way to parse a JSON to store in SQL database (SQL stored procedure/Python)

I have a table of overly complex JSON files I'm trying to convert to tabular format to store in a SQL database. I'm pulling the JSONs from the quickbooks online API, and the format is messy to say the least.. (We're talking 7x nested JSONs for some bits of it..
The format resembles the code snippet down below. Currently I am using a bunch of OpenJSON's + Cross applys to dig down to the innermost ColData then work my way up but it looks like some of the ColData's get skipped over doing that.
Are there any better ways, using either Python (since I pull the JSON initially in Python before sending the JSON to a SQL database to parse) or SQL to convert it to tabular format besides manually trying to use OpenJSON with Cross applys?
The goal is to get all of the ColData's into a SQL table...
Thanks!
{
"Header": {
"ReportName": "BalanceSheet",
"Option": [
{
"Name": "AccountingStandard",
"Value": "GAAP"
},
{
"Name": "NoReportData",
"Value": "false"
}
],
"DateMacro": "this calendar year-to-date",
"ReportBasis": "Accrual",
"StartPeriod": "2016-01-01",
"Currency": "USD",
"EndPeriod": "2016-10-31",
"Time": "2016-10-31T09:42:21-07:00",
"SummarizeColumnsBy": "Total"
},
"Rows": {
"Row": [
{
"Header": {
"ColData": [
{
"value": "ASSETS"
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"Header": {
"ColData": [
{
"value": "Current Assets"
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"Header": {
"ColData": [
{
"value": "Bank Accounts"
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"ColData": [
{
"id": "35",
"value": "Checking"
},
{
"value": "1350.55"
}
],
"type": "Data"
},
{
"ColData": [
{
"id": "36",
"value": "Savings"
},
{
"value": "800.00"
}
],
"type": "Data"
}
]
},
"type": "Section",
"group": "BankAccounts",
"Summary": {
"ColData": [
{
"value": "Total Bank Accounts"
},
{
"value": "2150.55"
}
]
}
},
{
"Header": {
"ColData": [
{
"value": "Accounts Receivable"
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"ColData": [
{
"id": "84",
"value": "Accounts Receivable (A/R)"
},
{
"value": "6383.12"
}
],
"type": "Data"
}
]
},
"type": "Section",
"group": "AR",
"Summary": {
"ColData": [
{
"value": "Total Accounts Receivable"
},
{
"value": "6383.12"
}
]
}
},
{
"Header": {
"ColData": [
{
"value": "Other current assets"
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"ColData": [
{
"id": "81",
"value": "Inventory Asset"
},
{
"value": "596.25"
}
],
"type": "Data"
},
{
"ColData": [
{
"id": "4",
"value": "Undeposited Funds"
},
{
"value": "2117.52"
}
],
"type": "Data"
}
]
},
"type": "Section",
"group": "OtherCurrentAssets",
"Summary": {
"ColData": [
{
"value": "Total Other current assets"
},
{
"value": "2713.77"
}
]
}
}
]
},
"type": "Section",
"group": "CurrentAssets",
"Summary": {
"ColData": [
{
"value": "Total Current Assets"
},
{
"value": "11247.44"
}
]
}
},
{
"Header": {
"ColData": [
{
"value": "Fixed Assets"
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"Header": {
"ColData": [
{
"id": "37",
"value": "Truck"
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"ColData": [
{
"id": "38",
"value": "Original Cost"
},
{
"value": "13495.00"
}
],
"type": "Data"
}
]
},
"type": "Section",
"Summary": {
"ColData": [
{
"value": "Total Truck"
},
{
"value": "13495.00"
}
]
}
}
]
},
"type": "Section",
"group": "FixedAssets",
"Summary": {
"ColData": [
{
"value": "Total Fixed Assets"
},
{
"value": "13495.00"
}
]
}
}
]
},
"type": "Section",
"group": "TotalAssets",
"Summary": {
"ColData": [
{
"value": "TOTAL ASSETS"
},
{
"value": "24742.44"
}
]
}
},
{
"Header": {
"ColData": [
{
"value": "LIABILITIES AND EQUITY"
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"Header": {
"ColData": [
{
"value": "Liabilities"
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"Header": {
"ColData": [
{
"value": "Current Liabilities"
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"Header": {
"ColData": [
{
"value": "Accounts Payable"
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"ColData": [
{
"id": "33",
"value": "Accounts Payable (A/P)"
},
{
"value": "1984.17"
}
],
"type": "Data"
}
]
},
"type": "Section",
"group": "AP",
"Summary": {
"ColData": [
{
"value": "Total Accounts Payable"
},
{
"value": "1984.17"
}
]
}
},
{
"Header": {
"ColData": [
{
"value": "Credit Cards"
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"ColData": [
{
"id": "41",
"value": "Mastercard"
},
{
"value": "157.72"
}
],
"type": "Data"
}
]
},
"type": "Section",
"group": "CreditCards",
"Summary": {
"ColData": [
{
"value": "Total Credit Cards"
},
{
"value": "157.72"
}
]
}
},
{
"Header": {
"ColData": [
{
"value": "Other Current Liabilities"
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"ColData": [
{
"id": "89",
"value": "Arizona Dept. of Revenue Payable"
},
{
"value": "4.55"
}
],
"type": "Data"
},
{
"ColData": [
{
"id": "90",
"value": "Board of Equalization Payable"
},
{
"value": "401.98"
}
],
"type": "Data"
},
{
"ColData": [
{
"id": "43",
"value": "Loan Payable"
},
{
"value": "4000.00"
}
],
"type": "Data"
}
]
},
"type": "Section",
"group": "OtherCurrentLiabilities",
"Summary": {
"ColData": [
{
"value": "Total Other Current Liabilities"
},
{
"value": "4406.53"
}
]
}
}
]
},
"type": "Section",
"group": "CurrentLiabilities",
"Summary": {
"ColData": [
{
"value": "Total Current Liabilities"
},
{
"value": "6548.42"
}
]
}
},
{
"Header": {
"ColData": [
{
"value": "Long-Term Liabilities"
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"ColData": [
{
"id": "44",
"value": "Notes Payable"
},
{
"value": "25000.00"
}
],
"type": "Data"
}
]
},
"type": "Section",
"group": "LongTermLiabilities",
"Summary": {
"ColData": [
{
"value": "Total Long-Term Liabilities"
},
{
"value": "25000.00"
}
]
}
}
]
},
"type": "Section",
"group": "Liabilities",
"Summary": {
"ColData": [
{
"value": "Total Liabilities"
},
{
"value": "31548.42"
}
]
}
},
{
"Header": {
"ColData": [
{
"value": "Equity"
},
{
"value": ""
}
]
},
"Rows": {
"Row": [
{
"ColData": [
{
"id": "34",
"value": "Opening Balance Equity"
},
{
"value": "-9337.50"
}
],
"type": "Data"
},
{
"ColData": [
{
"id": "2",
"value": "Retained Earnings"
},
{
"value": "91.25"
}
],
"type": "Data"
},
{
"ColData": [
{
"value": "Net Income"
},
{
"value": "2440.27"
}
],
"type": "Data",
"group": "NetIncome"
}
]
},
"type": "Section",
"group": "Equity",
"Summary": {
"ColData": [
{
"value": "Total Equity"
},
{
"value": "-6805.98"
}
]
}
}
]
},
"type": "Section",
"group": "TotalLiabilitiesAndEquity",
"Summary": {
"ColData": [
{
"value": "TOTAL LIABILITIES AND EQUITY"
},
{
"value": "24742.44"
}
]
}
}
]
},
"Columns": {
"Column": [
{
"ColType": "Account",
"ColTitle": "",
"MetaData": [
{
"Name": "ColKey",
"Value": "account"
}
]
},
{
"ColType": "Money",
"ColTitle": "Total",
"MetaData": [
{
"Name": "ColKey",
"Value": "total"
}
]
}
]
}
}
Here is what I tried to get the ColData (unsuccessfully I might add), I think it might be a little too contrived to do in SQL but I'm not sure if I should continue trying this way or if there's a better way in Python:
declare #json nvarchar(max)
SELECT #json = json FROM QboApiRawJSONData WHERE ID = 2
--Outer layer of JSON breaks into 3 parts - header, columns, rows
SELECT * FROM OPENJSON(#json)
WITH
(
Rows nvarchar(max) AS JSON
) as MainLayer
CROSS APPLY OPENJSON (MainLayer.Rows)
WITH
(
Row nvarchar(max) AS JSON
) as SecondaryLayer
CROSS APPLY OPENJSON (SecondaryLayer.Row)
WITH
(
Rows nvarchar(max) AS JSON
) As ThirdLayer
CROSS APPLY OPENJSON (ThirdLayer.Rows)
WITH
(
Row nvarchar(max) AS JSON
) as FourthLayer
CROSS APPLY OPENJSON (FourthLayer.Row)
WITH
(
Rows nvarchar(max) AS JSON
) as FifthLayer
CROSS APPLY OPENJSON (FifthLayer.Rows)
WITH
(
Row nvarchar(max) AS JSON
) as SixthLayer
CROSS APPLY OPENJSON (SixthLayer.Row)
WITH
(
Rows nvarchar(max) AS JSON
) as SeventhLayer
CROSS APPLY OPENJSON (SeventhLayer.Rows)
WITH
(
Row nvarchar(max) AS JSON
) as EighthLayer
CROSS APPLY OPENJSON (EighthLayer.Row)
WITH
(
Rows nvarchar(max) AS JSON
) as LayerNine
---Things get funky here
CROSS APPLY OPENJSON (LayerNine.Rows)
WITH
(
Row nvarchar(max) AS JSON
) as LayerTen
CROSS APPLY OPENJSON (LayerTen.Row)
WITH
(
Rows nvarchar(max) AS JSON
) as LayerEleven
CROSS APPLY OPENJSON (LayerEleven.Rows)
WITH
(
Row nvarchar(max) AS JSON
) as LayerTwelve
--21 items in last col
There is JSON support for sql-server:
https://learn.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-ver15
There is a JSON storage method here: https://learn.microsoft.com/en-us/sql/relational-databases/json/store-json-documents-in-sql-tables?view=sql-server-ver15. Although it will be fairly complex here I recommend storing the JSON data as a Logs table and then following the tutorial above to see if that solves your issue.
Use Quickbooks API to get the JSON, then refer to this guide:
https://learn.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-ver15
and this guide:
https://learn.microsoft.com/en-us/sql/relational-databases/json/convert-json-data-to-rows-and-columns-with-openjson-sql-server?view=sql-server-ver15
You can also consider setting something like AWS Lambda or Google Cloud Functions if you need something more automated.

Python Script to convert multiple json files in to single csv

{
"type": "Data",
"version": "1.0",
"box": {
"identifier": "abcdef",
"serial": "12345678"
},
"payload": {
"Type": "EL",
"Version": "1",
"Result": "Successful",
"Reference": null,
"Box": {
"Identifier": "abcdef",
"Serial": "12345678"
},
"Configuration": {
"EL": "1"
},
"vent": [
{
"ventType": "Arm",
"Timestamp": "2020-03-18T12:17:04+10:00",
"Parameters": [
{
"Name": "Arm",
"Value": "LT"
},
{
"Name": "Status",
"Value": "LD"
}
]
},
{
"ventType": "Arm",
"Timestamp": "2020-03-18T12:17:24+10:00",
"Parameters": [
{
"Name": "Arm",
"Value": "LT"
},
{
"Name": "Status",
"Value": "LD"
}
]
},
{
"EventType": "TimeUpdateCompleted",
"Timestamp": "2020-03-18T02:23:21.2979668Z",
"Parameters": [
{
"Name": "ActualAdjustment",
"Value": "PT0S"
},
{
"Name": "CorrectionOffset",
"Value": "PT0S"
},
{
"Name": "Latency",
"Value": "PT0.2423996S"
}
]
}
]
}
}
If you're looking to transfer information from a JSON file to a CSV, then you can use the following code to read in a JSON file into a dictionary in Python:
import json
with open('data.txt') as json_file:
data_dict = json.load(json_file)
You could then convert this dictionary into a list with either data_dict.items() or data_dict.values().
Then you just need to write this list to a CSV file which you can easily do by just looping through the list.

Fast way of adding fields to a nested dict

I need a help with improving my code.
I've got a nested dict with many levels:
{
"11": {
"FacLC": {
"immty": [
"in_mm",
"in_mm"
],
"moood": [
"in_oo",
"in_oo"
]
}
},
"22": {
"FacLC": {
"immty": [
"in_mm",
"in_mm",
"in_mm"
]
}
}
}
And I want to add additional fields on every level, so my output looks like this:
[
{
"id": "",
"name": "11",
"general": [
{
"id": "",
"name": "FacLC",
"specifics": [
{
"id": "",
"name": "immty",
"characteristics": [
{
"id": "",
"name": "in_mm"
},
{
"id": "",
"name": "in_mm"
}
]
},
{
"id": "",
"name": "moood",
"characteristics": [
{
"id": "",
"name": "in_oo"
},
{
"id": "",
"name": "in_oo"
}
]
}
]
}
]
},
{
"id": "",
"name": "22",
"general": [
{
"id": "",
"name": "FacLC",
"specifics": [
{
"id": "",
"name": "immty",
"characteristics": [
{
"id": "",
"name": "in_mm"
},
{
"id": "",
"name": "in_mm"
},
{
"id": "",
"name": "in_mm"
}
]
}
]
}
]
}
]
I managed to write a 4-times nested for loop, what I find inefficient and inelegant:
for main_name, general in my_dict.items():
generals = []
for general_name, specific in general.items():
specifics = []
for specific_name, characteristics in specific.items():
characteristics_dicts = []
for characteristic in characteristics:
characteristics_dicts.append({
"id": "",
"name": characteristic,
})
specifics.append({
"id": "",
"name": specific_name,
"characteristics": characteristics_dicts,
})
generals.append({
"id": "",
"name": general_name,
"specifics": specifics,
})
my_new_dict.append({
"id": "",
"name": main_name,
"general": generals,
})
I am wondering if there is more compact and efficient solution.
In the past I created a function to do it. Basically you call this function everytime that you need to add new fields to a nested dict, independently on how many levels this nested dict have. You only have to inform the 'full path' , that I called the 'key_map'.
Like ['node1','node1a','node1apart3']
def insert_value_using_map(_nodes_list_to_be_appended, _keys_map, _value_to_be_inserted):
for _key in _keys_map[:-1]:
_nodes_list_to_be_appended = _nodes_list_to_be_appended.setdefault(_key, {})
_nodes_list_to_be_appended[_keys_map[-1]] = _value_to_be_inserted

Add nested object to existing logic in Python

Below is the dataframe
df = pd.DataFrame([['xxx xxx','specs','67646546','TEST 123','United States of America']], columns = ['name', 'type', 'aim', 'aimd','context' ])
I am trying to add a object 'aimd' under 'data'.
Below is the format
{
"entities": [{
"name": "xxx xxx",
"type": "specs",
"data": {
"attributes": {
"aimd": {
"values": [{
"value": "xxxxx",
"source": "internal",
"locale": "en_Us"
}
]
}
},
"contexts": [{
"attributes": {
"aim": {
"values": [{
"value": "67646546",
"source": "internal",
"locale": "en_Us"
}
]
}
},
"context": {
"country": "United States of America"
}
}
]
}
}
]
}
It's just a matter of inserting an additional array:
import pandas as pd
import json
df = pd.DataFrame([['xxx xxx','specs','67646546','TEST 123','R12',43,'789S','XXX','SSSS','GGG','TTT','United States of America']], columns = ['name', 'type', 'aim', 'aimd','aim1','aim2','alim1','alim2','alim3','apim','asim','context' ])
exclude_list = ['name','type','aimd','context']
data = {'entities':[]}
for key,grp in df.groupby('name'):
for idx, row in grp.iterrows():
temp_dict_alpha = {'name':key,'type':row['type'],'data' :{'attributes':{},'contexts':[{'attributes':{},'context':{'country':row['context']}}]}}
attr_row = row[~row.index.isin(['name','type'])]
for idx2,row2 in attr_row.iteritems():
if idx2 not in exclude_list:
dict_temp = {}
dict_temp[idx2] = {'values':[]}
dict_temp[idx2]['values'].append({'value':row2,'source':'internal','locale':'en_Us'})
temp_dict_alpha['data']['contexts'][0]['attributes'].update(dict_temp)
if idx2 == 'aimd':
dict_temp = {}
dict_temp[idx2] = {'values':[]}
dict_temp[idx2]['values'].append({'value':row2,'source':'internal','locale':'en_Us'})
temp_dict_alpha['data']['attributes'].update(dict_temp)
data['entities'].append(temp_dict_alpha)
print(json.dumps(data, indent = 4))
Output:
print(json.dumps(data, indent = 4))
{
"entities": [
{
"name": "xxx xxx",
"type": "specs",
"data": {
"attributes": {
"aimd": {
"values": [
{
"value": "TEST 123",
"source": "internal",
"locale": "en_Us"
}
]
}
},
"contexts": [
{
"attributes": {
"aim": {
"values": [
{
"value": "67646546",
"source": "internal",
"locale": "en_Us"
}
]
},
"aim1": {
"values": [
{
"value": "R12",
"source": "internal",
"locale": "en_Us"
}
]
},
"aim2": {
"values": [
{
"value": 43,
"source": "internal",
"locale": "en_Us"
}
]
},
"alim1": {
"values": [
{
"value": "789S",
"source": "internal",
"locale": "en_Us"
}
]
},
"alim2": {
"values": [
{
"value": "XXX",
"source": "internal",
"locale": "en_Us"
}
]
},
"alim3": {
"values": [
{
"value": "SSSS",
"source": "internal",
"locale": "en_Us"
}
]
},
"apim": {
"values": [
{
"value": "GGG",
"source": "internal",
"locale": "en_Us"
}
]
},
"asim": {
"values": [
{
"value": "TTT",
"source": "internal",
"locale": "en_Us"
}
]
}
},
"context": {
"country": "United States of America"
}
}
]
}
}
]
}

How to Remove Outer Layer of JSON in Python

I am trying to remove the outer (parent) layer of a JSON file so that I can process it, however I have no idea how.
As you will see by the code below, the outer 2 most layers are 2 dictionaries, however, python says the 2nd dictionary ("item") is just a string when I call its type. Am I incorrect in how I interpret the structure?
sample_object6 = {
"items":
{
"item":
[
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
},
{
"id": "0002",
"type": "donut",
"name": "Raised",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
},
{
"id": "0003",
"type": "donut",
"name": "Old Fashioned",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
},
{
"id": "0004",
"type": "bar",
"name": "Bar",
"ppu": 0.75,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
]
},
"topping":
[
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
],
"fillings":
{
"filling":
[
{ "id": "7001", "name": "None", "addcost": 0 },
{ "id": "7002", "name": "Custard", "addcost": 0.25 },
{ "id": "7003", "name": "Whipped Cream", "addcost": 0.25 }
]
}
},
{
"id": "0005",
"type": "twist",
"name": "Twist",
"ppu": 0.65,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
]
},
"topping":
[
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
]
},
{
"id": "0006",
"type": "filled",
"name": "Filled",
"ppu": 0.75,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
]
},
"topping":
[
{ "id": "5002", "type": "Glazed" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
],
"fillings":
{
"filling":
[
{ "id": "7002", "name": "Custard", "addcost": 0 },
{ "id": "7003", "name": "Whipped Cream", "addcost": 0 },
{ "id": "7004", "name": "Strawberry Jelly", "addcost": 0 },
{ "id": "7005", "name": "Rasberry Jelly", "addcost": 0 }
]
}
}
]
}
}
I thought that it might be possible to store the nested portion starting at the first list (right after 'item') in a variable and then work with this but if I can't get python to see that item is a dictionary inside the items dictionary, then I fear I am at a loss with how to proceed.
Does anyone know what I am doing wrong?
Thank you in advance!
As far as the processing goes, there has been none because I could not even get the string to read as a dictionary appropriately.
This is what I tried to test if it was a dictionary:
for i in sample_object6:
print(i + str(type(i)))
for n in i["item"]:
print(n + str(type(n)))
After submitting the same code that I thought I had already submitted, I noticed that python is interpreting the object correctly. I have some obvious fundamental gaps in how to work in python and I'm sorry I took it to the forum.
For the record (and for future python newbies out there like me), I used the following code which returned the proper class types:
#this returned a class type of dictionary
print(type(sample_object6["items"]))
#this returned a class type of list
print(type(sample_object6["items"]["item"]))
Thank you SungJin Steve Yoo & Pm2Ring for your help.

Categories

Resources