Json to CSV Conversion - python

I have a very large JSON file with multiple individual JSON objects in the format shown below. I am trying to convert it to a CSV so that each row is a combination of the outer id/name/alphabet in a JSON object and 1 set of conversion: id/name/alphabet. This is repeated for all the sets of id/name/alphabet within an individual JSON object. So from the object below, 2 rows should be created where the first row is (outer) id/name/alphabet and 1st id/name/alphabet of conversion. The second row is again (outer) id/name/alphabet and now the 2nd id/name/alphabet of conversion.
Important note is that certain Objects in the file can have upwards of 50/60 conversion id/name/alphabet pairs.
What I tried so far was to flatten the JSON objects first which resulted in keys like conversion_id_0 and conversion_id_1 etc... so I can map the outer as its always constant but I am unsure how to map each corresponding number set to be a seperate row.
Any help or insight would be greatly appreciated!
[
{
"alphabet": "ABCDEFGHIJKL",
"conversion": [
{
"alphabet": "BCDEFGHIJKL",
"id": 18589260,
"name": [
"yy"
]
},
{
"alphabet": "EFGHIJEFGHIJ",
"id": 18056632,
"name": [
"zx",
"cd"
]
}
],
"id": 23929934,
"name": [
"x",
"y"
]
}
]

Your question is unclear about exactly the mapping from input JSON data to rows of the CSV file, so I had to guess on what should happen when there's more than one "name" associated with an inner or outer object.
Regardless, hopefully the following will give you a general idea of how to solve such problems.
import csv
objects = [
{
"alphabet": "ABCDEFGHIJKL",
"id": 23929934,
"name": [
"x",
"y"
],
"conversion": [
{
"alphabet": "BCDEFGHIJKL",
"id": 18589260,
"name": [
"yy"
]
},
{
"alphabet": "EFGHIJEFGHIJ",
"id": 18056632,
"name": [
"zx",
"cd"
]
}
],
}
]
with open('converted_json.csv', 'wb') as outfile:
def group(item):
return [item["id"], item["alphabet"], ' '.join(item["name"])]
writer = csv.writer(outfile, quoting=csv.QUOTE_NONNUMERIC)
for obj in objects:
outer = group(obj)
for conversion in obj["conversion"]:
inner = group(conversion)
writer.writerow(outer + inner)
Contents of the CSV file generated:
23929934,"ABCDEFGHIJKL","x y",18589260,"BCDEFGHIJKL","yy"
23929934,"ABCDEFGHIJKL","x y",18056632,"EFGHIJEFGHIJ","zx cd"

Related

Convert Embedded JSON Dict To Panda DataFrame Where Columns Headers Are Seperate From Values

I'm trying to create a python pandas DataFrame out of a JSON dictionary. The embedding is tripping me up.
The column headers are in a different section of the JSON file to the values.
The json looks similar to below. There is one section of column headers and multiple sections of data.
I need each column filled with the data that relates to it. So value_one in each case will fill the column under header_one and so on.
I have come close, but can't seem to get it to spit out the dataframe as described.
{
"my_data": {
"column_headers": [
"header_one",
"header_two",
"header_three"
],
"values": [
{
"data": [
"value_one",
"value_two",
"value_three"
]
},
{
"data": [
"value_one",
"value_two",
"value_three"
]
}
]
}
}
Assuming your dictionary is my_dict, try:
>>> pd.DataFrame(data=[d["data"] for d in my_dict["my_data"]["values"]],
columns=my_dict["my_data"]["column_headers"])

merging list with nested list

I need to 'cross join' (for want of a better term !) 2 lists.
Between them they represent a tabled dataset but ..
One holds the column header names, the other a nested array with the row values.
I've managed the easy bit :
col_names = [i['name'] for i in c]
which strips the column names out in to a list without 'typeName'
But just thinking how to extract the row field values and map them with column names .. is giving me a headache!
Any pointers appreciated ;)
Thanks
Columns (as provided):
[
{
"name": "col1",
"typeName": "varchar"
},
{
"name": "col2",
"typeName": "int4"
}
]
Records (as provided):
[
[
{
"stringValue": "apples"
},
{
"longValue": 1
}
],
[
{
"stringValue": "bananas"
},
{
"longValue": 2
}
]
]
Required Result:
[
{
'col1':'apples',
'col2':1
},
{
'col1':'bananas',
'col2':2
}
]
You have to be able to assume there is a 1-to-1 correspondence between the names in the schema and the dicts in the records. Once you assume that, it's pretty easy:
names = [i['name'] for i in schema]
data = []
for row in records:
d = {}
for a,b in zip( names, row ):
d[a] = list(b.values())[0]
data.append(d)
print(data)

Multi-level Python Dict to Pandas DataFrame only processes one level out of many

I'm parsing some XML data, doing some logic on it, and trying to display the results in an HTML table. The dictionary, after filling, looks like this:
{
"general_info": {
"name": "xxx",
"description": "xxx",
"language": "xxx",
"prefix": "xxx",
"version": "xxx"
},
"element_count": {
"folders": 23,
"conditions": 72,
"listeners": 1,
"outputs": 47
},
"external_resource_count": {
"total": 9,
"extensions": {
"jar": 8,
"json": 1
},
"paths": {
"/lib": 9
}
},
"complexity": {
"over_1_transition": {
"number": 4,
"percentage": 30.769
},
"over_1_trigger": {
"number": 2,
"percentage": 15.385
},
"over_1_output": {
"number": 4,
"percentage": 30.769
}
}
}
Then I'm using pandas to convert the dictionary into a table, like so:
data_frame = pandas.DataFrame.from_dict(data=extracted_metrics, orient='index').stack().to_frame()
The result is a table that is mostly correct:
While the first and second levels seem to render correctly, those categories with a sub-sub category get written as a string in the cell, rather than as a further column. I've also tried using stack(level=1) but it raises an error "IndexError: Too many levels: Index has only 1 level, not 2". I've also tried making it into a series with no luck. It seems like it only renders "complete" columns. Is there a way of filling up the empty spaces in the dictionary before processing?
How can I get, for example, external_resource_count -> extensions to have two daughter rows jar and json, with an additional column for the values, so that the final table looks like this:
Extra credit if anyone can tell me how to get rid of the first row with the index numbers. Thanks!
The way you load the dataframe is correct but you should rename the 0 to a some column name.
# this function extracts all the keys from your nested dicts
def explode_and_filter(df, filterdict):
return [df[col].apply(lambda x:x.get(k) if type(x)==dict else x).rename(f'{k}')
for col,nested in filterdict.items()
for k in nested]
data_frame = pd.DataFrame.from_dict(data= extracted_metrics, orient='index').stack().to_frame(name='somecol')
#lets separate the rows where a dict is present & explode only those rows
mask = data_frame.somecol.apply(lambda x:type(x)==dict)
expp = explode_and_filter(data_frame[mask],
{'somecol':['jar', 'json', '/lib', 'number', 'percentage']})
# here we concat the exploded series to a frame
exploded_df = pd.concat(expp, axis=1).stack().to_frame(name='somecol2').reset_index(level=2)\.rename(columns={'level_2':'somecol'})
# and now we concat the rows with dict elements with the rows with non dict elements
out = pd.concat([data_frame[~mask], exploded_df])
The output dataframe looks like this

How to clear an array and reset the values in a for loop while building a Json string?

I am looping through each row in an excel sheet using the openpyxl import to ultimately build a large Json string that i can feed to an API.
I am looping through each row and building out my json structure, I need to split a cell value by " || " and then for each value it needs to be added as a nested array inside a json section. I currently am using the following code and my problem is that I build my list object in my for loop and append the json chunk to a larger array and it keeps appending my list values during each loop. So i used the .Clear() method on the list to clear it after each loop...but then when i compile my final output my list is empty. Its like it does not maintain its values when it is added to the list each loop. I am new to Python and gave it a good whirl. Any suggestions in the right direction would be appreciated. Its almost like each loop needs its own unique array to use and keep the values. The tags section of the Json is emptied in the final output for each json line...when it should have the values for each unique iteration in it.
My Data Set (i have 3 rows in excel). You can see that i have values that i want to split in the 7th column. That is the column i am looping through to split the values as they will be nested in my json.
Row 1 (cells) = "ABC","Testing","Testing Again","DATE","DATE",Empty,"A || B || C".
Row 2 (cells) = "ABC 2","Testing 2","Testing Again 2","DATE","DATE",Empty,"X || Y || Z".
Row 3 (cells) = "ABC 3","Testing 3","Testing Again 3","DATE","DATE",Empty,Empty.
My Code.
#from openpyxl import Workbook
import json
from openpyxl import load_workbook
output_table = input_table.copy()
var_path_excel_file = flow_variables['Location']
workbook = load_workbook(filename=var_path_excel_file)
sheet = workbook.active
#create a null value to be used
emptyString = "Null"
#list out all of the sections of the json that we want to print out - these are based on the inputs
jsonFull = []
jsondata = {}
tags = []
for value in sheet.iter_rows(min_row=2,min_col=0,max_col=40,values_only=True):
#I add my split values to an array so that way when i add the array to the json it will have the proper brackets i need for the API to run correctly
if value[6] is not None:
data = value[6].split(" || ")
for temp in data:
tags.append(temp)
#I build out the json structure here that will be added for each excel row basically
jsondata = {
"name": value[0],
"short_description": value[1],
"long_description": value[2],
"effective_start_date": value[3],
"effective_end_date": value[4],
"workflow_state": emptyString,
"tags": tags
}
#Add the jsondata row to the larger collection
jsonFull.append(jsondata)
tags.clear()
print(json.dumps(jsonFull))
And then my desired outcome would be something like this. I just need to figure out the proper syntax for the list handling...and can't seem to find an example to base off of.
[
{
"name": "ABC",
"short_description": "Testing",
"long_description": "Testing Again",
"effective_start_date": "2020-03-04T14:45:22Z",
"effective_end_date": "2020-03-04T14:45:22Z",
"workflow_state": "Null",
"tags": [
"A",
"B",
"C"
]
},
{
"name": "ABC 2",
"short_description": "Testing 2",
"long_description": "Testing Again 2",
"effective_start_date": "2020-03-04T14:45:22Z",
"effective_end_date": "2020-03-04T14:45:22Z",
"workflow_state": "Null",
"tags": [
"X",
"Y",
"Z"
]
},
{
"name": "ABC 3",
"short_description": "Testing 3",
"long_description": "Testing Again 3",
"effective_start_date": "2020-03-04T14:45:22Z",
"effective_end_date": "2020-03-04T14:45:22Z",
"workflow_state": "Null",
"tags": [
]
}
]
You're not making a copy of tags when you put it into the dictionary or call tags.clear(), you're just putting a reference to the same list. You need to create a new list at the beginning of each loop iteration, not reuse the same list.
for value in sheet.iter_rows(min_row=2,min_col=0,max_col=40,values_only=True):
#I add my split values to an array so that way when i add the array to the json it will have the proper brackets i need for the API to run correctly
if value[6] is not None:
tags = value[6].split(" || ")
else:
tags = []
#I build out the json structure here that will be added for each excel row basically
jsondata = {
"name": value[0],
"short_description": value[1],
"long_description": value[2],
"effective_start_date": value[3],
"effective_end_date": value[4],
"workflow_state": emptyString,
"tags": tags
}
#Add the jsondata row to the larger collection
jsonFull.append(jsondata)

Get values with the key names from json object

I am trying to retrieve the values from specific columns from the python list object. This is the response format from Log analytics API here - https://dev.loganalytics.io/documentation/Using-the-API/ResponseFormat
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
{
"tables": [
{
"name": "PrimaryResult",
"columns": [
{
"name": "Category",
"type": "string"
},
{
"name": "count_",
"type": "long"
}
],
"rows": [
[
"Administrative",
20839
],
[
"Recommendation",
122
],
[
"Alert",
64
],
[
"ServiceHealth",
11
]
]
}
]
}
There are hundreds of columns and i want to read specific columns and row values. To do that, i initially tried to find an index for the column for e.g., "Category" and retrieve all the values from rows. Here is what i have done so far.
result=requests.get(url, params=params, headers=headers, verify=False)
index_category = (result.json()['tables'][0]['columns']).index('Category')
result contains data in the format posted above. I get this below error. What am i missing?
ValueError: 'Category' is not in list
I want to be able to retrieve the Category values from rows array in a loop. I have also done this below loop and i am able to get what i want but want to confirm if there is a better way to do this. Also i am retrieving the column index first before reading the row value because i suspect blindly reading the row values with explicit index values is prone to error, particularly when the sequence of columns change.
for column in range(0,columns):
if ((result.json()['tables'][0]['columns'][column]['name']) == 'Category'):
index_category = column
for row in range(0,rows):
print(result.json()['tables'][0]['rows'][row][index_category])
json_data = results.json()
for index, columns in enumerate(json_data['tables'][0]['columns']):
if columns['name'] == 'Category':
category_index = index
break
category_list = []
for row in json_data['tables'][0]['rows']:
category_list.append(row[category_index])
Haven't tested it btw.
You could also refactor the first loop where we find the index for the category with the filter function.

Categories

Resources