Converting CSV to Hierarchical JSON output

Converting CSV to Hierarchical JSON output - python

I am trying to convert the CSV file into a Hierarchical JSON file.CSV file input as follows, It contains two columns Gene and Disease.
gene,disease
A1BG,Adenocarcinoma
A1BG,apnea
A1BG,Athritis
A2M,Asthma
A2M,Astrocytoma
A2M,Diabetes
NAT1,polyps
NAT1,lymphoma
NAT1,neoplasms
The expected Output format should be in the following format
{
"name": "A1BG",
"children": [
{"name": "Adenocarcinoma"},
{"name": "apnea"},
{"name": "Athritis"}
]
},
{
"name": "A2M",
"children": [
{"name": "Asthma"},
{"name": "Astrocytoma"},
{"name": "Diabetes"}
]
},
{
"name": "NAT1",
"children": [
{"name": "polyps"},
{"name": "lymphoma"},
{"name": "neoplasms"}
]
}
The python code I have written is below. let me know where I need to change to get the desired output.
import json
finalList = []
finalDict = {}
grouped = df.groupby(['gene'])
for key, value in grouped:
dictionary = {}
dictList = []
anotherDict = {}
j = grouped.get_group(key).reset_index(drop=True)
dictionary['name'] = j.at[0, 'gene']
for i in j.index:
anotherDict['disease'] = j.at[i, 'disease']
dictList.append(anotherDict)
dictionary['children'] = dictList
finalList.append(dictionary)
with open('outputresult3.json', "w") as out:
json.dump(finalList,out)

import json
json_data = []
# group the data by each unique gene
for gene, data in df.groupby(["gene"]):
# obtain a list of diseases for the current gene
diseases = data["disease"].tolist()
# create a new list of dictionaries to satisfy json requirements
children = [{"name": disease} for disease in diseases]
entry = {"name": gene, "children": children}
json_data.append(entry)
with open('outputresult3.json', "w") as out:
json.dump(json_data, out)

Use DataFrame.groupby with custom lambda function for convert values to dictionaries by DataFrame.to_dict:
L = (df.rename(columns={'disease':'name'})
.groupby('gene')
.apply(lambda x: x[['name']].to_dict('records'))
.reset_index(name='children')
.rename(columns={'gene':'name'})
.to_dict('records')
)
print (L)
[{'name': 'A1BG', 'children': [{'name': 'Adenocarcinoma'},
{'name': 'apnea'},
{'name': 'Athritis'}]},
{'name': 'A2M', 'children': [{'name': 'Asthma'},
{'name': 'Astrocytoma'},
{'name': 'Diabetes'}]},
{'name': 'NAT1', 'children': [{'name': 'polyps'},
{'name': 'lymphoma'},
{'name': 'neoplasms'}]}]
with open('outputresult3.json', "w") as out:
json.dump(L,out)

Related

python dictionary to json

the output of file comes as dictionary, with 5 columns. Due to the 5th column the first 4 are duplicated. My goals is to output it as a json, without duplicates in the following format.
Sample input:
test_dict = [
{'ID':"A", 'ID_A':"A1",'ID_B':"A2",'ID_C':"A3",'INVOICE':"123"},
{'ID':"A", 'ID_A':"A1",'ID_B':"A2",'ID_C':"A3",'INVOICE':"345"}
]
Previously there were no duplicates so it was easy to transform to json as below:
result = defaultdict(set)
for i in test_dict:
id = i.get('ID')
if id:
result[i].add(i.get('ID_A'))
result[i].add(i.get('ID_B'))
result[i].add(i.get('ID_C'))
output = []
for id, details in result.items():
output.append(
{
"ID": id,
"otherDetails": {
"IDs": [
{"id": ref} for ref in details
]
},
}
)
How could I add INVOICE to this without duplicating the rows? The output would look like this:
[{'ID': '"A"',
'OtherDetails': {'IDs': [{'id': 'A1'},
{'id': 'A2'},
{'id': 'A3'}],
{'INVOICE': [{'id':'123'},
{'id':'345'}]}}}]
Thanks! (python 3.9)

Basically, you can just do the same as for the IDs, using a second defaultdict (or similar) for the invoice IDs. Afterwards, use a nested list/dict comprehension to build the final result.
id_to_ids = defaultdict(set)
id_to_inv = defaultdict(set)
for d in test_dict:
id_to_ids[d["ID"]] |= {d[k] for k in ["ID_A", "ID_B", "ID_C"]}
id_to_inv[d["ID"]] |= {d["INVOICE"]}
result = [{
'ID': k,
'OtherDetails': {
'IDs': [{'id': v} for v in id_to_ids[k]],
'INVOICE': [{'id': v} for v in id_to_inv[k]]
}} for k in id_to_ids]
Note, though, that using this format, you will lose the information which of the "other" IDs was which, and with that invoice ID those were associated.

You were pretty close. I would make the intermediate dictionary a little bit more straight forward. And have it just be a diction with id, and two lists.
When walking the original data, you just need to append INVOICE if there is already an entry for the ID. Then when you create the "json" format (a list of dictionary for each ID), all you have to do is use the lists you already generate. Here is the structure I propose.
from collections import defaultdict
test_dict = [
{'ID':"A", 'ID_A':"A1",'ID_B':"A2",'ID_C':"A3",'INVOICE':"123"},
{'ID':"A", 'ID_A':"A1",'ID_B':"A2",'ID_C':"A3",'INVOICE':"345"}
]
result = {}
for i in test_dict:
id = i.get('ID')
if not id:
continue
if id in result:
# just add INVOICE
result[id]['INVOICE'].append(i.get('INVOICE'))
else:
# ID not in result dictionary, so populate it
result[id] = {'IDs': [ i.get('ID_A'), i.get('ID_B'), i.get('ID_C')],
'INVOICE' : [i.get('INVOICE')]
}
output = []
for id, details in result.items():
output.append(
{
"ID": id,
"otherDetails": {
"IDs": details['IDs'],
'INVOICE': details['INVOICE']
}
}
)
The trick for duplicate id's is handled by the if id in result where it only appends the invoice to the list of invoices. I will also add since we are using a lot of dict.get() calls rather than simple dict[], we are potentially adding a bunch of None's into these lists.

The like the answer from #tobias_k, but it does not handle duplicate values for any of the ID_* or invoice columns. His answer is the most simple if order and repetition are not important.
Checkout this if they are important.
import pandas as pd
def create_item(df: pd.DataFrame):
output = list()
groups = df.groupby(["ID", "ID_A", "ID_B", "ID_C"])[["INVOICE"]]
for group, gdf in groups:
row = dict()
row["ID"] = group[0]
row["OtherDetails"] = dict()
row["OtherDetails"]["IDS"] = [{"id": x} for x in group[1:]]
row["OtherDetails"]["INVOICE"] = [{"id": x} for x in gdf["INVOICE"]]
output.append(row)
return output
test_dict = [
{"ID": "A", "ID_A": "A1", "ID_B": "A2", "ID_C": "A3", "INVOICE": "123"},
{"ID": "A", "ID_A": "A1", "ID_B": "A2", "ID_C": "A3", "INVOICE": "345"},
{"ID": "B", "ID_A": "A1", "ID_B": "A2", "ID_C": "A3", "INVOICE": "123"},
{"ID": "B", "ID_A": "A1", "ID_B": "A2", "ID_C": "A3", "INVOICE": "345"},
{"ID": "B", "ID_A": "A1", "ID_B": "A2", "ID_C": "A3", "INVOICE": "123"},
]
test_df = pd.DataFrame(test_dict)
create_item(test_df)
Which will return
[{'ID': 'A',
'OtherDetails': {'IDS': [{'id': 'A1'}, {'id': 'A2'}, {'id': 'A3'}],
'INVOICE': [{'id': '123'}, {'id': '345'}]}},
{'ID': 'B',
'OtherDetails': {'IDS': [{'id': 'A1'}, {'id': 'A2'}, {'id': 'A3'}],
'INVOICE': [{'id': '123'}, {'id': '345'}, {'id': '123'}]}}]

Create a list of nested dictionaries from a single csv file in python

I have a csv file with the following structure:
team,tournament,player
Team 1,spring tournament,Rebbecca Cardone
Team 1,spring tournament,Salina Youngblood
Team 1,spring tournament,Catarina Corbell
Team 1,summer tournament,Cara Mejias
Team 1,summer tournament,Catarina Corbell
...
Team 10, spring tournament,Jessi Ravelo
I want to create a nested dictionary (team, tournament) with a list of player dictionary. The desired outcome would be something like:
{'data':
{Team 1:
{'spring tournament':
{'players': [
{name: Rebecca Cardone},
{name: Salina Youngblood},
{name: Catarina Corbell}]
},
{'summer tournament':
{'players': [
{name: Cara Mejias},
{name: Catarina Corbell}]
}
}
},
...
{Team 10:
{'spring tournament':
{'players': [
{name: Jessi Ravelo}]
}
}
}
}
I've been struggling to format it like this. I have been able to successfully nest the first level (team # --> tournament) but I cannot get the second level to nest. Currently, my code looks like this:
d = {}
header = True
with open("input.csv") as f:
for line in f.readlines():
if header:
header = False
continue
team, tournament, player = line.strip().split(",")
d_team = d.get(team,{})
d_tournament = d_team.get(tournament, {})
d_player = d_tournament.get('player',['name'])
d_player.append(player)
d_tournament['player'] = d_tournament
d_team[tournament] = d_tournament
d[team] = d_team
print(d)
What would be the next step in fixing my code so I can create the nested dictionary?

Some problems with your implementation:
You do d_player = d_tournament.get('player',['name']). But you actually want to get the key named players, and this should be a list of dictionaries. Each of these dictionaries must have the form {"name": "Player's Name"}. So you want
l_player = d_tournament.get('players',[]) (default to an empty list), and then do l_player.append({"name": player}) (I renamed it to l_player because it's a list, not a dict).
You do d_tournament['player'] = d_tournament. I suspect you meant d_tournament['player'] = d_player
Strip the whitespace off the elements in the rows. Do team, tournament, player = (word.strip() for word in line.split(","))
Your code works fine after you make these changes
I strongly suggest you use the csv.reader class to read your CSV file instead of manually splitting the line by commas.
Also, since python's containers (lists and dictionaries) hold references to their contents, you can just add the container once and then modify it using mydict["key"] = value or mylist.append(), and these changes will be reflected in parent containers too. Because of this behavior, you don't need to repeatedly assign these things in the loop like you do with d_team[tournament] = d_tournament
allteams = dict()
hasHeader = True
with open("input.csv") as f:
csvreader = csv.reader(f)
if hasHeader: next(csvreader) # Consume one line if a header exists
# Iterate over the rows, and unpack each row into three variables
for team_name, tournament_name, player_name in csvreader:
# If the team hasn't been processed yet, create a new dict for it
if team_name not in allteams:
allteams[team_name] = dict()
# Get the dict object that holds this team's information
team = allteams[team_name]
# If the tournament hasn't been processed already for this team, create a new dict for it in the team's dict
if tournament_name not in team:
team[tournament_name] = {"players": []}
# Get the tournament dict object
tournament = team[tournament_name]
# Add this player's information to the tournament dict's "player" list
tournament["players"].append({"name": player_name})
# Add all teams' data to the "data" key in our result dict
result = {"data": allteams}
print(result)
Which gives us what we want (prettified output):
{
'data': {
'Team 1': {
'spring tournament': {
'players': [
{ 'name': 'Rebbecca Cardone' },
{ 'name': 'Salina Youngblood' },
{ 'name': 'Catarina Corbell' }
]
},
'summer tournament': {
'players': [
{ 'name': 'Cara Mejias' },
{ 'name': 'Catarina Corbell' }
]
}
},
'Team 10': {
' spring tournament': {
'players': [
{ 'name': 'Jessi Ravelo' }
]
}
}
}
}

The example dictionary you describe is not possible (if you want multiple dictionaries under the key "Team 1", put them in a list), but this snippet:
if __name__ == '__main__':
your_dict = {}
with open("yourfile.csv") as file:
all_lines = file.readlines()
data_lines = all_lines[1:] # Skipping "team,tournament,player" line
for line in data_lines:
line = line.strip() # Remove \n
team, tournament_type, player_name = line.split(",")
team_dict = your_dict.get(team, {}) # e.g. "Team 1"
tournaments_of_team_dict = team_dict.get(tournament_type, {'players': []}) # e.g. "spring_tournament"
tournaments_of_team_dict["players"].append({'name': player_name})
team_dict[tournament_type] = tournaments_of_team_dict
your_dict[team] = team_dict
your_dict = {'data': your_dict}
For this example yourfile.csv:
team,tournament,player
Team 1,spring tournament,Rebbecca Cardone
Team 1,spring tournament,Salina Youngblood
Team 2,spring tournament,Catarina Corbell
Team 1,summer tournament,Cara Mejias
Team 2,summer tournament,Catarina Corbell
Gives the following:
{
"data": {
"Team 1": {
"spring tournament": {
"players": [
{
"name": "Rebbecca Cardone"
},
{
"name": "Salina Youngblood"
}
]
},
"summer tournament": {
"players": [
{
"name": "Cara Mejias"
}
]
}
},
"Team 2": {
"spring tournament": {
"players": [
{
"name": "Catarina Corbell"
}
]
},
"summer tournament": {
"players": [
{
"name": "Catarina Corbell"
}
]
}
}
}
}
Process finished with exit code 0

Maybe I overlook somethign but couldn't you use:
df.groupby(['team','tournament'])['player'].apply(list).reset_index().to_json(orient='records')

You might approach it this way:
from collections import defaultdict
import csv
from pprint import pprint
d = defaultdict(dict)
with open('f00.txt', 'r') as f:
reader = csv.DictReader(f)
for row in reader:
d[ row['team'] ].setdefault(row['tournament'], []
).append(row['player'])
pprint(dict(d))
Prints:
{'Team 1': {'spring tournament': ['Rebbecca Cardone',
'Salina Youngblood',
'Catarina Corbell'],
'summer tournament': ['Cara Mejias', 'Catarina Corbell']},
'Team 10': {' spring tournament': ['Jessi Ravelo']}}

Create partial dict from recursively nested field list

After parsing a URL parameter for partial responses, e.g. ?fields=name,id,another(name,id),date, I'm getting back an arbitrarily nested list of strings, representing the individual keys of a nested JSON object:
['name', 'id', ['another', ['name', 'id']], 'date']
The goal is to map that parsed 'graph' of keys onto an original, larger dict and just retrieve a partial copy of it, e.g.:
input_dict = {
"name": "foobar",
"id": "1",
"another": {
"name": "spam",
"id": "42",
"but_wait": "there is more!"
},
"even_more": {
"nesting": {
"why": "not?"
}
},
"date": 1584567297
}
should simplyfy to:
output_dict = {
"name": "foobar",
"id": "1",
"another": {
"name": "spam",
"id": "42"
},
"date": 1584567297,
}
Sofar, I've glanced over nested defaultdicts, addict and glom, but the mappings they take as inputs are not compatible with my list (might have missed something, of course), and I end up with garbage.
How can I do this programmatically, and accounting for any nesting that might occur?

you can use:
def rec(d, f):
result = {}
for i in f:
if isinstance(i, list):
result[i[0]] = rec(d[i[0]], i[1])
else:
result[i] = d[i]
return result
f = ['name', 'id', ['another', ['name', 'id']], 'date']
rec(input_dict, f)
output:
{'name': 'foobar',
'id': '1',
'another': {'name': 'spam', 'id': '42'},
'date': 1584567297}
here the assumption is that on a nested list the first element is a valid key from the upper level and the second element contains valid keys from a nested dict which is the value for the first element

How to add an element to a specific position in json object using python

I want to add new element to json object to specific index/position.
if I use data["country"] = "value", it is adding to the end of the json object.
import json
data = json.loads('''{"user_name": "xcv","password": "dsjvwebv","age":27,"address":{"country_name": "value",
"state_name":"tamil nadu", "district" :"Tirunelveli" },"work_history": [ {"name": "CSC",
"location": "chennai"}, {"name": "Saturam", "location": "bangalore"}, {"name": "crayon",
"location": "chennai"} ],"marital_status" :"married","disability":"No"}''')
del data["password"]
country = data["address"]["country_name"]
data.pop("disability")
data.pop("address")
data["country"] = country
i=0
for j in data["work_history"]:
if j["location"] != "chennai":
data["work_history"].pop(i)
i = i+1
print(data)
i want the country value to be at position as in the output below.
{'user_name': 'xcv', 'age': 27, 'country': 'value', 'work_history': [{'name': 'CSC', 'location': 'chennai'}, {'name': 'crayon', 'location': 'chennai'}], 'marital_status': 'married}

filter json file with python

How to filter a json file to show only the information I need?
To start off I want to say I'm fairly new to python and working with JSON so sorry if this question was asked before and I overlooked it.
I have a JSON file that looks like this:
[
{
"Store": 417,
"Item": 10,
"Name": "Burger",
"Modifiable": true,
"Price": 8.90,
"LastModified": "09/02/2019 21:30:00"
},
{
"Store": 417,
"Item": 15,
"Name": "Fries",
"Modifiable": false,
"Price": 2.60,
"LastModified": "10/02/2019 23:00:00"
}
]
I need to filter this file to only show Item and Price, like
[
{
"Item": 10,
"Price": 8.90
},
{
"Item": 15,
"Price": 2.60
}
]
I have a code that looks like this:
# Transform json input to python objects
with open("StorePriceList.json") as input_file:
input_dict = json.load(input_file)
# Filter python objects with list comprehensions
output_dict = [x for x in input_dict if ] #missing logical test here.
# Transform python object back into json
output_json = json.dumps(output_dict)
# Show json
print(output_json)
What logical test I should be doing here to do that?

Let's say we can use dict comprehension, then it will be
output_dict = [{k:v for k,v in x.items() if k in ["Item", "Price"]} for x in input_dict]

You can also do it like this :)
>>> [{key: d[key] for key in ['Item', 'Price']} for d in input_dict] # you should rename it to `input_list` rather than `input_dict` :)
[{'Item': 10, 'Price': 8.9}, {'Item': 15, 'Price': 2.6}]

import pprint
with open('data.json', 'r') as f:
qe = json.load(f)
list = []
for item in qe['<your data>']:
query = (f'{item["Item"]} {item["Price"]}')
print("query")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting CSV to Hierarchical JSON output - python

Related

python dictionary to json

Create a list of nested dictionaries from a single csv file in python

Create partial dict from recursively nested field list

How to add an element to a specific position in json object using python

filter json file with python

Categories

Resources