Convert csv to JSON tree structure and change name? - python

Thank you to Hatt for the explanation and the code. It works, although I am unable to change the string name for a meaningful name from column headers.
Can anyone suggest how to achieve that?
Data in csv file
conversion_month channel sub_channel campaign Id cost kpi
2017-08 DISPLAY Retargeting Summer_Campaign 200278217 2.286261 0.1
2017-08 DISPLAY Retargeting Summer_Campaign 200278218 3.627064 2.5
2017-08 DISPLAY Retargeting Summer_Campaign 200278219 2.768436 0.001
2017-08 DISPLAY Retargeting August Campaign 200278220 5.653297 0.35
2017-09 DISPLAY Prospecting Test Campaign 200278221 4.11847 1.5
2017-08 DISPLAY Prospecting August Campaign 200278222 3.393972 0.26
2017-09 DISPLAY Prospecting Test Campaign 200278223 3.975332 4.2
2017-08 DISPLAY Prospecting August Campaign 200278224 4.131035 0.3
Code used:
import csv
from collections import defaultdict
def ctree():
return defaultdict(ctree)
def build_leaf(name, leaf):
res = {"name":name}
# add children node if the leaf actually has any children
if len(leaf.keys()) > 0:
res["children"] = [build_leaf(k, v) for k, v in leaf.items()]
return res
def main():
tree = ctree()
with open('file.csv') as csvfile:
reader = csv.reader(csvfile)
for rid, row in enumerate(reader):
if rid == 0:
continue
leaf = tree[row[0]]
for cid in range(1, (len(row)-2)):
leaf = leaf[row[cid]]
for cid in range((len(row)-1), len(row)):
leaf = (leaf[row[cid-1]],leaf[row[cid]])
# building a custom tree structure
res = []
for name, leaf in tree.items():
res.append(build_leaf(name, leaf))
# printing results into the terminal
import json
print(json.dumps(res, indent=2))
main()
It gives the tree, but I would like to change the string "name" for meaningful name such as "month", "channel", ..."id", etc. The names are in the first row of the csv file.
[
{
"name": "2017-08",
"children": [
{
"name": "DISPLAY",
"children": [
{
"name": "Retargeting",
"children": [
{
"name": "Summer_Campaign",
"children": [
{
"name": "200278217",
"children": [
{
"name": "2.286261"
},
{
"name": "0.1"
}
]
Thank you for any suggestions in advance.

Use next(reader) to first extract the header row from the CSV file. A level counter can be used to indicate which column is currently being dealt with so the corresponding column header can be extracted from the header:
import csv
from collections import defaultdict
def ctree():
return defaultdict(ctree)
def build_leaf(name, leaf, level, header):
res = {header[level] : name}
# add children node if the leaf actually has any children
if len(leaf.keys()) > 0:
res["children"] = [build_leaf(k, v, level+1, header) for k, v in leaf.items()]
return res
def main():
tree = ctree()
with open('file.csv') as csvfile:
reader = csv.reader(csvfile)
header = next(reader)
for row in reader:
leaf = tree[row[0]]
for cid in range(1, (len(row)-2)):
leaf = leaf[row[cid]]
for cid in range((len(row)-1), len(row)):
leaf = (leaf[row[cid-1]],leaf[row[cid]])
# building a custom tree structure
res = []
for name, leaf in tree.items():
res.append(build_leaf(name, leaf, 0, header))
# printing results into the terminal
import json
print(json.dumps(res, indent=2))
main()
This would give you:
[
{
"conversion_month": "2017-08",
"children": [
{
"channel": "DISPLAY",
"children": [
{
"sub_channel": "Retargeting",
"children": [
{
"campaign": "Summer_Campaign",
"children": [
{
"Id": "200278217",
"children": [
{
"cost": "2.286261"
},

Related

How can I even use the'else' syntax in Python?

I am reading data from a JSON file to check the existence of some values.
In the JSON structure below, I try to find adomain from the data in bid and check if there is a cat value, which is not always present.
How do I fix it in the syntax below?
import pandas as pd
import json
path = 'C:/MyWorks/Python/Anal/data_sample.json'
records = [json.loads(line) for line in open(path, encoding='utf-8')]
adomain = [
rec['win_res']['seatbid'][0]['bid'][0]['adomain']
for rec in records
if 'adomain' in rec
]
Here is a data sample:
[
{ "win_res": {
"id": "12345",
"seatbid": [
{
"bid": [
{
"id": "12345",
"impid": "1",
"price": 0.1,
"adm": "",
"adomain": [
"adomain.com"
],
"iurl": "url.com",
"cid": "11",
"crid": "11",
"cat": [
"IAB12345"
],
"w": 1,
"h": 1
}
],
"seat": "1"
}
]
}}
]
As a result, the adomain value exists unconditionally, but the cat value may not be present sometimes.
So, if cat exists in adomain, I want to express adomain and cat in this way, but if there is no adomain, the cat value, how can I do it?
Your question is not clear but I think this is what you are looking for:
import json
path = 'C:/MyWorks/Python/Anal/data_sample.json'
with open(path, encoding='utf-8') as f:
records = json.load(f)
adomain = [
_['win_res']['seatbid'][0]['bid'][0]['adomain']
for _ in records
if _['win_res']['seatbid'][0]['bid'][0].get('adomain', None) and
_['win_res']['seatbid'][0]['bid'][0].get('cat', None)
]
The code above will add the value of ['win_res']['seatbid'][0]['bid'][0]['adomain'] to the list adomain only if there is a ['win_res']['seatbid'][0]['bid'][0]['cat'] corresponding value.
The code will be a lot clearer if we just walk through a bids list. Something like this:
import json
path = 'C:/MyWorks/Python/Anal/data_sample.json'
with open(path, encoding='utf-8') as f:
records = json.load(f)
bids = [_['win_res']['seatbid'][0]['bid'][0] for _ in records]
adomain = [
_['adomain']
for _ in bids
if _.get('adomain', None) and _.get('cat', None)
]

Adding size to dynamic tree from last column

I need to create a nested dict structure where the number of children can vary at each level.
Appending “size” element to last json child element for a sunburst diagram
Tree creation is covered in this question, except that i need the size to be picked up from the last column.
Given my labels repeat between levels and each level can have the same label "abc" as a terminal one, as well as a parent to the next level - I modified the code here slightly (to avoid duplicates in a children branch). I am unable to however specify the size, which is stored in the last column and should replace the 1 here in each leaf end. I know that I need to pass the value from the rows to the recursion loop build_leaf, but can't seem to figure how.
import csv
from collections import defaultdict
import json
def ctree():
return defaultdict(ctree)
def build_leaf(name, leaf):
if len(name)==0:
res={"name":"last node"}
res['size']=1
else:
res = {"name": name}
# add children node if the leaf actually has any children
if len(leaf.keys())>0:
res["children"] = [build_leaf(k, v) for k, v in leaf.items()]
else:
res['size'] = 1
return res
def main():
tree = ctree()
# NOTE: you need to have test.csv file as neighbor to this file
with open('./inpfile.csv') as csvfile:
reader = csv.reader(csvfile)
header = next(reader) # read the header row
i=0
for row in reader:
# usage of python magic to construct dynamic tree structure and
# basically grouping csv values under their parents
leaf = tree[row[0]]
size=row[-1]
for value in row[1:-1]:
leaf = leaf[value]
# building a custom tree structure
res = []
for name, leaf in tree.items():
res.append(build_leaf(name, leaf))
# printing results into the terminal
print(json.dumps(res, indent=2))
with open('paths.json', 'w') as fp:
json.dump(res, fp)
main()
The final output for the data mentioned should look something like:
[
{
"name": "A1",
"children": [
{
"name": "A2",
"children": [
{
"name": "A1",
"children": [
{
"name": "A2",
"children": [
{
"name": "A3",
"size": 80
}
]
}
]
},
{
"name": "A3",
"children": [
{
"name": "A2",
"children": [
{
"name": "A3",
"size": 169
}
]
},
{
"name": "exit site",
"size": 764
}
]
},
{
"name": "A6",
"children": [
{
"name": "A3",
"children": [
{
"name": "exit site",
"size": 127
}
]
}
]
},
{
"name": "exit site",
"size": 576
}
]
}
]
}
]
In case someone stumbles across the same problem - I could get it to work, by creating another recursive loop to retrieve size from the nested leaf (thanks to Douglas for the help).
def ctree():
return defaultdict(ctree)
def get_size(leaf1):
for k,v in leaf1.items():
if k=="size":
return v
else:
return get_size(v)
def build_leaf(name, leaf):
if len(name)==0:
res={"name":"exit site"}
res['size']=int(get_size(leaf))
else:
res = {"name": name}
# add children node if the leaf actually has any children
if not leaf["size"]:
res["children"] = [build_leaf(k, v) for k, v in leaf.items() if not k == "size" ]
else:
res['size'] = int(get_size(leaf))
return res
def make_json(inpfile,outjson):
tree = ctree()
# NOTE: you need to have test.csv file as neighbor to this file
with open("./filepath.csv") as csvfile:
reader = csv.reader(csvfile)
header = next(reader) # read the header row
for row in reader:
# usage of python magic to construct dynamic tree structure and
# basically grouping csv values under their parents
leaf = tree[row[0]]
size=row[-1]
for value in row[1:-1]:
leaf = leaf[value]
if len(row) < 6:
leaf["exit site"]["size"]=size
else:
leaf["size"]=size
# building a custom tree structure
res = []
for name, leaf in tree.items():
res.append(build_leaf(name, leaf))
with open(outjson, 'w') as fp:
json.dump(res, fp)

convert csv file to multiple nested json format

I have written a code to convert csv file to nested json format. I have multiple columns to be nested hence assigning separately for each column. The problem is I'm getting 2 fields for the same column in the json output.
import csv
import json
from collections import OrderedDict
csv_file = 'data.csv'
json_file = csv_file + '.json'
def main(input_file):
csv_rows = []
with open(input_file, 'r') as csvfile:
reader = csv.DictReader(csvfile, delimiter='|')
for row in reader:
row['TYPE'] = 'REVIEW', # adding new key, value
row['RAWID'] = 1,
row['CUSTOMER'] = {
"ID": row['CUSTOMER_ID'],
"NAME": row['CUSTOMER_NAME']
}
row['CATEGORY'] = {
"ID": row['CATEGORY_ID'],
"NAME": row['CATEGORY']
}
del (row["CUSTOMER_NAME"], row["CATEGORY_ID"],
row["CATEGORY"], row["CUSTOMER_ID"]) # deleting since fields coccuring twice
csv_rows.append(row)
with open(json_file, 'w') as f:
json.dump(csv_rows, f, sort_keys=True, indent=4, ensure_ascii=False)
f.write('\n')
The output is as below:
[
{
"CATEGORY": {
"ID": "1",
"NAME": "Consumers"
},
"CATEGORY_ID": "1",
"CUSTOMER_ID": "41",
"CUSTOMER": {
"ID": "41",
"NAME": "SA Port"
},
"CUSTOMER_NAME": "SA Port",
"RAWID": [
1
]
}
]
I'm getting 2 entries for the fields I have assigned using row[''].
Is there any other way to get rid of this? I want only one entry for a particular field in each record.
Also how can I convert the keys to lower case after reading from csv.DictReader(). In my csv file all the columns are in upper case and hence I'm using the same to assign. But I want to convert all of them to lower case.
In order to convert the keys to lower case, it would be simpler to generate a new dict per row. BTW, it should be enough to get rid of the duplicate fields:
for row in reader:
orow = collection.OrderedDict()
orow['type'] = 'REVIEW', # adding new key, value
orow['rawid'] = 1,
orow['customer'] = {
"id": row['CUSTOMER_ID'],
"name": row['CUSTOMER_NAME']
}
orow['category'] = {
"id": row['CATEGORY_ID'],
"name": row['CATEGORY']
}
csv_rows.append(orow)

Pandas Create Dict within Deeply Nested JSON

I'm having trouble modifying my code to add another dictionary to separate "hostNumber" and "hostMode" in my output. Below is the code that found here and manipulated:
import json
from json import dumps
top = "Top_Level"
top_dict = {}
top_dict["name"] = top
top_dict["sub_name"] = []
for site, site_data in df.groupby("site", sort=False):
site_dict = {}
site_dict["site"] = site
site_dict["sub_site"] = []
for stor, stor_data in site_data.groupby("system", sort=False):
stor_dict = {}
stor_dict["system"] = stor
stor_dict["sub_system"] = []
for port, port_data in stor_data.groupby("portId", sort=False):
port_dict = {}
port_dict["portId"] = port
port_dict["sub_portId"] = []
for host, host_data in port_data.groupby("hostName", sort=False):
host_data = host_data.drop(["portId", "system",
"site"], axis=1).set_index(
"hostName")
for n in host_data.to_dict(orient="records"):
port_dict["sub_portId"].append({"hostName": host,
"sub_hostName": [n]})
stor_dict["sub_system"].append(port_dict)
site_dict["sub_site"].append(stor_dict)
top_dict["sub_name"].append(site_dict)
top_out = dumps(top_dict)
parsed = json.loads(top_out)
resulting in:
print(json.dumps(parsed, indent=4, sort_keys=True))
{
"name": "Top_Level",
"sub_name": [
{
"site": "A",
"sub_site": [
{
"system": "system01",
"sub_system": [
{
"portId": "1-A",
"sub_portId": [
{
"hostName": "ahost005",
"sub_hostName": [
{
"hostNumber": "1",
"hostMode": "WIN"
}
]
}, ...
How can I modify my code to have it output in the following way:
...
"sub_hostName": [
{"hostNumber": "1"},
{"hostMode": "WIN"}
]...
Use the following line instead of "sub_hostName": [n]:
"sub_hostName": [dict([i]) for i in n.items()]

Write json format using pandas Series and DataFrame

I'm working with csvfiles. My goal is to write a json format with csvfile information. Especifically, I want to get a similar format as miserables.json
Example:
{"source": "Napoleon", "target": "Myriel", "value": 1},
According with the information I have the format would be:
[
{
"source": "Germany",
"target": "Mexico",
"value": 1
},
{
"source": "Germany",
"target": "USA",
"value": 2
},
{
"source": "Brazil",
"target": "Argentina",
"value": 3
}
]
However, with the code I used the output looks as follow:
[
{
"source": "Germany",
"target": "Mexico",
"value": 1
},
{
"source": null,
"target": "USA",
"value": 2
}
][
{
"source": "Brazil",
"target": "Argentina",
"value": 3
}
]
Null source must be Germany. This is one of the main problems, because there are more cities with that issue. Besides this, the information is correct. I just want to remove several list inside the format and replace null to correct country.
This is the code I used using pandas and collections.
csvdata = pandas.read_csv('file.csv', low_memory=False, encoding='latin-1')
countries = csvdata['country'].tolist()
newcountries = list(set(countries))
for element in newcountries:
bills = csvdata['target'][csvdata['country'] == element]
frquency = Counter(bills)
sourceTemp = []
value = []
country = element
for k,v in frquency.items():
sourceTemp.append(k)
value.append(int(v))
forceData = {'source': Series(country), 'target': Series(sourceTemp), 'value': Series(value)}
dfForce = DataFrame(forceData)
jsondata = dfForce.to_json(orient='records', force_ascii=False, default_handler=callable)
parsed = json.loads(jsondata)
newData = json.dumps(parsed, indent=4, ensure_ascii=False, sort_keys=True)
# since to_json doesn´t have append mode this will be written in txt file
savetxt = open('data.txt', 'a')
savetxt.write(newData)
savetxt.close()
Any suggestion to solve this problem are appreciate!
Thanks
Consider removing the Series() around the scalar value, country. By doing so and then upsizing the dictionaries of series into a dataframe, you force NaN (later converted to null in json) into the series to match the lengths of other series. You can see this by printing out the dfForce dataframe:
from pandas import Series
from pandas import DataFrame
country = 'Germany'
sourceTemp = ['Mexico', 'USA', 'Argentina']
value = [1, 2, 3]
forceData = {'source': Series(country),
'target': Series(sourceTemp),
'value': Series(value)}
dfForce = DataFrame(forceData)
# source target value
# 0 Germany Mexico 1
# 1 NaN USA 2
# 2 NaN Argentina 3
To resolve, simply keep country as scalar in dictionary of series:
forceData = {'source': country,
'target': Series(sourceTemp),
'value': Series(value)}
dfForce = DataFrame(forceData)
# source target value
# 0 Germany Mexico 1
# 1 Germany USA 2
# 2 Germany Argentina 3
By the way, you do not need a dataframe object to output to json. Simply use a list of dictionaries. Consider the following using an Ordered Dictionary collection (to maintain the order of keys). In this way the growing list dumps into a text file without appending which would render an invalid json as opposite facing adjacent square brackets ...][... are not allowed.
from collections import OrderedDict
...
data = []
for element in newcountries:
bills = csvdata['target'][csvdata['country'] == element]
frquency = Counter(bills)
for k,v in frquency.items():
inner = OrderedDict()
inner['source'] = element
inner['target'] = k
inner['value'] = int(v)
data.append(inner)
newData = json.dumps(data, indent=4)
with open('data.json', 'w') as savetxt:
savetxt.write(newData)

Categories

Resources