Thank you to Hatt for the explanation and the code. It works, although I am unable to change the string name for a meaningful name from column headers.
Can anyone suggest how to achieve that?
Data in csv file
conversion_month channel sub_channel campaign Id cost kpi
2017-08 DISPLAY Retargeting Summer_Campaign 200278217 2.286261 0.1
2017-08 DISPLAY Retargeting Summer_Campaign 200278218 3.627064 2.5
2017-08 DISPLAY Retargeting Summer_Campaign 200278219 2.768436 0.001
2017-08 DISPLAY Retargeting August Campaign 200278220 5.653297 0.35
2017-09 DISPLAY Prospecting Test Campaign 200278221 4.11847 1.5
2017-08 DISPLAY Prospecting August Campaign 200278222 3.393972 0.26
2017-09 DISPLAY Prospecting Test Campaign 200278223 3.975332 4.2
2017-08 DISPLAY Prospecting August Campaign 200278224 4.131035 0.3
Code used:
import csv
from collections import defaultdict
def ctree():
return defaultdict(ctree)
def build_leaf(name, leaf):
res = {"name":name}
# add children node if the leaf actually has any children
if len(leaf.keys()) > 0:
res["children"] = [build_leaf(k, v) for k, v in leaf.items()]
return res
def main():
tree = ctree()
with open('file.csv') as csvfile:
reader = csv.reader(csvfile)
for rid, row in enumerate(reader):
if rid == 0:
continue
leaf = tree[row[0]]
for cid in range(1, (len(row)-2)):
leaf = leaf[row[cid]]
for cid in range((len(row)-1), len(row)):
leaf = (leaf[row[cid-1]],leaf[row[cid]])
# building a custom tree structure
res = []
for name, leaf in tree.items():
res.append(build_leaf(name, leaf))
# printing results into the terminal
import json
print(json.dumps(res, indent=2))
main()
It gives the tree, but I would like to change the string "name" for meaningful name such as "month", "channel", ..."id", etc. The names are in the first row of the csv file.
[
{
"name": "2017-08",
"children": [
{
"name": "DISPLAY",
"children": [
{
"name": "Retargeting",
"children": [
{
"name": "Summer_Campaign",
"children": [
{
"name": "200278217",
"children": [
{
"name": "2.286261"
},
{
"name": "0.1"
}
]
Thank you for any suggestions in advance.
Use next(reader) to first extract the header row from the CSV file. A level counter can be used to indicate which column is currently being dealt with so the corresponding column header can be extracted from the header:
import csv
from collections import defaultdict
def ctree():
return defaultdict(ctree)
def build_leaf(name, leaf, level, header):
res = {header[level] : name}
# add children node if the leaf actually has any children
if len(leaf.keys()) > 0:
res["children"] = [build_leaf(k, v, level+1, header) for k, v in leaf.items()]
return res
def main():
tree = ctree()
with open('file.csv') as csvfile:
reader = csv.reader(csvfile)
header = next(reader)
for row in reader:
leaf = tree[row[0]]
for cid in range(1, (len(row)-2)):
leaf = leaf[row[cid]]
for cid in range((len(row)-1), len(row)):
leaf = (leaf[row[cid-1]],leaf[row[cid]])
# building a custom tree structure
res = []
for name, leaf in tree.items():
res.append(build_leaf(name, leaf, 0, header))
# printing results into the terminal
import json
print(json.dumps(res, indent=2))
main()
This would give you:
[
{
"conversion_month": "2017-08",
"children": [
{
"channel": "DISPLAY",
"children": [
{
"sub_channel": "Retargeting",
"children": [
{
"campaign": "Summer_Campaign",
"children": [
{
"Id": "200278217",
"children": [
{
"cost": "2.286261"
},
Related
I am reading data from a JSON file to check the existence of some values.
In the JSON structure below, I try to find adomain from the data in bid and check if there is a cat value, which is not always present.
How do I fix it in the syntax below?
import pandas as pd
import json
path = 'C:/MyWorks/Python/Anal/data_sample.json'
records = [json.loads(line) for line in open(path, encoding='utf-8')]
adomain = [
rec['win_res']['seatbid'][0]['bid'][0]['adomain']
for rec in records
if 'adomain' in rec
]
Here is a data sample:
[
{ "win_res": {
"id": "12345",
"seatbid": [
{
"bid": [
{
"id": "12345",
"impid": "1",
"price": 0.1,
"adm": "",
"adomain": [
"adomain.com"
],
"iurl": "url.com",
"cid": "11",
"crid": "11",
"cat": [
"IAB12345"
],
"w": 1,
"h": 1
}
],
"seat": "1"
}
]
}}
]
As a result, the adomain value exists unconditionally, but the cat value may not be present sometimes.
So, if cat exists in adomain, I want to express adomain and cat in this way, but if there is no adomain, the cat value, how can I do it?
Your question is not clear but I think this is what you are looking for:
import json
path = 'C:/MyWorks/Python/Anal/data_sample.json'
with open(path, encoding='utf-8') as f:
records = json.load(f)
adomain = [
_['win_res']['seatbid'][0]['bid'][0]['adomain']
for _ in records
if _['win_res']['seatbid'][0]['bid'][0].get('adomain', None) and
_['win_res']['seatbid'][0]['bid'][0].get('cat', None)
]
The code above will add the value of ['win_res']['seatbid'][0]['bid'][0]['adomain'] to the list adomain only if there is a ['win_res']['seatbid'][0]['bid'][0]['cat'] corresponding value.
The code will be a lot clearer if we just walk through a bids list. Something like this:
import json
path = 'C:/MyWorks/Python/Anal/data_sample.json'
with open(path, encoding='utf-8') as f:
records = json.load(f)
bids = [_['win_res']['seatbid'][0]['bid'][0] for _ in records]
adomain = [
_['adomain']
for _ in bids
if _.get('adomain', None) and _.get('cat', None)
]
I need to create a nested dict structure where the number of children can vary at each level.
Appending “size” element to last json child element for a sunburst diagram
Tree creation is covered in this question, except that i need the size to be picked up from the last column.
Given my labels repeat between levels and each level can have the same label "abc" as a terminal one, as well as a parent to the next level - I modified the code here slightly (to avoid duplicates in a children branch). I am unable to however specify the size, which is stored in the last column and should replace the 1 here in each leaf end. I know that I need to pass the value from the rows to the recursion loop build_leaf, but can't seem to figure how.
import csv
from collections import defaultdict
import json
def ctree():
return defaultdict(ctree)
def build_leaf(name, leaf):
if len(name)==0:
res={"name":"last node"}
res['size']=1
else:
res = {"name": name}
# add children node if the leaf actually has any children
if len(leaf.keys())>0:
res["children"] = [build_leaf(k, v) for k, v in leaf.items()]
else:
res['size'] = 1
return res
def main():
tree = ctree()
# NOTE: you need to have test.csv file as neighbor to this file
with open('./inpfile.csv') as csvfile:
reader = csv.reader(csvfile)
header = next(reader) # read the header row
i=0
for row in reader:
# usage of python magic to construct dynamic tree structure and
# basically grouping csv values under their parents
leaf = tree[row[0]]
size=row[-1]
for value in row[1:-1]:
leaf = leaf[value]
# building a custom tree structure
res = []
for name, leaf in tree.items():
res.append(build_leaf(name, leaf))
# printing results into the terminal
print(json.dumps(res, indent=2))
with open('paths.json', 'w') as fp:
json.dump(res, fp)
main()
The final output for the data mentioned should look something like:
[
{
"name": "A1",
"children": [
{
"name": "A2",
"children": [
{
"name": "A1",
"children": [
{
"name": "A2",
"children": [
{
"name": "A3",
"size": 80
}
]
}
]
},
{
"name": "A3",
"children": [
{
"name": "A2",
"children": [
{
"name": "A3",
"size": 169
}
]
},
{
"name": "exit site",
"size": 764
}
]
},
{
"name": "A6",
"children": [
{
"name": "A3",
"children": [
{
"name": "exit site",
"size": 127
}
]
}
]
},
{
"name": "exit site",
"size": 576
}
]
}
]
}
]
In case someone stumbles across the same problem - I could get it to work, by creating another recursive loop to retrieve size from the nested leaf (thanks to Douglas for the help).
def ctree():
return defaultdict(ctree)
def get_size(leaf1):
for k,v in leaf1.items():
if k=="size":
return v
else:
return get_size(v)
def build_leaf(name, leaf):
if len(name)==0:
res={"name":"exit site"}
res['size']=int(get_size(leaf))
else:
res = {"name": name}
# add children node if the leaf actually has any children
if not leaf["size"]:
res["children"] = [build_leaf(k, v) for k, v in leaf.items() if not k == "size" ]
else:
res['size'] = int(get_size(leaf))
return res
def make_json(inpfile,outjson):
tree = ctree()
# NOTE: you need to have test.csv file as neighbor to this file
with open("./filepath.csv") as csvfile:
reader = csv.reader(csvfile)
header = next(reader) # read the header row
for row in reader:
# usage of python magic to construct dynamic tree structure and
# basically grouping csv values under their parents
leaf = tree[row[0]]
size=row[-1]
for value in row[1:-1]:
leaf = leaf[value]
if len(row) < 6:
leaf["exit site"]["size"]=size
else:
leaf["size"]=size
# building a custom tree structure
res = []
for name, leaf in tree.items():
res.append(build_leaf(name, leaf))
with open(outjson, 'w') as fp:
json.dump(res, fp)
I have written a code to convert csv file to nested json format. I have multiple columns to be nested hence assigning separately for each column. The problem is I'm getting 2 fields for the same column in the json output.
import csv
import json
from collections import OrderedDict
csv_file = 'data.csv'
json_file = csv_file + '.json'
def main(input_file):
csv_rows = []
with open(input_file, 'r') as csvfile:
reader = csv.DictReader(csvfile, delimiter='|')
for row in reader:
row['TYPE'] = 'REVIEW', # adding new key, value
row['RAWID'] = 1,
row['CUSTOMER'] = {
"ID": row['CUSTOMER_ID'],
"NAME": row['CUSTOMER_NAME']
}
row['CATEGORY'] = {
"ID": row['CATEGORY_ID'],
"NAME": row['CATEGORY']
}
del (row["CUSTOMER_NAME"], row["CATEGORY_ID"],
row["CATEGORY"], row["CUSTOMER_ID"]) # deleting since fields coccuring twice
csv_rows.append(row)
with open(json_file, 'w') as f:
json.dump(csv_rows, f, sort_keys=True, indent=4, ensure_ascii=False)
f.write('\n')
The output is as below:
[
{
"CATEGORY": {
"ID": "1",
"NAME": "Consumers"
},
"CATEGORY_ID": "1",
"CUSTOMER_ID": "41",
"CUSTOMER": {
"ID": "41",
"NAME": "SA Port"
},
"CUSTOMER_NAME": "SA Port",
"RAWID": [
1
]
}
]
I'm getting 2 entries for the fields I have assigned using row[''].
Is there any other way to get rid of this? I want only one entry for a particular field in each record.
Also how can I convert the keys to lower case after reading from csv.DictReader(). In my csv file all the columns are in upper case and hence I'm using the same to assign. But I want to convert all of them to lower case.
In order to convert the keys to lower case, it would be simpler to generate a new dict per row. BTW, it should be enough to get rid of the duplicate fields:
for row in reader:
orow = collection.OrderedDict()
orow['type'] = 'REVIEW', # adding new key, value
orow['rawid'] = 1,
orow['customer'] = {
"id": row['CUSTOMER_ID'],
"name": row['CUSTOMER_NAME']
}
orow['category'] = {
"id": row['CATEGORY_ID'],
"name": row['CATEGORY']
}
csv_rows.append(orow)
I'm having trouble modifying my code to add another dictionary to separate "hostNumber" and "hostMode" in my output. Below is the code that found here and manipulated:
import json
from json import dumps
top = "Top_Level"
top_dict = {}
top_dict["name"] = top
top_dict["sub_name"] = []
for site, site_data in df.groupby("site", sort=False):
site_dict = {}
site_dict["site"] = site
site_dict["sub_site"] = []
for stor, stor_data in site_data.groupby("system", sort=False):
stor_dict = {}
stor_dict["system"] = stor
stor_dict["sub_system"] = []
for port, port_data in stor_data.groupby("portId", sort=False):
port_dict = {}
port_dict["portId"] = port
port_dict["sub_portId"] = []
for host, host_data in port_data.groupby("hostName", sort=False):
host_data = host_data.drop(["portId", "system",
"site"], axis=1).set_index(
"hostName")
for n in host_data.to_dict(orient="records"):
port_dict["sub_portId"].append({"hostName": host,
"sub_hostName": [n]})
stor_dict["sub_system"].append(port_dict)
site_dict["sub_site"].append(stor_dict)
top_dict["sub_name"].append(site_dict)
top_out = dumps(top_dict)
parsed = json.loads(top_out)
resulting in:
print(json.dumps(parsed, indent=4, sort_keys=True))
{
"name": "Top_Level",
"sub_name": [
{
"site": "A",
"sub_site": [
{
"system": "system01",
"sub_system": [
{
"portId": "1-A",
"sub_portId": [
{
"hostName": "ahost005",
"sub_hostName": [
{
"hostNumber": "1",
"hostMode": "WIN"
}
]
}, ...
How can I modify my code to have it output in the following way:
...
"sub_hostName": [
{"hostNumber": "1"},
{"hostMode": "WIN"}
]...
Use the following line instead of "sub_hostName": [n]:
"sub_hostName": [dict([i]) for i in n.items()]
I'm working with csvfiles. My goal is to write a json format with csvfile information. Especifically, I want to get a similar format as miserables.json
Example:
{"source": "Napoleon", "target": "Myriel", "value": 1},
According with the information I have the format would be:
[
{
"source": "Germany",
"target": "Mexico",
"value": 1
},
{
"source": "Germany",
"target": "USA",
"value": 2
},
{
"source": "Brazil",
"target": "Argentina",
"value": 3
}
]
However, with the code I used the output looks as follow:
[
{
"source": "Germany",
"target": "Mexico",
"value": 1
},
{
"source": null,
"target": "USA",
"value": 2
}
][
{
"source": "Brazil",
"target": "Argentina",
"value": 3
}
]
Null source must be Germany. This is one of the main problems, because there are more cities with that issue. Besides this, the information is correct. I just want to remove several list inside the format and replace null to correct country.
This is the code I used using pandas and collections.
csvdata = pandas.read_csv('file.csv', low_memory=False, encoding='latin-1')
countries = csvdata['country'].tolist()
newcountries = list(set(countries))
for element in newcountries:
bills = csvdata['target'][csvdata['country'] == element]
frquency = Counter(bills)
sourceTemp = []
value = []
country = element
for k,v in frquency.items():
sourceTemp.append(k)
value.append(int(v))
forceData = {'source': Series(country), 'target': Series(sourceTemp), 'value': Series(value)}
dfForce = DataFrame(forceData)
jsondata = dfForce.to_json(orient='records', force_ascii=False, default_handler=callable)
parsed = json.loads(jsondata)
newData = json.dumps(parsed, indent=4, ensure_ascii=False, sort_keys=True)
# since to_json doesn´t have append mode this will be written in txt file
savetxt = open('data.txt', 'a')
savetxt.write(newData)
savetxt.close()
Any suggestion to solve this problem are appreciate!
Thanks
Consider removing the Series() around the scalar value, country. By doing so and then upsizing the dictionaries of series into a dataframe, you force NaN (later converted to null in json) into the series to match the lengths of other series. You can see this by printing out the dfForce dataframe:
from pandas import Series
from pandas import DataFrame
country = 'Germany'
sourceTemp = ['Mexico', 'USA', 'Argentina']
value = [1, 2, 3]
forceData = {'source': Series(country),
'target': Series(sourceTemp),
'value': Series(value)}
dfForce = DataFrame(forceData)
# source target value
# 0 Germany Mexico 1
# 1 NaN USA 2
# 2 NaN Argentina 3
To resolve, simply keep country as scalar in dictionary of series:
forceData = {'source': country,
'target': Series(sourceTemp),
'value': Series(value)}
dfForce = DataFrame(forceData)
# source target value
# 0 Germany Mexico 1
# 1 Germany USA 2
# 2 Germany Argentina 3
By the way, you do not need a dataframe object to output to json. Simply use a list of dictionaries. Consider the following using an Ordered Dictionary collection (to maintain the order of keys). In this way the growing list dumps into a text file without appending which would render an invalid json as opposite facing adjacent square brackets ...][... are not allowed.
from collections import OrderedDict
...
data = []
for element in newcountries:
bills = csvdata['target'][csvdata['country'] == element]
frquency = Counter(bills)
for k,v in frquency.items():
inner = OrderedDict()
inner['source'] = element
inner['target'] = k
inner['value'] = int(v)
data.append(inner)
newData = json.dumps(data, indent=4)
with open('data.json', 'w') as savetxt:
savetxt.write(newData)