Instantiating a nested dictionary - python

I'm trying to instantiate a nested dictionary which contains another dictionaries as keys and each of these dictionaries contains another dictionaries. I know what keys and how many keys will be in nested and nested-nested dictionaries, but I don't know how many and what keys will be in the upper dictionary (which will be OrderedDict and keys will be integers but I don't know how many).
The upper dictionary contains integers as keys and dictionaries as values - each of these dictionaries has 3 keys = 'forth','back' and 'price'.
'forth' and 'back' has another dictionaries as their values. Each of these dicts (values) contains these keys
'arr_date','arr_place','dep_date','dep_place'.
So for example 'forth' dict is:
dict.fromkeys(['arr_date','arr_place','dep_date','dep_place'],None)
So the point is that I want to instantiate the dictionary with these keys but the problem is that upper dictionary can has variable integers. It can contains these keys [1,2,3,4] but also can contains [1,2,3,4,5,6,7,8].
This is an example of instantiation of nested and nested-nested. So this would be a value of upper dictionary for each of it's keys (I'm not sure if condition will work).
dict.fromkeys(['forth','back','price'], dict.fromkeys(['arr_date','arr_place','dep_date','dep_place'],None) if key in ['forth','back'] else None)
The whole thing is that I want to tell the code as much as possible default values and keys.
Any advices?
EDIT: The condition 6 lines above does not work so anybody could tell how to do that too.
EDIT II: So the dict should looks like:
{1:{'forth':{'arr_date':'15-8-4','arr_place':'Atlanta','dep_date':'15-8-4','dep_place':'New York'},'back':{'arr_date...},'price':158},2:{....}}

Maybe something like this:
def inner_dict(vals = []):
my_vals = vals + [None]*(4 - len(vals))
my_keys = ['arr_date','arr_place','dep_date','dep_place']
return dict(zip(my_keys,my_vals))
def middle_dict(fvals = [], bvals = [], price = None):
d = {"forth": inner_dict(fvals),"back":inner_dict(bvals), 'price': price}
return d
Typical use:
>>> middle_dict(['5-18-4', 'Atlanta','5-18-4','New York'],
['5-19-4', 'New York','5-19-4','Atlanta'], 134.05)
{'forth': {'arr_date': '5-18-4', 'dep_place': 'New York', 'dep_date': '5-18-4', 'arr_place': 'Atlanta'}, 'price': 134.05, 'back': {'arr_date': '5-19-4', 'dep_place': 'Atlanta', 'dep_date': '5-19-4', 'arr_place': 'New York'}}
>>>
>>> d = {i:middle_dict() for i in range(1,4)}
>>> d
{1: {'forth': {'arr_date': None, 'dep_place': None, 'dep_date': None, 'arr_place': None}, 'price': 0.0, 'back': {'arr_date': None, 'dep_place': None, 'dep_date': None, 'arr_place': None}}, 2: {'forth': {'arr_date': None, 'dep_place': None, 'dep_date': None, 'arr_place': None}, 'price': 0.0, 'back': {'arr_date': None, 'dep_place': None, 'dep_date': None, 'arr_place': None}}, 3: {'forth': {'arr_date': None, 'dep_place': None, 'dep_date': None, 'arr_place': None}, 'price': 0.0, 'back': {'arr_date': None, 'dep_place': None, 'dep_date': None, 'arr_place': None}}}

This should produce the empty ordered dict you're looking for assuming you want to instantiate with None values in your nested dicts:
from collections import OrderedDict
d=OrderedDict()
for x in range(1,6):
d[x]={key:dict.fromkeys(['arr_date','arr_place','dep_date','dep_place'],None) if key in ['forth','back'] else None for key in ['forth','back','price']}
Which gives the following dict:
In[42]: dict(d)
Out[42]: {1: {'price': None, 'forth': {'arr_date': None, 'dep_date': None, 'arr_place': None, 'dep_place': None}, 'back': {'arr_date': None, 'dep_date': None, 'arr_place': None, 'dep_place': None}}, 2: {'price': None, 'forth': {'arr_date': None, 'dep_date': None, 'arr_place': None, 'dep_place': None}, 'back': {'arr_date': None, 'dep_date': None, 'arr_place': None, 'dep_place': None}}, 3: {'price': None, 'forth': {'arr_date': None, 'dep_date': None, 'arr_place': None, 'dep_place': None}, 'back': {'arr_date': None, 'dep_date': None, 'arr_place': None, 'dep_place': None}}, 4: {'price': None, 'forth': {'arr_date': None, 'dep_date': None, 'arr_place': None, 'dep_place': None}, 'back': {'arr_date': None, 'dep_date': None, 'arr_place': None, 'dep_place': None}}, 5: {'price': None, 'forth': {'arr_date': None, 'dep_date': None, 'arr_place': None, 'dep_place': None}, 'back': {'arr_date': None, 'dep_date': None, 'arr_place': None, 'dep_place': None}}}

Related

How to load list columns into a dataframe?

I try to load "columns" from a python list object into a dataframe.
This is my list object:
list = type(api_response.results) -> <class 'list'>
These are the values from the list object (I think this is a json structur):
{'results': [{'data': [{'interval': '2022-11-11T10:00:00.000Z/2022-11-11T10:30:00.000Z',
'metrics': [{'metric': 'nError',
'qualifier': None,
'stats': {'count': 4,
'count_negative': None,
'count_positive': None,
'current': None,
'denominator': None,
'max': None,
'min': None,
'numerator': None,
'ratio': None,
'sum': None,
'target': None}},
{'metric': 'nOffered',
'qualifier': None,
'stats': {'count': 113,
'count_negative': None,
'count_positive': None,
'current': None,
'denominator': None,
'max': None,
'min': None,
'numerator': None,
'ratio': None,
'sum': None,
'target': None}}],
'views': None}],
'group': {'mediaType': 'voice'}}]}
I just need this result:
Dataframe:
interval metric count
0 2022-11-11T10:00:00.000Z/2022-11-11T10:30:00.000Z nError 4
1 2022-11-11T10:00:00.000Z/2022-11-11T10:30:00.000Z nOffered 113
How get this result? How is it possibly to call intervals or metrics from the list object?
Thanks for any help
you can use:
def get_metric(x):
check=0
vals=[]
for i in range(0,len(x)):
if len(x)==1:
check=1
for j in range(0,len(x) + check):
print(i,j)
vals.append(x[i]['data'][0]['metrics'][j]['metric'])
return vals
def get_count(x):
vals=[]
for i in range(0,len(x)):
for j in range(0,len(x[0])):
vals.append(x[i]['data'][0]['metrics'][j]['stats']['count'])
return vals
df['interval']=df['results'].apply(lambda x: [x[0]['data'][i]['interval'] for i in range(0,len(x[0]['data']))])
df['metric']= df['results'].apply(lambda x: get_metric(x))
df['count']= df['results'].apply(lambda x: get_count(x))
df=df.drop(['results'],axis=1)
df=df.explode(['metric','count']).explode('interval')
print(df)
'''
interval metric count
0 2022-11-11T10:00:00.000Z/2022-11-11T10:30:00.000Z nError 4
0 2022-11-11T10:00:00.000Z/2022-11-11T10:30:00.000Z nOffered 113
'''

Python value assigned to incorrect dict key

I am iterating through a csv and for each column, determining the longest len of a string, and updating a dict as necessary.
If I do this
def get_max_size(current, cell_value):
if cell_value:
current = max(current, len(cell_value))
return current
def my_function():
headers = ["val1","val2","val3","val4","val5"]
d = {header: {'max_size': 0, 'other': {'test': None}} for header in headers}
csv_file = [
["abc","123","HAMILTON","1950.00","17-SEP-2015"],
["ab","321","GLASGOW","711.00","13-NOV-2015"]
]
for row in csv_file:
for i, header in enumerate(headers):
max_size = get_max_size(d[header]['max_size'], row[i])
d[header]['max_size'] = max_size
print(d)
I get the expected output:
{'val1': {'max_size': 3, 'other': {'test': None}},
'val2': {'max_size': 3, 'other': {'test': None}},
'val3': {'max_size': 8, 'other': {'test': None}},
'val4': {'max_size': 7, 'other': {'test': None}},
'val5': {'max_size': 11, 'other': {'test': None}}}
However if I modify my code as such:
REQUIRED_VALUES = {
'max_size': 0,
'allowed_values': {'digit': None, 'alpha': None, 'whitespace': None, 'symbol': None},
'max_value': None,
'allow_null': None,
}
def my_function():
headers = ["val1","val2","val3","val4","val5"]
# d = {header: {'max_size': 0, 'other': {'test': None}} for header in headers}
d = {header: REQUIRED_VALUES for header in headers}
csv_file = [
["abc","123","HAMILTON","1950.00","17-SEP-2015"],
["ab","321","GLASGOW","711.00","13-NOV-2015"]
]
for row in csv_file:
for i, header in enumerate(headers):
max_size = get_max_size(d[header]['max_size'], row[i])
d[header]['max_size'] = max_size
print(d)
Then the largest len of all keys (val5, the date field, of len == 11), is assigned to all max_length:
{'val1': {'max_size': 11, 'allowed_values': {'digit': None, 'alpha': None, 'whitespace': None, 'symbol': None}, 'max_value': None, 'allow_null': None},
'val2': {'max_size': 11, 'allowed_values': {'digit': None, 'alpha': None, 'whitespace': None, 'symbol': None}, 'max_value': None, 'allow_null': None},
'val3': {'max_size': 11, 'allowed_values': {'digit': None, 'alpha': None, 'whitespace': None, 'symbol': None}, 'max_value': None, 'allow_null': None},
'val4': {'max_size': 11, 'allowed_values': {'digit': None, 'alpha': None, 'whitespace': None, 'symbol': None}, 'max_value': None, 'allow_null': None},
'val5': {'max_size': 11, 'allowed_values': {'digit': None, 'alpha': None, 'whitespace': None, 'symbol': None}, 'max_value': None, 'allow_null': None}}
Is there some difference between the dicts that I'm missing? The dict is the only thing that changes, they both contain nested dictionaries... apart from number of items, I can't really see the difference.

How to iterate over interval in json file and create a dataframe?

I am iterating over json file and creating dataframe with the desirable columns. I already implemented the code but now json file has little bit changed. But I am not able to think where to change the code to get the required output.
Explanation:
previous json result:
queryResult: {'results': [{'data': [{'interval': '2021-10-11T11:46:25.000Z/2021-10-18T11:49:48.000Z',
'metrics': [{'metric': 'nOffered',
'qualifier': None,
'stats': {'count': 7,
'count_negative': None,
'count_positive': None,
'current': None,
'denominator': None,
'max': None,
'min': None,
'numerator': None,
'ratio': None,
'sum': None,
'target': None}},
{'metric': 'nTransferred',
'qualifier': None,
'stats': {'count': 1,
'count_negative': None,
'count_positive': None,
'current': None,
'denominator': None,
'max': None,
'min': None,
'numerator': None,
'ratio': None,
'sum': None,
'target': None}}],
'views': None}],
'group': {'mediaType': 'voice',
'queueId': '73643cff-799b-41ae-9a67-efcf5e593155'}}]}
previous dataframe:
Queue_Id,Interval Start,Interval End,nOffered_count,nOffered_sum,nOffered.denominator,nOffered.numerator,nTransferred_count,nTransferred_sum,nTransferred.denominator,nTransferred.numerator
73643cff-799b-41ae-9a67-efcf5e593155,2021-10-11T11:46:25.000Z,2021-10-18T11:49:48.000Z,7,,,,1.0,,,
new json result:
queryResult: {'results': [{'data': [{'interval': '2021-10-11T11:46:25.000Z/2021-10-12T11:46:25.000Z',
'metrics': [{'metric': 'nOffered',
'qualifier': None,
'stats': {'count': 1,
'count_negative': None,
'count_positive': None,
'current': None,
'denominator': None,
'max': None,
'min': None,
'numerator': None,
'ratio': None,
'sum': None,
'target': None}}],
'views': None},
{'interval': '2021-10-13T11:46:25.000Z/2021-10-14T11:46:25.000Z',
'metrics': [{'metric': 'nOffered',
'qualifier': None,
'stats': {'count': 2,
'count_negative': None,
'count_positive': None,
'current': None,
'denominator': None,
'max': None,
'min': None,
'numerator': None,
'ratio': None,
'sum': None,
'target': None}},
{'metric': 'nTransferred',
'qualifier': None,
'stats': {'count': 1,
'count_negative': None,
'count_positive': None,
'current': None,
'denominator': None,
'max': None,
'min': None,
'numerator': None,
'ratio': None,
'sum': None,
'target': None}}],
'views': None},
{'interval': '2021-10-14T11:46:25.000Z/2021-10-15T11:46:25.000Z',
'metrics': [{'metric': 'nOffered',
'qualifier': None,
'stats': {'count': 3,
'count_negative': None,
'count_positive': None,
'current': None,
'denominator': None,
'max': None,
'min': None,
'numerator': None,
'ratio': None,
'sum': None,
'target': None}}],
'views': None},
{'interval': '2021-10-15T11:46:25.000Z/2021-10-16T11:46:25.000Z',
'metrics': [{'metric': 'nOffered',
'qualifier': None,
'stats': {'count': 1,
'count_negative': None,
'count_positive': None,
'current': None,
'denominator': None,
'max': None,
'min': None,
'numerator': None,
'ratio': None,
'sum': None,
'target': None}}],
'views': None}],
'group': {'mediaType': 'voice',
'queueId': '73643cff-799b-41ae-9a67-efcf5e593155'}}]}
Now desirable dataframe:
Queue_Id,Interval Start,Interval End,nOffered_count,nOffered_sum,nOffered.denominator,nOffered.numerator,nTransferred_count,nTransferred_sum,nTransferred.denominator,nTransferred.numerator
73643cff-799b-41ae-9a67-efcf5e593155,2021-10-11T11:46:25.000Z,2021-10-12T11:46:25.000Z,1,,,,,,,
73643cff-799b-41ae-9a67-efcf5e593155,2021-10-13T11:46:25.000Z,2021-10-14T11:46:25.000Z,2,,,,1,,,
73643cff-799b-41ae-9a67-efcf5e593155,2021-10-14T11:46:25.000Z,2021-10-15T11:46:25.000Z,3,,,,,,,
73643cff-799b-41ae-9a67-efcf5e593155,2021-10-15T11:46:25.000Z,2021-10-16T11:46:25.000Z,1,,,,,,,
What are the changes I need to do to in below code to get the new result.
column_names = []
if(query_result.results != None):
for item in query_result.results:
data_lst = []
for lst_data in item.data:
print("####################################")
print(lst_data)
print("####################################")
for met in lst_data.metrics:
metric_name = met.metric
column_names.append('Queue_Id')
column_names.append(metric_name+'_count')
column_names.append(metric_name+'_sum')
column_names.append(metric_name+'.denominator')
column_names.append(metric_name+'.numerator')
column_names.append('Interval Start')
column_names.append('Interval End')
data_lst.append(queue_id)
data_lst.append(met.stats.count)
data_lst.append(met.stats.sum)
data_lst.append(met.stats.denominator)
data_lst.append(met.stats.numerator)
data_lst.append(lst_data.interval.split('/')[0])
data_lst.append(lst_data.interval.split('/')[1])
print(data_lst)
else:
data_lst = []
metric_name = query.metrics[0]
column_names.append('Queue_Id')
column_names.append(metric_name+'_count')
column_names.append(metric_name+'_sum')
column_names.append(metric_name+'.denominator')
column_names.append(metric_name+'.numerator')
column_names.append('Interval Start')
column_names.append('Interval End')
data_lst.append(queue_id)
data_lst.append('')
data_lst.append('')
data_lst.append('')
data_lst.append('')
data_lst.append(query.interval.split('/')[0])
data_lst.append(query.interval.split('/')[1])
print("data_lst", data_lst)
print("column_names", column_names)
return data_lst, column_names
I have modified my code little bit and got the result. The below code is working for me-
lst_of_metrics = ["nOffered", "nTransferred"]
out = defaultdict(list)
if(query_result.results != None):
for item in query_result.results:
#data_lst = []
for lst_data in item.data:
print("####################################")
print(lst_data)
print("####################################")
out['queue_id'].append(queue_id)
for met1, met in itertools.zip_longest(query.metrics, lst_data.metrics):
#for met in lst_data.metrics:
if(met):
if(met.metric == met1):
out[met.metric+"_count"].append(met.stats.count)
out[met.metric+"_sum"].append(met.stats.sum)
out[met.metric+".denominator"].append(met.stats.denominator)
out[met.metric+".numerator"].append(met.stats.numerator)
else:
out[met1+"_count"].append('')
out[met1+"_sum"].append('')
out[met1+".denominator"].append('')
out[met1+".numerator"].append('')
else:
out[met1+"_count"].append('')
out[met1+"_sum"].append('')
out[met1+".denominator"].append('')
out[met1+".numerator"].append('')
interval = lst_data.interval.split('/')
out['Interval Start'].append(interval[0])
out['Interval End'].append(interval[1])
print("out", out)
else:
metric_name = query.metrics[0]
out['queue_id'].append(queue_id)
out[metric_name+"_count"].append('')
out[metric_name+"_sum"].append('')
out[metric_name+".denominator"].append('')
out[metric_name+".numerator"].append('')
interval = query.interval.split('/')
out['Interval Start'].append(interval[0])
out['Interval End'].append(interval[1])
print(out)
df = pd.DataFrame(out)
print (df)

Iterate over json result and get the desirable data in pandas dataframe

I have a json result which I am trying to convert into dataframe but not able to get the correct result. Actually for some cases it is giving correct but for some case it is failing.
Example:
Based on metric API is generating result for specified interval. But this is not certain for that particular interval metric have output or not. And process is running 4 different queue_id.
suppose process is running only for 2 metric. ['nOffered', 'nTransferred']
queue_id = 'a72dba75-0bc6-4a65-b120-8803364f8dc3'
for this queue_id, nOffered is having some values but nTransferred doesn't have. Json result is given below-
queryResult: {'results': [{'data': [{'interval': '2021-10-11T11:46:25.000Z/2021-10-12T11:46:25.000Z',
'metrics': [{'metric': 'nOffered',
'qualifier': None,
'stats': {'count': 1,
'count_negative': None,
'count_positive': None,
'current': None,
'denominator': None,
'max': None,
'min': None,
'numerator': None,
'ratio': None,
'sum': None,
'target': None}}],
'views': None},
{'interval': '2021-10-13T11:46:25.000Z/2021-10-14T11:46:25.000Z',
'metrics': [{'metric': 'nOffered',
'qualifier': None,
'stats': {'count': 2,
'count_negative': None,
'count_positive': None,
'current': None,
'denominator': None,
'max': None,
'min': None,
'numerator': None,
'ratio': None,
'sum': None,
'target': None}}],
'views': None},
{'interval': '2021-10-14T11:46:25.000Z/2021-10-15T11:46:25.000Z',
'metrics': [{'metric': 'nOffered',
'qualifier': None,
'stats': {'count': 3,
'count_negative': None,
'count_positive': None,
'current': None,
'denominator': None,
'max': None,
'min': None,
'numerator': None,
'ratio': None,
'sum': None,
'target': None}}],
'views': None},
{'interval': '2021-10-15T11:46:25.000Z/2021-10-16T11:46:25.000Z',
'metrics': [{'metric': 'nOffered',
'qualifier': None,
'stats': {'count': 1,
'count_negative': None,
'count_positive': None,
'current': None,
'denominator': None,
'max': None,
'min': None,
'numerator': None,
'ratio': None,
'sum': None,
'target': None}}],
'views': None}],
'group': {'mediaType': 'voice',
'queueId': '73643cff-799b-41ae-9a67-efcf5e593155'}}]}
My code is giving below output-
queue_id nOffered_count nOffered_sum interval_start interval_end
0 a72dba75-0bc6-4a65-b120-8803364f8dc3 6 None 2021-10-11T11:46:25.000Z 2021-10-12T11:46:25.000Z
1 a72dba75-0bc6-4a65-b120-8803364f8dc3 1 None 2021-10-12T11:46:25.000Z 2021-10-13T11:46:25.000Z
2 a72dba75-0bc6-4a65-b120-8803364f8dc3 12 None 2021-10-13T11:46:25.000Z 2021-10-14T11:46:25.000Z
3 a72dba75-0bc6-4a65-b120-8803364f8dc3 6 None 2021-10-14T11:46:25.000Z 2021-10-15T11:46:25.000Z
4 a72dba75-0bc6-4a65-b120-8803364f8dc3 6 None 2021-10-15T11:46:25.000Z 2021-10-16T11:46:25.000Z
But when process is running for 2nd queue_id that time it is not working-
queue_id - 73643cff-799b-41ae-9a67-efcf5e593155
json output for this queue_id -
queryResult: {'results': [{'data': [{'interval': '2021-10-11T11:46:25.000Z/2021-10-12T11:46:25.000Z',
'metrics': [{'metric': 'nOffered',
'qualifier': None,
'stats': {'count': 1,
'count_negative': None,
'count_positive': None,
'current': None,
'denominator': None,
'max': None,
'min': None,
'numerator': None,
'ratio': None,
'sum': None,
'target': None}}],
'views': None},
{'interval': '2021-10-13T11:46:25.000Z/2021-10-14T11:46:25.000Z',
'metrics': [{'metric': 'nOffered',
'qualifier': None,
'stats': {'count': 2,
'count_negative': None,
'count_positive': None,
'current': None,
'denominator': None,
'max': None,
'min': None,
'numerator': None,
'ratio': None,
'sum': None,
'target': None}},
{'metric': 'nTransferred',
'qualifier': None,
'stats': {'count': 1,
'count_negative': None,
'count_positive': None,
'current': None,
'denominator': None,
'max': None,
'min': None,
'numerator': None,
'ratio': None,
'sum': None,
'target': None}}],
'views': None},
{'interval': '2021-10-14T11:46:25.000Z/2021-10-15T11:46:25.000Z',
'metrics': [{'metric': 'nOffered',
'qualifier': None,
'stats': {'count': 3,
'count_negative': None,
'count_positive': None,
'current': None,
'denominator': None,
'max': None,
'min': None,
'numerator': None,
'ratio': None,
'sum': None,
'target': None}}],
'views': None},
{'interval': '2021-10-15T11:46:25.000Z/2021-10-16T11:46:25.000Z',
'metrics': [{'metric': 'nOffered',
'qualifier': None,
'stats': {'count': 1,
'count_negative': None,
'count_positive': None,
'current': None,
'denominator': None,
'max': None,
'min': None,
'numerator': None,
'ratio': None,
'sum': None,
'target': None}}],
'views': None}],
'group': {'mediaType': 'voice',
'queueId': '73643cff-799b-41ae-9a67-efcf5e593155'}}]}
This time both metric having some data. So output would be-
Queue_Id,Interval Start,Interval End,nOffered_count,nOffered_sum,nOffered.denominator,nOffered.numerator,nTransferred_count,nTransferred_sum,nTransferred.denominator,nTransferred.numerator
73643cff-799b-41ae-9a67-efcf5e593155,2021-10-11T11:46:25.000Z,2021-10-12T11:46:25.000Z,1,,,,,,,
73643cff-799b-41ae-9a67-efcf5e593155,2021-10-13T11:46:25.000Z,2021-10-14T11:46:25.000Z,2,,,,1,,,
73643cff-799b-41ae-9a67-efcf5e593155,2021-10-14T11:46:25.000Z,2021-10-15T11:46:25.000Z,3,,,,,,,
73643cff-799b-41ae-9a67-efcf5e593155,2021-10-15T11:46:25.000Z,2021-10-16T11:46:25.000Z,1,,,,,,,
And in final result, both the result merge and give the output with all columns and data.
Queue_Id,Interval Start,Interval End,nOffered_count,nOffered_sum,nOffered.denominator,nOffered.numerator,nTransferred_count,nTransferred_sum,nTransferred.denominator,nTransferred.numerator
a72dba75-0bc6-4a65-b120-8803364f8dc3,2021-10-11T11:46:25.000Z,2021-10-12T11:46:25.000Z,6,,,,,,,
a72dba75-0bc6-4a65-b120-8803364f8dc3,2021-10-12T11:46:25.000Z,2021-10-13T11:46:25.000Z,1.0,,,,,,,
a72dba75-0bc6-4a65-b120-8803364f8dc3,2021-10-13T11:46:25.000Z,2021-10-14T11:46:25.000Z,12.0,,,,,,,
a72dba75-0bc6-4a65-b120-8803364f8dc3,2021-10-14T11:46:25.000Z,2021-10-15T11:46:25.000Z,6.0,,,,,,,
a72dba75-0bc6-4a65-b120-8803364f8dc3,2021-10-15T11:46:25.000Z,2021-10-16T11:46:25.000Z,6.0,,,,,,,
73643cff-799b-41ae-9a67-efcf5e593155,2021-10-11T11:46:25.000Z,2021-10-12T11:46:25.000Z,1,,,,,,,
73643cff-799b-41ae-9a67-efcf5e593155,2021-10-13T11:46:25.000Z,2021-10-14T11:46:25.000Z,2,,,,1.0,,,
73643cff-799b-41ae-9a67-efcf5e593155,2021-10-14T11:46:25.000Z,2021-10-15T11:46:25.000Z,3,,,,,,,
73643cff-799b-41ae-9a67-efcf5e593155,2021-10-15T11:46:25.000Z,2021-10-16T11:46:25.000Z,1,,,,,,,
Currently I am running below logic-
out = defaultdict(list)
if(query_result.results != None):
for item in query_result.results:
#data_lst = []
for lst_data in item.data:
print("####################################")
print(lst_data)
print("####################################")
out['queue_id'].append(queue_id)
for met in lst_data.metrics:
out[met.metric+"_count"].append(met.stats.count)
out[met.metric+"_sum"].append(met.stats.sum)
out[met.metric+".denominator"].append(met.stats.denominator)
out[met.metric+".numerator"].append(met.stats.numerator)
interval = lst_data.interval.split('/')
out['Interval Start'].append(interval[0])
out['Interval End'].append(interval[1])
print("out", out)
else:
metric_name = query.metrics[0]
out['queue_id'].append(queue_id)
out[metric_name+"_count"].append('')
out[metric_name+"_sum"].append('')
out[metric_name+".denominator"].append('')
out[metric_name+".numerator"].append('')
interval = query.interval.split('/')
out['Interval Start'].append(interval[0])
out['Interval End'].append(interval[1])
print(out)
df = pd.DataFrame(out)
print (df)
return df
I used below logic to get the desirable result. It is working for me.
lst_of_metrics = ["nOffered", "nTransferred"]
out = defaultdict(list)
if(query_result.results != None):
for item in query_result.results:
#data_lst = []
for lst_data in item.data:
print("####################################")
print(lst_data)
print("####################################")
out['queue_id'].append(queue_id)
for met1, met in itertools.zip_longest(query.metrics, lst_data.metrics):
if(met):
if(met.metric == met1):
out[met.metric+"_count"].append(met.stats.count)
out[met.metric+"_sum"].append(met.stats.sum)
out[met.metric+".denominator"].append(met.stats.denominator)
out[met.metric+".numerator"].append(met.stats.numerator)
else:
out[met1+"_count"].append('')
out[met1+"_sum"].append('')
out[met1+".denominator"].append('')
out[met1+".numerator"].append('')
else:
out[met1+"_count"].append('')
out[met1+"_sum"].append('')
out[met1+".denominator"].append('')
out[met1+".numerator"].append('')
interval = lst_data.interval.split('/')
out['Interval Start'].append(interval[0])
out['Interval End'].append(interval[1])
print("out", out)
else:
metric_name = query.metrics[0]
out['queue_id'].append(queue_id)
out[metric_name+"_count"].append('')
out[metric_name+"_sum"].append('')
out[metric_name+".denominator"].append('')
out[metric_name+".numerator"].append('')
interval = query.interval.split('/')
out['Interval Start'].append(interval[0])
out['Interval End'].append(interval[1])
print(out)
df = pd.DataFrame(out)
print (df)

Remove nested JSON from DataFrame without creating new rows or duplicate data

OVERVIEW
Using pandas.json_normalize I was able to normalize JSON data that was output from BigQuery that had several nested objects into a dataframe. The csv representation is below. You'll notice on things like "device" it broke out as it should. In other instances, such as the column and event_params it did not. I played around and was able to expand out this object but it would introduce new roles, fortunately, I only need a single key/value depending on the event_name value. For this example, I want to extract session_user_id and normalize that into a similar wide column layout.
CSV DATA
event_date,event_timestamp,event_name,event_params,device.operating_system device.operating_system_version
20200105,69996099900,session_start,[{'key': 'ga_session_number', 'value': {'string_value': None, 'int_value': '3', 'float_value': None, 'double_value': None}}, {'key': 'session_engaged', 'value': {'string_value': None, 'int_value': '1', 'float_value': None, 'double_value': None}}, {'key': 'engaged_session_event', 'value': {'string_value': None, 'int_value': '1', 'float_value': None, 'double_value': None}}, {'key': 'session_id', 'value': {'string_value': None, 'int_value': '123456789', 'float_value': None, 'double_value': None}}, {'key': 'firebase_event_origin', 'value': {'string_value': 'auto', 'int_value': None, 'float_value': None, 'double_value': None}}],IOS,9.3.5
QUESTION
I might just be in overwhelm, but is there a simple way to extract a single (or easier, all of the keys) with the associate values into additional columns?
For example, new columns would be:
(Honestly, the naming for key/value is less of an issue as long as its a wide csv flat-file)
event_params.1.key = session_engaged
event_params.1.string_value = None
event_params.1.int_value = 1
[...]
event_params.2.key = session_id
event_params.2.string_value = None
event_params.2.int_value = 123456789
[...]
This answer, whilst not being very elegant and probably needing a little bit of work to get exactly what you want, is a reasonable start.
Set-up Data
I've taken your example data and modified the format slightly, to load it into a dataframe.
columns = 'event_date event_timestamp event_name event_params device.operating_system device.operating_system_version'.split()
data = [
20200105,
69996099900,
'session_start',
[{'key': 'ga_session_number', 'value': {'string_value': None, 'int_value': '3', 'float_value': None, 'double_value': None}}, {'key': 'session_engaged', 'value': {'string_value': None, 'int_value': '1', 'float_value': None, 'double_value': None}}, {'key': 'engaged_session_event', 'value': {'string_value': None, 'int_value': '1', 'float_value': None, 'double_value': None}}, {'key': 'session_id', 'value': {'string_value': None, 'int_value': '123456789', 'float_value': None, 'double_value': None}}, {'key': 'firebase_event_origin', 'value': {'string_value': 'auto', 'int_value': None, 'float_value': None, 'double_value': None}}],
'IOS',
'9.3.5',
]
df = pd.DataFrame([data], columns=columns)
Initial data:
event_date event_timestamp event_name event_params device.operating_system device.operating_system_version
0 20200105 69996099900 session_start [{'key': 'ga_session_number', 'value': {'string_value': None, 'int_value': '3', 'float_value': None, 'double_value': None}}, {'key': 'session_engaged', 'value': {'string_value': None, 'int_value': '1', 'float_value': None, 'double_value': None}}, {'key': 'engaged_session_event', 'value': {'string_value': None, 'int_value': '1', 'float_value': None, 'double_value': None}}, {'key': 'session_id', 'value': {'string_value': None, 'int_value': '123456789', 'float_value': None, 'double_value': None}}, {'key': 'firebase_event_origin', 'value': {'string_value': 'auto', 'int_value': None, 'float_value': None, 'double_value': None}}] IOS 9.3.5
Solution
new_df = df.explode('event_params').event_params.apply(pd.Series).value.apply(pd.Series)
df = pd.concat([df.drop('event_params', axis=1), new_df], axis=1)
Output:
event_date event_timestamp event_name device.operating_system device.operating_system_version string_value int_value float_value double_value
0 20200105 69996099900 session_start IOS 9.3.5 None 3 None None
0 20200105 69996099900 session_start IOS 9.3.5 None 1 None None
0 20200105 69996099900 session_start IOS 9.3.5 None 1 None None
0 20200105 69996099900 session_start IOS 9.3.5 None 123456789 None None
0 20200105 69996099900 session_start IOS 9.3.5 auto None None None
Explanation
I'll break the description of the first line into multiple steps:
Use explode simply to access the inside of the list in event_params (you could apply a lambda function, if preferred).
Use apply.(pd.Series) first to expand the JSON see here for more info.
Use apply.(pd.Series) again to expand the values column.
Of course the second line just concatenates the two dataframes together, with event_params dropped from the original data.

Categories

Resources