Convert Pandas Dataframe to nested json-keep 2 columns - python

I have a DF with the following columns and data:
I hope it could be converted to two columns, studentid and info, with the following format.
the dataset is
"""
studentid course teacher grade rank
1 math A 91 1
1 history B 79 2
2 math A 88 2
2 history B 83 1
3 math A 85 3
3 history B 76 3
and the desire output is
studentid info
1 "{""math"":[{""teacher"":""A"",""grade"":91,""rank"":1}],
""history"":[{""teacher"":""B"",""grade"":79,""rank"":2}]}"
2 "{""math"":[{""teacher"":""A"",""grade"":88,""rank"":2}],
""history"":[{""teacher"":""B"",""grade"":83,""rank"":1}]}"
3 "{""math"":[{""teacher"":""A"",""grade"":85,""rank"":3}],
""history"":[{""teacher"":""B"",""grade"":76,""rank"":3}]}"

You don't really need groupby() and the single sub-dictionaries shouldn't really be in a list, but as value's for the nested dict. After setting the columns you want as index, with df.to_dict() you can achieve the desired output:
df = df.set_index(['studentid','course'])
df.to_dict(orient='index')
Outputs:
{(1, 'math'): {'teacher': 'A', 'grade': 91, 'rank': 1},
(1, 'history'): {'teacher': 'B', 'grade': 79, 'rank': 2},
(2, 'math'): {'teacher': 'A', 'grade': 88, 'rank': 2},
(2, 'history'): {'teacher': 'B', 'grade': 83, 'rank': 1},
(3, 'math'): {'teacher': 'A', 'grade': 85, 'rank': 3},
(3, 'history'): {'teacher': 'B', 'grade': 76, 'rank': 3}}

Considering that the initial dataframe is df, there are various options, depending on the exact desired output.
If one wants the info column to be a dictionary of lists, this will do the work
df_new = df.groupby('studentid').apply(lambda x: x.drop('studentid', axis=1).to_dict(orient='list')).reset_index(name='info')
[Out]:
studentid info
0 1 {'course': ['math', 'history'], 'teacher': ['A...
1 2 {'course': ['math', 'history'], 'teacher': ['A...
2 3 {'course': ['math', 'history'], 'teacher': ['A...
If one wants a list of dictionaries, then do the following
df_new = df.groupby('studentid').apply(lambda x: x.drop('studentid', axis=1).to_dict(orient='records')).reset_index(name='info')
[Out]:
studentid info
0 1 [{'course': 'math', 'teacher': 'A', 'grade': 9...
1 2 [{'course': 'math', 'teacher': 'A', 'grade': 8...
2 3 [{'course': 'math', 'teacher': 'A', 'grade': 8...

Related

List to a Readable Representation using Python

I have data as
[{'name': 'A', 'subsets': ['X_1', 'X_A', 'X_B'], 'cluster': 0},
{'name': 'B', 'subsets': ['B_1', 'B_A'], 'cluster': 2},
{'name': 'C', 'subsets': ['X_1', 'X_A', 'X_B'], 'cluster': 0},
{'name': 'D', 'subsets': ['D_1', 'D_2', 'D_3', 'D_4'], 'cluster': 1}]
I need to represent it as
Cluster Number Subset Name
0 ['X_1', 'X_A', 'X_B'] A, C
1 ['D_1', 'D_2', 'D_3', 'D_4'] D
2 ['B_1', 'B_A'] B
For the sake of completeness, I think it is fair to mention that you can actually create a dataframe without json_normalize in your case and apply groupby as originally shown here:
import pandas as pd
data = [{'name': 'A', 'subsets': ['X_1', 'X_A', 'X_B'], 'cluster': 0},
{'name': 'B', 'subsets': ['B_1', 'B_A'], 'cluster': 2},
{'name': 'C', 'subsets': ['X_1', 'X_A', 'X_B'], 'cluster': 0},
{'name': 'D', 'subsets': ['D_1', 'D_2', 'D_3', 'D_4'], 'cluster': 1}]
df = pd.DataFrame(data).groupby('cluster')
.agg({'subsets':'first','name':', '.join})
.reset_index()
.set_index('cluster')
.rename_axis('Cluster Number')
subsets name
Cluster Number
0 [X_1, X_A, X_B] A, C
1 [D_1, D_2, D_3, D_4] D
2 [B_1, B_A] B
You can use json_normalize + groupby "cluster" and apply join to "name" and first to "subsets":
df = pd.json_normalize(data).groupby('cluster').agg({'subsets':'first','name':', '.join}).reset_index()
Output:
cluster subsets name
0 0 [X_1, X_A, X_B] A, C
1 1 [D_1, D_2, D_3, D_4] D
2 2 [B_1, B_A] B

Pandas data manipulation: mapping a column with some predetermined values

My data look like this
import pandas as pd
import numpy as np
T1_Delivery = 20
T2_Delivery = 30
T3_Delivery = 40
T4_Delivery = 55
data = [
{'Person': 'A', 'Present_Delivery': -10, 'update': 'T1'},
{'Person': 'B', 'Present_Delivery': 30},
{'Person': 'C', 'Present_Delivery': 40},
{'Person': 'D', 'Present_Delivery': 70, 'update': 'T3'},
{'Person': 'E', 'Present_Delivery': 50, 'update': 'T2'},
{'Person': 'F', 'Present_Delivery': 50}
]
df = pd.DataFrame(data)
df['Actual_Delivery'] = np.where(df['update']==np.NaN, df['Present_Delivery'],0)
#map T{x} to T{x}_Delivery
I need to map update entry(x) with T{x}_Delivery as defined globally. Is this possible? I am able to map if _Delivery is not present in global definition.
My output is something like this:
data = [
{'Person': 'A', 'Actual_Delivery': 20},
{'Person': 'B', 'Actual_Delivery': 30},
{'Person': 'C', 'Actual_Delivery': 40},
{'Person': 'D', 'Actual_Delivery': 40},
{'Person': 'E', 'Actual_Delivery': 30},
{'Person': 'F', 'Actual_Delivery': 50}
]
df_desired = pd.DataFrame(data)
EDIT: This is part of a bigger script and it is not possible to change global variable to dictionary!
You can build a dictionary for mapping. Then use pd.Series.map and pd.Series.fillna
mapping = {'T1':20,'T2':30,'T3':40,'T4':55}
df_final = (df[['Person', 'Present_Delivery']].
assign(Present_Delivery = df['update'].map(mapping).fillna(df['Present_Delivery']))
)
Person Present_Delivery
0 A 20.0
1 B 30.0
2 C 40.0
3 D 40.0
4 E 30.0
5 F 50.0
Another idea using pd.Series.where along with pd.Series.isna
df['Present_Delivery'] = (df['Present_Delivery'].where(
df['update'].isna(),df['update'].map(mapping))
)
df_final = df.drop(columns='update')
Person Present_Delivery
0 A 20
1 B 30
2 C 40
3 D 40
4 E 30
5 F 50

What is the most efficient way to sum a dict with multiple keys by one key?

I have the following dict structure.
product1 = {'product_tmpl_id': product_id,
'qty':product_uom_qty,
'price':price_unit,
'subtotal':price_subtotal,
'total':price_total,
}
And then a list of products, each item in the list is a dict with the above structure
list_ = [product1,product2,product3,.....]
I need to sum the item in the list, group by the key product_tmpl_id ... I'm using dictcollections but it only sum the qty key, I need to sum key except the product_tmpl_id which is the criteria to group by
c = defaultdict(float)
for d in list_:
c[d['product_tmpl_id']] += d['qty']
c = [{'product_id': id, 'qty': qty} for id, qty in c.items()]
I know how to do it with a for iteration but trying to look for a more pythonic way
thanks
EDIT:
What is need is to pass from this:
lst = [
{'Name': 'A', 'qty':100,'price':10},
{'Name': 'A', 'qty':100,'price':10},
{'Name': 'A', 'qty':100,'price':10},
{'Name': 'B', 'qty':100,'price':10},
{'Name': 'C', 'qty':100,'price':10},
{'Name': 'C', 'qty':100,'price':10},
]
to this
group_lst = [
{'Name': 'A', 'qty':300,'price':30},
{'Name': 'B', 'qty':100,'price':10},
{'Name': 'C', 'qty':200,'price':20},
]
Using basic Python, this doesn't get a whole lot better. You could hack something together with itertools.groupby, but it'd be ugly and probably slower, certainly less clear.
As #9769953 suggested, though, Pandas is a good package to handle this sort of structured, tabular data.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(lst)
Out[2]:
Name price qty
0 A 10 100
1 A 10 100
2 A 10 100
3 B 10 100
4 C 10 100
5 C 10 100
In [3]: df.groupby('Name').agg(sum)
Out[3]:
price qty
Name
A 30 300
B 10 100
C 20 200
You just need a little extra mojo if you don't want to keep the data as a dataframe:
In [4]: grouped = df.groupby('Name', as_index=False).agg(sum)
In [5]: list(grouped.T.to_dict().values())
Out[5]:
[{'Name': 'A', 'price': 30, 'qty': 300},
{'Name': 'B', 'price': 10, 'qty': 100},
{'Name': 'C', 'price': 20, 'qty': 200}]
On the verbose side, but gets the job done:
group_lst = []
lst_of_names = []
for item in lst:
qty_total = 0
price_total = 0
# Get names that have already been totalled
lst_of_names = [item_get_name['Name'] for item_get_name in group_lst]
if item['Name'] in lst_of_names:
continue
for item2 in lst:
if item['Name'] == item2['Name']:
qty_total += item2['qty']
price_total += item2['price']
group_lst.append(
{
'Name':item['Name'],
'qty':qty_total,
'price':price_total
}
)
pprint(group_lst)
Output:
[{'Name': 'A', 'price': 30, 'qty': 300},
{'Name': 'B', 'price': 10, 'qty': 100},
{'Name': 'C', 'price': 20, 'qty': 200}]
You can use defaultdict and Counter
>>> from collections import Counter, defaultdict
>>> cntr = defaultdict(Counter)
>>> for d in lst:
... cntr[d['Name']].update(d)
...
>>> res = [dict(v, **{'Name':k}) for k,v in cntr.items()]
>>> pprint(res)
[{'Name': 'A', 'price': 30, 'qty': 300},
{'Name': 'C', 'price': 20, 'qty': 200},
{'Name': 'B', 'price': 10, 'qty': 100}]

pandas dataframe convert values in array of objects

I want to convert the below pandas data frame
data = pd.DataFrame([[1,2], [5,6]], columns=['10+', '20+'], index=['A', 'B'])
data.index.name = 'City'
data.columns.name= 'Age Group'
print data
Age Group 10+ 20+
City
A 1 2
B 5 6
in to an array of dictionaries, like
[
{'Age Group': '10+', 'City': 'A', 'count': 1},
{'Age Group': '20+', 'City': 'A', 'count': 2},
{'Age Group': '10+', 'City': 'B', 'count': 5},
{'Age Group': '20+', 'City': 'B', 'count': 6}
]
I am able to get the above expected result using the following loops
result = []
cols_name = data.columns.name
index_names = data.index.name
for index in data.index:
for col in data.columns:
result.append({cols_name: col, index_names: index, 'count': data.loc[index, col]})
Is there any better ways of doing this? Since my original data will be having large number of records, using for loops will take more time.
I think you can use stack with reset_index for reshape and last to_dict:
print (data.stack().reset_index(name='count'))
City Age Group count
0 A 10+ 1
1 A 20+ 2
2 B 10+ 5
3 B 20+ 6
print (data.stack().reset_index(name='count').to_dict(orient='records'))
[
{'Age Group': '10+', 'City': 'A', 'count': 1},
{'Age Group': '20+', 'City': 'A', 'count': 2},
{'Age Group': '10+', 'City': 'B', 'count': 5},
{'Age Group': '20+', 'City': 'B', 'count': 6}
]

Merging 2 list of dicts based on common values

So I have 2 list of dicts which are as follows:
list1 = [
{'name':'john',
'gender':'male',
'grade': 'third'
},
{'name':'cathy',
'gender':'female',
'grade':'second'
},
]
list2 = [
{'name':'john',
'physics':95,
'chemistry':89
},
{'name':'cathy',
'physics':78,
'chemistry':69
},
]
The output list i need is as follows:
final_list = [
{'name':'john',
'gender':'male',
'grade':'third'
'marks': {'physics':95, 'chemistry': 89}
},
{'name':'cathy',
'gender':'female'
'grade':'second'
'marks': {'physics':78, 'chemistry': 69}
},
]
First i tried with iteration as follows:
final_list = []
for item1 in list1:
for item2 in list2:
if item1['name'] == item2['name']:
temp = dict(item_2)
temp.pop('name')
final_result.append(dict(name=item_1['name'], **temp))
However,this does not give me the desired result..I also tried pandas..limited experience there..
>>> import pandas as pd
>>> df1 = pd.DataFrame(list1)
>>> df2 = pd.DataFrame(list2)
>>> result = pd.merge(df1, df2, on=['name'])
However,i am clueless how to get the data back to the original format i need it in..Any help
You can first merge both dataframes
In [144]: df = pd.DataFrame(list1).merge(pd.DataFrame(list2))
Which would look like,
In [145]: df
Out[145]:
gender grade name chemistry physics
0 male third john 89 95
1 female second cathy 69 78
Then create a marks columns as a dict
In [146]: df['marks'] = df.apply(lambda x: [x[['chemistry', 'physics']].to_dict()], axis=1)
In [147]: df
Out[147]:
gender grade name chemistry physics \
0 male third john 89 95
1 female second cathy 69 78
marks
0 [{u'chemistry': 89, u'physics': 95}]
1 [{u'chemistry': 69, u'physics': 78}]
And, use to_dict(orient='records') method of selected columns of dataframe
In [148]: df[['name', 'gender', 'grade', 'marks']].to_dict(orient='records')
Out[148]:
[{'gender': 'male',
'grade': 'third',
'marks': [{'chemistry': 89L, 'physics': 95L}],
'name': 'john'},
{'gender': 'female',
'grade': 'second',
'marks': [{'chemistry': 69L, 'physics': 78L}],
'name': 'cathy'}]
Using your pandas approach, you can call
result.to_dict(orient='records')
to get it back as a list of dictionaries. It won't put marks in as a sub-field though, since there's nothing telling it to do that. physics and chemistry will just be fields on the same level as the rest.
You may also be having problems because your name is 'cathy' in the first list and 'kathy' in the second, which naturally won't get merged.
create a function that will add a marks column , this columns should contain a dictionary of physics and chemistry marks
def create_marks(df):
df['marks'] = { 'chemistry' : df['chemistry'] , 'physics' : df['physics'] }
return df
result_with_marks = result.apply( create_marks , axis = 1)
Out[19]:
gender grade name chemistry physics marks
male third john 89 95 {u'chemistry': 89, u'physics': 95}
female second cathy 69 78 {u'chemistry': 69, u'physics': 78}
then convert it to your desired result as follows
result_with_marks.drop( ['chemistry' , 'physics'], axis = 1).to_dict(orient = 'records')
Out[20]:
[{'gender': 'male',
'grade': 'third',
'marks': {'chemistry': 89L, 'physics': 95L},
'name': 'john'},
{'gender': 'female',
'grade': 'second',
'marks': {'chemistry': 69L, 'physics': 78L},
'name': 'cathy'}]
Considering you want a list of dicts as output, you can easily do what you want without pandas, use a dict to store all the info using the names as the outer keys, doing one pass over each list not like the O(n^2) double loops in your own code:
out = {d["name"]: d for d in list1}
for d in list2:
out[d.pop("name")]["marks"] = d
from pprint import pprint as pp
pp(list(out.values()))
Output:
[{'gender': 'female',
'grade': 'second',
'marks': {'chemistry': 69, 'physics': 78},
'name': 'cathy'},
{'gender': 'male',
'grade': 'third',
'marks': {'chemistry': 89, 'physics': 95},
'name': 'john'}]
That reuses the dicts in your lists, if you wanted to create new dicts:
out = {d["name"]: d.copy() for d in list1}
for d in list2:
k = d.pop("name")
out[k]["marks"] = d.copy()
from pprint import pprint as pp
pp(list(out.values()))
The output is the same:
[{'gender': 'female',
'grade': 'second',
'marks': {'chemistry': 69, 'physics': 78},
'name': 'cathy'},
{'gender': 'male',
'grade': 'third',
'marks': {'chemistry': 89, 'physics': 95},
'name': 'john'}]

Categories

Resources