Python Pandas: Nested Dictionary - python

I have a list of dictionaries that I wish to manipulate using Pandas. Say:
m = [{"topic": "A", "type": "InvalidA", "count": 1}, {"topic": "A", "type": "InvalidB", "count": 1}, {"topic": "A", "type": "InvalidA", "count": 1}, {"topic": "B", "type": "InvalidA", "count": 1}, {"topic": "B", "type": "InvalidA", "count": 1}, {"topic": "B", "type": "InvalidB", "count": 1}]
1) first create a dataframe using the constructor:
df = pd.DataFrame(m)
2) Group by columns ['topic] and ['type'] and count
df_group = df.groupby(['topic', 'type']).count()
I end up with:
count
topic type
A InvalidA 2
InvalidB 1
B InvalidA 2
InvalidB 1
I want to now convert this to a nested dict:
{ "A" : {"InvalidA" : 2,
"InvalidB" : 1},
"B" : {"InvalidA" : 2,
"InvalidB": 1}
}
Any suggestions on how to get from df_group to a nested dict?

Using unstack + to_dict
df_group['count'].unstack(0).to_dict()
Out[446]: {'A': {'InvalidA': 2, 'InvalidB': 1}, 'B': {'InvalidA': 2, 'InvalidB': 1}}
And also slightly change you groupby to crosstab
pd.crosstab(df.type,df.topic).to_dict()
Out[449]: {'A': {'InvalidA': 2, 'InvalidB': 1}, 'B': {'InvalidA': 2, 'InvalidB': 1}}

Related

Fill Multiple empty values in python dictionary for particular key all over dictionary

I have a dictionary as below.
Key id is present multiple times inside dictionary.I need to fill id value at all places in dicts in single line of code.
Currently I am writing multiple line of code to fill empty values.
dicts = {
"abc": {
"a":{"id": "", "id1":""},
"b":{"id": "","hey":"1223"},
"c":{"id": "","hello":"4564"}
},
"xyz": {
"d":{"id": "","id1":"", "ijk":"water"}
},
"f":{"id": ""},
"g":{"id1": ""}
}
id = 123
dicts['abc']['a']['id'] = id
dicts['abc']['b']['id'] = id
dicts['abc']['c']['id'] = id
dicts['xyz']['d']['id'] = id
dicts['f']['id'] = id
dicts
Output:
{'abc': {'a': {'id': 123,"id1":""},
'b': {'id': 123, 'hey': '1223'},
'c': {'id': 123, 'hello': '4564'}},
'xyz': {'d': {'id': 123,id1:"", 'ijk': 'water'}},
'f': {'id': 123}, "g":{"id1": ""}}
You can solve it in place via simple recursive function, for example:
id = 123
dicts = {
"abc": {
"a": {"id": "", "id1": ""},
"b": {"id": "", "hey": "1223"},
"c": {"id": "", "hello": "4564"}
},
"xyz": {
"d": {"id": "", "id1": "", "ijk": "water"}
},
"f": {"id": ""},
"g": {"id1": ""}
}
def process(dicts):
for k, v in dicts.items():
if k == 'id' and not dicts[k]:
dicts[k] = id
if isinstance(v, dict):
process(v)
process(dicts)
print(dicts)
Output:
{
'abc': {'a': {'id': 123, 'id1': ''},
'b': {'id': 123, 'hey': '1223'},
'c': {'id': 123, 'hello': '4564'}},
'xyz': {'d': {'id': 123, 'id1': '', 'ijk': 'water'}},
'f': {'id': 123}, 'g': {'id1': ''}
}

GroupBy results to list of dictionaries, Using the grouped by object in it

My DataFrame looks like so:
Date Column1 Column2
1.1 A 1
1.1 B 3
1.1 C 4
2.1 A 2
2.1 B 3
2.1 C 5
3.1 A 1
3.1 B 2
3.1 C 2
And I'm looking to group it by Date and extract that data to a list of dictionaries so it appears like this:
[
{
"Date": "1.1",
"A": 1,
"B": 3,
"C": 4
},
{
"Date": "2.1",
"A": 2,
"B": 3,
"C": 5
},
{
"Date": "3.1",
"A": 1,
"B": 2,
"C": 2
}
]
This is my code so far:
df.groupby('Date')['Column1', 'Column2'].apply(lambda g: {k, v for k, v in g.values}).to_list()
Using this method can't use my grouped by objects in the apply method itself:
[
{
"A": 1,
"B": 3,
"C": 4
},
{
"A": 2,
"B": 3,
"C": 5
},
{
"A": 1,
"B": 2,
"C": 2
}
]
Using to_dict() giving me the option to reach the grouped by object, but not to parse it to the way I need.
Anyone familiar with some elegant way to solve it?
Thanks!!
You could first reshape your data using df.pivot, reset the index, and then apply to_dict to the new shape with the orient parameter set to "records". So:
import pandas as pd
data = {'Date': ['1.1', '1.1', '1.1', '2.1', '2.1', '2.1', '3.1', '3.1', '3.1'],
'Column1': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
'Column2': [1, 3, 4, 2, 3, 5, 1, 2, 2]}
df = pd.DataFrame(data)
df_pivot = df.pivot(index='Date',columns='Column1',values='Column2')\
.reset_index(drop=False)
result = df_pivot.to_dict('records')
target = [{'Date': '1.1', 'A': 1, 'B': 3, 'C': 4},
{'Date': '2.1', 'A': 2, 'B': 3, 'C': 5},
{'Date': '3.1', 'A': 1, 'B': 2, 'C': 2}]
print(result == target)
# True

Dropping duplicates from json deep structure pandas

I am working on a scenario of converting excel to nested json with group by which is to extend to the header as well as the items.
Tried as below:
Able to apply transformation rules using pandas:
df['Header'] = df[['A','B']].to_dict('records')
df['Item'] = df[['A', 'C', 'D'].to_dict('records')
By this, I am able to separate the records into separate data frames.
Applying below:
data_groupedby = data.groupby(['A', 'B']).agg(list).reset_index()
result = data_groupedby['A','B','Item'].to_json(orient='records')
This gives me the required json with header as well as further drill down of items as a nested deep structure.
With groupby, I am able to group fields of header but not able to apply the group by to the respective items, and its not grouping correctly.
Any idea as how we can achieve it.
Example DS:
Excel:
A B C D
100 Test1 XX10 L
100 Test1 XX10 L
100 Test1 XX20 L
101 Test2 XX10 L
101 Test2 XX20 L
101 Test2 XX20 L
Current output:
[
{
"A": 100,
"B": "Test1",
"Item": [
{
"A": 100,
"C": "XX10",
"D": "L"
},
{
"A": 100,
"C": "XX10",
"D": "L"
},
{
"A": 100,
"C": "XX20",
"D": "L"
}
]
},
{
"A": 101,
"B": "Test2",
"Item": [
{
"A": 101,
"C": "XX10",
"D": "L"
},
{
"A": 101,
"C": "XX20",
"D": "L"
},
{
"A": 101,
"C": "XX20",
"D": "L"
}
]
}
]
If you look at the Array Items, Same values are not grouped by and are repeated.
Thanks
TC
You can drop_duplicates and then groupby, then apply the to_dict transformation for columns C and D, and then clean up with reset_index and rename.
(data.drop_duplicates()
.groupby(["A", "B"])
.apply(lambda x: x[["C", "D"]].to_dict("records"))
.to_frame()
.reset_index()
.rename(columns={0: "Item"})
.to_dict("records"))
Output:
[{'A': 100,
'B': 'Test1',
'Item': [{'C': 'XX10', 'D': 'L'}, {'C': 'XX20', 'D': 'L'}]},
{'A': 101,
'B': 'Test2',
'Item': [{'C': 'XX10', 'D': 'L'}, {'C': 'XX20', 'D': 'L'}]}]

Iterating over a list of dictionaries in python to find all occurances of a pair of values

I searched for some time, but couldn't find an exact solution to my problem. I have a list of dictionaries in format:
d = [{"sender": "a", "time": 123, "receiver": "b", "amount": 2}, {"sender": "c", "time": 124, "receiver": "b", "amount": 10}, {"sender": "a", "time": 130, "receiver": "b", "amount": 5}]
I would like to find the best way to iterate over all the dictionaries and count how many times a given pair of sender-receiver occurs and the sum of the total amount.
So I would like to get:
result = [{"sender": "a", "receiver":b, "count": 2, "total_amount":7}, {"sender": "c", "receiver":b, "count": 1, "total_amount":10}]
I am pretty sure I can probably make this work by iterating over all the dictionaries in the list one by one, saving the information in a temporary dictionary, but that will lead to a lot of nested if loops. I was hoping there is a cleaner way to do this.
I know I can use Counter to count the number of occurences for a unique value:
from collections import Counter
Counter(val["sender"] for val in d)
which will give me:
>>> ({"a":2, "c":1})
but how can I do this for a pair of values and have separate dictionaries for each?
Thank you in advance and I hope my question was clear enough
This is one approach using a simple iteration with dict methods.
Ex:
d = [{"sender": "a", "time": 123, "reciever": "b", "amount": 2}, {"sender": "c", "time": 124, "reciever": "b", "amount": 10}, {"sender": "a", "time": 130, "reciever": "b", "amount": 5}]
result = {}
for i in d:
key = (i['sender'], i['reciever'])
# del i['time'] # if you do not need `time` key
if key not in result:
i.update({'total_amount': i.pop('amount'), 'count': 1})
result[key] = i
else:
result[key]['total_amount'] += i['amount']
result[key]['count'] += 1
print(list(result.values()))
Output:
[{'count': 2, 'reciever': 'b', 'sender': 'a', 'time': 123, 'total_amount': 7},
{'count': 1, 'reciever': 'b', 'sender': 'c', 'time': 124, 'total_amount': 10}]
Pure python way is to create a new hash table of sender:reciever pairs
I UPDATED it to count the total amount as requested as well.
d = [{"sender": "a", "time": 123, "reciever": "b", "amount": 2},
{"sender": "c", "time": 124, "reciever": "b", "amount": 10},
{"sender": "a", "time": 130, "reciever": "b", "amount": 5}]
nd = {}
for o in d:
sender = o['sender']
recv = o['reciever']
amount = o['amount']
k = sender + ":" + recv
if k not in nd:
nd[k] = (0, 0)
nd[k] = (nd[k][0] + 1, nd[k][1] + amount)
print nd
which results in {'c:b': (1, 10), 'a:b': (2, 7)}
You could use pandas to parse the list of dictionaries into a dataframe.
The dataframe would allow you to easily sum over the amount field for certain sender receiver pairs.
import pandas as pd
dict = [{"sender": "a", "time": 123, "receiver": "b", "amount": 2},
{"sender": "c", "time": 124, "receiver": "b", "amount": 10},
{"sender": "a", "time": 130, "receiver": "b", "amount": 5}]
df = pd.DataFrame.from_records(dict)
group = df.groupby(by=['sender', 'receiver'])
result = group.sum()
result['occurrences'] = group.size()
print(result)
will output
time amount occurrences
sender receiver
a b 253 7 2
c b 124 10 1
Max Crous's answer is more elegant than this, but in case you'd like to avoid extra libraries: this is a pure python way:
import collections
result = collections.defaultdict(lambda : [0,0])
for e in d:
result[(e['sender'],e['reciever'])][0]+=e['amount']
result[(e['sender'],e['reciever'])][1]+=1
Result is now a dictionary with tuples of sender and reciever as keys and 2-element lists [total_amount, count] as values
Imo the easiest and cleanest solution would be to use a defaultdict:
from collections import defaultdict
dct = [{"sender": "a", "time": 123, "reciever": "b", "amount": 2},
{"sender": "c", "time": 124, "reciever": "b", "amount": 10},
{"sender": "a", "time": 130, "reciever": "b", "amount": 5}]
result = defaultdict(int)
for item in dct:
key = "{}:{}".format(item["sender"], item["reciever"])
result[key] += item["amount"]
print(result)
Which results in
defaultdict(<class 'int'>, {'a:b': 7, 'c:b': 10})
Besides, don't call your variables dict or list.
using dictionary you can set sender as key and values as receiver and amount and then increment/add receiver,amount
dict = [{"sender": "a", "time": 123, "reciever": "b", "amount": 2}, {"sender": "c", "time": 124, "reciever": "b", "amount": 10}, {"sender": "a", "time": 130, "reciever": "b", "amount": 5}]
dict1={}
for eachitem in dict:
if(eachitem["sender"] in dict1.keys()):
dict1[eachitem["sender"]]["amount"]=dict1[eachitem["sender"]]["amount"]+eachitem["amount"]
dict1[eachitem["sender"]]["reciever"]+=1
else:
dict1[eachitem["sender"]]={"reciever":1,"amount":eachitem["amount"]}
print(dict1)
output
{'a': {'reciever': 2, 'amount': 7}, 'c': {'reciever': 1, 'amount': 10}}

Python: List of Dictionary Mapping

I have a list of 10,000 Dictionaries from a JSON that look like:
my_list =
[
{"id": 1, "val": "A"},
{"id": 4, "val": "A"},
{"id": 1, "val": "C"},
{"id": 3, "val": "C"},
{"id": 1, "val": "B"},
{"id": 2, "val": "B"},
{"id": 4, "val": "C"},
{"id": 4, "val": "B"},
.
.
.
{"id": 10000, "val": "A"}
]
and I want my output to be:
mapped_list =
[
{"id": 1, "val": ["A", "B", "C"]},
{"id": 2, "val": ["B"]},
{"id": 3, "val": ["C"]},
{"id": 4, "val": ["A", "B", "C"]},
.
.
.
{"id": 10000, "val": ["A","C"]}
]
My goal is to Map the first list's "id" and its "val" to create the 2nd list as efficiently as possible. So far my running time has not been the greatest:
output = []
cache = {}
for unit in my_list:
uid = unit['id']
value = unit['val']
if (uid in cache):
output[uid][value].append(value)
else:
cache[uid] = 1
output.append({'id' : uid, 'values': value})
My approach is to make a frequency check of the 'id' to avoid iterating through 2 different lists. I believe my fault is in understanding nested dicts/lists of dicts. I have a feeling I can get this in O(n), if not better, as O(n^2) is out of the question its too easy to grow this in magnitude.
Brighten my insight PLEASE, I could use the help.
Or any other way of approaching this problem.
Maybe map(), zip(), tuple() might be a better approach for this. Let me know!
EDIT: I'm trying to accomplish this with only built-in functions. Also, the last dictionary is to exemplify that this is not limited to what I have displayed but there are more "id's" than I can share with "val" being a combination of A,B,C for whatever id its associated with.
UPDATE:
This is my final solution, if there can be any improvements, Let me know!
mapped_list = []
cache = {}
for item in my_list:
id = item['id']
val = item['val']
if (id in cache):
output[cache[id]]['val'].append(val)
else:
cache[id] = len(output)
mapped_list.append({'id' : id, 'val': [val]})
mapped_list.sort(key=lambda k: k['id'])
print(output)
my_list=[
{"id": 1, "val": 'A'},
{"id": 4, "val": "A"},
{"id": 1, "val": "C"},
{"id": 3, "val": "C"},
{"id": 1, "val": "B"},
{"id": 2, "val": "B"},
{"id": 4, "val": "C"},
{"id": 4, "val": "B"},
{"id": 10000, "val": "A"}
]
temp_dict = {}
for item in my_list:
n, q = item.values()
if not n in temp_dict:
temp_dict[n] = []
temp_dict.get(n,[]).append(q)
mapped_list = [{'id': n, 'val': q} for n,q in temp_dict.items()]
mapped_list = sorted(mapped_list, key = lambda x : x['id'])
print(mapped_list)
If there are multiple val with the same id you can use a set like this:
my_list = [
{"id": 1, "val": "A"},
{"id": 4, "val": "A"},
{"id": 1, "val": "C"},
{"id": 3, "val": "C"},
{"id": 1, "val": "B"},
{"id": 2, "val": "B"},
{"id": 4, "val": "C"},
{"id": 4, "val": "B"},
{"id": 10000, "val": "A"}
]
from collections import defaultdict
ddict = defaultdict(set)
for lst in my_list:
ddict[lst['id']].add(lst['val'])
result = [{"id" : k,"val" : list(v)} for k,v in ddict.items()]
sorted(result,key = lambda x : x['id'])
[{'id': 1, 'val': ['C', 'A', 'B']},
{'id': 2, 'val': ['B']},
{'id': 3, 'val': ['C']},
{'id': 4, 'val': ['C', 'A', 'B']},
{'id': 10000, 'val': ['A']}]
Insert or search in dict (or defaultdict) and set have O(1) complexity and the sort function have O(NlogN) so overall is O(N + NlogN)
You could just use collections.defaultdict like,
>>> my_list
[{'id': 1, 'val': 'A'}, {'id': 4, 'val': 'A'}, {'id': 1, 'val': 'C'}, {'id': 3, 'val': 'C'}, {'id': 1, 'val': 'B'}, {'id': 2, 'val': 'B'}, {'id': 4, 'val': 'C'}, {'id': 4, 'val': 'B'}, {'id': 10000, 'val': 'A'}]
>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> for item in my_list:
... d[item['id']].append(item['val'])
...
>>> mapped_list = [{'id': key, 'val': val} for key,val in d.items()]
>>> mapped_list = sorted(mapped_list, key=lambda x: x['id']) # just to make it always sorted by `id`
>>> import pprint
>>> pprint.pprint(mapped_list)
[{'id': 1, 'val': ['A', 'C', 'B']},
{'id': 2, 'val': ['B']},
{'id': 3, 'val': ['C']},
{'id': 4, 'val': ['A', 'C', 'B']},
{'id': 10000, 'val': ['A']}]
I think you won't be able to do it better than O(n*log(n)):
from collections import defaultdict
vals = defaultdict(list)
my_list.sort(key=lambda x: x['val'])
for i in my_list:
vals[i['id']].append(i['val'])
output = [{'id': k, 'val': v} for k, v in vals.items()]
output.sort(key=lambda x: x['id'])
Output:
[{'id': 1, 'val': ['A', 'B', 'C']},
{'id': 2, 'val': ['B']},
{'id': 3, 'val': ['C']},
{'id': 4, 'val': ['A', 'B', 'C']},
{'id': 1000, 'val': ['A']}]
I am created mapped_list using setdefault
d = {}
for i in my_list:
d.setdefault(i['id'], []).append(i['val'])
mapped_list = [{'id':key, 'val': val} for key,val in sorted(d.items())]
print(mapped_list)
defaultdict makes better performance than setdefault.
I just make this answer for creating mapped_list using another approach

Categories

Resources