I have a list of dictionaries, which contain a key and a value of dictionaries as such:
Array = [
{'Example1': {'Time Taken': 56, 'Type': 'Quiz'} },
{'Example1': {'Time Taken': 58, 'Type': 'Exam'} },
{'Example2': {'Time Taken': 40, 'Type': 'Quiz'} } ]
I want to iterate through the list and obtain only unique keys, with the values merged together into one list as such:
{ 'Example1': [
{ 'Time Taken': 56, 'Type': 'Quiz' },
{ 'Time Taken': 58, 'Type': 'Exam' } ] }
{ 'Example2': [ { 'Time Taken': 40, 'Type': 'Quiz' } ] }
Any idea on how to go about this? I've tried a lot of different things, but can't seem to get an efficient way to write this code. Any feedback appreciated.
Using collections.defaultdict
Ex:
from collections import defaultdict
Array = [ {"Example1": {"Time Taken": 56, "Type": "Quiz"} }, {"Example1": {"Time Taken": 58, "Type": "Exam"} }, {"Example2": {"Time Taken": 40, "Type": "Quiz"} } ]
result = defaultdict(list)
for ar in Array: #Iterate each element in list
for k, v in ar.items(): #Iterate your dict
result[k].append(v) #Create key-list.
print(result)
Output:
defaultdict(<class 'list'>, {'Example1': [{'Time Taken': 56, 'Type': 'Quiz'}, {'Time Taken': 58, 'Type': 'Exam'}], 'Example2': [{'Time Taken': 40, 'Type': 'Quiz'}]})
You can iterate over the dicts and use dict.setdefault to set the default as empty list if key is missing and append to the list otherwise:
for dct in Array:
for k, v in dct.items():
out.setdefault(k, []).append(v)
out is the desired output dict.
Example:
In [1208]: arr = [ {'Example1': {'Time Taken': 56, 'Type': 'Quiz'} }, {'Example1': {'Time Taken': 58, 'Type': 'Exam'} }, {'Example2': {'Time Taken': 40, 'Type': 'Quiz'} } ]
In [1209]: out = {}
In [1210]: for dct in arr:
...: for k, v in dct.items():
...: out.setdefault(k, []).append(v)
...:
In [1211]: out
Out[1211]:
{'Example1': [{'Time Taken': 56, 'Type': 'Quiz'},
{'Time Taken': 58, 'Type': 'Exam'}],
'Example2': [{'Time Taken': 40, 'Type': 'Quiz'}]}
Try following.
from collections import defaultdict
# l is list of dictionaries
d = defaultdict(list)
for x in l:
for y in x:
d[y].append(x[y])
print(d)
I have created following without any external module
Array = [{'Example1': {'Time Taken': 56, 'Type': 'Quiz'} },
{'Example1': {'Time Taken': 58, 'Type': 'Exam'} },
{'Example2': {'Time Taken': 40, 'Type': 'Quiz'} }]
result = {}
for i in range(len(Array)):
dict = Array[i]
for key in dict :
if key in result:
result[key].append(dict[key])
else:
result[key]=[dict[key]]
print(result)
Output:
{
'Example1': [{'Time Taken': 56, 'Type': 'Quiz'},
{'Time Taken': 58, 'Type': 'Exam'}],
'Example2': [{'Time Taken': 40, 'Type': 'Quiz'}]
}
Related
I have a dictionary of dictionaries.
Sample:
keyList = ['0','1','2']
valueList = [{'Name': 'Nick', 'Age': 39, 'Country': 'UK'}, {'Name': 'Steve', 'Age': 19, 'Country': 'Spain'}, {'Name': 'Dave', 'Age': 23, 'Country': 'UK'}]
d = {}
for i in range(len(keyList)):
d[keyList[i]] = valueList[i]
Output:
{'0': {'Name': 'Nick', 'Age': 39, 'Country': 'UK'}, '1': {'Name': 'Steve', 'Age': 19, 'Country': 'Spain'}, '2': {'Name': 'Dave', 'Age': 23, 'Country': 'UK'}}
I want to do two things:
filter by one string or int value in a value e.g. Name, ignoring case. I.e. remove any key/value where a string/int is found. So if 'Nick' is found in Name, remove the key '0' and its value completely:
{'1': {'Name': 'Steve', 'Age': 19, 'Country': 'Spain'}, '2': {'Name': 'Dave', 'Age': 23, 'Country': 'UK'}}
The same as above, but with a list of strings instead. I.e. filter and remove any keys where any of the following strings ["uK", "Italy", "New Zealand"] appear in Country, ignoring case.
{'1': {'Name': 'Steve', 'Age': 19, 'Country': 'Spain'}}
I was hoping the below would work for one string, but I think it only works if it is just one dictionary rather than a dictionary of dictionaries, so its not working for me:
filtered_d = {k: v for k, v in d.items() if "nick".casefold() not in v["Name"]}
Any suggestions? Many thanks
Assuming there is one level of nesting in the dictionary (not a dictionary of dictionaries of dictionaries), you could use the following function which iterates over the keys and filters as per the supplied values:
from typing import List
def remove_from_dict(key_name: str, values: List[str], dictionary: dict):
values = [value.casefold() for value in values]
filtered_dict = {
key: inner_dict
for key, inner_dict in dictionary.items()
if inner_dict[key_name].casefold() not in values
}
return filtered_dict
dictionary = {
"0": {"Name": "Nick", "Age": 39, "Country": "UK"},
"1": {"Name": "Steve", "Age": 19, "Country": "Spain"},
"2": {"Name": "Dave", "Age": 23, "Country": "UK"},
}
# Output: {'1': {'Name': 'Steve', 'Age': 19, 'Country': 'Spain'}, '2': {'Name': 'Dave', 'Age': 23, 'Country': 'UK'}}
print(remove_from_dict("Name", ["Nick"], dictionary))
# Output: {'1': {'Name': 'Steve', 'Age': 19, 'Country': 'Spain'}}
print(remove_from_dict("Country", ["uK", "Italy", "New Zealand"], dictionary))
Update:
If we want to account for partial matches, we have to use re module.
import re
from typing import List, Optional
dictionary = {
"0": {"Name": "Nick", "Age": 39, "Country": "UK"},
"1": {"Name": "Steve", "Age": 19, "Country": "Spain"},
"2": {"Name": "Dave", "Age": 23, "Country": "UK"},
}
def remove_from_dict(
key_name: str,
values: List[str],
dictionary: dict,
use_regex: Optional[bool] = False,
):
values = [value.casefold() for value in values]
regular_comparator = lambda string: string.casefold() not in values
# if the string matches partially with anything in the list,
# we need to discard that dictionary.
regex_comparator = lambda string: not any(
re.match(value, string.casefold()) for value in values
)
comparator = regex_comparator if use_regex else regular_comparator
filtered_dict = {
key: inner_dict
for key, inner_dict in dictionary.items()
if comparator(inner_dict[key_name])
}
return filtered_dict
# Output: {}, all dictionaries removed
print(remove_from_dict("Country", ["uK", "Spa"], dictionary, use_regex=True))
I have the following JSON object, in which I need to post-process some labels:
{
'id': '123',
'type': 'A',
'fields':
{
'device_safety':
{
'cost': 0.237,
'total': 22
},
'device_unit_replacement':
{
'cost': 0.262,
'total': 7
},
'software_generalinfo':
{
'cost': 3.6,
'total': 10
}
}
}
I need to split the names of labels by _ to get the following hierarchy:
{
'id': '123',
'type': 'A',
'fields':
{
'device':
{
'safety':
{
'cost': 0.237,
'total': 22
},
'unit':
{
'replacement':
{
'cost': 0.262,
'total': 7
}
}
},
'software':
{
'generalinfo':
{
'cost': 3.6,
'total': 10
}
}
}
}
This is my current version, but I got stuck and not sure how to deal with the hierarchy of fields:
import json
json_object = json.load(raw_json)
newjson = {}
for x, y in json_object['fields'].items():
hierarchy = y.split("_")
if len(hierarchy) > 1:
for k in hierarchy:
newjson[k] = ????
newjson = json.dumps(newjson, indent = 4)
Here is recursive function that will process a dict and split the keys:
def splitkeys(dct):
if not isinstance(dct, dict):
return dct
new_dct = {}
for k, v in dct.items():
bits = k.split('_')
d = new_dct
for bit in bits[:-1]:
d = d.setdefault(bit, {})
d[bits[-1]] = splitkeys(v)
return new_dct
>>> splitkeys(json_object)
{'fields': {'device': {'safety': {'cost': 0.237, 'total': 22},
'unit': {'replacement': {'cost': 0.262, 'total': 7}}},
'software': {'generalinfo': {'cost': 3.6, 'total': 10}}},
'id': '123',
'type': 'A'}
I'am working on a script for migrating data from MongoDB to Clickhouse. Because of the reason that nested structures are'nt implemented good enough in Clickhouse, I iterate over nested structure and bring them to flat representation, where every element of nested structure is a distinct row in Clickhouse database.
What I do is iterate over list of dictionaries and take target values. The structure looks like this:
[
{
'Comment': None,
'Details': None,
'FunnelId': 'MegafonCompany',
'IsHot': False,
'IsReadonly': False,
'Name': 'Новый',
'SetAt': datetime.datetime(2018, 4, 20, 10, 39, 55, 475000),
'SetById': 'ekaterina.karpenko',
'SetByName': 'Екатерина Карпенко',
'Stage': {
'Label': 'Новые',
'Order': 0,
'_id': 'newStage'
},
'Tags': None,
'Type': 'Unknown',
'Weight': 120,
'_id': 'new'
},
{
'Comment': None,
'Details': {
'Name': 'взят в работу',
'_id': 1
},
'FunnelId': 'MegafonCompany',
'IsHot': False,
'IsReadonly': False,
'Name': 'В работе',
'SetAt': datetime.datetime(2018, 4, 20, 10, 40, 4, 841000),
'SetById': 'ekaterina.karpenko',
'SetByName': 'Екатерина Карпенко',
'Stage': {
'Label': 'Приглашение на интервью',
'Order': 1,
'_id': 'recruiterStage'
},
'Tags': None,
'Type': 'InProgress',
'Weight': 80,
'_id': 'phoneInterview'
}
]
I have a function that does this on dataframe object via data.iterrows() method:
def to_flat(data, coldict, field_last_upd):
m_status_history = stc.special_mongo_names['status_history_cand']
n_statuse_change = coldict['n_statuse_change']['name']
data[n_statuse_change] = n_status_change(dp.force_take_series(data, m_status_history))
flat_cols = [ x for x in coldict.values() if x['coltype'] == stc.COLTYPE_FLAT ]
old_cols_names = [ x['name'] for x in coldict.values() if x['coltype'] == stc.COLTYPE_PREPARATION ]
t_time = time.time()
t_len = 0
new_rows = list()
for j in range(row[n_statuse_change]):
t_new_value_row = np.empty(shape=[0, 0])
for k in range(len(flat_cols)):
if flat_cols[k]['colsubtype'] == stc.COLSUBTYPE_FLATPATH:
new_value = dp.under_value_line(
row,
path_for_status(j, row[n_statuse_change]-1, flat_cols[k]['path'])
)
# Дополнительно обрабатываем дату
if flat_cols[k]['name'] == coldict['status_set_at']['name']:
new_value = dp.iso_date_to_datetime(new_value)
if flat_cols[k]['name'] == coldict['status_set_at_mil']['name']:
new_value = dp.iso_date_to_miliseconds(new_value)
if flat_cols[k]['name'] == coldict['status_stage_order']['name']:
try:
new_value = int(new_value)
except:
new_value = new_value
else:
if flat_cols[k]['name'] == coldict['status_index']['name']:
new_value = j
t_new_value_row = np.append(t_new_value_row, dp.some_to_null(new_value))
new_rows.append(np.append(row[old_cols_names].values, t_new_value_row))
pdb.set_trace()
res = pd.DataFrame(new_rows, columns = [
x['name'] for x in coldict.values() if x['coltype'] == stc.COLTYPE_FLAT or x['coltype'] == stc.COLTYPE_PREPARATION
])
return res
It takes values from list of dicts, prepare them to correspond Clickhouse's requirements using numpy arrays and then appends them all together to get new dataframe with targeted values and its columnnames.
I've noticed that if nested structure is big enough, it begins to work much slower. I've found an article where different methods of iteration in Python are compared. article
It is claimed that it's much faster to iterate over .apply() method and even faster using vectorization. But the samples given are pretty trivial and rely on using the same function on all of the values. Is it possible to iterate over pandas object in faster manner, while using variety of functions on different types of data?
I think your first step should be converting your data into a pandas dataframe, then it will be so much easier to handle it. I couldn't deschiper the exact functions you wanted to run, but perhaps my example helps
import datetime
import pandas as pd
data_dict_array = [
{
'Comment': None,
'Details': None,
'FunnelId': 'MegafonCompany',
'IsHot': False,
'IsReadonly': False,
'Name': 'Новый',
'SetAt': datetime.datetime(2018, 4, 20, 10, 39, 55, 475000),
'SetById': 'ekaterina.karpenko',
'SetByName': 'Екатерина Карпенко',
'Stage': {
'Label': 'Новые',
'Order': 0,
'_id': 'newStage'
},
'Tags': None,
'Type': 'Unknown',
'Weight': 120,
'_id': 'new'
},
{
'Comment': None,
'Details': {
'Name': 'взят в работу',
'_id': 1
},
'FunnelId': 'MegafonCompany',
'IsHot': False,
'IsReadonly': False,
'Name': 'В работе',
'SetAt': datetime.datetime(2018, 4, 20, 10, 40, 4, 841000),
'SetById': 'ekaterina.karpenko',
'SetByName': 'Екатерина Карпенко',
'Stage': {
'Label': 'Приглашение на интервью',
'Order': 1,
'_id': 'recruiterStage'
},
'Tags': None,
'Type': 'InProgress',
'Weight': 80,
'_id': 'phoneInterview'
}
]
#converting your data into something pandas can read
# in particular, flattening the stage dict
for data_dict in data_dict_array:
d_temp = data_dict.pop("Stage")
data_dict["Stage_Label"] = d_temp["Label"]
data_dict["Stage_Order"] = d_temp["Order"]
data_dict["Stage_id"] = d_temp["_id"]
df = pd.DataFrame(data_dict_array)
# lets say i want to set comment to "cool" if name is 'В работе'
# in .loc[], the first argument is filtering the rows, the second argument is picking the column
df.loc[df['Name'] == 'В работе','Comment'] = "cool"
df
While writing python script for the api in mongodb..
We have..
new_posts = [{ 'name': 'A', 'age': 17, 'marks': 97, 'school': 'School1' },
{ 'name': 'B', 'age': 18, 'marks': 95, 'school': 'School2' },
{ 'name': 'C', 'age': 19, 'marks': 97, 'school': 'School2' }]
db.posts.insert( new_posts )
We create indexes as follows..
db.posts.create_index([('name',1),('school',1)],unique=True)
Now we perform two operations..
db.posts.update({ 'name':'A', 'age': 17, 'school': 'School3' },
{ 'name':'D', 'age': 17, 'marks': 70, 'school': 'School1' },
upsert=True )
db.posts.update({ 'name':'A', 'age': 17, 'school': 'School1' },
{ 'name':'A', 'age': 17, 'marks': 60, 'school': 'School1' },
upsert=True )
What does the update() returns here? How can we find out weather the document is inserted into the db or existing document is updated?
Can we do something like..
post1 = db.posts.update({ 'name':'A', 'age': 17, 'school': 'School3' },
{ 'name':'D', 'age': 17, 'marks': 70, 'school': 'School1' },
upsert=True )
post2 = db.posts.update({ 'name':'A', 'age': 17, 'school': 'School1' },
{ 'name':'A', 'age': 17, 'marks': 60, 'school': 'School1' },
upsert=True )
print post1
print post2
As the docs for update say, the method returns:
A document (dict) describing the effect of the update or None if write acknowledgement is disabled.
Just try it and print the return value to see what's available. You'll see something like:
{u'syncMillis': 0, u'ok': 1.0, u'err': None, u'writtenTo': None,
u'connectionId': 190, u'n': 1, u'updatedExisting': True}
The updatedExisting field is what you're looking for.
I have a list of dictionaries and I wanted to group the data. I used the following:
group_list = []
for key, items in itertools.groupby(res, operator.itemgetter('dept')):
group_list.append({key:list(items)})
For data that looks like this
[{'dept':1, 'age':10, 'name':'Sam'},
{'dept':1, 'age':12, 'name':'John'},
.
.
.
{'dept':2,'age':20, 'name':'Mary'},
{'dept':2,'age':11, 'name':'Mark'},
{'dept':2,'age':11, 'name':'Tom'}]
the output would be:
[{1:[{'dept':1, 'age':10, 'name':'Sam'},
{'dept':1, 'age':12, 'name':'John'}],
{2:[{'dept':2,'age':20, 'name':'Mary'},
{'dept':2,'age':11, 'name':'Mark'},
{'dept':2,'age':11, 'name':'Tom'}]
...
]
Now if I want to group using multiple keys say 'dept' and 'age', the above mentioned method returns
[{(2, 20): [{'age': 20, 'dept': 2, 'name': 'Mary'}]},
{(2, 11): [{'age': 11, 'dept': 2, 'name': 'Mark'},
{'age': 11, 'dept': 2, 'name': 'Tom'}]},
{(1, 10): [{'age': 10, 'dept': 1, 'name': 'Sam'}]},
{(1, 12): [{'age': 12, 'dept': 1, 'name': 'John'}]}]
The desired output would be:
[
{
2: {
20: [
{
'age': 20,
'dept': 2,
'name': 'Mary'
}
]
},
{
11: [
{
'age': 11,
'dept': 2,
'name': 'Mark'
},
{
'age': 11,
'dept': 2,
'name': 'Tom'
}
]
}
},
{
1: {
10: [
{
'age': 10,
'dept': 1,
'name': 'Sam'
}
]
},
{
12: [
{
'age': 12,
'dept': 1,
'name': 'John'
}
]
}
}
]
Can it be done with itertools? Or do I need to write that code myself?
Absolutely. You just need to apply itertools.groupby() for the second criterion first, then the other.
You would need to write a (probably recursive) bit of code to do this yourself - itertools doesn't have a tree-builder in it.
Thanks everyone for your help. Here is how I did it:
import itertools, operator
l = [{'dept':1, 'age':10, 'name':'Sam'},
{'dept':1, 'age':12, 'name':'John'},
{'dept':2,'age':20, 'name':'Mary'},
{'dept':2,'age':11, 'name':'Mark'},
{'dept':2,'age':11, 'name':'Tom'}]
groups = ['dept', 'age', 'name']
groups.reverse()
def hierachical_data(data, groups):
g = groups[-1]
g_list = []
for key, items in itertools.groupby(data, operator.itemgetter(g)):
g_list.append({key:list(items)})
groups = groups[0:-1]
if(len(groups) != 0):
for e in g_list:
for k,v in e.items():
e[k] = hierachical_data(v, groups)
return g_list
print hierachical_data(l, groups)