I am working on xbrl document parsing. I got to a point where I have a large dic structured like this....
sample of a dictionary I'm working on
Since it's bit challenging to describe the pattern of what I'm trying to achieve I just put an example of what I'd like it to be...
sample of what I'm trying to achieve
Since I'm fairly new to programing, I'm hustling for days with this. Trying different approaches with loops, list and dic comprehension starting from here...
for k in storage_gaap:
if 'context_ref' in storage_gaap[k]:
for _k in storage_gaap[k]['context_ref']:
storage_gaap[k]['context_ref']={_k}```
storage_gaap being the master dictionary. Sorry for attaching pictures, but it's just much clearer to see the dictionary
I'd really appreciate any and ever help
Here's a solution using zip and dictionary comprehension to do what you're trying to do using toy data in a similar structure.
import itertools
import pprint
# Sample data similar to provided screenshots
data = {
'a': {
'id': 'a',
'vals': ['a1', 'a2', 'a3'],
'val_num': [1, 2, 3]
},
'b': {
'id': 'b',
'vals': ['b1', 'b2', 'b3'],
'val_num': [4, 5, 6]
}
}
# Takes a tuple of keys, and a list of tuples of values, and transforms them into a list of dicts
# i.e ('id', 'val'), [('a', 1), ('b', 2) => [{'id': 'a', 'val': 1}, {'id': 'b', 'val': 2}]
def get_list_of_dict(keys, list_of_tuples):
list_of_dict = [dict(zip(keys, values)) for values in list_of_tuples]
return list_of_dict
def process_dict(key, values):
# Transform the dict with lists of values into a list of dicts
list_of_dicts = get_list_of_dict(('id', 'val', 'val_num'), zip(itertools.repeat(key, len(values['vals'])), values['vals'], values['val_num']))
# Dictionary comprehension to group them based on the 'val' property of each dict
return {d['val']: {k:v for k,v in d.items() if k != 'val'} for d in list_of_dicts}
# Reorganize to put dict under a 'context_values' key
processed = {k: {'context_values': process_dict(k, v)} for k,v in data.items()}
# {'a': {'context_values': {'a1': {'id': 'a', 'val_num': 1},
# 'a2': {'id': 'a', 'val_num': 2},
# 'a3': {'id': 'a', 'val_num': 3}}},
# 'b': {'context_values': {'b1': {'id': 'b', 'val_num': 4},
# 'b2': {'id': 'b', 'val_num': 5},
# 'b3': {'id': 'b', 'val_num': 6}}}}
pprint.pprint(processed)
Ok, Here is the updated solution from my case. Catch for me was the was the zip function since it only iterates over the smallest list passed. Solution was the itertools.cycle method Here is the code:
data = {'us-gaap_WeightedAverageNumberOfDilutedSharesOutstanding': {'context_ref': ['D20210801-20220731',
'D20200801-20210731',
'D20190801-20200731',
'D20210801-20220731',
'D20200801-20210731',
'D20190801-20200731'],
'decimals': ['-5',
'-5',
'-5',
'-5',
'-5',
'-5'],
'id': ['us-gaap:WeightedAverageNumberOfDilutedSharesOutstanding'],
'master_id': ['us-gaap_WeightedAverageNumberOfDilutedSharesOutstanding'],
'unit_ref': ['shares',
'shares',
'shares',
'shares',
'shares',
'shares'],
'value': ['98500000',
'96400000',
'96900000',
'98500000',
'96400000',
'96900000']},
def get_list_of_dict(keys, list_of_tuples):
list_of_dict = [dict(zip(keys, values)) for values in list_of_tuples]
return list_of_dict
def process_dict(k, values):
list_of_dicts = get_list_of_dict(('context_ref', 'decimals', 'id','master_id','unit_ref','value'),
zip((values['context_ref']),values['decimals'],itertools.cycle(values['id']),
itertools.cycle(values['master_id']),values['unit_ref'], values['value']))
return {d['context_ref']: {k:v for k,v in d.items()if k != 'context_ref'} for d in list_of_dicts}
processed = {k: {'context_values': process_dict(k, v)} for k,v in data.items()}
pprint.pprint(processed)
Related
EDIT: I realized I made a mistake in my data structure and updated it.
I have an ordered dict with some values as list and some as list of lists:
odict = OrderedDict([('A', [33]),
('B',
[[{'AA': 'DOG', 'BB': '2'},
{'AA': 'CAT', 'BB': '1'}]]),
('C',
[['01','012']])
])
I am currently manually adding the keys and values in this function:
def odict_to_listdict(odict, key1: str, key2: str, key3: str) -> list:
return [{key1: v, key2: v2, key3: v3} for v1, v2, v3 in \
zip(odict[key1], odict[key2], odict[key3])]
odict_to_listdict(odict,'key1','key2','key3')
that gives me the expected output of a list of dictionaries:
[{'A': 33,
'B': [{'AA': 'DOG', 'BB': '2'}, {'AA': 'CAT', 'BB': '1'}],
'C': ['01', '012']}]
I plan to add more keys, values, how do I iterate through the ordered dict without explicitly typing the keys while maintaining the expected output, for example:
[{'A': 33,
'B': [{'AA': 'DOG', 'BB': '2'}, {'AA': 'CAT', 'BB': '1'}],
'C': ['01', '012']}],
'D': 42,
'E': [{'A': 1}]
}]
First off, from Python 3.7+ you can use normal dict since it maintains the insertion order.
You can re-write your function like:
def odict_to_listdict(odict):
keys = list(odict)
return [dict(zip(keys, i)) for i in zip(*odict.values())]
Test:
from collections import OrderedDict
odict = OrderedDict()
odict['key1'] = ['1', '2']
odict['key2'] = ['Apple', 'Orange']
odict['key3'] = ['bla', 'bla']
def odict_to_listdict(odict):
keys = list(odict)
return [dict(zip(keys, i)) for i in zip(*odict.values())]
print(odict_to_listdict(odict))
output:
[{'key1': '1', 'key2': 'Apple', 'key3': 'bla'}, {'key1': '2', 'key2': 'Orange', 'key3': 'bla'}]
Explanation :
You can get the keys from the dictionary itself, no need to pass it to the function.
the iteration part of the odict is basically iterating through the odict dictionary in parallel.
For the expression part of list comp, you need to zip the keys and the values which get returned from the iteration part.
I have a list of JSON objects, already sorted (by time let's say). Each JSON object has type and status. For example:
[
{'type': 'A', 'status': 'ok'},
{'type': 'B', 'status': 'ok'},
{'type': 'A', 'status': 'fail'}
]
I'd like to convert it to:
{
'A': ['ok', 'fail'],
'B': ['ok']
}
So of course it's an easy task but I'm looking for the Pythonic way doing that, so it will spare me a few lines of code
I don't know if there is a one-liner but you can make use of setdefault or defaultdict to achieve the desired result as:
data = [
{'type': 'A', 'status': 'ok'},
{'type': 'B', 'status': 'ok'},
{'type': 'A', 'status': 'fail'}
]
Using setdefault():
res = {}
for elt in data:
res.setdefault(elt['type'], []).append(elt['status'])
Output:
{'A': ['ok', 'fail'], 'B': ['ok']}
Using defaultdict:
from collections import defaultdict
res = defaultdict(list)
for elt in data:
res[elt['type']].append(elt['status'])
Output:
defaultdict(<class 'list'>, {'A': ['ok', 'fail'], 'B': ['ok']})
Since your new structure depends on older (already processed) keys also, I doubt you can use dictionary comprehension (which operates on per-key basis) to achieve the required final state.
I would prefer to use defaultdict where each default key is a list.
from collections import defaultdict
new_data = defaultdict(list)
for item in data:
new_data[item['type']].append(item['status'])
print(new_data)
which gives output:
defaultdict(<class 'list'>, {'A': ['ok', 'fail'], 'B': ['ok']})
A one-liner with multiple list comprehensions and sorted(set(…)) to remove the duplicate keys:
original = [
{"type": "A", "status": "ok"},
{"type": "B", "status": "ok"},
{"type": "A", "status": "fail"},
]
result = {
key: [x["status"] for x in original if x["type"] == key]
for key in sorted(set([x["type"] for x in original]))
}
Output:
{'A': ['ok', 'fail'], 'B': ['ok']}
I have two dictionaries:
dict1 = {'a': '2', 'b': '10'}
dict2 = {'a': '25', 'b': '7'}
I need to save all the values for same key in a new dictionary.
The best i can do so far is: defaultdict(<class 'list'>, {'a': ['2', '25'], 'b': ['10', '7']})
dd = defaultdict(list)
for d in (dict1, dict2):
for key, value in d.items():
dd[key].append(value)
print(dd)
that does not fully resolve the problem since a desirable result is:
a = {'dict1':'2', 'dict2':'25'}
b = {'dict2':'10', 'dict2':'7'}
Also i possibly would like to use new dictionary key same as initial dictionary name
Your main problem is that you're trying to cross the implementation boundary between a string value and a variable name. This is almost always bad design. Instead, start with all of your labels as string data:
table = {
"dict1": {'a': '2', 'b': '10'},
"dict2": {'a': '25', 'b': '7'}
}
... or, in terms of your original post:
table = {
"dict1": dict1,
"dict2": dict2
}
From here, you should be able to invert the levels to obtain
invert = {
"a": {'dict1': '2', 'dict2': '25'},
"b": {'dict2': '10', 'dict2': '7'}
}
Is that enough to get your processing where it needs to be? Keeping the data in comprehensive dicts like this, will make it easier to iterate through the sub-dicts as needed.
As #Prune suggested, structuring your result as a nested dictionary will be easier:
{'a': {'dict1': '2', 'dict2': '25'}, 'b': {'dict1': '10', 'dict2': '7'}}
Which could be achieved with a dict comprehension:
{k: {"dict%d" % i: v2 for i, v2 in enumerate(v1, start=1)} for k, v1 in dd.items()}
If you prefer doing it without a comprehension, you could do this instead:
result = {}
for k, v1 in dd.items():
inner_dict = {}
for i, v2 in enumerate(v1, start=1):
inner_dict["dict%d" % i] = v2
result[k] = inner_dict
Note: This assumes you want to always want to keep the "dict1", "dict2",... key structure.
I have a dictionary of dictionaries, with each nested dictionary having the exact same keys, like this:
all_dicts = {'a':{'name': 'A', 'city': 'foo'},
'b':{'name': 'B', 'city': 'bar'},
'c':{'name': 'C', 'city': 'bar'},
'd':{'name': 'B', 'city': 'foo'},
'e':{'name': 'D', 'city': 'bar'},
}
How to I get a list (or dictionary) of all the dictionaries where 'city' has value 'bar'?
The following code works, but isn't scalable:
req_key = 'bar'
selected = []
for one in all_dicts.keys():
if req_key in all_dicts[one]:
selected.append(all_dicts[one])
Say 'city' can have 50,000 unique values and the dictionary all_dicts contains 600,000 values, iterating over the dictionary for every single value of 'city' is not very efficient.
Is there a scalable and efficient way of doing this?
What you could do is create an index on that dictionary, like this:
cityIndex={}
for item in all_dicts.values():
if item['city'] in cityIndex:
cityIndex[item['city']].append(item)
else:
cityIndex[item['city']]=[item]
This will require some initial processing time as well as some additional memory, but afterwards it will be very fast. If you want all items with some cityName, you'll get them by doing:
mylist=cityIndex[cityName] if cityName in cityIndex else []
This gives you many benefits if all_dicts is built once and queried afterwards many times.
If all_dicts is being modified during the execution of your program, you will need some more code to maintain the cityIndex. If an item is added to all_dicts, just do:
if item['city'] in cityIndex:
cityIndex[item['city']].append(item)
else:
cityIndex[item['city']]=[item]
while if an item is removed, this is a straightforward way to remove it from the index as well (assuming the combination of 'name' and 'city' is unique among your items):
for i, val in enumerate(cityIndex[item['city']]):
if val['name']==item['name']:
break
del cityIndex[item['city']][i]
If there are many more queries than updates, you will still get a huge performance improvement.
You have to check all the values; there isn't an alternative to that. You could however use a vectorised approach - list comprehension - which is going to be much faster than a for loop:
selected = [d for d in all_dicts.values() if d['city']=='bar']
print(selected)
# [{'name': 'B', 'city': 'bar'}, {'name': 'C', 'city': 'bar'}, {'name': 'D', 'city': 'bar'}]
Using dict.values instead of accessing the dictionary keys also improves performance and is also memory efficient in Python 3.
Or use filter, in python 3:
>>> list(filter(lambda x: x['city']=='bar', all_dicts.values()))
# [{'name': 'D', 'city': 'bar'}, {'name': 'B', 'city': 'bar'}, {'name': 'C', 'city': 'bar'}]
Or with pandas:
import pandas as pd
df = pd.DataFrame(all_dicts).T
df[df.city=='bar'].T.to_dict()
# {'e': {'city': 'bar', 'name': 'D'}, 'c': {'city': 'bar', 'name': 'C'}, 'b': {'city': 'bar', 'name': 'B'}}
all_dicts = {'a':{'name': 'A', 'city': 'foo'},
'b':{'name': 'B', 'city': 'bar'},
'c':{'name': 'C', 'city': 'bar'},
'd':{'name': 'B', 'city': 'foo'},
'e':{'name': 'D', 'city': 'bar'},
}
citys = {}
for key, value in all_dicts.items():
citys[key] = value['city']
#{'a': 'foo', 'b': 'bar', 'e': 'bar', 'd': 'foo', 'c': 'bar'}
for key, value in citys.items():
if value == 'bar':
print(all_dicts[key])
out:
{'name': 'B', 'city': 'bar'}
{'name': 'D', 'city': 'bar'}
{'name': 'C', 'city': 'bar'}
Build an auxiliary dict to store city as index, and you can reference it very quickly.
I have a bunch of lists like the following two:
['a', ['b', ['x', '1'], ['y', '2']]]
['a', ['c', ['xx', '4'], ['gg', ['m', '3']]]]
What is the easiest way to combine all of them into a single dictionary that looks like:
{'a': {
'b': {
'x':1,
'y':2
}
'c': {
'xx':4,
'gg': {
'm':3
}
}
}
The depth of nesting is variable.
Here's a very crude implementation, it does not handle weird cases such as lists having less than two elements and it overwrites duplicate keys, but its' something to get you started:
l1 = ['a', ['b', ['x', '1'], ['y', '2']]]
l2 = ['a', ['c', ['xx', '4'], ['gg', ['m', '3']]]]
def combine(d, l):
if not l[0] in d:
d[l[0]] = {}
for v in l[1:]:
if type(v) == list:
combine(d[l[0]],v)
else:
d[l[0]] = v
h = {}
combine(h, l1)
combine(h, l2)
print h
Output:
{'a': {'c': {'gg': {'m': '3'}, 'xx': '4'}, 'b': {'y': '2', 'x': '1'}}}
It's not really 'pythonic' but i dont see a good way to do this without recursion
def listToDict(l):
if type(l) != type([]): return l
return {l[0] : listToDict(l[1])}
It made the most sense to me to break this problem into two parts (well, that and I misread the question the first time through..'S)
transformation
The first part transforms the [key, list1, list2] data structure into nested dictionaries:
def recdict(elements):
"""Create recursive dictionaries from [k, v1, v2, ...] lists.
>>> import pprint, functools
>>> pprint = functools.partial(pprint.pprint, width=2)
>>> pprint(recdict(['a', ['b', ['x', '1'], ['y', '2']]]))
{'a': {'b': {'x': '1',
'y': '2'}}}
>>> pprint(recdict(['a', ['c', ['xx', '4'], ['gg', ['m', '3']]]]))
{'a': {'c': {'gg': {'m': '3'},
'xx': '4'}}}
"""
def rec(item):
if isinstance(item[1], list):
return [item[0], dict(rec(e) for e in item[1:])]
return item
return dict([rec(elements)])
It expects that
every list has at least two elements
the first element of every list is a key
if the second element of a list is a list, then all subsequent elements are also lists; these are combined into a dictionary.
The tricky bit (at least for me) was realizing that you have to return a list from the recursive function rather than a dictionary. Otherwise, you can't combine the parallel lists that form the second and third elements of some of the lists.
To make this more generally useful (i.e. to tuples and other sequences), I would change
if isinstance(item[1], list):
to
if (isinstance(item[1], collections.Sequence)
and not isinstance(item[1], basestring)):
You can also make it work for any iterable but that requires a little bit of reorganization.
merging
The second part merges the dictionaries that result from running the first routine on the two given data structures. I think this will recursively merge any number of dictionaries that don't have conflicting keys, though I didn't really test it for anything other than this use case.
def mergedicts(*dicts):
"""Recursively merge an arbitrary number of dictionaries.
>>> import pprint
>>> d1 = {'a': {'b': {'x': '1',
... 'y': '2'}}}
>>> d2 = {'a': {'c': {'gg': {'m': '3'},
... 'xx': '4'}}}
>>> pprint.pprint(mergedicts(d1, d2), width=2)
{'a': {'b': {'x': '1',
'y': '2'},
'c': {'gg': {'m': '3'},
'xx': '4'}}}
"""
keys = set(k for d in dicts for k in d)
def vals(key):
"""Returns all values for `key` in all `dicts`."""
withkey = (d for d in dicts if d.has_key(key))
return [d[key] for d in withkey]
def recurse(*values):
"""Recurse if the values are dictionaries."""
if isinstance(values[0], dict):
return mergedicts(*values)
if len(values) == 1:
return values[0]
raise TypeError("Multiple non-dictionary values for a key.")
return dict((key, recurse(*vals(key))) for key in keys)