I have two dictionaries. One has chapter_id and book_id: {99: 7358, 852: 7358, 456: 7358}. Here just one book as an example, but there are many. And another one the same chapter_id and some information: {99: [John Smith, 20, 5], 852: [Clair White, 15, 10], 456: [Daniel Dylan, 25, 10]}. Chapter ids are unique through all the books. And I have to combine it in the way that every book gets information from all the chapters it contains. Something like {7358:[[99,852,456],[John Smith, Claire White, Daniel Dylan],[20,15,25],[5,10,10]]}. I also have a file already with a dictionary, where each book has ids of all chapters it has. I know how to do it by looping over both dictionaries (they used to be lists). But it takes ages. That is why they are now dictionaries and I think I can manage with just one loop over all chapters. But in my head I always come back to the looping over books and over chapters. Any ideas are very much appreciated! The final result I will write in the file, so it is not very important if it is a nested dictionary or something else. Or at least I think so.
If you are open to using other packages then you might want to have a look on pandas, which will allow you to do many things easily and fast. Here is an example based on the data you provided...
import pandas as pd
d1 = {99: 7358, 852: 7358, 456: 7358}
df1 = pd.DataFrame.from_dict(d1, "index")
df1.reset_index(inplace=True)
d2 = {99: ["John Smith", 20, 5], 852: ["Clair White", 15, 10], 456: ["Daniel Dylan", 25, 10]}
df2 = pd.DataFrame.from_dict(d2, "index")
df2.reset_index(inplace=True)
df = df1.merge(df2, left_on="index", right_on="index")
df.columns = ["a", "b", "c", "d", "e"]
# all data for 7358 (ie subsetting)
df[df.b == 7358]
# all names as a list
list(df[df.b == 7358].c)
You could always iterate over the dictionary keys, given that the same keys appear in both dictionaries:
for chapter_id in dict1:
book_id = dict1[chapter_id]
chapter_info = dict2[chapter_id]
from collections import defaultdict
def append_all(l, a):
if len(l) != len(a):
raise ValueError
for i in range(len(l)):
l[i].append(a[i])
final_dict = defaultdict(lambda: [[],[],[],[]])
for chapter, book in d1.items():
final_dict[book][0].append(chapter)
append_all(final_dict[book][1:], d2[chapter])
You only need to iterate over the chapters. You can replace the append_all function with explicit appends, but it seemed ugly to do it that way. I'm surprised there's not a method for this, but it may just be that I missed a clever way to use zip here.
Related
I have a dictionary (table) defined like this:
table = {"id": [1, 2, 3]}, {"file": ['good1.txt', 'bad2.txt', 'good3.txt']}
and I have a list of bad candidates that should be removed:
to_exclude = ['bad0.txt', 'bad1.txt', 'bad2.txt']
I hope to filter the table based on if the file in a row of my table can be found inside to_exclude.
filtered = {"id": [1, 2]}, {"file": ['good1.txt', 'good3.txt']}
I guess I could use a for loop to check the entries one by one, but I was wondering what's the most python-efficient manner to solve this problem.
Could someone provide some guidance on this? Thanks.
I'm assuming you miswrote your data structure. You have a set of two dictionaries, which is impossible. (Dictionaries are not hashable). I'm hoping your actual data is:
data = {"id": [1, 2, 3], "file": [.......]}
a dictionary with two keys.
So for me, the simplest would be:
# Create a set for faster testing
to_exclude_set = set(to_exclude)
# Create (id, file) pairs for the pairs we want to keep
pairs = [(id, file) for id, file in zip(data["id"], data["file"])
if file not in to_exclude_set]
# Recreate the data structure
result = { 'id': [_ for id, _ in pairs],
'file': [_ for _, file in pairs] }
I'm trying to create an event manager in which a dictionary stores the events like this
my_dict = {'2020':
{'9': {'8': ['School ']},
'11': {'13': ['Doctors ']},
'8': {'31': ['Interview']}
},
'2021': {}}
In which the outer key is the year the middle key is a month and the most inner key is a date which leads to a list of events.
I'm trying to first sort it so that the months are in order then sort it again so that the days are in order. Thanks in advance
Use-case
DevOrangeCrush wishes to sort on keys in a nested dictionary where the nesting occurs on multiple levels
Solution
Normalize the data so that the dates match ISO8601 format, for easier sorting
In plain English, this means make sure you always use two digits for month and date, and always use four digits for year
Re-normalize the original dictionary data structure into a single list of dictionaries, where each dictionary represents a row, and the list represents an outer containing table
this is known as an Array of Hashes in perl-speak
this is known as a list of objects in JSON-speak
Once your data is restructured you are solving a much more well-known, well-documented, and more obvious problem, how to sort a simple list of dictionaries (which is already documented in the See also section of this answer).
Example
import pprint
## original data is formatted as a nested dictionary, which is clumsy
my_dict = {'2020':
{'9': {'8': ['School ']}, '11':
{'13': ['Doctors ']},'8':
{'31': ['Interview']}}, '2021': {}
}
## we want the data formatted as a standard table (aka list of dictionary)
## this is the most common format for this kind of data as you would see in
## databases and spreadsheets
mydata_table = []
ddtemp = dict()
for year in my_dict:
for month in my_dict[year].keys():
ddtemp['month'] = '{0:02d}'.format(*[int(month)])
ddtemp['year'] = year
for day in my_dict[year][month].keys():
ddtemp['day'] = '{0:02d}'.format(*[int(day)])
mydata_row = dict()
mydata_row['year'] = '{year}'.format(**ddtemp)
mydata_row['month'] = '{month}'.format(**ddtemp)
mydata_row['day'] = '{day}'.format(**ddtemp)
mydata_row['task_list'] = my_dict[year][month][day]
mydata_row['date'] = '{year}-{month}-{day}'.format(**ddtemp)
mydata_table.append(mydata_row)
pass
pass
pass
## output result is now easily sorted and there is no data loss
## you will have to modify this if you want to deal with years that
## do not have any associated task_list data
pprint.pprint(mydata_table)
'''
## now we have something that can be sorted using well-known python idioms
## and easily manipulated using data-table semantics
## (search, sort, filter-by, group-by, select, project ... etc)
[
{'date': '2020-09-08','day': '08',
'month': '09','task_list': ['School '],'year': '2020'},
{'date': '2020-11-13','day': '13',
'month': '11','task_list': ['Doctors '],'year': '2020'},
{'date': '2020-08-31','day': '31',
'month': '08','task_list': ['Interview'],'year': '2020'},
]
'''
See also
How to sort a python list-of-dictionary
How to sort objects by multiple keys
Why you should use ISO8601 date format
ISO8601 vs timestamp
To get sorted events data, you can do something like this:
def sort_events(my_dict):
new_events_data = dict()
for year, month_data in my_dict.items():
new_month_data = dict()
for month, day_data in month_data.items():
sorted_day_data = sorted(day_data.items(), key=lambda kv: int(kv[0]))
new_month_data[month] = OrderedDict(sorted_day_data)
sorted_months_data = sorted(new_month_data.items(), key=lambda kv: int(kv[0]))
new_events_data[year] = OrderedDict(sorted_months_data)
return new_events_data
Output:
{'2020': OrderedDict([('8', OrderedDict([('31', ['Interview'])])),
('9', OrderedDict([('8', ['School '])])),
('11', OrderedDict([('13', ['Doctors '])]))]),
'2021': OrderedDict()}
A simple dict can't be ordered, you could do it using a OrderedDict but if you simply need to get it sorted while iterating on it do like this
for year in sorted(map(int, my_dict)):
year_dict = my_dict[str(year)]
for month in sorted(map(int, year_dict)):
month_dict = year_dict[str(month)]
for day in sorted(map(int, month_dict)):
events = month_dict[str(day)]
for event in events:
print(year, month, day, event)
Online Demo
The conversion to int is to ensure right ordering between the numbers, without you'll get 1, 10, 11, .., 2, 20, 21
A dictionary in Python does not have an order, you might want to try the OrderedDict class from the collections Module which remembers the order of insertion.
Of course you would have to sort and reinsert the elements whenever you insert a new element which should be placed before any of the existing elements.
If you care about order maybe a different data structure works better. For example a list of lists.
I have a TSV file where one of the columns are a dictionary-format type.
Example of headers and one row (notice the string-quotes in Preferences-column)
Name, Age, Preferences
Nick, 18, "[{"Hobby":"Football", "Food":"Pizza", "FavoriteNumber":"72"}]"
To read the file into python:
df = pd.read_csv('search_data_assessment.tsv',delimiter='\t')
To remove the strings of the "Preferences" at beginning and end, I used ast.literal_eval:
df["Preferences"] = ast.literal_eval(df["Preferences"])
This raises "ValueError: malformed node or string: 0", but it seems to do the trick.
The question: How can I check all rows and look for "FavoriteNumber" in Preferences, and if it == 72, change it to 100 (arbitrary example)?
You can use pd.Series.apply with a custom function. Just note this is bordering on abuse of Pandas. Pandas isn't designed to hold lists of dictionaries in series. Here, you are running a loop in a particularly inefficient way.
from ast import literal_eval
df = pd.DataFrame([['Nick', 18, '[{"Hobby":"Football", "Food":"Pizza", "FavoriteNumber":"72"}]']],
columns=['Name', 'Age', 'Preferences'])
def updater(x):
if x[0]['FavoriteNumber'] == '72':
x[0]['FavoriteNumber'] = '100'
return x
df['Preferences'] = df['Preferences'].apply(literal_eval)
df['Preferences'] = df['Preferences'].apply(updater)
print(df['Preferences'].iloc[0])
[{'Hobby': 'Football', 'Food': 'Pizza', 'FavoriteNumber': '100'}]
Let's assume a very simple data structure. In the below example, IDs are unique. "date" and "id" are strings, and "amount" is an integer.
data = [[date1, id1, amount1], [date2, id2, amount2], etc.]
If date1 == date2 and id1 == id2, I'd like to merge the two entries into one and basically add up amount1 and amount2 so that data becomes:
data = [[date1, id1, amount1 + amount2], etc.]
There are many duplicates.
As data is very big (over 100,000 entries), I'd like to do this as efficiently as possible. What I did is a created a new "common" field that is basically date + id combined into one string with metadata allowing me to split it later (date + id + "_" + str(len(date)).
In terms of complexity, I have four loops:
Parse and load data from external source (it doesn't come in lists) | O(n)
Loop over data and create and store "common" string (date + id + metadata) - I call this "prepared data" where "common" is my encoded field | O(n)
Use the Counter() object to dedupe "prepared data" | O(n)
Decode "common" | O(n)
I don't care about memory here, I only care about speed. I could make a nested loop and avoid steps 2, 3 and 4 but that would be a time-complexity disaster (O(n²)).
What is the fastest way to do this?
Consider a defaultdict for aggregating data by a unique key:
Given
Some random data
import random
import collections as ct
random.seed(123)
# Random data
dates = ["2018-04-24", "2018-05-04", "2018-07-06"]
ids = "A B C D".split()
amounts = lambda: random.randrange(1, 100)
ch = random.choice
data = [[ch(dates), ch(ids), amounts()] for _ in range(10)]
data
Output
[['2018-04-24', 'C', 12],
['2018-05-04', 'C', 14],
['2018-04-24', 'D', 69],
['2018-07-06', 'C', 44],
['2018-04-24', 'B', 18],
['2018-05-04', 'C', 90],
['2018-04-24', 'B', 1],
['2018-05-04', 'A', 77],
['2018-05-04', 'A', 1],
['2018-05-04', 'D', 14]]
Code
dd = ct.defaultdict(int)
for date, id_, amt in data:
key = "{}{}_{}".format(date, id_, len(date))
dd[key] += amt
dd
Output
defaultdict(int,
{'2018-04-24B_10': 19,
'2018-04-24C_10': 12,
'2018-04-24D_10': 69,
'2018-05-04A_10': 78,
'2018-05-04C_10': 104,
'2018-05-04D_10': 14,
'2018-07-06C_10': 44})
Details
A defaultdict is a dictionary that calls a default factory (a specified function) for any missing keys. It this case, every date + id combination is uniquely added to the dict. The amounts are added to values if existing keys are found. Otherwise an integer (0) initializes a new entry to the dict.
For illustration, you can visualize the aggregated values using a list as the default factory.
dd = ct.defaultdict(list)
for date, id_, val in data:
key = "{}{}_{}".format(date, id_, len(date))
dd[key].append(val)
dd
Output
defaultdict(list,
{'2018-04-24B_10': [18, 1],
'2018-04-24C_10': [12],
'2018-04-24D_10': [69],
'2018-05-04A_10': [77, 1],
'2018-05-04C_10': [14, 90],
'2018-05-04D_10': [14],
'2018-07-06C_10': [44]})
We see three occurrences of duplicate keys where the values were appropriately summed. Regarding efficiency, notice:
keys are made with format(), which should be a bit better the string concatenation and calling str()
every key and value is computed in the same iteration
Using pandas makes this really easy:
import pandas as pd
df = pd.DataFrame(data, columns=['date', 'id', 'amount'])
df.groupby(['date','id']).sum().reset_index()
For more control you can use agg instead of sum():
df.groupby(['date','id']).agg({'amount':'sum'})
Depending on what you are doing with the data, it may be easier/faster to go this way just because so much of pandas is built on compiled C extensions and optimized routines that make it super easy to transform and manipulate.
You could import the data into a structure that prevents duplicates and than convert it to a list.
data = {
date1: {
id1: amount1,
id2: amount2,
},
date2: {
id3: amount3,
id4: amount4,
....
}
The program's skeleton:
ddata = collections.defaultdict(dict)
for date, id, amount in DATASOURCE:
ddata[date][id] = amount
data = [[d, i, a] for d, subd in ddata.items() for i, a in subd.items()]
I'm trying to create a dictionary of dictionaries like this:
food = {"Broccoli": {"Taste": "Bad", "Smell": "Bad"},
"Strawberry": {"Taste": "Good", "Smell": "Good"}}
But I am populating it from an SQL table. So I've pulled the SQL table into an SQL object called "result". And then I got the column names like this:
nutCol = [i[0] for i in result.description]
The table has about 40 characteristics, so it is quite long.
I can do this...
foodList = {}
for id, food in enumerate(result):
addMe = {str(food[1]): {nutCol[id + 2]: food[2], nulCol[idx + 3]:
food[3] ...}}
foodList.update(addMe)
But this of course would look horrible and take a while to write. And I'm still working out how I want to build this whole thing so it's possible I'll need to change it a few times...which could get extremely tedious.
Is there a DRY way of doing this?
In order to make solution position independent you can make use of dict1.update(dict2). This simply merges dict2 with dict1.
In our case since we have dict of dict, we can use dict['key'] as dict1 and simply add any additional key,value pair as dict2.
Here is an example.
food = {"Broccoli": {"Taste": "Bad", "Smell": "Bad"},
"Strawberry": {"Taste": "Good", "Smell": "Good"}}
addthis = {'foo':'bar'}
Suppose you want to add addthis dict to food['strawberry'] , we can simply use,
food["Strawberry"].update(addthis)
Getting result:
>>> food
{'Strawberry': {'Taste': 'Good', 'foo': 'bar', 'Smell': 'Good'},'Broccoli': {'Taste': 'Bad', 'Smell': 'Bad'}}
>>>
Assuming that column 0 is what you wish to use as your key, and you do wish to build a dictionary of dictionaries, then its:
detail_names = [col[0] for col in result.description[1:]]
foodList = {row[0]: dict(zip(detail_names, row[1:]))
for row in result}
Generalising, if column k is your identity then its:
foodList = {row[k]: {col[0]: row[i]
for i, col in enumerate(result.description) if i != k}
for row in result}
(Here each sub dictionary is all columns other than column k)
addMe = {str(food[1]):dict(zip(nutCol[2:],food[2:]))}
zip will take two (or more) lists of items and pair the elements, then you can pass the result to dict to turn the pairs into a dictionary.