Nested dictionary to multiindex dataframe where dictionary keys are column labels - python

Say I have a dictionary that looks like this:
dictionary = {'A' : {'a': [1,2,3,4,5],
'b': [6,7,8,9,1]},
'B' : {'a': [2,3,4,5,6],
'b': [7,8,9,1,2]}}
and I want a dataframe that looks something like this:
A B
a b a b
0 1 6 2 7
1 2 7 3 8
2 3 8 4 9
3 4 9 5 1
4 5 1 6 2
Is there a convenient way to do this? If I try:
In [99]:
DataFrame(dictionary)
Out[99]:
A B
a [1, 2, 3, 4, 5] [2, 3, 4, 5, 6]
b [6, 7, 8, 9, 1] [7, 8, 9, 1, 2]
I get a dataframe where each element is a list. What I need is a multiindex where each level corresponds to the keys in the nested dict and the rows corresponding to each element in the list as shown above. I think I can work a very crude solution but I'm hoping there might be something a bit simpler.

Pandas wants the MultiIndex values as tuples, not nested dicts. The simplest thing is to convert your dictionary to the right format before trying to pass it to DataFrame:
>>> reform = {(outerKey, innerKey): values for outerKey, innerDict in dictionary.items() for innerKey, values in innerDict.items()}
>>> reform
{('A', 'a'): [1, 2, 3, 4, 5],
('A', 'b'): [6, 7, 8, 9, 1],
('B', 'a'): [2, 3, 4, 5, 6],
('B', 'b'): [7, 8, 9, 1, 2]}
>>> pandas.DataFrame(reform)
A B
a b a b
0 1 6 2 7
1 2 7 3 8
2 3 8 4 9
3 4 9 5 1
4 5 1 6 2
[5 rows x 4 columns]

You're looking for the functionality in .stack:
df = pandas.DataFrame.from_dict(dictionary, orient="index").stack().to_frame()
# to break out the lists into columns
df = pandas.DataFrame(df[0].values.tolist(), index=df.index)

dict_of_df = {k: pd.DataFrame(v) for k,v in dictionary.items()}
df = pd.concat(dict_of_df, axis=1)
Note that the order of columns is lost for python < 3.6

This recursive function should work:
def reform_dict(dictionary, t=tuple(), reform={}):
for key, val in dictionary.items():
t = t + (key,)
if isinstance(val, dict):
reform_dict(val, t, reform)
else:
reform.update({t: val})
t = t[:-1]
return reform

If lists in the dictionary are not of the same lenght, you can adapte the method of BrenBarn.
>>> dictionary = {'A' : {'a': [1,2,3,4,5],
'b': [6,7,8,9,1]},
'B' : {'a': [2,3,4,5,6],
'b': [7,8,9,1]}}
>>> reform = {(outerKey, innerKey): values for outerKey, innerDict in dictionary.items() for innerKey, values in innerDict.items()}
>>> reform
{('A', 'a'): [1, 2, 3, 4, 5],
('A', 'b'): [6, 7, 8, 9, 1],
('B', 'a'): [2, 3, 4, 5, 6],
('B', 'b'): [7, 8, 9, 1]}
>>> pandas.DataFrame.from_dict(reform, orient='index').transpose()
>>> df.columns = pd.MultiIndex.from_tuples(df.columns)
A B
a b a b
0 1 6 2 7
1 2 7 3 8
2 3 8 4 9
3 4 9 5 1
4 5 1 6 NaN
[5 rows x 4 columns]

This solution works for a larger dataframe, it fits what was requested
cols = df.columns
int_cols = len(cols)
col_subset_1 = [cols[x] for x in range(1,int(int_cols/2)+1)]
col_subset_2 = [cols[x] for x in range(int(int_cols/2)+1, int_cols)]
col_subset_1_label = list(zip(['A']*len(col_subset_1), col_subset_1))
col_subset_2_label = list(zip(['B']*len(col_subset_2), col_subset_2))
df.columns = pd.MultiIndex.from_tuples([('','myIndex'),*col_subset_1_label,*col_subset_2_label])
OUTPUT
A B
myIndex a b c d
0 0.159710 1.472925 0.619508 -0.476738 0.866238
1 -0.665062 0.609273 -0.089719 0.730012 0.751615
2 0.215350 -0.403239 1.801829 -2.052797 -1.026114
3 -0.609692 1.163072 -1.007984 -0.324902 -1.624007
4 0.791321 -0.060026 -1.328531 -0.498092 0.559837
5 0.247412 -0.841714 0.354314 0.506985 0.425254
6 0.443535 1.037502 -0.433115 0.601754 -1.405284
7 -0.433744 1.514892 1.963495 -2.353169 1.285580

Related

Turning a list of dictionaries into a DataFrame

If you have a list of dictionaries like this:
listofdict = [{'value1': [1, 2, 3, 4, 5]}, {'value2': [5, 4, 3, 2, 1]}, {'value3': ['a', 'b', 'c', 'd', 'e']}]
How can you turn it into a dataframe where value1, value2 and value3 are column names and the lists are the columns.
I tried:
df = pd.DataFrame(listofdict)
But it gives me the values congested in one row and the remaining rows as NaN.
Here is another way:
df = pd.DataFrame({k:v for i in listofdict for k,v in i.items()})
Output:
value1 value2 value3
0 1 5 a
1 2 4 b
2 3 3 c
3 4 2 d
4 5 1 e
DataFrame is expecting a single dictionary with column names as keys, so you need to fusion all these dictionaries in a single one like {'value1': [1, 2, 3, 4, 5], 'value2': [5, 4, 3, 2, 1], ... }
You can try
listofdict = [{'value1':[1,2,3,4,5]}, {'value2':[5,4,3,2,1]},{'value3':['a','b','c','d','e']}]
dicofdics = {}
for dct in listofdict:
dicofdics.update(dct)
df = pd.DataFrame(dicofdics)
df
index
value1
value2
value3
0
1
5
a
1
2
4
b
2
3
3
c
3
4
2
d
4
5
1
e

remove values from pandas df and move remaining upwards

I have a dataframe with categorical data in it.
I have come with a procedure to keep only desired categories, while moving up the remaining categories in the empty cells of deleted values.
But I want to do it without the list intermediaries if possible.
import pandas as pd
mydf = pd.DataFrame(data = {'a': [9,6,3,8,5],
'b': [4, 3,5,6,7],
'c': [5, 3,6,9,10]
}
)
selecList = [5,8,4,6] # only this categories shall remain
mydf
a b c
0 9 4 5
1 6 3 3
2 3 5 6
3 8 6 9
4 5 7 10
Desired Output
a b c
0 6 4 5
1 8 5 6
2 5 6 <NA>
My workaround:
myList = mydf.T.values.tolist()
myList
[[9, 6, 3, 8, 5], [4, 3, 5, 6, 7], [5, 3, 6, 9, 10]]
filtered_list = [[x for x in y if x in selecList ] for y in myList]
filtered_list
[[6, 8, 5], [4, 5, 6], [5, 6]]
filtered_df = pd.DataFrame(filtered_list).T
filtered_df.columns = list(mydf)
filtered_df = filtered_df.astype('Int64')
Unsuccessful try:
pd.DataFrame(mydf.apply(lambda y: [x for x in y if x in selecList ])).T
Here is an alternative solution:
df.where(df.isin(selecList)).dropna(how='all')
Here is a another solution:
df.where(df.isin(selecList)).stack().droplevel(0).to_frame().assign(i = lambda x: x.groupby(level=0).cumcount()).set_index('i',append=True)[0].unstack(level=0)

How to Convert a dataframe into nested dictionary in the following format

2 3 4 loc_id
0 b b c 1
1 b b c 6
2 b a b 8
3 b b c 10
4 b a b 11
Can somone help me with converting the above dataframe to the following dictionary in Python with column names as first key and a dictionary inside that with keys as columns values of some columns and values as column values of another column
{2:{'b':[1,6,8,10,11]},3:{'b':[1,6,10],'a':[8,11]},4:{'c':[1,6,10],'b':[8,11]}}
Use DataFrame.melt with GroupBy.agg and list for MultiIndex Series and then create nested dictionary:
s = df.melt('loc_id').groupby(['variable','value'])['loc_id'].agg(list)
d = {level: s.xs(level).to_dict() for level in s.index.levels[0]}
print (d)
{'2': {'b': [1, 1, 6, 8, 10, 11]},
'3': {'a': [8, 11], 'b': [1, 1, 6, 10]},
'4': {'b': [8, 11], 'c': [1, 1, 6, 10]}}
Or create dictionary of Series and aggregate index to list:
d = {k: v.groupby(v).agg(lambda x: list(x.index)).to_dict()
for k, v in df.set_index('loc_id').to_dict('series').items()}

Extracting values from a dictionary for a respective key

I have a dictionary in a below-mentioned pattern:
dict_one = {1: [2, 3, 4], 2: [3, 4, 4, 5],3 : [2, 5, 6, 6]}
I need to get an output such that for each key I have only one value adjacent to it and then finally I need to create a data frame out of it.
The output would be similar to:
1 2
1 3
1 4
2 3
2 4
2 4
2 5
3 2
3 5
3 6
3 6
Please help me with this.
dict_one = {1: [2, 3, 4], 2: [3, 4, 4, 5],3 : [2, 5, 6, 6]}
df_column = ['key','value']
for key in dict_one.keys():
value = dict_one.values()
row = (key,value)
extended_ground_truth = pd.DataFrame.from_dict(row, orient='index', columns=df_column)
extended_ground_truth.to_csv("extended_ground_truth.csv", index=None)
You can normalize the data as you iterate the dictionary
df=pd.DataFrame(((key, value[0]) for key,value in dict_one.items()),
columns=["key", "value"])
You can wrap the values in lists, then use DataFrame.from_dict and finally use explode to expand the lists:
pd.DataFrame.from_dict({k: [v] for k, v in dict_one.items()}, orient='index').explode(0)

Duplicating rows with certain value in a column

I have to duplicate rows that have a certain value in a column and replace the value with another value.
For instance, I have this data:
import pandas as pd
df = pd.DataFrame({'Date': [1, 2, 3, 4], 'B': [1, 2, 3, 2], 'C': ['A','B','C','D']})
Now, I want to duplicate the rows that have 2 in column 'B' then change 2 to 4
df = pd.DataFrame({'Date': [1, 2, 2, 3, 4, 4], 'B': [1, 2, 4, 3, 2, 4], 'C': ['A','B','B','C','D','D']})
Please help me on this one. Thank you.
You can use append, to append the rows where B == 2, which you can extract using loc, but also reassigning B to 4 using assign. If order matters, you can then order by C (to reproduce your desired frame):
>>> df.append(df[df.B.eq(2)].assign(B=4)).sort_values('C')
B C Date
0 1 A 1
1 2 B 2
1 4 B 2
2 3 C 3
3 2 D 4
3 4 D 4

Categories

Resources