Construct MultiIndex pandas DataFrame nested Python dictionary - python

I would like to construct a MultiIndex DataFrame from a deeply-nested dictionary of the form
md = {'50': {'100': {'col1': ('0.100',
'0.200',
'0.300',
'0.400'),
'col2': ('6.263E-03',
'6.746E-03',
'7.266E-03',
'7.825E-03')},
'101': {'col1': ('0.100',
'0.200',
'0.300',
'0.400'),
'col2': ('6.510E-03',
'7.011E-03',
'7.553E-03',
'8.134E-03')}
'102': ...
}
'51': ...
}
I've tried
df = pd.DataFrame.from_dict({(i,j): md[i][j][v] for i in md.keys() for j in md[i].keys() for v in md[i][j]}, orient='index')
following Construct pandas DataFrame from items in nested dictionary, but I get a DataFrame with 1 row and many columns.
Bonus:
I'd also like to label the MultiIndex keys and the columns 'col1' and 'col2', as well as convert the strings to int and float, respectively.
How can I reconstruct my original dictionary from the dataframe?
I tried df.to_dict('list').

Check out this answer: https://stackoverflow.com/a/24988227/9404057. This method unpacks the keys and values of the dictionary, and reforms the data into an easily processed format for multiindex dataframes. Note that if you are using python 3.5+, you will need to use .items() rather than .iteritems() as shown in the linked answer:
>>>>import pandas as pd
>>>>reform = {(firstKey, secondKey, thirdKey): values for firstKey, middleDict in md.items() for secondKey, innerdict in middleDict.items() for thirdKey, values in innerdict.items()}
>>>>df = pd.DataFrame(reform)
To change the data type of col1 and col to int and float, you can then use pandas.DataFrame.rename() and specify any values you want:
df.rename({'col1':1, 'col2':2.5}, axis=1, level=2, inplace=True)
Also, if you'd rather have the levels on the index rather than the columns, you can also use pandas.DataFrame.T
If you wanted to reconstruct your dictionary from this MultiIndex, you could do something like this:
>>>>md2={}
>>>>for i in df.columns:
if i[0] not in md2.keys():
md2[i[0]]={}
if i[1] not in md2[i[0]].keys():
md2[i[0]][i[1]]={}
md2[i[0]][i[1]][i[2]]=tuple(df[i[0]][i[1]][i[2]].values)

Related

Create a dictionary from pandas empty dataframe with only column names

I have a pandas data frame with only two column names( single row, which can be also considered as headers).I want to make a dictionary out of this with the first column being the value and the second column being the key.I already tried the
to.dict() method, but it's not working as it's an empty dataframe.
Example
df=|Land |Norway| to {'Land': Norway}
I can change the pandas data frame to some other type and find my way around it, but this question is mostly to learn the best/different/efficient approach for this problem.
For now I have this as the solution :
dict(zip(a.iloc[0:0,0:1],a.iloc[0:0,1:2]))
Is there any other way to do this?
Here's a simple way convert the columns to a list and a list to a dictionary
def list_to_dict(a):
it = iter(a)
ret_dict = dict(zip(it, it))
return ret_dict
df = pd.DataFrame([], columns=['Land', 'Normway'])
dict_val = list_to_dict(df.columns.to_list())
dict_val # {'Land': 'Normway'}
Very manual solution
df = pd.DataFrame(columns=['Land', 'Norway'])
df = pd.DataFrame({df.columns[0]: df.columns[1]}, index=[0])
If you have any number of columns and you want each sequential pair to have this transformation, try:
df = pd.DataFrame(dict(zip(df.columns[::2], df.columns[1::2])), index=[0])
Note: You will get an error if your DataFrame does not have at least two columns.

use dict from CSV to format dataframe

I have a number of Pandas DFs with differing format that should get reshaped into a common target-format.
Right now, I write dictionaries for each DF:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({"original_name":["a","b","c"],"original_value":[1,2,3]})
key_dict = {
"name":df1.original_name,
"value":df1.original_value,
"other_value":np.nan
}
target_colnames = ["name","value","other_value"]
new_df = pd.DataFrame(key_dict, columns = target_colnames)
My Problem: The mapping of orginal to target columns with key_dict is stored in a CSV file (index= values, columns = key for each DF).
key_df= pd.read_csv("key_df.csv").set_index("key")
key_df= key_df.to_dict()
new_df = pd.DataFrame(key_df["df1"], columns = target_colnames)
This leads to the following error:
"If using all scalar values, you must pass an index"
I think it's because the values of 'key_df' are strings unlike in 'key_dict'. Do I need to apply 'eval' on the keys?
this is how 'key_df["df1"]' looks:
{'name': 'df1.original_name',
'other_value': 'np.nan',
'value': 'df1.original_value'}
Use:
key_df = {i:eval(j) for i,j in key_df.items()} # Use iteritems() for python 2
new_df = pd.DataFrame(key_dict, columns = target_colnames)
Output
name value other_value
a 1 NaN
b 2 NaN
c 3 NaN
Explanation
After loading and converting to csv to dict, you have to do a dict comprehension to convert the pd.Series() values stored as str to eval() so you can reuse the same new_df code to get what you want

How to coerce pandas dataframe column to be normal index

I create a DataFrame from a dictionary. I want the keys to be used as index and the values as a single column. This is what I managed to do so far:
import pandas as pd
my_counts = {"A": 43, "B": 42}
df = pd.DataFrame(pd.Series(my_counts, name=("count",)).rename_axis("letter"))
I get the following:
count
letter
A 43
B 42
The problem is I want to concatenate (with pd.concat) this with other dataframes, that have the same index name (letter), and seemingly the same single column (count), but I end up with an
AssertionError: invalid dtype determination in get_concat_dtype.
I discovered that the other dataframes have a different type for their columns: Index(['count'], dtype='object'). The above dataframe has MultiIndex(levels=[['count']], labels=[[0]]).
How can I ensure my dataframe has a normal index?
You can prevent the multiIndex column with this code by eliminating a ',':
df = pd.DataFrame(pd.Series(my_counts, name=("count")).rename_axis("letter"))
df.columns
Output:
Index(['count'], dtype='object')
OR you can flatten your multiindex columns like this:
df = pd.DataFrame(pd.Series(my_counts, name=("count",)).rename_axis("letter"))
df.columns = df.columns.map(''.join)
df.columns
Output:
Index(['count'], dtype='object')

split, map data in two columns in pandas data frame

I want to split data in two columns from a data frame and construct new columns using this data.
My data frame is,
dfc = pd.DataFrame( {"A": ["GT:DP:RO:QR:AO:QA:GL", "GT:DP:RO:QR:AO:QA:GL", "GT:DP:RO:QR:AO:QA:GL", "GT:DP:GL", "GT:DP:GL"], "B": ["0/1:71:43:1363:28:806:-71.1191,0,-121.278", "0/1:71:43:1363:28:806:-71.1191,0,-121.278", "0/1:71:43:1363:28:806:-71.1191,0,-121.278", "1/1:49:-103.754,0,-3.51307", "1/1:49:-103.754,0,-3.51307"]} )
I want individual columns named GT, DP, RO, QR, AO, QA, GL with values from column B
I want to produce output as,
We can split the two columns using a = df.A.str.split(":", expand = True)and b = df.B.str.split(":", expand = True) to get two individual data frames. These can be merged with c = pd.merge(a, b, left_index = True, right_index = True) to get all desired data. But, not in the format as expected.
Any suggestions ? I think better way can be using split on both columns A and B and then creating a dictcolumn with values from A as key and B as values. Then this column can be converted to data frame.
Thanks
Use an OrderedDict to preserve the order after creating a dict mapping of the two concerned columns of the dataframe split on the sep ":", flattened to a list.
Feed this to the dataframe constructor later.
from collections import OrderedDict
L = dfc.apply(
lambda x: OrderedDict(zip(x['A'].split(':'), x['B'].split(':'))), 1).tolist()
pd.DataFrame(L)
I'm going to split everything by ':'. But I have 2 columns. If I stack first, I get a series in which I can more easily use str.split
I now have a split series in which I can group by level=0 which is the original index.
I zip and dict to get series like structures with the original column A as the indices and B as the values.
unstack and I'm done.
gb = dfc.stack().str.split(':').groupby(level=0)
gb.apply(lambda x: dict(zip(*x))).unstack()

Convert Pandas DataFrame to Dictionary using multiple columns for the key

I have a pandas DataFrame as below
From_email,To_email,email_count
110165.74#compuserve.com,klay#enron.com,1
2krayz#gte.net,klay#enron.com,1
"<""d#piassick"".#enron#enron.com>",klay#enron.com,1
I would like to change it to a dictionary of the following format
hrc_dict = {('110165.74#compuserve.com', 'klay#enron.com'): 1,
('2krayz#gte.net', 'klay#enron.com'): 1,
('<"d#piassick".#enron#enron.com>', 'klay#enron.com '): 1}
What is the best way to do this?
You can use a dict comprehension to create the dict from your DataFrame.
df = DataFrame({
'From_email': ['110165.74#compuserve.com', '2krayz#gte.net', '<"d#piassick".#enron#enron.com>'],
'To_email': ['klay#enron.com', 'klay#enron.com', 'klay#enron.com'],
'email_count': [1, 1, 1]})
d = {tuple(x[:2]):x[2] for x in df[['From_email', 'To_email', 'email_count']].values}
First we explicitly grab the necessary columns from your data frame in the desired order. Then iterate over the rows and for each row, create a tuple from the email addresses (first two columns) and use this as the key. The value is simply the 3rd column (email_count)

Categories

Resources