long-format pandas dataframe to dictionary - python

While I find help and documentation on how to convert a pandas DataFrame to dictionary so that columns are keys and values are rows, I find myself stuck when I would like to have one of the column's values as keys and the associated values from another column as values, so that a df like this
a b
1 car
1 train
2 boot
2 computer
2 lipstick
converts to the following dictionary {'1': ['car','train'], '2': ['boot','computer','lipstick]}
I have a feeling it's something pretty simple but I'm out of ideas. I tried df.groupby('a').to_dict() but was unsuccessful
Any suggestions?

You could view this as a groupby-aggregation (i.e., an operation which turns each group into one value -- in this case a list):
In [85]: df.groupby(['a'])['b'].agg(lambda grp: list(grp))
Out[85]:
a
1 [car, train]
2 [boot, computer, lipstick]
dtype: object
In [68]: df.groupby(['a'])['b'].agg(lambda grp: list(grp)).to_dict()
Out[68]: {1: ['car', 'train'], 2: ['boot', 'computer', 'lipstick']}

You can't perform a to_dict() on a the result of groupby, but you can use it to perform your own dictionary construction. The following code will work with the example you provided.
import pandas as pd
df = pd.DataFrame(dict(a=[1,1,2,2,2],
b=['car', 'train', 'boot', 'computer', 'lipstick']))
# Using a loop
dt = {}
for g, d in df.groupby('a'):
dt[g] = d['b'].values
# Using dictionary comprehension
dt2 = {g: d['b'].values for g, d in df.groupby('a')}
Now both dt and dt2 will be dictionaries like this:
{1: array(['car', 'train'], dtype=object),
2: array(['boot', 'computer', 'lipstick'], dtype=object)}
Of course you can put the numpy arrays back into lists, if you so desire.

Yes, because DataFrameGroupBy has no attribute of to_dict, only DataFrame has to_dict attribute.
DataFrame.to_dict(outtype='dict')
Convert DataFrame to dictionary.
You can read more about DataFrame.to_dict here
Take a look of this:
import pandas as pd
df = pd.DataFrame([np.random.sample(9), np.random.sample(9)])
df.columns = [c for c in 'abcdefghi']
# it will convert the DataFrame to dict, with {column -> {index -> value}}
df.to_dict()
{'a': {0: 0.53252618404947039, 1: 0.78237275521385163},
'b': {0: 0.43681232450879315, 1: 0.31356312459390356},
'c': {0: 0.84648298651737541, 1: 0.81417040486070058},
'd': {0: 0.48419015448536995, 1: 0.37578177386187273},
'e': {0: 0.39840348154035421, 1: 0.35367537180764919},
'f': {0: 0.050381560155985827, 1: 0.57080653289506755},
'g': {0: 0.96491634442628171, 1: 0.32844653606404517},
'h': {0: 0.68201236712813085, 1: 0.0097104037581828839},
'i': {0: 0.66836630467152902, 1: 0.69104505886376366}}
type(df)
pandas.core.frame.DataFrame
# DataFrame.groupby is another type
type(df.groupby('a'))
pandas.core.groupby.DataFrameGroupBy
df.groupby('a').to_dict()
AttributeError: Cannot access callable attribute 'to_dict' of 'DataFrameGroupBy' objects, try using the 'apply' method

Related

How to compare 2 dictionary values in Python and make pairs with common ones by keys?

I have 2 columns: one is the Pandas DateTime Dataframe (data["start"]) and the second is the tags, data["parallels"] for example. So i'm going to create a dictionary, like this:
a = []
pl = []
pl1 = []
for i in list(data.index):
a.append(data["parallels"][i].astype(str))
if a[i] != 'nan':
pl1.append(i)
pl.append(a[i])
if i > list(data.index)[i]: break
parl1 = dict(zip(pl1,pl))
So, the output dictionary is: {3: '1.0', 5: '1.0'}
How can i check this dictionary if the values is equal (in the example both are) and after checking write down keys. The output keys i'm going to use as index by making equal column data["start")[5] == data["start][3]
I wonder how to do it automatically, if there are {2: '2.0', 3: '1.0', 4: '2.0', 5: '1.0'} dict for example.
Reverse the dictionary so that keys can be grouped together
v = {}
for key, value in d.items():
v.setdefault(value, set()).add(key)
print(v)
import pandas as pd
print(pd.DataFrame({'val': list(v.keys()), 'equal_keys': list(v.values())}))
{'2.0': {2, 4}, '1.0': {3, 5}}
val equal_keys
0 2.0 {2, 4}
1 1.0 {3, 5}
Other than this, maybe you want to use pandas groupby and aggregate all indices
# just an example did not check for errors
df.groupby('parallel').index.agg(list)

How to enable row[col_name] syntax with Namedtuple Pandas from df.itertuples()

We have a DataFrame with many columns and need to cycle through the rows with df.itertuples(). Many column names are in variables and accessing the namedtuple row with getattr() works fine but is not very readable with many column accesses. Is there a way to enable the row[col_name] syntax? E.g with a subclassed NamedTuple like here https://stackoverflow.com/a/65301971/360265?
import pandas as pd
col_name = 'b'
df = pd.DataFrame([{'a': 1, 'b': 2.}, {'a': 3,'b': 4.}])
for row in df.itertuples():
print(row.a) # Using row._asdict() would disable this syntax
print(getattr(row, col_name)) # works fine but is not as readable as row[col_name]
print(row[col_name]) # how to enable this syntax?
Wrapping row in the following Frame class is a solution but not really a pythonic one.
class Frame:
def __init__(self, namedtuple: NamedTuple):
self.namedtuple = namedtuple
def __getattr__(self, item):
return getattr(self.namedtuple, item)
def __getitem__(self, item):
return getattr(self.namedtuple, item)
Use to_dict
import pandas as pd
col_name = 'b'
df = pd.DataFrame([{'a': 1, 'b': 2}, {'a': 3, 'b': 4}])
for row in df.to_dict('records'):
print(row[col_name])
Output
2
4
If you want to keep both notations, a possible approach would be to do:
def iterdicts(tuples):
yield from ((tup, tup._asdict()) for tup in tuples)
df = pd.DataFrame([{'a': 1, 'b': 2}, {'a': 3, 'b': 4}])
for tup, row in iterdicts(df.itertuples()):
print(tup.a)
print(row[col_name])
Output
1
2
3
4
A similar approach to yours, just using df.iterrows()
import pandas as pd
df = pd.DataFrame([{'a': 1, 'b': 2}, {'a': 3,'b': 4}])
for index, row in df.iterrows():
print(row.b)
print(getattr(row, 'b'))
print(row['b'])
These lines were tested using pandas versions 0.20.3 and 1.0.1.

extract dataframes from list of dictionaries and combine into one

I have a list of dictionaries. Each item in the list is a dictionary. Each dictionary is a pair of key and value with the value being a data frame.
I would like to extract all the data frames and combine them into one.
I have tried:
df = pd.DataFrame.from_dict(data)
for both the full data file and for each dictionary in the list.
This gives the following error:
ValueError: If using all scalar values, you must pass an index
I have also tried turning the dictionary into a list, then converting to a pd.DataFrame, i get:
KeyError: 0
Any ideas?
It should be doable with pd.concat(). Let's say you have a list of dictionaries l:
l = (
{'a': pd.DataFrame(np.arange(9).reshape((3,3)))},
{'b': pd.DataFrame(np.arange(9).reshape((3,3)))},
{'c': pd.DataFrame(np.arange(9).reshape((3,3)))}
)
You can feed dataframes from each dict in the list to pd.concat():
df = pd.concat([[pd.DataFrame(df_) for df_ in dict_.values()][0] for dict_ in l])
In my example all data frames have the same number of columns, so the result has 9 x 3 shape. If your dataframes have different columns the output will be malformed and required extra steps to process.
This should work.
import pandas as pd
dict1 = {'d1': pd.DataFrame({'a': [1,2,3], 'b': ['one', 'two', 'three']})}
dict2 = {'d2': pd.DataFrame({'a': [4,5,6], 'b': ['four', 'five', 'six']})}
dict3 = {'d3': pd.DataFrame({'a': [7,8,9], 'b': ['seven', 'eigth', 'nine']})}
# dicts list. you would start from here
dicts_list = [dict1, dict2, dict3]
dict_counter = 0
for _dict in dicts_list:
aux_df = list(_dict.values())[0]
if dict_counter == 0:
df = aux_df
else:
df = df.append(aux_df)
dict_counter += 1
# Reseting and dropping old index
df = df.reset_index(drop=True)
print(df)
Just out of curiosity: Why are your sub-dataframes already included in a dictionary? An easy way of creating a dataframe from dictionaries is just building a list of dictionaries and then calling pd.DataFrame(list_with_dicts). If the keys are the same across all dictionaries, it should work. Just a suggestion from my side. Something like this:
list_with_dicts = [{'a': 1, 'b': 2}, {'a': 5, 'b': 4}, ...]
# my_df -> DataFrame with columns [a, b] and two rows with the values in the dict.
my_df = pd.DataFrame(list_with_dicts)

Convert pandas dataframe to dict to JSON, unflatten nested subkeys, drop None/NaN keys

Can the following be done in Pandas in one go, in more Pythonic code than below?
I have a row from a pandas-dataframe:
some values may be NaNs or empty strings or similar
I'd like to map this information to a dict (which is then converted to JSON and passed on to another application)
However, NaNs should not be included in the dict. (By default they are passed as None)
Dict subkeys 'c.x', 'c.y', 'c.z' should be unflattened, i.e. converted to a subdict c with keys x, y, z. Again, NaN keys in each row should be dropped.
Sample input: I iterate over rows in a dataframe with row = next(df.iterrows()), where a sample row would look like:
a 3
b NaN
c.x 4
c.y 5
c.z NaN
Desired output
{"A": 3,
"C": {"X": 4, "Y": 5}}
The most natural way (to me) to do that would like something like this:
outdict={"A": row['a'] if not pandas.isna(row['a']) else None,
"B": row['b'] if not pandas.isna(row['b']) else None,
"C": {"X": row['c.x'] if not pandas.isna(row['c.x']) else None,
"Y": row['c.y'] if not pandas.isna(row['c.y']) else None,
"Z": row['c.z'] if not pandas.isna(row['c.z']) else None
}}
However, this still assigns None to the slots that I'd like to remain empty (the receiving application is difficult in handlings nulls).
One workaround would be using this code and subsequently removing all None values in a second pass, or I could use outdict.update for each value (and not update if the value is NaN). But both solutions seems not very efficient to me.
To transform your DataFrame to a dictionary without NaN, there is a straightforward way:
df.dropna().to_dict()
But you also want to create sub-dictionaries from composed keys, and I found no other way than a loop:
df = DataFrame({"col": [3, None, 4, 5, None]}, index=["a", "b", "c.x", "c.y", "c.z"])
d = df.dropna().to_dict()
d is:
{'col': {'a': 3.0, 'c.x': 4.0, 'c.y': 5.0}}
Then:
d2 = dict()
for k, v in d['col'].items():
if k.count('.'):
a, b = k.split('.')
d2.setdefault('a', {})
d2[a][b] = v
else:
d2[k] = v
and d2 is:
{'a': 3.0, 'c': {'y': 5.0, 'x': 4.0}}
If row is a Series object, the following code will not create any entries for NaNs:
outdict = {row.index[i]: row[i]
for i in range(data.shape[1])
if not pandas.isna(row[i])}
However, it won't create the nested structure that you want. There are several ways I can think of to solve this, none of which are extremely elegant. The best way I can think of is to exclude the columns with labels of the form a.b when creating outdict; i.e.
outdict = {row.index[i]: row[i]
for i in range(data.shape[1])
if not (pandas.isna(row[i]) or '.' in row.index[i])}
then create the subdicts individually and assign them in outdict.

creating dict from column with header as key, no list formatting

One example of a simple DataFrame
ID1 ID2
0 1 2
1 2 2
I need to transform column ID1 into a dict where 'ID1' is the key and the row values are the pure values. My designated output is
dict1 = {ID1 : ('1','1'...)}
So far i've uses a simple pandas.DataFrame.to_dict(list) statement to get the following
dict2 = {ID1 : ['1','1'...]}
However, dict2 is not working when I pass the dict into my database.
Any suggestions how to create a dict without the squared brackets as in a list, so i get a result as shown in dict1?
If I understand correctly, you want something like this
d = {k:tuple(v) for k,v in dict2.items()}
Did you try:
df.to_dict()
in your small example generated this:
{'ID1': {0: 1, 1: 2}, 'ID2': {0: 2, 1: 2}}
for a single column:
df[['ID2']].to_dict()
{'ID2': {0: 2, 1: 2}}

Categories

Resources