Convert pandas.groupby to dict - python

Consider, dataframe d:
d = pd.DataFrame({'a': [0, 2, 1, 1, 1, 1, 1],
'b': [2, 1, 0, 1, 0, 0, 2],
'c': [1, 0, 2, 1, 0, 2, 2]})
> a b c
0 0 2 1
1 2 1 0
2 1 0 2
3 1 1 1
4 1 0 0
5 1 0 2
6 1 2 2
I want to split it by column a into dictionary like that:
{0: a b c
0 0 2 1,
1: a b c
2 1 0 2
3 1 1 1
4 1 0 0
5 1 0 2
6 1 2 2,
2: a b c
1 2 1 0}
The solution I've found using pandas.groupby is:
{k: table for k, table in d.groupby("a")}
What are the other solutions?

You can use dict with tuple / list applied on your groupby:
res = dict(tuple(d.groupby('a')))
A memory efficient alternative to dict is to create a groupby object and then use get_group:
res = d.groupby('a')
res.get_group(1) # select dataframe where column 'a' = 1
In cases where the resulting table requires a minor manipulation, like resetting the index, or removing the groupby column, continue to use a dictionary comprehension.
res = {k: v.drop('a', axis=1).reset_index(drop=True) for k, v in d.groupby('a')}

Related

How to find the column number and return it as an array when there is a value in that column in python?

Suppose you have the following data frame:
ID Q1 Q2 Q3
0 1 0 1
1 0 0 1
2 1 1 1
3 0 1 1
I would like to return an array with column numbers wherever there is a 1 and add it as another column like this:
ID Array Q1 Q2 Q3
0 [0,2] 1 0 1
1 [2] 0 0 1
2 [0,1,2] 1 1 1
3 [1,2] 0 1 1
Thanks
I would use numpy.where:
a, b = np.where(df.filter(like='ID')==1)
# or
a, b = np.where(df.drop(columns='ID')==1)
df['Array'] = pd.Series(b).groupby(a).agg(list).set_axis(df.index)
Output:
ID Q1 Q2 Q3 Array
0 0 1 0 1 [0, 2]
1 1 0 0 1 [2]
2 2 1 1 1 [0, 1, 2]
3 3 0 1 1 [1, 2]
Pure pandas variant:
df2 = df.filter(like='Q')
df['Array'] = (df2.set_axis(range(df2.shape[1]), axis=1).stack()
.loc[lambda s: s==1].reset_index()
.groupby('level_0')['level_1'].agg(list)
)
Another approach using np.where together with the "apply" method:
import pandas as pd
import numpy as np
data = {'ID': {0: 0, 1: 1, 2: 2, 3: 3},
'Q1': {0: 1, 1: 0, 2: 1, 3: 0},
'Q2': {0: 0, 1: 0, 2: 1, 3: 1},
'Q3': {0: 1, 1: 1, 2: 1, 3: 1}}
df = pd.DataFrame(data)
#####
def arr_func(row):
return np.where(row)[0]
df['Array'] = df.drop(columns = 'ID').apply(arr_func, axis = 1)
The result:
ID Q1 Q2 Q3 Array
0 0 1 0 1 [0, 2]
1 1 0 0 1 [2]
2 2 1 1 1 [0, 1, 2]
3 3 0 1 1 [1, 2]
Here's a simple solution.
import pandas as pd
def my_func(row):
return [item for item in row if item == 1]
df['Array'] = df[['Q1', 'Q2', 'Q3']].apply(my_func, axis=1)

after group by for one dataframe to generate a dictionary, how can I convert this dictionary back to dataframe

I have a dictionary dict_1 like this
'''
{0: [1],
1: [2,3,5],
2: [1,2],
3: [4,5]}
'''
how can I convert it to a dataframe df like this:
'''
column 1 column 2
0 1
1 2
1 3
1 5
2 1
2 2
3 4
3 5
'''
my approach is df = pd.DataFrame(list(dict_1.items()))
but the output is
0 1
0 0 [1]
1 1 [2, 3, 5]
2 2 [1, 2]
3 3 [4, 5]
It can be surely more elegant but is it something like this you're looking for?
for k, v in d.items():
for e in v:
print(k, e)

Transform 2D numpy array to row-column-value pandas DataFrame

Suppose I have a 2D numpy array like this:
arr = np.array([[1, 2], [3, 4], [5, 6]])
# array([[1, 2],
# [3, 4],
# [5, 6]])
How can one transform that to a "long" structure with one record per value, associated with the row and column index? In this case that would look like:
df = pd.DataFrame({'row': [0, 0, 1, 1, 2, 2],
'column': [0, 1, 0, 1, 0, 1],
'value': [1, 2, 3, 4, 5, 6]})
melt only assigns the column identifier, not the row:
pd.DataFrame(arr).melt()
# variable value
# 0 0 1
# 1 0 3
# 2 0 5
# 3 1 2
# 4 1 4
# 5 1 6
Is there a way to attach the row identifier?
Pass index to idvar:
pd.DataFrame(arr).reset_index().melt('index')
# index variable value
# 0 0 0 1
# 1 1 0 3
# 2 2 0 5
# 3 0 1 2
# 4 1 1 4
# 5 2 1 6
You can rename:
df = pd.DataFrame(arr).reset_index().melt('index')
df.columns = ['row', 'column', 'value']
melt can use the index if it's a column:
arrdf = pd.DataFrame(arr)
arrdf['row'] = arrdf.index
arrdf.melt(id_vars='row', var_name='column')
# row column value
# 0 0 0 1
# 1 1 0 3
# 2 2 0 5
# 3 0 1 2
# 4 1 1 4
# 5 2 1 6

Pandas - Does row fall below a row with a column value and same id

I am new to Pandas. I have a Pandas data frame like so:
df = pd.DataFrame(data={'id': [1, 1, 1, 2, 2, 2, 2], 'val1': [0, 1, 0, 0, 1, 0, 0]})
I want to add a column val2, that indicates whether an row falls below another row having the same id as itself where val1 == 1.
The result would be a data frame like:
df = pd.DataFrame(data={'id': [1, 1, 1, 2, 2, 2, 2], 'val1': [0, 1, 0, 0, 1, 0, 0], 'val2': [0, 0, 1, 0, 0, 1, 1]})
My first thought was to use an apply statement, but these only go by row. And from my experience for loops are never the answer. Any help would be greatly appreciated!
Let's try shift + cumsum inside a groupby.
df['val2'] = df.groupby('id').val1.apply(
lambda x: x.shift().cumsum()
).ge(1).astype(int)
Or, in an attempt to avoid the lambda,
df['val2'] = (
df.groupby('id')
.val1.shift()
.groupby(df.id)
.cumsum()
.ge(1)
.astype(int)
)
df
id val1 val2
0 1 0 0
1 1 1 0
2 1 0 1
3 2 0 0
4 2 1 0
5 2 0 1
6 2 0 1
Using groupby + transform. Similar to coldspeed's but using bool conversion for non-zero cumsum values.
df['val2'] = df.groupby('id')['val1'].transform(lambda x: x.cumsum().shift())\
.fillna(0).astype(bool).astype(int)
print(df)
id val1 val2
0 1 0 0
1 1 1 0
2 1 0 1
3 2 0 0
4 2 1 0
5 2 0 1
6 2 0 1

Pandas multiindex boolean indexing

So given a multiindexed dataframe, I would like to return only rows that satisfy a condition for all levels of the lower index in a multi index. Here is a small working example:
df = pd.DataFrame({'a': [1, 1, 2, 2], 'b': [1, 2, 3, 4], 'c': [0, 2, 2, 2]})
df = df.set_index(['a', 'b'])
print(df)
out:
c
a b
1 1 0
2 2
2 3 2
4 2
Now, I would like to return the entries for which c > 1. For instance, I would like to do something like
df[df[c > 1]]
out:
c
a b
1 2 2
2 3 2
4 2
But I want to get
out:
c
a b
2 3 2
4 2
Any thoughts on how to do this in the most efficient way?
I ended up using groupby:
df.groupby(level=0).filter(lambda x: all([c > 1 for v in x['c']]))

Categories

Resources