aggregate and group three columns in pandas dataframe

aggregate and group three columns in pandas dataframe - python

My dataframe is
df = pd.DataFrame({'col1': ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'A'],
'col2': ['action1', 'action2', 'action1', 'action3', 'action2', 'action1', 'action1', 'action2'],
'col3': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y']})
it looks like
col1 col2 col3
0 A action1 X
1 A action2 X
2 B action1 X
3 B action3 X
4 C action2 X
5 C action1 X
6 A action1 Y
7 A action2 Y
I would like to aggregate them into
col1 col2 col3
0 A,C action1,action2 X
1 B action1,action3 X
2 A action1,action2 Y
Order of items within the column does not matter. Basically i would like to aggregate col1 and col2. But differentiate the aggregation if col3 is different.
What is the approach I should take?

Probably many ways to do this, but here's a solution that uses groupby twice. Once to build the first set of actions, and next to group on the action and col3.
df = pd.DataFrame({'col1': ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'A'],
'col2': ['action1', 'action2', 'action1', 'action3', 'action2', 'action1', 'action1', 'action2'],
'col3': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y']})
df = df.sort_values(by='col2')
df = df.groupby(['col3','col1'], as_index=False)['col2'].apply(lambda x: ','.join(x))
df = df.groupby(['col3','col2'], as_index=False)['col1'].apply(lambda x: ','.join(x)).sort_index(axis=1)
Output
col1 col2 col3
0 A,C action1,action2 X
1 B action1,action3 X
2 A action1,action2 Y

IIUC, you want to group on groups that have common values in col2.
For this you need to set up a helper group:
m = df.groupby('col1')['col2'].apply(frozenset)
(df.groupby([df['col1'].map(m), 'col3'], as_index=False)
.aggregate(lambda x: ','.join(set(x)))
)
output:
col3 col1 col2
0 X A,C action1,action2
1 Y A action1,action2
2 X B action1,action3

Related

pandas group by and calculate result using values from the same column (R equivalent included)

I want to group and aggregate and then calculate a ratio based on values in a certain column.
In R it's pretty straight forward.
df = data.frame(a = c('a', 'a', 'b', 'b'),
b = c('x', 'y', 'x', 'y'),
value = 1:4)
df %>%
group_by(a) %>%
summarise(calc = value[b == 'x']/value[b == 'y']) ## (1/2) and (3/4)
In python I tried
df = pd.DataFrame({'a': ['a', 'a', 'b', 'b'],
'b': ['x', 'y', 'x', 'y'],
'value': [1, 2, 3, 4]})
df.groupby('a').agg(df[df['b'] == 'x'] / df[df['b'] == 'y'])
But its throwing errors

You can try this:
import pandas as pd
import numpy as np
cond1 = lambda x: x['value'].loc[x['b'].eq('x')].to_numpy()
cond2 = lambda x: x['value'].loc[x['b'].eq('y')].to_numpy()
(df.groupby('a').apply(lambda x: (cond1(x) / cond2(x))[0])
.reset_index(name = 'result'))
a result
0 a 0.50
1 b 0.75
Or in a slightly different form we could do:
(df.groupby('a').apply(lambda x: np.divide(cond1(x), cond2(x)))
.reset_index(name = 'result')
.explode('result'))
a result
0 a 0.5
1 b 0.75
For this case, you can use a pivot:
df.pivot(index='a',columns='b',values='value').pipe(lambda df: df.x/df.y)
Out[9]:
a
a 0.50
b 0.75
dtype: float64
For this specific use case, you do not need a groupby, as there is no aggregation really happening here:
temp = df.set_index('a')
b_x = temp.loc[temp.b.eq('x'), 'value']
b_y = temp.loc[temp.b.eq('y'), 'value']
b_x/b_y
Out[23]:
a
a 0.50
b 0.75
Name: value, dtype: float64

You could do:
df.pivot('a', 'b', 'value').assign(calc = lambda x: x.x/x.y).reset_index()
b a x y calc
0 a 1 2 0.50
1 b 3 4 0.75

PANDAS - Rename and combine like columns

I am trying to rename a column and combine that renamed column to others like it. The row indexes will not be the same (i.e. I am not combining 'City' and 'State' from two columns).
df = pd.DataFrame({'Col_1': ['A', 'B', 'C'],
'Col_2': ['D', 'E', 'F'],
'Col_one':['G', 'H', 'I'],})
df.rename(columns={'Col_one' : 'Col_1'}, inplace=True)
# Desired output:
({'Col_1': ['A', 'B', 'C', 'G', 'H', 'I'],
'Col_2': ['D', 'E', 'F', '-', '-', '-'],})
I've tried pd.concat and a few other things, but it fails to combine the columns in a way I'm expecting. Thank you!

This is melt and pivot after you have renamed:
u = df.melt()
out = (u.assign(k=u.groupby("variable").cumcount())
.pivot("k","variable","value").fillna('-'))
out = out.rename_axis(index=None,columns=None)
print(out)
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -

Using append without modifying the actual dataframe:
result = (df[['Col_1', 'Col_2']]
.append(df[['Col_one']]
.rename(columns={'Col_one': 'Col_1'}),ignore_index=True).fillna('-')
)
OUTPUT:
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -

Might be a slightly longer method than other answers but the below delivered the required output.
df = pd.DataFrame({'Col_1': ['A', 'B', 'C'],
'Col_2': ['D', 'E', 'F'],
'Col_one':['G', 'H', 'I'],})
# Create a list of the values we want to retain
TempList = df['Col_one']
# Append existing dataframe with the values from the list
df = df.append(pd.DataFrame({'Col_1':TempList}), ignore_index = True)
# Drop the redundant column
df.drop(columns=['Col_one'], inplace=True)
# Populate NaN with -
df.fillna('-', inplace=True)
Output is
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -

Using concat should work.
import pandas as pd
df = pd.DataFrame({'Col_1': ['A', 'B', 'C'],
'Col_2': ['D', 'E', 'F'],
'Col_one':['G', 'H', 'I'],})
df2 = pd.DataFrame()
df2['Col_1'] = pd.concat([df['Col_1'], df['Col_one']], axis = 0)
df2 = df2.reset_index()
df2 = df2.drop('index', axis =1)
df2['Col_2'] = df['Col_2']
df2['Col_2'] = df2['Col_2'].fillna('-')
print(df2)
prints
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -

Pandas how add a new column to dataframe based on values from all rows, specific columns values applied to whole dataframe

I working on a pandas DataFrame which needs a new column that shows count of specific values in specific columns.
I tried various combinations groupby and pivot, but had problems to apply it to whole dataframe without errors.
df = pd.DataFrame([
['a', 'z'],
['a', 'x'],
['a', 'y'],
['b', 'v'],
['b', 'x'],
['b', 'v']],
columns=['col1', 'col2'])
I need to add col3 that counts 'v' values in col2 for each value in 'col1'. There is no 'v' in col2 for 'a' in col1, so it's 0 everywhere, while expected value count is 2 for 'b', also in a row where value in col2 equals 'x' instead of 'v'.
Expected output:
['a', 'z', 0]
['a', 'x', 0]
['a', 'y', 0]
['b', 'v', 2]
['b', 'x', 2]
['b', 'v', 2]
I'm looking rather for a nice pandas specific solution because the original dataframe is quite big, so things like row iterations and time expensive.

Create a Boolean Series checking the equality then groupby +transform + sum to count them.
df['col3'] = df.col2.eq('v').astype(int).groupby(df.col1).transform('sum')
# col1 col2 col3
#0 a z 0
#1 a x 0
#2 a y 0
#3 b v 2
#4 b x 2
#5 b v 2

While ALollz's answer is neat and a one liner, here is another one although a two step solution introducing you to other concepts like str.contains and np.where!
First get the rows which have v using np.where and mark them as a flag:
df['col3'] = np.where(df['col2'].str.contains('v'), 1, 0)
Now perform a groupby on col1 and sum them:
df['col3'] = df.groupby('col1')['col3'].transform('sum')
Output:
col1 col2 col3
0 a z 0
1 a x 0
2 a y 0
3 b v 2
4 b x 2
5 b v 2

All the answers above are fine. The only caveat is that transform can be slow when the group size is very large. Alternatively, you can try the workaround below,
(df.assign(mask = lambda x:x.col2.eq('v'))
.pipe(lambda x:x.join(x.groupby('col1')['mask'].sum().map(int).rename('col3'),on='col1')))

Computing a new column based on other columns

I am trying to create a subset of an existing variable (col1) in the below df. My new variable (col2) would only have "a" corresponding to "a" in col1. Rest of the values should be marked as "Others". Please help.
col1
a
b
c
a
b
c
a
Col2
a
Other
Other
a
Other
Other
a

Use numpy.where:
df['col2'] = np.where(df['col1'] == 'a', 'a', 'Other')
#alternative
#df['col2'] = df['col1'].where(df['col1'] == 'a', 'Other')
print (df)
col1 col2
0 a a
1 b Other
2 c Other
3 a a
4 b Other
5 c Other
6 a a

Method 1: np.where
This is the most direct method:
df['col2'] = np.where(df['col1'] == 'a', 'a', 'Other')
Method 2: pd.DataFrame.loc
df['col2'] = 'Other'
df.loc[df['col1'] == 'a', 'col2'] = 'a'
Method 3: pd.Series.map
df['col2'] = df['col1'].map({'a': 'a'}).fillna('Other')
Most of these methods can be optimized by extracting numpy array representation via df['col1'].values.

Without any additional library since the question is not tagged with either pandas nor numpy :
You can use a list comprehension with if and else :
col1 = ['a', 'b', 'c', 'a', 'b', 'c', 'a']
col2 = [ x if x=='a' else 'others' for x in col1 ]

Python Pandas lookup and replace df1 value from df2

I have two dataframes
df df2
df column FOUR matches with df2 column LOOKUP COL
I need to match df column FOUR with df2 column LOOKUP COL and replace df column FOUR with the corresponding values from df2 column RETURN THIS
The resulting dataframe could overwrite df but I have it listed as result below.
NOTE: THE INDEX DOES NOT MATCH ON EACH OF THE DATAFRAMES
df = pd.DataFrame([['a', 'b', 'c', 'd'],
['e', 'f', 'g', 'h'],
['j', 'k', 'l', 'm'],
['x', 'y', 'z', 'w']])
df.columns = ['ONE', 'TWO', 'THREE', 'FOUR']
ONE TWO THREE FOUR
0 a b c d
1 e f g h
2 j k l m
3 x y z w
df2 = pd.DataFrame([['a', 'b', 'd', '1'],
['e', 'f', 'h', '2'],
['j', 'k', 'm', '3'],
['x', 'y', 'w', '4']])
df2.columns = ['X1', 'Y2', 'LOOKUP COL', 'RETURN THIS']
X1 Y2 LOOKUP COL RETURN THIS
0 a b d 1
1 e f h 2
2 j k m 3
3 x y w 4
RESULTING DF
ONE TWO THREE FOUR
0 a b c 1
1 e f g 2
2 j k l 3
3 x y z 4

You can use Series.map. You'll need to create a dictionary or a Series to use in map. A Series makes more sense here but the index should be LOOKUP COL:
df['FOUR'] = df['FOUR'].map(df2.set_index('LOOKUP COL')['RETURN THIS'])
df
Out:
ONE TWO THREE FOUR
0 a b c 1
1 e f g 2
2 j k l 3
3 x y z 4

df['Four']=[df2[df2['LOOKUP COL']==i]['RETURN THIS'] for i in df['Four']]
Should be something like sufficient to do the trick? There's probably a more pandas native way to do it.
Basically, list comprehension - We generate a new array of df2['RETURN THIS'] values based on using the lookup column as we iterate over the i in df['Four'] list.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

aggregate and group three columns in pandas dataframe - python

Related

pandas group by and calculate result using values from the same column (R equivalent included)

PANDAS - Rename and combine like columns

Pandas how add a new column to dataframe based on values from all rows, specific columns values applied to whole dataframe

Computing a new column based on other columns

Python Pandas lookup and replace df1 value from df2

Categories

Resources