pandas groupby apply on multiple columns to generate a new column - python

I like to generate a new column in pandas dataframe using groupby-apply.
For example, I have a dataframe:
df = pd.DataFrame({'A':[1,2,3,4],'B':['A','B','A','B'],'C':[0,0,1,1]})
and try to generate a new column 'D' by groupby-apply.
This works:
df = df.assign(D=df.groupby('B').C.apply(lambda x: x - x.mean()))
as (I think) it returns a series with the same index with the dataframe:
In [4]: df.groupby('B').C.apply(lambda x: x - x.mean())
Out[4]:
0 -0.5
1 -0.5
2 0.5
3 0.5
Name: C, dtype: float64
But if I try to generate a new column using multiple columns, I cannot assign it directly to a new column. So this doesn't work:
df.assign(D=df.groupby('B').apply(lambda x: x.A - x.C.mean()))
returning
TypeError: incompatible index of inserted column with frame index
and in fact, the groupby-apply returns:
In [8]: df.groupby('B').apply(lambda x: x.A - x.C.mean())
Out[8]:
B
A 0 0.5
2 2.5
B 1 1.5
3 3.5
Name: A, dtype: float64
I could do
df.groupby('B').apply(lambda x: x.A - x.C.mean()).reset_index(level=0,drop=True))
but it seems verbose and I am not sure if this will work as expected always.
So my question is: (i) when does pandas groupby-apply return a like-indexed series vs a multi-index series? (ii) is there a better way to assign a new column by groupby-apply to multiple columns?

For this case I do not think include the column A in apply is necessary, we can use transform
df.A-df.groupby('B').C.transform('mean')
Out[272]:
0 0.5
1 1.5
2 2.5
3 3.5
dtype: float64
And you can assign it back
df['diff']= df.A-df.groupby('B').C.transform('mean')
df
Out[274]:
A B C diff
0 1 A 0 0.5
1 2 B 0 1.5
2 3 A 1 2.5
3 4 B 1 3.5

Let's use group_keys=False in the groupby
df.assign(D=df.groupby('B', group_keys=False).apply(lambda x: x.A - x.C.mean()))
Output:
A B C D
0 1 A 0 0.5
1 2 B 0 1.5
2 3 A 1 2.5
3 4 B 1 3.5

Related

How can i correctly retrieve data with python pandas from two columns in a DataFrame? [duplicate]

I have a pandas dataframe in the following format:
df = pd.DataFrame([
[1.1, 1.1, 1.1, 2.6, 2.5, 3.4,2.6,2.6,3.4,3.4,2.6,1.1,1.1,3.3],
list('AAABBBBABCBDDD'),
[1.1, 1.7, 2.5, 2.6, 3.3, 3.8,4.0,4.2,4.3,4.5,4.6,4.7,4.7,4.8],
['x/y/z','x/y','x/y/z/n','x/u','x','x/u/v','x/y/z','x','x/u/v/b','-','x/y','x/y/z','x','x/u/v/w'],
['1','3','3','2','4','2','5','3','6','3','5','1','1','1']
]).T
df.columns = ['col1','col2','col3','col4','col5']
df:
col1 col2 col3 col4 col5
0 1.1 A 1.1 x/y/z 1
1 1.1 A 1.7 x/y 3
2 1.1 A 2.5 x/y/z/n 3
3 2.6 B 2.6 x/u 2
4 2.5 B 3.3 x 4
5 3.4 B 3.8 x/u/v 2
6 2.6 B 4 x/y/z 5
7 2.6 A 4.2 x 3
8 3.4 B 4.3 x/u/v/b 6
9 3.4 C 4.5 - 3
10 2.6 B 4.6 x/y 5
11 1.1 D 4.7 x/y/z 1
12 1.1 D 4.7 x 1
13 3.3 D 4.8 x/u/v/w 1
I want to get the count by each row like following. Expected Output:
col5 col2 count
1 A 1
D 3
2 B 2
etc...
How to get my expected output? And I want to find largest count for each 'col2' value?
You are looking for size:
In [11]: df.groupby(['col5', 'col2']).size()
Out[11]:
col5 col2
1 A 1
D 3
2 B 2
3 A 3
C 1
4 B 1
5 B 2
6 B 1
dtype: int64
To get the same answer as waitingkuo (the "second question"), but slightly cleaner, is to groupby the level:
In [12]: df.groupby(['col5', 'col2']).size().groupby(level=1).max()
Out[12]:
col2
A 3
B 2
C 1
D 3
dtype: int64
Followed by #Andy's answer, you can do following to solve your second question:
In [56]: df.groupby(['col5','col2']).size().reset_index().groupby('col2')[[0]].max()
Out[56]:
0
col2
A 3
B 2
C 1
D 3
Idiomatic solution that uses only a single groupby
(df.groupby(['col5', 'col2']).size()
.sort_values(ascending=False)
.reset_index(name='count')
.drop_duplicates(subset='col2'))
col5 col2 count
0 3 A 3
1 1 D 3
2 5 B 2
6 3 C 1
Explanation
The result of the groupby size method is a Series with col5 and col2 in the index. From here, you can use another groupby method to find the maximum value of each value in col2 but it is not necessary to do. You can simply sort all the values descendingly and then keep only the rows with the first occurrence of col2 with the drop_duplicates method.
Inserting data into a pandas dataframe and providing column name.
import pandas as pd
df = pd.DataFrame([['A','C','A','B','C','A','B','B','A','A'], ['ONE','TWO','ONE','ONE','ONE','TWO','ONE','TWO','ONE','THREE']]).T
df.columns = [['Alphabet','Words']]
print(df) #printing dataframe.
This is our printed data:
For making a group of dataframe in pandas and counter,
You need to provide one more column which counts the grouping, let's call that column as, "COUNTER" in dataframe.
Like this:
df['COUNTER'] =1 #initially, set that counter to 1.
group_data = df.groupby(['Alphabet','Words'])['COUNTER'].sum() #sum function
print(group_data)
OUTPUT:
Should you want to add a new column (say 'count_column') containing the groups' counts into the dataframe:
df.count_column=df.groupby(['col5','col2']).col5.transform('count')
(I picked 'col5' as it contains no nan)
Since pandas 1.1.0., you can value_counts on a DataFrame:
out = df[['col5','col2']].value_counts().sort_index()
Output:
col5 col2
1 A 1
D 3
2 B 2
3 A 3
C 1
4 B 1
5 B 2
6 B 1
dtype: int64
If you want to construct a DataFrame as a final result (not a pandas Series), use the as_index= parameter:
df.groupby(['col5', 'col2'], as_index=False).size()
To get the final desired output, pivot_table may be used as well (instead of double groupby):
df.pivot_table(index='col5', columns='col2', aggfunc='size').max()
If you don't want to count NaN values, you can use groupby.count:
df.groupby(['col5', 'col2']).count()
Note that since each column may have different number of non-NaN values, unless you specify the column, a simple groupby.count call may return different counts for each column as in the example above. For example, the number of non-NaN values in col1 after grouping by ['col5', 'col2'] is as follows:
df.groupby(['col5', 'col2'])['col1'].count()
You can just use the built-in function count follow by the groupby function
df.groupby(['col5','col2']).count()

How to apply lambda function to specific column based on the values in the adjacent column

I am trying a apply a lambda function to a pandas data frame. My question is how can I apply a lambda function to column a based on value in column b using if statement.
A B C
2 5 7
4 5 9
6 7 9
df['B'].apply(lambda x: x+3 if x<(#the value in column C) else x)
You need to call apply on the dataframe, with axis=1, instead of on the B column:
>>> df.apply(lambda x: x['B']+3 if x['B']<x['C'] else x['B'], axis=1)
0 8
1 8
2 10
dtype: int64
But, a much more efficient (faster) way would be to do this:
>>> df['B'] + df['B'].lt(df['C']) * 3
0 8
1 8
2 10
dtype: int64

Hierarchical Operation Pandas

Clearly I'm missing something simple but I don't know what. I would like to propagate an operation by groups. Let's say something simple, I have a simple series with multiindex (let's say 2 levels), I want to take the mean and subtract that mean to the correct index level.
Minimalist example code:
a = pd.Series({(2,1): 3., (1,2):4.,(2,3):4.})
b = a.groupby(level=0).mean()
r = a-b # this is the wrong line, b doesn't propagate to the multiindex of a
The result that I expect:
2 1 -0.5
1 2 0
2 3 .5
dtype: float64
Use Series.sub with possible defined level for align:
r = a.sub(b, level=0)
print (r)
2 1 -0.5
1 2 0.0
2 3 0.5
dtype: float64
Or use GroupBy.transform for Series with same index like original a Series:
b = a.groupby(level=0).transform('mean')
r = a-b
print (r)
2 1 -0.5
1 2 0.0
2 3 0.5
dtype: float64

How to DataFrame.groupby along axis=1

I have:
df = pd.DataFrame({'A':[1, 2, -3],'B':[1,2,6]})
df
A B
0 1 1
1 2 2
2 -3 6
Q: How do I get:
A
0 1
1 2
2 1.5
using groupby() and aggregate()?
Something like,
df.groupby([0,1], axis=1).aggregate('mean')
So basically groupby along axis=1 and use row indexes 0 and 1 for grouping. (without using Transpose)
Are you looking for ?
df.mean(1)
Out[71]:
0 1.0
1 2.0
2 1.5
dtype: float64
If you do want groupby
df.groupby(['key']*df.shape[1],axis=1).mean()
Out[72]:
key
0 1.0
1 2.0
2 1.5
Grouping keys can come in 4 forms, I will only mention the first and third which are relevant to your question. The following is from "Data Analysis Using Pandas":
Each grouping key can take many forms, and the keys do not have to be all of the same type:
• A list or array of values that is the same length as the axis being grouped
•A dict or Series giving a correspondence between the values on the axis being grouped and the group names
So you can pass on an array the same length as your columns axis, the grouping axis, or a dict like the following:
df1.groupby({x:'mean' for x in df1.columns}, axis=1).mean()
mean
0 1.0
1 2.0
2 1.5
Given the original dataframe df as follows -
A B C
0 1 1 2
1 2 2 3
2 -3 6 1
Please use command
df.groupby(by=lambda x : df[x].loc[0],axis=1).mean()
to get the desired output as -
1 2
0 1.0 2.0
1 2.0 3.0
2 1.5 1.0
Here, the function lambda x : df[x].loc[0] is used to map columns A and B to 1 and column C to 2. This mapping is then used to decide the grouping.
You can also use any complex function defined outside the groupby statement instead of the lambda function.
try this:
df["A"] = np.mean(dff.loc[:,["A","B"]],axis=1)
df.drop(columns=["B"],inplace=True)
A
0 1.0
1 2.0
2 1.5

Group by value of sum of columns with Pandas

I got lost in Pandas doc and features trying to figure out a way to groupby a DataFrame by the values of the sum of the columns.
for instance, let say I have the following data :
In [2]: dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
In [3]: df = pd.DataFrame(dat)
In [4]: df
Out[4]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
I would like columns a, b and c to be grouped since they all have their sum equal to 1. The resulting DataFrame would have columns labels equals to the sum of the columns it summed. Like this :
1 9
0 2 2
1 1 3
2 0 4
Any idea to put me in the good direction ? Thanks in advance !
Here you go:
In [57]: df.groupby(df.sum(), axis=1).sum()
Out[57]:
1 9
0 2 2
1 1 3
2 0 4
[3 rows x 2 columns]
df.sum() is your grouper. It sums over the 0 axis (the index), giving you the two groups: 1 (columns a, b, and, c) and 9 (column d) . You want to group the columns (axis=1), and take the sum of each group.
Because pandas is designed with database concepts in mind, it's really expected information to be stored together in rows, not in columns. Because of this, it's usually more elegant to do things row-wise. Here's how to solve your problem row-wise:
dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
df = pd.DataFrame(dat)
df = df.transpose()
df['totals'] = df.sum(1)
print df.groupby('totals').sum().transpose()
#totals 1 9
#0 2 2
#1 1 3
#2 0 4

Categories

Resources