Clearly I'm missing something simple but I don't know what. I would like to propagate an operation by groups. Let's say something simple, I have a simple series with multiindex (let's say 2 levels), I want to take the mean and subtract that mean to the correct index level.
Minimalist example code:
a = pd.Series({(2,1): 3., (1,2):4.,(2,3):4.})
b = a.groupby(level=0).mean()
r = a-b # this is the wrong line, b doesn't propagate to the multiindex of a
The result that I expect:
2 1 -0.5
1 2 0
2 3 .5
dtype: float64
Use Series.sub with possible defined level for align:
r = a.sub(b, level=0)
print (r)
2 1 -0.5
1 2 0.0
2 3 0.5
dtype: float64
Or use GroupBy.transform for Series with same index like original a Series:
b = a.groupby(level=0).transform('mean')
r = a-b
print (r)
2 1 -0.5
1 2 0.0
2 3 0.5
dtype: float64
Related
I have read several similar questions and cannot for the life of me find an answer that works for what I'm trying to specifically even though the question is very simple. I have a set of data that has a grouping variable, a position, and a value at that position:
Sample Position Depth
A 1 2
A 2 3
A 3 4
B 1 1
B 2 3
B 3 2
I want to generate a new column that is an internally normalized depth as follows:
Sample Position Depth NormalizedDepth
A 1 2 0
A 2 3 0.5
A 3 4 1
B 1 1 0
B 2 3 1
B 3 2 0.5
This is essentially represented by the formula NormalizedDepth = (x - min(x))/(max(x)-min(x)) such that the minimum and maximum are of the group.
I know how to do this with dplyr in R with the following:
depths %>%
group_by(Sample) %>%
mutate(NormalizedDepth = 100 * (Depth - min(Depth))/(max(Depth) - min(Depth)))
I cannot figure out how to do this with pandas I've tried doing grouping and applying, but none of it seems to replicate what I am looking for.
We have transform (do the same as mutate in R dplyr ) with ptp (thes is get the diff between the max and min )
import numpy as np
g=df.groupby('Sample').Depth
df['new']=(df.Depth-g.transform('min'))/g.transform(np.ptp)
0 0.0
1 0.5
2 1.0
3 0.0
4 1.0
5 0.5
Name: Depth, dtype: float64
Group the Data Frame by Sample Series' values, apply an anonymous function to each value of the (split) Depth Series which performs min max normalisation, assign result to NormalizedDepth Series of df DataFrame (note unlikely to be as efficient as YOBEN_S' answer above):
import pandas as pd
df['NormalizedDepth'] = df.groupby('Sample').Depth.apply(lambda x: (x - min(x))/(max(x)-min(x)))
Is there a Pandas function equivalent to the MS Excel fill handle?
It fills data down or extends a series if more than one cell is selected. My specific application is filling down with a set value in a specific column from a specific row in the dataframe, not necessarily filling a series.
This simple function essentially does what I want. I think it would be nice if ffill could be modified to fill in this way...
def fill_down(df, col, val, start, end = 0, interval = 1):
if not end:
end = len(df)
for i in range(start,end,interval):
df[col].iloc[i] += val
return df
As others commented, there isn't a GUI for pandas, but ffill gives the functionality you're looking for. You can also use ffill with groupby for more powerful functionality. For example:
>>> df
A B
0 12 1
1 NaN 1
2 4 2
3 NaN 2
>>> df.A = df.groupby('B').A.ffill()
A B
0 12 1
1 12 1
2 4 2
3 4 2
Edit: If you don't have NaN's, you could always create the NaN's where you want to fill down. For example:
>>> df
Out[8]:
A B
0 1 2
1 3 3
2 4 5
>>> df.replace(3, np.nan)
Out[9]:
A B
0 1.0 2.0
1 NaN NaN
2 4.0 5.0
I have:
df = pd.DataFrame({'A':[1, 2, -3],'B':[1,2,6]})
df
A B
0 1 1
1 2 2
2 -3 6
Q: How do I get:
A
0 1
1 2
2 1.5
using groupby() and aggregate()?
Something like,
df.groupby([0,1], axis=1).aggregate('mean')
So basically groupby along axis=1 and use row indexes 0 and 1 for grouping. (without using Transpose)
Are you looking for ?
df.mean(1)
Out[71]:
0 1.0
1 2.0
2 1.5
dtype: float64
If you do want groupby
df.groupby(['key']*df.shape[1],axis=1).mean()
Out[72]:
key
0 1.0
1 2.0
2 1.5
Grouping keys can come in 4 forms, I will only mention the first and third which are relevant to your question. The following is from "Data Analysis Using Pandas":
Each grouping key can take many forms, and the keys do not have to be all of the same type:
• A list or array of values that is the same length as the axis being grouped
•A dict or Series giving a correspondence between the values on the axis being grouped and the group names
So you can pass on an array the same length as your columns axis, the grouping axis, or a dict like the following:
df1.groupby({x:'mean' for x in df1.columns}, axis=1).mean()
mean
0 1.0
1 2.0
2 1.5
Given the original dataframe df as follows -
A B C
0 1 1 2
1 2 2 3
2 -3 6 1
Please use command
df.groupby(by=lambda x : df[x].loc[0],axis=1).mean()
to get the desired output as -
1 2
0 1.0 2.0
1 2.0 3.0
2 1.5 1.0
Here, the function lambda x : df[x].loc[0] is used to map columns A and B to 1 and column C to 2. This mapping is then used to decide the grouping.
You can also use any complex function defined outside the groupby statement instead of the lambda function.
try this:
df["A"] = np.mean(dff.loc[:,["A","B"]],axis=1)
df.drop(columns=["B"],inplace=True)
A
0 1.0
1 2.0
2 1.5
I like to generate a new column in pandas dataframe using groupby-apply.
For example, I have a dataframe:
df = pd.DataFrame({'A':[1,2,3,4],'B':['A','B','A','B'],'C':[0,0,1,1]})
and try to generate a new column 'D' by groupby-apply.
This works:
df = df.assign(D=df.groupby('B').C.apply(lambda x: x - x.mean()))
as (I think) it returns a series with the same index with the dataframe:
In [4]: df.groupby('B').C.apply(lambda x: x - x.mean())
Out[4]:
0 -0.5
1 -0.5
2 0.5
3 0.5
Name: C, dtype: float64
But if I try to generate a new column using multiple columns, I cannot assign it directly to a new column. So this doesn't work:
df.assign(D=df.groupby('B').apply(lambda x: x.A - x.C.mean()))
returning
TypeError: incompatible index of inserted column with frame index
and in fact, the groupby-apply returns:
In [8]: df.groupby('B').apply(lambda x: x.A - x.C.mean())
Out[8]:
B
A 0 0.5
2 2.5
B 1 1.5
3 3.5
Name: A, dtype: float64
I could do
df.groupby('B').apply(lambda x: x.A - x.C.mean()).reset_index(level=0,drop=True))
but it seems verbose and I am not sure if this will work as expected always.
So my question is: (i) when does pandas groupby-apply return a like-indexed series vs a multi-index series? (ii) is there a better way to assign a new column by groupby-apply to multiple columns?
For this case I do not think include the column A in apply is necessary, we can use transform
df.A-df.groupby('B').C.transform('mean')
Out[272]:
0 0.5
1 1.5
2 2.5
3 3.5
dtype: float64
And you can assign it back
df['diff']= df.A-df.groupby('B').C.transform('mean')
df
Out[274]:
A B C diff
0 1 A 0 0.5
1 2 B 0 1.5
2 3 A 1 2.5
3 4 B 1 3.5
Let's use group_keys=False in the groupby
df.assign(D=df.groupby('B', group_keys=False).apply(lambda x: x.A - x.C.mean()))
Output:
A B C D
0 1 A 0 0.5
1 2 B 0 1.5
2 3 A 1 2.5
3 4 B 1 3.5
I'm wondering how to aggregate data within a grouped pandas dataframe by a function where I take into account the value stored in some column of the dataframe. This would be useful in operations where order of operations matters, such as division.
For example I have:
In [8]: df
Out[8]:
class cat xer
0 a 1 2
1 b 1 4
2 c 1 9
3 a 2 6
4 b 2 8
5 c 2 3
I want to group by by class and for each class divide the xer value corresponding to cat == 1 by that for cat == 2. In other words, the entries in the final output should be:
class div
0 a 0.33 (i.e. 2/6)
1 b 0.5 (i.e. 4/8)
2 c 3 (i.e. 9/3)
Is this possible to do using groupby? I can't quite figure out how to do it without manually iterating through each class and even so it's not clean or fun.
Without doing anything too clever:
In [11]: one = df[df["cat"] == 1].set_index("class")["xer"]
In [12]: two = df[df["cat"] == 2].set_index("class")["xer"]
In [13]: one / two
Out[13]:
class
a 0.333333
b 0.500000
c 3.000000
Name: xer, dtype: float64
Given your DataFrame, you can use the following:
df.groupby('class').agg({'xer': lambda L: reduce(pd.np.divide, L)})
Which gives you:
xer
class
a 0.333333
b 0.500000
c 3.000000
This caters for > 2 per group (if needs be), but you might want to ensure your df is sorted by cat first to ensure they appear in the right order.
You may want to rearrange your data to make it easier to view:
df2 = df.set_index(['class', 'cat']).unstack()
>>> df2
xer
cat 1 2
class
a 2 6
b 4 8
c 9 3
You can then do the following to get your desired result:
>>> df2.iloc[:,0].div(df2.iloc[:, 1])
class
a 0.333333
b 0.500000
c 3.000000
Name: (xer, 1), dtype: float64
This is one approach, step by step:
# get cat==1 and cat==2 merged by class
grouped = df[df.cat==1].merge(df[df.cat==2], on='class')
# calculate div
grouped['div'] = grouped.xer_x / grouped.xer_y
# return the final dataframe
grouped[['class', 'div']]
which yields:
class div
0 a 0.333333
1 b 0.500000
2 c 3.000000