Pandas conditional group by and sum - python

Is there a way I can do a groupby and sum on some rows of a DataFrame, but leave the rest as is? For example I have the df:
df = pd.DataFrame({
'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
It looks like:
A B C D
0 foo one 0.469112 -0.861849
1 bar one -0.282863 -2.104569
2 foo two -1.509059 -0.494929
3 bar three -1.135632 1.071804
4 foo two 1.212112 0.721555
5 bar two -0.173215 -0.706771
6 foo one 0.119209 -1.039575
7 foo three -1.044236 0.271860
And now I'd like to groupby/sum the rows where the value in B is one (and keep the last occurrence in the column A). So the output would be:
A B sumC sumD
1 foo two -1.509059 -0.494929
2 bar three -1.135632 1.071804
3 foo two 1.212112 0.721555
4 bar two -0.173215 -0.706771
5 foo one 0.030545 -4.005993
6 foo three -1.044236 0.271860
How can this be done?

Let's use this:
pd.concat([df.query('B != "one"'),
df.query('B == "one"').groupby('B', as_index=False)['A','C','D']
.agg({'A':'last','C':'sum','D':'sum'})])
Output:
A B C D
2 foo two 0.656942 -0.605847
3 bar three 1.022090 0.493374
4 foo two -1.016595 0.652162
5 bar two -0.738758 -0.669947
7 foo three 0.913342 1.156044
0 foo one 0.590764 -0.192638

Another kind-of workaround is to define a new column that is a constant (e.g. -1) if B is one and a unique value (e.g. a range) otherwise, then group on it.
df['B2'] = np.where(df['B']=='one', -1, np.arange(len(df)))
df.groupby('B2', as_index=False).agg({'A': 'last', 'B': 'max', 'C': 'sum', 'D': 'sum'}).drop('B2', axis=1)
This avoids doing computations that in the end you throw away (although, if you really want to avoid these things, probably the easiest thing is just to split your DataFrame in two, where df.B == 'one' and where df.B != 'one', work only on the former and then concatenate the results back)

Related

Add column to pandas multiindex dataframe

I have a pandas dataframe that looks like this:
import pandas as pd
import numpy as np
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
df = pd.DataFrame(np.random.randn(8,4),index=arrays,columns=['A','B','C','D'])
I want to add a column E such that df.loc[(slice(None),'one'),'E'] = 1 and df.loc[(slice(None),'two'),'E'] = 2, and I want to do this without iterating over ['one', 'two']. I tried the following:
df.loc[(slice(None),slice('one','two')),'E'] = pd.Series([1,2],index=['one','two'])
but it just adds a column E with NaN. What's the right way to do this?
Here is one way reindex
df.loc[:,'E']=pd.Series([1,2],index=['one','two']).reindex(df.index.get_level_values(1)).values
df
A B C D E
bar one -0.856175 -0.383711 -0.646510 0.110204 1
two 1.640114 0.099713 0.406629 0.774960 2
baz one 0.097198 -0.814920 0.234416 -0.057340 1
two -0.155276 0.788130 0.761469 0.770709 2
foo one 1.593564 -1.048519 -1.194868 0.191314 1
two -0.755624 0.678036 -0.899805 1.070639 2
qux one -0.560672 0.317915 -0.858048 0.418655 1
two 1.198208 0.662354 -1.353606 -0.184258 2
Methinks this is a good use case for Index.map:
df['E'] = df.index.get_level_values(1).map({'one':1, 'two':2})
df
A B C D E
bar one 0.956122 -0.705841 1.192686 -0.237942 1
two 1.155288 0.438166 1.122328 -0.997020 2
baz one -0.106794 1.451429 -0.618037 -2.037201 1
two -1.942589 -2.506441 -2.114164 -0.411639 2
foo one 1.278528 -0.442229 0.323527 -0.109991 1
two 0.008549 -0.168199 -0.174180 0.461164 2
qux one -1.175983 1.010127 0.920018 -0.195057 1
two 0.805393 -0.701344 -0.537223 0.156264 2
You can just get it from df.index.labels:
df['E'] = df.index.labels[1] + 1
print(df)
Output:
A B C D E
bar one 0.746123 1.264906 0.169694 -0.180074 1
two -1.439730 -0.100075 0.929750 0.511201 2
baz one 0.833037 1.547624 -1.116807 0.425093 1
two 0.969887 -0.705240 -2.100482 0.728977 2
foo one -0.977623 -0.800136 -0.361394 0.396451 1
two 1.158378 -1.892137 -0.987366 -0.081511 2
qux one 0.155531 0.275015 0.571397 -0.663358 1
two 0.710313 -0.255876 0.420092 -0.116537 2
Thanks to coldspeed, if you want different values (i.e x and y), use:
df['E'] = pd.Series(df.index.labels[1]).map({0: 'x', 1: 'y'}).tolist()
print(df)

How is pandas groupby method actually working?

So I was trying to understand pandas.dataFrame.groupby() function and I came across this example on the documentation:
In [1]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B' : ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C' : np.random.randn(8),
...: 'D' : np.random.randn(8)})
...:
In [2]: df
Out[2]:
A B C D
0 foo one 0.469112 -0.861849
1 bar one -0.282863 -2.104569
2 foo two -1.509059 -0.494929
3 bar three -1.135632 1.071804
4 foo two 1.212112 0.721555
5 bar two -0.173215 -0.706771
6 foo one 0.119209 -1.039575
7 foo three -1.044236 0.271860
Not to further explore I did this:
print(df.groupby('B').head())
it outputs the same dataFrame but when I do this:
print(df.groupby('B'))
it gives me this:
<pandas.core.groupby.DataFrameGroupBy object at 0x7f65a585b390>
What does this mean? In a normal dataFrame printing .head() simply outputs the first 5 rows what's happening here?
And also why does printing .head() gives the same output as the dataframe? Shouldn't it be grouped by the elements of the column 'B'?
When you use just
df.groupby('A')
You get a GroupBy object. You haven't applied any function to it at that point. Under the hood, while this definition might not be perfect, you can think of a groupby object as:
An iterator of (group, DataFrame) pairs, for DataFrames, or
An iterator of (group, Series) pairs, for Series.
To illustrate:
df = DataFrame({'A' : [1, 1, 2, 2], 'B' : [1, 2, 3, 4]})
grouped = df.groupby('A')
# each `i` is a tuple of (group, DataFrame)
# so your output here will be a little messy
for i in grouped:
print(i)
(1, A B
0 1 1
1 1 2)
(2, A B
2 2 3
3 2 4)
# this version uses multiple counters
# in a single loop. each `group` is a group, each
# `df` is its corresponding DataFrame
for group, df in grouped:
print('group of A:', group, '\n')
print(df, '\n')
group of A: 1
A B
0 1 1
1 1 2
group of A: 2
A B
2 2 3
3 2 4
# and if you just wanted to visualize the groups,
# your second counter is a "throwaway"
for group, _ in grouped:
print('group of A:', group, '\n')
group of A: 1
group of A: 2
Now as for .head. Just have a look at the docs for that method:
Essentially equivalent to .apply(lambda x: x.head(n))
So here you're actually applying a function to each group of the groupby object. Keep in mind .head(5) is applied to each group (each DataFrame), so because you have less than or equal to 5 rows per group, you get your original DataFrame.
Consider this with the example above. If you use .head(1), you get only the first 1 row of each group:
print(df.groupby('A').head(1))
A B
0 1 1
2 2 3

Pandas Custom Sort Row in Multiindex

Given the following:
import pandas as pd
arrays = [['bar', 'bar', 'bar', 'baz', 'baz', 'baz', 'baz'],
['total', 'two', 'one', 'two', 'four', 'total', 'five']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.Series(np.random.randn(7), index=index)
s
first second
bar total 0.334158
two -0.267854
one 1.161727
baz two -0.748685
four -0.888634
total 0.383310
five 0.506120
dtype: float64
How do I ensure that the 'total' rows (per the second index) are always at the bottom of each group like this?:
first second
bar one 0.210911
two 0.628357
total -0.911331
baz two 0.315396
four -0.195451
five 0.060159
total 0.638313
dtype: float64
solution 1
I'm not happy with this. I'm working on a different solution
unstacked = s.unstack(0)
total = unstacked.loc['total']
unstacked.drop('total').append(total).unstack().dropna()
first second
bar one 1.682996
two 0.343783
total 1.287503
baz five 0.360170
four 1.113498
two 0.083691
total -0.377132
dtype: float64
solution 2
I feel better about this one
second = pd.Categorical(
s.index.levels[1].values,
categories=['one', 'two', 'three', 'four', 'five', 'total'],
ordered=True
)
s.index.set_levels(second, level='second', inplace=True)
cols = s.index.names
s.reset_index().sort_values(cols).set_index(cols)
0
first second
bar one 1.682996
two 0.343783
total 1.287503
baz two 0.083691
four 1.113498
five 0.360170
total -0.377132
unstack for creating DataFrame with columns with second level of MultiIndex, then reorder columns for total to last column and last use ordered CategoricalIndex.
So if stack level total is last.
np.random.seed(123)
arrays = [['bar', 'bar', 'bar', 'baz', 'baz', 'baz', 'baz'],
['total', 'two', 'one', 'two', 'four', 'total', 'five']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.Series(np.random.randn(7), index=index)
print (s)
first second
bar total -1.085631
two 0.997345
one 0.282978
baz two -1.506295
four -0.578600
total 1.651437
five -2.426679
dtype: float64
df = s.unstack()
df = df[df.columns[df.columns != 'total'].tolist() + ['total']]
df.columns = pd.CategoricalIndex(df.columns, ordered=True)
print (df)
second five four one two total
first
bar NaN NaN 0.282978 0.997345 -1.085631
baz -2.426679 -0.5786 NaN -1.506295 1.651437
s1 = df.stack()
print (s1)
first second
bar one 0.282978
two 0.997345
total -1.085631
baz five -2.426679
four -0.578600
two -1.506295
total 1.651437
dtype: float64
print (s1.sort_index())
first second
bar one 0.282978
two 0.997345
total -1.085631
baz five -2.426679
four -0.578600
two -1.506295
total 1.651437
dtype: float64

How to sum negative and positive values separately when using groupby in pandas?

How to sum positive and negative values differently in pandas and put them let's say in positive and negative columns?
I have this dataframe like below:
df = pandas.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C' : np.random.randn(8), 'D' : np.random.randn(8)})
Output is as below:
df
A B C D
0 foo one 0.374156 0.319699
1 bar one -0.356339 -0.629649
2 foo two -0.390243 -1.387909
3 bar three -0.783435 -0.959699
4 foo two -1.268622 -0.250871
5 bar two -2.302525 -1.295991
6 foo one -0.968840 1.247675
7 foo three 0.482845 1.004697
I used the below code to get negatives:
df['negative'] = df.groupby('A')['C'].apply(lambda x: x[x<0].sum()).reset_index()]
But the problem is when I want to add it to one of dataframe columns called negative it gives error:
ValueError: Wrong number of items passed 2, placement implies 1
Again I know what it says that groupby has returned more than one column and cannot assign it to df['negatives'] but I don't know how to solve this part of the problem. I need to have positive col too.
The desired outcome would be:
A Positive Negative
0 foo 0.374156 -0.319699
1 bar 0.356339 -0.629649
What is the right solution to the problem?
In [14]:
df.groupby(df['A'])['C'].agg([('negative' , lambda x : x[x < 0].sum()) , ('positive' , lambda x : x[x > 0].sum())])
Out[14]:
negative positive
A
bar -1.418788 2.603452
foo -0.504695 2.880512
You may groupby on A and df['C'] > 0, and unstack the result:
>>> right = df.groupby(['A', df['C'] > 0])['C'].sum().unstack()
>>> right = right.rename(columns={True:'positive', False:'negative'})
>>> right
C negative positive
A
bar -3.4423 NaN
foo -2.6277 0.857
The NaN value is because all the A == bar rows have negative value for C.
if you want to add these to the original frame corresponding to values of groupby key, i.e. A, it would require a left join:
>>> df.join(right, on='A', how='left')
A B C D negative positive
0 foo one 0.3742 0.3197 -2.6277 0.857
1 bar one -0.3563 -0.6296 -3.4423 NaN
2 foo two -0.3902 -1.3879 -2.6277 0.857
3 bar three -0.7834 -0.9597 -3.4423 NaN
4 foo two -1.2686 -0.2509 -2.6277 0.857
5 bar two -2.3025 -1.2960 -3.4423 NaN
6 foo one -0.9688 1.2477 -2.6277 0.857
7 foo three 0.4828 1.0047 -2.6277 0.857

compute mean of unique combinations in groupby pandas

I have he following pandas dataframe:
data = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' :[2,1,2,1,2,1,2,1]})
that looks like:
A B C
0 foo one 2
1 bar one 1
2 foo two 2
3 bar three 1
4 foo two 2
5 bar two 1
6 foo one 2
7 foo three 1
What I need is to compute the mean of each unique combination of A and B. i.e.:
A B C
foo one 2
foo two 2
foo three 1
mean = 1.66666667
and having as output the 'means' computed per value of A i.e.:
foo 1.666667
bar 1
I tried with :
data.groupby(['A'], sort=False, as_index=False).mean()
but it returns me:
foo 1.8
bar 1
Is there a way to compute the mean of only unique combinations? How ?
This is essentially the same as #S_A's answer, but a bit more concise.
You can calculate the means across A and B with:
In [41]: df.groupby(['A', 'B']).mean()
Out[41]:
C
A B
bar one 1
three 1
two 1
foo one 2
three 1
two 2
And then calculate the mean of these over A with:
In [42]: df.groupby(['A', 'B']).mean().groupby(level='A').mean()
Out[42]:
C
A
bar 1.000000
foo 1.666667
Yes. Here is a solution which you want. Firstly you make group corresponding column for making unique combination A and B column. Later from making group, you count mean() corresponding A column.
You can do this like:
from pandas import *
data = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' :[2.0,1,2,1,2,1,2,1]})
data = data.groupby(['A','B'], sort=False, as_index=False).mean()
print data.groupby('A', sort=False, as_index=False).mean()
Output:
A C
0 foo 1.666667
1 bar 1.000000
When you data.groupby(['A'], sort=False, as_index=False).mean() do, it's mean you count group_by all value of C column according to A Column. That's why it return
foo 1.8 (9/8)
bar 1.0 (3/3)
I think you should find your answer :) :)
This worked for me
test = data
test = test.drop_duplicates()
test = test.groupby(['A']).mean()
Output:
C
A
bar 1.000000
foo 1.666667

Categories

Resources