I have a pandas dataframe that looks like this:
import pandas as pd
import numpy as np
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
df = pd.DataFrame(np.random.randn(8,4),index=arrays,columns=['A','B','C','D'])
I want to add a column E such that df.loc[(slice(None),'one'),'E'] = 1 and df.loc[(slice(None),'two'),'E'] = 2, and I want to do this without iterating over ['one', 'two']. I tried the following:
df.loc[(slice(None),slice('one','two')),'E'] = pd.Series([1,2],index=['one','two'])
but it just adds a column E with NaN. What's the right way to do this?
Here is one way reindex
df.loc[:,'E']=pd.Series([1,2],index=['one','two']).reindex(df.index.get_level_values(1)).values
df
A B C D E
bar one -0.856175 -0.383711 -0.646510 0.110204 1
two 1.640114 0.099713 0.406629 0.774960 2
baz one 0.097198 -0.814920 0.234416 -0.057340 1
two -0.155276 0.788130 0.761469 0.770709 2
foo one 1.593564 -1.048519 -1.194868 0.191314 1
two -0.755624 0.678036 -0.899805 1.070639 2
qux one -0.560672 0.317915 -0.858048 0.418655 1
two 1.198208 0.662354 -1.353606 -0.184258 2
Methinks this is a good use case for Index.map:
df['E'] = df.index.get_level_values(1).map({'one':1, 'two':2})
df
A B C D E
bar one 0.956122 -0.705841 1.192686 -0.237942 1
two 1.155288 0.438166 1.122328 -0.997020 2
baz one -0.106794 1.451429 -0.618037 -2.037201 1
two -1.942589 -2.506441 -2.114164 -0.411639 2
foo one 1.278528 -0.442229 0.323527 -0.109991 1
two 0.008549 -0.168199 -0.174180 0.461164 2
qux one -1.175983 1.010127 0.920018 -0.195057 1
two 0.805393 -0.701344 -0.537223 0.156264 2
You can just get it from df.index.labels:
df['E'] = df.index.labels[1] + 1
print(df)
Output:
A B C D E
bar one 0.746123 1.264906 0.169694 -0.180074 1
two -1.439730 -0.100075 0.929750 0.511201 2
baz one 0.833037 1.547624 -1.116807 0.425093 1
two 0.969887 -0.705240 -2.100482 0.728977 2
foo one -0.977623 -0.800136 -0.361394 0.396451 1
two 1.158378 -1.892137 -0.987366 -0.081511 2
qux one 0.155531 0.275015 0.571397 -0.663358 1
two 0.710313 -0.255876 0.420092 -0.116537 2
Thanks to coldspeed, if you want different values (i.e x and y), use:
df['E'] = pd.Series(df.index.labels[1]).map({0: 'x', 1: 'y'}).tolist()
print(df)
Related
Consider this dataframe:
import pandas as pd
import numpy as np
iterables = [['bar', 'baz', 'foo'], ['one', 'two']]
index = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(3, 6), index=['A', 'B', 'C'], columns=index)
print(df)
first bar baz foo
second one two one two one two
A -1.954583 -1.347156 -1.117026 -1.253150 0.057197 -1.520180
B 0.253937 1.267758 -0.805287 0.337042 0.650892 -0.379811
C 0.354798 -0.835234 1.172324 -0.663353 1.145299 0.651343
I would like to drop 'one' from each column, while retaining other structure.
With the end result looking something like this:
first bar baz foo
second two two two
A -1.347156 -1.253150 -1.520180
B 1.267758 0.337042 -0.379811
C -0.835234 -0.663353 0.651343
Use drop:
df.drop('one', axis=1, level=1)
first bar baz foo
second two two two
A 0.127419 -0.319655 -0.878161
B -0.563335 1.193819 -0.469539
C -1.324932 -0.550495 1.378335
This should work as well:
df.loc[:,df.columns.get_level_values(1)!='one']
Try:
print(df.loc[:, (slice(None), "two")])
Prints:
first bar baz foo
second two two two
A -1.104831 0.286379 1.121148
B -1.637677 -2.297138 0.381137
C -1.556391 0.779042 2.316628
Use pd.IndexSlice:
indx = pd.IndexSlice
df.loc[:, indx[:, 'two']]
Output:
first bar baz foo
second two two two
A 1.169699 1.434761 0.917152
B -0.732991 -0.086613 -0.803092
C -0.813872 -0.706504 0.227000
So I was trying to understand pandas.dataFrame.groupby() function and I came across this example on the documentation:
In [1]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B' : ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C' : np.random.randn(8),
...: 'D' : np.random.randn(8)})
...:
In [2]: df
Out[2]:
A B C D
0 foo one 0.469112 -0.861849
1 bar one -0.282863 -2.104569
2 foo two -1.509059 -0.494929
3 bar three -1.135632 1.071804
4 foo two 1.212112 0.721555
5 bar two -0.173215 -0.706771
6 foo one 0.119209 -1.039575
7 foo three -1.044236 0.271860
Not to further explore I did this:
print(df.groupby('B').head())
it outputs the same dataFrame but when I do this:
print(df.groupby('B'))
it gives me this:
<pandas.core.groupby.DataFrameGroupBy object at 0x7f65a585b390>
What does this mean? In a normal dataFrame printing .head() simply outputs the first 5 rows what's happening here?
And also why does printing .head() gives the same output as the dataframe? Shouldn't it be grouped by the elements of the column 'B'?
When you use just
df.groupby('A')
You get a GroupBy object. You haven't applied any function to it at that point. Under the hood, while this definition might not be perfect, you can think of a groupby object as:
An iterator of (group, DataFrame) pairs, for DataFrames, or
An iterator of (group, Series) pairs, for Series.
To illustrate:
df = DataFrame({'A' : [1, 1, 2, 2], 'B' : [1, 2, 3, 4]})
grouped = df.groupby('A')
# each `i` is a tuple of (group, DataFrame)
# so your output here will be a little messy
for i in grouped:
print(i)
(1, A B
0 1 1
1 1 2)
(2, A B
2 2 3
3 2 4)
# this version uses multiple counters
# in a single loop. each `group` is a group, each
# `df` is its corresponding DataFrame
for group, df in grouped:
print('group of A:', group, '\n')
print(df, '\n')
group of A: 1
A B
0 1 1
1 1 2
group of A: 2
A B
2 2 3
3 2 4
# and if you just wanted to visualize the groups,
# your second counter is a "throwaway"
for group, _ in grouped:
print('group of A:', group, '\n')
group of A: 1
group of A: 2
Now as for .head. Just have a look at the docs for that method:
Essentially equivalent to .apply(lambda x: x.head(n))
So here you're actually applying a function to each group of the groupby object. Keep in mind .head(5) is applied to each group (each DataFrame), so because you have less than or equal to 5 rows per group, you get your original DataFrame.
Consider this with the example above. If you use .head(1), you get only the first 1 row of each group:
print(df.groupby('A').head(1))
A B
0 1 1
2 2 3
How to sum positive and negative values differently in pandas and put them let's say in positive and negative columns?
I have this dataframe like below:
df = pandas.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C' : np.random.randn(8), 'D' : np.random.randn(8)})
Output is as below:
df
A B C D
0 foo one 0.374156 0.319699
1 bar one -0.356339 -0.629649
2 foo two -0.390243 -1.387909
3 bar three -0.783435 -0.959699
4 foo two -1.268622 -0.250871
5 bar two -2.302525 -1.295991
6 foo one -0.968840 1.247675
7 foo three 0.482845 1.004697
I used the below code to get negatives:
df['negative'] = df.groupby('A')['C'].apply(lambda x: x[x<0].sum()).reset_index()]
But the problem is when I want to add it to one of dataframe columns called negative it gives error:
ValueError: Wrong number of items passed 2, placement implies 1
Again I know what it says that groupby has returned more than one column and cannot assign it to df['negatives'] but I don't know how to solve this part of the problem. I need to have positive col too.
The desired outcome would be:
A Positive Negative
0 foo 0.374156 -0.319699
1 bar 0.356339 -0.629649
What is the right solution to the problem?
In [14]:
df.groupby(df['A'])['C'].agg([('negative' , lambda x : x[x < 0].sum()) , ('positive' , lambda x : x[x > 0].sum())])
Out[14]:
negative positive
A
bar -1.418788 2.603452
foo -0.504695 2.880512
You may groupby on A and df['C'] > 0, and unstack the result:
>>> right = df.groupby(['A', df['C'] > 0])['C'].sum().unstack()
>>> right = right.rename(columns={True:'positive', False:'negative'})
>>> right
C negative positive
A
bar -3.4423 NaN
foo -2.6277 0.857
The NaN value is because all the A == bar rows have negative value for C.
if you want to add these to the original frame corresponding to values of groupby key, i.e. A, it would require a left join:
>>> df.join(right, on='A', how='left')
A B C D negative positive
0 foo one 0.3742 0.3197 -2.6277 0.857
1 bar one -0.3563 -0.6296 -3.4423 NaN
2 foo two -0.3902 -1.3879 -2.6277 0.857
3 bar three -0.7834 -0.9597 -3.4423 NaN
4 foo two -1.2686 -0.2509 -2.6277 0.857
5 bar two -2.3025 -1.2960 -3.4423 NaN
6 foo one -0.9688 1.2477 -2.6277 0.857
7 foo three 0.4828 1.0047 -2.6277 0.857
I have he following pandas dataframe:
data = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' :[2,1,2,1,2,1,2,1]})
that looks like:
A B C
0 foo one 2
1 bar one 1
2 foo two 2
3 bar three 1
4 foo two 2
5 bar two 1
6 foo one 2
7 foo three 1
What I need is to compute the mean of each unique combination of A and B. i.e.:
A B C
foo one 2
foo two 2
foo three 1
mean = 1.66666667
and having as output the 'means' computed per value of A i.e.:
foo 1.666667
bar 1
I tried with :
data.groupby(['A'], sort=False, as_index=False).mean()
but it returns me:
foo 1.8
bar 1
Is there a way to compute the mean of only unique combinations? How ?
This is essentially the same as #S_A's answer, but a bit more concise.
You can calculate the means across A and B with:
In [41]: df.groupby(['A', 'B']).mean()
Out[41]:
C
A B
bar one 1
three 1
two 1
foo one 2
three 1
two 2
And then calculate the mean of these over A with:
In [42]: df.groupby(['A', 'B']).mean().groupby(level='A').mean()
Out[42]:
C
A
bar 1.000000
foo 1.666667
Yes. Here is a solution which you want. Firstly you make group corresponding column for making unique combination A and B column. Later from making group, you count mean() corresponding A column.
You can do this like:
from pandas import *
data = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' :[2.0,1,2,1,2,1,2,1]})
data = data.groupby(['A','B'], sort=False, as_index=False).mean()
print data.groupby('A', sort=False, as_index=False).mean()
Output:
A C
0 foo 1.666667
1 bar 1.000000
When you data.groupby(['A'], sort=False, as_index=False).mean() do, it's mean you count group_by all value of C column according to A Column. That's why it return
foo 1.8 (9/8)
bar 1.0 (3/3)
I think you should find your answer :) :)
This worked for me
test = data
test = test.drop_duplicates()
test = test.groupby(['A']).mean()
Output:
C
A
bar 1.000000
foo 1.666667
I want to do some analysis on data. So far I am enable to group the columns that I want, now i need to add two columns here is my logic:
import pandas as pd
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'bar'],
'B' : ['one', 'one', 'two', 'two',
'two', 'two', 'one', 'two'],
'C' : [-1,2,3,4,5,6,0,2],
'D' : [-1,2,3,4,5,6,0,2]})
grouped = df.groupby(['A','B']).sum()
print grouped
The output looks like this:
C D
A B
bar one 2 2
two 12 12
foo one -1 -1
two 8 8
[4 rows x 2 columns]
What I need now is two use some addition operation to add column C and D and generate a output like this:
A B Sum
bar one 4
two 24
foo one -2
two 16
Any ideas will really help me as i am new to python
You could define a new column Sum:
In [107]: grouped['Sum'] = grouped['C']+grouped['D']
Now grouped would look like this:
In [108]: grouped
Out[108]:
C D Sum
A B
bar one 2 2 4
two 12 12 24
foo one -1 -1 -2
two 8 8 16
[4 rows x 3 columns]
To select just the Sum column (as a DataFrame use double brackets):
In [109]: grouped[['Sum']]
Out[109]:
Sum
A B
bar one 4
two 24
foo one -2
two 16
[4 rows x 1 columns]