Pandas dataframe with MultiIndex: exclude level values - python

I have a multi-indexed pandas dataframe like the following one.
import numpy as np
import pandas as pd
arrays = [np.array(['bar', 'bar', 'bar', 'bar', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']),
np.array(['blo', 'bla', 'bla', 'blo', 'blo', 'blu', 'blo', 'bla'])]
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df.sort_index(inplace=True)
which returns:
0 1 2 3
bar one bla 0.478461 1.030308 0.012688 0.137495
blo 0.476041 -1.679848 1.346798 0.143225
two bla 1.148882 -2.074197 -2.567959 1.258016
blo 1.062280 3.846096 -0.346636 1.170822
foo one blo -0.761327 0.262105 0.151554 1.066616
two blu 1.431951 0.043307 -0.326498 2.402536
qux one blo -0.622017 -0.566930 0.417977 -0.345238
two bla 0.129273 -0.181396 -0.758381 0.995827
Now I want to select a subset by using a slice object:
idx = pd.IndexSlice
subset = df.loc[idx[['bar'], :, :], :]
This returns:
0 1 2 3
bar one bla 0.478461 1.030308 0.012688 0.137495
blo 0.476041 -1.679848 1.346798 0.143225
two bla 1.148882 -2.074197 -2.567959 1.258016
blo 1.062280 3.846096 -0.346636 1.170822
Now I want to exclude all rows having "blo" as level value. I know that I could select everything but the 'blo' values but my real dataframe is very big and I only know the level values which should not appear in the subset.
What's the easiest way to exclude certain level values from the subset?
Thanks in advance!

IIUC, maybe you can mask your subset with:
subset = subset.iloc[subset.index.get_level_values(2) != 'blo']

You can do it this way:
In [263]:
subset.loc[subset.index.get_level_values(2) != 'blo']
Out[263]:
0 1 2 3
bar one bla -1.039335 -1.124656 0.057114 -0.284754
two bla 0.007208 -0.403559 -1.317075 -0.340171

For multiple values, I used this:
subset.iloc[~subset.index.get_level_values(2).isin(['blo'])]
In this way, you can use multiple values excluded at the same time.

Related

drop a level two column from multi index dataframe

Consider this dataframe:
import pandas as pd
import numpy as np
iterables = [['bar', 'baz', 'foo'], ['one', 'two']]
index = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(3, 6), index=['A', 'B', 'C'], columns=index)
print(df)
first bar baz foo
second one two one two one two
A -1.954583 -1.347156 -1.117026 -1.253150 0.057197 -1.520180
B 0.253937 1.267758 -0.805287 0.337042 0.650892 -0.379811
C 0.354798 -0.835234 1.172324 -0.663353 1.145299 0.651343
I would like to drop 'one' from each column, while retaining other structure.
With the end result looking something like this:
first bar baz foo
second two two two
A -1.347156 -1.253150 -1.520180
B 1.267758 0.337042 -0.379811
C -0.835234 -0.663353 0.651343
Use drop:
df.drop('one', axis=1, level=1)
first bar baz foo
second two two two
A 0.127419 -0.319655 -0.878161
B -0.563335 1.193819 -0.469539
C -1.324932 -0.550495 1.378335
This should work as well:
df.loc[:,df.columns.get_level_values(1)!='one']
Try:
print(df.loc[:, (slice(None), "two")])
Prints:
first bar baz foo
second two two two
A -1.104831 0.286379 1.121148
B -1.637677 -2.297138 0.381137
C -1.556391 0.779042 2.316628
Use pd.IndexSlice:
indx = pd.IndexSlice
df.loc[:, indx[:, 'two']]
Output:
first bar baz foo
second two two two
A 1.169699 1.434761 0.917152
B -0.732991 -0.086613 -0.803092
C -0.813872 -0.706504 0.227000

Extract max of a multiindex pandas dataframe with strings and NaN

I've got the following multiindex dataframe:
first bar baz foo
second one two one two one two
first second
bar one NaN -0.056213 0.988634 0.103149 1.5858 -0.101334
two -0.47464 -0.010561 2.679586 -0.080154 <LQ -0.422063
baz one <LQ 0.220080 1.495349 0.302883 -0.205234 0.781887
two 0.638597 0.276678 -0.408217 -0.083598 -1.15187 -1.724097
foo one 0.275549 -1.088070 0.259929 -0.782472 -1.1825 -1.346999
two 0.857858 0.783795 -0.655590 -1.969776 -0.964557 -0.220568
I would like to to extract the max along one level. Expected result:
first bar baz foo
second
one 0.275549 1.495349 1.5858
two 0.857858 2.679586 -0.964557
Here is what I tried:
df.xs('one', level=1, axis = 1).max(axis=0, level=1, skipna = True, numeric_only = False)
And the obtained result:
first baz
second
one 1.495349
two 2.679586
How do I get Pandas to not ignore the whole column if one cell contains a string?
(created like this:)
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
df['bar','one'].loc['bar','one'] = np.NaN
df['bar','one'].loc['baz','one'] = '<LQ'
df['foo','one'].loc['bar','two'] = '<LQ'
I guess you would need to replace the non-numeric with na:
(df.xs('one', level=1, axis=1)
.apply(pd.to_numeric, errors='coerce')
.max(level=1,skipna=True)
)
Output (with np.random.seed(1)):
first bar baz foo
second
one 0.900856 1.133769 0.865408
two 1.744812 0.319039 0.901591

Add column to pandas multiindex dataframe

I have a pandas dataframe that looks like this:
import pandas as pd
import numpy as np
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
df = pd.DataFrame(np.random.randn(8,4),index=arrays,columns=['A','B','C','D'])
I want to add a column E such that df.loc[(slice(None),'one'),'E'] = 1 and df.loc[(slice(None),'two'),'E'] = 2, and I want to do this without iterating over ['one', 'two']. I tried the following:
df.loc[(slice(None),slice('one','two')),'E'] = pd.Series([1,2],index=['one','two'])
but it just adds a column E with NaN. What's the right way to do this?
Here is one way reindex
df.loc[:,'E']=pd.Series([1,2],index=['one','two']).reindex(df.index.get_level_values(1)).values
df
A B C D E
bar one -0.856175 -0.383711 -0.646510 0.110204 1
two 1.640114 0.099713 0.406629 0.774960 2
baz one 0.097198 -0.814920 0.234416 -0.057340 1
two -0.155276 0.788130 0.761469 0.770709 2
foo one 1.593564 -1.048519 -1.194868 0.191314 1
two -0.755624 0.678036 -0.899805 1.070639 2
qux one -0.560672 0.317915 -0.858048 0.418655 1
two 1.198208 0.662354 -1.353606 -0.184258 2
Methinks this is a good use case for Index.map:
df['E'] = df.index.get_level_values(1).map({'one':1, 'two':2})
df
A B C D E
bar one 0.956122 -0.705841 1.192686 -0.237942 1
two 1.155288 0.438166 1.122328 -0.997020 2
baz one -0.106794 1.451429 -0.618037 -2.037201 1
two -1.942589 -2.506441 -2.114164 -0.411639 2
foo one 1.278528 -0.442229 0.323527 -0.109991 1
two 0.008549 -0.168199 -0.174180 0.461164 2
qux one -1.175983 1.010127 0.920018 -0.195057 1
two 0.805393 -0.701344 -0.537223 0.156264 2
You can just get it from df.index.labels:
df['E'] = df.index.labels[1] + 1
print(df)
Output:
A B C D E
bar one 0.746123 1.264906 0.169694 -0.180074 1
two -1.439730 -0.100075 0.929750 0.511201 2
baz one 0.833037 1.547624 -1.116807 0.425093 1
two 0.969887 -0.705240 -2.100482 0.728977 2
foo one -0.977623 -0.800136 -0.361394 0.396451 1
two 1.158378 -1.892137 -0.987366 -0.081511 2
qux one 0.155531 0.275015 0.571397 -0.663358 1
two 0.710313 -0.255876 0.420092 -0.116537 2
Thanks to coldspeed, if you want different values (i.e x and y), use:
df['E'] = pd.Series(df.index.labels[1]).map({0: 'x', 1: 'y'}).tolist()
print(df)

compute mean of unique combinations in groupby pandas

I have he following pandas dataframe:
data = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' :[2,1,2,1,2,1,2,1]})
that looks like:
A B C
0 foo one 2
1 bar one 1
2 foo two 2
3 bar three 1
4 foo two 2
5 bar two 1
6 foo one 2
7 foo three 1
What I need is to compute the mean of each unique combination of A and B. i.e.:
A B C
foo one 2
foo two 2
foo three 1
mean = 1.66666667
and having as output the 'means' computed per value of A i.e.:
foo 1.666667
bar 1
I tried with :
data.groupby(['A'], sort=False, as_index=False).mean()
but it returns me:
foo 1.8
bar 1
Is there a way to compute the mean of only unique combinations? How ?
This is essentially the same as #S_A's answer, but a bit more concise.
You can calculate the means across A and B with:
In [41]: df.groupby(['A', 'B']).mean()
Out[41]:
C
A B
bar one 1
three 1
two 1
foo one 2
three 1
two 2
And then calculate the mean of these over A with:
In [42]: df.groupby(['A', 'B']).mean().groupby(level='A').mean()
Out[42]:
C
A
bar 1.000000
foo 1.666667
Yes. Here is a solution which you want. Firstly you make group corresponding column for making unique combination A and B column. Later from making group, you count mean() corresponding A column.
You can do this like:
from pandas import *
data = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' :[2.0,1,2,1,2,1,2,1]})
data = data.groupby(['A','B'], sort=False, as_index=False).mean()
print data.groupby('A', sort=False, as_index=False).mean()
Output:
A C
0 foo 1.666667
1 bar 1.000000
When you data.groupby(['A'], sort=False, as_index=False).mean() do, it's mean you count group_by all value of C column according to A Column. That's why it return
foo 1.8 (9/8)
bar 1.0 (3/3)
I think you should find your answer :) :)
This worked for me
test = data
test = test.drop_duplicates()
test = test.groupby(['A']).mean()
Output:
C
A
bar 1.000000
foo 1.666667

Pandas sort by group aggregate and calculate sum of two colums

I want to do some analysis on data. So far I am enable to group the columns that I want, now i need to add two columns here is my logic:
import pandas as pd
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'bar'],
'B' : ['one', 'one', 'two', 'two',
'two', 'two', 'one', 'two'],
'C' : [-1,2,3,4,5,6,0,2],
'D' : [-1,2,3,4,5,6,0,2]})
grouped = df.groupby(['A','B']).sum()
print grouped
The output looks like this:
C D
A B
bar one 2 2
two 12 12
foo one -1 -1
two 8 8
[4 rows x 2 columns]
What I need now is two use some addition operation to add column C and D and generate a output like this:
A B Sum
bar one 4
two 24
foo one -2
two 16
Any ideas will really help me as i am new to python
You could define a new column Sum:
In [107]: grouped['Sum'] = grouped['C']+grouped['D']
Now grouped would look like this:
In [108]: grouped
Out[108]:
C D Sum
A B
bar one 2 2 4
two 12 12 24
foo one -1 -1 -2
two 8 8 16
[4 rows x 3 columns]
To select just the Sum column (as a DataFrame use double brackets):
In [109]: grouped[['Sum']]
Out[109]:
Sum
A B
bar one 4
two 24
foo one -2
two 16
[4 rows x 1 columns]

Categories

Resources