On a DataFrame with single levels, group the data on columns using a dictionary:
df1 = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=['a','b','c','d','e','f','g','h'])
dict_col= {'a':'ab','b':'ab','c':'c','d':'d','e':'efgh','f':'efgh','g':'efgh','h':'efgh'}
df1.groupby(dict_col, axis=1).sum()
ab c d efgh
A 1.014831 1.274621 -1.490353 -0.954438
B 1.484857 -0.968642 0.700881 -3.281607
C 0.898556 1.444362 0.680974 -2.985182
On a MultiIndexed DataFrame:
MultiIndex = pd.MultiIndex.from_product([['bar', 'baz', 'foo', 'qux'], ['a','b','c','d','e','f','g','h']])
df2 = pd.DataFrame(np.random.randn(3, 32), index=['A', 'B', 'C'], columns=MultiIndex)
df2.groupby(dict_col, axis=1, level=1).sum()
ab c d efgh
A 6.583721 -1.554734 1.922187 1.100208
B 6.138441 0.653721 -0.204472 1.890755
C 0.951489 2.695940 -1.494028 0.907464
How to get something like this (All elements on level 0)?
bar baz foo
ab c d efgh ab c d efgh ......
A 6.583721 -1.554734 1.922187 1.100208 4.944954 -1.343831 0.939265 -3.614612 ......
B 6.138441 0.653721 -0.204472 1.890755 -0.347505 1.633708 0.392096 0.414880 ......
C 0.951489 2.695940 -1.494028 0.907464 1.905409 -1.021097 -2.399670 0.799798 ......
One way is to pass a function to groupby, and then convert the tuples back to MultiIndex
out = df2.groupby(lambda x: (x[0], dict_col[x[1]]), axis=1).sum()
out.columns = pd.MultiIndex.from_tuples(out.columns)
Another way is flatten the column levels by stack then unstack back after groupby:
df2.stack(level=0).groupby(dict_col, axis=1).sum().unstack(level=-1)
Related
I have a df with multi-indexed columns, like this:
col = pd.MultiIndex.from_arrays([['one', '', '', 'two', 'two', 'two'],
['a', 'b', 'c', 'd', 'e', 'f']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data
I want to be able to select all rows where the values in one of the level 1 columns pass a certain test. If there were no multi-index on the columns I would say something like:
data[data['d']<1]
But of course that fails on a multindex. The level 1 indexes are unique, so I don't want to have to specify the level 0 index, just level 1. I'd like to return the table above but missing row 1, where d>1.
If values are unique in second level id necessary convert mask from one column DataFrame to Series - possible solution with DataFrame.squeeze:
np.random.seed(2019)
col = pd.MultiIndex.from_arrays([['one', '', '', 'two', 'two', 'two'],
['a', 'b', 'c', 'd', 'e', 'f']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
print (data.xs('d', axis=1, level=1))
two
0 1.331864
1 0.953490
2 -0.189313
3 0.064969
print (data.xs('d', axis=1, level=1).squeeze())
0 1.331864
1 0.953490
2 -0.189313
3 0.064969
Name: two, dtype: float64
print (data.xs('d', axis=1, level=1).squeeze().lt(1))
0 False
1 True
2 True
3 True
Name: two, dtype: bool
df = data[data.xs('d', axis=1, level=1).squeeze().lt(1)]
Alternative with DataFrame.iloc:
df = data[data.xs('d', axis=1, level=1).iloc[:, 0].lt(1)]
print (df)
one two
a b c d e f
1 0.573761 0.287728 -0.235634 0.953490 -1.689625 -0.344943
2 0.016905 -0.514984 0.244509 -0.189313 2.672172 0.464802
3 0.845930 -0.503542 -0.963336 0.064969 -3.205040 1.054969
If working with MultiIndex after select is possible get multiple columns, like here if select by c level:
np.random.seed(2019)
col = pd.MultiIndex.from_arrays([['one', '', '', 'two', 'two', 'two'],
['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
So first select by DataFrame.xs and compare by DataFrame.lt for <
print (data.xs('c', axis=1, level=1))
two
0 1.481278 0.685609
1 -0.235634 -0.344943
2 0.244509 0.464802
3 -0.963336 1.054969
m = data.xs('c', axis=1, level=1).lt(1)
#alternative
#m = data.xs('c', axis=1, level=1) < 1
print (m)
two
0 False True
1 True True
2 True True
3 True False
And then test if at least one True per rows by DataFrame.any and filter by boolean indexing:
df1 = data[m.any(axis=1)]
print (df1)
one two
a b c a b c
0 -0.217679 0.821455 1.481278 1.331864 -0.361865 0.685609
1 0.573761 0.287728 -0.235634 0.953490 -1.689625 -0.344943
2 0.016905 -0.514984 0.244509 -0.189313 2.672172 0.464802
3 0.845930 -0.503542 -0.963336 0.064969 -3.205040 1.054969
Or test if all Trues per row by DataFrame.any with filtering:
df1 = data[m.all(axis=1)]
print (df1)
one two
a b c a b c
1 0.573761 0.287728 -0.235634 0.953490 -1.689625 -0.344943
2 0.016905 -0.514984 0.244509 -0.189313 2.672172 0.464802
Using your supplied data, a combination of xs and squeeze can help with the filtering. This works on the assumption that the level 1 entries are unique, as indicated in your question :
np.random.seed(2019)
col = pd.MultiIndex.from_arrays([['one', '', '', 'two', 'two', 'two'],
['a', 'b', 'c', 'd', 'e', 'f']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data
one two
a b c d e f
0 -0.217679 0.821455 1.481278 1.331864 -0.361865 0.685609
1 0.573761 0.287728 -0.235634 0.953490 -1.689625 -0.344943
2 0.016905 -0.514984 0.244509 -0.189313 2.672172 0.464802
3 0.845930 -0.503542 -0.963336 0.064969 -3.205040 1.054969
Say you want to filter for d less than 1 :
#squeeze turns it into a series, making it easy to pass to loc via boolean indexing
condition = data.xs('d',axis=1,level=1).lt(1).squeeze()
#or you could use loc :
# condition = data.loc(axis=1)[:,'d'].lt(1).squeeze()
data.loc[condition]
one two
a b c d e f
1 0.573761 0.287728 -0.235634 0.953490 -1.689625 -0.344943
2 0.016905 -0.514984 0.244509 -0.189313 2.672172 0.464802
3 0.845930 -0.503542 -0.963336 0.064969 -3.205040 1.054969
I think this can be done using query;
data.query("some_column <1")
and get_level_values
data[data.index.get_level_values('some_column') < 1]
Thanks everyone for your help. As usual with these things, the specific answer to the problem is not as interesting as what you've learned in trying to fix it, and I learned a lot about .query, .xs and much more.
However, I ended up taking a side route to addressing my specific issue - namely that I copied the columns to a new variable, dropped an index, did my calculations, then put the original indexes in place. Eg:
cols = data.columns
data..droplevel(level=1, axis=1)
# do calculations
data.columns = cols
The advantage was that I could top and tail the operation modifying the indexes, but all the data manipulation in between used idioms I'm familiar with.
At some point I'll sit down and read about multi-indexes at length.
I'm trying to join two dataframes - one with multiindex columns and the other with a single column name. They have similar index.
I get the following warning:
"UserWarning: merging between different levels can give an unintended result (3 levels on the left, 1 on the right)"
For example:
import pandas as pd
import numpy as np
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
np.random.seed(2022) # so the data is the same each time
df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
df2 = pd.DataFrame(np.random.randn(3), index=['A', 'B', 'C'],columns=['w'])
df3 = df.join(df2)
DataFrame Views
df
first bar baz foo qux
second one two one two one two one two
A -0.000528 -0.274901 -0.139286 1.984686 0.282109 0.760809 0.300982 0.540297
B 0.373497 0.377813 -0.090213 -2.305943 1.142760 -1.535654 -0.863752 1.016545
C 1.033964 -0.824492 0.018905 -0.383344 -0.304185 0.997292 -0.127274 -1.475886
df2
w
A -1.940906
B 0.833649
C -0.567218
df3 - Result
(bar, one) (bar, two) (baz, one) (baz, two) (foo, one) (foo, two) (qux, one) (qux, two) w
A -0.000528 -0.274901 -0.139286 1.984686 0.282109 0.760809 0.300982 0.540297 -1.940906
B 0.373497 0.377813 -0.090213 -2.305943 1.142760 -1.535654 -0.863752 1.016545 0.833649
C 1.033964 -0.824492 0.018905 -0.383344 -0.304185 0.997292 -0.127274 -1.475886 -0.567218
df.join(df2) from pandas v1.3.0 results in a FutureWarning
FutureWarning: merging between different levels is deprecated and will be removed in a future version. (2 levels on the left, 1 on the right) df3 = df.join(df2).
What is the best way to join these two dataframes?
It depends on what you want! Do you want the column from df2 to be aligned with the 1st or second level of columns from df?
You have to add a level to the columns of df2
Super cheezy with pd.concat
df.join(pd.concat([df2], axis=1, keys=['a']))
Better way
df2.columns = pd.MultiIndex.from_product([['a'], df2.columns])
df.join(df2)
I think the simplest way is to convert df2 to MultiIndex, and then use concat or join:
df2.columns = pd.MultiIndex.from_tuples([('a','w')])
print (df2)
a
w
A -1.940906
B 0.833649
C -0.567218
Or:
df2.columns = [['a'], df2.columns]
print (df2)
a
w
A -1.940906
B 0.833649
C -0.567218
df3 = pd.concat([df, df2], axis=1)
Or:
df3 = df.join(df2)
Result:
print (df3)
first bar baz foo qux a
second one two one two one two one two w
A -0.000528 -0.274901 -0.139286 1.984686 0.282109 0.760809 0.300982 0.540297 -1.940906
B 0.373497 0.377813 -0.090213 -2.305943 1.142760 -1.535654 -0.863752 1.016545 0.833649
C 1.033964 -0.824492 0.018905 -0.383344 -0.304185 0.997292 -0.127274 -1.475886 -0.567218
Additional Resources
pandas docs: Joining a single Index to a MultiIndex
pandas docs: Joining with two MultiIndexes
I am not sure if this is possible. I have two dataframes df1 and df2 which are presented like this:
df1 df2
id value id value
a 5 a 3
c 9 b 7
d 4 c 6
f 2 d 8
e 2
f 1
They will have many more entries in reality than presented here. I would like to create a third dataframe df3 based on the values in df1 and df2. Any values in df1 would take precedence over values in df2 when writing to df3 (if the same id is present in both df1 and df2) so in this example I would return:
df3
id value
a 5
b 7
c 9
d 4
e 2
f 2
I have tried using df2 as the base (df2 will have all of the id's present for the whole universe) and then overwriting the value for id's that are present in df1, but cannot find the merge syntax to do this.
You could use combine_first, provided that you first make the DataFrame index id (so that the values get aligned by id):
In [80]: df1.set_index('id').combine_first(df2.set_index('id')).reset_index()
Out[80]:
id value
0 a 5.0
1 b 7.0
2 c 9.0
3 d 4.0
4 e 2.0
5 f 2.0
Since you mentioned merging, you might be interested in seeing that
you could merge df1 and df2 on id, and then use fillna to replace NaNs in df1's the value column with values from df2's value column:
df1 = pd.DataFrame({'id': ['a', 'c', 'd', 'f'], 'value': [5, 9, 4, 2]})
df2 = pd.DataFrame({'id': ['a', 'b', 'c', 'd', 'e', 'f'], 'value': [3, 7, 6, 8, 2, 1]})
result = pd.merge(df2, df1, on='id', how='left', suffixes=('_x', ''))
result['value'] = result['value'].fillna(result['value_x'])
result = result[['id', 'value']]
print(result)
yields the same result, though the first method is simpler.
Consider the following DataFrame:
arrays = [['foo', 'bar', 'bar', 'bar'],
['A', 'B', 'C', 'D']]
tuples = list(zip(*arrays))
columnValues = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame(np.random.rand(4,4), columns = columnValues)
print(df)
foo bar
A B C D
0 0.859664 0.671857 0.685368 0.939156
1 0.155301 0.495899 0.733943 0.585682
2 0.124663 0.467614 0.622972 0.567858
3 0.789442 0.048050 0.630039 0.722298
Say I want to remove the first column, like so:
df.drop(df.columns[[0]], axis = 1, inplace = True)
print(df)
bar
B C D
0 0.671857 0.685368 0.939156
1 0.495899 0.733943 0.585682
2 0.467614 0.622972 0.567858
3 0.048050 0.630039 0.722298
This produces the expected result, however the column labels foo and Aare retained:
print(df.columns.levels)
[['bar', 'foo'], ['A', 'B', 'C', 'D']]
Is there a way to completely drop a column, including its labels, from a MultiIndex DataFrame?
EDIT: As suggested by John, I had a look at https://github.com/pydata/pandas/issues/12822. What I got from it is that it's not a bug, however I believe the suggested solution (https://github.com/pydata/pandas/issues/2770#issuecomment-76500001) does not work for me. Am I missing something here?
df2 = df.drop(df.columns[[0]], axis = 1)
print(df2)
bar
B C D
0 0.969674 0.068575 0.688838
1 0.650791 0.122194 0.289639
2 0.373423 0.470032 0.749777
3 0.707488 0.734461 0.252820
print(df2.columns[[0]])
MultiIndex(levels=[['bar', 'foo'], ['A', 'B', 'C', 'D']],
labels=[[0], [1]])
df2.set_index(pd.MultiIndex.from_tuples(df2.columns.values))
ValueError: Length mismatch: Expected axis has 4 elements, new values have 3 elements
New Answer
As of pandas 0.20, pd.MultiIndex has a method pd.MultiIndex.remove_unused_levels
df.columns = df.columns.remove_unused_levels()
Old Answer
Our savior is pd.MultiIndex.to_series()
it returns a series of tuples restricted to what is in the DataFrame
df.columns = pd.MultiIndex.from_tuples(df.columns.to_series())
Was trying to generate a pivot table with multiple "values" columns. I know I can use aggfunc to aggregate values the way I want to, but what if I don't want to sum or avg both columns but instead I want sum of one column while mean of the other one. So is it possible to do so using pandas?
df = pd.DataFrame({
'A' : ['one', 'one', 'two', 'three'] * 6,
'B' : ['A', 'B', 'C'] * 8,
'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4,
'D' : np.random.randn(24),
'E' : np.random.randn(24)
})
Now this will get a pivot table with sum:
pd.pivot_table(df, values=['D','E'], rows=['B'], aggfunc=np.sum)
And this for mean:
pd.pivot_table(df, values=['D','E'], rows=['B'], aggfunc=np.mean)
How can I get sum for D and mean for E?
Hope my question is clear enough.
You can apply a specific function to a specific column by passing in a dict.
pd.pivot_table(df, values=['D','E'], rows=['B'], aggfunc={'D':np.sum, 'E':np.mean})
You can concat two DataFrames:
>>> df1 = pd.pivot_table(df, values=['D'], rows=['B'], aggfunc=np.sum)
>>> df2 = pd.pivot_table(df, values=['E'], rows=['B'], aggfunc=np.mean)
>>> pd.concat((df1, df2), axis=1)
D E
B
A 1.810847 -0.524178
B 2.762190 -0.443031
C 0.867519 0.078460
or you can pass list of functions as aggfunc parameter and then reindex:
>>> df3 = pd.pivot_table(df, values=['D','E'], rows=['B'], aggfunc=[np.sum, np.mean])
>>> df3
sum mean
D E D E
B
A 1.810847 -4.193425 0.226356 -0.524178
B 2.762190 -3.544245 0.345274 -0.443031
C 0.867519 0.627677 0.108440 0.078460
>>> df3 = df3.ix[:, [('sum', 'D'), ('mean','E')]]
>>> df3.columns = ['D', 'E']
>>> df3
D E
B
A 1.810847 -0.524178
B 2.762190 -0.443031
C 0.867519 0.078460
Alghouth, it would be nice to have an option to defin aggfunc for each column individually. Don't know how it could be done, may be pass into aggfunc dict-like parameter, like {'D':np.mean, 'E':np.sum}.
update Actually, in your case you can pivot by hand:
>>> df.groupby('B').aggregate({'D':np.sum, 'E':np.mean})
E D
B
A -0.524178 1.810847
B -0.443031 2.762190
C 0.078460 0.867519
table = pivot_table(df, values=['D', 'E'], index=['A', 'C'],
aggfunc={'D': np.mean,'E': np.sum})
table
D E
mean sum
A C
bar large 5.500000 7.500000
small 5.500000 8.500000
foo large 2.000000 4.500000
small 2.333333 4.333333