Join dataframes - one with multiindex columns and the other without - python

I'm trying to join two dataframes - one with multiindex columns and the other with a single column name. They have similar index.
I get the following warning:
"UserWarning: merging between different levels can give an unintended result (3 levels on the left, 1 on the right)"
For example:
import pandas as pd
import numpy as np
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
np.random.seed(2022) # so the data is the same each time
df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
df2 = pd.DataFrame(np.random.randn(3), index=['A', 'B', 'C'],columns=['w'])
df3 = df.join(df2)
DataFrame Views
df
first bar baz foo qux
second one two one two one two one two
A -0.000528 -0.274901 -0.139286 1.984686 0.282109 0.760809 0.300982 0.540297
B 0.373497 0.377813 -0.090213 -2.305943 1.142760 -1.535654 -0.863752 1.016545
C 1.033964 -0.824492 0.018905 -0.383344 -0.304185 0.997292 -0.127274 -1.475886
df2
w
A -1.940906
B 0.833649
C -0.567218
df3 - Result
(bar, one) (bar, two) (baz, one) (baz, two) (foo, one) (foo, two) (qux, one) (qux, two) w
A -0.000528 -0.274901 -0.139286 1.984686 0.282109 0.760809 0.300982 0.540297 -1.940906
B 0.373497 0.377813 -0.090213 -2.305943 1.142760 -1.535654 -0.863752 1.016545 0.833649
C 1.033964 -0.824492 0.018905 -0.383344 -0.304185 0.997292 -0.127274 -1.475886 -0.567218
df.join(df2) from pandas v1.3.0 results in a FutureWarning
FutureWarning: merging between different levels is deprecated and will be removed in a future version. (2 levels on the left, 1 on the right) df3 = df.join(df2).
What is the best way to join these two dataframes?

It depends on what you want! Do you want the column from df2 to be aligned with the 1st or second level of columns from df?
You have to add a level to the columns of df2
Super cheezy with pd.concat
df.join(pd.concat([df2], axis=1, keys=['a']))
Better way
df2.columns = pd.MultiIndex.from_product([['a'], df2.columns])
df.join(df2)

I think the simplest way is to convert df2 to MultiIndex, and then use concat or join:
df2.columns = pd.MultiIndex.from_tuples([('a','w')])
print (df2)
a
w
A -1.940906
B 0.833649
C -0.567218
Or:
df2.columns = [['a'], df2.columns]
print (df2)
a
w
A -1.940906
B 0.833649
C -0.567218
df3 = pd.concat([df, df2], axis=1)
Or:
df3 = df.join(df2)
Result:
print (df3)
first bar baz foo qux a
second one two one two one two one two w
A -0.000528 -0.274901 -0.139286 1.984686 0.282109 0.760809 0.300982 0.540297 -1.940906
B 0.373497 0.377813 -0.090213 -2.305943 1.142760 -1.535654 -0.863752 1.016545 0.833649
C 1.033964 -0.824492 0.018905 -0.383344 -0.304185 0.997292 -0.127274 -1.475886 -0.567218
Additional Resources
pandas docs: Joining a single Index to a MultiIndex
pandas docs: Joining with two MultiIndexes

Related

Pandas merge doesn't retain as many rows as I would think

Consider the following two data frames
df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})
Running
df3 = pd.merge(df1, df2, on='a')
Yields
a b c
0 foo 1 3
But why not the following?
a b c
0 foo 1 3
1 bar 2 -
1 baz - 4
What do I need to tell python to get it to output both rows?
A pandas merge does by default an inner join, if you are familiar with database joins. That means it only returns the rows that have a matching entry in both the left and right dataframe. For you, that is just 'foo'.
You can change that by setting the how argument. If you want all rows from both left, and right set it to outer, if you want to keep all from the left frame set it to leftand if you want to keep all from the right frame set it to right.
pd.merge(df1, df2, on='a', how='outer') will join on matching keys with all non matching keys returned as a new row will NaN filling in the blanks.
try here for an overview of different types of SQL style joins which merge uses as basis.

drop a level two column from multi index dataframe

Consider this dataframe:
import pandas as pd
import numpy as np
iterables = [['bar', 'baz', 'foo'], ['one', 'two']]
index = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(3, 6), index=['A', 'B', 'C'], columns=index)
print(df)
first bar baz foo
second one two one two one two
A -1.954583 -1.347156 -1.117026 -1.253150 0.057197 -1.520180
B 0.253937 1.267758 -0.805287 0.337042 0.650892 -0.379811
C 0.354798 -0.835234 1.172324 -0.663353 1.145299 0.651343
I would like to drop 'one' from each column, while retaining other structure.
With the end result looking something like this:
first bar baz foo
second two two two
A -1.347156 -1.253150 -1.520180
B 1.267758 0.337042 -0.379811
C -0.835234 -0.663353 0.651343
Use drop:
df.drop('one', axis=1, level=1)
first bar baz foo
second two two two
A 0.127419 -0.319655 -0.878161
B -0.563335 1.193819 -0.469539
C -1.324932 -0.550495 1.378335
This should work as well:
df.loc[:,df.columns.get_level_values(1)!='one']
Try:
print(df.loc[:, (slice(None), "two")])
Prints:
first bar baz foo
second two two two
A -1.104831 0.286379 1.121148
B -1.637677 -2.297138 0.381137
C -1.556391 0.779042 2.316628
Use pd.IndexSlice:
indx = pd.IndexSlice
df.loc[:, indx[:, 'two']]
Output:
first bar baz foo
second two two two
A 1.169699 1.434761 0.917152
B -0.732991 -0.086613 -0.803092
C -0.813872 -0.706504 0.227000

Extract max of a multiindex pandas dataframe with strings and NaN

I've got the following multiindex dataframe:
first bar baz foo
second one two one two one two
first second
bar one NaN -0.056213 0.988634 0.103149 1.5858 -0.101334
two -0.47464 -0.010561 2.679586 -0.080154 <LQ -0.422063
baz one <LQ 0.220080 1.495349 0.302883 -0.205234 0.781887
two 0.638597 0.276678 -0.408217 -0.083598 -1.15187 -1.724097
foo one 0.275549 -1.088070 0.259929 -0.782472 -1.1825 -1.346999
two 0.857858 0.783795 -0.655590 -1.969776 -0.964557 -0.220568
I would like to to extract the max along one level. Expected result:
first bar baz foo
second
one 0.275549 1.495349 1.5858
two 0.857858 2.679586 -0.964557
Here is what I tried:
df.xs('one', level=1, axis = 1).max(axis=0, level=1, skipna = True, numeric_only = False)
And the obtained result:
first baz
second
one 1.495349
two 2.679586
How do I get Pandas to not ignore the whole column if one cell contains a string?
(created like this:)
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
df['bar','one'].loc['bar','one'] = np.NaN
df['bar','one'].loc['baz','one'] = '<LQ'
df['foo','one'].loc['bar','two'] = '<LQ'
I guess you would need to replace the non-numeric with na:
(df.xs('one', level=1, axis=1)
.apply(pd.to_numeric, errors='coerce')
.max(level=1,skipna=True)
)
Output (with np.random.seed(1)):
first bar baz foo
second
one 0.900856 1.133769 0.865408
two 1.744812 0.319039 0.901591

Remove column from multi index dataframe

Consider the following DataFrame:
arrays = [['foo', 'bar', 'bar', 'bar'],
['A', 'B', 'C', 'D']]
tuples = list(zip(*arrays))
columnValues = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame(np.random.rand(4,4), columns = columnValues)
print(df)
foo bar
A B C D
0 0.859664 0.671857 0.685368 0.939156
1 0.155301 0.495899 0.733943 0.585682
2 0.124663 0.467614 0.622972 0.567858
3 0.789442 0.048050 0.630039 0.722298
Say I want to remove the first column, like so:
df.drop(df.columns[[0]], axis = 1, inplace = True)
print(df)
bar
B C D
0 0.671857 0.685368 0.939156
1 0.495899 0.733943 0.585682
2 0.467614 0.622972 0.567858
3 0.048050 0.630039 0.722298
This produces the expected result, however the column labels foo and Aare retained:
print(df.columns.levels)
[['bar', 'foo'], ['A', 'B', 'C', 'D']]
Is there a way to completely drop a column, including its labels, from a MultiIndex DataFrame?
EDIT: As suggested by John, I had a look at https://github.com/pydata/pandas/issues/12822. What I got from it is that it's not a bug, however I believe the suggested solution (https://github.com/pydata/pandas/issues/2770#issuecomment-76500001) does not work for me. Am I missing something here?
df2 = df.drop(df.columns[[0]], axis = 1)
print(df2)
bar
B C D
0 0.969674 0.068575 0.688838
1 0.650791 0.122194 0.289639
2 0.373423 0.470032 0.749777
3 0.707488 0.734461 0.252820
print(df2.columns[[0]])
MultiIndex(levels=[['bar', 'foo'], ['A', 'B', 'C', 'D']],
labels=[[0], [1]])
df2.set_index(pd.MultiIndex.from_tuples(df2.columns.values))
ValueError: Length mismatch: Expected axis has 4 elements, new values have 3 elements
New Answer
As of pandas 0.20, pd.MultiIndex has a method pd.MultiIndex.remove_unused_levels
df.columns = df.columns.remove_unused_levels()
Old Answer
Our savior is pd.MultiIndex.to_series()
it returns a series of tuples restricted to what is in the DataFrame
df.columns = pd.MultiIndex.from_tuples(df.columns.to_series())

define aggfunc for each values column in pandas pivot table

Was trying to generate a pivot table with multiple "values" columns. I know I can use aggfunc to aggregate values the way I want to, but what if I don't want to sum or avg both columns but instead I want sum of one column while mean of the other one. So is it possible to do so using pandas?
df = pd.DataFrame({
'A' : ['one', 'one', 'two', 'three'] * 6,
'B' : ['A', 'B', 'C'] * 8,
'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4,
'D' : np.random.randn(24),
'E' : np.random.randn(24)
})
Now this will get a pivot table with sum:
pd.pivot_table(df, values=['D','E'], rows=['B'], aggfunc=np.sum)
And this for mean:
pd.pivot_table(df, values=['D','E'], rows=['B'], aggfunc=np.mean)
How can I get sum for D and mean for E?
Hope my question is clear enough.
You can apply a specific function to a specific column by passing in a dict.
pd.pivot_table(df, values=['D','E'], rows=['B'], aggfunc={'D':np.sum, 'E':np.mean})
You can concat two DataFrames:
>>> df1 = pd.pivot_table(df, values=['D'], rows=['B'], aggfunc=np.sum)
>>> df2 = pd.pivot_table(df, values=['E'], rows=['B'], aggfunc=np.mean)
>>> pd.concat((df1, df2), axis=1)
D E
B
A 1.810847 -0.524178
B 2.762190 -0.443031
C 0.867519 0.078460
or you can pass list of functions as aggfunc parameter and then reindex:
>>> df3 = pd.pivot_table(df, values=['D','E'], rows=['B'], aggfunc=[np.sum, np.mean])
>>> df3
sum mean
D E D E
B
A 1.810847 -4.193425 0.226356 -0.524178
B 2.762190 -3.544245 0.345274 -0.443031
C 0.867519 0.627677 0.108440 0.078460
>>> df3 = df3.ix[:, [('sum', 'D'), ('mean','E')]]
>>> df3.columns = ['D', 'E']
>>> df3
D E
B
A 1.810847 -0.524178
B 2.762190 -0.443031
C 0.867519 0.078460
Alghouth, it would be nice to have an option to defin aggfunc for each column individually. Don't know how it could be done, may be pass into aggfunc dict-like parameter, like {'D':np.mean, 'E':np.sum}.
update Actually, in your case you can pivot by hand:
>>> df.groupby('B').aggregate({'D':np.sum, 'E':np.mean})
E D
B
A -0.524178 1.810847
B -0.443031 2.762190
C 0.078460 0.867519
table = pivot_table(df, values=['D', 'E'], index=['A', 'C'],
aggfunc={'D': np.mean,'E': np.sum})
table
D E
mean sum
A C
bar large 5.500000 7.500000
small 5.500000 8.500000
foo large 2.000000 4.500000
small 2.333333 4.333333

Categories

Resources