Extract max of a multiindex pandas dataframe with strings and NaN - python

I've got the following multiindex dataframe:
first bar baz foo
second one two one two one two
first second
bar one NaN -0.056213 0.988634 0.103149 1.5858 -0.101334
two -0.47464 -0.010561 2.679586 -0.080154 <LQ -0.422063
baz one <LQ 0.220080 1.495349 0.302883 -0.205234 0.781887
two 0.638597 0.276678 -0.408217 -0.083598 -1.15187 -1.724097
foo one 0.275549 -1.088070 0.259929 -0.782472 -1.1825 -1.346999
two 0.857858 0.783795 -0.655590 -1.969776 -0.964557 -0.220568
I would like to to extract the max along one level. Expected result:
first bar baz foo
second
one 0.275549 1.495349 1.5858
two 0.857858 2.679586 -0.964557
Here is what I tried:
df.xs('one', level=1, axis = 1).max(axis=0, level=1, skipna = True, numeric_only = False)
And the obtained result:
first baz
second
one 1.495349
two 2.679586
How do I get Pandas to not ignore the whole column if one cell contains a string?
(created like this:)
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
df['bar','one'].loc['bar','one'] = np.NaN
df['bar','one'].loc['baz','one'] = '<LQ'
df['foo','one'].loc['bar','two'] = '<LQ'

I guess you would need to replace the non-numeric with na:
(df.xs('one', level=1, axis=1)
.apply(pd.to_numeric, errors='coerce')
.max(level=1,skipna=True)
)
Output (with np.random.seed(1)):
first bar baz foo
second
one 0.900856 1.133769 0.865408
two 1.744812 0.319039 0.901591

Related

drop a level two column from multi index dataframe

Consider this dataframe:
import pandas as pd
import numpy as np
iterables = [['bar', 'baz', 'foo'], ['one', 'two']]
index = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(3, 6), index=['A', 'B', 'C'], columns=index)
print(df)
first bar baz foo
second one two one two one two
A -1.954583 -1.347156 -1.117026 -1.253150 0.057197 -1.520180
B 0.253937 1.267758 -0.805287 0.337042 0.650892 -0.379811
C 0.354798 -0.835234 1.172324 -0.663353 1.145299 0.651343
I would like to drop 'one' from each column, while retaining other structure.
With the end result looking something like this:
first bar baz foo
second two two two
A -1.347156 -1.253150 -1.520180
B 1.267758 0.337042 -0.379811
C -0.835234 -0.663353 0.651343
Use drop:
df.drop('one', axis=1, level=1)
first bar baz foo
second two two two
A 0.127419 -0.319655 -0.878161
B -0.563335 1.193819 -0.469539
C -1.324932 -0.550495 1.378335
This should work as well:
df.loc[:,df.columns.get_level_values(1)!='one']
Try:
print(df.loc[:, (slice(None), "two")])
Prints:
first bar baz foo
second two two two
A -1.104831 0.286379 1.121148
B -1.637677 -2.297138 0.381137
C -1.556391 0.779042 2.316628
Use pd.IndexSlice:
indx = pd.IndexSlice
df.loc[:, indx[:, 'two']]
Output:
first bar baz foo
second two two two
A 1.169699 1.434761 0.917152
B -0.732991 -0.086613 -0.803092
C -0.813872 -0.706504 0.227000

Calculating aggregate values in a pandas dataframe with multiple columns

I have a Pandas DataFrame with multiple columns.
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
print(df)
first bar baz foo qux \
second one two one two one two one
A -0.093829 -0.159939 -0.386961 -0.367417 0.625646 1.286186 0.429855
B 0.440266 0.345161 1.798363 -1.265215 0.204303 -1.492993 -1.714360
C 0.689076 -1.211060 -0.265888 0.769467 -0.706941 0.086907 -0.892892
first
second two
A -1.006210
B -0.275578
C -0.563757
I want to calculate the mean and standard deviation, of each column, grouping by the upper column. Once I have calculated the mean and standard deviation I want to double the columns in the lower level, adding to the column name the information related to the statistical operation (mean or standard deviation) as "column name" + "_" + "std/mean".
group_cols = df.groupby(df.columns.get_level_values('first'), axis=1)
list_stat_dfs = []
for key, group in group_cols:
group_descr = group.describe().loc[['mean', 'std'], :] # Get mean and std from single site
group_descr.loc[:, (key, 'stats')] = group_descr.index
group_descr.loc[:, (key, 'first')] = key
group_descr.columns = group_descr.columns.droplevel(0) # Remove upper level column (site_name)
group_descr = group_descr.pivot(columns='stats', index='first') # Rows to columns
col_prod = list(product(group_descr.columns.levels[0], group_descr.columns.levels[1]))
cols = ['_'.join((col[0], col[1])) for col in col_prod]
group_descr.columns = pd.MultiIndex.from_product(([key], cols)) # From multiple columns to single column
group_descr.reset_index(inplace=True)
list_stat_dfs.append(group_descr)
group_descr = pd.concat(list_stat_dfs, axis=1)
print(group_descr)
first bar first baz \
one_mean one_std two_mean two_std one_mean one_std
0 bar 0.507185 1.799053 -0.249692 1.41507 baz -0.147664 0.595927
first foo first \
two_mean two_std one_mean one_std two_mean two_std
0 0.160018 1.405113 foo -0.433644 1.245972 0.254995 0.846983 qux
qux
one_mean one_std two_mean two_std
0 0.667629 0.315417 -0.757989 0.683273
As you can see, I have been able to manage it with a for loop and some line of code. Can someone do the same thing in a more optimized way. I am quite sure that with Pandas, the same thing can be done with few lines of code.
I think you need get mean and std of df, then concat together and reshape by unstack:
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
np.random.seed(1000)
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
print(df)
first bar baz foo qux \
second one two one two one two one
A -0.804458 0.320932 -0.025483 0.644324 -0.300797 0.389475 -0.107437
B 0.595036 -0.464668 0.667281 -0.806116 -1.196070 -0.405960 -0.182377
C -0.138422 0.705692 1.271795 -0.986747 -0.334835 -0.099482 0.407192
first
second two
A -0.479983
B 0.103193
C 0.919388
df = pd.concat([df.mean(), df.std()], keys=('mean','std')).unstack(1)
df.index = [[0] * len(df.index), ['_'.join((col[1], col[0])) for col in df.index]]
df = df.unstack()
print (df)
first bar baz \
one_mean one_std two_mean two_std one_mean one_std two_mean
0 -0.115948 0.700018 0.187319 0.596511 0.637865 0.649139 -0.382846
first foo qux \
two_std one_mean one_std two_mean two_std one_mean one_std
0 0.894129 -0.610567 0.507346 -0.038656 0.401191 0.039126 0.32095
first
two_mean two_std
0 0.180866 0.702911

Pandas Custom Sort Row in Multiindex

Given the following:
import pandas as pd
arrays = [['bar', 'bar', 'bar', 'baz', 'baz', 'baz', 'baz'],
['total', 'two', 'one', 'two', 'four', 'total', 'five']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.Series(np.random.randn(7), index=index)
s
first second
bar total 0.334158
two -0.267854
one 1.161727
baz two -0.748685
four -0.888634
total 0.383310
five 0.506120
dtype: float64
How do I ensure that the 'total' rows (per the second index) are always at the bottom of each group like this?:
first second
bar one 0.210911
two 0.628357
total -0.911331
baz two 0.315396
four -0.195451
five 0.060159
total 0.638313
dtype: float64
solution 1
I'm not happy with this. I'm working on a different solution
unstacked = s.unstack(0)
total = unstacked.loc['total']
unstacked.drop('total').append(total).unstack().dropna()
first second
bar one 1.682996
two 0.343783
total 1.287503
baz five 0.360170
four 1.113498
two 0.083691
total -0.377132
dtype: float64
solution 2
I feel better about this one
second = pd.Categorical(
s.index.levels[1].values,
categories=['one', 'two', 'three', 'four', 'five', 'total'],
ordered=True
)
s.index.set_levels(second, level='second', inplace=True)
cols = s.index.names
s.reset_index().sort_values(cols).set_index(cols)
0
first second
bar one 1.682996
two 0.343783
total 1.287503
baz two 0.083691
four 1.113498
five 0.360170
total -0.377132
unstack for creating DataFrame with columns with second level of MultiIndex, then reorder columns for total to last column and last use ordered CategoricalIndex.
So if stack level total is last.
np.random.seed(123)
arrays = [['bar', 'bar', 'bar', 'baz', 'baz', 'baz', 'baz'],
['total', 'two', 'one', 'two', 'four', 'total', 'five']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.Series(np.random.randn(7), index=index)
print (s)
first second
bar total -1.085631
two 0.997345
one 0.282978
baz two -1.506295
four -0.578600
total 1.651437
five -2.426679
dtype: float64
df = s.unstack()
df = df[df.columns[df.columns != 'total'].tolist() + ['total']]
df.columns = pd.CategoricalIndex(df.columns, ordered=True)
print (df)
second five four one two total
first
bar NaN NaN 0.282978 0.997345 -1.085631
baz -2.426679 -0.5786 NaN -1.506295 1.651437
s1 = df.stack()
print (s1)
first second
bar one 0.282978
two 0.997345
total -1.085631
baz five -2.426679
four -0.578600
two -1.506295
total 1.651437
dtype: float64
print (s1.sort_index())
first second
bar one 0.282978
two 0.997345
total -1.085631
baz five -2.426679
four -0.578600
two -1.506295
total 1.651437
dtype: float64

Modify pandas group

I have a DataFrame, which I group.
I would like to add another column to the data frame, that is a result of function diff, per group. Something like:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df_grouped = df.groupby('B')
for name, group in df_grouped:
new_df["D_diff"] = group["D"].diff()
I would like to get per each group the differnece of column D, and have a DF that include a new column with the diff calculation.
IIUC you can use DataFrameGroupBy.diff:
df['D_diff'] = df.groupby('B')['D'].diff()
print (df)
A B C D D_diff
0 foo one 1.996084 0.580177 NaN
1 bar one 1.782665 0.042979 -0.537198
2 foo two -0.359840 1.952692 NaN
3 bar three -0.909853 0.119353 NaN
4 foo two -0.478386 -0.970906 -2.923598
5 bar two -1.289331 -1.245804 -0.274898
6 foo one -1.391884 -0.555056 -0.598035
7 foo three -1.270533 0.183360 0.064007

Pandas dataframe with MultiIndex: exclude level values

I have a multi-indexed pandas dataframe like the following one.
import numpy as np
import pandas as pd
arrays = [np.array(['bar', 'bar', 'bar', 'bar', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']),
np.array(['blo', 'bla', 'bla', 'blo', 'blo', 'blu', 'blo', 'bla'])]
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df.sort_index(inplace=True)
which returns:
0 1 2 3
bar one bla 0.478461 1.030308 0.012688 0.137495
blo 0.476041 -1.679848 1.346798 0.143225
two bla 1.148882 -2.074197 -2.567959 1.258016
blo 1.062280 3.846096 -0.346636 1.170822
foo one blo -0.761327 0.262105 0.151554 1.066616
two blu 1.431951 0.043307 -0.326498 2.402536
qux one blo -0.622017 -0.566930 0.417977 -0.345238
two bla 0.129273 -0.181396 -0.758381 0.995827
Now I want to select a subset by using a slice object:
idx = pd.IndexSlice
subset = df.loc[idx[['bar'], :, :], :]
This returns:
0 1 2 3
bar one bla 0.478461 1.030308 0.012688 0.137495
blo 0.476041 -1.679848 1.346798 0.143225
two bla 1.148882 -2.074197 -2.567959 1.258016
blo 1.062280 3.846096 -0.346636 1.170822
Now I want to exclude all rows having "blo" as level value. I know that I could select everything but the 'blo' values but my real dataframe is very big and I only know the level values which should not appear in the subset.
What's the easiest way to exclude certain level values from the subset?
Thanks in advance!
IIUC, maybe you can mask your subset with:
subset = subset.iloc[subset.index.get_level_values(2) != 'blo']
You can do it this way:
In [263]:
subset.loc[subset.index.get_level_values(2) != 'blo']
Out[263]:
0 1 2 3
bar one bla -1.039335 -1.124656 0.057114 -0.284754
two bla 0.007208 -0.403559 -1.317075 -0.340171
For multiple values, I used this:
subset.iloc[~subset.index.get_level_values(2).isin(['blo'])]
In this way, you can use multiple values excluded at the same time.

Categories

Resources