Calculating aggregate values in a pandas dataframe with multiple columns - python

I have a Pandas DataFrame with multiple columns.
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
print(df)
first bar baz foo qux \
second one two one two one two one
A -0.093829 -0.159939 -0.386961 -0.367417 0.625646 1.286186 0.429855
B 0.440266 0.345161 1.798363 -1.265215 0.204303 -1.492993 -1.714360
C 0.689076 -1.211060 -0.265888 0.769467 -0.706941 0.086907 -0.892892
first
second two
A -1.006210
B -0.275578
C -0.563757
I want to calculate the mean and standard deviation, of each column, grouping by the upper column. Once I have calculated the mean and standard deviation I want to double the columns in the lower level, adding to the column name the information related to the statistical operation (mean or standard deviation) as "column name" + "_" + "std/mean".
group_cols = df.groupby(df.columns.get_level_values('first'), axis=1)
list_stat_dfs = []
for key, group in group_cols:
group_descr = group.describe().loc[['mean', 'std'], :] # Get mean and std from single site
group_descr.loc[:, (key, 'stats')] = group_descr.index
group_descr.loc[:, (key, 'first')] = key
group_descr.columns = group_descr.columns.droplevel(0) # Remove upper level column (site_name)
group_descr = group_descr.pivot(columns='stats', index='first') # Rows to columns
col_prod = list(product(group_descr.columns.levels[0], group_descr.columns.levels[1]))
cols = ['_'.join((col[0], col[1])) for col in col_prod]
group_descr.columns = pd.MultiIndex.from_product(([key], cols)) # From multiple columns to single column
group_descr.reset_index(inplace=True)
list_stat_dfs.append(group_descr)
group_descr = pd.concat(list_stat_dfs, axis=1)
print(group_descr)
first bar first baz \
one_mean one_std two_mean two_std one_mean one_std
0 bar 0.507185 1.799053 -0.249692 1.41507 baz -0.147664 0.595927
first foo first \
two_mean two_std one_mean one_std two_mean two_std
0 0.160018 1.405113 foo -0.433644 1.245972 0.254995 0.846983 qux
qux
one_mean one_std two_mean two_std
0 0.667629 0.315417 -0.757989 0.683273
As you can see, I have been able to manage it with a for loop and some line of code. Can someone do the same thing in a more optimized way. I am quite sure that with Pandas, the same thing can be done with few lines of code.

I think you need get mean and std of df, then concat together and reshape by unstack:
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
np.random.seed(1000)
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
print(df)
first bar baz foo qux \
second one two one two one two one
A -0.804458 0.320932 -0.025483 0.644324 -0.300797 0.389475 -0.107437
B 0.595036 -0.464668 0.667281 -0.806116 -1.196070 -0.405960 -0.182377
C -0.138422 0.705692 1.271795 -0.986747 -0.334835 -0.099482 0.407192
first
second two
A -0.479983
B 0.103193
C 0.919388
df = pd.concat([df.mean(), df.std()], keys=('mean','std')).unstack(1)
df.index = [[0] * len(df.index), ['_'.join((col[1], col[0])) for col in df.index]]
df = df.unstack()
print (df)
first bar baz \
one_mean one_std two_mean two_std one_mean one_std two_mean
0 -0.115948 0.700018 0.187319 0.596511 0.637865 0.649139 -0.382846
first foo qux \
two_std one_mean one_std two_mean two_std one_mean one_std
0 0.894129 -0.610567 0.507346 -0.038656 0.401191 0.039126 0.32095
first
two_mean two_std
0 0.180866 0.702911

Related

Extract max of a multiindex pandas dataframe with strings and NaN

I've got the following multiindex dataframe:
first bar baz foo
second one two one two one two
first second
bar one NaN -0.056213 0.988634 0.103149 1.5858 -0.101334
two -0.47464 -0.010561 2.679586 -0.080154 <LQ -0.422063
baz one <LQ 0.220080 1.495349 0.302883 -0.205234 0.781887
two 0.638597 0.276678 -0.408217 -0.083598 -1.15187 -1.724097
foo one 0.275549 -1.088070 0.259929 -0.782472 -1.1825 -1.346999
two 0.857858 0.783795 -0.655590 -1.969776 -0.964557 -0.220568
I would like to to extract the max along one level. Expected result:
first bar baz foo
second
one 0.275549 1.495349 1.5858
two 0.857858 2.679586 -0.964557
Here is what I tried:
df.xs('one', level=1, axis = 1).max(axis=0, level=1, skipna = True, numeric_only = False)
And the obtained result:
first baz
second
one 1.495349
two 2.679586
How do I get Pandas to not ignore the whole column if one cell contains a string?
(created like this:)
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
df['bar','one'].loc['bar','one'] = np.NaN
df['bar','one'].loc['baz','one'] = '<LQ'
df['foo','one'].loc['bar','two'] = '<LQ'
I guess you would need to replace the non-numeric with na:
(df.xs('one', level=1, axis=1)
.apply(pd.to_numeric, errors='coerce')
.max(level=1,skipna=True)
)
Output (with np.random.seed(1)):
first bar baz foo
second
one 0.900856 1.133769 0.865408
two 1.744812 0.319039 0.901591

How to sort grouped multi-index pandas series by index level and values?

I have a pandas series:
import numpy as np
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.Series(np.random.randn(8), index=index)
s
Out[3]:
first second
bar one -1.111475
two -0.644368
baz one 0.027621
two 0.130411
foo one -0.942718
two -1.335731
qux one 1.277417
two -0.242090
dtype: float64
How to sort this series by values within each group?
For example, qux group should have the first row with two, -0.242090, and then row one, 1.277417.
Group bar is sorted well because -1.111475 is lower than -0.644368.
I need somethin like s.groupby(level=0).sort_values().
Use sort_values:
np.random.seed(0)
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.Series(np.random.randn(8), index=index)
s = (s.reset_index(name='value')
.sort_values(['first', 'value'])
.set_index(['first', 'second'])['value'])
s.name = None
print(s)
first second
bar two 0.400157
one 1.764052
baz one 0.978738
two 2.240893
foo two -0.977278
one 1.867558
qux two -0.151357
one 0.950088
dtype: float64
You can use np.lexsort to sort first by your first index level, and second by values.
np.random.seed(0)
s = pd.Series(np.random.randn(8), index=index)
s = s.iloc[np.lexsort((s.values, s.index.get_level_values(0)))]
print(s)
# first second
# bar two 0.400157
# one 1.764052
# baz one 0.978738
# two 2.240893
# foo two -0.977278
# one 1.867558
# qux two -0.151357
# one 0.950088
# dtype: float64

How to get pandas MultiIndex's values in the form of a list?

I have a pandas DataFrame with a MultiIndex. I want to get a list which includes MultiIndex level0 and level1 like this [level0,[level1-1,level1-2,(...)].
For example:
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
df = pd.DataFrame(np.random.randn(8), index=arrays,columns=['values'])
df
out:
values
bar one 2.171200
two -0.665047
baz one 0.474036
two 0.082408
foo one 1.820585
two 0.698537
qux one 1.163479
two 0.129044
I want to output a dataframe like this:
output
bar ['one','two']
baz ['one','two']
foo ['one','two']
qux ['one','two']
How? Thanks a lot.
Use reset_index with groupby and list:
df1 = (df.reset_index()
.groupby('level_0')['level_1']
.apply(list)
.rename_axis(None)
.to_frame('output'))
Or MultiIndex.to_frame (new in pandas 0.20.0+):
df1 = df.index.to_frame().groupby(0)[1].apply(list).rename_axis(None).to_frame('output')
print (df1)
output
bar [one, two]
baz [one, two]
foo [one, two]
qux [one, two]
You can feed to the pd.DataFrame constructor and then use groupby:
res = pd.DataFrame(df.index.values.tolist(), columns=['idx1', 'idx2'])\
.groupby('idx1')['idx2'].apply(list)
print(res)
idx1
bar [one, two]
baz [one, two]
foo [one, two]
qux [one, two]
Name: idx2, dtype: object

Pandas Custom Sort Row in Multiindex

Given the following:
import pandas as pd
arrays = [['bar', 'bar', 'bar', 'baz', 'baz', 'baz', 'baz'],
['total', 'two', 'one', 'two', 'four', 'total', 'five']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.Series(np.random.randn(7), index=index)
s
first second
bar total 0.334158
two -0.267854
one 1.161727
baz two -0.748685
four -0.888634
total 0.383310
five 0.506120
dtype: float64
How do I ensure that the 'total' rows (per the second index) are always at the bottom of each group like this?:
first second
bar one 0.210911
two 0.628357
total -0.911331
baz two 0.315396
four -0.195451
five 0.060159
total 0.638313
dtype: float64
solution 1
I'm not happy with this. I'm working on a different solution
unstacked = s.unstack(0)
total = unstacked.loc['total']
unstacked.drop('total').append(total).unstack().dropna()
first second
bar one 1.682996
two 0.343783
total 1.287503
baz five 0.360170
four 1.113498
two 0.083691
total -0.377132
dtype: float64
solution 2
I feel better about this one
second = pd.Categorical(
s.index.levels[1].values,
categories=['one', 'two', 'three', 'four', 'five', 'total'],
ordered=True
)
s.index.set_levels(second, level='second', inplace=True)
cols = s.index.names
s.reset_index().sort_values(cols).set_index(cols)
0
first second
bar one 1.682996
two 0.343783
total 1.287503
baz two 0.083691
four 1.113498
five 0.360170
total -0.377132
unstack for creating DataFrame with columns with second level of MultiIndex, then reorder columns for total to last column and last use ordered CategoricalIndex.
So if stack level total is last.
np.random.seed(123)
arrays = [['bar', 'bar', 'bar', 'baz', 'baz', 'baz', 'baz'],
['total', 'two', 'one', 'two', 'four', 'total', 'five']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.Series(np.random.randn(7), index=index)
print (s)
first second
bar total -1.085631
two 0.997345
one 0.282978
baz two -1.506295
four -0.578600
total 1.651437
five -2.426679
dtype: float64
df = s.unstack()
df = df[df.columns[df.columns != 'total'].tolist() + ['total']]
df.columns = pd.CategoricalIndex(df.columns, ordered=True)
print (df)
second five four one two total
first
bar NaN NaN 0.282978 0.997345 -1.085631
baz -2.426679 -0.5786 NaN -1.506295 1.651437
s1 = df.stack()
print (s1)
first second
bar one 0.282978
two 0.997345
total -1.085631
baz five -2.426679
four -0.578600
two -1.506295
total 1.651437
dtype: float64
print (s1.sort_index())
first second
bar one 0.282978
two 0.997345
total -1.085631
baz five -2.426679
four -0.578600
two -1.506295
total 1.651437
dtype: float64

Modify pandas group

I have a DataFrame, which I group.
I would like to add another column to the data frame, that is a result of function diff, per group. Something like:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df_grouped = df.groupby('B')
for name, group in df_grouped:
new_df["D_diff"] = group["D"].diff()
I would like to get per each group the differnece of column D, and have a DF that include a new column with the diff calculation.
IIUC you can use DataFrameGroupBy.diff:
df['D_diff'] = df.groupby('B')['D'].diff()
print (df)
A B C D D_diff
0 foo one 1.996084 0.580177 NaN
1 bar one 1.782665 0.042979 -0.537198
2 foo two -0.359840 1.952692 NaN
3 bar three -0.909853 0.119353 NaN
4 foo two -0.478386 -0.970906 -2.923598
5 bar two -1.289331 -1.245804 -0.274898
6 foo one -1.391884 -0.555056 -0.598035
7 foo three -1.270533 0.183360 0.064007

Categories

Resources