Conditional summing of columns in pandas - python

I have the following database in Pandas:
Student-ID Last-name First-name HW1 HW2 HW3 HW4 HW5 M1 M2 Final
59118211 Alf Brian 96 90 88 93 96 78 60 59.0
59260567 Anderson Jill 73 83 96 80 84 80 52 42.5
59402923 Archangel Michael 99 80 60 94 98 41 56 0.0
59545279 Astor John 93 88 97 100 55 53 53 88.9
59687635 Attach Zach 69 75 61 65 91 90 63 69.0
I want to add only those columns which have "HW" in them. Any suggestions on how I can do that?
Note: The number of columns containing HW may differ. So I can't reference them directly.

You could all df.filter(regex='HW') to return column names like 'HW' and then apply sum row-wise via sum(axis-1)
In [23]: df
Out[23]:
StudentID Lastname Firstname HW1 HW2 HW3 HW4 HW5 HW6 HW7 M1
0 59118211 Alf Brian 96 90 88 93 96 97 88 10
1 59260567 Anderson Jill 73 83 96 80 84 99 80 100
2 59402923 Archangel Michael 99 80 60 94 98 73 97 50
3 59545279 Astor John 93 88 97 100 55 96 86 60
4 59687635 Attach Zach 69 75 61 65 91 89 82 55
5 59829991 Bake Jake 56 0 77 78 0 79 0 10
In [24]: df.filter(regex='HW').sum(axis=1)
Out[24]:
0 648
1 595
2 601
3 615
4 532
5 290
dtype: int64

John's solution - using df.filter() - is more elegant, but you could also consider a list comprehension ...
df[[x for x in df.columns if 'HW' in x]].sum(axis=1)

Related

set and reset index in pandas dataframe not working

import numpy as np
import pandas as pd
np.random.seed(121)
randArr =np.random.randint(0,100,20).reshape(5,4)
df =pd.DataFrame(randArr,np.arange(101,106,1),['PDS','Algo','SE','INS'])
df.index.name='RollNo'
print(df)
print("")
df.reset_index()
print(df)
print("")
df.set_index('PDS')
print(df)
print("")
Output:(not coming as expected)
PDS Algo SE INS
RollNo
101 66 85 8 95
102 65 52 83 96
103 46 34 52 60
104 54 3 94 52
105 57 75 88 39
PDS Algo SE INS
RollNo
101 66 85 8 95
102 65 52 83 96
103 46 34 52 60
104 54 3 94 52
105 57 75 88 39
PDS Algo SE INS
RollNo
101 66 85 8 95
102 65 52 83 96
103 46 34 52 60
104 54 3 94 52
105 57 75 88 39
You need assign the result back
df = df.reset_index()
df = df.set_index('PDS')
Or you can use inplace argument
df.reset_index(inplace=True)
df.set_index('PDS', inplace=True)

Using apply to the pandas group object with original function

I have a multi-index df and I want to add a new column by apply an operation
class weight height time
A 45 150 85
50 160 80
55 155 74
B 78 180 90
51 158 65
40 155 68
C 80 185 90
86 175 81
52 162 73
def operation(col):
concat = ''
for i in col:
concat += (str(i))
return concat
and the result df should look like
df['new'] = df.groupby(level=0)['height'].apply(operation)
class weight height time new
A 45 150 85 150160155
50 160 80
55 155 74
B 78 180 90 180158155
51 158 65
40 155 68
C 80 185 90 185175162
86 175 81
52 162 73
However, the resultant df actually add NaN to new column. What am I doing wrong?
IIUC,
use transform instead of apply
df['new'] = df.groupby(level=0)['height'].transform(operation)
Output:
height time new
class weight
A 45 150 85 150160155
50 160 80 150160155
55 155 74 150160155
B 78 180 90 180158155
51 158 65 180158155
40 155 68 180158155
C 80 185 90 185175162
86 175 81 185175162
52 162 73 185175162
OR
df['new'] = df.groupby(level=0)['height'].transform(operation).drop_duplicates()
Output:
height time new
class weight
A 45 150 85 150160155
50 160 80 NaN
55 155 74 NaN
B 78 180 90 180158155
51 158 65 NaN
40 155 68 NaN
C 80 185 90 185175162
86 175 81 NaN
52 162 73 NaN
#Concat height in each class, put it in a dict and map it back to the column class
df['new']=df['class'].map(df.groupby('class').height.apply(lambda x: x.astype(str).str.cat()).to_dict())
#Select duplicated(keep=first), invert in np.where clause to null what you need done so
df['new']=np.where(~df['new'].duplicated(keep='first'),df['new'],'')
print(df)
class weight height time new
0 A 45 150 85 150160155
1 A 50 160 80
2 A 55 155 74
3 B 78 180 90 180158155
4 B 51 158 65
5 B 40 155 68
6 C 80 185 90 185175162
7 C 86 175 81
8 C 52 162 73

From Matlab to Python Code [z,index]=sort(abs(z));

i am trying to convert code from matlab to python.
Can you please help me to convert this code from matlab to python?
in matlab code
z is list and z length is 121
z= 7.0502 5.8030 4.4657 3.0404 1.5416 0 -1.5416 -3.0404 -4.4657
-5.8030 -7.0502 7.5944 6.3059 4.8990 3.3662 1.7189 0 -1.7189 -3.3662 -4.8990 -6.3059 -7.5944 8.2427 6.9282 5.4611 3.8122 1.9735 0 -1.9735 -3.8122 -5.4611 -6.9282 -8.2427 9.0135 7.7027 6.2075 4.4590 2.3803 0 -2.3803 -4.4590 -6.2075 -7.7027 -9.0135 9.9185 8.6576 7.2038 5.4466 3.1530 0 -3.1530 -5.4466 -7.2038 -8.6576 -9.9185 10.9545 9.7980 8.4853 6.9282 4.8990 0 -4.8990 -6.9282 -8.4853 -9.7980 -10.9545 12.0986 11.0885 9.9947 8.8128 7.6119 -6.9282 -7.6119 -8.8128 -9.9947 -11.0885 -12.0986 13.3133 12.4632 11.5988 10.7649 10.0829 -9.7980 -10.0829 -10.7649 -11.5988 -12.4632 -13.3133 14.5583 13.8564 13.1842 12.5910 12.1612 -12.0000 -12.1612 -12.5910 -13.1842 -13.8564 -14.5583 15.8011 15.2238 14.6969 14.2594 13.9626 -13.8564 -13.9626 -14.2594 -14.6969 -15.2238 -15.8011 17.0207 16.5431 16.1227 15.7875 15.5684 -15.4919 -15.5684 -15.7875 -16.1227 -16.5431 -17.0207
Matlab code : [z,index]=sort(abs(z));
after the code
z = 0 0 0 0 0 0 1.5416 1.5416 1.7189 1.7189 1.9735 1.9735 2.3803 2.3803 3.0404 3.0404 3.1530 3.1530 3.3662 3.3662 3.8122 3.8122 4.4590 4.4590 4.4657 4.4657 4.8990 4.8990 4.8990 4.8990 5.4466 5.4466 5.4611 5.4611 5.8030 5.8030 6.2075 6.2075 6.3059 6.3059 6.9282 6.9282 6.9282 6.9282 6.9282 7.0502 7.0502 7.2038 7.2038 7.5944 7.5944 7.6119 7.6119 7.7027 7.7027 8.2427 8.2427 8.4853 8.4853 8.6576 8.6576 8.8128 8.8128 9.0135 9.0135 9.7980 9.7980 9.7980 9.9185 9.9185 9.9947 9.9947 10.0829 10.0829 10.7649 10.7649 10.9545 10.9545 11.0885 11.0885 11.5988 11.5988 12.0000 12.0986 12.0986 12.1612 12.1612 12.4632 12.4632 12.5910
12.5910 13.1842 13.1842 13.3133 13.3133 13.8564 13.8564 13.8564 13.9626 13.9626 14.2594 14.2594 14.5583 14.5583 14.6969 14.6969 15.2238 15.2238 15.4919 15.5684 15.5684 15.7875 15.7875 15.8011 15.8011 16.1227 16.1227 16.5431 16.5431 17.0207 17.0207
and index is
index = 6 17 28 39 50 61 5 7 16 18 27 29 38 40 4 8 49 51 15 19 26 30 37 41 3 9 14 20 60 62 48 52 25 31 2 10 36 42 13 21 24 32 59 63 72 1 11 47 53 12 22 71 73 35 43 23 33 58 64 46 54 70 74 34 44 57 65 83 45 55 69 75 82 84 81 85 56 66 68 76 80 86 94 67 77 93 95 79 87 92 96 91 97 78 88 90 98 105 104 106 103 107 89 99 102 108 101 109 116 115 117 114 118 100 110 113 119 112 120 111 121
so what is the [z,index] in python ?
Do you need to return the index? If you don't, you could use:
z = abs(z)
new_list = sorted(map(abs, z))
index = sorted(range(len(z)), key=lambda k: z[k])
where x is the output and z is the list.
EDIT:
Try that now

Preserving multindex column structure after performing a groupby summation

I have a three-level multiindex column. At the lowest level, I want to add a subtotal column.
So in the example here, I would expect a new column zone: day, person:dave, find:'subtotal' with value = 49+27+63=138. similarly for all the other combinations of zone and person.
cols = pd.MultiIndex.from_product([['day', 'night'], ['dave', 'matt', 'mike'], ['gems', 'rocks', 'paper']])
rows = pd.date_range(start='20191201', periods=5, freq="d")
data = np.random.randint(0, high=100,size=(len(rows), len(cols)))
xf = pd.DataFrame(data, index=rows, columns=cols)
xf.columns.names = ['zone', 'person', 'find']
I can generate the correct subtotal data with xf.groupby(level=[0,1], axis="columns").sum() but then I lose the find level of the columns, it only leaves the zone and person levels. I need that third level of column called subtotal so that I can join that back with the original xf dataframe. But I cannot figure out a nice pythonic way to add a third level back into the multindex.
You can use sum first and then MultiIndex.from_product with new level:
df = xf.sum(level=[0,1], axis="columns")
df.columns = pd.MultiIndex.from_product(df.columns.levels + [['subtotal']])
print (df)
day night
dave matt mike dave matt mike
subtotal subtotal subtotal subtotal subtotal subtotal
2019-12-01 85 99 163 210 93 252
2019-12-02 38 113 101 211 110 135
2019-12-03 145 75 122 181 165 176
2019-12-04 220 184 173 179 134 192
2019-12-05 126 77 29 184 178 199
And then join together by concat with DataFrame.sort_index:
df = pd.concat([xf, df], axis=1).sort_index(axis=1)
print (df)
zone day \
person dave matt mike
find gems paper rocks subtotal gems paper rocks subtotal gems paper
2019-12-01 33 96 24 153 34 89 90 213 15 51
2019-12-02 74 48 61 183 94 83 2 179 75 4
2019-12-03 88 85 51 224 65 3 52 120 95 80
2019-12-04 43 28 60 131 43 14 77 134 88 54
2019-12-05 41 72 44 157 63 77 37 177 8 66
zone ... night \
person ... dave matt mike
find ... rocks subtotal gems paper rocks subtotal gems paper rocks
2019-12-01 ... 24 102 19 49 4 72 43 57 92
2019-12-02 ... 90 206 96 55 92 243 75 58 68
2019-12-03 ... 29 182 11 90 85 186 9 20 46
2019-12-04 ... 30 84 25 55 89 169 98 41 85
2019-12-05 ... 73 167 52 90 49 191 51 80 37
zone
person
find subtotal
2019-12-01 192
2019-12-02 201
2019-12-03 75
2019-12-04 224
2019-12-05 168
[5 rows x 24 columns]

Monthly Climatology for Pandas DataFrame with MultiIndex

I have a DataFrame with two years of monthly data Y. I need the second column Y_avg with the climatology to be able to subtract both.
Y Y_avg
T X
2000-01-31 1 51 63
2 52 64
2000-02-29 1 53 65
2 54 66
2000-03-31 1 55 67
2 56 68
2000-04-30 1 57 69
2 58 70
2000-05-31 1 59 71
2 60 72
2000-06-30 1 61 73
2 62 74
2000-07-31 1 63 75
2 64 76
2000-08-31 1 65 77
2 66 78
2000-09-30 1 67 79
2 68 80
2000-10-31 1 69 81
2 70 82
2000-11-30 1 71 83
2 72 84
2000-12-31 1 73 85
2 74 86
2001-01-31 1 75 63
2 76 64
2001-02-28 1 77 65
2 78 66
2001-03-31 1 79 67
2 80 68
2001-04-30 1 81 69
2 82 70
2001-05-31 1 83 71
2 84 72
2001-06-30 1 85 73
2 86 74
2001-07-31 1 87 75
2 88 76
2001-08-31 1 89 77
2 90 78
2001-09-30 1 91 79
2 92 80
2001-10-31 1 93 81
2 94 82
2001-11-30 1 95 83
2 96 84
2001-12-31 1 97 85
2 98 86
This is my temporal solution:
f = np.tile(np.arange(1,25),2)
df['Y_avg'] = np.tile(df.groupby(f).mean().values.ravel(),2)
But how can I do that more efficiently?
Thanks for the help!
So you want the Y_avg to be the mean by X and the month of T, right? Assuming the T level of your MultiIndex is a DatetimeIndex, use
gb = df['Y'].groupby([df.index.get_level_values(0).month,
pd.Grouper(level=1)])
df['Y_avg'] = gb.transform('mean')
Firts of all, I had a hard time recreating the dataframe copy-pasting the data, so
for all of you that may want to answer the question, you can recreate the example with the following code:
import pandas as pd
# Create a date range, convert to list and duplicate
T = pd.date_range("2000-01-31", "2001-12-31", freq="M").tolist() * 2
# Create a list of repeated [1, 2] to match length of T
X = [1, 2] * (len(T) // 2)
Y = range(51, 99)
index = pd.MultiIndex.from_arrays([sorted(T), X], names=("T", "X"))
df = pd.DataFrame({"Y": Y}, index=index)
Then to calculate the mean of Y with respect of level T, you can use the following code:
Y_avg = df.Y.mean(level="T")
df = df.join(Y_avg, on="T", rsuffix="_avg")
First, you can calculate the mean with respect to certain index using the level parameter of the mean series method. The you can perform a standard dataframe join to merge the Y_avg series with the dataframe on the "T" index. Please note that you must provide a suffix (rsuffix in this case) to properly deal with columns' names.

Categories

Resources