set and reset index in pandas dataframe not working - python

import numpy as np
import pandas as pd
np.random.seed(121)
randArr =np.random.randint(0,100,20).reshape(5,4)
df =pd.DataFrame(randArr,np.arange(101,106,1),['PDS','Algo','SE','INS'])
df.index.name='RollNo'
print(df)
print("")
df.reset_index()
print(df)
print("")
df.set_index('PDS')
print(df)
print("")
Output:(not coming as expected)
PDS Algo SE INS
RollNo
101 66 85 8 95
102 65 52 83 96
103 46 34 52 60
104 54 3 94 52
105 57 75 88 39
PDS Algo SE INS
RollNo
101 66 85 8 95
102 65 52 83 96
103 46 34 52 60
104 54 3 94 52
105 57 75 88 39
PDS Algo SE INS
RollNo
101 66 85 8 95
102 65 52 83 96
103 46 34 52 60
104 54 3 94 52
105 57 75 88 39

You need assign the result back
df = df.reset_index()
df = df.set_index('PDS')
Or you can use inplace argument
df.reset_index(inplace=True)
df.set_index('PDS', inplace=True)

Related

How to plot top ten for each column python

I want to plot the top 10 countries (displayed in rows index) according to the values taken for each column.
The columns are the features with which I evaluate the countries : "feature1","feature2","feature3","feature4","feature5","feature6","feature7","feature8","feature9","feature10".
So this would be 10 graphs with top 10 ranking. Then I want the "global" top 10 that is taking into account all columns (let's say each column has the same coefficient).
I was thinking about making a new df that shows the most recurring country in these top 10 dfs (the country that appears the most in the "top 10") but don't know how to.
I am struggling so I started by creating new dataframes from the original large dataset named "data_etude" which I made a copy of "date_etude_copy".
For each new dataframe "data_ind" I added a new column, to show top 10 based on each feature/column I am analysing (The columns are the features and the rows are the values taken by the countries).
Then I wrote a script to create from these dataframes another dataframe that shows only the top 10 ranking, the values and prameter. I am aware that this is quite laborious and as a beginner I didn't manage to make a loop from this...
the original dataset:
data_etude_copy = data_etude.copy()
dataframes of top 10 countries for each feature (but should do a loop this is so laborious)
:
data_ind1 = data_etude_copy.sort_values(by=['feature1'], ascending=False).head(10)
data_ind2 = data_etude_copy.sort_values(by=['feature2'], ascending=False).head(10)
data_ind3 = data_etude_copy.sort_values(by=['feature3'], ascending=False).head(10)
data_ind4 = data_etude_copy.sort_values(by=['feature4'], ascending=False).head(10)
data_ind5 = data_etude_copy.sort_values(by=['feature5'], ascending=False).head(10)
data_ind6 = data_etude_copy.sort_values(by=['feature6'], ascending=False).head(10)
data_ind7 = data_etude_copy.sort_values(by=['feature7'], ascending=False).head(10)
data_ind8 = data_etude_copy.sort_values(by=['feature8'], ascending=False).head(10)
data_ind9 = data_etude_copy.sort_values(by=['feature9'], ascending=False).head(10)
data_ind10 = data_etude_copy.sort_values(by=['feature10'], ascending=False).head(10)
and simplified dfs with top 10 for each feature (i need a loop I know...)
data_ind1.drop(data_ind1.loc[:,data_ind1.columns!="feature1"], inplace=True, axis = 1)
data_ind2.drop(data_ind2.loc[:,data_ind2.columns!="feature2"], inplace=True, axis = 1)
data_ind3.drop(data_ind3.loc[:,data_ind3.columns!="feature3"], inplace=True, axis = 1)
data_ind4.drop(data_ind4.loc[:,data_ind4.columns!="feature4"], inplace=True, axis = 1)
data_ind5.drop(data_ind5.loc[:,data_ind5.columns!="feature5"], inplace=True, axis = 1)
data_ind6.drop(data_ind6.loc[:,data_ind6.columns!="feature6"], inplace=True, axis = 1)
data_ind7.drop(data_ind7.loc[:,data_ind7.columns!="feature7"], inplace=True, axis = 1)
data_ind8.drop(data_ind8.loc[:,data_ind8.columns!="feature8"], inplace=True, axis = 1)
data_ind9.drop(data_ind9.loc[:,data_ind9.columns!="feature9"], inplace=True, axis = 1)
data_ind10.drop(data_ind3.loc[:,data_ind10.columns!="feature10"], inplace=True, axis = 1)
How could I make this into a loop and plot the aimed result? That is to say:
-plotting top 10 countries for each features
-then a final "top 10 countries" taking into account all 10 features (eather with countries that appears the most in each df or countries with best ranking if all features have same coefficient value)?
I think this is what you're asking for? I put your code into a for loop form and added code for ranking the countries overall. The overall ranking is based on all features, not just the top 10 lists but if you'd like it the other way then just switch the order of the commented blocks in the first for loop. I also wasn't sure how you wanted to display it so currently it just prints the final dataframe. It's probably not the cleanest code ever but I hope it helps!
import pandas as pd
import numpy as np
data = np.random.randint(100,size=(12,10))
countries = [
'Country1',
'Country2',
'Country3',
'Country4',
'Country5',
'Country6',
'Country7',
'Country8',
'Country9',
'Country10',
'Country11',
'Country12',
]
feature_names_weights = {
'feature1' :1.0,
'feature2' :1.0,
'feature3' :1.0,
'feature4' :1.0,
'feature5' :1.0,
'feature6' :1.0,
'feature7' :1.0,
'feature8' :1.0,
'feature9' :1.0,
'feature10' :1.0,
}
feature_names = list(feature_names_weights.keys())
df = pd.DataFrame(data=data, index=countries, columns=feature_names)
data_etude_copy = df
data_sorted_by_feature = {}
country_scores = (pd.DataFrame(data=np.zeros(len(countries)),index=countries))[0]
for feature in feature_names:
#Adds to each country's score and multiplies by weight factor for each feature
for country in countries:
country_scores[country] += data_etude_copy[feature][country]*(feature_names_weights[feature])
#Sorts the countries by feature (your code in loop form)
data_sorted_by_feature[feature] = data_etude_copy.sort_values(by=[feature], ascending=False).head(10)
data_sorted_by_feature[feature].drop(data_sorted_by_feature[feature].loc[:,data_sorted_by_feature[feature].columns!=feature], inplace=True, axis = 1)
#sort country total scores
ranked_countries = country_scores.sort_values(ascending=False).head(10)
##Put everything into one DataFrame
#Create empty DataFrame
empty_data=np.empty((10,11),str)
outputDF = pd.DataFrame(data=empty_data,columns=((feature_names)+['Overall']))
#Add entries for all features
for feature in feature_names:
for index in range(10):
country = list(data_sorted_by_feature[feature].index)[index]
outputDF[feature][index] = f'{country}: {data_sorted_by_feature[feature][feature][country]}'
#Add column for overall country score
for index in range(10):
country = list(ranked_countries.index)[index]
outputDF['Overall'][index] = f'{country}: {ranked_countries[country]}'
#Print DataFrame
print(outputDF)
Example data in:
feature1 feature2 feature3 feature4 feature5 feature6 feature7 feature8 feature9 feature10
Country1 40 31 5 6 4 67 65 57 52 96
Country2 93 20 41 65 44 21 91 25 43 75
Country3 93 34 87 69 0 25 65 71 17 91
Country4 24 20 41 68 46 1 94 87 11 97
Country5 90 21 93 0 72 20 44 87 16 42
Country6 93 17 33 40 96 53 1 97 51 20
Country7 82 50 34 27 44 38 49 85 7 70
Country8 33 81 14 5 72 13 13 53 39 47
Country9 18 38 20 32 52 96 51 93 53 16
Country10 75 94 91 59 39 24 7 0 96 57
Country11 62 9 33 89 5 77 37 63 42 29
Country12 7 98 43 71 98 81 48 13 61 69
Corresponding output:
feature1 feature2 feature3 feature4 feature5 feature6 feature7 feature8 feature9 feature10 Overall
0 Country2: 93 Country12: 98 Country5: 93 Country11: 89 Country12: 98 Country9: 96 Country4: 94 Country6: 97 Country10: 96 Country4: 97 Country12: 589.0
1 Country3: 93 Country10: 94 Country10: 91 Country12: 71 Country6: 96 Country12: 81 Country2: 91 Country9: 93 Country12: 61 Country1: 96 Country3: 552.0
2 Country6: 93 Country8: 81 Country3: 87 Country3: 69 Country5: 72 Country11: 77 Country1: 65 Country4: 87 Country9: 53 Country3: 91 Country10: 542.0
3 Country5: 90 Country7: 50 Country12: 43 Country4: 68 Country8: 72 Country1: 67 Country3: 65 Country5: 87 Country1: 52 Country2: 75 Country2: 518.0
4 Country7: 82 Country9: 38 Country2: 41 Country2: 65 Country9: 52 Country6: 53 Country9: 51 Country7: 85 Country6: 51 Country7: 70 Country6: 501.0
5 Country10: 75 Country3: 34 Country4: 41 Country10: 59 Country4: 46 Country7: 38 Country7: 49 Country3: 71 Country2: 43 Country12: 69 Country4: 489.0
6 Country11: 62 Country1: 31 Country7: 34 Country6: 40 Country2: 44 Country3: 25 Country12: 48 Country11: 63 Country11: 42 Country10: 57 Country7: 486.0
7 Country1: 40 Country5: 21 Country6: 33 Country9: 32 Country7: 44 Country10: 24 Country5: 44 Country1: 57 Country8: 39 Country8: 47 Country5: 485.0
8 Country8: 33 Country2: 20 Country11: 33 Country7: 27 Country10: 39 Country2: 21 Country11: 37 Country8: 53 Country3: 17 Country5: 42 Country9: 469.0
9 Country4: 24 Country4: 20 Country9: 20 Country1: 6 Country11: 5 Country5: 20 Country8: 13 Country2: 25 Country5: 16 Country11: 29 Country11: 446.0

How to create a heatmap with different color ranges for the sum column?

I've got my pd.dataframe ready.
Communication Services Consumer Discretionary Consumer Staples Energy Financials Health Care Industrials Materials Real Estate Technology Utilities Sum
Date
2020-09-15 61 65 39 3 36 53 68 89 74 43 53 584
2020-09-14 50 70 39 7 54 45 67 92 64 28 53 569
2020-09-11 38 54 30 0 28 27 46 82 25 18 28 376
2020-09-10 30 52 24 0 16 19 30 67 32 12 25 307
2020-09-09 50 57 36 0 33 30 52 71 51 30 42 452
2020-09-08 34 55 21 0 24 16 24 46 48 12 25 305
2020-09-04 53 59 51 3 66 32 47 71 74 35 28 519
2020-09-03 57 67 57 0 48 40 49 82 80 52 32 564
2020-09-02 73 85 78 3 80 74 94 89 87 94 64 821
2020-09-01 69 78 54 3 54 51 79 85 51 77 14 615
2020-08-31 76 73 78 7 50 61 75 64 54 70 21 629
2020-08-28 92 81 75 30 81 48 86 89 77 76 17 752
2020-08-27 88 77 81 11 83 53 82 82 70 64 14 705
2020-08-26 92 81 75 11 46 43 79 89 45 69 7 637
2020-08-25 92 86 78 23 65 45 82 82 64 64 21 702
2020-08-24 92 88 90 38 62 38 90 75 54 61 39 727
2020-08-21 80 78 69 11 28 37 71 50 45 49 17 535
2020-08-20 84 72 63 11 34 45 78 57 45 57 17 563
2020-08-19 80 83 81 34 48 56 84 71 29 60 35 661
2020-08-18 88 88 90 53 48 62 91 71 70 64 42 767
2020-08-17 80 95 87 80 69 62 94 78 77 63 42 827
2020-08-14 84 100 90 80 83 56 94 78 64 57 42 828
2020-08-13 88 98 87 69 81 56 95 78 64 66 57 839
2020-08-12 73 96 87 96 83 58 98 75 90 63 64 883
2020-08-11 73 86 72 84 89 50 95 78 77 53 46 803
2020-08-10 80 93 87 88 83 53 93 78 90 64 82 891
2020-08-07 69 81 84 65 84 58 91 60 83 71 89 835
2020-08-06 73 80 81 73 60 53 84 57 54 78 67 760
2020-08-05 69 81 87 73 68 69 89 64 51 78 64 793
2020-08-04 80 63 87 73 46 66 64 53 67 81 85 765
2020-08-03 69 55 78 50 60 74 68 42 51 81 78 706
2020-07-31 65 62 78 42 60 61 64 46 58 74 92 702
2020-07-30 65 62 75 34 65 74 71 50 64 61 89 710
2020-07-29 73 78 90 88 90 87 79 85 70 64 85 889
2020-07-28 46 67 81 38 71 72 78 85 61 47 89 735
2020-07-27 61 78 90 61 86 75 76 96 32 74 75 804
2020-07-24 80 77 87 73 87 72 83 100 32 56 96 843
2020-07-23 84 81 90 73 90 85 91 100 38 73 96 901
2020-07-22 88 90 90 84 92 93 94 100 45 90 96 962
2020-07-21 76 91 93 96 92 93 93 100 25 85 92 936
2020-07-20 65 81 81 34 62 91 84 96 32 87 89 802
2020-07-17 76 86 93 38 65 95 91 96 51 77 100 868
2020-07-16 80 90 93 50 81 93 89 96 22 70 85 849
2020-07-15 80 96 87 53 78 95 91 96 45 76 75 872
2020-07-14 69 59 81 23 53 82 73 96 25 60 82 703
2020-07-13 57 34 69 0 46 54 56 71 9 43 75 514
2020-07-10 61 44 66 0 43 59 39 60 35 66 64 537
2020-07-09 46 31 42 0 18 61 36 32 32 61 46 405
2020-07-08 50 42 57 3 34 67 50 46 25 61 57 492
2020-07-07 53 34 60 0 18 66 43 75 22 50 46 467
2020-07-06 50 52 54 7 30 75 64 89 41 76 53 591
Now I'd like to plot a heatmap by using matplotlib. The resulting heatmap should look something like this:
For the inner part (columns besides "sum"), if the value is above 50, then the color should be green, and the color should be darker for the largest values. Same logic for the values below 50.
For the "sum" column, the threshold is 550. How to achieve the gradual change in color?
A sns.diverging_palette(20, 145) standard has white in the center. Possible hue values for red are 20, and 145 for green.
vmin= will then set the numeric value corresponding to red and vmax= for the value corresponding to green. The value in the center will be white.
You need to create 2 separate heatmaps as they have different color ranges. The ax= keyword tells on which subplot the heatmap should be created. The colorbars can be left out: the numbers inside the cells already indicate the correspondence.
A newline character in the label names helps to better use the available space.
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
# df = pd.read_csv(...)
# df.set_index('Date', inplace=True)
column_labels = [col.replace(' ', '\n') for col in df.columns[:-1]]
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 10),
gridspec_kw={'width_ratios': [10, 1], 'wspace': 0.02, 'bottom': 0.14})
cmap = sns.diverging_palette(20, 145)
sns.heatmap(df[df.columns[:-1]], cmap=cmap, vmin=0, vmax=100, annot=True, fmt='.0f', annot_kws={'fontsize': 10},
lw=0.6, xticklabels=column_labels, cbar=False, ax=ax1)
sns.heatmap(df[df.columns[-1:]], cmap=cmap, vmin=0, vmax=1100, annot=True, fmt='.0f', annot_kws={'fontsize': 10},
lw=0.6, yticklabels=[], cbar=False, ax=ax2)
ax2.set_ylabel('')
ax2.tick_params(axis='x', labelrotation=90)
plt.show()
plt.figure(figsize=(15, 15))
sns.heatmap(data, annot=True, cmap="YlGnBu", linewidths=.5)
Is that what you are looking for.
and also If you want to add range on the values you can use vmin, vmax parameters.

From Matlab to Python Code [z,index]=sort(abs(z));

i am trying to convert code from matlab to python.
Can you please help me to convert this code from matlab to python?
in matlab code
z is list and z length is 121
z= 7.0502 5.8030 4.4657 3.0404 1.5416 0 -1.5416 -3.0404 -4.4657
-5.8030 -7.0502 7.5944 6.3059 4.8990 3.3662 1.7189 0 -1.7189 -3.3662 -4.8990 -6.3059 -7.5944 8.2427 6.9282 5.4611 3.8122 1.9735 0 -1.9735 -3.8122 -5.4611 -6.9282 -8.2427 9.0135 7.7027 6.2075 4.4590 2.3803 0 -2.3803 -4.4590 -6.2075 -7.7027 -9.0135 9.9185 8.6576 7.2038 5.4466 3.1530 0 -3.1530 -5.4466 -7.2038 -8.6576 -9.9185 10.9545 9.7980 8.4853 6.9282 4.8990 0 -4.8990 -6.9282 -8.4853 -9.7980 -10.9545 12.0986 11.0885 9.9947 8.8128 7.6119 -6.9282 -7.6119 -8.8128 -9.9947 -11.0885 -12.0986 13.3133 12.4632 11.5988 10.7649 10.0829 -9.7980 -10.0829 -10.7649 -11.5988 -12.4632 -13.3133 14.5583 13.8564 13.1842 12.5910 12.1612 -12.0000 -12.1612 -12.5910 -13.1842 -13.8564 -14.5583 15.8011 15.2238 14.6969 14.2594 13.9626 -13.8564 -13.9626 -14.2594 -14.6969 -15.2238 -15.8011 17.0207 16.5431 16.1227 15.7875 15.5684 -15.4919 -15.5684 -15.7875 -16.1227 -16.5431 -17.0207
Matlab code : [z,index]=sort(abs(z));
after the code
z = 0 0 0 0 0 0 1.5416 1.5416 1.7189 1.7189 1.9735 1.9735 2.3803 2.3803 3.0404 3.0404 3.1530 3.1530 3.3662 3.3662 3.8122 3.8122 4.4590 4.4590 4.4657 4.4657 4.8990 4.8990 4.8990 4.8990 5.4466 5.4466 5.4611 5.4611 5.8030 5.8030 6.2075 6.2075 6.3059 6.3059 6.9282 6.9282 6.9282 6.9282 6.9282 7.0502 7.0502 7.2038 7.2038 7.5944 7.5944 7.6119 7.6119 7.7027 7.7027 8.2427 8.2427 8.4853 8.4853 8.6576 8.6576 8.8128 8.8128 9.0135 9.0135 9.7980 9.7980 9.7980 9.9185 9.9185 9.9947 9.9947 10.0829 10.0829 10.7649 10.7649 10.9545 10.9545 11.0885 11.0885 11.5988 11.5988 12.0000 12.0986 12.0986 12.1612 12.1612 12.4632 12.4632 12.5910
12.5910 13.1842 13.1842 13.3133 13.3133 13.8564 13.8564 13.8564 13.9626 13.9626 14.2594 14.2594 14.5583 14.5583 14.6969 14.6969 15.2238 15.2238 15.4919 15.5684 15.5684 15.7875 15.7875 15.8011 15.8011 16.1227 16.1227 16.5431 16.5431 17.0207 17.0207
and index is
index = 6 17 28 39 50 61 5 7 16 18 27 29 38 40 4 8 49 51 15 19 26 30 37 41 3 9 14 20 60 62 48 52 25 31 2 10 36 42 13 21 24 32 59 63 72 1 11 47 53 12 22 71 73 35 43 23 33 58 64 46 54 70 74 34 44 57 65 83 45 55 69 75 82 84 81 85 56 66 68 76 80 86 94 67 77 93 95 79 87 92 96 91 97 78 88 90 98 105 104 106 103 107 89 99 102 108 101 109 116 115 117 114 118 100 110 113 119 112 120 111 121
so what is the [z,index] in python ?
Do you need to return the index? If you don't, you could use:
z = abs(z)
new_list = sorted(map(abs, z))
index = sorted(range(len(z)), key=lambda k: z[k])
where x is the output and z is the list.
EDIT:
Try that now

read_csv() doesn't work with StreamingBody object where python engine is required

I am encountering a problem while using pandas read_csv(). Data is being read in from s3 as StreamingBody object and noticed it worked only when c engine parser is used. (per pandas [documentation][1], skipfooter only works with python engine parser)
Did anyone encounter a similar issue before? Or any advice to solve this problem? Thanks
The following is how I re-produced this issue.
import boto3
import s3fs
import pandas as pd
s3 = boto3.client("s3")
response = s3.get_object(Bucket="df-raw-869771", Key="csv/customer-01.csv")
pd.read_csv(response["Body"])
customer_id|store_id|first_name|last_name|email|address_id|activebool|dw_insert_date|dw_update_date|active
0 9|2|MARGARET|MOORE|MARGARET.MOORE#sakilacustom...
1 13|2|KAREN|JACKSON|KAREN.JACKSON#sakilacustome...
2 17|1|DONNA|THOMPSON|DONNA.THOMPSON#sakilacusto...
3 21|1|MICHELLE|CLARK|MICHELLE.CLARK#sakilacusto...
4 25|1|DEBORAH|WALKER|DEBORAH.WALKER#sakilacusto...
... ...
1188 587|1|SERGIO|STANFIELD|SERGIO.STANFIELD#sakila...
1189 591|1|KENT|ARSENAULT|KENT.ARSENAULT#sakilacust...
1190 595|1|TERRENCE|GUNDERSON|TERRENCE.GUNDERSON#sa...
1191 599|2|AUSTIN|CINTRON|AUSTIN.CINTRON#sakilacust...
1192 4|2|BARBARA|JONES|BARBARA#sakilacustomer.org|8...
[1193 rows x 1 columns]
if passing in skipfooter argument
import boto3
import s3fs
import pandas as pd
s3 = boto3.client("s3")
response = s3.get_object(Bucket="df-raw-869771", Key="csv/customer-01.csv")
pd.read_csv(response["Body"], skipfooter=1)
__main__:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skipfooter; you can avoid this warning by specifying engine='python'.
99 117 115 116 111 109 101 114 95 105 100 124 115.1 116.1 111.1 ... 114.28 103.11 124.112 53.6 55.4 124.113 65.51 99.26 116.33 105.31 118.13 101.35 124.114 49.70 51.21
0 45 49 49 45 50 48 49 57 124 124 49 10 53 55 124 ... 69 83 84 124 69 68 78 65 46 87 69 83 84 64 115
1 97 107 105 108 97 99 117 115 116 111 109 101 114 46 111 ... 76 68 73 78 69 46 80 69 82 75 73 78 83 64 115
2 97 107 105 108 97 99 117 115 116 111 109 101 114 46 111 ... 73 76 76 73 65 77 83 79 78 64 115 97 107 105 108
3 97 99 117 115 116 111 109 101 114 46 111 114 103 124 50 ... 114 103 124 50 55 48 124 65 99 116 105 118 101 124 49
4 51 45 49 49 45 50 48 49 57 124 124 49 10 50 54 ... 115 97 107 105 108 97 99 117 115 116 111 109 101 114 46
.. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
84 99 116 105 118 101 124 49 51 45 49 49 45 50 48 49 ... 51 45 49 49 45 50 48 49 57 124 124 49 10 51 56
85 51 124 49 124 77 65 82 84 73 78 124 66 65 76 69 ... 51 53 124 50 124 82 73 67 75 89 124 83 72 69 76
86 66 89 124 82 73 67 75 89 46 83 72 69 76 66 89 ... 88 84 69 82 124 72 69 67 84 79 82 46 80 79 73
87 78 68 69 88 84 69 82 64 115 97 107 105 108 97 99 ... 103 124 53 52 53 124 65 99 116 105 118 101 124 49 51
88 45 49 49 45 50 48 49 57 124 124 49 10 53 52 51 ... 116 111 109 101 114 46 111 114 103 124 53 57 55 124 65
[89 rows x 1024 columns]

Conditional summing of columns in pandas

I have the following database in Pandas:
Student-ID Last-name First-name HW1 HW2 HW3 HW4 HW5 M1 M2 Final
59118211 Alf Brian 96 90 88 93 96 78 60 59.0
59260567 Anderson Jill 73 83 96 80 84 80 52 42.5
59402923 Archangel Michael 99 80 60 94 98 41 56 0.0
59545279 Astor John 93 88 97 100 55 53 53 88.9
59687635 Attach Zach 69 75 61 65 91 90 63 69.0
I want to add only those columns which have "HW" in them. Any suggestions on how I can do that?
Note: The number of columns containing HW may differ. So I can't reference them directly.
You could all df.filter(regex='HW') to return column names like 'HW' and then apply sum row-wise via sum(axis-1)
In [23]: df
Out[23]:
StudentID Lastname Firstname HW1 HW2 HW3 HW4 HW5 HW6 HW7 M1
0 59118211 Alf Brian 96 90 88 93 96 97 88 10
1 59260567 Anderson Jill 73 83 96 80 84 99 80 100
2 59402923 Archangel Michael 99 80 60 94 98 73 97 50
3 59545279 Astor John 93 88 97 100 55 96 86 60
4 59687635 Attach Zach 69 75 61 65 91 89 82 55
5 59829991 Bake Jake 56 0 77 78 0 79 0 10
In [24]: df.filter(regex='HW').sum(axis=1)
Out[24]:
0 648
1 595
2 601
3 615
4 532
5 290
dtype: int64
John's solution - using df.filter() - is more elegant, but you could also consider a list comprehension ...
df[[x for x in df.columns if 'HW' in x]].sum(axis=1)

Categories

Resources