How to plot top ten for each column python - python

I want to plot the top 10 countries (displayed in rows index) according to the values taken for each column.
The columns are the features with which I evaluate the countries : "feature1","feature2","feature3","feature4","feature5","feature6","feature7","feature8","feature9","feature10".
So this would be 10 graphs with top 10 ranking. Then I want the "global" top 10 that is taking into account all columns (let's say each column has the same coefficient).
I was thinking about making a new df that shows the most recurring country in these top 10 dfs (the country that appears the most in the "top 10") but don't know how to.
I am struggling so I started by creating new dataframes from the original large dataset named "data_etude" which I made a copy of "date_etude_copy".
For each new dataframe "data_ind" I added a new column, to show top 10 based on each feature/column I am analysing (The columns are the features and the rows are the values taken by the countries).
Then I wrote a script to create from these dataframes another dataframe that shows only the top 10 ranking, the values and prameter. I am aware that this is quite laborious and as a beginner I didn't manage to make a loop from this...
the original dataset:
data_etude_copy = data_etude.copy()
dataframes of top 10 countries for each feature (but should do a loop this is so laborious)
:
data_ind1 = data_etude_copy.sort_values(by=['feature1'], ascending=False).head(10)
data_ind2 = data_etude_copy.sort_values(by=['feature2'], ascending=False).head(10)
data_ind3 = data_etude_copy.sort_values(by=['feature3'], ascending=False).head(10)
data_ind4 = data_etude_copy.sort_values(by=['feature4'], ascending=False).head(10)
data_ind5 = data_etude_copy.sort_values(by=['feature5'], ascending=False).head(10)
data_ind6 = data_etude_copy.sort_values(by=['feature6'], ascending=False).head(10)
data_ind7 = data_etude_copy.sort_values(by=['feature7'], ascending=False).head(10)
data_ind8 = data_etude_copy.sort_values(by=['feature8'], ascending=False).head(10)
data_ind9 = data_etude_copy.sort_values(by=['feature9'], ascending=False).head(10)
data_ind10 = data_etude_copy.sort_values(by=['feature10'], ascending=False).head(10)
and simplified dfs with top 10 for each feature (i need a loop I know...)
data_ind1.drop(data_ind1.loc[:,data_ind1.columns!="feature1"], inplace=True, axis = 1)
data_ind2.drop(data_ind2.loc[:,data_ind2.columns!="feature2"], inplace=True, axis = 1)
data_ind3.drop(data_ind3.loc[:,data_ind3.columns!="feature3"], inplace=True, axis = 1)
data_ind4.drop(data_ind4.loc[:,data_ind4.columns!="feature4"], inplace=True, axis = 1)
data_ind5.drop(data_ind5.loc[:,data_ind5.columns!="feature5"], inplace=True, axis = 1)
data_ind6.drop(data_ind6.loc[:,data_ind6.columns!="feature6"], inplace=True, axis = 1)
data_ind7.drop(data_ind7.loc[:,data_ind7.columns!="feature7"], inplace=True, axis = 1)
data_ind8.drop(data_ind8.loc[:,data_ind8.columns!="feature8"], inplace=True, axis = 1)
data_ind9.drop(data_ind9.loc[:,data_ind9.columns!="feature9"], inplace=True, axis = 1)
data_ind10.drop(data_ind3.loc[:,data_ind10.columns!="feature10"], inplace=True, axis = 1)
How could I make this into a loop and plot the aimed result? That is to say:
-plotting top 10 countries for each features
-then a final "top 10 countries" taking into account all 10 features (eather with countries that appears the most in each df or countries with best ranking if all features have same coefficient value)?

I think this is what you're asking for? I put your code into a for loop form and added code for ranking the countries overall. The overall ranking is based on all features, not just the top 10 lists but if you'd like it the other way then just switch the order of the commented blocks in the first for loop. I also wasn't sure how you wanted to display it so currently it just prints the final dataframe. It's probably not the cleanest code ever but I hope it helps!
import pandas as pd
import numpy as np
data = np.random.randint(100,size=(12,10))
countries = [
'Country1',
'Country2',
'Country3',
'Country4',
'Country5',
'Country6',
'Country7',
'Country8',
'Country9',
'Country10',
'Country11',
'Country12',
]
feature_names_weights = {
'feature1' :1.0,
'feature2' :1.0,
'feature3' :1.0,
'feature4' :1.0,
'feature5' :1.0,
'feature6' :1.0,
'feature7' :1.0,
'feature8' :1.0,
'feature9' :1.0,
'feature10' :1.0,
}
feature_names = list(feature_names_weights.keys())
df = pd.DataFrame(data=data, index=countries, columns=feature_names)
data_etude_copy = df
data_sorted_by_feature = {}
country_scores = (pd.DataFrame(data=np.zeros(len(countries)),index=countries))[0]
for feature in feature_names:
#Adds to each country's score and multiplies by weight factor for each feature
for country in countries:
country_scores[country] += data_etude_copy[feature][country]*(feature_names_weights[feature])
#Sorts the countries by feature (your code in loop form)
data_sorted_by_feature[feature] = data_etude_copy.sort_values(by=[feature], ascending=False).head(10)
data_sorted_by_feature[feature].drop(data_sorted_by_feature[feature].loc[:,data_sorted_by_feature[feature].columns!=feature], inplace=True, axis = 1)
#sort country total scores
ranked_countries = country_scores.sort_values(ascending=False).head(10)
##Put everything into one DataFrame
#Create empty DataFrame
empty_data=np.empty((10,11),str)
outputDF = pd.DataFrame(data=empty_data,columns=((feature_names)+['Overall']))
#Add entries for all features
for feature in feature_names:
for index in range(10):
country = list(data_sorted_by_feature[feature].index)[index]
outputDF[feature][index] = f'{country}: {data_sorted_by_feature[feature][feature][country]}'
#Add column for overall country score
for index in range(10):
country = list(ranked_countries.index)[index]
outputDF['Overall'][index] = f'{country}: {ranked_countries[country]}'
#Print DataFrame
print(outputDF)
Example data in:
feature1 feature2 feature3 feature4 feature5 feature6 feature7 feature8 feature9 feature10
Country1 40 31 5 6 4 67 65 57 52 96
Country2 93 20 41 65 44 21 91 25 43 75
Country3 93 34 87 69 0 25 65 71 17 91
Country4 24 20 41 68 46 1 94 87 11 97
Country5 90 21 93 0 72 20 44 87 16 42
Country6 93 17 33 40 96 53 1 97 51 20
Country7 82 50 34 27 44 38 49 85 7 70
Country8 33 81 14 5 72 13 13 53 39 47
Country9 18 38 20 32 52 96 51 93 53 16
Country10 75 94 91 59 39 24 7 0 96 57
Country11 62 9 33 89 5 77 37 63 42 29
Country12 7 98 43 71 98 81 48 13 61 69
Corresponding output:
feature1 feature2 feature3 feature4 feature5 feature6 feature7 feature8 feature9 feature10 Overall
0 Country2: 93 Country12: 98 Country5: 93 Country11: 89 Country12: 98 Country9: 96 Country4: 94 Country6: 97 Country10: 96 Country4: 97 Country12: 589.0
1 Country3: 93 Country10: 94 Country10: 91 Country12: 71 Country6: 96 Country12: 81 Country2: 91 Country9: 93 Country12: 61 Country1: 96 Country3: 552.0
2 Country6: 93 Country8: 81 Country3: 87 Country3: 69 Country5: 72 Country11: 77 Country1: 65 Country4: 87 Country9: 53 Country3: 91 Country10: 542.0
3 Country5: 90 Country7: 50 Country12: 43 Country4: 68 Country8: 72 Country1: 67 Country3: 65 Country5: 87 Country1: 52 Country2: 75 Country2: 518.0
4 Country7: 82 Country9: 38 Country2: 41 Country2: 65 Country9: 52 Country6: 53 Country9: 51 Country7: 85 Country6: 51 Country7: 70 Country6: 501.0
5 Country10: 75 Country3: 34 Country4: 41 Country10: 59 Country4: 46 Country7: 38 Country7: 49 Country3: 71 Country2: 43 Country12: 69 Country4: 489.0
6 Country11: 62 Country1: 31 Country7: 34 Country6: 40 Country2: 44 Country3: 25 Country12: 48 Country11: 63 Country11: 42 Country10: 57 Country7: 486.0
7 Country1: 40 Country5: 21 Country6: 33 Country9: 32 Country7: 44 Country10: 24 Country5: 44 Country1: 57 Country8: 39 Country8: 47 Country5: 485.0
8 Country8: 33 Country2: 20 Country11: 33 Country7: 27 Country10: 39 Country2: 21 Country11: 37 Country8: 53 Country3: 17 Country5: 42 Country9: 469.0
9 Country4: 24 Country4: 20 Country9: 20 Country1: 6 Country11: 5 Country5: 20 Country8: 13 Country2: 25 Country5: 16 Country11: 29 Country11: 446.0

Related

set and reset index in pandas dataframe not working

import numpy as np
import pandas as pd
np.random.seed(121)
randArr =np.random.randint(0,100,20).reshape(5,4)
df =pd.DataFrame(randArr,np.arange(101,106,1),['PDS','Algo','SE','INS'])
df.index.name='RollNo'
print(df)
print("")
df.reset_index()
print(df)
print("")
df.set_index('PDS')
print(df)
print("")
Output:(not coming as expected)
PDS Algo SE INS
RollNo
101 66 85 8 95
102 65 52 83 96
103 46 34 52 60
104 54 3 94 52
105 57 75 88 39
PDS Algo SE INS
RollNo
101 66 85 8 95
102 65 52 83 96
103 46 34 52 60
104 54 3 94 52
105 57 75 88 39
PDS Algo SE INS
RollNo
101 66 85 8 95
102 65 52 83 96
103 46 34 52 60
104 54 3 94 52
105 57 75 88 39
You need assign the result back
df = df.reset_index()
df = df.set_index('PDS')
Or you can use inplace argument
df.reset_index(inplace=True)
df.set_index('PDS', inplace=True)

How to create a heatmap with different color ranges for the sum column?

I've got my pd.dataframe ready.
Communication Services Consumer Discretionary Consumer Staples Energy Financials Health Care Industrials Materials Real Estate Technology Utilities Sum
Date
2020-09-15 61 65 39 3 36 53 68 89 74 43 53 584
2020-09-14 50 70 39 7 54 45 67 92 64 28 53 569
2020-09-11 38 54 30 0 28 27 46 82 25 18 28 376
2020-09-10 30 52 24 0 16 19 30 67 32 12 25 307
2020-09-09 50 57 36 0 33 30 52 71 51 30 42 452
2020-09-08 34 55 21 0 24 16 24 46 48 12 25 305
2020-09-04 53 59 51 3 66 32 47 71 74 35 28 519
2020-09-03 57 67 57 0 48 40 49 82 80 52 32 564
2020-09-02 73 85 78 3 80 74 94 89 87 94 64 821
2020-09-01 69 78 54 3 54 51 79 85 51 77 14 615
2020-08-31 76 73 78 7 50 61 75 64 54 70 21 629
2020-08-28 92 81 75 30 81 48 86 89 77 76 17 752
2020-08-27 88 77 81 11 83 53 82 82 70 64 14 705
2020-08-26 92 81 75 11 46 43 79 89 45 69 7 637
2020-08-25 92 86 78 23 65 45 82 82 64 64 21 702
2020-08-24 92 88 90 38 62 38 90 75 54 61 39 727
2020-08-21 80 78 69 11 28 37 71 50 45 49 17 535
2020-08-20 84 72 63 11 34 45 78 57 45 57 17 563
2020-08-19 80 83 81 34 48 56 84 71 29 60 35 661
2020-08-18 88 88 90 53 48 62 91 71 70 64 42 767
2020-08-17 80 95 87 80 69 62 94 78 77 63 42 827
2020-08-14 84 100 90 80 83 56 94 78 64 57 42 828
2020-08-13 88 98 87 69 81 56 95 78 64 66 57 839
2020-08-12 73 96 87 96 83 58 98 75 90 63 64 883
2020-08-11 73 86 72 84 89 50 95 78 77 53 46 803
2020-08-10 80 93 87 88 83 53 93 78 90 64 82 891
2020-08-07 69 81 84 65 84 58 91 60 83 71 89 835
2020-08-06 73 80 81 73 60 53 84 57 54 78 67 760
2020-08-05 69 81 87 73 68 69 89 64 51 78 64 793
2020-08-04 80 63 87 73 46 66 64 53 67 81 85 765
2020-08-03 69 55 78 50 60 74 68 42 51 81 78 706
2020-07-31 65 62 78 42 60 61 64 46 58 74 92 702
2020-07-30 65 62 75 34 65 74 71 50 64 61 89 710
2020-07-29 73 78 90 88 90 87 79 85 70 64 85 889
2020-07-28 46 67 81 38 71 72 78 85 61 47 89 735
2020-07-27 61 78 90 61 86 75 76 96 32 74 75 804
2020-07-24 80 77 87 73 87 72 83 100 32 56 96 843
2020-07-23 84 81 90 73 90 85 91 100 38 73 96 901
2020-07-22 88 90 90 84 92 93 94 100 45 90 96 962
2020-07-21 76 91 93 96 92 93 93 100 25 85 92 936
2020-07-20 65 81 81 34 62 91 84 96 32 87 89 802
2020-07-17 76 86 93 38 65 95 91 96 51 77 100 868
2020-07-16 80 90 93 50 81 93 89 96 22 70 85 849
2020-07-15 80 96 87 53 78 95 91 96 45 76 75 872
2020-07-14 69 59 81 23 53 82 73 96 25 60 82 703
2020-07-13 57 34 69 0 46 54 56 71 9 43 75 514
2020-07-10 61 44 66 0 43 59 39 60 35 66 64 537
2020-07-09 46 31 42 0 18 61 36 32 32 61 46 405
2020-07-08 50 42 57 3 34 67 50 46 25 61 57 492
2020-07-07 53 34 60 0 18 66 43 75 22 50 46 467
2020-07-06 50 52 54 7 30 75 64 89 41 76 53 591
Now I'd like to plot a heatmap by using matplotlib. The resulting heatmap should look something like this:
For the inner part (columns besides "sum"), if the value is above 50, then the color should be green, and the color should be darker for the largest values. Same logic for the values below 50.
For the "sum" column, the threshold is 550. How to achieve the gradual change in color?
A sns.diverging_palette(20, 145) standard has white in the center. Possible hue values for red are 20, and 145 for green.
vmin= will then set the numeric value corresponding to red and vmax= for the value corresponding to green. The value in the center will be white.
You need to create 2 separate heatmaps as they have different color ranges. The ax= keyword tells on which subplot the heatmap should be created. The colorbars can be left out: the numbers inside the cells already indicate the correspondence.
A newline character in the label names helps to better use the available space.
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
# df = pd.read_csv(...)
# df.set_index('Date', inplace=True)
column_labels = [col.replace(' ', '\n') for col in df.columns[:-1]]
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 10),
gridspec_kw={'width_ratios': [10, 1], 'wspace': 0.02, 'bottom': 0.14})
cmap = sns.diverging_palette(20, 145)
sns.heatmap(df[df.columns[:-1]], cmap=cmap, vmin=0, vmax=100, annot=True, fmt='.0f', annot_kws={'fontsize': 10},
lw=0.6, xticklabels=column_labels, cbar=False, ax=ax1)
sns.heatmap(df[df.columns[-1:]], cmap=cmap, vmin=0, vmax=1100, annot=True, fmt='.0f', annot_kws={'fontsize': 10},
lw=0.6, yticklabels=[], cbar=False, ax=ax2)
ax2.set_ylabel('')
ax2.tick_params(axis='x', labelrotation=90)
plt.show()
plt.figure(figsize=(15, 15))
sns.heatmap(data, annot=True, cmap="YlGnBu", linewidths=.5)
Is that what you are looking for.
and also If you want to add range on the values you can use vmin, vmax parameters.

From Matlab to Python Code [z,index]=sort(abs(z));

i am trying to convert code from matlab to python.
Can you please help me to convert this code from matlab to python?
in matlab code
z is list and z length is 121
z= 7.0502 5.8030 4.4657 3.0404 1.5416 0 -1.5416 -3.0404 -4.4657
-5.8030 -7.0502 7.5944 6.3059 4.8990 3.3662 1.7189 0 -1.7189 -3.3662 -4.8990 -6.3059 -7.5944 8.2427 6.9282 5.4611 3.8122 1.9735 0 -1.9735 -3.8122 -5.4611 -6.9282 -8.2427 9.0135 7.7027 6.2075 4.4590 2.3803 0 -2.3803 -4.4590 -6.2075 -7.7027 -9.0135 9.9185 8.6576 7.2038 5.4466 3.1530 0 -3.1530 -5.4466 -7.2038 -8.6576 -9.9185 10.9545 9.7980 8.4853 6.9282 4.8990 0 -4.8990 -6.9282 -8.4853 -9.7980 -10.9545 12.0986 11.0885 9.9947 8.8128 7.6119 -6.9282 -7.6119 -8.8128 -9.9947 -11.0885 -12.0986 13.3133 12.4632 11.5988 10.7649 10.0829 -9.7980 -10.0829 -10.7649 -11.5988 -12.4632 -13.3133 14.5583 13.8564 13.1842 12.5910 12.1612 -12.0000 -12.1612 -12.5910 -13.1842 -13.8564 -14.5583 15.8011 15.2238 14.6969 14.2594 13.9626 -13.8564 -13.9626 -14.2594 -14.6969 -15.2238 -15.8011 17.0207 16.5431 16.1227 15.7875 15.5684 -15.4919 -15.5684 -15.7875 -16.1227 -16.5431 -17.0207
Matlab code : [z,index]=sort(abs(z));
after the code
z = 0 0 0 0 0 0 1.5416 1.5416 1.7189 1.7189 1.9735 1.9735 2.3803 2.3803 3.0404 3.0404 3.1530 3.1530 3.3662 3.3662 3.8122 3.8122 4.4590 4.4590 4.4657 4.4657 4.8990 4.8990 4.8990 4.8990 5.4466 5.4466 5.4611 5.4611 5.8030 5.8030 6.2075 6.2075 6.3059 6.3059 6.9282 6.9282 6.9282 6.9282 6.9282 7.0502 7.0502 7.2038 7.2038 7.5944 7.5944 7.6119 7.6119 7.7027 7.7027 8.2427 8.2427 8.4853 8.4853 8.6576 8.6576 8.8128 8.8128 9.0135 9.0135 9.7980 9.7980 9.7980 9.9185 9.9185 9.9947 9.9947 10.0829 10.0829 10.7649 10.7649 10.9545 10.9545 11.0885 11.0885 11.5988 11.5988 12.0000 12.0986 12.0986 12.1612 12.1612 12.4632 12.4632 12.5910
12.5910 13.1842 13.1842 13.3133 13.3133 13.8564 13.8564 13.8564 13.9626 13.9626 14.2594 14.2594 14.5583 14.5583 14.6969 14.6969 15.2238 15.2238 15.4919 15.5684 15.5684 15.7875 15.7875 15.8011 15.8011 16.1227 16.1227 16.5431 16.5431 17.0207 17.0207
and index is
index = 6 17 28 39 50 61 5 7 16 18 27 29 38 40 4 8 49 51 15 19 26 30 37 41 3 9 14 20 60 62 48 52 25 31 2 10 36 42 13 21 24 32 59 63 72 1 11 47 53 12 22 71 73 35 43 23 33 58 64 46 54 70 74 34 44 57 65 83 45 55 69 75 82 84 81 85 56 66 68 76 80 86 94 67 77 93 95 79 87 92 96 91 97 78 88 90 98 105 104 106 103 107 89 99 102 108 101 109 116 115 117 114 118 100 110 113 119 112 120 111 121
so what is the [z,index] in python ?
Do you need to return the index? If you don't, you could use:
z = abs(z)
new_list = sorted(map(abs, z))
index = sorted(range(len(z)), key=lambda k: z[k])
where x is the output and z is the list.
EDIT:
Try that now

Conditional summing of columns in pandas

I have the following database in Pandas:
Student-ID Last-name First-name HW1 HW2 HW3 HW4 HW5 M1 M2 Final
59118211 Alf Brian 96 90 88 93 96 78 60 59.0
59260567 Anderson Jill 73 83 96 80 84 80 52 42.5
59402923 Archangel Michael 99 80 60 94 98 41 56 0.0
59545279 Astor John 93 88 97 100 55 53 53 88.9
59687635 Attach Zach 69 75 61 65 91 90 63 69.0
I want to add only those columns which have "HW" in them. Any suggestions on how I can do that?
Note: The number of columns containing HW may differ. So I can't reference them directly.
You could all df.filter(regex='HW') to return column names like 'HW' and then apply sum row-wise via sum(axis-1)
In [23]: df
Out[23]:
StudentID Lastname Firstname HW1 HW2 HW3 HW4 HW5 HW6 HW7 M1
0 59118211 Alf Brian 96 90 88 93 96 97 88 10
1 59260567 Anderson Jill 73 83 96 80 84 99 80 100
2 59402923 Archangel Michael 99 80 60 94 98 73 97 50
3 59545279 Astor John 93 88 97 100 55 96 86 60
4 59687635 Attach Zach 69 75 61 65 91 89 82 55
5 59829991 Bake Jake 56 0 77 78 0 79 0 10
In [24]: df.filter(regex='HW').sum(axis=1)
Out[24]:
0 648
1 595
2 601
3 615
4 532
5 290
dtype: int64
John's solution - using df.filter() - is more elegant, but you could also consider a list comprehension ...
df[[x for x in df.columns if 'HW' in x]].sum(axis=1)

Array reshape not mapping correctly to numpy meshgrid

I have a long 121 element array where the data is stored in ascending order and I want to reshape to an 11x11 matrix and so I use the NumPy reshape command
Z = data.attributevalue[2,time,axial,:]
Z = np.reshape(Z, (int(math.sqrt(datacount)), int(math.sqrt(datacount))))
The data should be oriented in a Cartesian plane and I create the mesh grid with the following
x = np.arange(1.75, 12.5, 1)
y = np.arange(1.75, 12.5, 1)
X,Y = np.meshgrid(x, y)
The issue is that rows of Z are in the wrong order so the data in the last row of the matrix should be in the first and vice-versa. I want to rearrange so the rows are filled in the proper manner. The starting array Z is assembled in the following arrangement [datapoint #1, datapoint #2 ...., datapoint #N]. Datapoint #1 should be in the top left and the last point in the bottom right. Is there a simple way of accomplishing this or do I have to make a function to changed the order of the rows?
my plot statement is the following
surf = self.ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=cm.jet,
linewidth=1, antialiased=True)
***UPDATE****
I tried populating the initial array backwards and still no luck. I changed the orientation of the axis to the following
y = np.arrange(12.5,1,-1)
This flipped the data but my axis label is wrong so it is not a real solution to my issue. Any ideas?
It is possible that your original array does not look like a 1x121 array. The following code block shows how you reshape an array from 1x121 to 11x11.
import numpy as np
A = np.arange(1,122)
print A
print A.reshape((11,11))
Gives:
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121]
[[ 1 2 3 4 5 6 7 8 9 10 11]
[ 12 13 14 15 16 17 18 19 20 21 22]
[ 23 24 25 26 27 28 29 30 31 32 33]
[ 34 35 36 37 38 39 40 41 42 43 44]
[ 45 46 47 48 49 50 51 52 53 54 55]
[ 56 57 58 59 60 61 62 63 64 65 66]
[ 67 68 69 70 71 72 73 74 75 76 77]
[ 78 79 80 81 82 83 84 85 86 87 88]
[ 89 90 91 92 93 94 95 96 97 98 99]
[100 101 102 103 104 105 106 107 108 109 110]
[111 112 113 114 115 116 117 118 119 120 121]]

Categories

Resources