How to make histogram using pandas - python

In this problem, a .txt file is read using pandas. The number of genes needs to be calculated, and a histogram needs to be made for the specific sample and the amount of interaction with each gene.
I have tried using .transpose(), as well as, using value_counts() to access the appropriate information; however, because of it being in a row, and the way the table is set up, I cannot figure out how to get the appropriate histogram.
Use Pandas to read the file. Write a program to answer the following questions:
How many samples are in the data set?
How many genes are in the data set?
Which sample has the lowest average expression of genes?
Plot a histogram showing the distribution of the IL6 expression
across all samples.
Data:
protein M-12 M-24 M-36 M-48 M+ANDV-12 M+ANDV-24 M+ANDV-36 M+ANDV-48 M+SNV-12 M+SNV-24 M+SNV-36 M+SNV-48
ARG1 -11.67 -9.92 -4.37 -11.92 -3.62 -9.38 -11.54 -4.88 -3.59 -2.96 -4.95 -4.31
CASP3 0.05 -0.05 -0.18 0.02 0.04 0.14 -0.35 -0.41 0.24 0.23 -0.40 -0.36
CASP7 -1.40 -0.05 -0.78 -1.33 -0.43 0.63 -1.39 -0.95 0.81 1.45 0.09 0.11
CCL22 -0.96 1.47 0.37 -1.48 1.34 2.72 -11.12 -1.05 -0.63 1.42 0.30 0.12
CCL5 -5.59 -3.84 -4.64 -5.84 -5.19 -5.24 -5.45 -5.45 -2.86 -4.53 -4.80 -6.46
CCR7 -11.26 -9.50 -2.96 -11.50 -2.35 -2.31 -11.12 -3.66 -3.18 -1.31 -2.48 -2.84
CD14 2.85 4.14 3.87 4.33 1.16 3.28 3.68 3.74 1.20 2.80 3.23 2.79
CD200R1 -11.67 -9.92 -5.37 -11.92 -4.61 -9.38 -11.54 -11.54 -3.59 -2.96 -4.54 -4.89
CD274 -5.59 -9.92 -4.64 -5.84 -1.78 -3.30 -5.45 -5.45 -4.17 -10.61 -4.80 -4.48
CD80 -6.57 -9.50 -4.96 -6.82 -6.17 -4.28 -6.43 -6.43 -3.18 -5.51 -5.12 -4.16
CD86 0.14 0.94 0.87 1.12 -0.23 0.58 1.09 0.66 -0.15 0.42 0.74 0.49
CXCL10 -6.57 -2.85 -4.96 -6.82 -4.20 -2.31 -4.47 -4.47 -2.38 -2.74 -5.12 -4.67
CXCL11 -5.28 -9.50 -5.63 -11.50 -10.85 -8.97 -11.12 -11.12 -9.83 -10.20 -5.79 -6.14
IDO1 -5.02 -9.92 -4.37 -5.26 -4.61 -2.72 -4.88 -4.88 -2.60 -3.96 -4.54 -5.88
IFNA1 -11.67 -9.92 -5.37 -5.26 -11.27 -9.38 -11.54 -4.88 -3.59 -10.61 -6.52 -5.88
IFNB1 -11.67 -9.92 -6.35 -11.92 -11.27 -9.38 -11.54 -11.54 -10.25 -10.61 -12.19 -12.54
IFNG -2.09 -1.21 -1.66 -2.24 -2.75 -2.50 -2.83 -3.22 -2.48 -1.60 -2.13 -2.48
IFR3 -0.39 0.05 -0.21 0.15 -0.27 0.07 -0.01 -0.11 -0.28 0.28 0.04 -0.09
IL10 -1.53 -0.21 -0.51 0.45 -3.40 -1.00 -0.51 -0.04 -2.38 -1.55 -0.25 -0.72
IL12A -11.67 -9.92 -4.79 -11.92 -3.30 -3.71 -11.54 -11.54 -10.25 -3.38 -4.22 -4.09
IL15 -1.91 -2.53 -3.50 -3.85 -2.75 -9.38 -4.15 -4.15 -2.19 -2.09 -2.81 -3.16
IL1A -4.28 -2.53 -2.26 -3.39 -2.12 -0.51 -11.54 -2.67 -1.73 -1.75 -2.13 -1.84
IL1B -1.61 -2.53 -0.31 -0.16 0.77 -3.30 -1.95 -0.21 -1.73 -2.55 -0.65 -0.64
IL1RN 3.14 -0.40 -1.54 -3.53 3.95 0.76 0.15 -3.15 3.34 0.95 -1.23 -1.02
IL6 -4.60 -0.21 -1.82 -3.53 -1.25 0.76 -11.12 -2.47 -0.94 -0.60 -1.61 -1.74
IL8 5.43 5.04 4.57 4.22 5.67 5.06 4.30 4.53 4.84 4.53 4.25 3.79
IRF7 0.14 0.97 -0.13 -0.72 0.83 1.85 -0.19 -0.19 1.01 0.62 0.07 -0.03
ITGAM -1.68 0.91 0.28 -0.12 0.67 1.73 -0.30 -0.07 1.21 1.28 0.71 1.21
NFKB1 0.80 0.31 0.29 0.43 1.21 -0.74 0.39 0.02 0.15 -0.02 0.01 -0.09
NOS2 -11.26 -3.52 -4.50 -5.52 -4.87 -2.98 -5.14 -5.14 -3.85 -4.22 -5.79 -6.14
PPARG 0.68 0.23 0.02 -1.16 0.56 1.38 0.80 -0.95 1.17 1.04 1.09 0.94
TGFB1 3.99 3.21 2.41 2.62 4.05 3.48 2.87 2.15 3.68 2.97 2.46 2.31
TLR3 -3.61 -1.85 -1.72 -11.92 -2.40 -1.32 -11.54 -11.54 -0.57 0.09 -1.32 -1.60
TLR7 -3.80 -2.05 -1.64 -0.35 -6.17 -4.28 -2.47 -1.75 -3.18 -3.54 -1.86 -2.84
TNF 1.09 0.53 0.71 1.17 1.91 0.58 1.04 1.41 1.20 1.18 1.13 0.66
VEGFA -2.36 -2.85 -3.64 -3.53 -3.40 -4.28 -4.47 -4.47 -5.15 -5.51 -4.32 -4.67
df=pd.read_csv('../Data/virus_miniset0.txt', sep='\t')
len(df['Sample'])
df

Set the index, in order to properly transpose:
in tabular data, the top row should indicate the name of each column
in this data, the first header was named sample, with all the M prefixed names being the samples.
sample was renamed to protein to properly identify the column.
Current Data:
import pandas as pd
import matplotlib.pylot as plt
import seaborn as sns
df.set_index('protein', inplace=True)
Transpose:
df_sample = df.T
df_sample.reset_index(inplace=True)
df_sample.rename(columns={'index': 'sample'}, inplace=True)
df_sample.set_index('sample', inplace=True)
How many samples:
len(df_sample.index)
>>> 12
How many proteins / genes:
len(df_sample.columns)
>>> 36
Lowest average expression:
find the mean and then find the min
df_sample.mean().min() works, but doesn't include the protein name, just the value.
protein_avg = df_sample.mean()
protein_avg[protein_avg == df_sample.mean().min()]
>>> protein
IFNB1 -10.765
dtype: float64
The following boxplot of all genes, confirms IFNB1 as the protein with the lowest average expression across samples, and shows IL8 as the protein with highest average expression.
Boxplot:
seaborn to make your plots look nicer
plt.figure(figsize=(12, 8))
g = sns.boxplot(data=df_sample)
for item in g.get_xticklabels():
item.set_rotation(90)
plt.show()
Alternate Boxplot:
plt.figure(figsize=(8, 8))
sns.boxplot('IL6', data=df_sample, orient='v')
plt.show()
IL6 Histogram:
sns.distplot(df_sample.IL6)
plt.show()
Bonus Plot - Heatmap:
I thought you might like this
plt.figure(figsize=(20, 8))
sns.heatmap(df_sample, annot=True, annot_kws={"size": 7}, cmap='PiYG')
plt.show()
M-12 and M+SNV-48 are only half size in the plot. This will be resolved in the forthcoming matplotlib v3.1.2

Related

Tensorflow dataframe lambda modify the type of object

So i got a dataframe that has stats of fighter that store data as float
fighers_clean.dtypes
sig_str_abs_pM float64
sig_str_def_pct float64
sig_str_land_pM float64
sig_str_land_pct float64
sub_avg float64
td_avg float64
td_def_pct float64
td_land_pct float64
win% float64
fighers_clean
sig_str_abs_pM sig_str_def_pct sig_str_land_pM sig_str_land_pct sub_avg td_avg td_def_pct td_land_pct win%
name
Hunter Azure 1.57 0.56 4.00 0.50 2.5 2.00 0.75 0.33 1.000000
Jessica Eye 3.36 0.60 3.51 0.36 0.7 0.51 0.56 0.50 0.666667
Rolando Dy 4.47 0.52 3.04 0.37 0.0 0.30 0.68 0.20 0.529412
Gleidson Cutis 8.28 0.59 2.99 0.52 0.0 0.00 0.00 0.00 0.700000
Damien Brown 4.86 0.50 3.66 0.38 0.7 0.68 0.53 0.27 0.586207
... ... ... ... ... ... ... ... ... ...
Xiaonan Yan 4.22 0.64 6.85 0.40 0.0 0.25 0.66 0.50 0.916667
Alexander Yakovlev 2.44 0.58 1.79 0.47 0.2 1.56 0.72 0.33 0.705882
Rani Yahya 1.61 0.52 1.59 0.36 2.1 2.92 0.22 0.32 0.722222
Eddie Yagin 5.77 0.42 3.13 0.30 1.0 0.00 0.62 0.00 0.727273
Jamie Yager 2.55 0.63 3.08 0.39 0.0 0.00 0.66 0.00 0.714286
With this line im trying to add data to another dataframe that have stats about matches
for col in statistics:
matches_clean[col] = matches_clean.apply(
lambda row: fighers_clean.loc[row["fighter_1"], col] - fighers_clean.loc[row["fighter_2"], col], axis=1)
matches_clean.dtypes
fighter_1 object
fighter_2 object
result int64
sig_str_abs_pM object
sig_str_def_pct object
sig_str_land_pM object
sig_str_land_pct object
sub_avg object
td_avg object
td_def_pct object
td_land_pct object
win% object
dtype: object
fighter_1 fighter_2 result sig_str_abs_pM sig_str_def_pct sig_str_land_pM sig_str_land_pct sub_avg td_avg td_def_pct td_land_pct win%
fight_id
d7cbe2f23d75afd1 Julio Arce Hakeem Dawodu 0 0.56 0.03 -1.03 -0.1 0.6 0.58 0.08 0.3 -0.046154
f0418c2c989a5cde Grigorii Popov Davey Grant 0 2.52 -0.1 0.5 -0.15 0.0 -2.64 -0.19 -0.47 0.079167
fc16ccf0994c6e50 Jack Shore Nohelin Hernandez 1 -2.16 0.31 2.26 0.16 1.2 3.96 -0.37 0.07 0.285714
18e1b0df8da7010e Vanessa Melo Tracy Cortez 0 4.17 -0.05 -1.23 -0.26 0.0 -3.0 -0.23 -0.37 -0.286765
57ff0eb2351979c4 Khalid Taha Bruno Silva 1 -0.63 -0.11 1.1 0.1 0.5 -2.31 -0.37 -0.18 0.1875
This cause later an error ValueError: setting an array element with a sequence. here at line X_train_scaled = scaler.fit_transform(X_train)
# get ready for deep learning
X, y = matches_clean.iloc[:, 1:], matches_clean.iloc[:, 0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# normalization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train.dtypes
Im pretty sure Its because the float are converted to object during the lambda function
Do you know why the lambda changes the returned values and how to avoid that ?

Reading in a .txt file to get time series from rows of years and columns of monthly values

How could I read in a txt file like the one from
https://psl.noaa.gov/data/correlation/pna.data (example below)
1960 -0.16 -0.22 -0.69 -0.07 0.99 1.20 1.11 1.85 -0.01 0.48 -0.52 1.15
1961 1.16 0.17 0.28 -1.14 -0.25 1.84 -0.52 0.47 1.10 -1.94 -0.40 -1.54
1962 -0.74 -0.54 -0.71 -1.50 -1.11 -0.97 -0.36 0.57 -0.83 1.33 0.53 -0.38
1963 0.09 0.79 -2.04 -0.79 -0.95 0.50 -1.10 -1.01 0.87 0.93 -0.31 1.46
1964 -0.44 1.36 -1.31 -1.30 -2.27 0.27 0.20 0.83 0.92 0.80 -0.78 -2.03
1965 -0.92 -1.03 -0.80 -1.07 -0.42 1.89 -1.26 0.32 0.36 1.42 -0.81 -1.56
into a pandas dataframe to plot as a time series, for example from 1960-1965 with each value column (corresponding to months) being plotted? I rarely use .txt's
Here's what you can try:
import pandas as pd
import requests
import re
aa=requests.get("https://psl.noaa.gov/data/correlation/pna.data").text
aa=aa.split("\n")[1:-4]
aa=list(map(lambda x:x[1:],aa))
aa="\n".join(aa)
aa=re.sub(" +",",",aa)
with open("test.csv","w") as f:
f.write(aa)
df=pd.read_csv("test.csv", header=None, index_col=0).rename_axis('Year')
df.columns=list(pd.date_range(start='2021-01', freq='M', periods=12).month_name())
print(df.head())
df.to_csv("test.csv")
This is going to give you, in test.csv file:
Year
January
February
March.....
up to December
1948
73
67
67
773....
1949
73
67
67
773....
1950
73
67
67
773....
....
..
..
..
.......
....
..
..
..
.......
2021
73
88
84
733....
Use pd.read_fwf as suggested by #SanskarSingh
>>> pd.read_fwf('data.txt', header=None, index_col=0).rename_axis('Year')
1 2 3 4 5 6 7 8 9 10 11 12
Year
1960 -0.16 -0.22 -0.69 -0.07 0.99 1.20 1.11 1.85 -0.01 0.48 -0.52 1.15
1961 1.16 0.17 0.28 -1.14 -0.25 1.84 -0.52 0.47 1.10 -1.94 -0.40 -1.54
1962 -0.74 -0.54 -0.71 -1.50 -1.11 -0.97 -0.36 0.57 -0.83 1.33 0.53 -0.38
1963 0.09 0.79 -2.04 -0.79 -0.95 0.50 -1.10 -1.01 0.87 0.93 -0.31 1.46
1964 -0.44 1.36 -1.31 -1.30 -2.27 0.27 0.20 0.83 0.92 0.80 -0.78 -2.03
1965 -0.92 -1.03 -0.80 -1.07 -0.42 1.89 -1.26 0.32 0.36 1.42 -0.81 -1.56

How can I get the average of the 3rd to 7th highest selection from three pandas columns?

id MaxS MaxA MaxD
43290 9.511364 2.70 0.27
43290 7.547727 2.56 0.34
43290 7.465909 2.66 0.48
43290 7.404545 3.90 0.60
43290 7.772727 2.38 0.11
43290 7.936364 2.62 0.97
43290 7.650000 4.20 1.64
43290 3.088636 1.79 0.06
43290 4.377273 2.19 0.05
43290 6.750000 4.65 1.90
43290 5.461364 2.82 0.19
43290 7.363636 4.13 1.48
43290 11.270455 3.72 0.41
43290 10.186364 3.88 1.17
43290 3.109091 2.05 0.02
43290 7.834091 3.38 0.01
43290 3.252273 2.31 0.03
43290 7.854545 3.00 0.70
43290 9.756818 3.26 0.54
43290 6.954545 2.93 0.24
43291 4.070455 1.21 0.21
43291 6.034091 3.42 0.42
43291 8.018182 2.41 0.66
43291 7.956818 3.55 0.62
43291 8.161364 2.74 0.64
43291 8.263636 4.11 0.13
43291 2.618182 1.80 0.08
43291 2.168182 2.12 0.04
43291 6.095455 3.04 0.11
43291 9.061364 2.91 0.33
45880 5.236364 2.43 0.15
45880 14.972727 4.86 0.23
45880 9.593182 4.48 1.36
45880 4.459091 3.67 0.14
45880 17.325000 4.21 0.44
45880 11.086364 3.30 1.00
45880 5.277273 2.25 0.12
45880 7.547727 2.92 0.34
45880 11.270455 3.33 0.03
45880 13.990909 3.21 0.50
45880 9.122727 3.86 1.14
45880 6.790909 4.24 1.30
45880 8.100000 4.31 0.80
45880 5.809091 3.22 0.94
45881 6.565909 3.50 0.86
45881 10.452273 4.64 0.85
45881 7.281818 3.47 0.71
45881 9.347727 3.67 0.02
45881 14.318182 3.97 0.51
45881 5.481818 3.99 0.21
45881 7.425000 3.93 1.65
45881 8.836364 3.50 0.26
45881 5.277273 2.21 0.57
45881 12.865909 4.38 0.94
45881 7.200000 2.86 0.45
45881 7.138636 4.39 1.18
45881 8.815909 4.34 0.34
45881 9.490909 4.53 0.28
45881 17.652273 4.59 0.05
45881 11.106818 2.64 0.31
45881 9.511364 3.83 1.14
45881 8.284091 3.90 0.20
45881 9.306818 3.54 0.22
45881 5.195455 2.66 0.14
45881 3.477273 2.50 0.16
45881 7.179545 3.70 0.08
45881 8.447727 3.19 0.32
45881 4.990909 2.32 0.86
45881 16.465909 4.28 0.25
Hi all, as you can see I have the table above in a pandas dataframe. What I want to do is for every ID i would like the 3rd most to 7th most MaxS, MaxA and MaxD and I want to take an average of them. I know you can head or nlargest to get the top most numbers in these columns but I am not sure how to get the third most to seventh most for each ID. Also if you try nlargest on multiple columns pandas throws an error. So I'm not sure how to proceed with this problem.
Would really appreciate it if someone could help me find the average of the 3rd largest to 7th largest number in each of the three columns (MaxS, MaxA and MaxD) for each id.
Thank you!
Converting the data frame column to a numpy array can work:
import pandas as pd
import numpy as np
def _get_average(column):
""" perform sort, reverse and get 3rd to 7th values to average """
return np.mean(np.sort(column.to_numpy())[::-1][3:8])
def average_csv():
""" read data as csv and average the desired fields """
my_df = pd.read_csv("my_csv.csv", delimiter="\s+")
return _get_average(my_df["MaxS"]), _get_average(my_df["MaxA"]), _get_average(my_df["MaxD"])
if __name__ == "__main__":
print("MaxS: {}, MaxA: {}, MaxD: {}".format(*average_csv()))

How to use `.to_string` method of `pd.DataFrame` to return string with actual `\t` characters?

I want to recreate the matrix string with the .to_string method of pd.DataFrame.
I can't figure out how to use the built in .to_string method to achieve a format that contains the \t characters matching the input format?
Can this be done or do I need to create a custom parser?
matrix = """
#Names\tcol1\tcol2\tcol3\tcol4\tcol5\tcol6\tcol7
A\t-1.23\t-0.81\t1.79\t0.78\t-0.42\t-0.69\t0.58
B\t-1.76\t-0.94\t1.16\t0.36\t0.41\t-0.35\t1.12
C\t-2.19\t0.13\t0.65\t-0.51\t0.52\t1.04\t0.36
D\t-1.22\t-0.98\t0.79\t-0.76\t-0.29\t1.54\t0.93
E\t-1.47\t-0.83\t0.85\t0.07\t-0.81\t1.53\t0.65
F\t-1.04\t-1.11\t0.87\t-0.14\t-0.80\t1.74\t0.48
G\t-1.57\t-1.17\t1.29\t0.23\t-0.20\t1.17\t0.26
H\t-1.53\t-1.25\t0.59\t-0.30\t0.32\t1.41\t0.77
"""
df = pd.read_table(StringIO(matrix), index_col=0)
df.to_string()
# ' col1 col2 col3 col4 col5 col6 col7\n#Names \nA -1.23 -0.81 1.79 0.78 -0.42 -0.69 0.58\nB -1.76 -0.94 1.16 0.36 0.41 -0.35 1.12\nC -2.19 0.13 0.65 -0.51 0.52 1.04 0.36\nD -1.22 -0.98 0.79 -0.76 -0.29 1.54 0.93\nE -1.47 -0.83 0.85 0.07 -0.81 1.53 0.65\nF -1.04 -1.11 0.87 -0.14 -0.80 1.74 0.48\nG -1.57 -1.17 1.29 0.23 -0.20 1.17 0.26\nH -1.53 -1.25 0.59 -0.30 0.32 1.41 0.77'
pd.__version__
'0.22.0'

DataFrame of means of top N most correlated columns

I have a dataframe df1 where each column represents a time series of returns. I want to create a new dataframe df2 with columns that corresponds to each of the columns in df1 where the column in df2 is defined to be the average of the top 5 most correlated columns in df1.
import pandas as pd
import numpy as np
from string import ascii_letters
np.random.seed([3,1415])
df1 = pd.DataFrame(np.random.randn(100, 10).round(2),
columns=list(ascii_letters[26:36]))
print df1.head()
A B C D E F G H I J
0 -2.13 -1.27 -1.97 -2.26 -0.35 -0.03 0.32 0.35 0.72 0.77
1 -0.61 0.35 -0.35 -0.42 -0.91 -0.14 0.75 -1.50 0.61 0.40
2 -0.96 1.49 -0.35 -1.47 1.06 1.06 0.59 0.30 -0.77 0.83
3 1.49 0.26 -0.90 0.38 -0.52 0.05 0.95 -1.03 0.95 0.73
4 1.24 0.16 -1.34 0.16 1.26 0.78 1.34 -1.64 -0.20 0.13
I expect the head of the resulting dataframe rounded to 2 places to look like:
A B C D E F G H I J
0 -0.78 -0.70 -0.53 -0.45 -0.99 -0.10 -0.47 -0.86 -0.31 -0.64
1 -0.49 -0.11 -0.45 -0.03 -0.04 0.10 -0.26 0.11 -0.06 -0.10
2 0.03 0.13 0.54 0.33 -0.13 0.27 0.22 0.32 0.41 0.27
3 -0.22 0.13 0.19 0.58 0.63 0.24 0.34 0.51 0.32 0.22
4 -0.04 0.31 0.23 0.52 0.43 0.24 0.07 0.31 0.73 0.43
For each column in the correlation matrix, take the six largest and ignore the first one (i.e. 100% correlated with itself). Use a dictionary comprehension to do this for each column.
Use another dictionary comprehension to located this columns in df1 and take their mean. Create a dataframe from the result, and reorder the columns to match those of df1 by appending [df1.columns].
corr = df1.corr()
most_correlated_cols = {col: corr[col].nlargest(6)[1:].index
for col in corr}
df2 = pd.DataFrame({col: df1.loc[:, most_correlated_cols[col]].mean(axis=1)
for col in df1})[df1.columns]
>>> df2.head()
A B C D E F G H I J
0 -0.782 -0.698 -0.526 -0.452 -0.994 -0.102 -0.472 -0.856 -0.310 -0.638
1 -0.486 -0.106 -0.454 -0.032 -0.042 0.100 -0.258 0.108 -0.064 -0.102
2 0.026 0.132 0.544 0.330 -0.130 0.272 0.224 0.320 0.414 0.274
3 -0.224 0.128 0.186 0.582 0.626 0.242 0.344 0.506 0.318 0.224
4 -0.044 0.310 0.230 0.518 0.428 0.238 0.068 0.306 0.734 0.432
%%timeit
corr = df1.corr()
most_correlated_cols = {
col: corr[col].nlargest(6)[1:].index
for col in corr}
df2 = pd.DataFrame({col: df1.loc[:, most_correlated_cols[col]].mean(axis=1)
for col in df1})[df1.columns]
100 loops, best of 3: 10 ms per loop
%%timeit
corr = df1.corr()
df2 = corr.apply(argsort).head(5).apply(lambda x: avg_of(x, df1))
100 loops, best of 3: 16 ms per loop
Setup
import pandas as pd
import numpy as np
from string import ascii_letters
np.random.seed([3,1415])
df1 = pd.DataFrame(np.random.randn(100, 10).round(2),
columns=list(ascii_letters[26:36]))
Solution
corr = df.corr()
# I don't want a securities correlation with itself to be included.
# Because `corr` is symmetrical, I can assume that a series' name will be in its index.
def remove_self(x):
return x.loc[x.index != x.name]
# This builds utilizes `remove_self` then sorts by correlation
# and returns the index.
def argsort(x):
return pd.Series(remove_self(x).sort_values(ascending=False).index)
# This reaches into `df` and gets all columns identified in x
# then takes the mean.
def avg_of(x, df):
return df.loc[:, x].mean(axis=1)
# Putting it all together.
df2 = corr.apply(argsort).head(5).apply(lambda x: avg_of(x, df))
print df2.round(2).head()
A B C D E F G H I J
0 -0.78 -0.70 -0.53 -0.45 -0.99 -0.10 -0.47 -0.86 -0.31 -0.64
1 -0.49 -0.11 -0.45 -0.03 -0.04 0.10 -0.26 0.11 -0.06 -0.10
2 0.03 0.13 0.54 0.33 -0.13 0.27 0.22 0.32 0.41 0.27
3 -0.22 0.13 0.19 0.58 0.63 0.24 0.34 0.51 0.32 0.22
4 -0.04 0.31 0.23 0.52 0.43 0.24 0.07 0.31 0.73 0.43

Categories

Resources