DataFrame of means of top N most correlated columns

DataFrame of means of top N most correlated columns - python

I have a dataframe df1 where each column represents a time series of returns. I want to create a new dataframe df2 with columns that corresponds to each of the columns in df1 where the column in df2 is defined to be the average of the top 5 most correlated columns in df1.
import pandas as pd
import numpy as np
from string import ascii_letters
np.random.seed([3,1415])
df1 = pd.DataFrame(np.random.randn(100, 10).round(2),
columns=list(ascii_letters[26:36]))
print df1.head()
A B C D E F G H I J
0 -2.13 -1.27 -1.97 -2.26 -0.35 -0.03 0.32 0.35 0.72 0.77
1 -0.61 0.35 -0.35 -0.42 -0.91 -0.14 0.75 -1.50 0.61 0.40
2 -0.96 1.49 -0.35 -1.47 1.06 1.06 0.59 0.30 -0.77 0.83
3 1.49 0.26 -0.90 0.38 -0.52 0.05 0.95 -1.03 0.95 0.73
4 1.24 0.16 -1.34 0.16 1.26 0.78 1.34 -1.64 -0.20 0.13
I expect the head of the resulting dataframe rounded to 2 places to look like:
A B C D E F G H I J
0 -0.78 -0.70 -0.53 -0.45 -0.99 -0.10 -0.47 -0.86 -0.31 -0.64
1 -0.49 -0.11 -0.45 -0.03 -0.04 0.10 -0.26 0.11 -0.06 -0.10
2 0.03 0.13 0.54 0.33 -0.13 0.27 0.22 0.32 0.41 0.27
3 -0.22 0.13 0.19 0.58 0.63 0.24 0.34 0.51 0.32 0.22
4 -0.04 0.31 0.23 0.52 0.43 0.24 0.07 0.31 0.73 0.43

For each column in the correlation matrix, take the six largest and ignore the first one (i.e. 100% correlated with itself). Use a dictionary comprehension to do this for each column.
Use another dictionary comprehension to located this columns in df1 and take their mean. Create a dataframe from the result, and reorder the columns to match those of df1 by appending [df1.columns].
corr = df1.corr()
most_correlated_cols = {col: corr[col].nlargest(6)[1:].index
for col in corr}
df2 = pd.DataFrame({col: df1.loc[:, most_correlated_cols[col]].mean(axis=1)
for col in df1})[df1.columns]
>>> df2.head()
A B C D E F G H I J
0 -0.782 -0.698 -0.526 -0.452 -0.994 -0.102 -0.472 -0.856 -0.310 -0.638
1 -0.486 -0.106 -0.454 -0.032 -0.042 0.100 -0.258 0.108 -0.064 -0.102
2 0.026 0.132 0.544 0.330 -0.130 0.272 0.224 0.320 0.414 0.274
3 -0.224 0.128 0.186 0.582 0.626 0.242 0.344 0.506 0.318 0.224
4 -0.044 0.310 0.230 0.518 0.428 0.238 0.068 0.306 0.734 0.432
%%timeit
corr = df1.corr()
most_correlated_cols = {
col: corr[col].nlargest(6)[1:].index
for col in corr}
df2 = pd.DataFrame({col: df1.loc[:, most_correlated_cols[col]].mean(axis=1)
for col in df1})[df1.columns]
100 loops, best of 3: 10 ms per loop
%%timeit
corr = df1.corr()
df2 = corr.apply(argsort).head(5).apply(lambda x: avg_of(x, df1))
100 loops, best of 3: 16 ms per loop

Setup
import pandas as pd
import numpy as np
from string import ascii_letters
np.random.seed([3,1415])
df1 = pd.DataFrame(np.random.randn(100, 10).round(2),
columns=list(ascii_letters[26:36]))
Solution
corr = df.corr()
# I don't want a securities correlation with itself to be included.
# Because `corr` is symmetrical, I can assume that a series' name will be in its index.
def remove_self(x):
return x.loc[x.index != x.name]
# This builds utilizes `remove_self` then sorts by correlation
# and returns the index.
def argsort(x):
return pd.Series(remove_self(x).sort_values(ascending=False).index)
# This reaches into `df` and gets all columns identified in x
# then takes the mean.
def avg_of(x, df):
return df.loc[:, x].mean(axis=1)
# Putting it all together.
df2 = corr.apply(argsort).head(5).apply(lambda x: avg_of(x, df))
print df2.round(2).head()
A B C D E F G H I J
0 -0.78 -0.70 -0.53 -0.45 -0.99 -0.10 -0.47 -0.86 -0.31 -0.64
1 -0.49 -0.11 -0.45 -0.03 -0.04 0.10 -0.26 0.11 -0.06 -0.10
2 0.03 0.13 0.54 0.33 -0.13 0.27 0.22 0.32 0.41 0.27
3 -0.22 0.13 0.19 0.58 0.63 0.24 0.34 0.51 0.32 0.22
4 -0.04 0.31 0.23 0.52 0.43 0.24 0.07 0.31 0.73 0.43

Related

How to add sum() and mean() value above the df column values in the same line?

Supposed we have a df with a sum() value in the below DataFrame, thanks so much for #jezrael 's answer here, now sum value is in the first line, and avg value is the second line, but it's ugly, how to let sum value and avg value in the same column and with index name:Total? Also place it in the first line as below
# Total 27.56 25.04 -1.31
code in pandas is as below:
df.columns=['value_a','value_b','name','up_or_down','difference']
df1 = df[['value_a','value_b']].sum().to_frame().T
df2 = df[['difference']].mean().to_frame().T
df = pd.concat([df1,df2, df], ignore_index=True)
df
value_a value_b name up_or_down difference
project_name
27.56 25.04
-1.31
2021-project11 0.43 0.48 2021-project11 up 0.05
2021-project1 0.62 0.56 2021-project1 down -0.06
2021-project2 0.51 0.47 2021-project2 down -0.04
2021-porject3 0.37 0.34 2021-porject3 down -0.03
2021-porject4 0.64 0.61 2021-porject4 down -0.03
2021-project5 0.32 0.25 2021-project5 down -0.07
2021-project6 0.75 0.81 2021-project6 up 0.06
2021-project7 0.60 0.60 2021-project7 down 0.00
2021-project8 0.85 0.74 2021-project8 down -0.11
2021-project10 0.67 0.67 2021-project10 down 0.00
2021-project9 0.73 0.73 2021-project9 down 0.00
2021-project11 0.54 0.54 2021-project11 down 0.00
2021-project12 0.40 0.40 2021-project12 down 0.00
2021-project13 0.76 0.77 2021-project13 up 0.01
2021-project14 1.16 1.28 2021-project14 up 0.12
2021-project15 1.01 0.94 2021-project15 down -0.07
2021-project16 1.23 1.24 2021-project16 up 0.01
2022-project17 0.40 0.36 2022-project17 down -0.04
2022-project_11 0.40 0.40 2022-project_11 down 0.00
2022-project4 1.01 0.80 2022-project4 down -0.21
2022-project1 0.65 0.67 2022-project1 up 0.02
2022-project2 0.75 0.57 2022-project2 down -0.18
2022-porject3 0.32 0.32 2022-porject3 down 0.00
2022-project18 0.91 0.56 2022-project18 down -0.35
2022-project5 0.84 0.89 2022-project5 up 0.05
2022-project19 0.61 0.48 2022-project19 down -0.13
2022-project6 0.77 0.80 2022-project6 up 0.03
2022-project20 0.63 0.54 2022-project20 down -0.09
2022-project8 0.59 0.55 2022-project8 down -0.04
2022-project21 0.58 0.54 2022-project21 down -0.04
2022-project10 0.76 0.76 2022-project10 down 0.00
2022-project9 0.70 0.71 2022-project9 up 0.01
2022-project22 0.62 0.56 2022-project22 down -0.06
2022-project23 2.03 1.74 2022-project23 down -0.29
2022-project12 0.39 0.39 2022-project12 down 0.00
2022-project24 1.35 1.55 2022-project24 up 0.20
project25 0.45 0.42 project25 down -0.03
project26 0.53 NaN project26 down NaN
project27 0.68 NaN project27 down NaN
Thanks so much for any advice

Use DataFrame.agg with dictionary for aggregate functions:
df.columns=['value_a','value_b','name','up_or_down','difference']
df1 = df.agg({'value_a':'sum', 'value_b':'sum', 'difference':'mean'}).to_frame('Total').T
df = pd.concat([df1,df])
print (df.head())
value_a value_b difference name up_or_down
Total 27.56 25.04 -0.035405 NaN NaN
2021-project11 0.43 0.48 0.050000 2021-project11 up
2021-project1 0.62 0.56 -0.060000 2021-project1 down
2021-project2 0.51 0.47 -0.040000 2021-project2 down
2021-porject3 0.37 0.34 -0.030000 2021-porject3 down

Add column using groupby in multiindex Pandas

I am trying to find the sum of a column based upon the groupby function. so in this example i want to find the sum of all the bar, baz, foo, and qux.
the sum would be added to a new column at the end. i can get the results i need but i can not join it back to the dataframe.
import numpy as np
import pandas as pd
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
np.random.seed(7)
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
results=df.groupby(level=[0]).sum(axis=1)
col_names=results.columns.values
hold=[]
for i in col_names:
hold.append('sum_'+str(i))
results.columns=hold
df=pd.concat([df,results],axis=1)
Desired result below. thanks for looking
0 1 2 3 sum_0 sum_1 sum_2 sum_3
bar one 1.69 (0.47) 0.03 0.41 0.90 (0.46) 0.03 (1.35)
bar two (0.79) 0.00 (0.00) (1.75) 0.90 (0.46) 0.03 (1.35)
baz one 1.02 0.60 (0.63) (0.17) 1.52 0.34 (0.87) (1.62)
baz two 0.51 (0.26) (0.24) (1.45) 1.52 0.34 (0.87) (1.62)
foo one 0.55 0.12 0.27 (1.53) 2.21 0.28 (0.11) 0.50
foo two 1.65 0.15 (0.39) 2.03 2.21 0.28 (0.11) 0.50
qux one (0.05) (1.45) (0.41) (2.29) 1.00 (1.87) (1.15) (1.22)
qux two 1.05 (0.42) (0.74) 1.07 1.00 (1.87) (1.15) (1.22)

Use transform instead, you can rid your code of that loop.
df = pd.concat([df, df.groupby(level=0).transform('sum').add_prefix('sum_')], axis=1)
df
0 1 2 3 sum_0 sum_1 sum_2 sum_3
bar one 1.69 -0.47 0.03 0.41 0.90 -0.46 0.03 -1.35
two -0.79 0.00 -0.00 -1.75 0.90 -0.46 0.03 -1.35
baz one 1.02 0.60 -0.63 -0.17 1.52 0.34 -0.87 -1.62
two 0.51 -0.26 -0.24 -1.45 1.52 0.34 -0.87 -1.62
foo one 0.55 0.12 0.27 -1.53 2.21 0.28 -0.11 0.50
two 1.65 0.15 -0.39 2.03 2.21 0.28 -0.11 0.50
qux one -0.05 -1.45 -0.41 -2.29 1.00 -1.87 -1.15 -1.22
two 1.05 -0.42 -0.74 1.07 1.00 -1.87 -1.15 -1.22

How to use `.to_string` method of `pd.DataFrame` to return string with actual `\t` characters?

I want to recreate the matrix string with the .to_string method of pd.DataFrame.
I can't figure out how to use the built in .to_string method to achieve a format that contains the \t characters matching the input format?
Can this be done or do I need to create a custom parser?
matrix = """
#Names\tcol1\tcol2\tcol3\tcol4\tcol5\tcol6\tcol7
A\t-1.23\t-0.81\t1.79\t0.78\t-0.42\t-0.69\t0.58
B\t-1.76\t-0.94\t1.16\t0.36\t0.41\t-0.35\t1.12
C\t-2.19\t0.13\t0.65\t-0.51\t0.52\t1.04\t0.36
D\t-1.22\t-0.98\t0.79\t-0.76\t-0.29\t1.54\t0.93
E\t-1.47\t-0.83\t0.85\t0.07\t-0.81\t1.53\t0.65
F\t-1.04\t-1.11\t0.87\t-0.14\t-0.80\t1.74\t0.48
G\t-1.57\t-1.17\t1.29\t0.23\t-0.20\t1.17\t0.26
H\t-1.53\t-1.25\t0.59\t-0.30\t0.32\t1.41\t0.77
"""
df = pd.read_table(StringIO(matrix), index_col=0)
df.to_string()
# ' col1 col2 col3 col4 col5 col6 col7\n#Names \nA -1.23 -0.81 1.79 0.78 -0.42 -0.69 0.58\nB -1.76 -0.94 1.16 0.36 0.41 -0.35 1.12\nC -2.19 0.13 0.65 -0.51 0.52 1.04 0.36\nD -1.22 -0.98 0.79 -0.76 -0.29 1.54 0.93\nE -1.47 -0.83 0.85 0.07 -0.81 1.53 0.65\nF -1.04 -1.11 0.87 -0.14 -0.80 1.74 0.48\nG -1.57 -1.17 1.29 0.23 -0.20 1.17 0.26\nH -1.53 -1.25 0.59 -0.30 0.32 1.41 0.77'
pd.__version__
'0.22.0'

Python(Pandas) - Create a column by matching column's values into dataframe

I have the below assumed dataframe
a b c d e F
0.02 0.62 0.31 0.67 0.27 a
0.30 0.07 0.23 0.42 0.00 a
0.82 0.59 0.34 0.73 0.29 a
0.90 0.80 0.13 0.14 0.07 d
0.50 0.62 0.94 0.34 0.53 d
0.59 0.84 0.95 0.42 0.54 d
0.13 0.33 0.87 0.20 0.25 d
0.47 0.37 0.84 0.69 0.28 e
Column F represents the columns of the dataframe.
For each row of column F I want to find relevant row and column from the rest of the dataframe and return the values into one column
The outcome will look like this:
a b c d e f To_Be_Filled
0.02 0.62 0.31 0.67 0.27 a 0.02
0.30 0.07 0.23 0.42 0.00 a 0.30
0.82 0.59 0.34 0.73 0.29 a 0.82
0.90 0.80 0.13 0.14 0.07 d 0.14
0.50 0.62 0.94 0.34 0.53 d 0.34
0.59 0.84 0.95 0.42 0.54 d 0.42
0.13 0.33 0.87 0.20 0.25 d 0.20
0.47 0.37 0.84 0.69 0.28 e 0.28
I am able to identify each case with the below, but not sure how to do it across the whole dataframe.
test.loc[test.iloc[:,5]==a,test.columns==a]
Many thanks in advance.

You can use lookup:
df['To_Be_Filled'] = df.lookup(np.arange(len(df)), df['F'])
df
Out:
a b c d e F To_Be_Filled
0 0.02 0.62 0.31 0.67 0.27 a 0.02
1 0.30 0.07 0.23 0.42 0.00 a 0.30
2 0.82 0.59 0.34 0.73 0.29 a 0.82
3 0.90 0.80 0.13 0.14 0.07 d 0.14
4 0.50 0.62 0.94 0.34 0.53 d 0.34
5 0.59 0.84 0.95 0.42 0.54 d 0.42
6 0.13 0.33 0.87 0.20 0.25 d 0.20
7 0.47 0.37 0.84 0.69 0.28 e 0.28
np.arange(len(df)) can be replaced with df.index.

Group by - select most recent 4 events

I have the following df in pandas:
df:
DATE STOCK DATA1 DATA2 DATA3
01/01/12 ABC 0.40 0.88 0.22
04/01/12 ABC 0.50 0.49 0.13
07/01/12 ABC 0.85 0.36 0.83
10/01/12 ABC 0.28 0.12 0.39
01/01/13 ABC 0.86 0.87 0.58
04/01/13 ABC 0.95 0.39 0.87
07/01/13 ABC 0.60 0.25 0.56
10/01/13 ABC 0.15 0.28 0.69
01/01/11 XYZ 0.94 0.40 0.50
04/01/11 XYZ 0.65 0.19 0.81
07/01/11 XYZ 0.89 0.59 0.69
10/01/11 XYZ 0.12 0.09 0.18
01/01/12 XYZ 0.25 0.94 0.55
04/01/12 XYZ 0.07 0.22 0.67
07/01/12 XYZ 0.46 0.08 0.54
10/01/12 XYZ 0.04 0.03 0.94
...
I want to group by the stocks, sort by date and then for specified columns (in this case DATA1 and DATA3), I want to get the last four items summed (TTM data).
The output would look like this:
DATE STOCK DATA1 DATA2 DATA3 DATA1_TTM DATA3_TTM
01/01/12 ABC 0.40 0.88 0.22 NaN NaN
04/01/12 ABC 0.50 0.49 0.13 NaN NaN
07/01/12 ABC 0.85 0.36 0.83 NaN NaN
10/01/12 ABC 0.28 0.12 0.39 2.03 1.56
01/01/13 ABC 0.86 0.87 0.58 2.49 1.92
04/01/13 ABC 0.95 0.39 0.87 2.94 2.66
07/01/13 ABC 0.60 0.25 0.56 2.69 2.39
10/01/13 ABC 0.15 0.28 0.69 2.55 2.70
01/01/11 XYZ 0.94 0.40 0.50 NaN NaN
04/01/11 XYZ 0.65 0.19 0.81 NaN NaN
07/01/11 XYZ 0.89 0.59 0.69 NaN NaN
10/01/11 XYZ 0.12 0.09 0.18 2.59 2.18
01/01/12 XYZ 0.25 0.94 0.55 1.90 2.23
04/01/12 XYZ 0.07 0.22 0.67 1.33 2.09
07/01/12 XYZ 0.46 0.08 0.54 0.89 1.94
10/01/12 XYZ 0.04 0.03 0.94 0.82 2.70
...
My approach so far has been to sort by date, then group, then iterate through each group and if there are 3 older events then the current event I sum. Also, I want to check to see if the dates fall within 1 year. Can anyone offer a better way in Python? Thank you.
Added: As a clarification for the 1 year part, let's say you take the last four dates and it goes 1/1/1993, 4/1/12, 7/1/12, 10/1/12 -- a data error. I wouldn't want to sum those four. I would want that one to say NaN.

For this I think you can use transform and rolling_sum. Starting from your dataframe, I might do something like:
>>> df["DATE"] = pd.to_datetime(df["DATE"]) # switch to datetime to ease sorting
>>> df = df.sort(["STOCK", "DATE"])
>>> rsum_columns = "DATA1", "DATA3"
>>> grouped = df.groupby("STOCK")[rsum_columns]
>>> new_columns = grouped.transform(lambda x: pd.rolling_sum(x, 4))
>>> df[new_columns.columns + "_TTM"] = new_columns
>>> df
DATE STOCK DATA1 DATA2 DATA3 DATA1_TTM DATA3_TTM
0 2012-01-01 00:00:00 ABC 0.40 0.88 0.22 NaN NaN
1 2012-04-01 00:00:00 ABC 0.50 0.49 0.13 NaN NaN
2 2012-07-01 00:00:00 ABC 0.85 0.36 0.83 NaN NaN
3 2012-10-01 00:00:00 ABC 0.28 0.12 0.39 2.03 1.57
4 2013-01-01 00:00:00 ABC 0.86 0.87 0.58 2.49 1.93
5 2013-04-01 00:00:00 ABC 0.95 0.39 0.87 2.94 2.67
6 2013-07-01 00:00:00 ABC 0.60 0.25 0.56 2.69 2.40
7 2013-10-01 00:00:00 ABC 0.15 0.28 0.69 2.56 2.70
8 2011-01-01 00:00:00 XYZ 0.94 0.40 0.50 NaN NaN
9 2011-04-01 00:00:00 XYZ 0.65 0.19 0.81 NaN NaN
10 2011-07-01 00:00:00 XYZ 0.89 0.59 0.69 NaN NaN
11 2011-10-01 00:00:00 XYZ 0.12 0.09 0.18 2.60 2.18
12 2012-01-01 00:00:00 XYZ 0.25 0.94 0.55 1.91 2.23
13 2012-04-01 00:00:00 XYZ 0.07 0.22 0.67 1.33 2.09
14 2012-07-01 00:00:00 XYZ 0.46 0.08 0.54 0.90 1.94
15 2012-10-01 00:00:00 XYZ 0.04 0.03 0.94 0.82 2.70
[16 rows x 7 columns]
I don't know what you're asking by "Also, I want to check to see if the dates fall within 1 year", so I'll leave that alone.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

DataFrame of means of top N most correlated columns - python

Related

How to add sum() and mean() value above the df column values in the same line?

Add column using groupby in multiindex Pandas

How to use `.to_string` method of `pd.DataFrame` to return string with actual `\t` characters?

Python(Pandas) - Create a column by matching column's values into dataframe

Group by - select most recent 4 events

Categories

Resources