Remove strongly correlated columns from DataFrame [duplicate]

Remove strongly correlated columns from DataFrame [duplicate] - python

This question already has answers here:
How to calculate correlation between all columns and remove highly correlated ones using pandas?
(28 answers)
Closed 1 year ago.
I have a DataFrame like this
dict_ = {'Date':['2018-01-01','2018-01-02','2018-01-03','2018-01-04','2018-01-05'],'Col1':[1,2,3,4,5],'Col2':[1.1,1.2,1.3,1.4,1.5],'Col3':[0.33,0.98,1.54,0.01,0.99]}
df = pd.DataFrame(dict_, columns=dict_.keys())
I then calculate the pearson correlation between the columns and filter out columns that are correlated above my threshold of 0.95
def trimm_correlated(df_in, threshold):
df_corr = df_in.corr(method='pearson', min_periods=1)
df_not_correlated = ~(df_corr.mask(np.eye(len(df_corr), dtype=bool)).abs() > threshold).any()
un_corr_idx = df_not_correlated.loc[df_not_correlated[df_not_correlated.index] == True].index
df_out = df_in[un_corr_idx]
return df_out
which yields
uncorrelated_factors = trimm_correlated(df, 0.95)
print uncorrelated_factors
Col3
0 0.33
1 0.98
2 1.54
3 0.01
4 0.99
So far I am happy with the result, but I would like to keep one column from each correlated pair, so in the above example I would like to include Col1 or Col2. To get s.th. like this
Col1 Col3
0 1 0.33
1 2 0.98
2 3 1.54
3 4 0.01
4 5 0.99
Also on a side note, is there any further evaluation I can do to determine which of the correlated columns to keep?
thanks

You can use np.tril() instead of np.eye() for the mask:
def trimm_correlated(df_in, threshold):
df_corr = df_in.corr(method='pearson', min_periods=1)
df_not_correlated = ~(df_corr.mask(np.tril(np.ones([len(df_corr)]*2, dtype=bool))).abs() > threshold).any()
un_corr_idx = df_not_correlated.loc[df_not_correlated[df_not_correlated.index] == True].index
df_out = df_in[un_corr_idx]
return df_out
Output:
Col1 Col3
0 1 0.33
1 2 0.98
2 3 1.54
3 4 0.01
4 5 0.99

Use this directly on the dataframe to sort out the top correlation values.
import pandas as pd
import numpy as np
def correl(X_train):
cor = X_train.corr()
corrm = np.corrcoef(X_train.transpose())
corr = corrm - np.diagflat(corrm.diagonal())
print("max corr:",corr.max(), ", min corr: ", corr.min())
c1 = cor.stack().sort_values(ascending=False).drop_duplicates()
high_cor = c1[c1.values!=1]
## change this value to get more correlation results
thresh = 0.9
display(high_cor[high_cor>thresh])
correl(X)
output:
max corr: 0.9821068918331252 , min corr: -0.2993837739125243
object at 0x0000017712D504E0>
count_rech_2g_8 sachet_2g_8 0.982107
count_rech_2g_7 sachet_2g_7 0.979492
count_rech_2g_6 sachet_2g_6 0.975892
arpu_8 total_rech_amt_8 0.946617
arpu_3g_8 arpu_2g_8 0.942428
isd_og_mou_8 isd_og_mou_7 0.938388
arpu_2g_6 arpu_3g_6 0.933158
isd_og_mou_6 isd_og_mou_8 0.931683
arpu_3g_7 arpu_2g_7 0.930460
total_rech_amt_6 arpu_6 0.930103
isd_og_mou_7 isd_og_mou_6 0.926571
arpu_7 total_rech_amt_7 0.926111
dtype: float64

Related

How to pass the whole dataframe and the index of the row being operated upon to the apply() method

How do I pass the whole dataframe and the index of the row being operated upon when using the apply() method on a dataframe?
Specifically, I have a dataframe correlation_df with the following data:
id
scores
cosine
1
100
0.8
2
75
0.7
3
50
0.4
4
25
0.05
I want to create an extra column where each row value is the correlation of scores and cosine without that row's values included.
My understanding is that I should do this with with a custom function and the apply method, i.e. correlation_df.apply(my_fuct). However, I need to pass in the whole dataframe and the index of the row in question so that I can ignore it in the correlation calculation.
NB. Problem code:
import numpy as np
import pandas as pd
score = np.array([100, 75, 50, 25])
cosine = np.array([.8, 0.7, 0.4, .05])
correlation_df = pd.DataFrame(
{
"score": score,
"cosine": cosine,
}
)
corr = correlation_df.corr().values[0, 1]
[Edit] Roundabout solution that I'm sure can be improved:
def my_fuct(row):
i = int(row["index"])
r = list(range(correlation_df.shape[0]))
r.remove(i)
subset = correlation_df.iloc[r, :].copy()
subset = subset.set_index("index")
return subset.corr().values[0, 1]
correlation_df["diff_correlations"] = = correlation_df.apply(my_fuct, axis=1)

Your problem can be simplified to:
>>> df["diff_correlations"] = df.apply(lambda x: df.drop(x.name).corr().iat[0,1], axis=1)
>>> df
score cosine diff_correlations
0 100 0.80 0.999015
1 75 0.70 0.988522
2 50 0.40 0.977951
3 25 0.05 0.960769
A more sophisticated method would be:
The whole correlation matrix isn't made every time this way.
df.apply(lambda x: (tmp_df := df.drop(x.name)).score.corr(tmp_df.cosine), axis=1)
The index can be accessed in an apply with .name or .index, depending on the axis:
>>> correlation_df.apply(lambda x: x.name, axis=1)
0 0
1 1
2 2
3 3
dtype: int64
>>> correlation_df.apply(lambda x: x.index, axis=0)
score cosine
0 0 0
1 1 1
2 2 2
3 3 3

Using
correlation_df = correlation_df.reset_index()
gives you a new column index, denoting the index of the row, namely what previously was your index. Now when using pd.apply access it via:
correlation_df.apply(lambda r: r["index"])
After you are done you could do:
correlation_df = correlation_df.set_index("index")
to get your previous format back.

Cumulative sum of a pandas column until a maximum value is met, and average adjacent rows

I'm a biology student who is fairly new to python and was hoping someone might be able to help with a problem I have yet to solve
With some subsequent code I have created a pandas dataframe that looks like the example below:
Distance. No. of values Mean rSquared
1 500 0.6
2 80 0.3
3 40 0.4
4 30 0.2
5 50 0.2
6 30 0.1
I can provide my previous code to create this dataframe, but I didn't think it was particularly relevant.
I need to sum the number of values column until I achieve a value >= 100; and then combine the data of the rows of the adjacent columns, taking the weighted average of the distance and mean r2 values, as seen in the example below
Mean Distance. No. Of values Mean rSquared
1 500 0.6
(80*2+40*3)/120 (80+40) = 120 (80*0.3+40*0.4)/120
(30*4+50*5+30*6)/110 (30+50+30) = 110 (30*0.2+50*0.2+30*0.1)/110
etc...
I know pandas has it's .cumsum function, which I might be able to implement into a for loop with an if statement that checks the upper limit and resets the sum back to 0 when it is greater than or equal to the upper limit. However, I haven't a clue how to average the adjacent columns.
Any help would be appreciated!

You can use this code snippet to solve your problem.
# First, compute some weighted values
df.loc[:, "weighted_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "weighted_mean_rSquared"] = df["Mean rSquared"] * df["No. of values"]
min_threshold = 100
indexes = []
temp_sum = 0
# placeholder for final result
final_df = pd.DataFrame()
columns = ["Distance", "No. of values", "Mean rSquared"]
# reseting index to make the 'df' usable in following output
df = df.reset_index(drop=True)
# main loop to check and compute the desired output
for index, _ in df.iterrows():
temp_sum += df.iloc[index]["No. of values"]
indexes.append(index)
# if the sum exceeds 'min_threshold' then do some computation
if temp_sum >= min_threshold:
temp_distance = df.iloc[indexes]["weighted_distance"].sum() / temp_sum
temp_mean_rSquared = df.iloc[indexes]["weighted_mean_rSquared"].sum() / temp_sum
# create temporary dataframe and concatenate with the 'final_df'
temp_df = pd.DataFrame([[temp_distance, temp_sum, temp_mean_rSquared]], columns=columns)
final_df = pd.concat([final_df, temp_df])
# reset the variables
temp_sum = 0
indexes = []

Numpy has a function numpy.frompyfunc You can use that to get the cumulative value based on a threshold.
Here's how to implement it. With that, you can then figure out the index when the value goes over the threshold. Use that to calculate the Mean Distance and Mean rSquared for the values in your original dataframe.
I also leveraged #sujanay's idea of calculating the weighted values first.
c = ['Distance','No. of values','Mean rSquared']
d = [[1,500,0.6], [2,80,0.3], [3,40,0.4],
[4,30,0.2], [5,50,0.2], [6,30,0.1]]
import pandas as pd
import numpy as np
df = pd.DataFrame(d,columns=c)
#calculate the weighted distance and weighted mean squares first
df.loc[:, "w_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "w_mean_rSqrd"] = df["Mean rSquared"] * df["No. of values"]
#use numpy.frompyfunc to setup the threshold condition
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
#assign value to cumvals based on threshold
df['cumvals'] = sumvals.accumulate(df['No. of values'], dtype=np.object)
#find out all records that have >= 100 as cumulative values
idx = df.index[df['cumvals'] >= 100].tolist()
#if last row not in idx, then add it to the list
if (len(df)-1) not in idx: idx += [len(df)-1]
#iterate thru the idx for each set and calculate Mean Distance and Mean rSquared
i = 0
for j in idx:
df.loc[j,'Mean Distance'] = (df.iloc[i:j+1]["w_distance"].sum() / df.loc[j,'cumvals']).round(2)
df.loc[j,'New Mean rSquared'] = (df.iloc[i:j+1]["w_mean_rSqrd"].sum() / df.loc[j,'cumvals']).round(2)
i = j+1
print (df)
The output of this will be:
Distance No. of values ... Mean Distance New Mean rSquared
0 1 500 ... 1.00 0.60
1 2 80 ... NaN NaN
2 3 40 ... 2.33 0.33
3 4 30 ... NaN NaN
4 5 50 ... NaN NaN
5 6 30 ... 5.00 0.17
If you want to extract only the records that are non NaN, you can do:
final_df = df[df['Mean Distance'].notnull()]
This will result in:
Distance No. of values ... Mean Distance New Mean rSquared
0 1 500 ... 1.00 0.60
2 3 40 ... 2.33 0.33
5 6 30 ... 5.00 0.17
I looked up BEN_YO's implementation of numpy.frompyfunc. The original SO post can be found here. Restart cumsum and get index if cumsum more than value

If you figure out the grouping first, pandas groupby-functionality will do a lot of the remaining work for you. A loop is appropriate to get the grouping (unless somebody has a clever one-liner):
>>> groups = []
>>> group = 0
>>> cumsum = 0
>>> for n in df["No. of values"]:
... if cumsum >= 100:
... cumsum = 0
... group = group + 1
... cumsum = cumsum + n
... groups.append(group)
>>>
>>> groups
[0, 1, 1, 2, 2, 2]
Before doing the grouped operations you need to use the No. of values information to get the weighting in:
df[["Distance.", "Mean rSquared"]] = df[["Distance.", "Mean rSquared"]].multiply(df["No. of values"], axis=0)
Now get the sums like this:
>>> sums = df.groupby(groups)["No. of values"].sum()
>>> sums
0 500
1 120
2 110
Name: No. of values, dtype: int64
And finally the weighted group averages like this:
>>> df[["Distance.", "Mean rSquared"]].groupby(groups).sum().div(sums, axis=0)
Distance. Mean rSquared
0 1.000000 0.600000
1 2.333333 0.333333
2 5.000000 0.172727

How transform all values NOT = 0 in 1 with REGEX, efficiently

I've a Column that contains 0 and 12/02/19 dates, I want to transforming all dates into ones and multiply by the column Enrolls_F
-
Preferring using REGEX, but any other options should be fine too, it is a large Dataset, I tried with simple for loop and my kernel could not run it.
-
Data:
df = pd.DataFrame({ 'Enrolled_Date': ['0','2019/02/04','0','0','2019/02/04','0'] , 'Enrolls_F': ['1.11','1.11','0.222','1.11','5.22','1'] })
Attempts:
trying to search for everything starts with 2 and replace with 1 and multiply by Enrolls_F
df_test = (df.replace({'Enrolled_Date': r'2.$'}, {'Enrolled_Date': '1'}, regex=True)) * df.Enrolls_F
# Nothing happens

IIUC, this should help you get the trouble sorted;
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'Enrolled_Date': ['0','2019/02/04','0','0','2019/02/04','0'] , 'Enrolls_F': ['1.11','1.11','0.222','1.11','5.22','1'] })
df['Enrolled_Date'] = np.where(df['Enrolled_Date'] == '0',0,1)
df['multiplication_column'] = df['Enrolled_Date'] * df['Enrolls_F']
print(df)
Output:
Enrolled_Date Enrolls_F multiplication_column
0 0 1.11
1 1 1.11 1.11
2 0 0.222
3 0 1.11
4 1 5.22 5.22
5 0 1

If you want output is float, try this
df.Enrolled_Date.ne('0').astype(int) * df.Enrolls_F.astype(float)
Out[212]:
0 0.00
1 1.11
2 0.00
3 0.00
4 5.22
5 0.00
dtype: float64

Multiply many columns pandas

I have a data frame like this, but with many more columns and I would like to multiply each two adjacent columns and state the product of the two in a new column beside it and call it Sub_pro and at the end have the total sum of all Sub_pro in a column called F_Pro and reduce the precision to 3 decimal places. I don't know how to get the Sub_pro columns. Below is my code.
import pandas as pd
df = pd.read_excel("C:dummy")
df['F_Pro'] = ("Result" * "Attribute").sum(axis=1)
df.round(decimals=3)
print (df)
Input
Id Result Attribute Result1 Attribute1
1 0.5621 0.56 536 0.005642
2 0.5221 0.5677 2.15 93
3 0.024564 5.23 6.489 8
4 11.564256 4.005 0.45556 5.25
5 0.6123 0.4798 0.6667 5.10
Desire Output
id Result Attribute Sub_Pro Result1 Attribute1 Sub_pro1 F_Pro
1 0.5621 0.56 0.314776 536 0.005642 3.024112 3.338888
2 0.5221 0.5677 0.29639617 2.15 93 199.95 200.2463962
3 0.024564 5.23 0.12846972 6.489 8 51.912 52.04046972
4 11.564256 4.005 46.31484528 0.45556 5.25 2.39169 48.70653528
5 0.6123 0.4798 0.29378154 0.6667 5.1 3.40017 3.69395154

Because you have several columns named kind of the same, here is one way using filter. To see how it works, on your df, you do df.filter(like='Result') and you get the columns where the name has Result in it:
Result Result1
0 0.562100 536.00000
1 0.522100 2.15000
2 0.024564 6.48900
3 11.564256 0.45556
4 0.612300 0.66670
You can create an array containing the columns 'Sub_Pro':
import numpy as np
arr_sub_pro = np.round(df.filter(like='Result').values* df.filter(like='Attribute').values,3)
and you get the values of the columns sub_pro such as arr_sub_pro:
array([[3.1500e-01, 3.0240e+00],
[2.9600e-01, 1.9995e+02],
[1.2800e-01, 5.1912e+01],
[4.6315e+01, 2.3920e+00],
[2.9400e-01, 3.4000e+00]])
Now you need to add them at the right position in the dataframe, I think a loop for is necessary
for nb, col in zip( range(arr_sub_pro.shape[1]), df.filter(like='Attribute').columns):
df.insert(df.columns.get_loc(col)+1, 'Sub_pro{}'.format(nb), arr_sub_pro[:,nb])
here I get the location of the column Attibut(nb) and insert the value from column nb of arr_sub_pro at the next position
To add the column 'F_Pro', you can do:
df.insert(len(df.columns), 'F_Pro', arr_sub_pro.sum(axis=1))
the final df looks like:
Id Result Attribute Sub_pro0 Result1 Attribute1 Sub_pro1 \
0 1 0.562100 0.5600 0.315 536.00000 0.005642 3.024
1 2 0.522100 0.5677 0.296 2.15000 93.000000 199.950
2 3 0.024564 5.2300 0.128 6.48900 8.000000 51.912
3 4 11.564256 4.0050 46.315 0.45556 5.250000 2.392
4 5 0.612300 0.4798 0.294 0.66670 5.100000 3.400
F_Pro
0 3.339
1 200.246
2 52.040
3 48.707
4 3.694

import pandas as pd
src = "/opt/repos/pareto/test/stack/data.csv"
df = pd.read_csv(src)
count = 0
def multiply(x):
res = x.copy()
keys_len = len(x)
idx = 1
while idx + 1 < keys_len:
left = x[idx]
right = x[idx + 1]
new_key = "sub_prod_{}".format(idx)
# Multiply and round to three decimal places.
res[new_key] = round(left * right,3)
idx = idx + 1
return res
res_df = df.apply(lambda x: multiply(x),axis=1)
It solve the problem but you need now order de columns you can iterate over the keys instead of make a deep copy of the full row. I hope that the code help you.

Here's one way using NumPy and a dictionary comprehension:
# extract NumPy array for relevant columns
A = df.iloc[:, 1:].values
n = int(A.shape[1] / 2)
# calculate products and feed to pd.DataFrame
prods = pd.DataFrame({'Sub_Pro_'+str(i): np.prod(A[:, 2*i: 2*(i+1)], axis=1) \
for i in range(n)})
# calculate sum of product rows
prods['F_Pro'] = prods.sum(axis=1)
# join to original dataframe
df = df.join(prods)
print(df)
Id Result Attribute Result1 Attribute1 Sub_Pro_0 Sub_Pro_1 \
0 1 0.562100 0.5600 536.00000 0.005642 0.314776 3.024112
1 2 0.522100 0.5677 2.15000 93.000000 0.296396 199.950000
2 3 0.024564 5.2300 6.48900 8.000000 0.128470 51.912000
3 4 11.564256 4.0050 0.45556 5.250000 46.314845 2.391690
4 5 0.612300 0.4798 0.66670 5.100000 0.293782 3.400170
F_Pro
0 3.338888
1 200.246396
2 52.040470
3 48.706535
4 3.693952

Merging dataframes based on ranges that are defined by columns

I have two dataframes. One has some probability brackets.
df1 = pd.DataFrame({'ProbabilityBrackets' : [0,0.50,0.75,1.0,0.75,0.90,1.0,0],\
'Group' : pd.Categorical(["test","test","test","test","train","train","train","train"]),'Destination' : pd.Categorical(["-","A","B","C","AA","BB","CC","-"])})
Destination Group ProbabilityBrackets
0 - test 0.00
1 A test 0.50
2 B test 0.75
3 C test 1.00
4 AA train 0.75
5 BB train 0.90
6 CC train 1.00
7 - train 0.00
The other dataframe has some random numbers and the group column.
df2 = pd.DataFrame({'randomnumbers' : [0.2,0.15,0.78,0.35],\
'Group' : pd.Categorical(["test","train","test","train"])})
Group randomnumbers
0 test 0.20
1 train 0.15
2 test 0.78
3 train 0.35
Now, I need to merge the two dataframes together by both group and based on the probability brackets. Merging by group is trivial. The challenging requirement is merging by based on probabilitybrackets and random numbers. A random number in df2 should be mapped to the smallest probability bracket that is larger than itself. E.g., test 0.2 in df2 is mapped to test 0.5 in df1. test 0.78 in df2 is mapped to test 1.0 in df1.
I did it as follows, which works well and :
for group in ['test','train']:
brackets=df1[df1['Group']==group].sort_values(by='ProbabilityBrackets')['ProbabilityBrackets'].unique()
bracketlabels = brackets[1:] #remove the first element of the list. (e.g., remove 0 from (0,0.5,1))
df2.loc[df2['Group']==group,'ProbabilityBrackets']=pd.cut(df2['randomnumbers'],brackets, labels=bracketlabels) #assign random numbers to the brackets so that we can easily merge them with df1
df3=df2.merge(df1,on=['Group','ProbabilityBrackets'],how='left')
It generates the following output, which is what I want but it is slower than I want because I have thousands of groups in my dataset. Is there a way to do it faster in a pythonic way?
Group randomnumbers ProbabilityBrackets Destination
0 test 0.20 0.50 A
1 train 0.15 0.75 AA
2 test 0.78 1.00 C
3 train 0.35 0.75 AA

You can try this.
# Step 1
df_m = df2.merge(df1, on="Group", how="outer")
# Step 2
df_m["diff"] = df_m["randomnumbers"] - df_m["ProbabilityBrackets"]
# Step 3
df_m_filtered = df_m[df_m["diff"] < 0].set_index(
["Destination", "ProbabilityBrackets"])
# Step 4
df_desired = df_m_filtered.groupby(
["Group", "randomnumbers"])["diff"].nlargest(1).reset_index()
index Group randomnumbers Destination ProbabilityBrackets diff
0 0 test 0.20 A 0.50 -0.30
1 1 test 0.78 C 1.00 -0.22
2 2 train 0.15 AA 0.75 -0.60
3 3 train 0.35 AA 0.75 -0.40
Explanation:
Begin with an outer merge
Calculate differences between randomnumbers and ProbabilityBrackets
Filter the results with condition df_merged["diff"] < 0 as we are interested in finding those whose randomnumbers is smaller than ProbabilityBrackets
Groupby ["Group" and "randomnumbers"] and find the one with the largest diff within each group.

Comparing “Group” for every element in df2 to every element in df1 is a lot of unnecessary string comparisons. You could instead try putting all the elements of df1 into a dictionary with Group as the key and having lists of (ProbabilityBrackets, Destination) tuples as the values. When inserting each element from df1, insert the tuple into the list maintaining the sort by ProbabilityBracket so that you don’t have to sort it again. Then you can retrieve the appropriate (ProbabilityBracket, Destination) for each element in df2 by looking in the dictionary by Group and performing a binary search on the list by ProbabilityBracket.

This is another way of doing it. Taking some cues from #JasonR.
Explanation:
- We create a dictionary of tuples (Destination, ProbablityBrackets). This is done to avoid multiple times looping on df1
- Next, we check dictionary keys in df2 and assign the result based on given criteria.
from collections import defaultdict
# remove these rows
df1 = df1[df1['ProbabilityBrackets'] > 0]
df_dict = defaultdict(list)
# create a dictionary of tuples in list
for index, row in df1.iterrows():
df_dict[row['Group']].append((row['Destination'],row['ProbabilityBrackets']))
## this calculates the output
for index, row in df2.iterrows():
d = df_dict[row['Group']]
randnum = row['randomnumbers']
## this checks the suitable probablity bracket
low = 10000
tuple_ix = 10000
for ix, (i, j) in enumerate(d):
sub = (j - randnum)
if sub > 0 and sub < low:
low = sub
tuple_ix = ix
combination = d[tuple_ix]
df2.loc[index, 'ProbabilityBracket'] = combination[1]
df2.loc[index, 'Destination'] = combination[0]
Group randomnumbers ProbabilityBracket Destination
0 test 0.20 0.50 A
1 train 0.15 0.75 AA
2 test 0.78 1.00 C
3 train 0.35 0.75 AA

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove strongly correlated columns from DataFrame [duplicate] - python

Related

How to pass the whole dataframe and the index of the row being operated upon to the apply() method

Cumulative sum of a pandas column until a maximum value is met, and average adjacent rows

How transform all values NOT = 0 in 1 with REGEX, efficiently

Multiply many columns pandas

Merging dataframes based on ranges that are defined by columns

Categories

Resources