Find high correlations in a large coefficient matrix - python

I have a dataset with 56 numerical features. Loading it to pandas, I can easily generate a correlation coefficients matrix.
However, due to its size, I'd like to find coefficients higher (or lower) than a certain threshold, e.g. >0.8 or <-0.8, and list the corresponding pairs of variables. Is there a way to do it? I figure it would require selecting by value across all columns, then returning, not the row, but the column name and row index of the value, but I have no idea how to do either!
Thanks!

I think you can do where and stack(): this:
np.random.seed(1)
df = pd.DataFrame(np.random.rand(10,3))
coeff = df.corr()
# 0.3 is used for illustration
# replace with your actual value
thresh = 0.3
mask = coeff.abs().lt(thresh)
# or mask = coeff < thresh
coeff.where(mask).stack()
Output:
0 2 -0.089326
2 0 -0.089326
dtype: float64
Output:
0 1 0.319612
2 -0.089326
1 0 0.319612
2 -0.687399
2 0 -0.089326
1 -0.687399
dtype: float64

This approach will work if you're looking to also deduplicate the correlation results.
thresh = 0.8
# get correlation matrix
df_corr = df.corr().abs().unstack()
# filter
df_corr_filt = df_corr[(df_corr>thresh) | (df_corr<-thresh)].reset_index()
# deduplicate
df_corr_filt.iloc[df_corr_filt[['level_0','level_1']].apply(lambda r: ''.join(map(str, sorted(r))), axis = 1).drop_duplicates().index]

Related

Sum the predictions of a Linear Regression from Scikit-Learn

I need to make a linear regression and sum all the predictions. Maybe this isn't a question for Scikit-Learn but for NumPy because I get an array at the end and I am unable to turn it into a float.
df
rank Sales
0 1 18000
1 2 17780
2 3 17870
3 4 17672
4 5 17556
x = df['rank'].to_numpy()
y = df['Sales'].to_numpy()
X = x.reshape(-1,1)
regression = LinearRegression().fit(X, y)
I am getting it right up to this point. The next part (which is a while loop to sum all the values) is not working:
number_predictions = 100
x_current_prediction = 1
total_sales = 0
while x_current_prediction <= number_predictions:
variable_sum = x_current_prediction*regression.coef_
variable_sum_float = variable_sum.astype(np.float_)
total_sales = total_sales + variable_sum_float
x_current_prediction =+1
return total_sales
I think that the problem is getting regression.coef_ to be a float, but when I use astype, it does not work?
You don't need to loop like this, and you don't need to use the coefficient to compute the prediction (don't forget there may be an intercept as well).
Instead, make an array of all the values of x you want to predict for, and ask sklearn for the predictions:
X_new = np.arange(1, 101).reshape(-1, 1) # X must be 2D.
y_pred = regression.predict(X_new)
If you want to add all these numbers together, use y_pred.sum() or np.sum(y_pred), or if you want a cumulative sum, np.cumsum(y_pred) will do it.

Incorporate product of main and auxiliary variables in objective function

This is a follow-up question to a previous post.
The problem I'm trying to solve is to find the largest combination of cells at the intersection of company and category with an average above a certain threshold. Additionally, if I include 3 categories for a certain company, then in order to include another company, all 3 categories of that company must also be included.
I am incorporating disjunctive reasoning into my optimization problem. I set a constraint where at least 2 of the 3 companies (columns) and all three categories (rows) have to be non-zero. I create zero_comp_vars and zero_cat_vars for this. In order for the optimization to incorporate this, I want to multiply the decision variable matrix by these, but element wise (e.g. so if the zero_comp_vars is [1,0,0] we multiply the first column of the decision variables matrix by 1 as a scalar and the second and third columns by 0 as a scalar. However, I believe doing this makes the function non-linear.
I can use vstack and hstack to convert zero_comp_vars and zero_cat_vars to matrices, but then encounter the DCP violation.
import numpy as np
import cvxpy as cp
import cvxopt
util = np.array([[0.7, 0.95, 0.3], [2, 1.05, 2.2], [4, 1, 3]])
# The variable we are solving for
dec_vars = cp.Variable(util.shape, boolean = True)
zero_comp_vars = cp.Variable(util.shape[1], boolean = True)
zero_cat_vars = cp.Variable(util.shape[0], boolean = True)
# define constraints
zero_comp_constr = cp.sum(dec_vars, axis=0) >= 2 * zero_comp_vars
zero_cat_constr = cp.sum(dec_vars, axis=1) >= 3 * zero_cat_vars
# need the following two constraints, otherwise all the values in the zero_comp_constr and zero_cat_constr vectors can be 0
above_one_non_zero_comp = cp.sum(zero_comp_vars) >= 1
above_one_non_zero_cat = cp.sum(zero_cat_vars) >= 1
# min tin array
min_comp_array=np.empty(dec_vars.shape[0])
min_comp_array.fill(2)
min_comp_constr = cp.sum(dec_vars, axis=1) >= min_comp_array
tot_avg_constr = tot_util >= 2.0 * cp.sum(dec_vars)
temp = cp.multiply(util, dec_vars)
# realize as it's written, it's trying to do element-wise multiplication, and dec_vars has shape (3,)
temp2 = cp.multiply(temp, zero_comp_vars)
temp3 = cp.multiply(temp2, zero_cat_vars)
tot_util = cp.sum(temp3)
cluster_problem = cp.Problem(cp.Maximize(tot_util), [zero_comp_constr, zero_cat_constr,
above_one_non_zero_comp, above_one_non_zero_cat,
min_comp_constr, tot_avg_constr])
cluster_problem.solve()

scaling data between -1 and 1 centred on zero

Apologies in advance for any incorrect wording. The reason I am not finding answers to this might be because I am not using the right terminology.
I have a dataframe that looks something like
0 -0.004973 0.008638 0.000264 -0.021122 -0.017193
1 -0.003744 0.008664 0.000423 -0.021031 -0.015688
2 -0.002526 0.008688 0.000581 -0.020937 -0.014195
3 -0.001322 0.008708 0.000740 -0.020840 -0.012715
4 -0.000131 0.008725 0.000898 -0.020741 -0.011249
5 0.001044 0.008738 0.001057 -0.020639 -0.009800
6 0.002203 0.008748 0.001215 -0.020535 -0.008368
7 0.003347 0.008755 0.001373 -0.020428 -0.006952
8 0.004476 0.008758 0.001531 -0.020319 -0.005554
9 0.005589 0.008758 0.001688 -0.020208 -0.004173
10 0.006687 0.008754 0.001845 -0.020094 -0.002809
...
For each column I would like to scale the data to a float between -1.0 and 1.0 for this column's min and max.
I have tried scikit learn's minmax scaler with scaler = MinMaxScaler(feature_range = (-1, 1)) but some values change sign as a result, which I need to preserve.
Is there a way to 'centre' the scaling on zero?
Have you tried using StandardScaler from sklearn ?
It has with_mean and with_std option, which you can use to get data you want.
The problem with scaling the negative values to the column's minimum value and the positive values to the column's maximum value is that the scale of the positive numbers may be different than the scale of the positive numbers. If you want to use the same scale for both negative and positive values, try the following:
def zero_centered_min_max_scaling(dataframe):
"""
Scale the numerical values in the dataframe to be between -1 and 1, preserving the
signal of all values.
"""
df_copy = dataframe.copy(deep=True)
for column in df_copy.columns:
max_absolute_value = df_copy[column].abs().max()
df_copy[column] = df_copy[column] / max_absolute_value
return df_copy

Weighted data problems, mean is fine, but Covar and std look wrong, how do I adjust?

I'm trying to apply a weighted filter on data rather the use raw data before calculating stats, mu, std and covar. But the results clearly need adjusting.
# generate some data and a filter
f_n = 100.
np.random.seed(seed=101);
foo = np.random.rand(f_n,3)
foo = DataFrame(foo).add(1).pct_change()
f_filter = np.arange(f_n,.0,-1)
f_filter = 1.0 / (f_filter**(f_filter/f_n))
# nominalise the filter ... This could be where I'm going wrong?
f_filter = f_filter * (f_n / f_filter.sum())
Now we are ready to look at some results
print foo.mul(f_filter,axis=0).mean()
print foo.mean()
0 0.039147
1 0.039013
2 0.037598
dtype: float64
0 0.035006
1 0.042244
2 0.041956
dtype: float64
Means all look in line, but when we look at covar and std they are significantly different in terms of scale and also direction
print foo.mul(f_filter,axis=0).cov()
print foo.cov()
0 1 2
0 0.124766 -0.038954 0.027256
1 -0.038954 0.204269 0.056185
2 0.027256 0.056185 0.203934
0 1 2
0 0.070063 -0.014926 0.010434
1 -0.014926 0.099249 0.015573
2 0.010434 0.015573 0.087060
print foo.mul(f_filter,axis=0).std()
print foo.std()
0 0.353223
1 0.451961
2 0.451590
dtype: float64
0 0.264694
1 0.315037
2 0.295060
dtype: float64
Any ideas why, how can we adjust the filter or to adjust the covar matrix to make it more comparable?
The problem is your weighting function. (Do you want Gaussian random numbers or uniform r.v.?) See this plot
f_n = 100.
np.random.seed(seed=101);
# ??? you want uniform random variable? or is this just a typo and you want normal random variable?
foo = np.random.rand(f_n,3)
foo = DataFrame(foo)
f_filter = np.arange(f_n,.0,-1)
# here is the problem, uneven weight makes a artificial trend, causing non-stationary. covariance only works for stationary data.
# =============================================
f_filter = 1.0 / (f_filter**(f_filter/f_n))
fig, ax = plt.subplots()
ax.plot(f_filter)
Uneven weight makes a artificial trend (your random numbers are all positive uniforms), causing non-stationary. covariance only works for stationary data. Take a look at the resulting weighted data.

How to qcut with non unique bin edges?

My question is the same as this previous one:
Binning with zero values in pandas
however, I still want to include the 0 values in a fractile. Is there a way to do this? In other words, if I have 600 values, 50% of which are 0, and the rest are let's say between 1 and 100, how would I categorize all the 0 values in fractile 1, and then the rest of the non-zero values in fractile labels 2 to 10 (assuming I want 10 fractiles). Could I convert the 0's to nan, qcut the remaining non nan data into 9 fractiles (1 to 9), then add 1 to each label (now 2 to 10) and label all the 0 values as fractile 1 manually? Even this is tricky, because in my data set in addition to the 600 values, I also have another couple hundred which may already be nan before I would convert the 0s to nan.
Update 1/26/14:
I came up with the following interim solution. The problem with this code though, is if the high frequency value is not on the edges of the distribution, then it inserts an extra bin in the middle of the existing set of bins and throws everything a little (or a lot) off.
def fractile_cut(ser, num_fractiles):
num_valid = ser.valid().shape[0]
remain_fractiles = num_fractiles
vcounts = ser.value_counts()
high_freq = []
i = 0
while vcounts.iloc[i] > num_valid/ float(remain_fractiles):
curr_val = vcounts.index[i]
high_freq.append(curr_val)
remain_fractiles -= 1
num_valid = num_valid - vcounts[i]
i += 1
curr_ser = ser.copy()
curr_ser = curr_ser[~curr_ser.isin(high_freq)]
qcut = pd.qcut(curr_ser, remain_fractiles, retbins=True)
qcut_bins = qcut[1]
all_bins = list(qcut_bins)
for val in high_freq:
bisect.insort(all_bins, val)
cut = pd.cut(ser, bins=all_bins)
ser_fractiles = pd.Series(cut.labels + 1, index=ser.index)
return ser_fractiles
The problem is that pandas.qcut chooses the bins/quantiles so that each one has the same number of records, but all records with the same value must stay in the same bin/quantile (this behaviour is in accordance with the statistical definition of quantile).
The solutions are:
1 - Use pandas >= 0.20.0 that has this fix. They added an option duplicates='raise'|'drop' to control whether to raise on duplicated edges or to drop them, which would result in less bins than specified, and some larger (with more elements) than others.
2 - Decrease the number of quantiles. Less quantiles means more elements per quantile
3 - Rank your data with DataFrame.rank(method='first'). The ranking assigns a unique value to each element in the dataframe (the rank) while keeping the order of the elements (except for identical values, which will be ranked in order they appear in the array, see method='first')
Example:
pd.qcut(df, nbins) <-- this generates "ValueError: Bin edges must be unique"
Then use this instead:
pd.qcut(df.rank(method='first'), nbins)
4 - Specify a custom quantiles range, e.g. [0, .50, .75, 1.] to get unequal number of items per quantile
5 - Use pandas.cut that chooses the bins to be evenly spaced according to the values themselves, while pandas.qcut chooses the bins so that you have the same number of records in each bin
Another way to do this is to introduce a minimal amount of noise, which will artificially create unique bin edges. Here's an example:
a = pd.Series(range(100) + ([0]*20))
def jitter(a_series, noise_reduction=1000000):
return (np.random.random(len(a_series))*a_series.std()/noise_reduction)-(a_series.std()/(2*noise_reduction))
# and now this works by adding a little noise
a_deciles = pd.qcut(a + jitter(a), 10, labels=False)
we can recreate the original error using something like this:
a_deciles = pd.qcut(a, 10, labels=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pandas/tools/tile.py", line 173, in qcut
precision=precision, include_lowest=True)
File "/usr/local/lib/python2.7/site-packages/pandas/tools/tile.py", line 192, in _bins_to_cuts
raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([ 0. , 0. , 0. , 3.8 ,
11.73333333, 19.66666667, 27.6 , 35.53333333,
43.46666667, 51.4 , 59.33333333, 67.26666667,
75.2 , 83.13333333, 91.06666667, 99. ])
You ask about binning with non-unique bin edges, for which I have a fairly simple answer. In the case of your example, your intent and the behavior of qcut diverge where in the pandas.tools.tile.qcut function where bins are defined:
bins = algos.quantile(x, quantiles)
Which, because your data is 50% 0s, causes bins to be returned with multiple bin edges at the value 0 for any value of quantiles greater than 2. I see two possible resolutions. In the first, the fractile space is divided evenly, binning all 0s, but not only 0s, in the first bin. In the second, the fractile space is divided evenly for values greater than 0, binning all 0s and only 0s in the first bin.
import numpy as np
import pandas as pd
import pandas.core.algorithms as algos
from pandas import Series
In both cases, I'll create some random sample data fitting your description of 50% zeroes and the remaining values between 1 and 100
zs = np.zeros(300)
rs = np.random.randint(1, 100, size=300)
arr=np.concatenate((zs, rs))
ser = Series(arr)
Solution 1: bin 1 contains both 0s and low values
bins = algos.quantile(np.unique(ser), np.linspace(0, 1, 11))
result = pd.tools.tile._bins_to_cuts(ser, bins, include_lowest=True)
The result is
In[61]: result.value_counts()
Out[61]:
[0, 9.3] 323
(27.9, 38.2] 37
(9.3, 18.6] 37
(88.7, 99] 35
(57.8, 68.1] 32
(68.1, 78.4] 31
(78.4, 88.7] 30
(38.2, 48.5] 27
(48.5, 57.8] 26
(18.6, 27.9] 22
dtype: int64
Solution 2: bin1 contains only 0s
mx = np.ma.masked_equal(arr, 0, copy=True)
bins = algos.quantile(arr[~mx.mask], np.linspace(0, 1, 11))
bins = np.insert(bins, 0, 0)
bins[1] = bins[1]-(bins[1]/2)
result = pd.tools.tile._bins_to_cuts(arr, bins, include_lowest=True)
The result is:
In[133]: result.value_counts()
Out[133]:
[0, 0.5] 300
(0.5, 11] 32
(11, 18.8] 28
(18.8, 29.7] 30
(29.7, 39] 35
(39, 50] 26
(50, 59] 31
(59, 71] 31
(71, 79.2] 27
(79.2, 90.2] 30
(90.2, 99] 30
dtype: int64
There is work that could be done to Solution 2 to make it a little prettier I think, but you can see that the masked array is a useful tool to approach your goals.
If you want to enforce equal size bins, even in the presence of duplicate values, you can use the following, 2 step process:
Rank your values, using method='first' to have python assign a unique rank to all your records. If there is a duplicate value (i.e. a tie in the rank), this method will choose the first record it comes to and rank in that order.
df['rank'] = df['value'].rank(method='first')
Use qcut on the rank to determine equal sized quantiles. Below example creates deciles (bins=10).
df['decile'] = pd.qcut(df['rank'].values, 10).codes
I've had a lot of problems with qcut as well, so I used the Series.rank function combined with creating my own bins using those results. My code is on Github:
https://gist.github.com/ashishsingal1/e1828ffd1a449513b8f8
I had this problem as well, so I wrote a small function, which only treats the non zero values and then inserts the labels where the original was not 0.
def qcut2(x, n=10):
x = np.array(x)
x_index_not0 = [i for i in range(len(x)) if x[i] > 0]
x_cut_not0 = pd.qcut(x[x > 0], n-1, labels=False) + 1
y = np.zeros(len(x))
y[x_index_not0] = x_cut_not0
return y

Categories

Resources