Apologies in advance for any incorrect wording. The reason I am not finding answers to this might be because I am not using the right terminology.
I have a dataframe that looks something like
0 -0.004973 0.008638 0.000264 -0.021122 -0.017193
1 -0.003744 0.008664 0.000423 -0.021031 -0.015688
2 -0.002526 0.008688 0.000581 -0.020937 -0.014195
3 -0.001322 0.008708 0.000740 -0.020840 -0.012715
4 -0.000131 0.008725 0.000898 -0.020741 -0.011249
5 0.001044 0.008738 0.001057 -0.020639 -0.009800
6 0.002203 0.008748 0.001215 -0.020535 -0.008368
7 0.003347 0.008755 0.001373 -0.020428 -0.006952
8 0.004476 0.008758 0.001531 -0.020319 -0.005554
9 0.005589 0.008758 0.001688 -0.020208 -0.004173
10 0.006687 0.008754 0.001845 -0.020094 -0.002809
...
For each column I would like to scale the data to a float between -1.0 and 1.0 for this column's min and max.
I have tried scikit learn's minmax scaler with scaler = MinMaxScaler(feature_range = (-1, 1)) but some values change sign as a result, which I need to preserve.
Is there a way to 'centre' the scaling on zero?
Have you tried using StandardScaler from sklearn ?
It has with_mean and with_std option, which you can use to get data you want.
The problem with scaling the negative values to the column's minimum value and the positive values to the column's maximum value is that the scale of the positive numbers may be different than the scale of the positive numbers. If you want to use the same scale for both negative and positive values, try the following:
def zero_centered_min_max_scaling(dataframe):
"""
Scale the numerical values in the dataframe to be between -1 and 1, preserving the
signal of all values.
"""
df_copy = dataframe.copy(deep=True)
for column in df_copy.columns:
max_absolute_value = df_copy[column].abs().max()
df_copy[column] = df_copy[column] / max_absolute_value
return df_copy
Related
I need to make a linear regression and sum all the predictions. Maybe this isn't a question for Scikit-Learn but for NumPy because I get an array at the end and I am unable to turn it into a float.
df
rank Sales
0 1 18000
1 2 17780
2 3 17870
3 4 17672
4 5 17556
x = df['rank'].to_numpy()
y = df['Sales'].to_numpy()
X = x.reshape(-1,1)
regression = LinearRegression().fit(X, y)
I am getting it right up to this point. The next part (which is a while loop to sum all the values) is not working:
number_predictions = 100
x_current_prediction = 1
total_sales = 0
while x_current_prediction <= number_predictions:
variable_sum = x_current_prediction*regression.coef_
variable_sum_float = variable_sum.astype(np.float_)
total_sales = total_sales + variable_sum_float
x_current_prediction =+1
return total_sales
I think that the problem is getting regression.coef_ to be a float, but when I use astype, it does not work?
You don't need to loop like this, and you don't need to use the coefficient to compute the prediction (don't forget there may be an intercept as well).
Instead, make an array of all the values of x you want to predict for, and ask sklearn for the predictions:
X_new = np.arange(1, 101).reshape(-1, 1) # X must be 2D.
y_pred = regression.predict(X_new)
If you want to add all these numbers together, use y_pred.sum() or np.sum(y_pred), or if you want a cumulative sum, np.cumsum(y_pred) will do it.
I am creating a heatmap for the correlations between items.
sns.heatmap(df_corr, fmt=".2g", cmap='vlag', cbar='True', annot = True)
I choose vlag as it has red for high values, blue for low values, and white for the middle.
Seaborn automatically sets red for the highest value and blue for the lowest value in the dataframe.
However, as I am tracking Pearson's correlation, the value range is between -1 and 1 - as so I would like to set 1 to be represented by red, -1 with blue, leaving 0 to be represented by white.
How the results looks like:
How it should be*:
*(Of course this was generated by "cheating" - setting -1 as value(s) to force the range to be from -1 to 1; I want to set this range without warping my data)
it is vmin=-1 and vmax=1:
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
data = np.random.uniform(low=-0.5, high=0.5, size=(5,5))
hm = sn.heatmap(data = data, cmap= 'vlag', annot = True, vmin=-1, vmax=1)
plt.show()
Here is an unorthodox solution. You can "standardize" your data to a range 1 and -1. Even though the theoretical range of Pearson coefficient is [-1, 1]; strong negative correlations are not present in your dataset.
So, you can create another dataframe which contains the data with its max being 1 and min being -1. You can then plot this dataframe to get the desired effect. The advantage this procedure provides is that this technique generalizes to pretty much any dataframe (not verified though).
Here is the code -
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
# Setting the initial scale of the data
scale_minimum = -1
scale_maximum = 1
scale_range = scale_maximum-scale_minimum
# Applying the scaling
df_minimun, df_maximum = df.min(), df.max() # Getting the range of the current dataframe
df_range = df_maximum - df_minimun # The range of the data
df = (df - df_minimun)/(df_range) # Scaling between 0 and 1
df_scaled = df*(scale_range) + scale_minimum # Scaling between 1 and -1
Hope this solves your problem.
I have a dataset with 56 numerical features. Loading it to pandas, I can easily generate a correlation coefficients matrix.
However, due to its size, I'd like to find coefficients higher (or lower) than a certain threshold, e.g. >0.8 or <-0.8, and list the corresponding pairs of variables. Is there a way to do it? I figure it would require selecting by value across all columns, then returning, not the row, but the column name and row index of the value, but I have no idea how to do either!
Thanks!
I think you can do where and stack(): this:
np.random.seed(1)
df = pd.DataFrame(np.random.rand(10,3))
coeff = df.corr()
# 0.3 is used for illustration
# replace with your actual value
thresh = 0.3
mask = coeff.abs().lt(thresh)
# or mask = coeff < thresh
coeff.where(mask).stack()
Output:
0 2 -0.089326
2 0 -0.089326
dtype: float64
Output:
0 1 0.319612
2 -0.089326
1 0 0.319612
2 -0.687399
2 0 -0.089326
1 -0.687399
dtype: float64
This approach will work if you're looking to also deduplicate the correlation results.
thresh = 0.8
# get correlation matrix
df_corr = df.corr().abs().unstack()
# filter
df_corr_filt = df_corr[(df_corr>thresh) | (df_corr<-thresh)].reset_index()
# deduplicate
df_corr_filt.iloc[df_corr_filt[['level_0','level_1']].apply(lambda r: ''.join(map(str, sorted(r))), axis = 1).drop_duplicates().index]
I have a Pandas Series, that needs to be log-transformed to be normal distributed. But I can´t log transform yet, because there are values =0 and values below 1 (0-4000). Therefore I want to normalize the Series first. I heard of StandardScaler(scikit-learn), Z-score standardization and Min-Max scaling(normalization).
I want to cluster the data later, which would be the best method?
StandardScaler and Z-score standardization use mean, variance etc. Can I use them on "not yet normal distibuted" data?
To transform to logarithms, you need positive values, so translate your range of values (-1,1] to normalized (0,1] as follows
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.uniform(-1,1,(10,1)))
df['norm'] = (1+df[0])/2 # (-1,1] -> (0,1]
df['lognorm'] = np.log(df['norm'])
results in a dataframe like
0 norm lognorm
0 0.360660 0.680330 -0.385177
1 0.973724 0.986862 -0.013225
2 0.329130 0.664565 -0.408622
3 0.604727 0.802364 -0.220193
4 0.416732 0.708366 -0.344795
5 0.085439 0.542719 -0.611163
6 -0.964246 0.017877 -4.024232
7 0.738281 0.869141 -0.140250
8 0.558220 0.779110 -0.249603
9 0.485144 0.742572 -0.297636
If your data is in the range (-1;+1) (assuming you lost the minus in your question) then log transform is probably not what you need. At least from a theoretical point of view, it's obviously the wrong thing to do.
Maybe your data has already been preprocessed (inadequately)? Can you get the raw data? Why do you think log transform will help?
If you don't care about what is the meaningful thing to do, you can call log1p, which is the same as log(1+x) and which will thus work on (-1;∞).
I have an array which contains error values as a function of two different quantities (alpha and eigRange).
I fill my array like this :
for j in range(n):
for i in range(alphaLen):
alpha = alpha_list[i]
c = train.eig(xt_, yt_,m-j, m,alpha, "cpu")
costListTrain[j, i] = cost.err(xt_, xt_, yt_, c)
normedValues=costListTrain/np.max(costListTrain.ravel())
where
n = 20
alpha_list = [0.0001,0.0003,0.0008,0.001,0.003,0.006,0.01,0.03,0.05]
My costListTrain array contains some values that have very small differences, e.g.:
2.809458902485728 2.809458905776425 2.809458913576337 2.809459011062461
2.030326752376704 2.030329906064879 2.030337351188699 2.030428976282031
1.919840839066182 1.919846470077076 1.919859731440199 1.920021453630778
1.858436351617677 1.858444223016128 1.858462730482461 1.858687054377165
1.475871326997542 1.475901926855846 1.475973476249240 1.476822830933632
1.475775410801635 1.475806023102173 1.475877601316863 1.476727286424228
1.475774284270633 1.475804896751524 1.475876475382906 1.476726165223209
1.463578292548192 1.463611627166494 1.463689466240788 1.464609083309240
1.462859608038034 1.462893157900139 1.462971489632478 1.463896516033939
1.461912706143012 1.461954067956570 1.462047793798572 1.463079574605320
1.450581041157659 1.452770209885761 1.454835202839513 1.459676311335618
1.450581041157643 1.452770209885764 1.454835202839484 1.459676311335624
1.450581041157651 1.452770209885735 1.454835202839484 1.459676311335610
1.450581041157597 1.452770209885784 1.454835202839503 1.459676311335620
1.450581041157575 1.452770209885757 1.454835202839496 1.459676311335619
1.450581041157716 1.452770209885711 1.454835202839499 1.459676311335613
1.450581041157667 1.452770209885744 1.454835202839509 1.459676311335625
1.450581041157649 1.452770209885750 1.454835202839476 1.459676311335617
1.450581041157655 1.452770209885708 1.454835202839442 1.459676311335622
1.450581041157571 1.452770209885700 1.454835202839498 1.459676311335622
as you can here the value are very very close together!
I am trying to plotting this data in a way where I have the two quantities in the x, y axes and the error value is represented by the dot color.
This is how I'm plotting my data:
alpha_list = np.log(alpha_list)
eigenvalues, alphaa = np.meshgrid(eigRange, alpha_list)
vMin = np.min(costListTrain)
vMax = np.max(costListTrain)
plt.scatter(x, y, s=70, c=normedValues, vmin=vMin, vmax=vMax, alpha=0.50)
but the result is not correct.
I tried to normalize my error value by dividing all values by the max, but it didn't work !
The only way that I could make it work (which is incorrect) is to normalize my data in two different ways. One is base on each column (which means factor1 is constant, factor 2 changing), and the other one based on row (means factor 2 is constant and factor one changing). But it doesn't really make sense because I need a single plot to show the tradeoff between the two quantities on the error values.
UPDATE
this is what I mean by last paragraph.
normalizing values base on max on each rows which correspond to eigenvalues:
maxsEigBasedTrain= np.amax(costListTrain.T,1)[:,np.newaxis]
maxsEigBasedTest= np.amax(costListTest.T,1)[:,np.newaxis]
normEigCostTrain=costListTrain.T/maxsEigBasedTrain
normEigCostTest=costListTest.T/maxsEigBasedTest
normalizing values base on max on each column which correspond to alphas:
maxsAlphaBasedTrain= np.amax(costListTrain,1)[:,np.newaxis]
maxsAlphaBasedTest= np.amax(costListTest,1)[:,np.newaxis]
normAlphaCostTrain=costListTrain/maxsAlphaBasedTrain
normAlphaCostTest=costListTest/maxsAlphaBasedTest
plot 1:
where no. eigenvalue = 10 and alpha changes (should correspond to column 10 of plot 1) :
where alpha = 0.0001 and eigenvalues change (should correspond to first row of plot1)
but as you can see the results are different from plot 1!
UPDATE:
just to clarify more stuff this is how I read my data:
from sklearn.datasets.samples_generator import make_regression
rng = np.random.RandomState(0)
diabetes = datasets.load_diabetes()
X_diabetes, y_diabetes = diabetes.data, diabetes.target
X_diabetes=np.c_[np.ones(len(X_diabetes)),X_diabetes]
ind = np.arange(X_diabetes.shape[0])
rng.shuffle(ind)
#===============================================================================
# Split Data
#===============================================================================
import math
cross= math.ceil(0.7*len(X_diabetes))
ind_train = ind[:cross]
X_train, y_train = X_diabetes[ind_train], y_diabetes[ind_train]
ind_val=ind[cross:]
X_val,y_val= X_diabetes[ind_val], y_diabetes[ind_val]
I also uploaded .csv files HERE
log.csv contain the original value before normalization for plot 1
normalizedLog.csv for plot 1
eigenConst.csv for plot 2
alphaConst.csv for plot 3
I think I found the answer. First of all there was one problem in my code. I was expecting the "No. of eigenvalue" correspond to rows but in my for loop they fill the columns. The currect answer is this :
for i in range(alphaLen):
for j in range(n):
alpha=alpha_list[i]
c=train.eig(xt_, yt_,m-j,m,alpha,"cpu")
costListTrain[i,j]=cost.err(xt_,xt_,yt_,c)
costListTest[i,j]=cost.err(xt_,xv_,yv_,c)
After asking questions from friends and colleagues I got this answer :
I would assume on default imshow and other plotting commands you
might want to use, do equally sized intervals on the values you are
plotting. if you can set that to logarithmic you should be fine.
Ideally, equally "populated bins" would proof most effective, i guess.
for plotting I just subtract the min value from the error and the add a small number and at the end take the log.
temp=costListTrain- costListTrain.min()
temp+=0.00000001
extent = [0, 20,alpha_list[0], alpha_list[-1]]
plt.imshow(np.log(temp),interpolation="nearest",cmap=plt.get_cmap('spectral'), extent = extent, origin="lower")
plt.colorbar()
and result is :