Sum the predictions of a Linear Regression from Scikit-Learn - python

I need to make a linear regression and sum all the predictions. Maybe this isn't a question for Scikit-Learn but for NumPy because I get an array at the end and I am unable to turn it into a float.
df
rank Sales
0 1 18000
1 2 17780
2 3 17870
3 4 17672
4 5 17556
x = df['rank'].to_numpy()
y = df['Sales'].to_numpy()
X = x.reshape(-1,1)
regression = LinearRegression().fit(X, y)
I am getting it right up to this point. The next part (which is a while loop to sum all the values) is not working:
number_predictions = 100
x_current_prediction = 1
total_sales = 0
while x_current_prediction <= number_predictions:
variable_sum = x_current_prediction*regression.coef_
variable_sum_float = variable_sum.astype(np.float_)
total_sales = total_sales + variable_sum_float
x_current_prediction =+1
return total_sales
I think that the problem is getting regression.coef_ to be a float, but when I use astype, it does not work?

You don't need to loop like this, and you don't need to use the coefficient to compute the prediction (don't forget there may be an intercept as well).
Instead, make an array of all the values of x you want to predict for, and ask sklearn for the predictions:
X_new = np.arange(1, 101).reshape(-1, 1) # X must be 2D.
y_pred = regression.predict(X_new)
If you want to add all these numbers together, use y_pred.sum() or np.sum(y_pred), or if you want a cumulative sum, np.cumsum(y_pred) will do it.

Related

Find high correlations in a large coefficient matrix

I have a dataset with 56 numerical features. Loading it to pandas, I can easily generate a correlation coefficients matrix.
However, due to its size, I'd like to find coefficients higher (or lower) than a certain threshold, e.g. >0.8 or <-0.8, and list the corresponding pairs of variables. Is there a way to do it? I figure it would require selecting by value across all columns, then returning, not the row, but the column name and row index of the value, but I have no idea how to do either!
Thanks!
I think you can do where and stack(): this:
np.random.seed(1)
df = pd.DataFrame(np.random.rand(10,3))
coeff = df.corr()
# 0.3 is used for illustration
# replace with your actual value
thresh = 0.3
mask = coeff.abs().lt(thresh)
# or mask = coeff < thresh
coeff.where(mask).stack()
Output:
0 2 -0.089326
2 0 -0.089326
dtype: float64
Output:
0 1 0.319612
2 -0.089326
1 0 0.319612
2 -0.687399
2 0 -0.089326
1 -0.687399
dtype: float64
This approach will work if you're looking to also deduplicate the correlation results.
thresh = 0.8
# get correlation matrix
df_corr = df.corr().abs().unstack()
# filter
df_corr_filt = df_corr[(df_corr>thresh) | (df_corr<-thresh)].reset_index()
# deduplicate
df_corr_filt.iloc[df_corr_filt[['level_0','level_1']].apply(lambda r: ''.join(map(str, sorted(r))), axis = 1).drop_duplicates().index]

How does offset in XGBoost is handled in binary:logistic objective function

I am working on a mortality prediction (binary outcome) problem with “base mortality probability” as my offset in the XGboost problem.
I have used gbtree booster and binary:logistic objective function. In my data data I have multiple observations/records having same X values but different offset values.
As per my understanding (please correct me, if wrong) the XGBoost under binary:logistic setup tries to fit a model of below representation. log(p/1-p) = offset + F(x). Where F(x) is optimized (for a specific loss function) using splits with various X values.
Thus, when the X values are exactly same, to get the F(x), I can use the predicted output (with outputmargin = True option) and subtract the offset from here. However, when I got the output, it turned out in the above mentioned approach, I am getting different values F(X) for a same set X. I believe the way offset is handled internally in the XGBoost is different from my understanding. Can anyone explain me this method/mathematical formulation of handlng offset.
I am specifically interested in extracting the value of F(x) (as this is additional information the model is providing) by adjusting the model prediction from the offset values.
Here are the sample codes:
library(xgboost)
x1 = runif(1000)
y1 = as.numeric(runif(1000)>.8)
y2 = as.numeric(runif(1000)>.8)
off1 = runif(1000)
off2 = runif(1000)
#stacking the data to have same X values
x= c(x1,x1)
y = c(y1,y2)
off = c(off1,off2)
length(unique(off)) # shows unique 2000 values
length(unique(x)) # shows unique 1000 values, i.e. each X is repeated once (as expected)
fulldata = cbind.data.frame(x,y,off)
train_dMtrix = xgb.DMatrix(data = as.matrix(x),
label = y,
base_margin = off)
params_list=list(booster = "gblinear", objective = "binary:logistic",
eta = 0.05, max_depth= 4, min_child_weight = 10, eval_metric = 'logloss')
set.seed(100)
xgbmodel = xgb.train(params = params_list, data = train_dMtrix, nrounds=100, callbacks = list(cb.gblinear.history()))
# Getting the prediction in link format
fulldata$Predicted_link = predict(xgbmodel, train_dMtrix, outputmargin = TRUE)
# Assuming Predicted_link = offset + F(x), calculating F(x) for each values of X
fulldata$F_x = fulldata$Predicted_link - fulldata$off
# As per my understanding, since the F(X) in purely independent of offset,
# the model predictions of F_x (not the predicted probability) should be exactly same for same values of x,
# irrespective of the corresponding offsets. Given I have 1000 distinct X values, I'm expecting 1000 distinct F_x values
length(unique(fulldata$F_x)) # shows almost 2000 unique values, which is contrary to my expectation.

scaling data between -1 and 1 centred on zero

Apologies in advance for any incorrect wording. The reason I am not finding answers to this might be because I am not using the right terminology.
I have a dataframe that looks something like
0 -0.004973 0.008638 0.000264 -0.021122 -0.017193
1 -0.003744 0.008664 0.000423 -0.021031 -0.015688
2 -0.002526 0.008688 0.000581 -0.020937 -0.014195
3 -0.001322 0.008708 0.000740 -0.020840 -0.012715
4 -0.000131 0.008725 0.000898 -0.020741 -0.011249
5 0.001044 0.008738 0.001057 -0.020639 -0.009800
6 0.002203 0.008748 0.001215 -0.020535 -0.008368
7 0.003347 0.008755 0.001373 -0.020428 -0.006952
8 0.004476 0.008758 0.001531 -0.020319 -0.005554
9 0.005589 0.008758 0.001688 -0.020208 -0.004173
10 0.006687 0.008754 0.001845 -0.020094 -0.002809
...
For each column I would like to scale the data to a float between -1.0 and 1.0 for this column's min and max.
I have tried scikit learn's minmax scaler with scaler = MinMaxScaler(feature_range = (-1, 1)) but some values change sign as a result, which I need to preserve.
Is there a way to 'centre' the scaling on zero?
Have you tried using StandardScaler from sklearn ?
It has with_mean and with_std option, which you can use to get data you want.
The problem with scaling the negative values to the column's minimum value and the positive values to the column's maximum value is that the scale of the positive numbers may be different than the scale of the positive numbers. If you want to use the same scale for both negative and positive values, try the following:
def zero_centered_min_max_scaling(dataframe):
"""
Scale the numerical values in the dataframe to be between -1 and 1, preserving the
signal of all values.
"""
df_copy = dataframe.copy(deep=True)
for column in df_copy.columns:
max_absolute_value = df_copy[column].abs().max()
df_copy[column] = df_copy[column] / max_absolute_value
return df_copy

Weighted data problems, mean is fine, but Covar and std look wrong, how do I adjust?

I'm trying to apply a weighted filter on data rather the use raw data before calculating stats, mu, std and covar. But the results clearly need adjusting.
# generate some data and a filter
f_n = 100.
np.random.seed(seed=101);
foo = np.random.rand(f_n,3)
foo = DataFrame(foo).add(1).pct_change()
f_filter = np.arange(f_n,.0,-1)
f_filter = 1.0 / (f_filter**(f_filter/f_n))
# nominalise the filter ... This could be where I'm going wrong?
f_filter = f_filter * (f_n / f_filter.sum())
Now we are ready to look at some results
print foo.mul(f_filter,axis=0).mean()
print foo.mean()
0 0.039147
1 0.039013
2 0.037598
dtype: float64
0 0.035006
1 0.042244
2 0.041956
dtype: float64
Means all look in line, but when we look at covar and std they are significantly different in terms of scale and also direction
print foo.mul(f_filter,axis=0).cov()
print foo.cov()
0 1 2
0 0.124766 -0.038954 0.027256
1 -0.038954 0.204269 0.056185
2 0.027256 0.056185 0.203934
0 1 2
0 0.070063 -0.014926 0.010434
1 -0.014926 0.099249 0.015573
2 0.010434 0.015573 0.087060
print foo.mul(f_filter,axis=0).std()
print foo.std()
0 0.353223
1 0.451961
2 0.451590
dtype: float64
0 0.264694
1 0.315037
2 0.295060
dtype: float64
Any ideas why, how can we adjust the filter or to adjust the covar matrix to make it more comparable?
The problem is your weighting function. (Do you want Gaussian random numbers or uniform r.v.?) See this plot
f_n = 100.
np.random.seed(seed=101);
# ??? you want uniform random variable? or is this just a typo and you want normal random variable?
foo = np.random.rand(f_n,3)
foo = DataFrame(foo)
f_filter = np.arange(f_n,.0,-1)
# here is the problem, uneven weight makes a artificial trend, causing non-stationary. covariance only works for stationary data.
# =============================================
f_filter = 1.0 / (f_filter**(f_filter/f_n))
fig, ax = plt.subplots()
ax.plot(f_filter)
Uneven weight makes a artificial trend (your random numbers are all positive uniforms), causing non-stationary. covariance only works for stationary data. Take a look at the resulting weighted data.

Python cross correlation

I have a pair of 1D arrays (of different lengths) like the following:
data1 = [0,0,0,1,1,1,0,1,0,0,1]
data2 = [0,1,1,0,1,0,0,1]
I would like to get the max cross correlation of the 2 series in python. In matlab, the xcorr() function will return it OK
I have tried the following 2 methods:
numpy.correlate(data1, data2)
signal.fftconvolve(data2, data1[::-1], mode='full')
Both methods give me the same values, but the values I get from python are different from what comes out of matlab. Python gives me integers values > 1, whereas matlab gives actual correlation values between 0 and 1.
I have tried normalizing the 2 arrays first (value-mean/SD), but the cross correlation values I get are in the thousands which doesnt seem correct.
Matlab will also give you a lag value at which the cross correlation is the greatest. I assume it is easy to do this using indices but whats the most appropriate way of doing this if my arrays contain 10's of thousands of values?
I would like to mimic the xcorr() function that matlab has, any thoughts on how I would do that in python?
numpy.correlate(arr1,arr2,"full")
gave me same output as
xcorr(arr1,arr2)
gives in matlab
Implementation of MATLAB xcorr(x,y) and comparision of result with example.
import scipy.signal as signal
def xcorr(x,y):
"""
Perform Cross-Correlation on x and y
x : 1st signal
y : 2nd signal
returns
lags : lags of correlation
corr : coefficients of correlation
"""
corr = signal.correlate(x, y, mode="full")
lags = signal.correlation_lags(len(x), len(y), mode="full")
return lags, corr
n = np.array([i for i in range(0,15)])
x = 0.84**n
y = np.roll(x,5);
lags,c = xcorr(x,y);
plt.figure()
plt.stem(lags,c)
plt.show()
This code will help in finding the delay between two channels in audio file
xin, fs = sf.read('recording1.wav')
frame_len = int(fs*5*1e-3)
dim_x =xin.shape
M = dim_x[0] # No. of rows
N= dim_x[1] # No. of col
sample_lim = frame_len*100
tau = [0]
M_lim = 20000 # for testing as processing takes time
for i in range(1,N):
c = np.correlate(xin[0:M_lim,0],xin[0:M_lim,i],"full")
maxlags = M_lim-1
c = c[M_lim -1 -maxlags: M_lim + maxlags]
Rmax_pos = np.argmax(c)
pos = Rmax_pos-M_lim+1
tau.append(pos)
print(tau)

Categories

Resources