GaussianProcessRegressor fitting perfectly but poor perfomance on test data? - python

I am trying to understand GPR, and I am testing it to predict some values. The response is the first component of a PCA, so it has relatively good data without outliers. The predictors also come from a PCA(n=2), and buth predictors columns has been standarized with StandardScaler().fit_transform, as I saw it was better in previous posts. Since the predictors are standarized, I am using a RBF kernel and mutiplying it by 1**2, and let the hyperparameters fit. The thing is that the model fits perfectly to predictors, and gives almost constant values for the test data. The set is a set of 463 points, and no matter if I randomize 20-100 or 200 for the train data, adding Whitekernel() or alpha values, I have the same result. I am almost certain that I am doing something wrong, but I can't find what, any help? Here's relevant chunk of code and the responses:
k1 = cKrnl(1**2,(1e-40, 1e40)) * RBF(2, (1e-40, 1e40))
k2 = cKrnl(1**2,(1e-40, 1e40)) * RBF(2, (1e-40, 1e40))
kernel = k1 + k2
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10,normalize_y = True)
gp.fit(x_train, y_train)
print("GPML kernel: %s" % gp.kernel_)
Output :
GPML kernel: 1**2 * RBF(length_scale=0.000388) + 8.01e-18**2 * RBF(length_scale=2.85e-18)
Training data:
Test data and prediction:
Thanks to all!!!

Related

SVM testing - normalization of test data [duplicate]

This question already has an answer here:
what is the difference between fit() ,fit_transform() and transform() in scikit_learn? [duplicate]
(1 answer)
Closed 1 year ago.
I'm working with SVM model to classify 5 different classes. (N1, N2, N3, W, R)
Feature extractions -> Data normalization -> train SVM
when I tested the model (20%, 80% usual train-test-split), it shows high accuracy enter image description here
But when I tried testing with a completely new dataset, with the same method of
Feature extractions -> Data normalization -> test on trained SVM model
It came out really badly.
Let's say the original dataset used in training is A, and the new test dataset is B.
when I trained the model only with A and tested B, it came out really badly.
First I thought it was model overfitting so I included A and B to train the model and tested with B. It came out badly again...
I think the problem is the normaliztion process. It eventually worked when I tried new dataset C, but this time I brought the train A data, concatenated A+C to normalize, and then cut only C dataset out from it. And when I compared that with the data C normalized alone, it was different..
I used MinMaxScaler from sklearn.
I mean mathematically speaking of course it's different.. because every dataset has different minimum maximum value and normalized data will be different when mixed with other data.
My question is, when you test with new dataset, is it normal to bring the train dataset to normalize it together and then take out the test datapart only?? It's like mixing A(112x12), B(15x12) -> normalize (127x12) together -> take out (15x12)
Or should I start from fixing the code from feature extraction and training SVM?
(I attached the code, and each feature has 12x1 shape which means each stage has 12xN matrix.)
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
# Load training data
N1_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_N1_features")
N2_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_N2_features")
N3_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_N3_features")
W_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_W_features")
R_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_R_features")
# Load test data
N1_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_N1_features")
N2_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_N2_features")
N3_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_N3_features")
W_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_W_features")
R_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_R_features")
# normalize with original raw features and take only test out
N1_scaled_test = features.normalize_together(N1_test, N1_train, "N1")
N2_scaled_test = features.normalize_together(N2_test, N2_train, "N2")
N3_scaled_test = features.normalize_together(N3_test, N3_train, "N3")
W_scaled_test = features.normalize_together(W_test, W_train, "W")
R_scaled_test = features.normalize_together(R_test, R_train, "R")
def normalize_together(test, raw, stage_no):
together = pd.concat([test, raw], ignore_index=True)
scaled_test = pd.DataFrame(scaler.fit_transform(together.iloc[:, :-1]))
scaled_test['label'] = "{}".format(stage_no)
scaled_test = scaled_test.iloc[0:test.shape[0], :]
return scaled_test
Test data should remain unseen during training (includes preprocessing) - don't use both test + train data to compute a common normalisation factor. Normalise the training set. Separately, normalise the test set.
Why? It's vital to use an unseen test partition to evaluate your trained model. Otherwise you have not tested the ability for your model to generalise - imagine playing a game of cards where you have already have prior knowledge of the cards or order of the deck.

Retrieving r2 value in negative

I have the following code applying lightgbm to the dataset(link shared below). I retrieve negative r2 of -2.0687981990506565. RMSE error I am retrieving is very low however r2 value is in negative. How can it perform badly while having very low MSE for train and test data.
weights_data = pd.read_csv("dataset.csv")
columns = weights_data.columns
target = columns[-1:]
features = columns[:-1]
def regressor_model():
print()
X = weights_data[features].to_numpy()
Y = weights_data[target].to_numpy() * 100
x_train,x_test,y_train,y_test=train_test_split(X,Y, train_size=0.8, random_state = 2021)
regressor = lightgbm.LGBMRegressor()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
r2_score_value=r2_score(y_test,y_pred)
print(r2_score_value)
print()
return regressor
regressor_model()
Link for dataset https://drive.google.com/file/d/1W1G67215vNZpsU1BEiz5S4XO0XwZJhwR/view?usp=sharing
If the order of the r2 parameter is changed for instance like below, a r2 value of 0.0 is retrieved.
r2_score_value=r2_score(y_pred,y_test)
If you are getting negative r-square. It means your model is making a random guess. From the above code I guess you are using the default parameters of the LGBMRegressor(). You need to tune the parameters of your model. Turning the parameters might probably solve your problem.
A you can find a similar scenario here

Logistic regression coefficient meaning

I'm trying to write my own logistic regressor (using batch/mini-batch gradient descent) for practice purposes.
I generated a random dataset (see below) with normally distributed inputs, and the output is binary (0,1). I manually used coefficients for the input and was hoping to be able to reproduce them (see below for the code snippet). However, to my surprise, neither my own code, nor sklearn LogisticRegression were able to reproduce the actual numbers (although the sign and order of magnitude are in line). Moreso, the coefficients my algorithm produced are different than the one produced by sklearn.
Am I misinterpreting what the coefficients for a logistic regression are?
I will appreciate any insight into this discrepancy.
Thank you!
edit: I tried using statsmodels Logit and got yet a third set of slightly different values for the coefficients
Some more info that might be relevant:
I wrote a linear regressor using an almost identical code and it worked perfectly, so I am fairly confident this is not a problem in the code. Also my regressor actually outperformed the sklearn one on the training set, and they have the exact same accuracy on the test set, so I have no reason to believe the regressors are wrong.
Code snippets for the generation of the dataset:
o1 = 2
o2 = -3
x[:,1]=np.random.rand(size)*2
x[:,2]=np.random.rand(size)*3
y = np.vectorize(sigmoid)(x[:,1]*o1+x[:,2]*o2 + np.random.normal(size=size))
so as can be seen, input coefficients are +2 and -3 (intercept 0);
sklearn coefficients were ~2.8 and ~-4.8;
my coefficients were ~1.7 and ~-2.6
and of the regressor (the most relevant parts of it):
for j in range(bin_size):
xs = x[i]
y_real = y[i]
z = np.dot(self.coeff,xs)
h = sigmoid(z)
dc+= (h-y_real)*xs
self.coeff-= dc * (learning_rate/n)
What was the intercept learned? It really should not be a surprise, as your y is polynomial of 3rd degree, while your model has only two coefficients, while 3 + y-intercept would be needed to model the response variable from predictors.
Furthermore, values may be different due to SGD for example.
Not really sure, but the coefficients could be different and return correct y for finite set of points. What are the metrics on each model? Do those differ?

low SVM accuracy on train and test sets in python

I'm porting some matlab/octave scripts for support vector machines (SVMs) to python but I'm getting poor accuracy in one of two scripts with the sklearn method.
ex6_spam.py loads some data trains a spam-detecting model.
In matlab, the SVM code provided, svmTrain.m, (see below for snippets) gives me ~99% accuracy in both the training and the test sets.
In python, sklearn.svm.SVM().fit() is giving me ~56% if I just use their linear kernel, and ~44% if I precompute the Gram matrix for a linear kernel. (The data and code - ex6_spam.py - are here.)
The odd thing, too, is that the exact same piece of code used in ex6.py gives me proper classification of 2D data points. Its behavior there is almost identical to the matlab/octave script.
I'm not doing much in ex6_spam.py - I load a training set:
mat = scipy.io.loadmat('spamTrain.mat')
X = mat["X"]
y = mat["y"]
I feed it to sklearn.svm.SVM().fit():
C = 0.1
model = svmt.svmTrain(X, y, C, "linear")
# this results in
# clf = svm.SVC(C = C, kernel=kernelFunction, tol=tol, max_iter=max_passes, verbose=2)
# return clf.fit(X, y)
and the I make a prediction:
p = model.predict(X)
The matlab/octave equivalent is
load('spamTrain.mat');
C = 0.1;
model = svmTrain(X, y, C, #linearKernel); # see the link to svmTrain.m above
p = svmPredict(model, X);
However, the results are wildly different. Any ideas why? I haven't had the chance to run it in a different computer, but maybe that's a possible reason?

sklearn's GradientBoostingRegressor gives the same prediction for different inputs

I encountered a weird behavior while trying to train sklearn's GradientBoostingRegressor and make prediction. I will bring an example to demonstrate the issue on a reduced dataset but issue remains on a larger dataset as well. I have the following 2 small datasets adapted from a big dataset. As you can see the target variable is identical for both cases but input variables are different though their values are close to each other. The target variables(Y) are in the last column.
I have the following code:
d1 = {'0':[101869.2,102119.9,102138.0,101958.3,101903.7,12384900],
'1':[101809.1,102031.3,102061.7,101930.0,101935.2,11930700],
'2':[101978.0,102208.9,102209.8,101970.0,101878.6,12116700],
'3':[101869.2,102119.9,102138.0,101958.3,101903.7,12301200],
'4':[102125.5,102283.4,102194.0,101884.8,101806.0,10706100],
'5':[102215.5,102351.9,102214.0,101769.3,101693.6,10116900]}
data1 = pd.DataFrame(d1).T
X1 = data1.ix[:,:4]
Y = data1[5]
d2 = {'0':[101876.0,102109.8,102127.6,101937.0,101868.4,12384900],
'1':[101812.9,102021.2,102058.8,101912.9,101896.4,11930700],
'2':[101982.5,102198.0,102195.4,101940.2,101842.5,12116700],
'3':[101876.0,102109.8,102127.6,101937.0,101868.4,12301200],
'4':[102111.3,102254.8,102182.8,101832.7,101719.7,10706100],
'5':[102184.6,102320.2,102188.9,101699.9,101548.1,10116900]}
data2 = pd.DataFrame(d2).T
X2 = data2.ix[:,:4]
Y = data2[5]
re1 = ensemble.GradientBoostingRegressor(n_estimators=40,max_depth=None,random_state=1)
re1.fit(X1,Y)
pred1 = re1.predict(X1)
re2 = ensemble.GradientBoostingRegressor(n_estimators=40,max_depth=None,random_state=3)
re2.fit(X2,Y)
pred2 = re2.predict(X2)
where
X1 is a pandas DataFrame corresponding to Column 1 through Column 5 on the 1st dataset
X2 is a pandas DataFrame corresponding to Column 1 through Column 5 on the 2nd dataset
Y represents the target column.
The issue I am facing is that I cannot explain why pred1 is exactly the same as pred2?? As long as X1 and X2 are not the same pred1 and pred2 must also be different, musn't they? Help me to find my false assumption, please.
What you observe is perfectly expected.
You fit a high complexity estimator to the data (max_depth=None), so it is easy to learn all of the data by heart, that is overfit completely on the training data.
Then the prediction will be whatever labels you gave for training.
Have a look at Peter's talk here about how to tune GradientBoosting correctly:
https://www.youtube.com/watch?v=-5l3g91NZfQ
Anyhow, you should at least have a test-set.
My guess is that since you are fitting X1 and X2 to the same Y, it is reasonable that pred1 and pred2 are similar. When your Regressor is very powerful (can fit anything to anything) or your problem is too easy (can be fitted exactly by your regressor), then pred1 and pred2 will be both equal to Y.

Categories

Resources