I am running ridge regression on a dataset. I have done 5 fold cross validation. So basically my dataset is divided into 5 train and 5 test folds.
This is how I did in scikit:
from sklearn import cross_validation
k_fold=cross_validation.KFold(n=len(tourism_train_X),n_folds=5)
I set the regularisation parameter like this:
#Generating alpha values for regularization parameters
n_alphas = 200
alphas = np.logspace(-10, -1, n_alphas)
Now , my doubt is, for each train and test fold
I do something like this.
ridge_tourism = linear_model.Ridge()
for a in alphas:
ridge_tourism.set_params(alpha=a)
index=0
for train_indices, test_indices in k_fold:
ridge_tourism.fit(tourism_train_X[train_indices], tourism_train_Y[train_indices]) # Fitting the model
coefs.append(ridge_tourism.coef_)
The problem is it would give me coefficient vector for each of the five training fold within each alpha. All I want is for each alpha what is the best coefficient vector chosen. How do we get that? How do we choose out of 5 train sets which coefficient vector is finally reported for that alpha?
For each alpha value, take the mean of the validation error of the 5 folds validation. Then you will be able to get a curve for mean validation error v.s. alpha. Choose the alpha value, which gives the lowest mean validation error.
Related
I'm facing a problem with different linear models from scikit-learn.
There is my code
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train, y_train)
y_pred = reg.predict(X_train).reshape(-1)
print(f"R2 on train set:{reg.score(X_train, y_train)}")
print(f"R2 on test set:{reg.score(X_test, y_test)}")
print(f"MSE on train set:{mean_squared_error(y_train, y_pred)}")
print(f"MSE on test set:{mean_squared_error(y_test, reg.predict(X_test))}")
output:
>R2 on train set:0.5810258473777401
>R2 on test set:0.5908396388537969
>MSE on train set:0.023576848498732563
>MSE on test set:0.02378699441936436
Model is fitted, now I want to get the slope coefficient and the intercept from my model:
A, B = reg.coef_[0], reg.intercept_[0]
A, B
output:
>(array([ 0.14373081, -1.8211677 , 1.81493948, 1.39041689, -0.14027746]),
> 0.060286931992710735)
Since I used 5 features to fit the model I also have 5 slope coefficients, ok.
But when I try to visualize y_true, y_pred and the regression (ax +b) it's looks wrong for the regression of the second feature (total rooms). Since it has -1.81 as coef slope it's look logic but if the predictions of the model look fine, how it's possible to have this regression looks that bad, it make no sense right ?
I think that the return of reg.coef_ is not in the same order as the features the model is fitted with. But as far as I have see, it should be the same order, so idk.
There is also this part of code, that plot the regression just in case
sns.lineplot(x=X[:, i], y=(a[i]*X[:, i])+b, label="regression", color=c3, alpha=1, ci=None, ax=axes[i])
Any idea ?
I keep in mind that there may be no problem at all but visually it hurts a bit
y_pred is a quantifier for listreg. We introduce N as an other variable which cannot be quantified or consecutive.
N=ax/k-b of a scatter plot. This helps to find the total shape or size of the bedroom; b, b=l.
5 is right. 5 is independent of the regression. I mean of an independent variable.
Say I have a learning curve that is sklearn learning curve SVM. And I'm also doing 5-fold cross-validation, which as far as I understand, it means splitting your training data into 5 pieces, train on four of them and testing on the last one.
So my question is, since for each data point in the LearningCurve, the size of the training set is different (Because we want to see how will the model perform with the increasing amount of data), how does the cross-validation work in that case? Does it still split the whole training set into 5 equal pieces? Or it splits the current point training set into five different small pieces, then computes the test score? Is it possible to get a confusion matrix for each data point? (i.e. True Positive, True Negative etc.). I don't see a way to do that yet based on the sklearn learning curve code.
Does how many folds of cross-validation relate to how many pieces of training set we are splitting in train_sizes = np.linspace(0.1, 1.0, 5).
train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(estimator,
X, y, cv,
n_jobs, scoring,
train_sizes)
Thank you!
No, it does the split the training data into 5 folds again. Instead, for a particular combination of training folds (for example - folds 1,2,3 and 4 as training), it will pick only k number of data points (x- ticks) as training from those 4 training folds. Test fold would be used as such as the testing data.
If you look at the code here it would become clearer for you.
for train, test in cv_iter:
for n_train_samples in train_sizes_abs:
train_test_proportions.append((train[:n_train_samples], test))
n_train_samples would be something like [200,400,...1400] for the plot that you had mentioned.
Does how many folds of cross-validation relate to how many pieces of training set we are splitting in train_sizes = np.linspace(0.1, 1.0, 5)?
we can't assign any number of folds for a certain train_sizes. It is just a subset of datapoints from all the training folds.
I have a linear regression model and my cost function is a Sum of Squares Error function. I've split my full dataset into three datasets, training, validation, and test. I am not sure how to calculate the training error and validation error (and the difference between the two).
Is the training error the Residual Sum of Squares error calculated using the training dataset?
An example of what I'm asking: So if I was doing this in Python, and let's say I had 90 data-points in the training data set, then is this the correct code for the training error?
y_predicted = f(X_train, theta) #predicted y-value at point x, where y_train is the actual y-value at x
training_error = 0
for i in range(90):
out = y_predicted[i] - y_train[i]
out = out*out
training_error+=out
training_error = training_error/2
print('The training error for this regression model is:', training_error)
This is mentioned in a comment on the post but you need to divide by the total number of samples to get a number that you can compare between validation and test sets.
Simply changed the code would be:
y_predicted = f(X_train, theta) #predicted y-value at point x, where y_train is the actual y-value at x
training_error = 0
for i in range(90):
out = y_predicted[i] - y_train[i]
out = out*out
training_error+=out
#change 2 to 90
training_error = training_error/90
print('The training error for this regression model is:', training_error)
The goal of this is so you can compare two different subsets of data using the same metric. You had a divide by 2 in there which was ok as well as long as you are also dividing by the number of samples.
Another way you can do this in Python is by using the sci-kit learn library, it already has the function.
see below.
from sklearn.metrics import mean_squared_error
training_error = mean_squared_error(y_train,y_predicted)
Also generally when making calculations like this it is better and faster to use matrix multiplication instead of a for loop. In the context, of this question 90 records is quite small but when you start working with larger sample sizes you could try something like this utilizing numpy.
import numpy as np
training_error = np.mean(np.square(np.array(y_predicted)-np.array(y_train)))
All 3 ways should get you similar results.
I am working on an data project assignment where I am asked to use 50% of data for training and remaining 50% of data for testing. I would like to use the magic of cross-validation and still meet the aforementioned criteria.
Currently, my code is following:
clf = LogisticRegression(penalty='l2', class_weight='balanced'
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
#cross validation
cv = StratifiedKFold(n_splits=2)
i = 0
for train, test in cv.split(X, y):
probas_ = clf.fit(X[train], y[train]).predict_proba(X[test])
# Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1])
tprs.append(interp(mean_fpr, fpr, tpr))
tprs[-1][0] = 0.0
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
i += 1
print("Average AUC: ", sum(aucs)/len(aucs),"AUC: ", aucs[-1],)
Since I am using just 2 splits, is it considered as if I was using train-test split of 50:50? Or should I first split data into 50:50 and then use cross validation on the training part, and finally use that model to test the remaining 50% on test data?
You should implement your second suggestion.
Cross-validation should be used to tune the parameters of your approach. Among others, such parameters in your example are the value of the C parameter and the class_weight='balanced' of Logistic Regression. So you should:
split in 50% training, 50% test
use the training data to select the optimal values of the parameters of your model with cross-validation
Refit the model with the optimal parameters on the training data
Predict for the test data and report the score of the evaluation measure you selected
Notice, you should use the test data only for reporting the final score and not for tuning the model, otherwise you are cheating. Imagine, that in reality you may not have access to them until the last moment, so you can not use them.
I am working on large data, and I want to find important features.
As I am a kind of biologist, so please forgive my lacking knowledge.
My dataset havs about 5000 attributes and 500 samples, which has binary classes 0 and 1. Also, the data set is biased - about 400 0s and 100 1s for samples.
I want to find some features, which influence most in determining class.
A1 A2 A3 ... Gn Class
S1 1.0 0.8 -0.1 ... 1.0 0
S2 0.8 0.4 0.9 ... 1.0 0
S3 -1.0 -0.5 -0.8 ... 1.0 1
...
As I got some advice from previous question, I am trying to find attribute coefficient of which are high as important features, using Lasso regression using L1 penalty, because it makes score of unimportant features to 0.
I am doing this work using scikit-learn library.
So, my questions are like this.
Can I use Lasso regression for biased binary class? If not, is it a good solution to use Logistic regression, although it does not use L1 penalty?
How can I find optimal value of alpha using LassoCV? The document says that LassoCV supports it, but I cannot find the function.
Is there other good way for this kind of classification?
Thank you very much.
You should use a classifier instead of a regressor so either SVM or Logistic Regression would do the job. Instead you can use SGDClassifier where you can set the loss parameter to 'log' for Logistic Regression or 'hinge' for SVM.
In SGDClassifier you can set the penalty to either of 'l1', 'l2' or 'elasticnet' which is a combination of both.
You can find an opimum value of 'alpha' by either looping over different values of alpha and evaluating the performance over a validation set or you can use gridsearchcv as:
tuned_parameters = {'alpha': [10 ** a for a in range(-6, -2)]}
clf = GridSearchCV(SGDClassifier(loss='hinge', penalty='elasticnet',l1_ratio=0.15, n_iter=5, shuffle=True, verbose=False, n_jobs=10, average=False, class_weight='balanced')
, tuned_parameters, cv=10, scoring='f1_macro')
#now clf is the best classifier found given the search space
clf.fit(X_train, Y_train)
#you can find the best alpha here
print(clf.best_params_)
This, searches over the range of values of alpha you have provided in tuned_parameters and then finds the best one. You can change the performance criteria from 'f1_macro' to 'f1_weighted' or other metrics.
To address the skewness of your dataset in terms of labels use the class_weight parameter of SGDCassifier and set it to "balanced".
To find the top 10 features contributing to the class labels you can find the indices as:
for i in range(0, clf.best_estimator_.coef_.shape[0]):
top10 = np.argsort(clf.best_estimator_.coef_[i])[-10:]
Note 1: It is always good to keep some part of your dataset aside as validation/test set and after finding your optimum model evaluating it on the held out data.
Note 2: It is usually good to play a little bit with different types of feature normalisation and sample normalisation by dividing a row or a column to 'l2' or 'l1' of the row or column to see its effect on the performance using normalizer
Note 3: For elasticnet regularisation play a little bit with l1_ratio parameter.