Once I iterated on each training combination, given the k-fold split, I can estimate mean and standard deviation of models performance but I actually get k different models (with their own fitted parameters). How do I get the final, whole model? Is a matter of averaging the parameters?
Not showing code because is a general question so I'll write down the logic only:
dataset
splitting dataset according to the k-fold theory (let's say k = 5)
iterations: training from the first to the fifth model
getting 5 different models with, let's say, the following parameters:
model_1 = [p10, p11, p12] \
model_2 = [p20, p21, p22] |
model_3 = [p30, p31, p32] > param_matrix
model_4 = [p40, p41, p42] |
model_5 = [p50, p51, p52] /
What about model_final: [pf0, pf1, pf2]?
Too trivial solution 1:
model_final = mean(param_matrix, axis=0)
Too trivial solution 2:
model_final = the one of the fives that reach the highest performance
(could be a overfit rather than the optimal one)
First of all, the purpose of cross-validation (K-fold) is model checking, not model building.
In your question, you said that every fold of your program has different parameters, maybe this is not the best way to work.
One possibility to proceed is evaluate every model (each one with different parameters) using K-fold inside (using GridSearchCV); if you see that you obtain similar values of accuracy or other metrics in each split, then you are not overfitting.
Make this methodology for every model you have, and chose the one you obtain better results. Of course, always there is possibility to overfit, but with K-fold, you reduce it.
Finally, once you have checked with cross-validation that you obtain similar metrics for every split and you have chosed the model parameters, you have to train your model with all your training data; and you will finally obtain one unique model.
Related
I am trying to build an XGboost model for the Binary classification problem on an imbalanced dataset with Class-1 (21k records) being the minority and class-2 (10L/1M) being the Majority.
I am not a big fan of SMOTE analysis of upsampling or downsampling because in downsampling we might end up with data loss or the process of Upsampling data(augmented data) might create bias or overfitting data.
I want to build my model something like below
I want to create a stratified K-folds by maintaining the ratio in both the class of the data
so if I got 48 as my split, I will have each fold where Class 1 and class 2 are equal
kf = model_selection.StratifiedKFold(n_splits=48)
by doing this I feel that each of my models will be trained on a diverse data set..
Questions:
now how to I exactly merge all of these weak learners models (48 models) and make predictions on the final ensembles model(strong learner)
is this a good idea to work with such imbalanced data? or just defining "scale_pos_weight" will help be here?
I also thought something like below (but I feel its not best way) :
pred_logreg = logreg.predict_proba(xfold1)[:, 1]
pred_rf = rf.predict_proba(xfold1)[:, 1]
pred_xgbc = xgbc.predict_proba(xfold1)[:, 1]
avg_pred = (pred_logreg + pred_rf + pred_xgbc) / 3
any other ways here?
Say I have df that looks like this
userId movieId rating
0 1 31 1
1 1 34 5
2 1 742 2
3 1 1013 4
4 2 31 1
...
I've splitted using stratified sampling to keep same user in both train/test set.
When training on train dataset I would usually initialize embedding matrices for user and movies and try to learn using SGD.
After two matrices learned say P, Q. I take dot_product(P_i, Q.T_j) to get prediction for (i,j)th position in rating's matrix.
Since P,Q are learned embedding seems correct to use this learned embedding to predict validation dataset. However simply validation_dataset - dot_product(P,Q) doesn't make sense because shape of train and valid dataset are different.
One way to do is from original dataset take-out known ratings and keep it as validation set. However I am wondering if there is a way to split data first then apply learned embeddings to predict test set (this seems more intuitive to me however do not know how to do it...)
The most widely accepted method to calculate test-set performance on collaborative-filtering systems is to keep some number of known user-item interactions separate, in the form of a test set. We exclude those test-set interactions from the training set, which is used to train the model.
After training, for each pair of user u and item i in the test set, we compute the model's predicted interaction score for u and i, and compare it with the known interaction score, which is either 1 or 0 (0 when negative-sampling is used). This is how we compute the test-set performance metrics.
If you use the model's predicted ratings/scores to create new data-points for the test-set, then it may not reflect the true generalization performance of the model on completely unseen/new data. Let me know if that answers your question, or if any clarifications are needed.
I want to plot loss curves for my training and validation sets the same way as Keras does, but using Scikit. I have chosen the concrete dataset which is a Regression problem, the dataset is available at:
http://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/
So, I have converted the data to CSV and the first version of my program is the following:
Model 1
df=pd.read_csv("Concrete_Data.csv")
train,validate,test=np.split(df.sample(frac=1),[int(.8*len(df)),int(.90*len(df))])
Xtrain=train.drop(["ConcreteCompStrength"],axis="columns")
ytrain=train["ConcreteCompStrength"]
Xval=validate.drop(["ConcreteCompStrength"],axis="columns")
yval=validate["ConcreteCompStrength"]
mlp=MLPRegressor(activation="relu",max_iter=5000,solver="adam",random_state=2)
mlp.fit(Xtrain,ytrain)
plt.plot(mlp.loss_curve_,label="train")
mlp.fit(Xval,yval) #doubt
plt.plot(mlp.loss_curve_,label="validation") #doubt
plt.legend()
The resulting graph is the following:
In this model, I doubt if it's the correct marked part because as long as I know one should leave apart the validation or testing set, so maybe the fit function is not correct there. The score that I got is 0.95.
Model 2
In this model I try to use the validation score as follows:
df=pd.read_csv("Concrete_Data.csv")
train,validate,test=np.split(df.sample(frac=1),[int(.8*len(df)),int(.90*len(df))])
Xtrain=train.drop(["ConcreteCompStrength"],axis="columns")
ytrain=train["ConcreteCompStrength"]
Xval=validate.drop(["ConcreteCompStrength"],axis="columns")
yval=validate["ConcreteCompStrength"]
mlp=MLPRegressor(activation="relu",max_iter=5000,solver="adam",random_state=2,early_stopping=True)
mlp.fit(Xtrain,ytrain)
plt.plot(mlp.loss_curve_,label="train")
plt.plot(mlp.validation_scores_,label="validation") #line changed
plt.legend()
And for this model, I had to add the part of early stopping set to true and validation_scores_to be plotted, but the graph results are a little bit weird:
The score I get is 0.82, but I read that this occurs when the model finds it easier to predict the data in the validation set that in the train set. I believe that is because I am using the validation_scores_ part, but I was not able to find any online reference about this particular instruction.
How it will be the correct way to plot these loss curves for adjusting my hyperparameters in Scikit?
Update
I have programmed the module as it was advise like this:
mlp=MLPRegressor(activation="relu",max_iter=1,solver="adam",random_state=2,early_stopping=True)
training_mse = []
validation_mse = []
epochs = 5000
for epoch in range(1,epochs):
mlp.fit(X_train, Y_train)
Y_pred = mlp.predict(X_train)
curr_train_score = mean_squared_error(Y_train, Y_pred) # training performances
Y_pred = mlp.predict(X_valid)
curr_valid_score = mean_squared_error(Y_valid, Y_pred) # validation performances
training_mse.append(curr_train_score) # list of training perf to plot
validation_mse.append(curr_valid_score) # list of valid perf to plot
plt.plot(training_mse,label="train")
plt.plot(validation_mse,label="validation")
plt.legend()
but the plot obtained are two flat lines:
It seems I am missing something here.
You shouldn't fit your model on the validation set. The validation set is usually used to decide what hyperparameters to use, not the parameters' values.
The standard way to do training is to divide your dataset into three parts
training
validation
test
For example with a split of 80, 10, 10 %
Usually, you would select a neural network (how many layers, nodes, what activation functions) and then train -only- on the training set, check the result on the validation, and then on the test
I'll show a pseudo algorithm to make it clear:
for model in my_networks: # hyperparameters selection
model.fit(X_train, Y_train) # parameters fitting
model.predict(X_valid) # no train, only check on performances
Save model performances on validation and pick the best model (the one with the best scores on the validation set) then check results on the testset:
model.predict(X_test) # this will be the estimated performance of your model
If your dataset is big enough, you could also use something like cross-validation.
Anyway, remember:
the parameters are the network weights
you fit the parameters with the training set
the hyperparameters are the ones that define the net architecture (layers, nodes, activation functions)
you select the best hyperparameters checking the result of your model on the validation set
after this selection (best parameters, best hyperparameters) you get the model performances testing the model on the test set
To obtain the same result of keras, you should understand that when you call the method fit() on the model with default arguments, the training will stop after a fixed amount of epochs (200), with your defined number of epochs (5000 in your case) or when you define a early_stopping.
max_iter: int, default=200
Maximum number of iterations. The solver iterates until convergence (determined by ‘tol’) or this number of iterations. For
stochastic solvers (‘sgd’, ‘adam’), note that this determines the
number of epochs (how many times each data point will be used), not
the number of gradient steps.
Check your model definition and arguments on the scikit page
To obtain the same result of keras, you could fix the training epochs (eg. 1 step per training), check the result on validation, and then train again until you reach the desired number of epochs
for example, something like this (if your model uses mse):
epochs = 5000
mlp = MLPRegressor(activation="relu",
max_iter=1,
solver="adam",
random_state=2,
early_stopping=True)
training_mse = []
validation_mse = []
for epoch in epochs:
mlp.fit(X_train, Y_train)
Y_pred = mlp.predict(X_train)
curr_train_score = mean_squared_error(Y_train, Y_pred) # training performances
Y_pred = mlp.predict(X_valid)
curr_valid_score = mean_squared_error(Y_valid, Y_pred) # validation performances
training_mse.append(curr_train_score) # list of training perf to plot
validation_mse.append(curr_valid_score) # list of valid perf to plot
I have the same problem: obtained two flat lines when using the module as it was advised, I solve the problem just adding warm_start=True to the MLPRegressor parameters, as explained in MLPRegressor- 1.17.9. More control with warm_start
mlp=MLPRegressor(activation="relu",max_iter=1,solver="adam",random_state=2,early_stopping=True, warm_star=True)
The plot obtained are now correct:
Train and validation loss curves
Let me first explain about data set that I am using.
I have three set.
train with shape of (1277, 927), target is present about 12% of time
Eval set with shape of (174, 927), target is present about 11.5% of time
Hold out set with shape of (414, 927), target is present about 10% of time
This set is also building using time slices. Train set is oldest data. Hold out set is newest data. and Eval set is in middle set.
Now I am building two models.
Model1:
# Initialize CatBoostClassifier
model = CatBoostClassifier(
# custom_loss=['Accuracy'],
depth=9,
random_seed=42,
l2_leaf_reg=1,
# has_time= True,
iterations=300,
learning_rate=0.05,
loss_function='Logloss',
logging_level='Verbose',
)
## Fitting catboost model
model.fit(
train_set.values, Y_train.values,
cat_features=categorical_features_indices,
eval_set=(test_set.values, Y_test)
# logging_level='Verbose' # you can uncomment this for text output
)
predicting on hold out set.
Model2:
model = CatBoostClassifier(
# custom_loss=['Accuracy'],
depth=9,
random_seed=42,
l2_leaf_reg=1,
# has_time= True,
iterations= 'bestIteration from model1',
learning_rate=0.05,
loss_function='Logloss',
logging_level='Verbose',
)
## Fitting catboost model
model.fit(
train.values, Y.values,
cat_features=categorical_features_indices,
# logging_level='Verbose' # you can uncomment this for text output
)
Both model is identical except iterations. First model has fix 300 round, but it will Shrink model to bestIteration. Where second model uses that bestIteration from model1.
However, When I compare feature importance. It looks drastically difference.
Feature Score_m1 Score_m2 delta
0 x0 3.612309 2.013193 -1.399116
1 x1 3.390630 3.121273 -0.269357
2 x2 2.762750 1.822564 -0.940186
3 x3 2.553052 NaN NaN
4 x4 2.400786 0.329625 -2.071161
As you can see one of feature x3 which was on top3 in first model, dropped off in second model. Not only that but there is large shift in weights between models for given feature. There are about 60 features that are present in model1 are not present in model2. And there about 60 features that present in model2 are not present in model1. delta is difference between Score_m1 and Score_m2. I have seen where model changes score little bit not this drastic. AUC and LogLoss doesn't change that much when I use model1 or model2.
Now I have following questions regarding this situation.
Is this models are instable due to small number of sample and large number of features. If this is case, how to check for this?
Are there feature in this model are just not giving that much information regarding model outcome and there is random change that it is creating split. If this case how to check for this situation?
This catboost is right model for this situation ?
Any help regarding this issue will be appreciated
Yes. Trees in general are somewhat unstable. If you remove the least important feature, you can get a very different model.
Having more data reduces this tendency.
Having more features increases this tendency.
Tree algorithms are random by nature, so the results will be different.
Things to try:
Run the model a large number of times but with different random seeds. Use the results to determine which feature seems to be the least important. (How many features do you have?)
Try to balance your training set. This might require you to upsample the rarer cases.
Get more data. Maybe you'll have to combine your train and test set and use the holdout as the test.
Long time lurker first time poster.
I have data that roughly follows a y=sin(time) distribution, but also depends on other variables than time. In terms of correlations, since the target y-variable oscillates there is almost zero statistical correlation with time, but y obviously depends very strongly on time.
The goal is to predict the future values of the target variable. I want to avoid using an explicit assumption of the model, and instead rely on data driven models and machine learning, so I have tried using regression methods from sklearn.
I have tried the following methods (the parameters were blindly copied from examples and other threads):
LogisticRegression()
QDA()
GridSearchCV(SVR(kernel='rbf', gamma=0.1), cv=5,
param_grid={"C": [1e0, 1e1, 1e2, 1e3],
"gamma": np.logspace(-2, 2, 5)})
GridSearchCV(KernelRidge(kernel='rbf', gamma=0.1), cv=5,
param_grid={"alpha": [1e0, 0.1, 1e-2, 1e-3],
"gamma": np.logspace(-2, 2, 5)})
GradientBoostingRegressor(loss='quantile', alpha=0.95,
n_estimators=250, max_depth=3,
learning_rate=.1, min_samples_leaf=9,
min_samples_split=9)
DecisionTreeRegressor(max_depth=4)
AdaBoostRegressor(DecisionTreeRegressor(max_depth=4),
n_estimators=300, random_state=rng)
RandomForestRegressor(n_estimators=10, min_samples_split=2, n_jobs=-1)
The results fall into two different categories of failure:
The time field is having no effect, probably due to the absence of correlation from the oscillatory behaviour of the target variable. However, secondary effects from other variables allow a modest predictive capability for future time ranges (these other variables have a simple correlation with the target variable)
The when applying predict() to the training time range the prediction is near perfect with respect to the observations, but when given the future time range (for which data was not used in training) the predicted value stays constant.
Below is how I performed the training and testing:
weather_df.index = pd.to_datetime(weather_df.index,unit='D')
weather_df['Days'] = (weather_df.index-datetime.datetime(2005,1,1)).days
ts = pd.DataFrame({'Temperature':weather_df['Mean TemperatureC'].ix[:'2015-1-1'],
'Humidity':weather_df[' Mean Humidity'].ix[:'2015-1-1'],
'Visibility':weather_df[' Mean VisibilityKm'].ix[:'2015-1-1'],
'Wind':weather_df[' Mean Wind SpeedKm/h'].ix[:'2015-1-1'],
'Time':weather_df['Days'].ix[:'2015-1-1']
})
start_test = datetime.datetime(2012,1,1)
ts_train = ts[ts.index < start_test]
ts_test = ts
data_train = np.array(ts_train.Humidity, ts_test.Time)[np.newaxis]
data_target = np.array(ts_train.Temperature)[np.newaxis].ravel()
model.fit(data_train.T, data_target.T)
data_test = np.array(ts_test.Humidity, ts_test.Time)[np.newaxis]
pred = model.predict(data_test.T)
ts_test['Pred'] = pred
Is there a regression model I could/should use for this problem, and if so what would be appropriate options and parameters?
(Also, my treatment of the time objects in sklearn is far from elegant, so I am gladly taking advice there.)
Here is my guess about what is happening in your two types of results:
.days does not convert your index into a form that repeats itself between your train and test samples. So it becomes a unique value for every date in your dataset.
As a consequence your models either ignore days (1st result), or your model overfits on the days feature (2nd result) causing the model to perform badly on your test data.
Suggestion:
If your dataset is large enough (it looks like it goes from 2005), try using dayofyear or weekofyear instead, so that your model will have something generalizable from the date information.
Agree with #zemekeneng that time should be module by the corresponding periods like 24hours, 12 months etc.
Beyond that, I'd like to remind using prior knowledge when selecting features or models. Since you already knew that your data is highly likely to follow sin(x), it should be used even in data driven approach.
We know that sin(x) can be approximated by x - x^3/3! + x^5/5! - x^7/7! then these should be used as features. None of the models that you used may have included these features. One way to do it would be to create these high order features by yourself and concatenate to your other features. Then a linear model with regulation may give you reasonable results.