Here I have this piece of python code, taken from SoloLearn,
scores = []
kf = KFold(n_splits=5, shuffle=True)
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = LogisticRegression()
model.fit(X_train, y_train)
scores.append(model.score(X_test, y_test))
print(scores)
print(np.mean(scores))
My question then is, do I need to create a new model in every split?
Why don't we just create one LogisticRegression before the for?
I would put it before to save computation time, but since it has been presented this way I thought there was a reason.
Great question! The answer is...you don't have to create the model each time. Your intuition is correct. Feel free to move model = LogisticRegression() to the top, outside the loop, and re-run to confirm.
The model object that exists after model.fit(X_train, y_train) each time through the loop will be the same either way.
Short answer is yes.
The reason why is because this is k-fold cross validation
Simply put, this means that you are training k number of models, evaluating the results of each and averaging together.
We do this in cases where we do not have separate data sets for training and testing. Cross validation is splitting the training data into k subgroups, each of which contains its own test/train split (we call these folds). We then train a model on the training data of the first fold and test on the test data. Repeat for all folds with a new model for each and now we have proper predictions for the full dataset.
Here is a link to a detailed description of cross validation - https://machinelearningmastery.com/k-fold-cross-validation/
KFold is used for cross validation, that means training a model and evaluating it.
Here is an example of documentation on the subject.
When doing that you obviously need two datasets: a training AND an evaluation data set.
When using KFold, you split your training set in number of folds (5 in your example) and run five models, using one fifth each time as the validation set and the rest of the dataset as the training set.
Now, in order to answer the question : you need a new model each time because you have five models, as each of the fifth times you have a different training set, as well as a different validation set. You must create a new one in scikit learn because when you run model.fit() the model is trained on a specific dataset, so you cannot use it for another training dataset.
If you want to create it only once, you can make copies for example :
model = LogisticRegression(**params)
def parse_kfold(model)
kf = KFold(n_splits=5, shuffle=True)
for train_index, test_index in kf.split(X):
model_fold = model
...
Related
I am using this GitHub package https://github.com/5663015/elm/blob/master/elm.py for Extreme Learning Machine models. I run the following code on my dataset.
# Create target series and data splits
y = df['rain'].copy()
X= df[['lag1']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=200, shuffle=False)
# model
model = ELM(hid_num=10).fit(X_train, y_train)
# predictions
prediction = model.predict(X_test)
In the dataset, the target variable is rainfall, and the predictor is lag1 of the rainfall data. The data is time series and I put shafle=False. I used 70% of data for training the model and 30% of data as a test set.
The model is working and I can get predictions. However, each time that I run the model, I get different prediction values and RMSE (for evaluating the model performance). Could you please let me know if this is common with ELM models? and is there any way to get fixed predictions and RMSE each time after running the model?
Each time you train the model, a different random seed is chosen. As a result, the initialization changes and, thus, the optimization behaves differently.
To fix the random seed with numpy set np.random.seed(...) (docs).
I know that for training you model and see the accuracy you must use something like this:
print("Fit model on training data")
history = model.fit(
X_train,
y_train,
batch_size=64,
epochs=2,
validation_data=valid_set,
)
However, I don't know how to put my own validation data set since it has a different length than my training data set and it has caused me many error along the way.
I even pre-processed my own validation dataset with the same process I used for my base dataset.
Any tips on how to solve this?
I'm trying to do sentiment analysis on text docoments but I got lost in the steps.
So my goal is to:
Train SVM, KNN and Naive Bayes algorithms
Use gridsearch to find best parameters
Evaluate models accuracy and find the best one
Use those parameters and get optimal result
Almost on every guide I find that train_test_split method is used. But I've read that Holdout cross validation method isn't very accurate. It's when you split data into train test sets for example 80:20 and hold that 20% for the testing. So instead i wanted to use K-folds cross validation. But the question is how could i use it and do i still need to split my data into train test sets?
So far what i've tried is:
sentences = svietimas_data['text']
y = svietimas_data['sentiment']
sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.1, random_state=1)
sentences_train, sentences_validate, y_train, y_validate = train_test_split(sentences_train, y_train, test_size=0.1111, random_state=1)
classifier = KNeighborsClassifier()
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'minkowski']
k_range = list(range(1, 31))
param_grid = dict(n_neighbors=k_range, weights = weights, metric = metric )
vectorizer = TfidfVectorizer(lowercase=False, max_df=100)
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train)
X_validate = vectorizer.transform(sentences_validate)
X_test = vectorizer.transform(sentences_test)
grid_search = GridSearchCV(classifier, param_grid, cv=10,scoring='accuracy', return_train_score=False)
grid_search.fit(X_train, y_train)
print(grid_search.best_score_)
print(grid_search.best_params_)
I split the data into train validate and test - 80:10:10. I use my train data for the gridsearch parameter analysis to find best parameters and after that i put those parameters into my classifier to use it with validate and test sets to find the best results like this:
classifier.fit(X_train, y_train)
y_pred_validate = classifier.predict(X_validate)
print(classification_report(y_validate, y_pred_validate))
y_pred_test = classifier.predict(X_test)
print(classification_report(y_test, y_pred_test))
But since this method isn't very accurate could i instead use my whole data set on gridsearch and thats it? or after getting best parameters with 80% data set I should put those parameters into classifier and use K-folds cross validation with full data set? Because using gridsearch or k-folds with train (80%) data i waste 20% of the data and as far as i know if i would use 100% of the data K-folds would split that data into for example gievn k-5 sets and the data wouldn't count as seen or overfitted?
Or what my exact steps should be to correctly achieve that goal?
You're doing parameter tuning, which is equivalent to training: this is why you must keep a fresh test set to evaluate the final model (otherwise performance could be overestimated).
However since you're using CV in the first level of training, you need only one more test set. So the typical process would be like this:
Split training and test set
Apply CV to the training set for all combinations of parameters (grid search), then pick the best parameters.
Re-train the final model on the full training set with the best parameters.
Evaluate the model on the test set
But since this method isn't very accurate could i instead use my whole data set on gridsearch and thats it?
If you don't evaluate on a fresh test set after parameter tuning, you might have overfitting (best parameters by chance) and you wouldn't know it, the performance would be biased.
or after getting best parameters with 80% data set I should put those parameters into classifier and use K-folds cross validation with full data set?
It is possible to use CV also for the last stage of evaluation, but it's not so simple: you would have to used nested CV, it's not really worth it and it would take a lot more time because you would have to repeat the parameter tuning stage for each training inside the top-level CV.
Because using gridsearch or k-folds with train (80%) data i waste 20% of the data
Actually you don't waste the data. The test set is needed only for the purpose of reliable evaluation, but once this is done you could perfectly re-train your model on the full data.
Also this is a bad sign when 20% of the data matters a lot for performance, it means that the model probably doesn't have a large enough training set and even the full data might not be enough.
Not an ML expert but the normal flow I follow to train a machine learning model is after data cleaning, split the dataset to train, and test using scikit-learn's train test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X.values,
y.values,
test_size=0.30,
random_state=0)
skipping over the whole model building process...when you go to train the model(fit it) after defining and compiling it, it is to use the validation split parameter like below
history = model.fit(X_train, y_train, epochs=10,validation_split=0.2)
this seems to again divide the training datasets by 20% data points to validate our model during training. If you had let's say 1000 data points(rows) in a dataset the first code above will lead to
700 training data points for training and 300 for testing
the second will again divide that 700's 20% for validation leaving as
640 data points for training and 160 for validation
leaving us with small data to train our model with.
I recently encountered a method tho where you use the test data for validation like below
history = model.fit(x_train, y_train, validation_data=(x_test, y_test))
My question is what will actually happen to the validation data after training completes is it automatically added to train our model somehow improving our accuracy at the end and also is using the test data for validation enough and if we do that will it have an effect when we try to evaluate our model using the test data.
As stated by keras (the last model.fit method with validation_data comes from keras)
validation_split
Float between 0 and 1. Fraction of the training data to be used as
validation data. The model will set apart this fraction of the
training data, will not train on it, and will evaluate the loss and
any model metrics on this data at the end of each epoch. The
validation data is selected from the last samples in the x and y data
provided, before shuffling.
validation_data
Data on which to evaluate the loss and any model metrics at the end of
each epoch. The model will not be trained on this data. This could be
a list (x_val, y_val) or a list (x_val, y_val, val_sample_weights).
validation_data will override validation_split.
So with this setup, the model will not train on the validation data.
In theory, validation data is used to evaluate your model and tune its hyperparameters.
If you need to put this model in production you should retrain the model with all the data, knowing that the performance would be what you obtained from the validation\test data
Anyway, the performances on validation\test data are often an optimistic estimate of the performance
source
i split my data into training and test samples (70/30) for regression-forecasting based problem (MLP, LSTM, etc.).
Within the code:
history = model.fit(X_train, y_train, epochs=100, batch_size=32,
validation_data=(X_test, y_test), verbose=0, shuffle=False)
I put my test data as the validation set and did couple weeks worth of predictions. So i did not hold back the test data...
But now that i think about it, i guess it was wrong to put the test data into the fit function, or was it ok?
NEVER EVER! use your testing that as part of training or validation. The test set should only be used for inference after training. So yes it's wrong to use your test data in the fit function, it should only be in model.predict(y_test)