I am using this GitHub package https://github.com/5663015/elm/blob/master/elm.py for Extreme Learning Machine models. I run the following code on my dataset.
# Create target series and data splits
y = df['rain'].copy()
X= df[['lag1']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=200, shuffle=False)
# model
model = ELM(hid_num=10).fit(X_train, y_train)
# predictions
prediction = model.predict(X_test)
In the dataset, the target variable is rainfall, and the predictor is lag1 of the rainfall data. The data is time series and I put shafle=False. I used 70% of data for training the model and 30% of data as a test set.
The model is working and I can get predictions. However, each time that I run the model, I get different prediction values and RMSE (for evaluating the model performance). Could you please let me know if this is common with ELM models? and is there any way to get fixed predictions and RMSE each time after running the model?
Each time you train the model, a different random seed is chosen. As a result, the initialization changes and, thus, the optimization behaves differently.
To fix the random seed with numpy set np.random.seed(...) (docs).
Related
I'm trying to do sentiment analysis on text docoments but I got lost in the steps.
So my goal is to:
Train SVM, KNN and Naive Bayes algorithms
Use gridsearch to find best parameters
Evaluate models accuracy and find the best one
Use those parameters and get optimal result
Almost on every guide I find that train_test_split method is used. But I've read that Holdout cross validation method isn't very accurate. It's when you split data into train test sets for example 80:20 and hold that 20% for the testing. So instead i wanted to use K-folds cross validation. But the question is how could i use it and do i still need to split my data into train test sets?
So far what i've tried is:
sentences = svietimas_data['text']
y = svietimas_data['sentiment']
sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.1, random_state=1)
sentences_train, sentences_validate, y_train, y_validate = train_test_split(sentences_train, y_train, test_size=0.1111, random_state=1)
classifier = KNeighborsClassifier()
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'minkowski']
k_range = list(range(1, 31))
param_grid = dict(n_neighbors=k_range, weights = weights, metric = metric )
vectorizer = TfidfVectorizer(lowercase=False, max_df=100)
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train)
X_validate = vectorizer.transform(sentences_validate)
X_test = vectorizer.transform(sentences_test)
grid_search = GridSearchCV(classifier, param_grid, cv=10,scoring='accuracy', return_train_score=False)
grid_search.fit(X_train, y_train)
print(grid_search.best_score_)
print(grid_search.best_params_)
I split the data into train validate and test - 80:10:10. I use my train data for the gridsearch parameter analysis to find best parameters and after that i put those parameters into my classifier to use it with validate and test sets to find the best results like this:
classifier.fit(X_train, y_train)
y_pred_validate = classifier.predict(X_validate)
print(classification_report(y_validate, y_pred_validate))
y_pred_test = classifier.predict(X_test)
print(classification_report(y_test, y_pred_test))
But since this method isn't very accurate could i instead use my whole data set on gridsearch and thats it? or after getting best parameters with 80% data set I should put those parameters into classifier and use K-folds cross validation with full data set? Because using gridsearch or k-folds with train (80%) data i waste 20% of the data and as far as i know if i would use 100% of the data K-folds would split that data into for example gievn k-5 sets and the data wouldn't count as seen or overfitted?
Or what my exact steps should be to correctly achieve that goal?
You're doing parameter tuning, which is equivalent to training: this is why you must keep a fresh test set to evaluate the final model (otherwise performance could be overestimated).
However since you're using CV in the first level of training, you need only one more test set. So the typical process would be like this:
Split training and test set
Apply CV to the training set for all combinations of parameters (grid search), then pick the best parameters.
Re-train the final model on the full training set with the best parameters.
Evaluate the model on the test set
But since this method isn't very accurate could i instead use my whole data set on gridsearch and thats it?
If you don't evaluate on a fresh test set after parameter tuning, you might have overfitting (best parameters by chance) and you wouldn't know it, the performance would be biased.
or after getting best parameters with 80% data set I should put those parameters into classifier and use K-folds cross validation with full data set?
It is possible to use CV also for the last stage of evaluation, but it's not so simple: you would have to used nested CV, it's not really worth it and it would take a lot more time because you would have to repeat the parameter tuning stage for each training inside the top-level CV.
Because using gridsearch or k-folds with train (80%) data i waste 20% of the data
Actually you don't waste the data. The test set is needed only for the purpose of reliable evaluation, but once this is done you could perfectly re-train your model on the full data.
Also this is a bad sign when 20% of the data matters a lot for performance, it means that the model probably doesn't have a large enough training set and even the full data might not be enough.
I have a doubt about classification algorithm comparation.
I am doing a project regarding hyperparameter tuning and classification model comparation for a dataset.
The Goal is to find out the best fitted model with the best hyperparameters for my dataset.
For example: I have 2 classification models (SVM and Random Forest), my dataset has 1000 rows and 10 columns (9 columns are features) and 1 last column is lable.
First of all, I splitted dataset into 2 portions (80-10) for training (800 rows) and tesing (200rows) correspondingly. After that, I use Grid Search with CV = 10 to tune hyperparameter on training set with these 2 models (SVM and Random Forest). When hyperparameters are identified for each model, I use these hyperparameters of these 2 models to test Accuracy_score on training and testing set again in order to find out which model is the best one for my data (conditions: Accuracy_score on training set < Accuracy_score on testing set (not overfiting) and which Accuracy_score on testing set of model is higher, that model is the best model).
However, SVM shows the accuracy_score of training set is 100 and the accuracy_score of testing set is 83.56, this means SVM with tuning hyperparameters is overfitting. On the other hand, Random Forest shows the accuracy_score of training set is 72.36 and the accuracy_score of testing set is 81.23. It is clear that the accuracy_score of testing set of SVM is higher than the accuracy_score of testing set of Random Forest, but SVM is overfitting.
I have some question as below:
_ Is my method correst when I implement comparation of accuracy_score for training and testing set as above instead of using Cross-Validation? (if use Cross-Validation, how to do it?
_ It is clear that SVM above is overfitting but its accuracy_score of testing set is higher than accuracy_score of testing set of Random Forest, could I conclude that SVM is a best model in this case?
Thank you!
It's good that you've done quite an analysis on your part to investigate the best model. However, I would suggest you elaborate on your investigation a bit. As you're searching for the best model for your data, "Accuracy" alone is not a good evaluation metric for your models. You should also evaluate your model on "Precision Score", "Recall Score", "ROC", "Sensitivity", "Specificity" etc. Find out if your data has imbalance (If they do, there're techniques to work 'em around). After evaluating all those metrics you may come up with a decision.
For the training-testing part, you're quite on the right track, with only one issue (which is quite severe), the time you're testing your model on the test set, you're injecting a sort of bias. So I would say make 3 partitions of your data, and use cross-validation (sklearn has got what you need for this) on your "training set", after cross-validation, you may use another partition "validation set" for testing the generalization power of your model (performance on unseen data), you may change some parameter after that. And after you've come up to a conclusion and tuning everything you needed to, only then use your "test set". No matter what the results are (on the test set) don't change the model after that, as those scores represent the true capability of your model.
you can create 3 partitions of your data in the following way for example-
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_blobs
# Dummy dataset for example purpose
X, y = make_blobs(n_samples=1000, centers=2, n_features=2, cluster_std=6.0)
# first partition i.e. "train-set" and "test-set"
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.9, random_state=123)
# second partition, we're splitting the "train-set" into 2 sets, thus creating a new partition of "train-set" and "validation-set"
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size=0.9, random_state=123)
print(X_train.shape, X_test.shape, X_val.shape) # output : ((810, 2), (100, 2), (90, 2))
I would suggest splitting your data into three sets, rather than two:
Training
Validation
Testing
Training is used to train the model, as you have been doing. The validation set is used to evaluate the performance of a model trained with a given set of hyperparameters. The optimal set of hyperparameters is then used to generate predictions on the test set, which wasn't part of either training or hyper parameter selection. You can then compare performance on the test set between your classifiers.
The large decrease in performance on your SVM model on your validation dataset does suggest overfitting, though it is common for a classifier to perform better on the training dataset than an evaluation or test dataset.
For your second question, yes your SVM would be overfitting although in most machine-learning cases, the training set's accuracy would not really matter. It is much more important to look at the testing set's accuracy. It is not unusual to have a higher training accuracy than testing accuracy so I suggest to not look to overfitting and only look at the difference in the testing accuracy. With the information provided, yes you could say that the SVM is the best model in your case.
For your first question, you are already doing a type of cross validation and it is an acceptable way to do evaluate the model.
This might be a useful article for you to read
Not an ML expert but the normal flow I follow to train a machine learning model is after data cleaning, split the dataset to train, and test using scikit-learn's train test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X.values,
y.values,
test_size=0.30,
random_state=0)
skipping over the whole model building process...when you go to train the model(fit it) after defining and compiling it, it is to use the validation split parameter like below
history = model.fit(X_train, y_train, epochs=10,validation_split=0.2)
this seems to again divide the training datasets by 20% data points to validate our model during training. If you had let's say 1000 data points(rows) in a dataset the first code above will lead to
700 training data points for training and 300 for testing
the second will again divide that 700's 20% for validation leaving as
640 data points for training and 160 for validation
leaving us with small data to train our model with.
I recently encountered a method tho where you use the test data for validation like below
history = model.fit(x_train, y_train, validation_data=(x_test, y_test))
My question is what will actually happen to the validation data after training completes is it automatically added to train our model somehow improving our accuracy at the end and also is using the test data for validation enough and if we do that will it have an effect when we try to evaluate our model using the test data.
As stated by keras (the last model.fit method with validation_data comes from keras)
validation_split
Float between 0 and 1. Fraction of the training data to be used as
validation data. The model will set apart this fraction of the
training data, will not train on it, and will evaluate the loss and
any model metrics on this data at the end of each epoch. The
validation data is selected from the last samples in the x and y data
provided, before shuffling.
validation_data
Data on which to evaluate the loss and any model metrics at the end of
each epoch. The model will not be trained on this data. This could be
a list (x_val, y_val) or a list (x_val, y_val, val_sample_weights).
validation_data will override validation_split.
So with this setup, the model will not train on the validation data.
In theory, validation data is used to evaluate your model and tune its hyperparameters.
If you need to put this model in production you should retrain the model with all the data, knowing that the performance would be what you obtained from the validation\test data
Anyway, the performances on validation\test data are often an optimistic estimate of the performance
source
Here I have this piece of python code, taken from SoloLearn,
scores = []
kf = KFold(n_splits=5, shuffle=True)
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = LogisticRegression()
model.fit(X_train, y_train)
scores.append(model.score(X_test, y_test))
print(scores)
print(np.mean(scores))
My question then is, do I need to create a new model in every split?
Why don't we just create one LogisticRegression before the for?
I would put it before to save computation time, but since it has been presented this way I thought there was a reason.
Great question! The answer is...you don't have to create the model each time. Your intuition is correct. Feel free to move model = LogisticRegression() to the top, outside the loop, and re-run to confirm.
The model object that exists after model.fit(X_train, y_train) each time through the loop will be the same either way.
Short answer is yes.
The reason why is because this is k-fold cross validation
Simply put, this means that you are training k number of models, evaluating the results of each and averaging together.
We do this in cases where we do not have separate data sets for training and testing. Cross validation is splitting the training data into k subgroups, each of which contains its own test/train split (we call these folds). We then train a model on the training data of the first fold and test on the test data. Repeat for all folds with a new model for each and now we have proper predictions for the full dataset.
Here is a link to a detailed description of cross validation - https://machinelearningmastery.com/k-fold-cross-validation/
KFold is used for cross validation, that means training a model and evaluating it.
Here is an example of documentation on the subject.
When doing that you obviously need two datasets: a training AND an evaluation data set.
When using KFold, you split your training set in number of folds (5 in your example) and run five models, using one fifth each time as the validation set and the rest of the dataset as the training set.
Now, in order to answer the question : you need a new model each time because you have five models, as each of the fifth times you have a different training set, as well as a different validation set. You must create a new one in scikit learn because when you run model.fit() the model is trained on a specific dataset, so you cannot use it for another training dataset.
If you want to create it only once, you can make copies for example :
model = LogisticRegression(**params)
def parse_kfold(model)
kf = KFold(n_splits=5, shuffle=True)
for train_index, test_index in kf.split(X):
model_fold = model
...
I am building a python application in which i want to forecast the values of PM2.5 over a month. I am using polynomial regression and I have trained the algorithm to split data into 30%test data and 70%train data. I am using this line of code to train the algorithm:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42,shuffle=True)
But i have noticed that if i give the random_state different integers, the mean squared error differs and also the accuracy of the forecast. How can I find the optimal parameters for the train_test_split method so that the forecast has the most accuracy?
How much does the accuracy vary when you change the random seed?
You can use k-fold cross-validation to find the best split, however, I am not sure you want the one with the highest accuracy. You want your model to generalize. You should go for the one where you have enough training data and a test set that is representative of the real-world test data the model will encounter.