I am a beginner of LSTM and I have built a simple LSTM model for predicting the stock price.
However I don't quite understand the purpose of y_train and y_test for data set preparation and splitting.
When i tried to input x_train and y_train data, that's ok to train up the model. After that i just input x_test data but not input y_test data, the model still can predict the result. Why?
Thank you so much dude
The x_test values are the one's you are trying to make a prediction on without having the answers. This represents a real world scenario. You need to compare your y_pred with your y_test values in order to evaluate your model and get a score.
Once the model is deployed and you use real values you won't have any y_test to compare against as you are using unseen data. The test set in model development tries to emulate this in order to test how the model generalizes on real world data.
Related
I am using this GitHub package https://github.com/5663015/elm/blob/master/elm.py for Extreme Learning Machine models. I run the following code on my dataset.
# Create target series and data splits
y = df['rain'].copy()
X= df[['lag1']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=200, shuffle=False)
# model
model = ELM(hid_num=10).fit(X_train, y_train)
# predictions
prediction = model.predict(X_test)
In the dataset, the target variable is rainfall, and the predictor is lag1 of the rainfall data. The data is time series and I put shafle=False. I used 70% of data for training the model and 30% of data as a test set.
The model is working and I can get predictions. However, each time that I run the model, I get different prediction values and RMSE (for evaluating the model performance). Could you please let me know if this is common with ELM models? and is there any way to get fixed predictions and RMSE each time after running the model?
Each time you train the model, a different random seed is chosen. As a result, the initialization changes and, thus, the optimization behaves differently.
To fix the random seed with numpy set np.random.seed(...) (docs).
I done training my LSTM model and it brought a really good result. But, I wondering that is it only can predict the value in the x_test dataset? Like, I usually write model.predict(x_test). But what if I want to predict the actual future like it never happened for the testing set? What should I do?
Here I have this piece of python code, taken from SoloLearn,
scores = []
kf = KFold(n_splits=5, shuffle=True)
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = LogisticRegression()
model.fit(X_train, y_train)
scores.append(model.score(X_test, y_test))
print(scores)
print(np.mean(scores))
My question then is, do I need to create a new model in every split?
Why don't we just create one LogisticRegression before the for?
I would put it before to save computation time, but since it has been presented this way I thought there was a reason.
Great question! The answer is...you don't have to create the model each time. Your intuition is correct. Feel free to move model = LogisticRegression() to the top, outside the loop, and re-run to confirm.
The model object that exists after model.fit(X_train, y_train) each time through the loop will be the same either way.
Short answer is yes.
The reason why is because this is k-fold cross validation
Simply put, this means that you are training k number of models, evaluating the results of each and averaging together.
We do this in cases where we do not have separate data sets for training and testing. Cross validation is splitting the training data into k subgroups, each of which contains its own test/train split (we call these folds). We then train a model on the training data of the first fold and test on the test data. Repeat for all folds with a new model for each and now we have proper predictions for the full dataset.
Here is a link to a detailed description of cross validation - https://machinelearningmastery.com/k-fold-cross-validation/
KFold is used for cross validation, that means training a model and evaluating it.
Here is an example of documentation on the subject.
When doing that you obviously need two datasets: a training AND an evaluation data set.
When using KFold, you split your training set in number of folds (5 in your example) and run five models, using one fifth each time as the validation set and the rest of the dataset as the training set.
Now, in order to answer the question : you need a new model each time because you have five models, as each of the fifth times you have a different training set, as well as a different validation set. You must create a new one in scikit learn because when you run model.fit() the model is trained on a specific dataset, so you cannot use it for another training dataset.
If you want to create it only once, you can make copies for example :
model = LogisticRegression(**params)
def parse_kfold(model)
kf = KFold(n_splits=5, shuffle=True)
for train_index, test_index in kf.split(X):
model_fold = model
...
I am building a python application in which i want to forecast the values of PM2.5 over a month. I am using polynomial regression and I have trained the algorithm to split data into 30%test data and 70%train data. I am using this line of code to train the algorithm:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42,shuffle=True)
But i have noticed that if i give the random_state different integers, the mean squared error differs and also the accuracy of the forecast. How can I find the optimal parameters for the train_test_split method so that the forecast has the most accuracy?
How much does the accuracy vary when you change the random seed?
You can use k-fold cross-validation to find the best split, however, I am not sure you want the one with the highest accuracy. You want your model to generalize. You should go for the one where you have enough training data and a test set that is representative of the real-world test data the model will encounter.
train_index, test_index = next(iter(ShuffleSplit(821, train_size=0.2, test_size=0.80, random_state=42)))
print train_index, len(train_index)
print test_index, len(test_index)
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(features, labels, train_size=0.33, random_state=42)
clf = DecisionTreeClassifier()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test, labels_test)
print pred, len(pred)
A few questions from this code:
Why do I need the cross_validation.train_test_split line in order to fit and predict with my classifier? (I am not doing any preprocessing on my data except for stopword removal I have already done)
Do the test and train indexes correspond to the classified & predicted labels? My goal is to get all my labels, in their original order, after fitting and predicting them. My features and labels used for training and testing are from a pandas dataframe (two columns), and I need the predicted labels, in order, so that I can feed them back into the pandas dataframe.
Is there a way to predict the labels for the whole set, and not just the test set?
tl;dr
Because your decision tree classifier has to be trained before it can predict anything. It's not a magic algorithm. It has to be shown examples of what to do before it can work out what to do on other things.
cross_validation.test_train_split() facilitates this by splitting your data into a test and training dataset in such a way that you can analyse how well it performed later on. Without this, you have no way of assessing how well your decision tree classifier actually performed.
You can create your own testing and training data without test_train_split() (and I suspect that was what you were trying to do with ShuffleSplit()), but you will need at least some training data.
test_index and train_index have nothing to do with your data. Full stop. They come from a randomly generated process that is completely unrelated to what test_train_split() does.
The purpose of ShuffleSplit() is to give you the indices to partition your data into training and test yourself. test_train_split() will instead choose their own indices and partition based on those indices. You should either use one or the other and sensibly.
Yes. You can always just call
pred = clf.predict(features) or pred = clf.predict(features_test + features_train)
The Full Story
You need cross_validation if you want to do this right. The whole purpose of cross-validation is to avoid overfit.
Basically, if you run your model on both the training and the testing data, then your model is going to perform really well on the training set (because, well, that's what you trained it on) and that's going to skew your overall metrics of how well your model will perform on real data.
It's a lot like asking a student to perform in an exam and then in real life: if you want to know whether your student learned from the process of preparing for an exam, you don't give him another exam, you ask him to demonstrate his skills in the real world dealing with unknown and complex data.
If you want to know if your model will be useful, then you want to cross-validate. Wikipedia puts it best:
In a prediction problem, a model is usually given a dataset of known
data on which training is run (training dataset), and a dataset of
unknown data (or first seen data) against which the model is tested
(testing dataset).
The goal of cross validation is to define a
dataset to "test" the model in the training phase (i.e., the
validation dataset), in order to limit problems like overfitting, give
an insight on how the model will generalize to an independent dataset
(i.e., an unknown dataset, for instance from a real problem), etc.
cross_validation.train_test_split doesn't do anything except split the dataset into training and testing data for you.
But perhaps you don't care about metrics, and that's fine. The question then becomes: is it possible to run a decision tree classifier without a training dataset?
The answer is no. Decision tree classifiers are supervised algorithms: they need to be trained on data before they can generalise their model to new results. If you don't give them any data to train on, it will be unable to do anything with any data you feed it in predict.
Finally, while it is perfectly possible to get the labels for the whole set (see tl;dr) , it is a really bad idea if you actually care about whether or not you're getting sensible results.
You already have the labels for the testing and training data. You don't need another column that includes prediction on the testing data, because they'll either come out to be identical or close enough to identical.
I can't think of a single meaningful reason to get back predicted results for your training data short of trying to optimise how it's performing on your training data. If that's what you are trying to do, then do that. What you are doing right now is definitely not that, and I encourage you to think strongly about what your reasons are for blindly inserting numbers into your table without due cause to believe they actually mean something.
There are ways to improve this: get back an accuracy metric, for example, or try to do k-fold cross-validation to model accuracy, or look at log-loss or AUC or any one of number of metrics to gauge whether or not your model is performing well.
Using both ShuffleSplit and train_test_split is redundant. You do not even appear to be using the indices returned by ShuffleSplit.
An example of how to use the indices return by ShuffleSplit is below. X and y are np.array. X is number of instances by number of features. y contains the labels of each row.
train_inds, test_inds = train_test_split(range(len(y)),test_size=0.33, random_state=42)
X_train, y_train = X[train_inds], y[train_inds]
X_test , y_test = X[test_inds] , y[test_inds]
You should not test on your training data! But if you want to see what happens just do
pred = clf.predict(features_train)
Also you do not need to pass the labels to predict. You should be using
score = metrics.accuracy_score(y_test, pred)