When you do cross_validation.train_test_split(features,labels,test_size), it is one data set that is automatically being split into training and testing data by cross_validation but how can you train and test two separate sets of data? So if the training data is in one file and the testing data is in another file, and you want to first train the data using the train file and then test using the test file how can you do that? Because cross_validation only takes one set of data and splits it into train and test automatically.
Thanks!!
When there is just one split there is no cross validation, you just literally train on one dataset and check your accuracy (or other metric) on test one, without the use of CV (since, as said before - there is no such tring as CV for a single split). This is the exact oposite of what CV is for. CV has been introduced because single split is not enough for valid estimation of test for small dataset.
Related
I'm trying to get into machine learning, and I've been following this tutorial:
https://www.analyticsvidhya.com/blog/2021/05/classification-algorithms-in-python-heart-attack-prediction-and-analysis/
Near the end, we split the dataset into training and testing using train_test_split
x = data3.drop("output", axis=1)
y = data3["output"]
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)
That is, we use the same dataset for training and testing, 70% for training and 30% for testing.
But how can I use another dataset to test my model ?
One scenario came to mind: "You trained your model in 250 patients, now test it against these 3 patients that we have, so we can see the chances of them having a heart attack".
How can I, instead of splitting the data, use another csv/dataframe as a test ? Assuming this test data has the same format as the train, just fewer rows.
train_test_split(x,y,test_size=0.3) only divides data into training and testing set. After training the model on training data, you can use your other data for testing too. This function is mainly for splitting current data and you can use any data for testing purposes. You just have to make sure the attributes and the type are same as the training data. If you have to test on 3 patients, all you have to do is to pass the patients data into model.predict() function as a dataframe or an array depends on data.
Just as you load one dataframe from a file:
data1 = pd.read_csv("heart.csv")
If you had two separate data files, you'd load them into separate data files and skip the train_test_split step.
train_df = pd.read_csv("heart_train.csv")
test_df = pd.read_csv("heart_test.csv")
Since the two dataframes are already separate, you just have to make sure you do any cleaning and pre-processing steps on both of them, including removal of the target variable (y).
I have many categorical variables which exist in my test set but don't in my train set. They are important so I can't drop them. Should I combine train and test set or what other solution should I make?
You have some options in this case, you can use another technique than Holdout to separate your data like a K-Fold Cross-validation or Leave-one-out.
When using a holdout is necessary to stratify your data to all subsets have all classes on train/test/validation subset's and NEVER use your test or validation dataset to fit your model, when you do it the model will learn this data and you probably will be overfitting your model, read more about it here
How did you end up in this situation? Normally you take a dataset and divide it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.
But it is clear that, because they originate from the same original dataset, that they both have the same categorical variables.
I'm currently working on a problem which compares three different machine learning algorithms performance on the same data-set. I divided the data-set into 70/30 training/testing sets and then performed grid search for the best parameters of each algorithm using GridSearchCV and X_train, y_train.
First question, am I suppose to perform grid search on the training set or is it suppose to be on the whole data-set?
Second question, I know that GridSearchCV uses K-fold in its' implementation, does it mean that I performed cross-validation if I used the same X_train, y_train for all three algorithms I compare in the GridSearchCV?
Any answer would be appreciated, thank you.
All estimators in scikit where name ends with CV perform cross-validation.
But you need to keep a separate test set for measuring the performance.
So you need to split your whole data to train and test. Forget about this test data for a while.
And then pass this train data only to grid-search. GridSearch will split this train data further into train and test to tune the hyper-parameters passed to it. And finally fit the model on the whole train data with best found parameters.
Now you need to test this model on the test data you kept aside in the beginning. This will give you the near real world performance of model.
If you use the whole data into GridSearchCV, then there would be leakage of test data into parameter tuning and then the final model may not perform that well on newer unseen data.
You can look at my other answers which describe the GridSearch in more detail:
Model help using Scikit-learn when using GridSearch
scikit-learn GridSearchCV with multiple repetitions
Yes, GridSearchCV performs cross-validation. If I understand the concept correctly - you want to keep part of your data set unseen for the model in order to test it.
So you train your models against train data set and test them on a testing data set.
Here I was doing almost the same - you might want to check it...
from sklearn.model_selection import train_test_split
I have a question about the train_test_split function from sklearn. First, why do we split the data??? and were do we get the testing data from. Do we just chop the data in half and use some of it to train and some of it to test?? Than doesn't make sense since the data is already filled. If it is filled, then what are we predicting now?? I need help!
First, why do we split the data???
We split the data to isolate a portion of it for validation purposes. Use the non-isolated portion to fit the algorithm and use the algorithm to test again the isolated portion.
were do we get the testing data from.
The testing data is actually part of your original dataset.
Do we just chop the data in half
We usually chop at the 20-40% mark.
If it is filled, then what are we predicting now??*
You are actually not trying to predict the result directly. You are training the algorithm to fit the training set and use the testing set to see how accurate the algorithm is.
Your original dataset should be split up into training and testing data. For example, 80% of the data could be used for training and 20% could be used for testing. The data is split so that there is data for the model to be evaluated on to see how well the model performs on unseen data.
Training: This data is used to build your model. E.g. finding the optimal coefficients in a Linear Regression model; or using the CART
algorithm to create a Decision Tree.
Testing: This data is used to see how the model performs on unseen data, as it would in a real world situation. This data should
be left completely unseen until you would like to test your model to
evaluate performance.
Extra notes on validation data:
To tune your model (for example, finding the best max_depth value for a decision tree) the training data should also be proportioned into training and validation data. K-Folds Cross Validation can be used here. The model should be trained on the training data and evaluated on the validation data. This is performed multiple time with cross validation.
The results (e.g. MSE, F1 etc.) can then be evaluated on each fold and used to tune the hyperparameters. By using cross-validation to tune the hyperparameters, this ensures that the model is not overfitting to the test data.
Once the model is tuned, it can then be applied to the test data.
I have a my training and testing data separate (from different CSV loaded into different pandas dataframe) and I want to plot the learning curve with this training and testing data instead of training and test data generated from training set itself using cross validation (which seems to be the usual way learning_curve works).
It seems like scikit expects your testing and training data to be present in the same Dataframe, but this way the classifier would learn the test data as well which is not what I want.
How can I go about solving this problem ? I am new to sci-kit.
You will need to keep your training and test data separate (at least in separate variables within the code). The learning curve can then be applied on the training set. This way you can optimize your experiment without using the test set (in order to avoid overfitting).
To verify how well you are doing on the test set, scikit-learn offers the validation curve which evaluates against the test set.
Scikit-Learn is more tricky. It allows you to define train_sizes of train and test sets and then runs a cross-validation on all of them (parameter cv, defaults to a 3-fold cross validation).