I'm trying to get into machine learning, and I've been following this tutorial:
https://www.analyticsvidhya.com/blog/2021/05/classification-algorithms-in-python-heart-attack-prediction-and-analysis/
Near the end, we split the dataset into training and testing using train_test_split
x = data3.drop("output", axis=1)
y = data3["output"]
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)
That is, we use the same dataset for training and testing, 70% for training and 30% for testing.
But how can I use another dataset to test my model ?
One scenario came to mind: "You trained your model in 250 patients, now test it against these 3 patients that we have, so we can see the chances of them having a heart attack".
How can I, instead of splitting the data, use another csv/dataframe as a test ? Assuming this test data has the same format as the train, just fewer rows.
train_test_split(x,y,test_size=0.3) only divides data into training and testing set. After training the model on training data, you can use your other data for testing too. This function is mainly for splitting current data and you can use any data for testing purposes. You just have to make sure the attributes and the type are same as the training data. If you have to test on 3 patients, all you have to do is to pass the patients data into model.predict() function as a dataframe or an array depends on data.
Just as you load one dataframe from a file:
data1 = pd.read_csv("heart.csv")
If you had two separate data files, you'd load them into separate data files and skip the train_test_split step.
train_df = pd.read_csv("heart_train.csv")
test_df = pd.read_csv("heart_test.csv")
Since the two dataframes are already separate, you just have to make sure you do any cleaning and pre-processing steps on both of them, including removal of the target variable (y).
Related
I am working on a small project that requires using different classification models on BoW data.
I understand that the train and test data must be different to get the model's true accuracy.
For model.score() to work correctly, I need to give it test data and labels in the same dimensions as the initial data. But the test data is in different dimensions, so I do it like this:
vectorizer = CountVectorizer()
traindata_bow = vectorizer.fit_transform(traindata)
testdata_bow = vectorizer.transform(testdata)
Now, the test data has the same dimensions as the initial train data.
Now on to my question:
Test data has its own set of dimensions/"characteristics".
So by transforming it using the vectorizer are we not losing any of the test data's characteristics?
I am asking because my model's accuracy ends up being in the 99.9% range and I worry something gets calculated incorrectly (Though my dataset is quite easy)
For example, after the code above:
traindata_bow.shape is (35918, 34319) and
testdata_bow.shape is (8980, 34319)
But if I run:
testdata_bow = vectorizer.fit_transform(testdata) i get
testdata_bow.shape is (8980, 20806)
So is there any data loss (or even partial merging with the train data) in the transform stage?
I have a flight delay dataset and try to split the set to train and test set before sampling. On-time cases are about 80% of total data and delayed cases are about 20% of that.
Normally in machine learning ratio of train and test set size is 8:2. But the data is too imbalanced. So considering extreme case, most of train data are on-time cases and most of test data are delayed cases and accuracy will be poor.
So my question is How can I properly split imbalanced dataset to train and test set??
Probably just by playing with ratio of train and test you might not get the correct prediction and results.
if you are working on imbalanced dataset, you should try re-sampling technique to get better results. In case of imbalanced datasets the classifier always "predicts" the most common class without performing any analysis of the features.
Also use different metric for performance measurement such as F1 Score etc in case of imbalanced data set
Please go through the below link, it will give you more clarity.
What is the correct procedure to split the Data sets for classification problem?
Cleveland heart disease dataset - can’t describe the class
Start from 50/50 and go on changing the sets as 60/40, 70/30, 80/20, 90/10. declare all the results and come to some conclusion. In one of my work on Flight delays prediction project, I used 60/40 database and got 86.8 % accuracy using MLP NN.
There are two approaches that you can take.
A simple one: no preprocessing of the dataset but careful sampling of the dataset so that both classes are represented in the same proportion in the test and train subsets. You can do it by splitting by class first and then randomly sampling from both sets.
import sklearn
XclassA = dataX[0] # TODO: change to split by class
XclassB = dataX[1]
YclassA = dataY[0]
YclassB = dataY[1]
XclassA_train, XclassA_test, YclassA_train, YclassA_test = sklearn.model_selection.train_test_split(XclassA, YclassA, test_size=0.2, random_state=42)
XclassB_train, XclassB_test, YclassB_train, YclassB_test = sklearn.model_selection.train_test_split(XclassB, YclassB, test_size=0.2, random_state=42)
Xclass_train = XclassA_train + XclassB_train
Yclass_train = YclassA_train + YclassB_train
A more involved, and arguably better one, you can try first to balance your dataset. For that you can use one of many techniques (under-, over-sampling, SMOTE, AdaSYN, Tomek links, etc.). I recommend you review the methods of imbalanced-learn package. Having done balancing you can use the ordinary test/train split using typical methods without any additional intermediary steps.
The second approach is better not only from the perspective of splitting the data but also from the speed and even ability to train a model (which for heavily imbalanced datasets is not guaranteed to work).
I am currently trying to build a classification model for which I am using this dataset for training and testing. It is extracted from the TIMIT database and contains digitized frequencies of five different phoneme classes. The frequencies are under the 256 columns labelled "x.1" - "x.256", while the phoneme class itself is labelled "g". Furthermore, there is also a "speakers" column identifying the different speakers.
My question is, is it possible to split this dataset into a 50:50 ratio of training and test data considering the speakers column? In fact, I want to divide the data so that any speaker is not in both sets, so that I do not validate the trained model with test data containing the same speakers that are already in the training data.
My approach was to extract all speakers from the original dataset using NumPy and make use of the stratify parameter of train_test_split:
X_train, X_test, y_train, y_test = train_test_split(input_data, phonemes, random_state=42, test_size=0.5, stratify=speakers)
But this most likely is not the solution. I would greatly appreciate any help in solving this issue!
Hi you can use pandas library of python to load the csv in to dataframe by using
import pandas as pd
df = pd.read_csv(path_to_csv)
then you can get all unique values of the column speaker by using
arrayOfSpeaker = df['speaker'].unique()
now you can easily use the arrayOfSpeaker to split your data into training and testing set.
Also i would recommend to first randomize the arrayOfSpeaker before slicing the array.
and i normally split the data into 70:20:10 ratio for train:validation:test. I didnt get the point of 50:50 split !
When you do cross_validation.train_test_split(features,labels,test_size), it is one data set that is automatically being split into training and testing data by cross_validation but how can you train and test two separate sets of data? So if the training data is in one file and the testing data is in another file, and you want to first train the data using the train file and then test using the test file how can you do that? Because cross_validation only takes one set of data and splits it into train and test automatically.
Thanks!!
When there is just one split there is no cross validation, you just literally train on one dataset and check your accuracy (or other metric) on test one, without the use of CV (since, as said before - there is no such tring as CV for a single split). This is the exact oposite of what CV is for. CV has been introduced because single split is not enough for valid estimation of test for small dataset.
train_index, test_index = next(iter(ShuffleSplit(821, train_size=0.2, test_size=0.80, random_state=42)))
print train_index, len(train_index)
print test_index, len(test_index)
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(features, labels, train_size=0.33, random_state=42)
clf = DecisionTreeClassifier()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test, labels_test)
print pred, len(pred)
A few questions from this code:
Why do I need the cross_validation.train_test_split line in order to fit and predict with my classifier? (I am not doing any preprocessing on my data except for stopword removal I have already done)
Do the test and train indexes correspond to the classified & predicted labels? My goal is to get all my labels, in their original order, after fitting and predicting them. My features and labels used for training and testing are from a pandas dataframe (two columns), and I need the predicted labels, in order, so that I can feed them back into the pandas dataframe.
Is there a way to predict the labels for the whole set, and not just the test set?
tl;dr
Because your decision tree classifier has to be trained before it can predict anything. It's not a magic algorithm. It has to be shown examples of what to do before it can work out what to do on other things.
cross_validation.test_train_split() facilitates this by splitting your data into a test and training dataset in such a way that you can analyse how well it performed later on. Without this, you have no way of assessing how well your decision tree classifier actually performed.
You can create your own testing and training data without test_train_split() (and I suspect that was what you were trying to do with ShuffleSplit()), but you will need at least some training data.
test_index and train_index have nothing to do with your data. Full stop. They come from a randomly generated process that is completely unrelated to what test_train_split() does.
The purpose of ShuffleSplit() is to give you the indices to partition your data into training and test yourself. test_train_split() will instead choose their own indices and partition based on those indices. You should either use one or the other and sensibly.
Yes. You can always just call
pred = clf.predict(features) or pred = clf.predict(features_test + features_train)
The Full Story
You need cross_validation if you want to do this right. The whole purpose of cross-validation is to avoid overfit.
Basically, if you run your model on both the training and the testing data, then your model is going to perform really well on the training set (because, well, that's what you trained it on) and that's going to skew your overall metrics of how well your model will perform on real data.
It's a lot like asking a student to perform in an exam and then in real life: if you want to know whether your student learned from the process of preparing for an exam, you don't give him another exam, you ask him to demonstrate his skills in the real world dealing with unknown and complex data.
If you want to know if your model will be useful, then you want to cross-validate. Wikipedia puts it best:
In a prediction problem, a model is usually given a dataset of known
data on which training is run (training dataset), and a dataset of
unknown data (or first seen data) against which the model is tested
(testing dataset).
The goal of cross validation is to define a
dataset to "test" the model in the training phase (i.e., the
validation dataset), in order to limit problems like overfitting, give
an insight on how the model will generalize to an independent dataset
(i.e., an unknown dataset, for instance from a real problem), etc.
cross_validation.train_test_split doesn't do anything except split the dataset into training and testing data for you.
But perhaps you don't care about metrics, and that's fine. The question then becomes: is it possible to run a decision tree classifier without a training dataset?
The answer is no. Decision tree classifiers are supervised algorithms: they need to be trained on data before they can generalise their model to new results. If you don't give them any data to train on, it will be unable to do anything with any data you feed it in predict.
Finally, while it is perfectly possible to get the labels for the whole set (see tl;dr) , it is a really bad idea if you actually care about whether or not you're getting sensible results.
You already have the labels for the testing and training data. You don't need another column that includes prediction on the testing data, because they'll either come out to be identical or close enough to identical.
I can't think of a single meaningful reason to get back predicted results for your training data short of trying to optimise how it's performing on your training data. If that's what you are trying to do, then do that. What you are doing right now is definitely not that, and I encourage you to think strongly about what your reasons are for blindly inserting numbers into your table without due cause to believe they actually mean something.
There are ways to improve this: get back an accuracy metric, for example, or try to do k-fold cross-validation to model accuracy, or look at log-loss or AUC or any one of number of metrics to gauge whether or not your model is performing well.
Using both ShuffleSplit and train_test_split is redundant. You do not even appear to be using the indices returned by ShuffleSplit.
An example of how to use the indices return by ShuffleSplit is below. X and y are np.array. X is number of instances by number of features. y contains the labels of each row.
train_inds, test_inds = train_test_split(range(len(y)),test_size=0.33, random_state=42)
X_train, y_train = X[train_inds], y[train_inds]
X_test , y_test = X[test_inds] , y[test_inds]
You should not test on your training data! But if you want to see what happens just do
pred = clf.predict(features_train)
Also you do not need to pass the labels to predict. You should be using
score = metrics.accuracy_score(y_test, pred)