Problem getting the labels of training-set

Problem getting the labels of training-set - python

I have used train_test_split function to divide my data into X_train, X_test, y_train, y_test, and then used utils.data.DataLoader to feed it to my CNN but the problem is that I do not know how to access my labels tensor for making a confusion matrix and comparing them with my prediction tensor. I know its a basic question but anyway your help is appreciated.
X_train, X_test, y_train, y_test = train_test_split(faces, emotions, test_size=0.1, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=41)
and I used
train = torch.utils.data.TensorDataset(torch.from_numpy(X_train), torch.from_numpy(y_train))
train_loader = torch.utils.data.DataLoader(train, batch_size=100, shuffle=True)
for feeding the data to my network
It seems you can access your labels by just typing targets attribute after your train_set like train_set.targets but it does not work for me that way. How can I get my labels?

PyTorch's DataLoader object is roughly used like this:
for i, (inputs, labels) in enumerate(dataloader):
inputs = inputs.to(device)
labels = labels.to(device)
outputs = model(inputs)
_, preds = torch.max(outputs, 1)
In general I would suggest to use two DataLoaders, one for training and one for testing/validating. Since you want to make a confusion matrix, you can access your labels simply by your numpy array y_train and your prediction preds e.g. by concatenating them inside the loop to a numpy array.
For more information on how to use the DataLoader, I suggest looking at this very good tutorial:
https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py
and
https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

Related

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=1)

What is purpose of this line :
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=1)

For neural networks you have input features (X) and output labels (Y). It's very important to split your data into a training dataset and testing dataset.
To make this easy sklearn has a function called
train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None).
Here's the documentation for sklearn.model_selection.train_test_split
Going through the function we can see that:
1.) X is your input features array
2.) Y is your output label array
3.) test_size = 0.25 states that you want your testing data to be 25% of your overall data. Therefore your training data will be 75% of your overall data.
4.) random_state = 1 Controls the shuffling applied to the data before applying the split.
5.) Your question is why do you have 4 outputs (X_train, X_test, y_train, y_test). It is because X will be split into X_train (75%) and X_test (25%) and then Y will be split into y_train (75%) and y_test (25%). It's all put onto one line.

Splitting data to training, testing and valuation when making Keras model

I'm a little confused about splitting the dataset when I'm making and evaluating Keras machine learning models.
Lets say that I have dataset of 1000 rows.
features = df.iloc[:,:-1]
results = df.iloc[:,-1]
Now I want to split this data into training and testing (33% of data for testing, 67% for training):
x_train, X_test, y_train, y_test = train_test_split(features, results, test_size=0.33)
I have read on the internet that fitting the data into model should look like this:
history = model.fit(features, results, validation_split = 0.2, epochs = 10, batch_size=50)
So I'm fitting the full data (features and results) to my model, and from that data I'm using 20% of data for validation: validation_split = 0.2.
So basically, my model will be trained with 80% of data, and tested on 20% of data.
So confusion starts when I need to evaluate the model:
score = model.evaluate(x_test, y_test, batch_size=50)
Is this correct?
I mean, why should I split the data into training and testing, where does x_train and y_train go?
Can you please explain to me whats the correct order of steps for creating model?

Generally, in training time (model. fit), you have two sets: one is for the training set and another is for validation/tuning/development set. With the training set, you train the model, and with the validation set, you need to find the best set of hyper-parameter. And when you're done, you may then test your model with unseen data set - a set that was completely hidden from the model unlike the training or validation set.
Now, when you used
X_train, X_test, y_train, y_test = train_test_split(features, results, test_size=0.33)
By this, you split the features and results into 33% of data for testing, 67% for training. Now, you can do two things
use the (X_test and y_test as validation set in model.fit(...). Or,
use them for final prediction in model. predict(...)
So, if you choose these test sets as a validation set ( number 1 ), you would do as follows:
model.fit(x=X_train, y=y_trian,
validation_data = (X_test, y_test), ...)
In the training log, you will get the validation results along with the training score. The validation results should be the same if you later compute model.evaluate(X_test, y_test).
Now, if you choose those test set as a final prediction or final evaluation set ( number 2 ), then you need to make validation set newly or use the validation_split argument as follows:
model.fit(x=X_train, y=y_trian,
validation_split = 0.2, ...)
The Keras API will take the .2 percentage of the training data (X_train and y_train) and use it for validation. And lastly, for the final evaluation of your model, you can do as follows:
y_pred = model.predict(x_test, batch_size=50)
Now, you can compare with y_test and y_pred with some relevant metrics.

Generally, you'd want to use your X_train, y_train data that you have split as arguments in the fit method. So it would look something like:
history = model.fit(X_train, y_train, batch_size=50)
While not splitting your data before throwing it into the fit method and adding the validation_split arguments work as well, just be careful to refer to the keras documentation on the validation_data and validation_split arguments to make sure that you are splitting them up as expected.
There is a related question here:
https://datascience.stackexchange.com/questions/38955/how-does-the-validation-split-parameter-of-keras-fit-function-work
Keras documentation:
https://keras.rstudio.com/reference/fit.html

I have read on the internet that fitting the data into model should
look like this:
That means you need to fit features and labels. You already split them into x_train & y_train. So your fit should look like this:
history = model.fit(x_train, y_train, validation_split = 0.2, epochs = 10, batch_size=50)
So confusion starts when I need to evaluate the model:
score = model.evaluate(x_test, y_test, batch_size=50) --> Is this correct?
That's correct, you evaluate the model by using testing features and corresponding labels. Furthermore if you want to get only for example predicted labels, you can use:
y_hat = model.predict(X_test)
Then you can compare y_hat with y_test, i.e get a confusion matrix etc.

I want to compare my prediction value with original train data

I am trying to learn decision tree regressor and I have wrote below code.
X_train, X_test, y_train, y_test = train_test_split(
x, y, test_size = 0.3, random_state = 100)
model = DecisionTreeRegressor(random_state=1)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
I want to create a dataframe which include X_test and Y_test and Y_pred.
Is there any method or function for that.

Append the below code at the end of your prediction code:
final_df = X_test.copy()
final_df["Y_original"] = y_test
final_df["Y_predicted"] = y_pred
Here we are creating a new dataframe namely final_df and putting all the values you require into it. Would not suggest you to directly append values into X_test, as it might be needed for use again for prediction.

naive bayes classifier dynamic training

Is it possible (and how if it is) to dynamically train sklearn MultinomialNB Classifier?
I would like to train(update) my spam classifier every time I feed an email in it.
I want this (does not work):
x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
for i in range(len(x_train)):
clf.fit([x_train[i]], [y_train[i]])
preds = clf.predict(x_test)
to have similar result as this (works OK):
x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
clf.fit(x_train, y_train)
preds = clf.predict(x_test)

Scikit-learn supports incremental learning for multiple algorithms, including MultinomialNB. Check the docs here
You'll need to use the method partial_fit() instead of fit(), so your example code would look like:
x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
for i in range(len(x_train)):
if i == 0:
clf.partial_fit([x_train[i]], [y_train[I]], classes=numpy.unique(y_train))
else:
clf.partial_fit([x_train[i]], [y_train[I]])
preds = clf.predict(x_test)
Edit: added the classes argument to partial_fit, as suggested by #BobWazowski

Python xgb: ValueError: "feature_names mismatch"

I'm trying to learn the basics of XGBoost and devises a script that splits some data I found on Kaggle about Corona virus outbreaks in China. The code and model work, but some some reason when I use the model to make a new prediction I get a "ValueError: feature_names mismatch." The new test data has a 2-d array with 2 values, just like the test data, but I still get a value error.
train = df[['RegionCode','ProvinceCode']].astype(int)
test = df['infected'].astype(int)
X_test, X_train, y_test, y_train = train_test_split(train, test, test_size=0.2, random_state=42)
train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)
param = {
'max_depth':4,
'eta':0.3,
'num_class': 2}
epochs = 10
model = xgb.train(param, train, epochs)
All the code above works, but the terst below gives me the error:
testArray=np.array([[13, 67]])
test_individual = xgb.DMatrix(testArray)
print(model.predict(test_individual))
Any idea what I'm doing wrong?

Seems like you are missing out on the basics of using the train_test_split function of sklearn.
X_test, X_train, y_test, y_train = train_test_split(train, test, test_size=0.2, random_state=42)
The line above expects the train to have all the features to be used for training, while the test expects the target feature.
Try fixing that first.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problem getting the labels of training-set - python

Related

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=1)

Splitting data to training, testing and valuation when making Keras model

I want to compare my prediction value with original train data

naive bayes classifier dynamic training

Python xgb: ValueError: "feature_names mismatch"

Categories

Resources