How to see the validation error after each epoch in keras - python

I am using keras to train a model for regression. My code looks like:
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasRegressor(build_fn=baseline_model, epochs=100, batch_size=32, verbose=2)))
pipeline = Pipeline(estimators)
X_train, X_test, y_train, y_test = train_test_split(X, Y,
train_size=0.75, test_size=0.25)
pipeline.fit(X_train, y_train)
The problem is that it is dramatically overfitting. How can I see the
validation error after each epoch?

You can transmit parameters to KerasRegressor fit method:
validation_split: float (0. < x < 1). Fraction of the data to use as
held-out validation data. validation_data: tuple (x_val, y_val) or
tuple (x_val, y_val, val_sample_weights) to be used as held-out
validation data. Will override validation_split.
via Pipeline fit method:
**fit_params : dict of string -> object Parameters passed to the fit method of each step, where each parameter name is prefixed such that
parameter p for step s has key s__p.
Example:
pipeline.fit(X_train, y_train, mlp__validation_split=0.3)

Related

Repeated holdout method

How can I make "Repeated" holdout method, I made holdout method and get accuracy but need to repeat holdout method for 30 times
There is my code for holdout method
[IN]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y.values.ravel(), random_state=100)
model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.2f%%" % (result*100.0))
[OUT]
Accuracy: 49.62%
I see many codes for repeated method but only for K fold cross, nothing for holdout method
So to use a repeated holdout you could use the ShuffleSplit method from sklearn. A minimum working example (following the name conventions that you used) might be as follows:
from sklearn.modelselection import ShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
# Create some artificial data to train on, can be replace by your own data
X, Y = make_classification()
rs = ShuffleSplit(n_splits=30, test_size=0.25, random_state=100)
model = LogisticRegression()
for train_index, test_index in rs.split(X):
X_train, Y_train = X[train_index], Y[train_index]
X_test, Y_test = X[test_index], Y[test_index]
model.fit(X_train,Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.2f%%" % (result*100.0))
n_splits determines how many time you would like to repeat the holdout. test_size deterimines the fraction of samples that is sampled as a test set. In this case 75% is sampled as train set, whereas 25% is sampled to your test set. For reproducible results you can set the random_state (any number suffices, as long as you use the same number consistently).

naive bayes classifier dynamic training

Is it possible (and how if it is) to dynamically train sklearn MultinomialNB Classifier?
I would like to train(update) my spam classifier every time I feed an email in it.
I want this (does not work):
x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
for i in range(len(x_train)):
clf.fit([x_train[i]], [y_train[i]])
preds = clf.predict(x_test)
to have similar result as this (works OK):
x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
clf.fit(x_train, y_train)
preds = clf.predict(x_test)
Scikit-learn supports incremental learning for multiple algorithms, including MultinomialNB. Check the docs here
You'll need to use the method partial_fit() instead of fit(), so your example code would look like:
x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
for i in range(len(x_train)):
if i == 0:
clf.partial_fit([x_train[i]], [y_train[I]], classes=numpy.unique(y_train))
else:
clf.partial_fit([x_train[i]], [y_train[I]])
preds = clf.predict(x_test)
Edit: added the classes argument to partial_fit, as suggested by #BobWazowski

Python xgb: ValueError: "feature_names mismatch"

I'm trying to learn the basics of XGBoost and devises a script that splits some data I found on Kaggle about Corona virus outbreaks in China. The code and model work, but some some reason when I use the model to make a new prediction I get a "ValueError: feature_names mismatch." The new test data has a 2-d array with 2 values, just like the test data, but I still get a value error.
train = df[['RegionCode','ProvinceCode']].astype(int)
test = df['infected'].astype(int)
X_test, X_train, y_test, y_train = train_test_split(train, test, test_size=0.2, random_state=42)
train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)
param = {
'max_depth':4,
'eta':0.3,
'num_class': 2}
epochs = 10
model = xgb.train(param, train, epochs)
All the code above works, but the terst below gives me the error:
testArray=np.array([[13, 67]])
test_individual = xgb.DMatrix(testArray)
print(model.predict(test_individual))
Any idea what I'm doing wrong?
Seems like you are missing out on the basics of using the train_test_split function of sklearn.
X_test, X_train, y_test, y_train = train_test_split(train, test, test_size=0.2, random_state=42)
The line above expects the train to have all the features to be used for training, while the test expects the target feature.
Try fixing that first.

Problem getting the labels of training-set

I have used train_test_split function to divide my data into X_train, X_test, y_train, y_test, and then used utils.data.DataLoader to feed it to my CNN but the problem is that I do not know how to access my labels tensor for making a confusion matrix and comparing them with my prediction tensor. I know its a basic question but anyway your help is appreciated.
X_train, X_test, y_train, y_test = train_test_split(faces, emotions, test_size=0.1, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=41)
and I used
train = torch.utils.data.TensorDataset(torch.from_numpy(X_train), torch.from_numpy(y_train))
train_loader = torch.utils.data.DataLoader(train, batch_size=100, shuffle=True)
for feeding the data to my network
It seems you can access your labels by just typing targets attribute after your train_set like train_set.targets but it does not work for me that way. How can I get my labels?
PyTorch's DataLoader object is roughly used like this:
for i, (inputs, labels) in enumerate(dataloader):
inputs = inputs.to(device)
labels = labels.to(device)
outputs = model(inputs)
_, preds = torch.max(outputs, 1)
In general I would suggest to use two DataLoaders, one for training and one for testing/validating. Since you want to make a confusion matrix, you can access your labels simply by your numpy array y_train and your prediction preds e.g. by concatenating them inside the loop to a numpy array.
For more information on how to use the DataLoader, I suggest looking at this very good tutorial:
https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py
and
https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

Scaling of validation data in supervised ML algorithm

I have written a classification algorithm in Python. It satisfies to Scikit-Learn's API. Given labeled data X, y, I would like to train my algorithm on this data in the following way:
X, y are split into X_aux, y_aux and X_test, y_test.
X_aux, y_aux are split into X_train, y_train and X_val, y_val.
Then, using Scikit-Learn, I define a Pipeline which is the concatenation of a StandardScaler (for feature normalization) and my model. Eventually, the pipeline is trained and evaluated as follows:
pipe = Pipeline([('scaler', StandardScaler()), ('clf', Model())])
pipe.fit(X_train, y_train, validation_data = (X_val, y_val))
pred_proba = pipe.predict_proba(X_test)
score = roc_auc_score(y_test, pred_proba)
The fit method of Model accepts a validation_data parameter to monitor progress during training and possibly avoid overfitting. To this aim, at each iteration, the fit method prints the model loss on training data (X_train, y_train) (training loss) and model loss on validation data (X_val, y_val) (validation loss). In addition to validation loss, I also would like the fit method to return ROC AUC score on validation data. My question is the following :
Shall X_val be normalized with the scaler of the pipeline before it is used to compute validation ROC AUC score during training ? Also, in this code, only X_train is normalized by the scaler. Should I do X_aux = scaler.fit_transform(X_aux) instead and then split into train/validation ?
I apologize in advance for my question is very naive. I confess I got confused.
I think that X_val should be normalized. The way I see it is that the few lines of code above are equivalent to:
scaler = StandardScaler()
clf = Model()
X_train = scaler.fit_transform(X_train)
clf.fit(X_train, y_train, validation_data = (X_val, y_val))
# During `fit`, at each iteration, we would have:
# train_loss = loss(X_train, y_train)
# validation_loss = loss(X_val, y_val)
# pred_proba_val = predict_proba(X_val, y_val) (*)
# roc_auc_val = roc_auc_score(y_val, pred_proba_val)
X_test = scaler.transform(X_test)
pred_proba = clf.predict_proba(X_test) (**)
score = roc_auc_score(y_test, pred_proba)
In line (*) the predict_proba method is called on unnormalized data whereas it is called on normalized data on line (**). This is why I believe that X_val should be normalized. Still I am not sure whether my thinking is correct.

Categories

Resources