I am using keras to train a model for regression. My code looks like:
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasRegressor(build_fn=baseline_model, epochs=100, batch_size=32, verbose=2)))
pipeline = Pipeline(estimators)
X_train, X_test, y_train, y_test = train_test_split(X, Y,
train_size=0.75, test_size=0.25)
pipeline.fit(X_train, y_train)
The problem is that it is dramatically overfitting. How can I see the
validation error after each epoch?
You can transmit parameters to KerasRegressor fit method:
validation_split: float (0. < x < 1). Fraction of the data to use as
held-out validation data. validation_data: tuple (x_val, y_val) or
tuple (x_val, y_val, val_sample_weights) to be used as held-out
validation data. Will override validation_split.
via Pipeline fit method:
**fit_params : dict of string -> object Parameters passed to the fit method of each step, where each parameter name is prefixed such that
parameter p for step s has key s__p.
pipeline.fit(X_train, y_train, mlp__validation_split=0.3)
How can I make "Repeated" holdout method, I made holdout method and get accuracy but need to repeat holdout method for 30 times
There is my code for holdout method
X_train, X_test, Y_train, Y_test = train_test_split(X, Y.values.ravel(), random_state=100)
model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.2f%%" % (result*100.0))
Accuracy: 49.62%
I see many codes for repeated method but only for K fold cross, nothing for holdout method
So to use a repeated holdout you could use the ShuffleSplit method from sklearn. A minimum working example (following the name conventions that you used) might be as follows:
from sklearn.modelselection import ShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
# Create some artificial data to train on, can be replace by your own data
X, Y = make_classification()
rs = ShuffleSplit(n_splits=30, test_size=0.25, random_state=100)
model = LogisticRegression()
for train_index, test_index in rs.split(X):
X_train, Y_train = X[train_index], Y[train_index]
X_test, Y_test = X[test_index], Y[test_index]
result = model.score(X_test, Y_test)
print("Accuracy: %.2f%%" % (result*100.0))
n_splits determines how many time you would like to repeat the holdout. test_size deterimines the fraction of samples that is sampled as a test set. In this case 75% is sampled as train set, whereas 25% is sampled to your test set. For reproducible results you can set the random_state (any number suffices, as long as you use the same number consistently).
Is it possible (and how if it is) to dynamically train sklearn MultinomialNB Classifier?
I would like to train(update) my spam classifier every time I feed an email in it.
I want this (does not work):
x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
for i in range(len(x_train)):
clf.fit([x_train[i]], [y_train[i]])
preds = clf.predict(x_test)
to have similar result as this (works OK):
x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
clf.fit(x_train, y_train)
preds = clf.predict(x_test)
Scikit-learn supports incremental learning for multiple algorithms, including MultinomialNB. Check the docs here
You'll need to use the method partial_fit() instead of fit(), so your example code would look like:
x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
for i in range(len(x_train)):
if i == 0:
clf.partial_fit([x_train[i]], [y_train[I]], classes=numpy.unique(y_train))
clf.partial_fit([x_train[i]], [y_train[I]])
preds = clf.predict(x_test)
Edit: added the classes argument to partial_fit, as suggested by #BobWazowski
I'm trying to learn the basics of XGBoost and devises a script that splits some data I found on Kaggle about Corona virus outbreaks in China. The code and model work, but some some reason when I use the model to make a new prediction I get a "ValueError: feature_names mismatch." The new test data has a 2-d array with 2 values, just like the test data, but I still get a value error.
train = df[['RegionCode','ProvinceCode']].astype(int)
test = df['infected'].astype(int)
X_test, X_train, y_test, y_train = train_test_split(train, test, test_size=0.2, random_state=42)
train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)
param = {
'num_class': 2}
epochs = 10
model = xgb.train(param, train, epochs)
All the code above works, but the terst below gives me the error:
testArray=np.array([[13, 67]])
test_individual = xgb.DMatrix(testArray)
Any idea what I'm doing wrong?
Seems like you are missing out on the basics of using the train_test_split function of sklearn.
X_test, X_train, y_test, y_train = train_test_split(train, test, test_size=0.2, random_state=42)
The line above expects the train to have all the features to be used for training, while the test expects the target feature.
Try fixing that first.
I have used train_test_split function to divide my data into X_train, X_test, y_train, y_test, and then used utils.data.DataLoader to feed it to my CNN but the problem is that I do not know how to access my labels tensor for making a confusion matrix and comparing them with my prediction tensor. I know its a basic question but anyway your help is appreciated.
X_train, X_test, y_train, y_test = train_test_split(faces, emotions, test_size=0.1, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=41)
and I used
train = torch.utils.data.TensorDataset(torch.from_numpy(X_train), torch.from_numpy(y_train))
train_loader = torch.utils.data.DataLoader(train, batch_size=100, shuffle=True)
for feeding the data to my network
It seems you can access your labels by just typing targets attribute after your train_set like train_set.targets but it does not work for me that way. How can I get my labels?
PyTorch's DataLoader object is roughly used like this:
for i, (inputs, labels) in enumerate(dataloader):
inputs = inputs.to(device)
labels = labels.to(device)
outputs = model(inputs)
_, preds = torch.max(outputs, 1)
In general I would suggest to use two DataLoaders, one for training and one for testing/validating. Since you want to make a confusion matrix, you can access your labels simply by your numpy array y_train and your prediction preds e.g. by concatenating them inside the loop to a numpy array.
For more information on how to use the DataLoader, I suggest looking at this very good tutorial:
I have written a classification algorithm in Python. It satisfies to Scikit-Learn's API. Given labeled data X, y, I would like to train my algorithm on this data in the following way:
X, y are split into X_aux, y_aux and X_test, y_test.
X_aux, y_aux are split into X_train, y_train and X_val, y_val.
Then, using Scikit-Learn, I define a Pipeline which is the concatenation of a StandardScaler (for feature normalization) and my model. Eventually, the pipeline is trained and evaluated as follows:
pipe = Pipeline([('scaler', StandardScaler()), ('clf', Model())])
pipe.fit(X_train, y_train, validation_data = (X_val, y_val))
pred_proba = pipe.predict_proba(X_test)
score = roc_auc_score(y_test, pred_proba)
The fit method of Model accepts a validation_data parameter to monitor progress during training and possibly avoid overfitting. To this aim, at each iteration, the fit method prints the model loss on training data (X_train, y_train) (training loss) and model loss on validation data (X_val, y_val) (validation loss). In addition to validation loss, I also would like the fit method to return ROC AUC score on validation data. My question is the following :
Shall X_val be normalized with the scaler of the pipeline before it is used to compute validation ROC AUC score during training ? Also, in this code, only X_train is normalized by the scaler. Should I do X_aux = scaler.fit_transform(X_aux) instead and then split into train/validation ?
I apologize in advance for my question is very naive. I confess I got confused.
I think that X_val should be normalized. The way I see it is that the few lines of code above are equivalent to:
scaler = StandardScaler()
clf = Model()
X_train = scaler.fit_transform(X_train)
clf.fit(X_train, y_train, validation_data = (X_val, y_val))
# During `fit`, at each iteration, we would have:
# train_loss = loss(X_train, y_train)
# validation_loss = loss(X_val, y_val)
# pred_proba_val = predict_proba(X_val, y_val) (*)
# roc_auc_val = roc_auc_score(y_val, pred_proba_val)
X_test = scaler.transform(X_test)
pred_proba = clf.predict_proba(X_test) (**)
score = roc_auc_score(y_test, pred_proba)
In line (*) the predict_proba method is called on unnormalized data whereas it is called on normalized data on line (**). This is why I believe that X_val should be normalized. Still I am not sure whether my thinking is correct.