Getting different result even after having the same subset of data - python

I have tried searching for possible similar questions, but I have not been able to find any so far. My problem is:
I want to use cross-validation using non-overlapping subsets of data using KFold. What I did was to create subsets using KFold and fix the outcome by setting randome_state to a certain integer. When I print out the subsets multiple times results look OK. However, the problem is when I use the same subsets on a model multiple times using model.predict (meaning running my code multiple times), I get different results. Naturally, I suspect there is something wrong with my implementation of training model. But I cannot figure out what it is. I would very much appreciate a hint. Here is my code:
random.seed(42)
# define K-fold cross validation test harness
kf = KFold(n_splits=3, random_state=42, shuffle=True)
for train_index, test_index in kf.split(data):l
print ('Train', train_index, '\nTest ', test_index)
# create model
testX= data[test_index]
trainX = data[train_index]
testYcheck = labels[test_index]
testP = Path[test_index]
# convert the labels from integers to vectors
trainY = to_categorical(labels[train_index], num_classes=2)
testY = to_categorical(labels[test_index], num_classes=2)
# construct the image generator for data augmentation
aug = ImageDataGenerator(rotation_range=30, width_shift_range=0.1,
height_shift_range=0.1, shear_range=0.2, zoom_range=0.2,
horizontal_flip=True, fill_mode="nearest")
# train the network
print("[INFO] training network...")
model.fit_generator(aug.flow(trainX, trainY, batch_size=BS),
validation_data=(testX, testY),
steps_per_epoch=len(trainX) // BS, epochs=EPOCHS, verbose=1)
#predict the test data
y_pred = model.predict(testX)
predYl = []
for element in range(len(y_pred)):
if y_pred[element,1] > y_pred[element,0]:
predYl.append(1)
else:
predYl.append(0)
pred_Y= np.array(predYl)
# Compute confusion matrix
cnf_matrix = confusion_matrix(testYcheck, pred_Y)
np.set_printoptions(precision=2)
print (cnf_matrix)

Related

Using model.fit with different data generators

I'm using this code to train my CNN:
history = model.fit(
x = train_gen1,
epochs = epochs,
validation_data = valid_gen,
callbacks = noaug_callbacks,
class_weight = class_weight
).history
where train_gen1 is made using ImageDataGenerator. But what if I want to use 2 different types of generator (let's call them train_gen1 and train_gen2) both feeding my training phase? How can I change my code in order to do so?
This is the way I generate:
aug_train_data_gen = ImageDataGenerator(rotation_range=0,
height_shift_range=40,
width_shift_range=40,
zoom_range=0,
horizontal_flip=True,
vertical_flip=True,
fill_mode='reflect',
preprocessing_function=preprocess_input
)
train_gen1 = aug_train_data_gen.flow_from_directory(directory=training_dir,
target_size=(96,96),
color_mode='rgb',
classes=None, # can be set to labels
class_mode='categorical',
batch_size=512,
shuffle= False #set to false if need to compare images
)

Splitting data to training, testing and valuation when making Keras model

I'm a little confused about splitting the dataset when I'm making and evaluating Keras machine learning models.
Lets say that I have dataset of 1000 rows.
features = df.iloc[:,:-1]
results = df.iloc[:,-1]
Now I want to split this data into training and testing (33% of data for testing, 67% for training):
x_train, X_test, y_train, y_test = train_test_split(features, results, test_size=0.33)
I have read on the internet that fitting the data into model should look like this:
history = model.fit(features, results, validation_split = 0.2, epochs = 10, batch_size=50)
So I'm fitting the full data (features and results) to my model, and from that data I'm using 20% of data for validation: validation_split = 0.2.
So basically, my model will be trained with 80% of data, and tested on 20% of data.
So confusion starts when I need to evaluate the model:
score = model.evaluate(x_test, y_test, batch_size=50)
Is this correct?
I mean, why should I split the data into training and testing, where does x_train and y_train go?
Can you please explain to me whats the correct order of steps for creating model?
Generally, in training time (model. fit), you have two sets: one is for the training set and another is for validation/tuning/development set. With the training set, you train the model, and with the validation set, you need to find the best set of hyper-parameter. And when you're done, you may then test your model with unseen data set - a set that was completely hidden from the model unlike the training or validation set.
Now, when you used
X_train, X_test, y_train, y_test = train_test_split(features, results, test_size=0.33)
By this, you split the features and results into 33% of data for testing, 67% for training. Now, you can do two things
use the (X_test and y_test as validation set in model.fit(...). Or,
use them for final prediction in model. predict(...)
So, if you choose these test sets as a validation set ( number 1 ), you would do as follows:
model.fit(x=X_train, y=y_trian,
validation_data = (X_test, y_test), ...)
In the training log, you will get the validation results along with the training score. The validation results should be the same if you later compute model.evaluate(X_test, y_test).
Now, if you choose those test set as a final prediction or final evaluation set ( number 2 ), then you need to make validation set newly or use the validation_split argument as follows:
model.fit(x=X_train, y=y_trian,
validation_split = 0.2, ...)
The Keras API will take the .2 percentage of the training data (X_train and y_train) and use it for validation. And lastly, for the final evaluation of your model, you can do as follows:
y_pred = model.predict(x_test, batch_size=50)
Now, you can compare with y_test and y_pred with some relevant metrics.
Generally, you'd want to use your X_train, y_train data that you have split as arguments in the fit method. So it would look something like:
history = model.fit(X_train, y_train, batch_size=50)
While not splitting your data before throwing it into the fit method and adding the validation_split arguments work as well, just be careful to refer to the keras documentation on the validation_data and validation_split arguments to make sure that you are splitting them up as expected.
There is a related question here:
https://datascience.stackexchange.com/questions/38955/how-does-the-validation-split-parameter-of-keras-fit-function-work
Keras documentation:
https://keras.rstudio.com/reference/fit.html
I have read on the internet that fitting the data into model should
look like this:
That means you need to fit features and labels. You already split them into x_train & y_train. So your fit should look like this:
history = model.fit(x_train, y_train, validation_split = 0.2, epochs = 10, batch_size=50)
So confusion starts when I need to evaluate the model:
score = model.evaluate(x_test, y_test, batch_size=50) --> Is this correct?
That's correct, you evaluate the model by using testing features and corresponding labels. Furthermore if you want to get only for example predicted labels, you can use:
y_hat = model.predict(X_test)
Then you can compare y_hat with y_test, i.e get a confusion matrix etc.

Train many neural networks and pick best one

I'm working on a classification task, trying to reconstruct a network from paper. In that paper, they are talking about doing a train test split 300 times and training the network each time after they are taking the mean of all predictions from each network for specific input data.
So here's the question: What is the best option for doing that, I've already reconstructed their network and thinking about using a for loop and saving outputs of each network in a data frame but can't get it the right way.
Here's the code :
# Set X and Y for training
X = dum_bll_fsrq.drop(['type2', 'name', 'Type_is_bll', 'Type_is_fsrq'], axis = 1)
Y = dum_bll_fsrq.iloc[:,-2:]
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, stratify = Y)
# Create model
model_two_neuron = tf.keras.Sequential([
tf.keras.layers.Dense(40, input_shape=(15,)), # input shape required
tf.keras.layers.Dense(2, activation=tf.nn.sigmoid)
])
model_two_neuron.compile(optimizer=tf.keras.optimizers.Adam(),
loss=tf.keras.losses.MeanSquaredError(),
metrics=[tf.keras.metrics.Precision()])
# Train
model_two_neuron.fit(X_train, y_train, epochs=20)
You can use callbacks to save the best weights for each of your models, then evaluate the best results saved by callbacks after training.
Here is a basic example, provided in the Documentation:
model.compile(loss=..., optimizer=...,
metrics=['accuracy'])
EPOCHS = 10
checkpoint_filepath = '/tmp/checkpoint'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_filepath,
save_weights_only=True,
monitor='val_accuracy',
mode='max',
save_best_only=True)
# Model weights are saved at the end of every epoch, if it's the best seen
# so far.
model.fit(epochs=EPOCHS, callbacks=[model_checkpoint_callback])
# The model weights (that are considered the best) are loaded into the model.
model.load_weights(checkpoint_filepath)

How to do both Data Augmentation and Cross Validation at the same time in NLP?

I have read somewhere that you should not use data augmentation on your validation set, and you should only use it on your training set.
My problem is this:
I have a dataset which has less number of training samples and I want to use data augmentation.
I split the dataset into training and test set and use data augmentation on the training set. I then use StratifiedKfold on the training set, which returns me a train index and a test index, but if I use the X_train[test index] as my validation set, it has some augmented images and I don't want that.
Is there any way to do data augmentation on the training set and do cross-validation ?
Here is my code(I haven't done data augmentation but would love to get a way to separate out the test_index from the augmented training samples.):
kfold = StratifiedKFold(n_splits=5,shuffle=True)
i=1
for train_index , test_index in kfold.split(X_train,y_train):
dataset_train = tf.data.Dataset.from_tensor_slices((X_train[train_index],
y_train.iloc[train_index])).shuffle(len(X_train[train_index]))
dataset_train = dataset_train.batch(512,drop_remainder=True).repeat()
dataset_test = tf.data.Dataset.from_tensor_slices((X_train[test_index],
y_train.iloc[test_index])).shuffle(len(X_train[test_index]))
dataset_test = dataset_test.batch(32,drop_remainder=True).take(steps_per_epoch).repeat()
model_1 = deep_neural()
print('--------------------------------------------------------------------------------------------
-------------------------------')
print('\n')
print(f'Training for fold {i} ...')
print('Training on {} samples.........Validating on {} samples'.format(len(X_train[train_index]),
len(X_train[test_index])))
checkpoint = tf.keras.callbacks.ModelCheckpoint(get_model_name(i),
monitor='val_loss', verbose=1,
save_best_only=True, mode='min')
history = model_1.fit(dataset_train,steps_per_epoch = len(X_train[train_index])//BATCH_SIZE,
epochs=4,validation_data=dataset_test,
validation_steps=1,callbacks=[csv_logger,checkpoint])
scores = model_1.evaluate(X_test,y_test,verbose=0)
pred_classes = model_1.predict(X_test).argmax(1)
f1score = f1_score(y_test,pred_classes,average='macro')
print('\n')
print(f'Score for fold {i}: {model_1.metrics_names[0]} of {scores[0]}; {model_1.metrics_names[1]}
of {scores[1]*100}; F1 Score of {f1score}%')
print('\n')
acc_per_fold.append(scores[1] * 100)
loss_per_fold.append(scores[0])
f1score_per_fold.append(f1score)
tf.keras.backend.clear_session()
gc.collect()
del model_1
i=i+1

How to split test and train data in a dataset based on number of targets in each category

I have an imageFolder in PyTorch which holds my categorized data images. Each folder is the name of the category and in the folder are images of that category.
I've loaded data and split train and test data via a sampler with random train_test_split. But the problem is my data distribution isn't good and some classes have lots of images and some classes have fewer.
So for solving this problem I want to choose 20% of each class as my test data and the rest would be the train data
ds = ImageFolder(filePath, transform=transform)
batch_size = 64
validation_split = 0.2
indices = list(range(len(ds))) # indices of the dataset
# TODO: fix spliting
train_indices,test_indices = train_test_split(indices,test_size=0.2)
# Creating PT data samplers and loaders:
train_sampler = SubsetRandomSampler(train_indices)
test_sampler = SubsetRandomSampler(test_indices)
train_loader = torch.utils.data.DataLoader(ds, batch_size=batch_size, sampler=train_sampler, num_workers=16)
test_loader = torch.utils.data.DataLoader(ds, batch_size=batch_size, sampler=test_sampler, num_workers=16)
Any idea of how I should fix it?
Use the stratify argument in train_test_split according to the docs. If your label indices is an array-like called y, do:
train_indices,test_indices = train_test_split(indices, test_size=0.2, stratify=y)
Try using StratifiedKFold or StratifiedShuffleSplit.
According to the docs:
This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.
In your case you can try:
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
for train_index, test_index in sss.split(ds):
train = torch.utils.data.Subset(dataset, train_index)
test = torch.utils.data.Subset(dataset, test_index)
trainloader = torch.utils.data.DataLoader(train, batch_size=batch_size, shuffle=True, num_workers=0, pin_memory=False)
testloader = torch.utils.data.DataLoader(test, batch_size=batch_size, shuffle=True, num_workers=0, pin_memory=False)

Categories

Resources