How to balance data while using data generators with keras? [duplicate] - python

I am trying to use keras to fit a CNN model to classify 2 classes of data . I have imbalanced dataset I want to balance the data. I don't know can I use class_weight in model.fit_generator . I wonder if I used class_weight="balanced" in model.fit_generator
The main code:
def generate_arrays_for_training(indexPat, paths, start=0, end=100):
while True:
from_=int(len(paths)/100*start)
to_=int(len(paths)/100*end)
for i in range(from_, int(to_)):
f=paths[i]
x = np.load(PathSpectogramFolder+f)
x = np.expand_dims(x, axis=0)
if('P' in f):
y = np.repeat([[0,1]],x.shape[0], axis=0)
else:
y =np.repeat([[1,0]],x.shape[0], axis=0)
yield(x,y)
history=model.fit_generator(generate_arrays_for_training(indexPat, filesPath, end=75),
validation_data=generate_arrays_for_training(indexPat, filesPath, start=75),
steps_per_epoch=int((len(filesPath)-int(len(filesPath)/100*25))),
validation_steps=int((len(filesPath)-int(len(filesPath)/100*75))),
verbose=2,
epochs=15, max_queue_size=2, shuffle=True, callbacks=[callback])

If you don't want to change your data creation process, you can use class_weight in your fit generator. You can use dictionary to set your class_weight and observe with fine tuning. For instance when class_weight is not used, and you have 50 examples for class0 and 100 examples for class1. Then, loss function calculate loss uniformly. It means that class1 will be a problem. But, when you set:
class_weight = {0:2 , 1:1}
It means that loss function will give 2 times weight to your class 0 now. Therefore, misclassification of underrepresented data will take 2 times more punishment than before. Thus, model can handle imbalanced data.
If you use class_weight='balanced' model can make that setting automatically. But my suggestion is that, create a dictionary like class_weight = {0:a1 , 1:a2} and try different values for a1 and a2, so you can understand difference.
Also, you can use undersampling methods for imbalanced data instead of using class_weight. Check Bootstrapping methods for that purpose.

Related

How to make predictions on new dataset with tensorflow's gradient tape

While I'm able to understand how to use model.fit(x_train, y_train), I can't figure out how to make predictions on new data using tensorflow's gradient tape. My github repository with runnable code (up to an error) can be found here. What is currently working is that I get the trained model "network_output", however it appears that with gradient tape, argmax is being used on the model itself, where I'm used to model.fit() taking the test data as an input:
network_output = trained_network(input_images,input_number)
preds = np.argmax(network_output, axis=1)
Where "input_images" is an ndarray: (20,3,3,1) and "input_number" is an ndarray: (20,5).
Now I'm taking network_output as the trained model and would like to use it to predict similarly typed data of test_images, and test_number respectively.
The error 'tensorflow.python.framework.ops.EagerTensor' object has no attribute 'predict' here:
predicted_number = network_output.predict(test_images)
Which is because I don't know how to use the tape to make predictions. However once the prediction works I would guess I can compare the resulting "predicted_number" against the "test_number" as would usually be done using the model.fit method.
acc = 0
for i in range(len(test_images)):
if (predicted_number[i] == test_number[i]):
acc += 1
print("Accuracy: ", acc / len(input_images) * 100, "%")
In order to obtain prediction I usually iterate through batches manually like this:
predictions = []
for batch in range(num_batch):
logits = trained_network(x_test[batch * batch_size: (batch + 1) * batch_size], training=False)
# first obtain probabilities
# (if the last layer of the network has no activation, otherwise skip the softmax here)
prob = tf.nn.softmax(logits)
# putting back together predictions for all batches
predictions.extend(tf.argmax(input=prob, axis=1))
If you don't have a lot of data you can skip the loop, this is faster than using predict because you directly invoke the __call__ method of the model:
logits = trained_network(x_test, training=False)
prob = tf.nn.softmax(logits)
predictions = tf.argmax(input=prob, axis=1)
Finally you could also use predict. In this case the batches are handled automatically. It is easier to use when you have lots of data since you don't have to create a loop to interate through batches. The result is a numpy array of predictions. In can be used like this:
predictions = trained_network.predict(x_test) # you can set a batch_size if you want
What you're doing wrong is this part:
network_output = trained_network(input_images,input_number)
predicted_number = network_output.predict(test_images)
You have to call predict directly on your model trained_network.

how to plot correctly loss curves for training and validation sets?

I want to plot loss curves for my training and validation sets the same way as Keras does, but using Scikit. I have chosen the concrete dataset which is a Regression problem, the dataset is available at:
http://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/
So, I have converted the data to CSV and the first version of my program is the following:
Model 1
df=pd.read_csv("Concrete_Data.csv")
train,validate,test=np.split(df.sample(frac=1),[int(.8*len(df)),int(.90*len(df))])
Xtrain=train.drop(["ConcreteCompStrength"],axis="columns")
ytrain=train["ConcreteCompStrength"]
Xval=validate.drop(["ConcreteCompStrength"],axis="columns")
yval=validate["ConcreteCompStrength"]
mlp=MLPRegressor(activation="relu",max_iter=5000,solver="adam",random_state=2)
mlp.fit(Xtrain,ytrain)
plt.plot(mlp.loss_curve_,label="train")
mlp.fit(Xval,yval) #doubt
plt.plot(mlp.loss_curve_,label="validation") #doubt
plt.legend()
The resulting graph is the following:
In this model, I doubt if it's the correct marked part because as long as I know one should leave apart the validation or testing set, so maybe the fit function is not correct there. The score that I got is 0.95.
Model 2
In this model I try to use the validation score as follows:
df=pd.read_csv("Concrete_Data.csv")
train,validate,test=np.split(df.sample(frac=1),[int(.8*len(df)),int(.90*len(df))])
Xtrain=train.drop(["ConcreteCompStrength"],axis="columns")
ytrain=train["ConcreteCompStrength"]
Xval=validate.drop(["ConcreteCompStrength"],axis="columns")
yval=validate["ConcreteCompStrength"]
mlp=MLPRegressor(activation="relu",max_iter=5000,solver="adam",random_state=2,early_stopping=True)
mlp.fit(Xtrain,ytrain)
plt.plot(mlp.loss_curve_,label="train")
plt.plot(mlp.validation_scores_,label="validation") #line changed
plt.legend()
And for this model, I had to add the part of early stopping set to true and validation_scores_to be plotted, but the graph results are a little bit weird:
The score I get is 0.82, but I read that this occurs when the model finds it easier to predict the data in the validation set that in the train set. I believe that is because I am using the validation_scores_ part, but I was not able to find any online reference about this particular instruction.
How it will be the correct way to plot these loss curves for adjusting my hyperparameters in Scikit?
Update
I have programmed the module as it was advise like this:
mlp=MLPRegressor(activation="relu",max_iter=1,solver="adam",random_state=2,early_stopping=True)
training_mse = []
validation_mse = []
epochs = 5000
for epoch in range(1,epochs):
mlp.fit(X_train, Y_train)
Y_pred = mlp.predict(X_train)
curr_train_score = mean_squared_error(Y_train, Y_pred) # training performances
Y_pred = mlp.predict(X_valid)
curr_valid_score = mean_squared_error(Y_valid, Y_pred) # validation performances
training_mse.append(curr_train_score) # list of training perf to plot
validation_mse.append(curr_valid_score) # list of valid perf to plot
plt.plot(training_mse,label="train")
plt.plot(validation_mse,label="validation")
plt.legend()
but the plot obtained are two flat lines:
It seems I am missing something here.
You shouldn't fit your model on the validation set. The validation set is usually used to decide what hyperparameters to use, not the parameters' values.
The standard way to do training is to divide your dataset into three parts
training
validation
test
For example with a split of 80, 10, 10 %
Usually, you would select a neural network (how many layers, nodes, what activation functions) and then train -only- on the training set, check the result on the validation, and then on the test
I'll show a pseudo algorithm to make it clear:
for model in my_networks: # hyperparameters selection
model.fit(X_train, Y_train) # parameters fitting
model.predict(X_valid) # no train, only check on performances
Save model performances on validation and pick the best model (the one with the best scores on the validation set) then check results on the testset:
model.predict(X_test) # this will be the estimated performance of your model
If your dataset is big enough, you could also use something like cross-validation.
Anyway, remember:
the parameters are the network weights
you fit the parameters with the training set
the hyperparameters are the ones that define the net architecture (layers, nodes, activation functions)
you select the best hyperparameters checking the result of your model on the validation set
after this selection (best parameters, best hyperparameters) you get the model performances testing the model on the test set
To obtain the same result of keras, you should understand that when you call the method fit() on the model with default arguments, the training will stop after a fixed amount of epochs (200), with your defined number of epochs (5000 in your case) or when you define a early_stopping.
max_iter: int, default=200
Maximum number of iterations. The solver iterates until convergence (determined by ‘tol’) or this number of iterations. For
stochastic solvers (‘sgd’, ‘adam’), note that this determines the
number of epochs (how many times each data point will be used), not
the number of gradient steps.
Check your model definition and arguments on the scikit page
To obtain the same result of keras, you could fix the training epochs (eg. 1 step per training), check the result on validation, and then train again until you reach the desired number of epochs
for example, something like this (if your model uses mse):
epochs = 5000
mlp = MLPRegressor(activation="relu",
max_iter=1,
solver="adam",
random_state=2,
early_stopping=True)
training_mse = []
validation_mse = []
for epoch in epochs:
mlp.fit(X_train, Y_train)
Y_pred = mlp.predict(X_train)
curr_train_score = mean_squared_error(Y_train, Y_pred) # training performances
Y_pred = mlp.predict(X_valid)
curr_valid_score = mean_squared_error(Y_valid, Y_pred) # validation performances
training_mse.append(curr_train_score) # list of training perf to plot
validation_mse.append(curr_valid_score) # list of valid perf to plot
I have the same problem: obtained two flat lines when using the module as it was advised, I solve the problem just adding warm_start=True to the MLPRegressor parameters, as explained in MLPRegressor- 1.17.9. More control with warm_start
mlp=MLPRegressor(activation="relu",max_iter=1,solver="adam",random_state=2,early_stopping=True, warm_star=True)
The plot obtained are now correct:
Train and validation loss curves

Pytorch simulation fails to converge on convex loss function when not initialized with 0

My code works when the weights initialized with 0. When I initialize them according to some seed, they fail to converge. This should be a bug since the loss function is convex.
I filtered two labels from MNIST (0 and 1), and then I trained a logistic regression model using pytorch. Since I use only 200 training samples (and 784 parameters), the model should quickly converge to 100% accuracy on the training set. This is not the case when the weights initialize by some seed.
I had some problem to share my code on stackoverflow, so here is a link to the code: https://drive.google.com/file/d/1ELe8TIWrXMiXgsB63B0Ss43GPr719rGc/view?usp=sharing
Your data are not rescaled and normalized. If you look at the images variable in your training loop it's between 0 and 255 this is in all likelihood hurting your training process.
There are cleaner ways to subsample the dataset as you want, but without modifying too much of your code, using this data loading definition
import torchvision.transforms as transforms
#Load Dataset
preprocessing = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_dataset = dsets.MNIST(root='./data', train=True, transform=preprocessing, download=True)
#Filter samples by label (to get binary classification) and by number of training samples
Binary_filter=torch.add(train_dataset.targets==1, train_dataset.targets==0)
train_dataset.data, train_dataset.targets = train_dataset.data[Binary_filter],train_dataset.targets[Binary_filter]
TrainSet_filter=torch.cat((torch.ones(num_of_training_samples)
,torch.zeros(len(train_dataset.targets)-num_of_training_samples)),0).bool()
train_dataset.data, train_dataset.targets = train_dataset.data[TrainSet_filter], train_dataset.targets[TrainSet_filter]
#Make Dataset Iterable
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
I have ~100% accuracy in about 5-10 epochs.
Your loss function (BCE) is convex only with respect to the outputs of the deep network, not with respect to the weights.
You definitely can't assume that any local minimum is also a global minimum.

Keras cross-validation overfitting: is my model carrying over information across different folds?

I would like to ensure that my code for running a cross-validation of a Keras model is correct. Currently I suspect that it is wrong, because the results appear to be over-fitting.
My code structure generally looks like as follows:
def get_model():
....
#code to create a Keras Neural network model using the functional API
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
splits = list(enumerate(kfold.split(X, y))) #X is the train feature matrix, y the target
model= get_model() #LINE A
model.compile(...) #LINE B
for k in range(0, len(splits)): #LINE C
split = splits[k]
X_split_train = ... #slice X into corresponding training parts
X_split_test
y_split_train = ... #slice y into corresponding parts
model.fit(X_split_train, y_split_train, ...)
prediction_prob = model.predict(X_split_test)
#... code for evaluating the result for this fold
And I suspect my code is wrong. Specifically, line A and B should be within the loop, line C.
Reasons for my suspicion:
Looking at the training log generated for all epochs, there seems to be continuation of model performance over different folds. Say For the first fold the model obtains an accuracy of 75%. In the second fold, it starts reporting an accuracy 75.x% upwards
the model seems to be overfitting as it soon outputs training accuracy of 1.0
for some rare classes that have only 1 instance in the dataset, in some cases the model even reported 100% F1 for those classes and this doesn't make sense.
All these seem to suggest that the model parameters and learned class distribution seem to be carried forward between folds. And the only way to fix this I suppose, is to re-create the model in every fold. Is this correct?
Thanks
No, this code is not doing cross-validation correctly, for each fold you train a new model from scratch, here you are reusing the model from the previous fold, which is incorrect.
I would do it like this:
for k in range(0, len(splits)): #LINE C
model= get_model() #LINE A
model.compile(...) #LINE B
split = splits[k]
X_split_train = ... #slice X into corresponding training parts
X_split_test
y_split_train = ... #slice y into corresponding parts
model.fit(X_split_train, y_split_train, ...)
prediction_prob = model.predict(X_split_test)
del model

Python SkLearn Gradient Boost Classifier Sample_Weight Clarification

Using Python SkLearn Gradient Boost Classifier. The setting I am using is selecting random samples (stochastic). Using the sample_weight of 1 for one of the binary classes (outcome = 0) and 20 for the other class (outcome = 1). My question is how are these weights applied in 'laymans terms'.
Is it that at each iteration, the model will select x rows from the sample for the 0 outcome, and y rows for the 1 outcome, then the sample_weight setting will kick into and keep all of x but oversample the y (1) outcome by a factor of 20?
In the documentation I am not clear if it is oversampling by having sample_weight > 1. I understand that class_weight is different and does not change the data but how the model interprets the data via the loss function. Sample_weight on the other hand, is it true that it effectively changes the data fed into the model by oversampling?
Thanks
Sample weights are a multiplier factor, here is the code:
https://github.com/scikit-learn/scikit-learn/blob/f0ab589f/sklearn/ensemble/gradient_boosting.py#L1225

Categories

Resources