Why does keras model.fit with sample_weight have long initialization time?

Why does keras model.fit with sample_weight have long initialization time? - python

I am using keras with a tensorflow (version 2.2.0) backend to train a classifier to distinguish between two datasets, A and B, which I have mixed into a pandas DataFrame object x_train (with two columns), and with labels in a numpy array y_train. I would like to perform sample weighting in order to account for the fact that A has far more samples than B. In addition, A is comprised of two datasets A1 and A2, with A1 much larger than A2; I would like to account for this fact as well using my sample weights. I have the sample weights in a numpy array called w_train. There are ~10 million training samples.
Here is example code:
model = Sequential()
model.add(Dense(64, input_dim=x_train.shape[1], activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train.iloc, y_train, sample_weight=w_train)
When I use the sample_weight argument in model.fit(), I find that the model fitting initialization (i.e. whatever happens before keras starts to display the training progress) takes forever, too long to wait for. The problem goes away when I limit the dataset to 1000 samples, but as I increase to 100000 or 1000000 samples I notice that there is a significant difference in initialization and fitting time, so I suspect it has something to do with the way the data is being loaded. Nevertheless, it seems weird that merely adding the sample_weights argument would cause such a large timing difference.
Other information: I am running on CPU using a Jupyter notebook.
What is the problem here? Is there a way for me to modify the training setup or something else in order to speed up the initialization (or training) time?

The issue is caused by how TensorFlow validates some type of input objects. Such validations, when the data are surely correct, are exclusively a wasted time expenditure (I hope in the future it will be handled better).
In order to force TensorFlow to skip such validation procedures, you can trivially wrap the weights in a Pandas Series, such as follows:
model.fit(x_train.iloc, y_train, sample_weight=pd.Series(w_train))
Do note that in your code you are using the metrics keyword. If you want the accuracy to be actually weighted on the provided weights, to use the weighted_metrics argument instead.

Related

Learning with Batch Normalization vs without Batch Normalization

The primary objective of Batch Normalization is its faster optimization speed. To confirm it, I created a model with and without batch normalization as shown below;
Without BN:
model=Sequential()
model.add(Dense(units=1,activation='linear',use_bias=False,kernel_initializer=init_1,input_shape=(28,28,1),trainable=False))
model.add(Flatten())
model.add(Dense(units=1024,kernel_initializer=init_2))
model.add(Activation(activation='relu'))
model.add(Dense(units=10,kernel_initializer=init_2))
model.add(Activation(activation='softmax'))
With BN:
model=Sequential()
model.add(Dense(units=1,activation='linear',use_bias=False,kernel_initializer=init_1,input_shape=(28,28,1),trainable=False))
model.add(Flatten())
model.add(Dense(units=1024,kernel_initializer=init_2))
model.add(BatchNormalization())
model.add(Activation(activation='relu'))
model.add(Dense(units=10,kernel_initializer=init_2))
model.add(Activation(activation='softmax'))
The model was then compiled and trained on augmented data as shown below;
opt=optimizers.Adam(learning_rate=0.001)
model.compile(optimizer=opt,loss='categorical_crossentropy',metrics=['accuracy'])
model.fit_generator(iter,steps_per_epoch=np.ceil(len(X_train)/64),epochs=30)
To my surprise, I found that the model Without BN converges faster.
To further confirm the findings, I tried it by varying the number of units in the hidden layer and got the following results(this time using fit() method rather than fit_generator() and using the original data rather than augmented data);
No. of hidden units = 5
No. of hidden units = 50
No. of hidden units = 100
From the graphs, we can observe that the model seems to converge faster without batch normalization and achieves higher accuracy than that being achieved with normalization.
Please help me in finding out where I am doing it wrong. Thank you.

Neural net with duplicated inputs - Keras

I have a dataset of N videos each video is characterized by some metrics (that will be inputs for a neural net) my goal is to predict the score that a person will give when he or she watches the video.
The problem is that in my dataset each video was watched more than once by different subjects, so I was forced to duplicate the same metrics (inputs) the number of time the video was watched to keep all the scores given by the subjects.
I built an MLP model to predictet the scores. But when I calculate the RMSE it's always higher than 0.7.
I want to know if having a dataset like that would affect the performance of my model ? And how can I deal with it ?
Here is how the dataset looks like:
The first 5 columns are the inputs and the last one is the score of subjects. Note that all of them are normalized.
Here is my Model:
def mlp_model():
# create model
model = Sequential()
model.add(Dense(100,input_dim=5, kernel_initializer='normal', activation='relu'))
model.add(Dense(100, kernel_initializer='normal', activation='relu'))
model.add(Dense(100, kernel_initializer='normal', activation='relu'))
model.add(Dense(100, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
return model
seed = 100
numpy.random.seed(seed)
myModel = mlp_model()
myModel.fit(x=x_train, y=y_train, batch_size=10, epochs=45, validation_split=0.3, shuffle=True,callbacks=[plot_losses])
predictions = myModel.predict(x_test)
print predictions

Your problem statement reveals an inherent flaw in the design. As you correctly pointed out, you have no way of knowing what the user does, how she has rated other videos, and how she will rate the current video.
It would be helpful to explain what your current input values are, and whether they could differ at all. For example, a metric like "time spent watching the video" might be different for different users.
On a larger scale, try to answer the question whethre you could answer the rating (with a completely deterministic judgement), i.e. would it be possible for you to come up with the same answer (given the same input), and constantly get the same result?
Since that is currently not the case, I would say that you should investigate more time in finding a suitable approach to your problem, like for example recommender systems, but that also requires you to use a lot of different input information.
Alternatively, you could try to find more input data, which specifically identifies the users, and allows you to make more suitable predictions; even then, it will be hard to base a reasonable prediction on such proxy metrics, since you might end up creating an unwanted bias in your preprocessing.
In any case, getting much better results with the current format of the input is very unlikely.

Output the loss/cost function in keras

I am trying to find the cost function in Keras. I am running an LSTM with the loss function categorical_crossentropy and I added a Regularizer. How do I output what the cost function looks like after my Regularizer this for my own analysis?
model = Sequential()
model.add(LSTM(
NUM_HIDDEN_UNITS,
return_sequences=True,
input_shape=(PHRASE_LEN, SYMBOL_DIM),
kernel_regularizer=regularizers.l2(0.01)
))
model.add(Dropout(0.3))
model.add(LSTM(NUM_HIDDEN_UNITS, return_sequences=False))
model.add(Dropout(0.3))
model.add(Dense(SYMBOL_DIM))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
optimizer=RMSprop(lr=1e-03, rho=0.9, epsilon=1e-08))

How do i output what the cost function looks like after my regularizer this for my own analysis?
Surely you can achieve this by obtaining the output (yourlayer.output) of the layer you want to see and print it (see here). However there are better ways to visualize these things.
Meet Tensorboard.
This is a powerful visualization tool that enables you to track and visualize your metrics, outputs, architecture, kernel_initializations, etc. The good news is that there is already a Tensorboard Keras Callback that you can use for this purpose; you just have to import it. To use it just pass an instance of the Callback to your fit method, something like this:
from keras.callbacks import TensorBoard
#indicate folder to save, plus other options
tensorboard = TensorBoard(log_dir='./logs/run1', histogram_freq=1,
write_graph=True, write_images=False)
#save it in your callback list
callbacks_list = [tensorboard]
#then pass to fit as callback, remember to use validation_data also
model.fit(X, Y, callbacks=callbacks_list, epochs=64,
validation_data=(X_test, Y_test), shuffle=True)
After that, start your Tensorboard sever (it runs locally on your pc) by executing:
tensorboard --logdir=logs/run1
For example, this is what my Kernels look like on two different models I tested (to compare them you have to save separate runs and then start Tensorboard on the parent directory instead). This is on the Histograms tab, on my second layer:
The model on the left I initialized with kernel_initializer='random_uniform', thus its shape is the one of a Uniform Distribution. The model on the right I initialized with kernel_initializer='normal', thus why it appears as a Gaussian distribution throughout my epochs (about 30).
This way you could visualize how your kernels and layers "look like", in a more interactive and understandable way than printing outputs. This is just one of the great features Tensorboard has, and it can help you develop your Deep Learning models faster and better.
Of course there are more options to the Tensorboard Callback and for Tensorboard in general, so I do suggest you thoroughly read the links provided if you decide to attempt this. For more information you can check this and also this questions.
Edit: So, you comment you want to know how your regularized loss "looks" analytically. Let's remember that by adding a Regularizer to a loss function we are basically extending the loss function to include some "penalty" or preference in it. So, if you are using cross_entropy as your loss function and adding an l2 regularizer (that is Euclidean Norm) with a weight of 0.01 your whole loss function would look something like:

Keras: Training loss decrases (accuracy increase) while validation loss increases (accuracy decrease)

I am working on a very sparse dataset with the point of predicting 6 classes.
I have tried working with a lot of models and architectures, but the problem remains the same.
When I start training, the acc for training will slowly start to increase and loss will decrease where as the validation will do the exact opposite.
I have really tried to deal with overfitting, and I simply cannot still believe that this is what is coursing this issue.
What have I tried
Transfer learning on VGG16:
exclude top layer and add dense layer with 256 units and 6 units softmax output layer
finetune the top CNN block
finetune the top 3-4 CNN blocks
To deal with overfitting I use heavy augmentation in Keras and dropout after the 256 dense layer with p=0.5.
Creating own CNN with VGG16-ish architecture:
including batch normalization wherever possible
L2 regularization on each CNN+dense layer
Dropout from anywhere between 0.5-0.8 after each CNN+dense+pooling layer
Heavy data augmentation in "on the fly" in Keras
Realising that perhaps I have too many free parameters:
decreasing the network to only contain 2 CNN blocks + dense + output.
dealing with overfitting in the same manner as above.
Without exception all training sessions are looking like this:
Training & Validation loss+accuracy
The last mentioned architecture looks like this:
reg = 0.0001
model = Sequential()
model.add(Conv2D(8, (3, 3), input_shape=input_shape, padding='same',
kernel_regularizer=regularizers.l2(reg)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.7))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.5))
model.add(Conv2D(16, (3, 3), input_shape=input_shape, padding='same',
kernel_regularizer=regularizers.l2(reg)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.7))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(16, kernel_regularizer=regularizers.l2(reg)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(6))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='SGD',metrics=['accuracy'])
And the data is augmented by the generator in Keras and is loaded with flow_from_directory:
train_datagen = ImageDataGenerator(rotation_range=10,
width_shift_range=0.05,
height_shift_range=0.05,
shear_range=0.05,
zoom_range=0.05,
rescale=1/255.,
fill_mode='nearest',
channel_shift_range=0.2*255)
train_generator = train_datagen.flow_from_directory(
train_data_dir,
target_size=(img_width, img_height),
batch_size=batch_size,
shuffle = True,
class_mode='categorical')
validation_datagen = ImageDataGenerator(rescale=1/255.)
validation_generator = validation_datagen.flow_from_directory(
validation_data_dir,
target_size=(img_width, img_height),
batch_size=1,
shuffle = True,
class_mode='categorical')

What I can think of by analyzing your metric outputs (from the link you provided):
Seems to me that approximately near epoch 30 your model is starting to overfit. Therefore you can try stopping your training in that iteration, or well just train it for ~30 epochs (or the exact number). The Keras Callbacks may be useful here, specially the ModelCheckpoint to enable you to stop your training when desired (Ctrl +C) or when certain criteria is met. Here is an example of basic ModelCheckpoint use:
#save best True saves only if the metric improves
chk = ModelCheckpoint("myModel.h5", monitor='val_loss', save_best_only=False)
callbacks_list = [chk]
#pass callback on fit
history = model.fit(X, Y, ... , callbacks=callbacks_list)
(Edit:) As suggested in comments, another option you have available is to use the EarlyStopping callback, where you can specify the minimum change tolerated and the 'patience' or epochs without such improvement before stopping the training. If using this, you have to pass it to the callbacks argument as explained before.
At the current setup you model has (and with the modifications you have tried) that point in your training seems to be the optimal training time for your case; training it further will bring no benefits to your model (in fact, will make it generalize worse).
Given you have tried several modifications, one thing you can do is to try to increase your Network Depth, to give it more capacity. Try adding more layers, one at a time, and check for improvements. Also, you usually you want to start with simpler models first, before attempting a multi-layer solution.
If a simple model doesn't work, add one layer and test again, repeating until satisfied or possible. And by simple I mean really simple, have you tried a non-convolutional approach? Although CNN are great for images, maybe you are overkilling it here.
If nothing seems to work, maybe it is time to get more data, or to generate more data from the one you have by sampling or other techniques. For that last suggestion, try checking this keras blog I have found really useful. Deep learning algorithms usually require substantial amount of training data, specially for complex models, like images, so be aware this may not be an easy task. Hope this helps.

IMHO, this is just normal situation for DL. In Keras you can setup a callback that will save the best model (depending on evaluation metric that you provide), and callback that will stop training if model isn't improving.
See ModelCheckpoint & EarlyStopping callbacks respectively.
P.S. Sorry, maybe I misunderstood question - do you have validation loss decreasing form first step?

Validation loss is increasing. This means you need more data, or more regularization. Standard situation here, and nothing to be worried about. By the way, more parameters (bigger model) is just going to worsen this problem unless you fix it.
So you can now investigate profitably by introducing more examples, L2, L1, or dropout.

I faced a similar problem and managed to fix it by removing the Batch Normalisation layer that's just before the output dense layer. This made a ton of difference. Also one of the suggestions I was given is to remove the Dropout layer as it might be causing Shift Variance. Check this paper
I got part of the solution from this thread.

Accessing gradient values of keras model outputs with respect to inputs

I made a pretty simple NN model to do some non-linear regressions for me in Keras, as an introduction exercise. I uploaded my jupyter notebookit as a gist here (renders properly on github), which is pretty short and to the point.
It just fits the 1D function y = (x - 5)^2 / 25.
I know that Theano and Tensorflow are, at their core, graph based derivative (gradient) passing frameworks. And utilizing the gradients of loss functions with respect to weights for gradient step-based optimization are the main purpose of that.
But what I'm trying to get sense of is if I have access to something that, given a trained model, can approximate derivatives of inputs with respect to the output layer for me (not the weights or loss function). So for this case, I would want y' = 2(x-5)/25.0 estimated via the network's derivative graph for me for an indicated value of the input x, in the network's currently trained state.
Do I have any options in either the Keras or Theano/TF backend APIs to do this, or do I need to do my own chain ruling somehow with the weights (or maybe adding my own non-trainable "identity" layers or something)? In my notebook, you can see me trying a few approaches based what I was able to find so far, but without a ton of success.
To make it concrete, I have a working keras model with the structure:
model = Sequential()
# 1d input
model.add(Dense(64, input_dim=1, activation='relu'))
model.add(Activation("linear"))
model.add(Dense(32, activation='relu'))
model.add(Activation("linear"))
model.add(Dense(32, activation='relu'))
# 1d output
model.add(Dense(1))
model.compile(loss='mse', optimizer='adam', metrics=["accuracy"])
model.fit(x, y,
batch_size=10,
epochs=25,
verbose=0,
validation_data=(x_test, y_test))
I would like to estimate the derivative of output y with respect to input x at, say, x = 0.5.
All of my attempts to extract gradient values based on searching for past answers have led to syntax errors. From a high level point of view, is this a supported feature of Keras, or is any solution going to be backend-specific?

As you mention, Theano and TF are symbolic, so doing a derivative should be quite easy:
import theano
import theano.tensor as T
import keras.backend as K
J = T.grad(model.output[0, 0], model.input)
jacobian = K.function([model.input, K.learning_phase()], [J])
First you compute the symbolic gradient (T.grad) of the output given the input, then you build a function that you can call and does the computation. Note that sometimes this is not that trivial due to shape problems, as you get one derivative for each element in the input.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.