NN: Why does the manual implementation converges better than Keras/TF?

NN: Why does the manual implementation converges better than Keras/TF? - python

I've implemented a basic 1-hidden layer NN to identify digits from part of the MNIST dataset (This is adapted from the Coursera ML course). It is implemented using matrices and conjugated-gradient (fmincg) optimization.
In addition I've implemented both a Keras implementation and a vanilla TF implementation.
In the manual implementation, there's 50 iterations (or epochs, but since I'm using the whole batch, each epoch is of size 1).
Now, what I've noticed, is that the manual implementation takes a while to compute, but after 50 iterations gives me really good results.
Keras and TF on the other hand, takes less time to compute, but need about 500-1000 iterations if I'm using the whole batch to give me the same results. (Though using 50 epochs, with each processing mini-batches of size 32, seems to work good as well)
So I wonder, how come the manual impl. converges better? And why does it take longer to compute even though it's the same number of iterations (50)?

Related

What causes neural network accuracy to sharply increase after only one epoch?

I'm using a relatively simple neural network with fully connected layers in keras. For some reason, the accuracy drastically increases basically to its final value after only one training epoch (likewise, the loss sharply decreases). I've tried architectures with larger and smaller numbers of hidden layers too. This network also performs poorly on the testing data, so I am trying to find a more optimal architecture or improve my training set accordingly.
It is trained on a set of 6500 1D array-like data, and I'm using a batch size of 512.

As said by Murilo, hard to say much without more information but it can come from multiple things:
Your network learns through the batches of each epoch, meaning that
your ~12 batches (6500/512) are already enough to learn a good bit of
classification.
Your weights are not really well initialized, and produce a huge
loss for the first epoch. The massive decrease in the loss is
actually the solver 'squishing' the weights. The best explanation I
found for this comes from A. Karpathy in his 'MakeMore' tutorial:
https://youtu.be/P6sfmUTpUmc?t=260
Now this sudden decrease of the loss is not extreme here (from 0.5 to 0.2) so I would not care much. I agree with Murilo that low accuracy in validation can come from too few samples in your validation set, or a bad shuffling between train and validation sets.

Loss and learning rate scaling strategies for Tensorflow distributed training when using TF Estimator

For those who don't want to read the whole story:
TL; DR: When using TF Estimator, do we have to scale learning rate by the factor by which we increase batch size (I know this is the right way, I am not sure if TF handles this internally)? Similarly, do we have to scale per example loss by global batch size (batch_size_per_replica * number of replicas)?
Documentation on Tensorflow distributed learning is confusing. I need clarification on below points.
It is now understood that if you increase the batch size by a factor of k then you need to increase the learning rate by k (see this and this paper). However, Tensoflow official page on distributed learning makes no clarifying comment about this. They do mention here that learning rate needs to be adjusted. Do they handle the learning rate scaling by themselves? To make matters more complicated, the behavior is different in Keras and tf.Estimator (see next point). Any suggestions on should I increase the LR by a factor of K or not when I am using tf.Estimator?
It is widely accepted that the per example loss should be scaled by global_batch_size = batch_size_per_replica * number of replicas. Tensorflow mentions it here but then when illustrating how to achieve this with a tf.Estimator, they either forget or the scaling by global_batch_size is not required. See here, in the code snippet, loss is defined as follows.
loss = tf.reduce_sum(loss) * (1. / BATCH_SIZE)
and BATCH_SIZE to the best of my understanding is defined above as per replica batch size.
To complicate things further, the scaling is handled automatically if you are using Keras (for reasons I will never understand, it would have been better to keep everything consistent).

The learning rate is not automatically scaled by the global step. As you said, they even suggest that you might need to adjust the learning rate, but then again only in some cases, so that's not the default. I suggest that you do increase the learning rate manually.
If we take a look at a simple tf.Estimator, the tf.estimator.DNNClassifier (link), the default loss_reduction is losses_utils.ReductionV2.SUM_OVER_BATCH_SIZE. If we got to that Reduction (found here), we see that it's a policy for how to combine losses form indiviual samples together. On a single machine, we would just use tf.reduce_mean, but you can't use that in a distributed setting (as mentioned in the next link). The Reduction leads us to here, which shows you 1) an implementation of how you would implement the global step and 2) explains why. As they are telling you you should implement this yourself, this implies it's not handled by tf.Estimator. Also note that yo can find some explanations on the Reduction page they state the differences bewteen Keras and Estimators about these params.

How should I approach a 300 classes classification machine learning problem?

I am trying to make a Multi-Class classification application, but my dataset has 300 classes, is it possible to train my model with all these classes with a normal PC?

Sure it is. You can even train imagenet with 1000 categories or more, if you have enough time! ;)
You just have to think about which loss function you want (categorical crossentropy, sparse categorical crossentropy or even binary if you want to penalize each output node independently), apart from that there's not really much difference between 10, 100 or a 1000 classes.
Of course you have to increase your model size to account for more classes, so RAM may be an issue, but then you can always decrease batch size. If you are using images and convnets and your model is still too large, try to downsample the images, use pooling layers or larger strides.
If your computer is too old and slow, you can also try Google Colab which offers free GPU and even TPU online!

It is difficult to answer this question. The training time of your model depends on a number of factors. It might be best to train your model for a certain amount of hours and evaluate the performance. You could make use of fitting a learning curve, which could provide an esitmation of how many data points your require for training to achieve a certain performance. After that you could link the required amount of data points to computation time.
Here is an article provides an algorithm for fitting a learning curve: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3307431/.

Does size of training data for an epoch matter in tensorflow?

Assuming we have 500k items worth of training data, does it matter if we train the model one item at a time or 'n' items at a time or all at once?
Considering inputTrainingData and outputTrainingData to be [[]] and train_step to be any generic tensorflow training step.
Option 1 Train one item at a time -
for i in range(len(inputTrainingData)):
train_step.run(feed_dict={x: [inputTrainingData[i]], y: [outputTrainingData[i]], keep_prob: .60}, session= sess)
Option 2 Train on all at once -
train_step.run(feed_dict={x: inputTrainingData, y: outputTrainingData, keep_prob: .60}, session= sess)
Is there any difference between options 1 and 2 above as far as the quality of training is concerned?

Yes, there is a difference. Option 1 is much less memory consuming but is also much less accurate. Option 2 could eat up all of your RAM but should prove more accurate. However, if you use all your training set at once, be sure to limit the number of steps to avoid over-fitting.
Ideally, use data in batches (typically between 16 and 256).
Most optimization techniques are 'stochastic', i.e. they rely on a statistical sample of examples to estimate a model update.
To sum up:
- More data => more accuracy (but more memory) => higher risk of over-fitting (so limit the amount of training steps)

There is a different between this options. Normally you have to use a batchsize to train for example 128 iterations of data.
You also could use a batchsize of one, like the first of you examples.
The advantage of this method is you can output the training efficient of the neural network.
If you are learning all data at ones, you will bi a little bit faster, but you will know only at the end if you efficient is good.
Best way is to make a batchsize and learn by stack. So you can output you efficient after every stack and control your efficient.

Mathematically these two methods are different. One is called stochastic gradient descent and the other is called batch gradient descent. You are missing the most commonly used one - mini batch gradient descent. There has been a lot of research on this topic but basically different batch sizes have different convergence properties. Generally people use batch sizes that are greater than one but not the full dataset. This is usually necessary since most datasets cannot fit into memory all at once. Also if your model uses batch normalization then a batch size of one won't converge. This paper discusses the effects of batch size (among other things) on performance. The takeaway is that larger batch sizes do not generalize as well. (They actually argue it isn't the batch size itself the but the fact that you have fewer updates when the batch is larger. I would recommend batch sizes of 32 to start and experiment to see how batch size effects performance.
Here is a graph of the effects of batch size on training and validation performance from the paper I linked.

How to speed up up Stochastic Gradient Descent?

I'm trying to fit a regression model with an L1 penalty, but I'm having trouble finding an implementation in python that fits in a reasonable amount of time. The data I've got is on the order of 100k by 500 (sidenote; several of the variables are pretty correlated), but running the sklearn Lasso implementation on this takes upwards of 12 hours to fit a single model (I'm not actually sure of the exact time, I've left it running overnight several times and it never finished).
I've been looking into Stochastic Gradient Descent as a way to get the job done faster. However, the SGDRegressor implementation in sklearn takes on the order of 8 hours to fit when I'm using 1e5 iterations. This seems like a relatively small amount (and the docs even suggest that the model often takes around 1e6 iters to converge).
I'm wondering if there's something that I'm being stupid about which is causing the fits to take a really long time. I've been told that SGD is often used for its efficiency (something around O(n_iter * n_samp * n_feat), though so far I haven't seen much improvement over Lasso.
To speed things up, I have tried:
Decreasing n_iter, but this often leads to a pretty bad solution because it hasn't converged yet.
Increasing the step size (and decreasing n_iter), but this often makes the loss function explode
Changing the learning rate types (from inverse scaling to an amount based off of the number of iterations), and this also didn't seem to make a huge difference.
Any suggestions for speeding this process up? It seems like partial_fit might be part of the answer, though the docs on this are somewhat sparse. I'd love to be able to fit these models without waiting for three days apiece.

Partial_fit is not the answer. It will not speed anything up. If anything, it would make it slower.
The implementation is pretty efficient, and I am surprised that you say convergence is slow. You do way to many iterations, I think. Have you looked at how the objective decreases?
Often tuning the initial learning rate can give speedups. Your dataset really shouldn't be a problem. I'm not sure if SGDRegressor does that internally, but rescaling your target to unit variance might help.
You could try vopal wabbit, which is an even faster implementation, but it shouldn't be necessary.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.