TensorFlow Federated: Keras model with custom learning algorithm

TensorFlow Federated: Keras model with custom learning algorithm - python

This tutorial describes how to build a TFF computation from keras model.
This tutorial describes how to build a custom TFF computation from scratch, possibly with a custom federated learning algorithm.
What I need is a combination of these: I want to build a custom federated learning algorithm, and I want to use an existing keras model. Q. How can it be done?
The second tutorial requires MODEL_TYPE which is based on MODEL_SPEC, but I don't know how to get it. I can see some variables in model.trainable_variables (where model = tff.learning.from_keras_model(keras_model, ...), but I doubt it's what I need.
Of course, I can implement the model by hand (as in the second tutorial), but I want to avoid it.

I think you have the correct pointers for writing a custom federated computation, as well as converting a Keras model to a tff.learning.Model. So we'll focus on pulling a TFF type signature from an existing tff.learning.Model.
Once you have your hands on such a model, you should be able to use tff.learning.framework.weights_type_from_model to pull out the appropriate TFF type to use for your custom algorithm.
There is an interesting caveat here: how precisely you use a tff.learning.Model in your custom algorithm is pretty much up to you, and this could affect your desired model weights type. This is unlikely to be the case (likely you will simply be assigning values from incoming tensors to the model variables), so I think we should prefer to avoid going deeper into this caveat.
Finally, a few pointers of end-to-end custom algorithm implementations in TFF:
One of the simplest complete examples TFF has is simple_fedavg, which is totally self-contained and contains instructions for running.
The code for a paper on Adaptive Federated Optimization contains a handwritten implementation of learning rate decay on the clients in TFF.
A similar implementation of adaptive learning rate decay (think Keras' functions to decay learning rate on plateaus) is right next door to the code for AFO.

Related

If I have two machine learning model instances of the same architecture, there is any feature to store only the difference between these two models?

I know that it is possible, for example using TensorFlow but also in PyTorch or whatever, to store an instance of a trained (or in training) model in a way that it can be loaded in future, or loaded by another machine, or just to use it as a checkpoint during the training.
What I wonder is if there is any way, such as the above mentioned one, to store the difference (maybe not exactly the algebric subtraction but a similar concept, always referring to operation on tensors) between two instances of the same neural network (same architecture, different weights) for efficiency purposes.
If you are wondering why this should be convenient, consider an hypothetical setting where there are different entities and all of them know a model instance (a "shared model"), so using the "difference" calculated with respect to this shared model could be useful in terms of storage space or in terms of bandwidth (if the local model parameters should be sent via Internet to another machine).
The hypotesis is that it is possible to reconstruct a model knowing the shared model and the "difference" with the model to reconstruct.
Summarizing my questions:
There is any built-in features in TensorFlow, Pytorch, etc.. to do this?
It could be convenient in your opinion to do something like that? If not, why?
PS: In literature, this concept exists and it has been recently explored within the "Federated Learning" topic, and the "difference" I mentioned is called update.

Tensorflow2.0 training: model.compile vs GradientTape

I am starting to learn Tensorflow2.0 and one major source of my confusion is when to use the keras-like model.compile vs tf.GradientTape to train a model.
On the Tensorflow2.0 tutorial for MNIST classification they train two similar models. One with model.compile and the other with tf.GradientTape.
Apologies if this is trivial, but when do you use one over the other?

This is really a case-specific thing and it's difficult to give a definite answer here (it might border on "too opinion-based). But in general, I would say
The "classic" Keras interface (using compile, fitetc.) allows for quick and easy building, training & evaluation of standard models. However, it is very high-level/abstract and as such doesn't give you much low-level control. If you are implementing models with non-trivial control flow, this can be hard to accommodate.
GradientTape gives you full low-level control over all aspects of training/running your model, allowing easier debugging as well as more complex architectures etc., but you will need to write more boilerplate code for many things that a compiled model will hide from you (e.g. training loops). Still, if you do research in deep learning you will probably be working on this level most of the time.

Theoretical underpinning behind Hardmax operator

In the tensor flow Github repository, in the file attentionwrapper.py, hardmax operator has been defined. On the docs, it has been mentioned tf.contrib.seq2seq.hardmax
I want to know what's the theoretical underpinning behind providing this functionality for hardmax operator. Prima facie google searches for past few weeks haven't led me to concrete understanding of the concept.
If softmax is differentiable (soft), why would hardmax be ever used? If it can't be used in back propagation (due to non-differentiability required in gradient calculation), where else can it be used?
Reinforcement learning literature talks about Soft vs Hard attention. However I couldn't see concrete examples nor explanations of where the tf.contrib.seq2seq.hardmax can be actually used in some RL model.
By the looks of it, since it is mentioned in seq2seq, it should be obviously having some application in Natural Language Processing. But exactly where? There are tonnes of NLP tasks. Couldn't find any direct task SOTA algorithm which uses hardmax.

Hardmax is used when you have no choice but to make a decision nonprobabalistically. For example, when you are using a model to generate a neural architecture as in neural module networks, you have to make a discrete choice. To make this trainable (since this would be non-differentiable as you state), you can use REINFORCE (an algorithm in RL) to train via policy gradient and estimate this loss contribution via Monte Carlo sampling. Neural module networks are an NLP construct and depend on seq2seq. I'm sure there are many examples, but this is one that immediately came to mind.

R-CNN: looking for REPO where FC for classification is retrainable

I'm studying different object detection algorithms for my interest.
The main reference are Andrej Karpathy's slides on object detection slides here.
I would like to start from some reference, in particular something which allows me to directly test some of the network mentioned on my data (mainly consisting in onboard cameras of car and bike races).
Unfortunately I already used some pretrained network (repo forked from JunshengFu one, where I slightly adapt Yolo to my use case), but the classification accuracy is rather poor, I guess because there were not many training instances of racing cars like Formula 1.
For this reason I would like to retrain the networks and here is where I'm finding the most issues:
properly training some of the networks requires either hardware (powerful GPUs) or time I don't have so I was wondering whether I could retrain just some part of the network, in particular the classification network and if there is any repo already allowing that.
Thank you in advance

That is called fine-tuning of the network or transfer-learning. Basically you can do that for any network you find (having similar problem domains of course), and then depending on the amount of the data you have you will either fine-tune whole network or freeze some layers and train only last layers. For your case you would probably need to freeze whole network except last fully-connected layers (which you will actually replace with new ones, satisfying your number of classes), which perform classification. I don't know what library you use, but tensorflow has official tutorial on transfer-learning. However it's not very clear tbh.
More user-friendly tutorial you can find here by some enthusiast: tutorial. Here you can find a code repository as well. One correction you need thou is that the author performs fine-tuning of the whole network, while if you want to freeze some layers you will need to get list of the trainable variables and remove those you want to freeze and pass the resultant list to the optimizer (so he ignores removed vars), like following:
all_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,scope='InceptionResnetV2')
to_train = all_vars[-6:] // you better specify them by name explicitely, but this still will work
optimizer = tf.train.AdamOptimizer(lr=0.0001)
train_op = slim.learning.create_train_op(total_loss,optimizer, variables_to_train=to_train)
Further, tensorflow has a so called model zoo (bunch of trained models you can use for your purposes and transfer-learning). You can find it here.

How to get predictions out of tensorflow model after you've used tf.group on your optimizers

I'm trying to write something similar to google's wide and deep learning after running into difficulties of doing multi-class classification(12 classes) with the sklearn api. I've tried to follow the advice in a couple of posts and used the tf.group(logistic_regression_optimizer, deep_model_optimizer). It seems to work but I was trying to figure out how to get predictions out of this model. I'm hoping that with the tf.group operator the model is learning to weight the logistic and deep models differently but I don't know how to get these weights out so I can get the right combination of the two model's predictions. Thanks in advance for any help.
https://groups.google.com/a/tensorflow.org/forum/#!topic/discuss/Cs0R75AGi8A
How to set layer-wise learning rate in Tensorflow?

tf.group() creates a node that forces a list of other nodes to run using control dependencies. It's really just a handy way to package up logic that says "run this set of nodes, and I don't care about their output". In the discussion you point to, it's just a convenient way to create a single train_op from a pair of training operators.
If you're interested in the value of a Tensor (e.g., weights), you should pass it to session.run() explicitly, either in the same call as the training step, or in a separate session.run() invocation. You can pass a list of values to session.run(), for example, your tf.group() expression, as well as a Tensor whose value you would like to compute.
Hope that helps!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.