In the tensor flow Github repository, in the file attentionwrapper.py, hardmax operator has been defined. On the docs, it has been mentioned tf.contrib.seq2seq.hardmax
I want to know what's the theoretical underpinning behind providing this functionality for hardmax operator. Prima facie google searches for past few weeks haven't led me to concrete understanding of the concept.
If softmax is differentiable (soft), why would hardmax be ever used? If it can't be used in back propagation (due to non-differentiability required in gradient calculation), where else can it be used?
Reinforcement learning literature talks about Soft vs Hard attention. However I couldn't see concrete examples nor explanations of where the tf.contrib.seq2seq.hardmax can be actually used in some RL model.
By the looks of it, since it is mentioned in seq2seq, it should be obviously having some application in Natural Language Processing. But exactly where? There are tonnes of NLP tasks. Couldn't find any direct task SOTA algorithm which uses hardmax.
Hardmax is used when you have no choice but to make a decision nonprobabalistically. For example, when you are using a model to generate a neural architecture as in neural module networks, you have to make a discrete choice. To make this trainable (since this would be non-differentiable as you state), you can use REINFORCE (an algorithm in RL) to train via policy gradient and estimate this loss contribution via Monte Carlo sampling. Neural module networks are an NLP construct and depend on seq2seq. I'm sure there are many examples, but this is one that immediately came to mind.
Related
I have a Reinforcement Learning problem where the optimal policy does not depend on the next state (ie gamma equals 0). I think this means that I only need an efficient exploration algorithm.
I know that contextual bandits are specialized for this situation, except they only work for discrete action space, and I still need my policy network to make complex decisions (train a Deep neural networks, whereas most contextual bandits algorithms I found learn linear policy decisions).
Therefore I am looking for algorithms, or ideally a python library, that solves RL for continuous action spaces when gamma=0.
Many Thanks,
This tutorial describes how to build a TFF computation from keras model.
This tutorial describes how to build a custom TFF computation from scratch, possibly with a custom federated learning algorithm.
What I need is a combination of these: I want to build a custom federated learning algorithm, and I want to use an existing keras model. Q. How can it be done?
The second tutorial requires MODEL_TYPE which is based on MODEL_SPEC, but I don't know how to get it. I can see some variables in model.trainable_variables (where model = tff.learning.from_keras_model(keras_model, ...), but I doubt it's what I need.
Of course, I can implement the model by hand (as in the second tutorial), but I want to avoid it.
I think you have the correct pointers for writing a custom federated computation, as well as converting a Keras model to a tff.learning.Model. So we'll focus on pulling a TFF type signature from an existing tff.learning.Model.
Once you have your hands on such a model, you should be able to use tff.learning.framework.weights_type_from_model to pull out the appropriate TFF type to use for your custom algorithm.
There is an interesting caveat here: how precisely you use a tff.learning.Model in your custom algorithm is pretty much up to you, and this could affect your desired model weights type. This is unlikely to be the case (likely you will simply be assigning values from incoming tensors to the model variables), so I think we should prefer to avoid going deeper into this caveat.
Finally, a few pointers of end-to-end custom algorithm implementations in TFF:
One of the simplest complete examples TFF has is simple_fedavg, which is totally self-contained and contains instructions for running.
The code for a paper on Adaptive Federated Optimization contains a handwritten implementation of learning rate decay on the clients in TFF.
A similar implementation of adaptive learning rate decay (think Keras' functions to decay learning rate on plateaus) is right next door to the code for AFO.
I am starting to learn Tensorflow2.0 and one major source of my confusion is when to use the keras-like model.compile vs tf.GradientTape to train a model.
On the Tensorflow2.0 tutorial for MNIST classification they train two similar models. One with model.compile and the other with tf.GradientTape.
Apologies if this is trivial, but when do you use one over the other?
This is really a case-specific thing and it's difficult to give a definite answer here (it might border on "too opinion-based). But in general, I would say
The "classic" Keras interface (using compile, fitetc.) allows for quick and easy building, training & evaluation of standard models. However, it is very high-level/abstract and as such doesn't give you much low-level control. If you are implementing models with non-trivial control flow, this can be hard to accommodate.
GradientTape gives you full low-level control over all aspects of training/running your model, allowing easier debugging as well as more complex architectures etc., but you will need to write more boilerplate code for many things that a compiled model will hide from you (e.g. training loops). Still, if you do research in deep learning you will probably be working on this level most of the time.
I am having problems implementing reinforcement learning (RL) algorithms with pure tensor style in tensorflow. The idea comes from the implementation of Deepmind's IMPALA. The code is written in a pure tensor style without fetches and feeds. The authors use tf.py_func() to convert the emulator into tensorflow ops. Then the worker ops put data to tf.queue and the learner pull training data from tf.contrib.staging.StagingArea to calculate loss functions and the optimizer. Finally, a single fetch of the optimizer runs all algorithm. I want to convert the code to other RL algorithms. So as an exercise, I am trying to train with gym and I am trying policy gradient (PG) and proximal policy optimization (PPO). However, I have very bad results in last few weeks and need someone's help.
My code to run the gym's games in my repo.
I am not sure if I get right at PG algorithm. The result of PG for "CartPole-v0". It finally converges to score of 200. But the result of PG for "LunarLander-v2" looks problematic. The algorithm first learns something, but after a while the episodic return dropped drastically down to a wrong place and never come back. But my PPO does not learn anything at all. (see pictures here)
My implementations are here: PG, and PPO.
Please help me find out what goes wrong with my code? The outcome of pure-tensor code is supposed to match those of the conventional ones. As a reference, I also provide these RL algorithms using conventional tensorflow fetches and feeds here: PG conventional and PPO conventional.
I have a dataset with 11k instances containing 0s,1s and -1s. I heard that deep learning can be applied to feature values.Hence applied the same for my dataset but surprisingly it resulted in less accuracy (<50%) compared to traditional machine learning algos (RF,SVM,ELM). Is it appropriate to apply deep learning algos to feature values for classification task? Any suggestion is greatly appreciated.
First of all, Deep Learning isn't a mythical hammer you can throw at every problem and expect better results. It requires careful analysis of your problem, choosing the right method, crafting your network, properly setting up your training, and only then, with a lot of luck will you see significantly better results than classical methods.
From what you describe (and without any more details about your implementation), it seems to me that there could have been several things going wrong:
Your task is simply not designed for a neural network. Some tasks are still better solved with classical methods, since they manually account for patterns in your data, or distill your advanced reasoning/knowledge into a prediction. You might not be directly aware of it, but sometimes neural networks are just overkill.
You don't describe how your 11000 instances are distributed with respect to the target classes, how big the input is, what kind of preprocessing you are performing for either method, etc, etc. Maybe your data is simply processed wrong, your training is diverging due to unfortunate parameter setups, or plenty of other things.
To expect a reasonable answer, you would have to share at least a bit of code regarding the implementation of your task, and parameters you are using for training.