TensorFlow 2.0: create replica-local variable under MirroredStrategy

TensorFlow 2.0: create replica-local variable under MirroredStrategy - python

I'm implementing Memory Transformer and I need to keep the memory between calls to the model. I've tried to use tf.Variable for this task and it works perfectly on a single GPU.
But under MirroredStrategy on multiple GPUs this approach fails because mirrored strategy wants to synchronize the variables written on multiple replicas. Which is not needed in my case, I need to create a personal set of memory variables on each 'tower' like in TransformerXL implementation.
I think I can use with tf.device(): to create those variables but I'm not sure how to get current replica's device inside build method.

Related

Shared weights with model parallelism in PyTorch

Our setup involves initial part of the network (input interface) which run on separate GPU cards. Each GPU gets its own portion of data (model parallelism) and process it separately.
Each input interface, in turn, it itself a complex nn.Module. Every input interface can occupy one or several cards (say, interface_1 runs on GPU 0 and 1, interface_2 - on GPU 2 and 3 and so on).
We need to keep the weights of these input interface the same all over the training. We also need them to run in parallel to save training time which is already weeks for our scenario.
The best idea we can think of was initializing the interfaces with the same weights and then average the gradients for them. As the interfaces are identical, updating same weights with the same gradients should keep them the same all over the training process thus achieving desired “shared weights” mode.
However, I cannot find any good solution for changing values of these weights and their gradients represented as Parameter in PyTorch. Apparently, PyTorch does not allow to do so.
Our current state is: if we copy.deepcopy the ‘parameter.data’ of the “master” interface and assign it to ‘parameter.data’ of the "slave" interface, the values are indeed changed but .to(device_id) does not work and keeps them at the “master” device. However, we need them to move to a “slave” device.
Could someone please tell me if it is possible at all or, if not, if there is a better way to implement shared weights along with the parallel execution for our scenario?

Modify Tensorflow Code to place preprocessing on CPU and training on GPU

I am reading this performance guide on the best practices for optimizing TensorFlow code for GPU. One suggestion they have is to place the preprocessing operations on the CPU so that the GPU is dedicated for training. To try to understand how one would actually implement this within an experiment (ie. learn_runner.run()). To further the discussion, I'd like to consider the best way to apply this strategy to the Custom Estimator Census Sample provided here.
The article suggests placing with tf.device('/cpu:0') around the preprocessing operations. However, when I look at the custom estimator the 'preprocessing' appears to be done in multiple steps:
Line 152/153 inputs = tf.feature_column.input_layer(features, transformed_columns) & label_values = tf.constant(LABELS) -- if I wrapped with tf.device('/cpu:0') around these two lines would that be sufficient to cover the 'preprocessing' in this example?
Line 282/294 - There is also a generate_input_fn and parse_csv function that are used to set up input data queues. Would it be necessary to place with tf.device('/cpu:0') within these functions as well or would that basically be forced by having the inputs & label_values already wrapped?
Main Question: Which of the above implementation suggestions is sufficient to properly place all preprocessing on the CPU?
Some additional questions that aren't addressed in the post:
What if the machine has multiple cores? Would 'cpu:0' be limiting?
The post implies to me that by wrapping the preprocessing on the cpu, the GPU would be automatically used for the rest. Is that actually the case?
Distributed ML Engine Experiment
As a follow up, I would like to understand how this can be further adapted in a distributed ML engine experiment - would any of the recommendations above need to change if there were say 2 worker GPUs, 1 master CPU and a parameter server? My understanding is that the distributed training would be data-parallel asynchronous training so that each worker will be independently iterating through the data (and passing gradients asynchronously back to the PS) which suggests to me that no further modifications from the single GPU above would be needed if you train in this way. However, this seems a bit to easy to be true.

MAIN QUESTION:
The 2 codes your placed actually are 2 different parts of the training, Line 282/294 in my options is so called "pre-processing" part, for it's parse raw input data into Tensors, this operations not suitable for GPU accelerating, so it will be sufficient if allocated on CPU.
Line 152/152 is part of the training model for it's processing the raw feature into different type of features.
'cpu:0' means the operations of this section will be allocated on CPU, but not bind to specified core. The operations allocated on CPU will run in multi-threads and use multi-cores.
If your running machine has GPUs, the TensorFlow will prefer allocating the operations on GPUs if the device is not specified.

The previous answer accurately describes device placement. Allow me to provide an answer to the questions about distributed TF.
The first thing to note is that, whenever possible, prefer a single machine with lots of GPUs to multiple machines with single GPUs. The bandwidth to parameters in RAM on the same machine (or even better, on the GPUs themselves) is orders of magnitude faster than going over the network.
That said, there are times where you'll want distributed training, including remote parameter servers. In that case, you would not necessarily need to change anything in your code from the single machine setup.

how to use tensorflow saver with multiple models?

I'm having a lot of trouble understanding the proper use of tf.train.Saver
I have a session where I create several distinct and separate network models. All models are trained and I save the best performing networks for later use.
However, when I try to restore a model at a later time I get an error which seems to indicate that some variables are either not getting saved or restored:
NotFoundError: Tensor name "Network_8/train/beta2_power" not found in checkpoint files networks/network_0.ckpt
for some reason, when I try and load the variables for Network_0 I'm being told I need variable information for Network_8.
What is the best way to make sure I only save/restore the correct variables from a multi-network session?
It seems part of my problem is that, while I have created a dict object for the Variables I want to save (weights and biases) for each network, when I setup an optimizer such as the AdamOptimizer tensorflow automatically creates extra variables which need to be initialized. This is fine if you use tf.train.Saver to save ALL variables and you only have one network, however I am training multiple networks and only saving the best results. I'm not sure how to specify the variables tf auto adds to my dict for saving.

My solution is to create a part_saver with the same tensor name both in the original model and the new model (i.e. Network_0 and Network_8) which only restores the needed variables.
part_saver = tf.train.Saver({"W":w,"b":b,...})
Init all the variables in Network_8 before restoring the partial model.

Theano's pkl_utils Dump Function not available in Theano 0.7?

I want to save a model that I have trained.
Since it is using shared variables (like Weights, bias and so on) and since it should be readable on machines without Theano installed, I wanted to use the theano.misc.pkl_utils.dump() function.
However, it seems as if that is only installed in bleeding edge installations (the current github file looks different than my local one).
Is that really the case? And why is the description in the docs then?
I am using theano 0.7.0 and I'm seriously confused about this.
If that feature is not yet available (I can't install bleeding edge right now), what are other ways? I'm sure that I am not the only one trying to save a trained model the easiest way possible ;-)
Thank you a lot,
Roman

If you train your model using Theano, the parameters of the model will be eventually shared variables (probably a list of shared variables if the network consists of several layers). It is possible to pickle the list of shared variables and unpickle it later. However you might have problems to unpickle such variables in another machine, e.g. with no Theano installation or if you train in a GPU-capable machine that generates CudaNdarrays and then you want to load back the model in a non-GPU-capable machine. What I recommend you is the following: convert every shared variable of the list of parameters into a numpy ndarray:
params_numpy = [numpy.asarray(p.get_value()) for p in params]
where params is a list of shared variables. Then you can safely pickle/unpickle params_numpy.

Multiple networks in Theano

I'd like to have 2 separate networks running in Theano at the same time, where the first network trains on the results of the second. I could embed both networks in the same structure but that would be a real mess in the entire forward pass (and probably won't even work because of the shared variables etc.)
The problem is that when I define a theano function I don't specify the model it's applied on, meaning if I'm having a predict and a train function they'll both work on the first model I define.
Is there a way to overcome that issue?

In a rather simplified way I've managed to find a nice solution. The trick was to create one model, define its function and then create the other model and define the second function. Works like a charm

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.