How to add L1-regularization to one hidden layer? - python

There have been some answers about adding L1-regularization to the Weights of one hidden. However what I want is not only the sparseness of Weight, but also the sparseness of the representation of one hidden layer. What I want is something like the code below. Is it feasible to be realized, or I need only to add L1-regularization on the Weights?
import tensorflow as tf
...
**HIDDEN** = tf.contrib.layers.dense(input_layer, n_nodes)
...
loss = meansq #or other loss calcuation
l1_regularizer = tf.contrib.layers.l1_regularizer(scale=0.005, scope=None)
regularization_penalty = tf.contrib.layers.apply_regularization(l1_regularizer, **HIDDEN**)
regularized_loss = loss + regularization_penalty
This idea is from the sparse representation of the book Deep Learning written by Goodfellow and Bengio.

If you are using tf.contrib.layers, the fully_connected function accepts weights_regularizer argument, so your code should look like thus
l1 = tf.contrib.layers.l1_regularizer(scale=0.005, scope=None)
hidden = tf.contrib.layers.fully_connected(inputs, n_nodes, weights_regularizer=l1)
That said, tf.contrib.layers has been mostly moved to the core API, so you should be using tf.layers.dense instead with kernel_regularizer argument.
The code above will regularize the weights in the layer. If you want to regularize both weights and the layer output, you can use the same tf.contrib.layers.l1_regularizer or create a different one with different parameters. Something like this should work for you:
l1 = tf.contrib.layers.l1_regularizer(scale=0.005, scope=None)
hidden = tf.contrib.layers.fully_connected(inputs, n_nodes, weights_regularizer=l1)
hidden_reg = l1(hidden)

Related

Keras: create submodel (from layer "m" to layer "n") of a "full" model without using loops

I have a Keras-model (let's call it full model), which was already trained and now I would like to create a new submodel using layers m to n of the full model.
E.g. full model has 10 layers and my submodel shall comprise layers 3 to 8
For the case that m=0, the task is trivial as one can use: (assume we want to go to layer 5)
full_model = ... # anything we load from a h5-file
submodel=tf.keras.Model(inputs=full_model.inputs, outputs=full_model.layers[5].output)
# =>
submodel.summary()
tf.keras.utils.plot_model(submodel, to_file = ...)
So, we can use the submodel, get its summary and also get the png-plot of the submodel-architecture.
The concrete problem now is that I don't know how to make this if we want to take the last layers of the model for example. I always get a GraphDisconnected error than.
The only way to get around this, that I found, was to manually loop over the layers (as the function below, "create_submodel", is doing it) - but in my case, I cannot use this because the model is quite complex and the layers are not simply put after each other but they are nested and so on i.e. in the architecture-plot, I do not have a straight series of layers but many different branches in the "tree" of layers.
So: Is there a way to create a submodel (from layer "m" to layer "n") of a "full" model without simple, naive looping through the layers (as demonstrated in the function below)
Thanks very much!
def create_submodel(full_model, start_layer_number=None, end_layer_number=None):
layers = tf.keras.layers
if start_layer_number is None:
start_layer_number = 0
if end_layer_number is None:
end_layer_number = len(full_model.layers)
inp_shape = full_model.layers[start_layer_number].input.shape[1:]
inp = layers.Input(shape=(inp_shape))
x = inp
for i in range(start_layer_number, end_layer_number):
print(i, full_model.layers[i].name)
x = full_model.layers[i](x)
out = x
sub_model = tf.keras.Model(inputs=inp, outputs=out)
sub_model.summary()
return sub_model

How does `tf.keras.layers.ActivityRegularization` work and how to use it correctly?

In training of deep neural network, how can tf.keras.layers.ActivityRegularization be used to regularize output?
In my code, output have very large value. So I tried to regularize it. So for my last dense layers, I tried:
output = tf.layers.dense(inputs=dropout_dense1,
units=NUM_OUTPUTS,
kernel_initializer=tf.truncated_normal_initializer,
activity_regularizer = tf.keras.layers.ActivityRegularization())
But no regularization effect is observed in output (still very large). I tried all kinds of combination of parameters (default is tf.keras.layers.ActivityRegularization(L1=0, L2=0)), but it seems that it doesn't have any effect.
In your case, I think the proper method should be like this.
(Tensorflow version >= 2)
output = tf.keras.layers.Dense(units=NUM_OUTPUTS,
kernel_initializer=tf.keras.initializers.TruncatedNormal(mean=0., stddev=1.),
activity_regularizer=tf.keras.regularizers.L2(0.01))(dropout_dense1)
You can change the method, for example, from L2 to L1, or if you want to calculate regularizers your own way.
You can customize it yourself. Please see an example here.
Tensorflow 2 Developing new regularizers
but if you want to use tf.keras.layers.ActivityRegularization
you can use as follows
output = tf.keras.layers.Dense(units=NUM_OUTPUTS)(dropout_dense1)
output_reg = tf.keras.layers.Activation('relu')(output)
# Define ActivityRegularization layer
reg_output = tf.keras.layers.ActivityRegularization(l1=0.001, l2=0.001)
# Apply ActivityRegularization layer to the non-weight directly layer that you want
output_reg = reg_output(output_reg)
other_layer = tf.keras.layers.Dense(units=NUM)(output_reg)
final_output = tf.keras.layers.Activation('relu')(other_layer)
model = tf.keras.Model(input, final_output)

TF Graph does not correspond to the code

I am trying to create a very simple neural network reading in information with the shape 1x2048 and to create a classification for two categories (object or not object). The graph structure however, deviates from what I believe to have coded. The dense layers should be included in the scope of "inner_layer" and should be receiving their input from the "input" placeholder. Instead, TF seems to be treating them as independent layers which do not receive any information from "input".
Also, when using trying to use tensorboard summaries I get an error telling me that I have not mentioned inserting inputs for the apparent placeholders of the dense layers. When omitting tensorboard, everything works as I expected it based on the code.
I have spent a lot of time trying to find the problem but I think I must be overlooking an something very basic.
The graph I get in tensorboard is on this image.
Which corresponds to the following code:
tf.reset_default_graph()
keep_prob = 0.5
# Graph Strcuture
## Placeholders for input
with tf.name_scope('input'):
x_ = tf.placeholder(tf.float32, shape = [None, transfer_values_train.shape[1]], name = "input1")
y_ = tf.placeholder(tf.float32, shape = [None, num_classes], name = "labels")
## Dense Layer one with 2048 nodes
with tf.name_scope('inner_layers'):
first_layer = tf.layers.dense(x_, units = 2048, activation=tf.nn.relu, name = "first_dense")
dropout_layer = tf.nn.dropout(first_layer, keep_prob, name = "dropout_layer")
#readout layer, without softmax
y_conv = tf.layers.dense(dropout_layer, units = 2, activation=tf.nn.relu, name = "second_dense")
# Evaluation and training
with tf.name_scope('cross_entropy'):
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels = y_ , logits = y_conv),
name = "cross_entropy_layer")
with tf.name_scope('trainer'):
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
with tf.name_scope('accuracy'):
prediction = tf.argmax(y_conv, axis = 1)
correct_prediction = tf.equal(prediction, tf.argmax(y_, axis = 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
Does anyone have an idea why the graph is so different from what you would expect based on the code?
The graph rendering in tensorboard may be a bit confusing (initially), but it's correct. Take a look at this picture where I've left only the inner_layers part of your graph:
You may notice that:
The first_dense and second_dense are actually the name scopes themselves (generated by tf.layers.dense function; see also this question).
Their input/output tensors are inside the inner_layers scope and wire correctly to the dropout_layer. Here, in each of dense layers, live the corresponding linear ops: MatMul, BiasAdd, Relu.
Both scopes also include the variables (kernel and bias each), that are shown separately from inner_layers. They encapsulate the ops related specifically to variable, such as read, assign, initialize, etc. The linear ops in first_dense depend on the variable ops of first_dense, and second_dense likewise.
The reason for this separation is that in distributed settings the variables are manages by a different task called parameter server. It's usually run on a different device (CPU as opposed to GPU), sometimes even on a different machine. In other words, for tensorflow the variable management is by design different from matrix computation.
Having said that, I'd love to see a mode in tensorflow that would not split the scope into variables and ops and keep them coupled.
Other than this the graph perfectly matches the code.

Feeding parameters into placeholders in tensorflow

I'm trying to get into tensorflow, setting up a network and then feeding data to it. For some reason I end up with the error message ValueError: setting an array element with a sequence. I made a minimal example of what I'm trying to do:
import tensorflow as tf
K = 10
lchild = tf.placeholder(tf.float32, shape=(K))
rchild = tf.placeholder(tf.float32, shape=(K))
parent = tf.nn.tanh(tf.add(lchild, rchild))
input = [ tf.Variable(tf.random_normal([K])),
tf.Variable(tf.random_normal([K])) ]
with tf.Session() as sess :
print(sess.run([parent], feed_dict={ lchild: input[0], rchild: input[1] }))
Basically, I'm setting up a network with place holders and a sequence of input embeddings that I want to learn, and then I try to run the network, feeding the input embeddings into it. From what I can tell by searching for the error message, there might be something wrong with my feed_dict, but I can't see any obvious mismatches in eg. dimensionality.
So, what did I miss, or how did I get this completely backwards?
EDIT: I've edited the above to clarify that the input represents embeddings that need to be learned. I guess the question can be asked more sharply as: Is it possible to use placeholders for parameters?
The inputs should be numpy arrays.
So, instead of tf.Variable(tf.random_normal([K])), simply write np.random.randn(K) and everything should work as expected.
EDIT (The question was clarified after my answer):
It is possible to use placeholders as parameters but in a slightly different way. For example:
lchild = tf.placeholder(tf.float32, shape=(K))
rchild = tf.placeholder(tf.float32, shape=(K))
parent = tf.nn.tanh(tf.add(lchild, rchild))
loss = <some loss that depends on the parent tensor or lchild/rchild>
# Compute gradients with respect to the input variables
grads = tf.gradients(loss, [lchild, rchild])
inputs = [np.random.randn(K), np.random.randn(K)]
for i in range(<number of iterations>):
np_grads = sess.run(grads, feed_dict={lchild:inputs[0], rchild:inputs[1])
inputs[0] -= 0.1 * np_grads[0]
inputs[1] -= 0.1 * np_grads[1]
It is not however the best or easiest way to do this. The main problem with it is that at every iteration you need to copy numpy arrays in and out of the session (which is running potentially on a different device like GPU).
Placeholders generally are used to feed the data external to the model (like texts or images). The way to solve it using tensorflow utilities would be something like:
lchild = tf.Variable(tf.random_normal([K])
rchild = tf.Variable(tf.random_normal([K])
parent = tf.nn.tanh(tf.add(lchild, rchild))
loss = <some loss that depends on the parent tensor or lchild/rchild>
train_op = tf.train.GradientDescentOptimizer(loss).minimize(0.1)
for i in range(<number of iterations>):
sess.run(train_op)
# Retrieve the weights back to numpy:
np_lchild = sess.run(lchild)

How to do Xavier initialization on TensorFlow

I'm porting my Caffe network over to TensorFlow but it doesn't seem to have xavier initialization. I'm using truncated_normal but this seems to be making it a lot harder to train.
Since version 0.8 there is a Xavier initializer, see here for the docs.
You can use something like this:
W = tf.get_variable("W", shape=[784, 256],
initializer=tf.contrib.layers.xavier_initializer())
Just to add another example on how to define a tf.Variable initialized using Xavier and Yoshua's method:
graph = tf.Graph()
with graph.as_default():
...
initializer = tf.contrib.layers.xavier_initializer()
w1 = tf.Variable(initializer(w1_shape))
b1 = tf.Variable(initializer(b1_shape))
...
This prevented me from having nan values on my loss function due to numerical instabilities when using multiple layers with RELUs.
In Tensorflow 2.0 and further both tf.contrib.* and tf.get_variable() are deprecated. In order to do Xavier initialization you now have to switch to:
init = tf.initializers.GlorotUniform()
var = tf.Variable(init(shape=shape))
# or a oneliner with a little confusing brackets
var = tf.Variable(tf.initializers.GlorotUniform()(shape=shape))
Glorot uniform and Xavier uniform are two different names of the same initialization type. If you want to know more about how to use initializations in TF2.0 with or without Keras refer to documentation.
#Aleph7, Xavier/Glorot initialization depends the number of incoming connections (fan_in), number outgoing connections (fan_out), and kind of activation function (sigmoid or tanh) of the neuron. See this: http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf
So now, to your question. This is how I would do it in TensorFlow:
(fan_in, fan_out) = ...
low = -4*np.sqrt(6.0/(fan_in + fan_out)) # use 4 for sigmoid, 1 for tanh activation
high = 4*np.sqrt(6.0/(fan_in + fan_out))
return tf.Variable(tf.random_uniform(shape, minval=low, maxval=high, dtype=tf.float32))
Note that we should be sampling from a uniform distribution, and not the normal distribution as suggested in the other answer.
Incidentally, I wrote a post yesterday for something different using TensorFlow that happens to also use Xavier initialization. If you're interested, there's also a python notebook with an end-to-end example: https://github.com/delip/blog-stuff/blob/master/tensorflow_ufp.ipynb
A nice wrapper around tensorflow called prettytensor gives an implementation in the source code (copied directly from here):
def xavier_init(n_inputs, n_outputs, uniform=True):
"""Set the parameter initialization using the method described.
This method is designed to keep the scale of the gradients roughly the same
in all layers.
Xavier Glorot and Yoshua Bengio (2010):
Understanding the difficulty of training deep feedforward neural
networks. International conference on artificial intelligence and
statistics.
Args:
n_inputs: The number of input nodes into each output.
n_outputs: The number of output nodes for each input.
uniform: If true use a uniform distribution, otherwise use a normal.
Returns:
An initializer.
"""
if uniform:
# 6 was used in the paper.
init_range = math.sqrt(6.0 / (n_inputs + n_outputs))
return tf.random_uniform_initializer(-init_range, init_range)
else:
# 3 gives us approximately the same limits as above since this repicks
# values greater than 2 standard deviations from the mean.
stddev = math.sqrt(3.0 / (n_inputs + n_outputs))
return tf.truncated_normal_initializer(stddev=stddev)
TF-contrib has xavier_initializer. Here is an example how to use it:
import tensorflow as tf
a = tf.get_variable("a", shape=[4, 4], initializer=tf.contrib.layers.xavier_initializer())
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print sess.run(a)
In addition to this, tensorflow has other initializers:
xavier_initializer_conv2d
variance_scaling_initializer
constant_initializer
zeros_initializer
ones_initializer
uniform_unit_scaling_initializer
truncated_normal_initializer
random_uniform_initializer
random_normal_initializer
orthogonal_initializer
as well as a lot of initializers from keras
I looked and I couldn't find anything built in. However, according to this:
http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization
Xavier initialization is just sampling a (usually Gaussian) distribution where the variance is a function of the number of neurons. tf.random_normal can do that for you, you just need to compute the stddev (i.e. the number of neurons being represented by the weight matrix you're trying to initialize).
Via the kernel_initializer parameter to tf.layers.conv2d, tf.layers.conv2d_transpose, tf.layers.Dense etc
e.g.
layer = tf.layers.conv2d(
input, 128, 5, strides=2,padding='SAME',
kernel_initializer=tf.contrib.layers.xavier_initializer())
https://www.tensorflow.org/api_docs/python/tf/layers/conv2d
https://www.tensorflow.org/api_docs/python/tf/layers/conv2d_transpose
https://www.tensorflow.org/api_docs/python/tf/layers/Dense
Just in case you want to use one line as you do with:
W = tf.Variable(tf.truncated_normal((n_prev, n), stddev=0.1))
You can do:
W = tf.Variable(tf.contrib.layers.xavier_initializer()((n_prev, n)))
Tensorflow 1:
W1 = tf.get_variable("W1", [25, 12288],
initializer = tf.contrib.layers.xavier_initializer(seed=1)
Tensorflow 2:
W1 = tf.get_variable("W1", [25, 12288],
initializer = tf.random_normal_initializer(seed=1))

Categories

Resources