For an objective, I am trying to compute the MultiHead Attention Matrix for a sparse matrix and a dense matrix. I understand that by default, the Keras MultiHead Attention API requires two dense matrices, and then returns the attention value after the Softmax operation with the Query, Keys and Values from the Vaswani et. al paper "Attention is all you need".
However, I have a use-case where I have a sparse and dense matrix, and I want to pass them to a MultiHead Attention layer as a Query and a Value respectively.
By default, there is no support, and converting to dense and back is not an option as the time complexity grows a lot.
Is there any way to override the internal applications not compatible with sparse-dense combinations, and maybe replace them with mixed APIs such as sparse_dense_matmul for the Attention computation? Albeit, the documentation states that the matrix ranks must be 2 for sparse_dense_matmul, which is why class overriding also seems not plausible to me directly, unless I write my own class sparse-dense computation block. Note: Rank for matmul is usually 3 for a transformer, as the shapes are in the format of (Batch Size, Sequence Length, Dim).
To given an example:
att = layers.MultiHeadAttention(num_heads=num_heads,
key_dim=embed_dim)
attn_output = att(query=inputs1, value=inputs2) # I would like to pass this query as sparse, this value as dense.
I appreciate any help.
You can take a look into official repos that published the implementation of sparce attention such as sparse transformer
Related
I'm implementing a text classifier with a CNN similar to Kim 2014 with Tensorflow. Tensorflow provides tf.nn.embedding_lookup_sparse, which allows you to provide the word IDs as a sparse tensor. This is nice, especially for enabling variable length sequences. However, this function requires a "combination" step after the lookup, such as "mean" or "sum". This coerces it back to the dense tensor space. I don't want to do any combination. I want to keep my vectors in the sparse representation, so I can do other convolutions afterwards. Is this possible in TF?
EDIT: I want to avoid padding the input prior to the embedding lookup. This is because Tensorflow's embedding lookup generates vectors for the pad value, and its a kludge trying to mask it with zeros (see here).
I think there are two points of confusion in the question. Firstly, the combiner operation happens across the set of embedding IDs for each row of the sparse indices input sp_ids. So if sp_ids has a shape of N x 1, then you are "combining" just one embedding vector per each row of sp_ids, which will just retrieve that embedding vector (which is I think what you are saying you want).
Secondly though, the return value is the embedding vector for each row of input. The embedding vector itself is a dense vector, by very definition of what the embedding is and what the TensorFlow embedding operations calculate. So this return result will always be dense, and that's what you want. A sparse matrix representation would be horribly inefficient, since the matrix truly will be dense (full of dense embeddings), regardless of whether any 'combiner' operation happens or not.
The research paper you linked does not seem to be doing any type of special methodology that would result in a special case of a sparse embedding vector, so I don't see a reason here for expecting or desiring sparse outputs.
Maybe I am incorrect, can you provide more details about why you expect the embedding vectors themselves to be sparse vectors? That would be a highly unusual situation if so.
I am wondering if there is a way in TensorFlow, PyTorch or some other library to selectively connect neurons. I want to make a network with a very large number of neurons in each layer, but that has very few connections between layers.
Note that I do not think this is a duplicate of this answer: Selectively zero weights in TensorFlow?. I implemented a custom keras layer using essentially the same method that appears in that question - essentially by creating a dense layer where all but the specified weights are ignored in training and evaluation. This fulfills part of what I want to do by not training specified weights, and not using them for prediction. But, the problems is that I still waste memory saving the untrained weights, and I waste time calculating the gradients of the zeroed weights. What I would like is for the computation of the gradient matrices to involve only sparse matrices, so that I do not waste time and memory.
Is there a way to selectively create and train weights without wasting memory? If my question is unclear or there is more information that it would be helpful for me to provide, please let me know. I would like to be helpful as a question-asker.
The usual, simple solution is to initialize your weight matrices to have zeros where there should be no connection. You store a mask of the location of these zeros, and set the weights at these positions to zero after each weight update. You need to do this as the gradient for zero weights may be nonzero, and this would introduce nonzero weights (i.e. connectios) where you don't want any.
Pseudocode:
# setup network
weights = sparse_init() # only nonzero for existing connections
zero_mask = where(weights == 0)
# train
for e in range(num_epochs):
train_operation() # may lead to introduction of new connections
weights[zero_mask] = 0 # so we set them to zero again
Both tensorflow and pytorch support sparse tensors (torch.sparse, tf.sparse).
My intuitive understanding would be that if you were willing to write your network using the respective low level APIs (e.g. actually implementing the forward-pass yourself), you could cast your weight matrices as sparse tensors. That would in turn result in sparse connectivity, since the weight matrix of layer [L] defines the connectivity between neurons of the previous layer [L-1] with neurons of layer [L].
I am wondering if it is possible how to add a similar to flattened layer for images of variable length.
Say we have an input layer for our CNN as:
input_shape=(1, None, None)
After performing your typical series of convolution/maxpooling layers, can we create a flattened layer, such that the shape is:
output_shape=(None,...)
If not, would someone be able to explain why not?
You can add GlobalMaxPooling2D and GlobalAveragePooling2D.
These will eliminate the spatial dimensions and keep only the channels dimension. Max will take the maximum values, Average will get the mean value.
I don't really know why you can't use a Flatten layer, but in fact you can't with variable dimensions.
I understand why a Dense wouldn't work: it would have a variable number of parameters, which is totally infeasible for backpropagation, weight update and things like that. (PS: Dense layers act only on the last dimension, so that is the only that needs to be fixed).
Examples:
A Dense layer requires the last dimension fixed
A Conv layer can have variable spatial dimensions, but needs fixed channels (otherwise the number of parameters will vary)
A recurrent layer can have variable time steps, but needs fixed features and so on
Also, notice that:
For classification models, you'd need a fixed dimension output, so, how to flatten and still guarantee the correct number of elements in each dimension? It's impossible.
For models with variable output, why would you want to have a fixed dimension in the middle of the model anyway?
If you're going totally custom, you can always use K.reshape() inside a Lambda layer and work with the tensor shapes:
import keras.backend as K
def myReshape(x):
shape = K.shape(x)
batchSize = shape[:1]
newShape = K.variable([-1],dtype='int32')
newShape = K.concatenate([batchSize,newShape])
return K.reshape(x,newShape)
The layer: Lambda(myReshape)
I don't think you can because the compile step uses those dimensions to allocate fixed memory when your model is instanced for training or prediction. Some dimensions need to be known ahead of time, so the matrix dimensions can be allocated.
I understand why you want variable-sized image input, the world is not (226, 226, 3). It depends on your specific goals, but for me, scaling up or windowing to a region of interest using say Single Shot Detection as a preprocessing step may be helpful. You could just start with Keras's ImageDataGenerator to scale all images to a fixed size - then you see how much of a performance gain you get from conditional input sizing or windowing preprocessing.
#mikkola, I have found flatten to be very helpful for TimeDistributed models. You can add flatten after the convolution steps using:
your_model.add(Flatten())
I want to use my own convolution function in tensorflow. I have implemented that using numpy. How would I convert the code to Tensorflow format(dynamic inputs in computational graph).
At present my function takes a 2d numpy array as input and produces a 3d numpy array(height, width and output channels). How can I iterate through all the input images?
What you want does not make sense. Convolution is a mathematical operation which is defined in some particular way. It is easily extended in N-dimensions and converted to a discrete case (by summing instead of integration), this is why TF has conv1d, conv2d, and general n-dim convolution.
So it is impossible to define my own convolution function in the same way as you can't define your own matrix multiplication function because if it does not calculate the values in exactly the same way, it will no longer be a convolution function.
Now if you want to define your own operation, you should take a look at the official documentation how to define it. In a few words:
create the op using already written functions
write stuff in C++
Some use cases for neural networks requires that not all neurons are connected between two consecutive layers. For my neural network architecture, I need to have a layer, where each neuron only has connections to some prespecified neurons in the previous layer (at somewhat arbitrary places, not with a pattern such as a convolution layer). This is needed in order to model data on a specific graph. I need to implement this "Sparse" layer in Theano, but I'm not used to the Theano way of programming.
It seems that the most efficient way of programming sparse connections in Theano would be to use theano.tensor.nnet.blocksparse.SparseBlockGemv. An alternative would be to do matrix multiplication, where many weights are set to 0 (= no connection), but that would be very inefficient compared to SparseBlockGemv as each neuron is only connected to 2-6 neurons in the previous layer out of ~100000 neurons. Moreover, a weight matrix of 100000x100000 would not fit on my RAM/GPU. Could someone therefore provide an example of how to implement sparse connections using the SparseBlockGemv method or another computationally-efficient method?
A perfect example would be to extend the MLP Theano Tutorial with an extra layer after the hidden layer (and before softmax), where each neuron only has connections to a subset of neurons in the previous layer. However, other examples are also very welcome!
Edit: Note that the layer must be implemented in Theano as it is just a small part of a larger architecture.
The output of a fully-connected layer is given by the dot product of the input and the weights of that layer. In theano or numpy you can use the dot method.
y = x.dot(w)
If you only have connections to some neurons in the previous layer and those connections are predefined you could do something like this:
y = [x[edges[i]].dot(w[i])) for i in neurons]
Where edges[i] contains the indices for neurons connected to neuron i and w[i] the weights of this connection.
Please note, that theano doesn't know about layers or other high-level details.
Apologies for resurrecting an old thread, but this was the simplest guidance I found that was useful in extending the guidance at https://iamtrask.github.io/2015/07/12/basic-python-network/ for partially-connected inputs. However, it took me a while to make sense of basaundi's answer and I think I can improve upon it.
There were a couple of things that I needed to change to make it work. In my case, I am trying to map from N inputs to M neurons in my first hidden layer. My inputs are in a NxF array, where F is the number of features for my inputs, and my synapse values (weights) between inputs and the first layer are in a FxM array. Therefore, the output of Inputs <dot> Weights is a NxM array. My edge matrix is an MxF array that specifies for each neuron in layer 1 (rows), which of the features of the input data are relevant (columns).
In this setup, at least, it required me to slice my arrays differently than specified above. Also, the list comprehension returns a list of matrices, which must be summed to get the correct NxM (otherwise you get an MxNxM array).
So I have used the following (util.sigmoid is a helper function of my own):
y = [numpy.dot(x[:, edges[i]], w[edges[i]])
for i in range(M)]
y = util.sigmoid(numpy.sum(y, 0))
This seems to work for me.