I'm implementing a text classifier with a CNN similar to Kim 2014 with Tensorflow. Tensorflow provides tf.nn.embedding_lookup_sparse, which allows you to provide the word IDs as a sparse tensor. This is nice, especially for enabling variable length sequences. However, this function requires a "combination" step after the lookup, such as "mean" or "sum". This coerces it back to the dense tensor space. I don't want to do any combination. I want to keep my vectors in the sparse representation, so I can do other convolutions afterwards. Is this possible in TF?
EDIT: I want to avoid padding the input prior to the embedding lookup. This is because Tensorflow's embedding lookup generates vectors for the pad value, and its a kludge trying to mask it with zeros (see here).
I think there are two points of confusion in the question. Firstly, the combiner operation happens across the set of embedding IDs for each row of the sparse indices input sp_ids. So if sp_ids has a shape of N x 1, then you are "combining" just one embedding vector per each row of sp_ids, which will just retrieve that embedding vector (which is I think what you are saying you want).
Secondly though, the return value is the embedding vector for each row of input. The embedding vector itself is a dense vector, by very definition of what the embedding is and what the TensorFlow embedding operations calculate. So this return result will always be dense, and that's what you want. A sparse matrix representation would be horribly inefficient, since the matrix truly will be dense (full of dense embeddings), regardless of whether any 'combiner' operation happens or not.
The research paper you linked does not seem to be doing any type of special methodology that would result in a special case of a sparse embedding vector, so I don't see a reason here for expecting or desiring sparse outputs.
Maybe I am incorrect, can you provide more details about why you expect the embedding vectors themselves to be sparse vectors? That would be a highly unusual situation if so.
Related
For an objective, I am trying to compute the MultiHead Attention Matrix for a sparse matrix and a dense matrix. I understand that by default, the Keras MultiHead Attention API requires two dense matrices, and then returns the attention value after the Softmax operation with the Query, Keys and Values from the Vaswani et. al paper "Attention is all you need".
However, I have a use-case where I have a sparse and dense matrix, and I want to pass them to a MultiHead Attention layer as a Query and a Value respectively.
By default, there is no support, and converting to dense and back is not an option as the time complexity grows a lot.
Is there any way to override the internal applications not compatible with sparse-dense combinations, and maybe replace them with mixed APIs such as sparse_dense_matmul for the Attention computation? Albeit, the documentation states that the matrix ranks must be 2 for sparse_dense_matmul, which is why class overriding also seems not plausible to me directly, unless I write my own class sparse-dense computation block. Note: Rank for matmul is usually 3 for a transformer, as the shapes are in the format of (Batch Size, Sequence Length, Dim).
To given an example:
att = layers.MultiHeadAttention(num_heads=num_heads,
key_dim=embed_dim)
attn_output = att(query=inputs1, value=inputs2) # I would like to pass this query as sparse, this value as dense.
I appreciate any help.
You can take a look into official repos that published the implementation of sparce attention such as sparse transformer
The documentation for JAX says,
Not all JAX code can be JIT compiled, as it requires array shapes to be static & known at compile time.
Now I am somewhat surprised because tensorflow has operations like tf.boolean_mask that does what JAX seems incapable of doing when compiled.
Why is there such a regression from Tensorflow? I was under the assumption that the underlying XLA representation was shared between the two frameworks, but I may be mistaken. I don't recall Tensorflow ever having troubles with dynamic shapes, and functions such as tf.boolean_mask have been around forever.
Can we expect this gap to close in the future? If not, why makes it impossible to do in JAX' jit what Tensorflow (among others) enables?
EDIT
The gradient passes through tf.boolean_mask (obviously not on mask values, which are discrete); case in point here using TF1-style graphs where values are unknown, so TF cannot rely on them:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
x1 = tf.placeholder(tf.float32, (3,))
x2 = tf.placeholder(tf.float32, (3,))
y = tf.boolean_mask(x1, x2 > 0)
print(y.shape) # prints "(?,)"
dydx1, dydx2 = tf.gradients(y, [x1, x2])
assert dydx1 is not None and dydx2 is None
Currently, you can't (as discussed here)
This is not a limitation of JAX jit vs TensorFlow, but a limitation of XLA or rather how the two compile.
JAX uses simply XLA to compile the function. XLA needs to know the static shape. That's an inherent design choice within XLA.
TensorFlow uses the function: this creates a graph which can have shapes that are not statically known. This is not as efficient as using XLA, but still fine. However, tf.function offers an option jit_compile, which will compile the graph inside the function with XLA. While this offers often a decent speedup (for free), it comes with restrictions: shapes need to be statically known (surprise, surprise,...)
This is overall not too surprising behavior: computations in computers are in general faster (given a decent optimizer went over it) the more is previously known as more parameters (memory layout,...) can be optimally scheduled. The less is known, the slower the code (on this end is normal Python).
I don't think JAX isn't more incapable of doing this than TensorFlow. Nothing forbid you to do this in JAX:
new_array = my_array[mask]
However, mask should be indices (integers) and not booleans. This way, JAX is aware of the shape of new_array (the same as mask). In that sens, I'm pretty sure that tf.boolean_mask is not differentiable i.e. it will raise an error if you try to compute its gradient at some point.
More generally, if you need to mask an array, whatever library you are using, there are two approaches:
if you know in advance what indices need to be selected and you need to provide these indices such that the library can compute the shape before compilation;
if you can't define these indices, for whatever reason, then you need to design your code in order to avoid the prevent the padding to affect your result.
Examples for each situation
Let say you're writing a simple embedding layer in JAX. The input is a batch of token indices corresponding to several sentences. To get word embeddings corresponding to these indices, I will simply write word_embeddings = embeddings[input]. Since I don't know the length of the sentences in advance, I need to pad all token sequences to the same length beforehand, such that input is of shape (number_of_sentences, sentence_max_length). Now, JAX will compile the masking operation every time this shape changes. To minimize the number of compilations, you can provide the same number of sentences (also called batch size) and you can set the sentence_max_length to the maximum sentence length in the entire corpus. This way, there will be only one compilation during training. Of course, you need to reserve one row in word_embeddings that corresponds to the pad index. But still, the masking works.
Later in the model, let say you want to express each word of each sentence as a weighted average of all other words in the sentence (like a self-attention mechanism). The weights are computed in parallel for the entire batch and are stored in the matrix A of dimension (number_of_sentences, sentence_max_length, sentence_max_length). The weighted averages are computed with the formula A # word_embeddings. Now, you need to make sure the pad tokens don't affect this previous formula. To do so, you can zero out the entries of A corresponding to the pad indices to remove their influence in the averaging. If the pad token index is 0, you would do:
mask = jnp.array(input > 0, dtype=jnp.float32)
A = A * mask[:, jnp.newaxis, :]
weighted_mean = A # word_embeddings
So here we used a boolean mask, but the masking is somehow differentiable since we multiply the mask with another matrix instead of using it as an index. Note that we should proceed the same way to remove the rows of weighted_mean that also correspond to pad tokens.
I have an intermediate model that outputs symmetric arrays. Those arrays are then used as input to another model. I'd like to just flatten the arrays and discard the lower triangles, since they're symmetric. Is there a best/most efficient way to do this?
Edit: I want the triangle extraction to be handled similar to any other Keras layer, so that the output of the first model can be input directly to the second model and trained end-to-end.
Tensorflow and Keras use Numpy to handle the data. Consider using Numpy functions triu or triu_indices
I want to use my own convolution function in tensorflow. I have implemented that using numpy. How would I convert the code to Tensorflow format(dynamic inputs in computational graph).
At present my function takes a 2d numpy array as input and produces a 3d numpy array(height, width and output channels). How can I iterate through all the input images?
What you want does not make sense. Convolution is a mathematical operation which is defined in some particular way. It is easily extended in N-dimensions and converted to a discrete case (by summing instead of integration), this is why TF has conv1d, conv2d, and general n-dim convolution.
So it is impossible to define my own convolution function in the same way as you can't define your own matrix multiplication function because if it does not calculate the values in exactly the same way, it will no longer be a convolution function.
Now if you want to define your own operation, you should take a look at the official documentation how to define it. In a few words:
create the op using already written functions
write stuff in C++
Some use cases for neural networks requires that not all neurons are connected between two consecutive layers. For my neural network architecture, I need to have a layer, where each neuron only has connections to some prespecified neurons in the previous layer (at somewhat arbitrary places, not with a pattern such as a convolution layer). This is needed in order to model data on a specific graph. I need to implement this "Sparse" layer in Theano, but I'm not used to the Theano way of programming.
It seems that the most efficient way of programming sparse connections in Theano would be to use theano.tensor.nnet.blocksparse.SparseBlockGemv. An alternative would be to do matrix multiplication, where many weights are set to 0 (= no connection), but that would be very inefficient compared to SparseBlockGemv as each neuron is only connected to 2-6 neurons in the previous layer out of ~100000 neurons. Moreover, a weight matrix of 100000x100000 would not fit on my RAM/GPU. Could someone therefore provide an example of how to implement sparse connections using the SparseBlockGemv method or another computationally-efficient method?
A perfect example would be to extend the MLP Theano Tutorial with an extra layer after the hidden layer (and before softmax), where each neuron only has connections to a subset of neurons in the previous layer. However, other examples are also very welcome!
Edit: Note that the layer must be implemented in Theano as it is just a small part of a larger architecture.
The output of a fully-connected layer is given by the dot product of the input and the weights of that layer. In theano or numpy you can use the dot method.
y = x.dot(w)
If you only have connections to some neurons in the previous layer and those connections are predefined you could do something like this:
y = [x[edges[i]].dot(w[i])) for i in neurons]
Where edges[i] contains the indices for neurons connected to neuron i and w[i] the weights of this connection.
Please note, that theano doesn't know about layers or other high-level details.
Apologies for resurrecting an old thread, but this was the simplest guidance I found that was useful in extending the guidance at https://iamtrask.github.io/2015/07/12/basic-python-network/ for partially-connected inputs. However, it took me a while to make sense of basaundi's answer and I think I can improve upon it.
There were a couple of things that I needed to change to make it work. In my case, I am trying to map from N inputs to M neurons in my first hidden layer. My inputs are in a NxF array, where F is the number of features for my inputs, and my synapse values (weights) between inputs and the first layer are in a FxM array. Therefore, the output of Inputs <dot> Weights is a NxM array. My edge matrix is an MxF array that specifies for each neuron in layer 1 (rows), which of the features of the input data are relevant (columns).
In this setup, at least, it required me to slice my arrays differently than specified above. Also, the list comprehension returns a list of matrices, which must be summed to get the correct NxM (otherwise you get an MxNxM array).
So I have used the following (util.sigmoid is a helper function of my own):
y = [numpy.dot(x[:, edges[i]], w[edges[i]])
for i in range(M)]
y = util.sigmoid(numpy.sum(y, 0))
This seems to work for me.