The documentation for JAX says,
Not all JAX code can be JIT compiled, as it requires array shapes to be static & known at compile time.
Now I am somewhat surprised because tensorflow has operations like tf.boolean_mask that does what JAX seems incapable of doing when compiled.
Why is there such a regression from Tensorflow? I was under the assumption that the underlying XLA representation was shared between the two frameworks, but I may be mistaken. I don't recall Tensorflow ever having troubles with dynamic shapes, and functions such as tf.boolean_mask have been around forever.
Can we expect this gap to close in the future? If not, why makes it impossible to do in JAX' jit what Tensorflow (among others) enables?
EDIT
The gradient passes through tf.boolean_mask (obviously not on mask values, which are discrete); case in point here using TF1-style graphs where values are unknown, so TF cannot rely on them:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
x1 = tf.placeholder(tf.float32, (3,))
x2 = tf.placeholder(tf.float32, (3,))
y = tf.boolean_mask(x1, x2 > 0)
print(y.shape) # prints "(?,)"
dydx1, dydx2 = tf.gradients(y, [x1, x2])
assert dydx1 is not None and dydx2 is None
Currently, you can't (as discussed here)
This is not a limitation of JAX jit vs TensorFlow, but a limitation of XLA or rather how the two compile.
JAX uses simply XLA to compile the function. XLA needs to know the static shape. That's an inherent design choice within XLA.
TensorFlow uses the function: this creates a graph which can have shapes that are not statically known. This is not as efficient as using XLA, but still fine. However, tf.function offers an option jit_compile, which will compile the graph inside the function with XLA. While this offers often a decent speedup (for free), it comes with restrictions: shapes need to be statically known (surprise, surprise,...)
This is overall not too surprising behavior: computations in computers are in general faster (given a decent optimizer went over it) the more is previously known as more parameters (memory layout,...) can be optimally scheduled. The less is known, the slower the code (on this end is normal Python).
I don't think JAX isn't more incapable of doing this than TensorFlow. Nothing forbid you to do this in JAX:
new_array = my_array[mask]
However, mask should be indices (integers) and not booleans. This way, JAX is aware of the shape of new_array (the same as mask). In that sens, I'm pretty sure that tf.boolean_mask is not differentiable i.e. it will raise an error if you try to compute its gradient at some point.
More generally, if you need to mask an array, whatever library you are using, there are two approaches:
if you know in advance what indices need to be selected and you need to provide these indices such that the library can compute the shape before compilation;
if you can't define these indices, for whatever reason, then you need to design your code in order to avoid the prevent the padding to affect your result.
Examples for each situation
Let say you're writing a simple embedding layer in JAX. The input is a batch of token indices corresponding to several sentences. To get word embeddings corresponding to these indices, I will simply write word_embeddings = embeddings[input]. Since I don't know the length of the sentences in advance, I need to pad all token sequences to the same length beforehand, such that input is of shape (number_of_sentences, sentence_max_length). Now, JAX will compile the masking operation every time this shape changes. To minimize the number of compilations, you can provide the same number of sentences (also called batch size) and you can set the sentence_max_length to the maximum sentence length in the entire corpus. This way, there will be only one compilation during training. Of course, you need to reserve one row in word_embeddings that corresponds to the pad index. But still, the masking works.
Later in the model, let say you want to express each word of each sentence as a weighted average of all other words in the sentence (like a self-attention mechanism). The weights are computed in parallel for the entire batch and are stored in the matrix A of dimension (number_of_sentences, sentence_max_length, sentence_max_length). The weighted averages are computed with the formula A # word_embeddings. Now, you need to make sure the pad tokens don't affect this previous formula. To do so, you can zero out the entries of A corresponding to the pad indices to remove their influence in the averaging. If the pad token index is 0, you would do:
mask = jnp.array(input > 0, dtype=jnp.float32)
A = A * mask[:, jnp.newaxis, :]
weighted_mean = A # word_embeddings
So here we used a boolean mask, but the masking is somehow differentiable since we multiply the mask with another matrix instead of using it as an index. Note that we should proceed the same way to remove the rows of weighted_mean that also correspond to pad tokens.
Related
I'm running a sequence-to-sequence model in TensorFlow, for which I need to extensively pad my input data (samples' length varies). Thus, any metrics calculated, it is heavily biased (if the true sample is ~10% of the data input to the model, the errors appearing there as somehow hidden by "correct" predictions on the padded part).
Thus, I'd like to calculate "true" metrics (accuracy, AUC or whathever), that takes into account the real sample only. In numpy-ish code, I'd like to do something like that:
def adjusted_metrics(y_true,y_pred):
last_index = np.nonzero(y_true)[-1] % y is padded with 0 and there is another value at the end of real y
return AUC(y_true[:last_index], y_pred[:last_index])
But, I'm pretty new to tensorflow and:
I can't do that in tensorflow code. Actually, I'm not able to find the index of the last nonzero element of y_true, when it is a Tensor. I tried casting to numpy using tensorflow.experimental.numpy (no effect actually, it still appears as a Tensor?) or calling .numpy() on a tensor (not working, despite the fact I don't have eager execution disabled). I tried masking, but it's hard for me to find the mask dimension, also due to the following point:
All my attempts also seem inappropriate in the context of batches - y_true and y_pred are of shape (None, max_length). I suppose the calculation in batches is governed by my model, but I've no idea how (and if it's possible) to change the metrics calculation to be done per sample, keeping the whole learning process in batches.
Any advice? :)
I have a problem where I have texts ( that can be very long max ~9000 words) that I need to embed with Keras Layer. I choose the fixed size 5000 for every text and I need to pad each sequence to get to the right shape. The classical way is to use Keras' pad_sequence that take as input list of lists of indexes and pad with zeros or cut the lists of indexes to 5000.
For my downstream task, I use a sort of convnet inspired by Kim's Paper (https://arxiv.org/abs/1408.5882). My concern is that the network learns in a certain sense the Wordcount by detecting the pattern of vectors that embed the 0 I used to pad the sequences. I am not saying that this feature is not important but I would like to force the network to learn other features in preferences. I was thinking about two things, first using an additional task (like an adversarial task) that take the latent representation created by the model before the output and use a branch of the model to predict the size of the text or a cluster of size, for example :
[,1000 words] -- cluster 1
[1001,2000words] -- cluster 2
ect..
Then use the output to encourage the network to map other information in the latent space by adding an adversarial loss to the main loss term. My other idea was instead of using zeros' vectors to pad the embed the zero paddings, we could use random vectors, generated on the fly while training. (every time the network sees a particular index, for example -1, it knows it has to generate a random vector). I was thinking that it breaks the symmetry introduced by using zeros vectors and helps the model to generalize better as it introduces noise in the training process.
As I didn't find any papers on this task of padding with something else than zeros, I turn to the community. What do you think? I went through the Embedding layer implementation and I am pretty sure that the implementation of the second idea is pretty straightforward in keras by changing the K.gather() by a flag for the right indexes (It would be longer execution time though).
Thanks in advance for your feedback and your ressources !
I know that in gensims KeyedVectors-model, one can access the embedding matrix by the attribute model.syn0. There is also a syn0norm, which doesn't seem to work for the glove model I recently loaded. I think I also have seen syn1 somewhere previously.
I haven't found a doc-string for this and I'm just wondering what's the logic behind this?
So if syn0 is the embedding matrix, what is syn0norm? What would then syn1 be and generally, what does syn stand for?
These names were inherited from the original Google word2vec.c implementation, upon which the gensim Word2Vec class was based. (I believe syn0 only exists in recent versions for backward-compatbility.)
The syn0 array essentially holds raw word-vectors. From the perspective of the neural-network used to train word-vectors, these vectors are a 'projection layer' that can convert a one-hot encoding of a word into a dense embedding-vector of the right dimensionality.
Similarity operations tend to be done on the unit-normalized versions of the word-vectors. That is, vectors that have all been scaled to have a magnitude of 1.0. (This makes the cosine-similarity calculation easier.) The syn0norm array is filled with these unit-normalized vectors, the first time they're needed.
This syn0norm will be empty until either you do an operation (like most_similar()) that requires it, or you explicitly do an init_sims() call. If you explicitly do an init_sims(replace=True) call, you'll actually clobber the raw vectors, in-place, with the unit-normed vectors. This saves the memory that storing both vectors for every word would otherwise require. (However, some word-vector uses may still be interested in the original raw vectors of varying magnitudes, so only do this when you're sure most_similar() cosine-similarity operations are all you'll need.)
The syn1 (or syn1neg in the more common case of negative-sampling training) properties, when they exist on a full model (and not for a plain KeyedVectors object of only word-vectors), are the model neural network's internal 'hidden' weights leading to the output nodes. They're needed during model training, but not a part of the typical word-vectors collected after training.
I believe the syn prefix is just a convention from neural-network variable-naming, likely derived from 'synapse'.
I'm implementing a text classifier with a CNN similar to Kim 2014 with Tensorflow. Tensorflow provides tf.nn.embedding_lookup_sparse, which allows you to provide the word IDs as a sparse tensor. This is nice, especially for enabling variable length sequences. However, this function requires a "combination" step after the lookup, such as "mean" or "sum". This coerces it back to the dense tensor space. I don't want to do any combination. I want to keep my vectors in the sparse representation, so I can do other convolutions afterwards. Is this possible in TF?
EDIT: I want to avoid padding the input prior to the embedding lookup. This is because Tensorflow's embedding lookup generates vectors for the pad value, and its a kludge trying to mask it with zeros (see here).
I think there are two points of confusion in the question. Firstly, the combiner operation happens across the set of embedding IDs for each row of the sparse indices input sp_ids. So if sp_ids has a shape of N x 1, then you are "combining" just one embedding vector per each row of sp_ids, which will just retrieve that embedding vector (which is I think what you are saying you want).
Secondly though, the return value is the embedding vector for each row of input. The embedding vector itself is a dense vector, by very definition of what the embedding is and what the TensorFlow embedding operations calculate. So this return result will always be dense, and that's what you want. A sparse matrix representation would be horribly inefficient, since the matrix truly will be dense (full of dense embeddings), regardless of whether any 'combiner' operation happens or not.
The research paper you linked does not seem to be doing any type of special methodology that would result in a special case of a sparse embedding vector, so I don't see a reason here for expecting or desiring sparse outputs.
Maybe I am incorrect, can you provide more details about why you expect the embedding vectors themselves to be sparse vectors? That would be a highly unusual situation if so.
I have a 3D sparse tensor in Tensorflow which I want to split along the first dimension (axis=0). I was thinking of using tf.sparse_split operation. But it requires an argument num_splits as Python integer. I wanted to know if I have the num_splits in a scalar placeholder is there any way to use it?
Why such a convention has been followed for this function, I haven't seen this in any other tensorflow operation?
In the tensorflow framework, num_splits has to be known at graph building time, because the graph is meant to be static. At least when using daddy's old, graph-based tensorflow. If you really have to have parts of your graph that are dynamic, you might success using tensorflow's imperative eager execution.