Keras LSTM Input Transpose - python

I'm training a timeseries LSTM model using Keras. I understand that the input to the model has to be in the format: [samples, timesteps, features].
However, when I reverse transpose each input element, so the input now is in the format: [samples, features, timesteps] my model accuracy improves significantly, and training time is reduced quite a bit as well. Does anyone have an explanation as to why?
For reference, here are the stats on my training data:
samples: 720
timesteps: 256
features: 4
So my input tensor should have the shape [720, 256, 4] but reshaping to [720, 4, 256] produces better results. Why?

As I said in my rather long comments, the answer is "because you are not learning the same thing". Frameworks like tensorflow and keras attempt to make training and using networks convenient, so as long as your inputs are at least approximately the right shape, the framework will try its best to feed the data into the network. But the network has no way to interpret the data you feed into it. It is up to you to make sure that what you are feeding into the network makes sense in the context of your data. No matter what data you send and what labels you use, the network will do its best to learn a mapping between the data and the labels. And it might succeed. Just because the pattern you are trying to learn makes no sense doesn't mean it cannot be learned. So to answer your question, you need to figure out what is the meaning of your input transposition and given that LSTM will treat your data as a sequence of consecutive datapoints, what sequences did you end up learning.

Related

Tensorflow 2.0: Best way for structure the output of `tf.data.Dataset` in multiple inputs scenario

Im building a GAN on Tensorflow for Image Deblurring, its an implementation of DeblurGANv2. I setup the GAN in a way it have two inputs, a batch of blurred images, and a batch of sharp images. Following this lines, I design the input to be a Python Dictionary with two Keys ['sharp', 'blur'], each one have a tensor of shape [batch_size, 512, 512, 3], this make it easy for feed the blurred images batch to the generator, and then feed the output of generator and the sharp images batch to the discriminator.
Based on the last requirements, i create a tf.data.Dataset that outputs exactly that, a dict containing the two tensors, each one with their batch dimension. this complements perfectly with my GAN implementation, everything work fine and smoothly.
So keep in mind, my input is not a tensor, but a python dict, that has no batch dimension, this will be relevant for explain my problem later.
Recently, i decided to add support for distributed training using Tensorflow Distribution Strategies. This feature of Tensorflow allows to distribute the training over multiple devices, inclusively over multiple machines. There is a feature with some of the implementations, for example MirroredStrategy, that takes the input tensor, splits it in equal parts, and feed each slice to different devices, that means, if you have a batch size of 16 and 4 GPUs, each GPU will end taking a local batch of 4 datapoints, after this there is some magic for aggregate the results and other stuff that is not relevant to my problem.
As you already notice, is critical for distribution strategies to have a tensor as input, or at least some sort of input with an exterior batch dimension, and what i have is a Python dict, with the batch dimension of the inputs in the internal dictionary tensor values. This is a huge problem, my current implementation is not compatible with distributed training.
I was looking for workarounds, but i cant wrap my head very well around this, maybe just make the input a huge tensor of shape=[batch_size, 2, 512, 512, 3] and slice it? not sure this just come to my mind right now lol. Anyways i see this very ambiguous, i cant not differentiate the two inputs, at least not with the clarity of the dictionary keys. Edit: The problem with this solution is that make my dataset transformations very expensive, hence makes the dataset throughput lot slower, taking into account this is an image loading pipeline, this is a major point.
Maybe my explanation of how distributed strategies work is not the most rigorous one, if im not seeing something feel free to correct me pls.
PD: This is not a bug question or a code error, mostly a "System Design Query", hope this is not illegal here
Instead of using dictionary as input the GAN, you can try mapping a function in the following way,
def load_image(fileA,fileB):
imageA = tf.io.read_file(fileA)
imageA = tf.image.decode_jpeg(imageA, channels=3)
imageB = tf.io.read_file(fileB)
imageB = tf.image.decode_jpeg(imageB)
return imageA,imageB
trainA = glob.glob('blur/*.jpg')
trainB = glob.glob('sharp/*.jpg')
train_dataset = tf.data.Dataset.from_tensor_slices((trainA,trainB))
train_dataset = train_dataset.map(load_image).batch(batch_size)
#for mirrored strategy
dist_dataset = mirrored_strategy.experimental_distribute_dataset(train_dataset)
You can iterate the dataset and update the network by passing both the images.
I hope this helps !

Can CNN autoencoders have different input and output dimensions?

I am working on a problem which requires me to build a deep learning model that based on certain input image it has to output another image. It is worth noting that these two images are conceptually related but they don't have the same dimensions.
At first I thought that a classical CNN with a final dense layer whose argument is the multiplication of the height and width of the output image would suit this case, but when training it was giving strange figures such as accuracy of 0.
While looking for some answers on the Internet I discovered the concepts of CNN autoencoders and I was wondering if this approach could help me solve my problem. Among all the examples I saw, the input and output of an autoencoder had the same size and dimensions.
At this point I wanted to ask if there was a type of CNN autoencoders that produce an output image that has different dimension compared to input image.
Auto-encoder (AE) is an architecture that tries to encode your image into a lower-dimensional representation by learning to reconstruct the data from such representation simultaniously. Therefore AE rely on a unsupervised (don't need labels) data that is used both as an input and as the target (used in the loss).
You can try using a U-net based architecture for your usecase. A U-net would forward intermediate data representations to later layers of the network which should assist with faster learning/mapping of the inputs into a new domain..
You can also experiment with a simple architecture containing a few ResNet blocks without any downsampling layers, which might or might not be enough for your use-case.
If you want to dig a little deeper you can look into Disco-GAN and related methods.They explicitly try to map image into a new domain while maintaining image information.

LSTM Embedding output size and No. of LSTM

I am not sure why we have only output vector of size 32, while have LSTM 100?
What I am confuse is that if we have only 32 words vector, if fetch into LSTM, 32 LSTM should big enough to hold it?
Model.add(Embedding(5000,32)
Model.add(LSTM(100))
Those are hyper-parameters of your model and there is no best way of setting them without experimentation. In your case, embedding single words into a vector of dimension 32 might be enough, but the LSTM will process a sequence of them and might require more capacity (ie dimensions) to store information about multiple words. Without knowing the objective or the dataset it is difficult to make an educated guess on what those parameters would be. Often we look at past research papers tackling similar problems and see what hyper-parameters they used and then tune them via experimentation.

Multilabel classification using LSTM on variable length signal using Keras

I have recently started working on ECG signal classification in to various classes. It is basically multi label classification task (Total 4 classes). I am new to Deep Learning, LSTM and Keras that why i am confused in few things.
I am thinking about giving normalized original signal as input to the network, is this a good approach?
I also need to understand training input shape for LSTM as ECG signals are of variable length (9000 to 18000 samples) and usually classifier need fixed variable input. How can i handle such type of input in case of LSTM.
Finally what should be structure of deep LSTM network for such lengthy input and how many layers should i use.
Thanks for your time.
Regards
I am thinking about giving normalized original signal as input to the network, is this a good approach?
Yes this is a good approach. It is actually quite standard for Deep Learning algorithms to give them your input normalized or rescaled.
This usually helps your model converge faster, as now you are inside smaller range (i.e.: [-1, 1]) instead of greater un-normalized ranges from your original input (say [0, 1000]). It also helps you get better, more precise results, as it helps solve problems like the vanishing gradient as well as adapting better to modern activation and optimizer functions.
I also need to understand training input shape for LSTM as ECG signals are of variable length (9000 to 18000 samples) and usually classifier need fixed variable input. How can i handle such type of input in case of LSTM.
This part is really important. You are correct, LSTM expects to receive inputs with a fixed shape, one that you know beforehand (in fact, any Deep Learning layer expects fixed shape inputs). This is also explained in the keras docs on Recurrent Layers where they say:
Input shape
3D tensor with shape (batch_size, timesteps, input_dim).
As we can see, it expects your data to have a number of timesteps as well as a dimension on each one of those timesteps (batch size is usually 1). To exemplify, suppose your input data consists of elements like: [[1,4],[2,3],[3,2],[4,1]]. Then, using a batch_size of 1, the shape of your data would be (1,4,2). As you have 4 timesteps, each with 2 features.
So bottom line, you have to make sure that you pre-process you data so it has a fixed shape you can then pass to your LSTM layers. This one you will have to find out by yourself, as you know your data and problem better than we do.
Maybe you can fix the samples you obtain from your signal, discarding some and keeping others so every signal is of the same length (if you say your signals are between 9k and 18k choosing 9000 could be the logical choice, discarding samples from the others you get). You could even do some other conversion to your data in a way that you can map from inputs of 9000-18000 to a fixed size.
Finally what should be structure of deep LSTM network for such lengthy input and how many layers should i use.
This one is really quite broad and doesn't have a unique answer. It would depend on the nature of your problem, and determining those parameters a priori is not so straightforward.
What I recommend you do is to start with a simple model first, and then add layers and blocks (neurons) incrementally until you are satisfied with the results.
Try just one hidden layer first, train and test your model and check your performance. You can then add more blocks and see if your performance improved. You can also add more layers and check for the same until you are satisfied.
This is a good way to create Deep Learning models, as you will arrive to the results you want while keeping your Network as lean as possible, which in turn helps your execution time and complexity. Good luck with your coding, hope you find this useful.

Output Projection in Seq2Seq model Tensorflow

I'm going through the translation code implemented by tensorflow using seq2seq model. I'm following tensorflow tutorial about seq2seq model.
In that tutorial there is a part explaining a concept called output projection which they have implemented in seq2seq_model.py code. I understand the code. But I don't understand what this output projection part is doing.
It would be great if someone can explain me what is going on behind this output projection thing..?
Thank You!!
Internally, a neural network operates on dense vectors of some size, often 256, 512 or 1024 floats (let's say 512 for here). But at the end it needs to predict a word from the vocabulary which is often much larger, e.g., 40000 words. Output projection is the final linear layer that converts (projects) from the internal representation to the larger one. So, for example, it can consist of a 512 x 40000 parameter matrix and a 40000 parameter for the bias vector. The reason it is kept separate in seq2seq code is that some loss functions (e.g., the sampled softmax loss) need direct access to the final 512-sized vector and the output projection matrix. Hope that helps!

Categories

Resources