LSTM Sequence length

LSTM Sequence length - python

I want to ask if there is an optimal sequence length of a LSTM network in general, or in terms of time series prediction problems?
I read about vanishing gradient or exploding gradient problems that very long RNN networks had and LSTM tried to solve and succeeded to a certain extent.
I also heard about techniques to handle very large sequences with LSTM’s and RNN’s in general like: truncating sequences, summarizing sequences, truncating backpropagation through time or even using an Encoder-Decoder architecture.
I asked this question because I didn’t find a research paper about this, only this blog post that stated an optimal sequence length between 10-30.

Do some Model Selection.
TLDR: Just try it out.
Because training is already very computationally expensive, the easiest way to calculate how successful a model would be is to test it out. The combination that works best cannot be easily predetermined, especially not with such a vague description (or no description at all) of how the actual problem looks like.
From this answer:
It totally depends on the nature of your data and the inner correlations, there is no rule of thumb. However, given that you have a large amount of data a 2-layer LSTM can model a large body of time series problems / benchmarks.
So in your case, you might want to try sequence lengths from 10 - 30. But I'd also try and evaluate how your training algorithm performs outside of that recommendation by the post you linked.

Related

Can LSTMs handle incredibly dense time-series data?

I have 50 time series, each having at least 500 data points (some series have as much as 2000+ data points). All the time series go from a value of 1.089 to 0.886, so you can see that the resolution for each dataset comes close to 10e-4, i.e. the data is something like:
1.079299, 1.078809, 1.078479, 1.078389, 1.078362,... and so on in a decreasing fashion from 1.089 to 0.886 for all 50 time-series.
My questions hence, are:
Can LSTMs handle such dense data?
In order to avoid overfitting, what can be the suggested number of epochs, timesteps per batch, batches, hidden layers and neurons per layer?
I have been struggling with this for more than a week, and no other source that I could find talks about this specific case, so it could also help others.

A good question and I can understand why you did not find a lot of explanations because there are many tutorials which cover some basic concepts and aspects, not necessarily custom problems.
You have 50 time series. However, the frequency of your data is not the same for each time series. You have to interpolate in order to reach the same number of samples for each time series if you want to properly construct the dataset.
LSTMs can handle such dense data. It can be both a classification and a regression problem, neural networks can adapt to such situations.
In order to avoid overfitting(LSTMs are very prone to it), the first major aspect to take into consideration is the hidden layers and the number of units per layer. Normally people tend to use 256-512 by default since in Natural Language Processing where you process huge datasets they are suitable. In my experience, for simpler regression/classification problems you do not need such a big number, it will only lead to overfitting in smaller problems.
Therefore, taking into consideration (1) and (2), start with an LSTM/GRU with 32 units and then the output layer. If you see that you do not have good results, add another layer (64 first 32 second) and then the output layer.
Admittedly, timesteps per batch is of crucial importance. This cannot be determined here, you have to manually iterate through values of it and see what yields you the best results. I assume you create your dataset via sliding window manner; consider this(window size) also a hyperparameter to alter before arriving at the batch and epochs ones.

Why do we drop different nodes for each training example in Dropout Regularisation?

In Deep Learning course, Prof. Ng mentioned about how to implement Dropout Regularisation.
Implementation by Prof.Ng:-
First image tells the implementation of dropout and generating d3 matrix i.e. dropout matrix for third layer with shape (#nodes, #training examples).
As per my understanding, D3 would look like this for a single iteration and keeps on changing with every iteration (here, I've taken 5 nodes and 10 training examples)
Query: One thing I didn't get is why we need to drop different nodes for each training example. Instead, why we can't keep dropped nodes the same for all examples and again randomly drop in the next iteration. For example in the second image, 2nd training example is passing through 4 nodes while the first one is passing through all nodes. Why not the same nodes for all examples?

We choose different nodes within the same batch for the simple reason that it gives slightly better results. This is the rationale for virtually any choice in a neural network. As to why it gives better results ...
If you have many batches per epoch, you will notice very little difference, if any. The reason we have any dropout is that models sometimes exhibit superstitious learning (in the statistical sense): if an early batch of training examples, if several of them just happen to have a particular strong correlation of some sort, then the model will learn that correlation early on (primacy effect), and will take a long time to un-learn it. For instance, students doing the canonical dogs-vs-cats exercise will often use their own data set. Some students find that the model learns to identify anything on a soft chair as a cat, and anything on a lawn as a dog -- because those are common places to photograph each type of pet.
Now, imagine that your shuffling algorithm brings up several such photos in the first three batches. Your model will learn this correlation. It will take quite a few counter-examples (cat in the yard, dog on the furniture) to counteract the original assumption. Dropout disables one or another of the "learned" nodes, allowing others to be more strongly trained.
The broad concept is that a valid correlation will be re-learned easily; an invalid one is likely to disappear. This is the same concept as in repeating other experiments. For instance, if a particular experiment shows significance at p < .05 (a typical standard of scientific acceptance), then there is no more than a 5% chance that the correlation could have happened by chance, rather than the functional connection we hope to find. This is considered to be confident enough to move forward.
However, it's not certain enough to make large, sweeping changes in thousands of lives -- say, with a new drug. In those cases, we repeat the experiment enough times to achieve the social confidence desired. If we achieve the same result in a second, independent experiment, then instead of 1/20 chance of being mistaken, we have a 1/400 chance.
The same idea applies to training a model: dropout reduces the chance that we've learned something that isn't really true.
Back to the original question: why dropout based on each example, rather than on each iteration. Statistically, we do a little better if we drop out more frequently, but for shorter spans. If we drop the same nodes for several iterations, that slightly increase the chance that the active nodes will make a mistake while the few nodes were inactive.
Do note that this is mostly ex post facto rationale: this wasn't predicted in advance; instead, the explanation comes only after the false-training effect was discovered, a solution found, and the mechanics of the solution were studied.

Getting some sort of Math Formula from a Machine Learning trained model

I already asked this question here: Can Convolutional Neural Networks (CNN) be represented by a Mathematical formula? but I feel that I was not clear enough and also the proposed idea did not work for me.
Let's say that using my computer, I train a certain machine learning algorithm (i.e. naive bayes, decision tree, linear regression, and others). So I already have a trained model which I can give a input value and it returns the result of the prediction (i.e. 1 or 0).
Now, let's say that I still want to give an input and get a predicted output. However, at this time I would like that my input value to be, for example, multiplied by some sort of mathematical formula, weights, or matrix that represents my "trained model".
In other words, I would like that my trained model "transformed" in some sort of formula which I can give an input and get the predicted number.
The reason why I want to do this is because I wanna train a big dataset and use complex prediction model. And use this trained prediciton model in simpler hardwares such as a PIC32 microcontroler. The PIC32 Microntroler would not train the machine learning or store all inputs. Instead, the microcontroler would simple read from the system certain numbers, apply a math formula or some sort of matrix multiplication and give me the predicted output. With that, I can use "fancy" neural networks in much simpler devices that can easily operate math formulas.

If I read this properly, you want a generally continuous function in many variables to replace a CNN. The central point of a CNN existing in a world with ANNs ("normal" neural networks) is that in includes irruptive transformations: non-linearities, discontinuities, etc. that enable the CNN to develop recognitions and relationships that simple linear combinations -- such as matrix multiplication -- cannot handle.
If you want to understand this better, I recommend that you choose an introduction to Deep Learning and CNNs in whatever presentation mode fits your learning styles.

Essentially, every machine learning algorithm is a parameterized formula, with your trained model being the learned parameters that are applied to the input.
So what you're actually asking is to simplify arbitrary computations to, more or less, a matrix multiplication. I'm afraid that's mathematically impossible. If you ever do come up with a solution to this, make sure to share it - you'd instantly become famous, most likely rich, and put a hell of a lot of researchers out of business. If you can't train a matrix multiplication to get the accuracy you want from the start, what makes you think you can boil down arbitrary "complex prediction models" to such simple computations?

Prediction and forecasting with Python tensorflow

I have created a prediction model and used RNN in it offered by the tensorflow library in Python. Here is the complete code I have created and tried:
Jupyter Notbook of the Code
But I have doubts.
1) Whether RNN is correct for what I am trying to predict?
2) Is there a better algorithm I can try?
3) Can anyone suggest me how I can give multiple inputs and get the necessary output using tensorflow model? Can anyone guide me please.
I hope I am clear on my points. Please do tell me if anything else required.

Having doubts is normal, but you should try to measure them before asking for advice. If you don't have a clear thing you want to improve it's unlikely you will get something better.
1) Whether RNN is correct for what I am trying to predict?
Yes. RNN is used appropriately here. If you don't care much about having arbitrary length input sequences, you can also try to force them to a fixed size and then apply convolutions on top (see convolutional NeuralNetworks), or even try with a more simple DNN.
The more important question to ask yourself is if you have the right inputs and if you have sufficient training data to learn what you hope to learn.
2) Is there a better algorithm I can try?
Probably no. As I said RNN seems appropriate for this problem. Do try some hyper parameter tuning to make sure you don't accidentally just pick a sub-optimal configuration.
3) Can anyone suggest me how I can give multiple inputs and get the necessary output using tensorflow model? Can anyone guide me please.
The common way to handle variable length inputs is to set a max length and pad the shorter examples until they reach that length. The max length can be a variable you pick or you can dynamically set it to the largest length in the batch. This is needed only because the internal operations are done in batches. You can pick which results you want. Picking the last one is reasonable (the model will just have to learn to propagate the state for the padding values). Another reasonable thing to do is to pick the first one you get after feeding the last meaningful value into the RNN.
Looking at your code, there's one thing I would improve:
Instead of computing a loss on the last value only, I would compute it over all values in the series. This gives your model more training data with very little performance degradation.

Using Pybrain to detect malicious PDF files

I'm trying to make an ANN to classify a PDF file as either malicious or clean, by utilising the 26,000 PDF samples (both clean and malicious) found on contagiodump. For each PDF file, I used PDFid.py to parse the file and return a vector of 42 numbers. The 26000 vectors are then passed into pybrain; 50% for training and 50% for testing. This is my source code:
https://gist.github.com/sirpoot/6805938
After much tweaking with the dimensions and other parameters I managed to get a false positive rate of about 0.90%. This is my output:
https://gist.github.com/sirpoot/6805948
My question is, is there any explicit way for me to decrease the false positive rate further? What do I have to do to reduce the rate to perhaps 0.05%?

There are several things you can try to increase the accuracy of your neural network.
Use more of your data for training. This will permit the network to learn from a larger set of training samples. The drawback of this is that having a smaller test set will make your error measurements more noisy. As a rule of thumb, however, I find that 80%-90% of your data can be used in the training set, with the rest for test.
Augment your feature representation. I'm not familiar with PDFid.py, but it only returns ~40 values for a given PDF file. It's possible that there are many more than 40 features that might be relevant in determining whether a PDF is malicious, so you could conceivably use a different feature representation that includes more values to increase the accuracy of your model.
Note that this can potentially involve a lot of work -- feature engineering is difficult! One suggestion I have if you decide to go this route is to look at the PDF files that your model misclassifies, and try to get an intuitive idea of what went wrong with those files. If you can identify a common feature that they all share, you could try adding that feature to your input representation (giving you a vector of 43 values) and re-train your model.
Optimize the model hyperparameters. You could try training several different models using training parameters (momentum, learning rate, etc.) and architecture parameters (weight decay, number of hidden units, etc.) chosen randomly from some reasonable intervals. This is one way to do what is called "hyperparameter optimization" and, like feature engineering, it can involve a lot of work. However, unlike feature engineering, hyperparameter optimization can largely be done automatically and in parallel, provided you have access to a lot of processing cores.
Try a deeper model. Deep models have become quite "hot" in the machine learning literature recently, especially for speech processing and some types of image classification. By using stacked RBMs, a second-order learning method (PDF), or a different nonlinearity like a rectified linear activation function, then you can add multiple layers of hidden units to your model, and sometimes this will help improve your error rate.
These are the ones that come to mind right off the bat. Good luck !

Let me first say I am in no ways an expert in Neural Networks. But I played with pyBrain once and I used the .train() method in a while error < 0.001 loop to get the error rate I wanted. So you can try using all of them for training with that loop and test it with other files.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.