LSTM prediction normalization - python

I am currently trying Ethereum price prediction with LSTM network in Keras and I have a little problem.
I am trying to predict 10 following prices from 40 previous prices. Time span between prices is 15 second, so I am predicting 2,5 minutes into the future.
My input data are: closing price, open price, lowest price and highest price. I normalize the data between 0 and 1, since my activation function in fully connected layer is linear. I normalize each sequence independently based on minimum and maximum value of the whole sequence.
This model works fine when I test it with past data, since I have both input prices and the prices I am trying to predict, so when I denormalize the data based on minimum and maximum of the whole sequence, the prediction is pretty accurate.
However when I try to predict live data, I do not have the last 10 prices from the sequence, so I have to denormalize my output based on only 40 input values, which is where the problem is. The live result will never be as good as the testing result.
I tried to normalize my data based on 40 input prices from the sequence at the training, but it kind of threw off my whole prediction.
I also tried to normalize between -1 and 1 based on 40 input values and change my output activation function to tanh, but it still did not work correctly. My prediction mostly showed straight line with no increase or decrease in value.
My training data consist of around 61 000 rows of prices and I took 85% as training set, 20% of training set as validation set and 15% as testing set.
Do you have any idea where the problem could be? I can also provide some data about my architecture and training if needed.

Related

Model doesn't learn from data

We have a dataset with ~40000 data points each having 160 features. We know nothing about what each feature represents, but they are 0-5 integers, most probably some rankings. Our task is to take a subset of those features, lets say (40000,30) and predict the initial (40000,160) data. In other words, we need to create a model, that takes 30 features as input and outputs the full 160 set of features.
https://i.stack.imgur.com/Ko6nR.png
the example of the dataset.
What we have done so far, we trained a ANN with the following architecture:
30->200->150->163
We are calculating an accuracy score by rounding the prediction(lets say I predicted 3.6 for 4, 3.6~4, 4==4, so True)
We got ~52% accuracy and nothing makes it go higher.
So, the problem is a multi-output regression problem. The prediction is done using 30 discrete numeric features. The normalization was done both by using Min-Max Scaling and Standardization(The target is also normalized). In the model, we tried different number of layers with different capacity, tried to use batch-norm, different activations (relu is used now, for the output layer no activation is used), different losses (mse is the current one), different optimizers (adam is the current one). Both Keras and PyTorch is used in the case something is wrong with the PyTorch implementation.
So, the accuracy still remains 50-52%. There is one straightforward thing - when we increase the model capacity (the number of parameters) the model is more prone to overfitting. Even after increasing the model capacity very very much, we couldn't make the model overfit the data. We tried to use the features separately (For example, predict one feature from another) - nothing useful. Tried to predict 1 feature using 159 features, but again ~52% and even less.
What I understand and can conclude from these - there is no relationship between those ratings and most of them can't predict others. What do you think about this case?

XGBoost scaling weights for time series data

I'm working on a binary classification problem using time series data and I've been having some trouble adjusting the scale_pos_weight parameter.
As it's time series data most of my features are of the sort of last 30 days mean, number of days since X event, accumulated days of X event happening, etc. so in order to avoid data leakeage I'm splitting the data first 80% for training and the last 20% for test.
Works fine for most of the cases but there's a few that the target's distribution changes a lot from the training data to the test, meaning that the training data has 100:1 negative to positive instances meanwhile the test data is around 30:1.
I've tried switching the training size to different values to get similar target distributions, but I end up getting odd training sizes like 50% or 95%.
I also considered using the test data distribution to adjust the weights but it would be data leakeage.
Any ideas of how could I sort this out?

Predicting future of LSTM resulting in weird answers

so I was recently trying to predict stock prices using an LSTM (seems really overdone I know) but when I was predicting on data that is outside the dataset, I get a really weird graph that I do not think is correct, am I doing something wrong?
Prediction: https://github.com/Alpheron/StockPred/blob/master/predictions/MSFT-5-Year-LSTM.ipynb
Training:
https://github.com/Alpheron/StockPred/blob/master/MSFT-5-Year-LSTM.ipynb
In order to process the data, I used the lookback index of 60 points, so when trying to predict on data that is outside of the dataset, I would need the last 60 points as well, but I am I doing something wrong with the way I am predicting?

Neural network classifies heavily towards few particular classes

I am performing sentiment analysis on a dataset of Movie Reviews. The neural network is a single-hidden layer NN, made from scratch in Python. The classifier is expected to assign one of five classes(0 to 4) to each review phrase. however, upon training, the confusion matrix for the dev set gives the following results:
This means that the classifier is heavily biased towards class 0 and class 4. What could be the possible reasons?
The classifier earlier predicted only class 2 always because the dataset was skewed (~ 50% of the data was from class 2). Hence I chose a subset of the dataset containing an equal number of examples from all 5 classes. I still don't understand the output and low accuracy.
The link to my notebook can be found here
first of all your model is linear, with only 1 layer. so its simple model which might not produce good results, try increasing number of layers.
you training cost is also very high, you have to train for more epochs until you get good training cost. which also affect your validation cost which is twice the training cost.
it is sign of over fitting.

How to predict future Stock using LSTM Keras

First of all, I must say, I'm a beginner to this AI things. I followed most of the tutorials about stock market predictions and all of them are pretty much same. These tutorials using a data set and split in to two sets. First one is Training set and the 2nd one is Test set. They are using Closing price of the stocks to train and make a model. From that model, they insert test data set which contain the closing price and showing two graphs. Then they say the actual and the predicted graphs are pretty much same.
The github repo of the tutorial. -
https://github.com/surajr/Stock-Predictor-using-LSTM/blob/master/Stock-Predictor-using-LSTM.ipynb
This is my question,
1. Why all those tutorials are putting closing price in the testing set also? They are only suppose to insert dates right? Because we are predicting the closing price. This is confusing. Please explain me.
2. No one is telling me how to predict next 7 days values. So if we have a model, how to get next 7 days closing value?
Please help me to clarify this. Thanks a lot.
Take a look at this link. I think it will get you going in the right direction.
https://www.datacamp.com/community/tutorials/lstm-python-stock-market
Why all those tutorials are putting closing price in the testing set also?
The ultimate goal is to predict the movement (growth), Which is closing minus- opening price. The ultimate model is the model that calculates the growth in test data set very close to what the actual growth is. The growth is the main problem that the model is trying solve and is the point of reference when you calculate the accuracy of the trained model.
They are only suppose to insert dates right? Because we are predicting the closing price
The model is predicting the growth based on given factors. For a company, you have many factors that are quantified, per day. I suspect the tutorial you did uses a testing set extracted for one particular day and different stocks. Like extracting all parameters for all companies but only in 10th of January and then check how accurate the trained model is. The training set on the other hand contains the stock for more than one day most of the time.
No one is telling me how to predict next 7 days values. So if we have a model, how to get next 7 days closing value?
To predict the stock price relatively accurate, you need a well-trained model. To do this you need to train your model based on many many factors. Same model cannot predict stock in different countries. One model might be suitable to predict technology stocks (AAPL) but not other fields.
Overall, this is a complicated subject. Financial advisers pay a massive amount of money just to use reliable models. Most of them use multiple models based on their client's portfolio. These tutorials introduce the subject to you and teach you the main concept. IMHO, I would say the next step would be learning and then competing in Kaggle.
In the training set, closing value is included as an input because it is relevant to the "next day's" price, or "price in X days" (for models that predict price movement over more than 1 day).
Note, in the training data, typically the future price (today + 1 day) is the target value (train_Y).
In the testing data, the closing data is included because the testing data is predicting "future price."
In determining the accuracy of the model, the price prediction of (today + X days) is compared against the future value (test_Y) to determine the effectiveness of the prediction. Just like a human stock trader, if you are guessing/predicting if the FUTURE price will be Y (i.e. up/down), then you would have access to the current day's end of day closing price...which is why it is a relevant input. Obviously, in a real-world model, the accuracy of the prediction would only be known AFTER X days pass. When training and then testing a model, typically the data is historical, so out of sample values (like the price of today + X days) is used for accuracy determination, though the FUTURE value should definitely not be an input.
Why all those tutorials are putting closing price in the testing set also?
-> It is easy to understand that closing price is a kind of input variable which is required to calculate stock price.
As I see the code, it seems predict stock price with 22days history
X_train (1173, 22, 3)
y_train (1173,)
X_test (130, 22, 3)
y_test (130,)
I think you should re-train with (~~~, 7, 3) to predict price of 7 days after today.

Categories

Resources